ClickHouse
ClickHouse is a database created by Yandex.
The speed of ClickHouse is undeniable. Large data and analytics have always been a big part of my work and also a significant portion of my interest. Analysis that used to take a cluster of servers can now sometimes be done efficiently on a single, moderately equipped machine. Even larger datasets can be linearly clustered. It was even used in CERN's LHCb experiment, where it handled metadata for 10 billion events.
Documentation is very clean and well organized. The team has put a lot of work into this, along with easy navigation and a beautiful look. Few documentation sites are this clear and direct.
Installation is easy:
curl https://clickhouse.com/ | sh
It distributes an self-inflating binary that contains not just the client and server, but multiple tools for benchmarking, checking status, installing and managing the service, copying tables between clusters, and even a file compression utility.
Query processing 200 million rows at 5GB/s:
host.domain :) SELECT client_ip, count() FROM logs.http \
GROUP BY client_ip ORDER BY count() DESC LIMIT 10;
SELECT
client_ip,
count()
FROM http
GROUP BY client_ip
ORDER BY count() DESC
LIMIT 10
Query id: 18ef723e-56c0-44ce-931f-4b981bdd5c7e
┌─client_ip───────────┬──count()─┐
│ 2a03:92c3:2600::116 │ 19565953 │
│ 87.205.52.29 │ 13900800 │
│ 159.151.157.223 │ 2439398 │
│ 137.88.192.88 │ 2164264 │
│ 18.244.231.58 │ 1611585 │
│ 138.20.38.8 │ 932911 │
│ 240.90.76.56 │ 915241 │
│ 152.54.234.15 │ 894005 │
│ 193.131.80.25 │ 799013 │
│ 231.58.195.29 │ 783346 │
└─────────────────────┴──────────┘
10 rows in set. Elapsed: 0.422 sec. Processed 89.66 million rows, 2.09 GB (212.68 million rows/s., 4.96 GB/s.)
Peak memory usage: 84.38 MiB.
Run using the benchmark utility:
bash-5.2$ clickhouse benchmark --query "SELECT client_ip, count() FROM logs.http \
GROUP BY client_ip ORDER BY count() DESC LIMIT 10"
Loaded 1 queries.
Queries executed: 13.
localhost:9000, queries: 13, QPS: 2.575, RPS: 230880043.453, MiB/s: 5133.366, result RPS: 25.751, result MiB/s: 0.001.
0.000% 0.366 sec.
10.000% 0.366 sec.
20.000% 0.368 sec.
30.000% 0.369 sec.
40.000% 0.369 sec.
50.000% 0.369 sec.
60.000% 0.369 sec.
70.000% 0.369 sec.
80.000% 0.372 sec.
90.000% 0.372 sec.
95.000% 0.372 sec.
99.000% 0.404 sec.
99.900% 0.404 sec.
99.990% 0.404 sec.
Clickhouse Local
For testing and development, clickhouse local
provides native functionality of ClickHouse, but with the ability to query local and remote data files. This is very similar to DuckDB. Combined with the impressive Input/Output FORMAT support, you can take a local CSV or TSV and query it using SQL without having to create a database service or import process. In the realm of building structured testing data, this offers real convenience, especially during the early phases of development.
This is quite similar to much of what I use SQLite for, but with an incredible amount of additional features. It should not be understated how easily the team has made this tool, because it is included with the native install. It can be invoked just the same as clickhouse server
and clickhouse client
.
There are many times in which I find myself needing to convert data formats, filter data, or search for outliers in a very large dataset in order to create a comprehensive dataset for testing. When this happens, I either resort to classic UNIX text tools or write a one-off tool in whatever programming language I happen to be working in. There is often lots of boiler-plate code that needs to be written before you can even start working with the actual properties of the data. Having the ability to query and filter data in native SQL syntax can save an incredible amount of time and frustration. Though I used ClickHouse long before I knew about this feature, I wish I would have learned about it from day one.
Clickhouse Backup
clickhouse-backup
is a easy to use, third-party backup/restore utility. It is written in Go by a third-party. This is the simplest way to backup and restore data, especially in unique development environments. There are many backup techniques available with the native server, but sometimes you just need a simple dump/restore without configuring more complex production-worthy methods that require more thought and detailed configuration. It is a great tool to have if you are just starting to work with ClickHouse.