Forem: Erika

Live Coding Session: From Kafka Streams to REST APIs in minutes

Erika — Tue, 13 Dec 2022 11:21:40 +0000

The week after Christmas and before New Year's is one of the best times to learn something.

Sure you might be "working", but apart from the occasional on-call alert you're probably just sitting around the (home) office looking for something to do.

Well, come learn something with Tinybird. On December 27th, Alasdair Brown and Jorge Gomez Sancha are running a deep dive on building analytics APIs from Kafka streams using Tinybird.

Click here for all the details and to register

You'll learn how to...

...persist events data into OLAP storage
...filter, aggregate, and enrich your streams with SQL
...publish your queries as low-latency REST APIs
...add query parameters to your API endpoints
...monitor endpoints performance & errors
...determine when Tinybird or ksqldb are appropriate

We guarantee it will be at least 2.4x more fun that twiddling your thumbs and at least 4.7x more fun than tearing down the Christmas lights. 🤪

Join us

ClickHouse Fundamentals: Part 2

Erika — Tue, 22 Nov 2022 12:52:02 +0000

Change table TTLs
You can modify the TTL of a table in ClickHouse by using ALTER TABLE...MODIFY TTL. For example:

ALTER TABLE database.table MODIFY TTL event_date + INTERVAL 30 DAY;

However, ClickHouse will rewrite all table partitions, including those not impacted by the TTL change. This can be a very expensive operation, especially for large tables.

To avoid impacting the performance of our database, we can instead set materialize_ttl_after_modify to 0 and clear up old partitions manually.

This avoids the huge performance impact of rewriting all table partitions, but does mean there is additional manual effort.

For example:

set materialize_ttl_after_modify=0;
ALTER TABLE database.table MODIFY TTL event_date + INTERVAL 30 DAY;
ALTER TABLE database.table DROP PARTITION 202205;
ALTER TABLE database.table DROP PARTITION 202206;
ALTER TABLE database.table DROP PARTITION 202207;

Analyze local files with clickhouse-local
clickhouse-local is like running a temporary ClickHouse server that only lasts for your session. It's great for exploring local files to quickly experiment with data, without needing to set up a proper ClickHouse deployment.

It's possible to use to clickhouse-local to analyze files of structured data directly from the local file system.

SELECT count() FROM file('final.ndjson');

SELECT count()
FROM file('final.ndjson')

Query id: a0a1f4b5-40cb-4125-b68b-4ed978c41576

┌─count()─┐
│  100000 │
└─────────┘

1 row in set. Elapsed: 0.659 sec. Processed 55.38 thousand rows, 96.97 MB (84.04 thousand rows/s., 147.16 MB/s.)


SELECT countDistinct(public_ip) FROM file('final.ndjson');

SELECT countDistinct(public_ip)
FROM file('final.ndjson')

Query id: 21df7ca5-e3bf-4010-b2a0-bf8b854502d2

┌─uniqExact(public_ip)─┐
│                   71 │
└──────────────────────┘

1 row in set. Elapsed: 0.225 sec. Processed 77.53 thousand rows, 96.45 MB (345.22 thousand rows/s., 429.46 MB/s.)

You can create tables from the local file if you want to do more than one analysis on the data. The table is destroyed when your clickhouse-local session ends.

CREATE TABLE auxiliar Engine=MergeTree() ORDER BY tuple() AS SELECT * FROM file('final.ndjson');

CREATE TABLE auxiliar
ENGINE = MergeTree
ORDER BY tuple() AS
SELECT *
FROM file('final.ndjson')

Query id: a1732be5-a912-41a5-bf8e-e524db8f12f4

Ok.

0 rows in set. Elapsed: 0.486 sec. Processed 100.00 thousand rows, 161.88 MB (205.73 thousand rows/s., 333.03 MB/s.)


SHOW CREATE TABLE auxiliar;

SHOW CREATE TABLE auxiliar

Query id: dffbcd4b-2c08-4d07-916c-b8e1b668c202


│ CREATE TABLE _local.auxiliar
(
    `timestamp_iso8601` Nullable(DateTime64(9)),
    `host` Nullable(String),
    `public_ip` Nullable(String),
    `request_method` Nullable(String),
    `request_path` Nullable(String),
    `status` Nullable(Int64),
    `body_bytes_sent` Nullable(Int64),
    `request_length` Nullable(Int64),
    `first_byte` Nullable(Float64),
    `request_time` Nullable(Float64),
    `lambda_name` Nullable(String),
    `lambda_region` Nullable(String),
    `path_type` Nullable(String),
    `hit_level` Nullable(String),
    `hit_state` Nullable(String),
    `error_details` Nullable(String),
    `owner_id` Nullable(String),
    `project_id` Nullable(String),
    `target_path` Nullable(String),
    `deployment_plan` Nullable(String),
    `lambda_duration` Nullable(Float64),
    `lambda_billed_duration` Nullable(Int64),
    `lambda_memory_size` Nullable(Int64),
    `http_user_agent` Nullable(String),
    `full_vercel_id` Nullable(String),
    `dc` Nullable(String),
    `public_ip_country` Nullable(String),
    `public_ip_city` Nullable(String),
    `asn_id` Nullable(String),
    `asn_name` Nullable(String)
)
ENGINE = MergeTree
ORDER BY tuple()
SETTINGS index_granularity = 8192 │


1 row in set. Elapsed: 0.001 sec.

SELECT count(), status - status % 100 AS status_range FROM auxiliar GROUP BY status_range;

SELECT
    count(),
    status - (status % 100) AS status_range
FROM auxiliar
GROUP BY status_range

Query id: 2685e0d4-827a-4306-8598-5d6e589dbd15

┌─count()─┬─status_range─┐
│   74000 │          200 │
│    5000 │          400 │
│   21000 │          300 │
└─────────┴──────────────┘

3 rows in set. Elapsed: 0.015 sec.

Add a default value for new columns
When you add a new column to a table, ClickHouse will add it with the default value:

CREATE TABLE local
ENGINE = MergeTree
ORDER BY number AS
SELECT *
FROM numbers(1000000);

ALTER TABLE local
    ADD COLUMN IF NOT EXISTS `date` DateTime;

OPTIMIZE TABLE local FINAL; -- To speed up the mutation / lazy way to know it has finished

SELECT *
FROM local
LIMIT 10

Query id: b5fedb97-a1c8-475f-a674-0b1658c8e889

┌─number─┬────────────────date─┐
│      0 │ 1970-01-01 01:00:00 │
│      1 │ 1970-01-01 01:00:00 │
│      2 │ 1970-01-01 01:00:00 │
│      3 │ 1970-01-01 01:00:00 │
│      4 │ 1970-01-01 01:00:00 │
│      5 │ 1970-01-01 01:00:00 │
│      6 │ 1970-01-01 01:00:00 │
│      7 │ 1970-01-01 01:00:00 │
│      8 │ 1970-01-01 01:00:00 │
│      9 │ 1970-01-01 01:00:00 │
└────────┴─────────────────────┘

To change the default value for old rows you need to declare the default in the column definition:

ALTER TABLE local
    ADD COLUMN IF NOT EXISTS `new_date` DateTime DEFAULT now();

OPTIMIZE TABLE local FINAL;

SELECT *
FROM local
LIMIT 10

Query id: b5ff3afd-78f7-4ea3-8d43-adc7fe14f0a0

┌─number─┬────────────────date─┬────────────new_date─┐
│      0 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      1 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      2 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      3 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      4 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      5 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      6 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      7 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      8 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
│      9 │ 1970-01-01 01:00:00 │ 2022-09-23 13:53:38 │
└────────┴─────────────────────┴─────────────────────┘

10 rows in set. Elapsed: 0.002 sec.

Note that this means that new rows will also get the default value if it's not declared on insertion.

ALTER TABLE local
    MODIFY COLUMN `new_date` DateTime DEFAULT yesterday();

INSERT INTO local(number) VALUES (999999999);

SELECT *
FROM local
WHERE number = 999999999

Query id: 02527ad6-4644-42ff-8755-8869a9df30fa

┌────number─┬────────────────date─┬────────────new_date─┐
│ 999999999 │ 1970-01-01 01:00:00 │ 2022-09-22 00:00:00 │
└───────────┴─────────────────────┴─────────────────────┘

ClickHouse Fundamentals: Part 1

Erika — Thu, 27 Oct 2022 13:20:55 +0000

We've launched an Open Source ClickHouse® Knowledge Base!

We'll be sharing some of the tips here on Dev.to but if you don't want to wait, jump in and find them all in one place.

Oh, and if you're a ClickHouse guru, please feel free to contribute your own ClickHouse magic.

Here's part 1:

View the intermediate states of aggregations
Using aggregation functions with the -State modifier (e.g. sumState) result in intermediate states being stored in ClickHouse. These intermediate states generally cannot be read, as they are stored in a binary representation. Thus, to read the result, we must use the corresponding -Merge modifer when selecting the result (e.g. sumMerge).

For example:

SELECT
    number % 4 AS pk,
    avgState(number) AS avg_state
FROM numbers(2000)
GROUP BY pk

Query id: af1c69e7-b5d2-4063-9b8d-1ac08598fc79

┌─pk─┬─avg_state─┐
│  0 │ 8��         │
│  1 │ ,��         │
│  2 │  ��         │
│  3 │ ��          │
└────┴───────────┘

If you want to explore the intermediate states, perhaps without knowing what the original aggregation method was, you can instead use the finalizeAggregation function.

SELECT
    pk,
    finalizeAggregation(avg_state)
FROM
(
    SELECT
        number % 4 AS pk,
        avgState(number) AS avg_state
    FROM numbers(2000)
    GROUP BY pk
)

Query id: 7cf3a07f-f5d1-4ddd-891f-a89bb304b227

┌─pk─┬─finalizeAggregation(avg_state)─┐
│  0 │                            998 │
│  1 │                            999 │
│  2 │                           1000 │
│  3 │                           1001 │
└────┴────────────────────────────────┘

Apply column default to existing rows
ClickHouse includes a special wrapper type called Nullable which allows a column to contain null values. It's common to use this early on in schema design, when a default value has not yet been decided.

CREATE TABLE deleteme
(
    `number` UInt64,
    `date` Nullable(DateTime)
)
ENGINE = MergeTree
PARTITION BY number % 10
ORDER BY number AS
SELECT
    number,
    NULL
FROM numbers(10)

However, you will often find that you eventually want to modify this column to remove Nullable and insert a default value instead of nulls.

ALTER TABLE deleteme MODIFY COLUMN `date` DEFAULT now()

Adding a default value will affect new rows, but will not replace the nulls in existing rows.

SELECT *
FROM deleteme
LIMIT 1;

┌─number─┬─date─┐
│      0 │ ᴺᵁᴸᴸ │
└────────┴──────┘

To apply the new default value to existing rows, you can use MATERIALIZE.

ALTER TABLE deleteme
MATERIALIZE COLUMN `date`;

SELECT *
FROM deleteme
LIMIT 1;

┌─number─┬────────────────date─┐
│      0 │ 2022-09-23 12:31:14 │
└────────┴─────────────────────┘

Avoid TOO_MANY_PARTS with async_insert
ClickHouse was originally designed to insert data in batches.

For engineers accustomed to other databases, it's a common mistake to send hundreds of individual inserts per second to ClickHouse and get a TOO_MANY_PARTS error. This error is ClickHouse telling us to throttle ingestion, as it can't keep up.

Until recently, you were required to solve this issue yourself, by buffering inserts and sending larger batches.

However, ClickHouse v21.11 introduced async_insert which enables ClickHouse to handle batching small inserts for you.

NOTE
async_insert is disabled by default, so you must enable it to take advantage of this feature.

If you decide to use it you should also have a look at async_insert_threads, async_insert_max_data_size, async_insert_busy_timeout_ms and wait_for_async_insert.

When you should use columnar databases and not Postgres, MySQL, or MongoDB

Erika — Tue, 25 Oct 2022 16:56:19 +0000

by Javier Santana, Co-founder at Tinybird

Row-oriented, OLTP databases aren't ideal application DBs when you know you'll need to run analytics on lots of data. Choose OLAP instead.

When you develop an application, your first choice for a database is often one of Postgres, MySQL, or - if NoSQL is more your thing - MongoDB. You can’t go wrong with any of these; they’re great general-purpose databases with huge communities and some excellent features (e.g. transactions) that make developers’ lives easier.

But there comes a point when things start to slow down. So you read a few blog posts, scan some docs, browse some dev forums, and spend hours tuning and improving queries, database config, etc. This does improve things temporarily, but eventually, you hit a wall.

If you cut your database teeth on the Postgres/MySQL zeitgeist, it’d be understandable if you thought that most applications should be built on these or similar databases; the row-oriented, OLTP approach lends itself well to most of the app development that’s happened over the last decade.

But then you realize there are other databases out there focused specifically on analytical use cases with lots of data and complex queries. Newcomers like ClickHouse, Pinot, and Druid (all open source) respond to a new class of problem: The need to develop applications using endpoints published on analytical queries that were previously confined only to the data warehouse and BI tools.

To use a metaphor: if you want to find 3 specific trees in a massive forest, OLTP is great. But if you want to count all the trees? That’s where OLAP comes in.

When you discover these new analytical databases, all the sudden your general-purpose database doesn’t feel so “general-purpose” anymore. And you realize that maybe Postgres, MySQL, and MongoDB aren’t the databases you’re looking for to tackle your next project.

But why?
Why can't every database be good for both transactional and analytical workloads? The answer is basic physics. Hardware has hard limits, and you have to configure it one way for transactional use cases, and another way for analytical ones.

More specifically, it has to do with how data is physically stored and processed. Analytical databases like ClickHouse, for example, store data in a columnar fashion; all the values in a column in a table are stored together in disk. In OLTP databases, the data is stored by row, so all the values in a row stay together.

And this is very important.

Analytical databases store data in columns. Traditional OLTP databases store it in rows. This makes a big difference.

Data locality matters.
If you go to the grocery store to buy 100 cans of soda, you hope to find them packaged in 12-packs all on the same shelf. You go to one place, grab the packs you need, check out, and you’re on to your next errand in 5 minutes tops.

But if the cans are spread out all over the store, behind the bananas and beside the corn starch, you'd need to push your shopping cart from end to end to collect them all. You’d be lucky to get out of there before closing time.

The same thing is true with data: If data sits together in disk or memory, reading and processing it is way faster.

This is because disks and memory work 100 times faster if access is sequential. And CPUs process much faster if they don't need to jump between different tasks.

Disks, memory, and CPU work way faster when data is close together.

But of course, cans of soda aren’t data and it’s not fair to compare them. Cans of soda are physical objects, data is not. If the goal is a “faster checkout time”, then there are many things you can do with data to speed things up:

Compress it
We all use compression to save data in our disks. If you are old like me you probably used winzip. In general, we all understand that if you compress data, it gets smaller. Typically speaking, smaller is better when storing things. It’s why fast food chains buy soda syrup and carbonated water in bulk instead of stacking hundreds of 12-packs next to the burger patties.

Compression is a general term, and there are many ways to compress data. But there’s one thing common to all compression methods: compression works much better if similar values are together.

Vectorize it
Many years ago, video games started to use 3D graphics that required complex vector calculations to move things in a 3D space. Intel (the market leader) created MMX technology and later SSE to enable CPUs to do vector math quickly. So in one CPU cycle you do 3 operations instead of one. Vectorization lets CPUs process many values at the same time, but it's only possible if those values are stored together.

Cache it
In a somewhat famous slide titled “Numbers Everyone Should Know” (slide 13 in this presentation), Jeff Dean lists retrieval times by data location. As you can see, L1 cache access is 200x faster than main memory access. If you cache, you go fast.

And it turns out that caching really likes when the data is close together. Add this along with compression so you can fit more data into the cache, and you can feed the CPU that much faster.

Sort it
Sorting is another huge performance factor, especially for columnar databases. If the data is sorted by the columns you’ll be filtering on, everything speeds up. It compresses better, access is faster, and the data locality is much much better. Sorting also helps to improve algorithms like joins, order by, limit, and group by. Proper sorting can speed up queries by multiple orders of magnitude.

Parallelize it
One more thing: parallelization. It’s not unique to analytical databases, but it helps a lot when you have a lot of data, which is when you typically consider using an analytical database. There are several types of parallelization: inside the CPU (aka vectorization), across multiple CPUs, and even across multiple machines. I’ll talk about it more in a bit.

Because analytical use cases almost always involve aggregating and filtering on columns, running against data stored by column is much faster.

All these factors combined, it’s pretty simple: Things go faster when the data you want to access is stored together. And because analytical use cases pretty much always involve aggregating and filtering on specific columns, running these queries against data stored by column is just much faster. The hardware is optimized to count all the “trees” in as few cycles as possible. This is the main reason why column-oriented analytical bases are better for analytics.

But it doesn’t stop at data locality
The low-level benefits of columnar data stores for analytics should be pretty clear at this point: Data locality to extract 100% of the hardware is huge for speeding up queries on large amounts of data involving aggregations and filters typical in analytical use cases. But it doesn’t stop there. Analytical databases have other properties that make them even more appealing for handling large amounts of data.

Probabilistic data structures
When you run analytics, you often don’t need many things that OLTP databases offer. One of those things is exactness, especially on statistical calculations. That might not seem like a big deal, but if you’re building on top of an analytical databases, it has big implications.

If you are allowed to have a small error in your statistics, say +/- 1%, it can mean much, much faster queries. Calculating unique values, for example, is very compute-intensive and requires a lot of memory. If you can get by with some error, you can use probabilistic data structures like HyperLogLog that estimate unique values with less memory and less CPU.

Eventual consistency is also important here; it’s not usually achievable on OLTP workloads, but it’s not a problem in analytics. When you have a lot of data, a single machine often isn’t enough to run your analytical workloads. Of course, you should always try to scale vertically if you can, but eventually you’ll need to put data on several machines (sharding and replication are the terms for this).

This is a well-trodden path: Coordination in a distributed system is not hard and there are numerous books written about it. So you can pretty easily set up a cluster with several machines to scale your reads and writes. But the benefits go beyond basic horizontal scaling.

Probabilistic data structures improve query performance when exactness isn't required, and they are highly parallelizable.

I talked earlier about parallelization at different levels. It turns out that parallelization, distributed systems, and probabilistic data structures get along quite well.

The same algorithms and methods used to parallelize a workload on many cores work well when you parallelize it on many machines. Probabilistic data structures themselves happen to be highly parallelizable as well.

Analytical databases take advantage of this. ClickHouse, for example, has a number of non-exact functions like uniq() and quantileDeterministic() that deterministically estimate their respective statistics. As data volumes get bigger and bigger, this has a meaningful impact on query latency.

Faster writes with LSM tree
An important component of analytical architecture is the log-structured merge-tree (LSMT), a data structure that many new databases (and some “old”) use because it aligns well with how hardware works.

ClickHouse, for example, uses an LSMT-like structure that lets you insert millions of rows per second without any problems.

In a lot of analytical use cases, you don’t just need your queries to be fast, you also need them to query the freshest data. This is especially important for realtime use cases where you need to serve low-latency analytics on streaming data. Every millisecond counts. Just ask the day traders.

Incremental rollups and materializations
Almost all databases support some form of materialized views. But in most databases, including Postgres, MySQL, and MongoDB, the materialized views need to be periodically and manually refreshed. Analytical databases, on the other hand, usually have special tables and structures to enable incremental materializations, rollups, and other kinds of aggregations. The result is faster queries over aggregations even on datasets with high-frequency inserts.

Specialized functions for statistics and time series
It’s hard to find an analytical use case that doesn’t involve time series data, statistical functions, or both. Most OLAP database designers understood this, so they designed their DBMS with specialized functions for time series data and statistics. ClickHouse has a host of specialized functions for dealing with dates and times that you mostly don’t get with Postgres, for example.

OLAP databases like ClickHouse come with functions and structures that are optimized for analytical use cases.

They aren’t perfect, but you still need them.
Analytical databases aren’t perfect by any means. They’re often a huge pain, not necessarily because they are harder to manage, but, because they let you store and process way more data, things just get harder in general. More data, more problems.

But maybe, as you approach your next project, you should stop thinking in terms of transactions, linearizability, fast point queries, super advanced search indices, and the other trappings of OLTP databases.

Instead, think about what the types of queries you’ll need to run and how much data you’ll have. If you need to do sums on billions of rows, you’re gonna want to go with a columnar, OLAP database.

We launched an open source ClickHouse Knowledge Base

Erika — Mon, 24 Oct 2022 15:12:02 +0000

ClickHouse is an incredible database. We want to help our community learn how to better use it for realtime analytics.

We are very excited to announce that we've launched an open source ClickHouse Knowledge Base. Please feel free to check it out, and if you are so inclined, contribute to the public repo.

Tinybird is built on top of ClickHouse, and we employ several ClickHouse developers who contribute daily to the lightning-fast open-source database to make it… well… lightning-faster.

In fact, our team has over 40 years of combined experience working with ClickHouse. Not bad for a database whose initial release was only 6 years ago.

Over the course of the time that we’ve been building Tinybird and contributing to ClickHouse, we’ve learned a lot about the column-oriented OLAP. The more we know, the more we love. It’s a perfect analytical database to solve the problems that data companies are trying to solve right now: namely operationalizing complex analytical queries and building products on top of them.

We’ve launched this knowledge base to celebrate our continued investment in the ClickHouse open source project, and to help our customers, friends, and the broader ClickHouse community make the most of this incredible OLAP powerhouse. We intend to add new tips almost daily.

Feel free to check it out. If you learn something, please share it wherever you scroll. And if you’ve got some ClickHouse magic up your sleeve that you want to share, please contribute!

Hackathon: Build a space-themed application (and win a MacBook)

Erika — Wed, 28 Sep 2022 09:27:39 +0000

Calling all explorers of new worlds to join the first ever Tinybird Hackathon. Your mission is simple: Build a space-themed application using Tinybird (and win some awesome prizes).

Mission briefing
Location: Online
Submission time: Due by 11:59 GMT on October 12th, 2022
Supplies Required: Free Tinybird account & GitHub or GitLab account

Briefing notes
Team Size: This is an individual challenge. You’re on your own, Space Ranger.

Streaming Data: You can use your own data, or pull from this list we’ve compiled. Data must not include any PII.

No Limits: You may create anything as long as it uses Tinybird and is space-themed.

Fresh Code: You must create something from scratch. Submissions may be subject to a code-review to be eligible to win.. This is to ensure that all code used is in fact fresh.

Submissions: Log your submissions to Star Command on this page. Each submission should include a well-documented GitHub repo, a working demo app, and a 2-min video demonstration of the app.

IP & Ownership: You own your IP and anything you create using Tinybird.

Judging criteria
Submissions will be judged subjectively by a panel of independent developers and data engineers. All of the following factors will be considered when judging whether you’ve completed your mission:

Functionality: Does it work?
Documentation: Is it easy to understand what you did?
Performance: Are queries optimized?
Aesthetics: Does it look nice?
Gravity: Does it pull us in with wonder and amazement?

Prizes
x1 Grad Prize: MacBook Air
x2 Runner Ups: Keychron K2

Mission complete?
Submit your work on this page.

What is Tinybird?
Tinybird is a platform that lets you ingest data, shape it with SQL, and publish the results as REST APIs. There aren’t any SDKs or libraries to learn, just use whatever you’re comfortable with to interact with APIs over HTTP.