WEBVTT

00:00.000 --> 00:13.240
All right. I'm going to get right into it. I'm climbing again. I work at DuckDB Labs. I contribute

00:13.240 --> 00:17.760
a lot to the DuckDB iceberg extension and I'm going to be presenting DuckDB in the cloud

00:17.760 --> 00:24.160
a simple powerful SQL engine for your lakehouse. So let's get started. What is DuckDB?

00:24.160 --> 00:31.120
Learn MIT, license, open source, analytical database. We have zero dependencies. And we can

00:31.120 --> 00:37.600
install on any major platforms, Windows, Ubuntu, macOS. It's a super easy install. If you

00:37.600 --> 00:42.240
like your CLI, you can just curl and you're off to the races. We also have a Python client

00:42.240 --> 00:48.640
along with 15 other clients. Java are Rust. You name it. We probably have it. So what does

00:48.640 --> 00:53.400
that all mean? Well, it means we're super portable and there are a lot of different deployment

00:53.400 --> 00:57.480
methods. So you can just install it on your computer, start querying your favorite part

00:57.480 --> 01:03.960
k files, your favorite CSV files, and dump it into a DuckDB database file. On top of that,

01:03.960 --> 01:08.520
if you need a little more resources, you can deploy it in the cloud. Favorite EC2 instance or

01:08.520 --> 01:14.560
whatever, if you need more memory, more storage. But since we have zero dependencies, and we

01:14.560 --> 01:19.680
have a small binary, you can also scale it down to a serverless function. So you can put it

01:19.680 --> 01:24.960
in a Lambda and then you can start reading files on an S3 bucket, whatever blob storage,

01:24.960 --> 01:30.560
you prefer. On top of that, you can have multiple readers from a DuckDB database file

01:30.560 --> 01:37.680
and your favorite blob storage. But what we cannot do is multiple read writers on a database

01:37.680 --> 01:41.760
file and blob storage. So you might be thinking, why should I use this in the cloud if you

01:41.760 --> 01:47.040
can't have multiple read writers? Well, because we're in process database, if we have multiple

01:47.040 --> 01:51.440
read writers, there's a lot more fine-grained locking content and that kind of goes against

01:52.480 --> 01:58.800
that would hurt our ability to do great analytics. Thankfully, a couple of years ago,

01:58.800 --> 02:05.040
a lot of people started putting a lot of their data just in blob storage on par k files and S3 files.

02:05.040 --> 02:10.320
And from that, you had open table formats get developed. I'm talking about iceberg, talking

02:10.480 --> 02:17.040
about Delta Lake and what's great about those is you get these basically database table

02:17.040 --> 02:23.440
like properties just in file storage and multiple databases support reading from these open

02:23.440 --> 02:31.680
table formats including now DuckDB. So we now have, you can now read and write to these data

02:31.680 --> 02:38.480
open table formats in these blob storage files with DuckDB. So this is great. This enables

02:38.480 --> 02:43.760
multi-player updates on object storage. Unfortunately, it's only limited to a single table,

02:43.760 --> 02:49.600
which hopefully I can get back to. So what does iceberg look like in DuckDB? Well, it looks like

02:49.600 --> 02:55.040
you're pouring just your regular CSV or par k files. If you know where your metadata.json file

02:55.040 --> 03:00.640
is, you can just do a select star from the iceberg scan of the metadata.json and iceberg does

03:00.640 --> 03:05.680
all of the merging of the schema information and all of the data files and what you get out

03:05.680 --> 03:13.200
looks just like a regular table. This is just line item from tbch. A one problem with the iceberg

03:13.200 --> 03:18.240
format is that it is like all files and if you have fast-moving data, there is a lot of

03:18.240 --> 03:23.200
contention around, okay, who gets the insert now, what is the most recent version of the data.

03:23.200 --> 03:28.160
So what happens, what has now happened is there is now a catalog layer that has been introduced,

03:28.160 --> 03:34.640
which is the way of saying, okay, this catalog, this database now controls who has the most

03:34.800 --> 03:40.560
who gets to update this table with what data. So if any of these clients want to update your

03:40.560 --> 03:45.680
fast-moving table that exists all in blob storage, they get to write all of the data and then

03:45.680 --> 03:50.800
they get to ask the catalog, hey, update this table with my new data and the catalog will say,

03:50.800 --> 03:55.920
yes, okay, now this table has been updated or the catalog will then say, no, it hasn't been updated.

03:57.680 --> 04:02.480
And some of the popular catalog providers now are Lakekeeper, Polaris, Nessie,

04:02.560 --> 04:09.040
AWS glue Amazon S3 tables, these are all in RCI for ductDB iceberg. If you've been paying

04:09.040 --> 04:14.800
attention, oh, and this is what does it look like in the ductDB iceberg extension, you can create a secret

04:14.800 --> 04:21.520
using our secret, yeah, using our secret functionality, you can attach to, in this example,

04:21.520 --> 04:27.920
it's S3 tables, your S3 table Amazon resource name, type iceberg endpoint type, and then you're

04:27.920 --> 04:32.480
off to the races and you can dump all of your CSVs and park caves into an iceberg table.

04:34.880 --> 04:39.200
If you've been paying attention, you know that I've mentioned iceberg OTS open table formats

04:39.200 --> 04:43.440
and a lot of table state exists in these arrow files. If you've known that, I just want to

04:43.440 --> 04:48.480
mention duct Lake has recently been developed at ductDB, which takes a lot of that metadata state

04:48.480 --> 04:53.680
and now puts it into the catalog into the database. And that's a lot more efficient.

04:54.640 --> 04:58.960
It's still a little bit under development, but I'm happy to announce a touch statement,

04:58.960 --> 05:04.400
super easy. Happy to announce that we will be releasing a major version next month, February,

05:04.400 --> 05:08.960
and that's, yeah, that's my talk. Thank you so much.