WEBVTT 00:00.000 --> 00:13.240 All right. I'm going to get right into it. I'm climbing again. I work at DuckDB Labs. I contribute 00:13.240 --> 00:17.760 a lot to the DuckDB iceberg extension and I'm going to be presenting DuckDB in the cloud 00:17.760 --> 00:24.160 a simple powerful SQL engine for your lakehouse. So let's get started. What is DuckDB? 00:24.160 --> 00:31.120 Learn MIT, license, open source, analytical database. We have zero dependencies. And we can 00:31.120 --> 00:37.600 install on any major platforms, Windows, Ubuntu, macOS. It's a super easy install. If you 00:37.600 --> 00:42.240 like your CLI, you can just curl and you're off to the races. We also have a Python client 00:42.240 --> 00:48.640 along with 15 other clients. Java are Rust. You name it. We probably have it. So what does 00:48.640 --> 00:53.400 that all mean? Well, it means we're super portable and there are a lot of different deployment 00:53.400 --> 00:57.480 methods. So you can just install it on your computer, start querying your favorite part 00:57.480 --> 01:03.960 k files, your favorite CSV files, and dump it into a DuckDB database file. On top of that, 01:03.960 --> 01:08.520 if you need a little more resources, you can deploy it in the cloud. Favorite EC2 instance or 01:08.520 --> 01:14.560 whatever, if you need more memory, more storage. But since we have zero dependencies, and we 01:14.560 --> 01:19.680 have a small binary, you can also scale it down to a serverless function. So you can put it 01:19.680 --> 01:24.960 in a Lambda and then you can start reading files on an S3 bucket, whatever blob storage, 01:24.960 --> 01:30.560 you prefer. On top of that, you can have multiple readers from a DuckDB database file 01:30.560 --> 01:37.680 and your favorite blob storage. But what we cannot do is multiple read writers on a database 01:37.680 --> 01:41.760 file and blob storage. So you might be thinking, why should I use this in the cloud if you 01:41.760 --> 01:47.040 can't have multiple read writers? Well, because we're in process database, if we have multiple 01:47.040 --> 01:51.440 read writers, there's a lot more fine-grained locking content and that kind of goes against 01:52.480 --> 01:58.800 that would hurt our ability to do great analytics. Thankfully, a couple of years ago, 01:58.800 --> 02:05.040 a lot of people started putting a lot of their data just in blob storage on par k files and S3 files. 02:05.040 --> 02:10.320 And from that, you had open table formats get developed. I'm talking about iceberg, talking 02:10.480 --> 02:17.040 about Delta Lake and what's great about those is you get these basically database table 02:17.040 --> 02:23.440 like properties just in file storage and multiple databases support reading from these open 02:23.440 --> 02:31.680 table formats including now DuckDB. So we now have, you can now read and write to these data 02:31.680 --> 02:38.480 open table formats in these blob storage files with DuckDB. So this is great. This enables 02:38.480 --> 02:43.760 multi-player updates on object storage. Unfortunately, it's only limited to a single table, 02:43.760 --> 02:49.600 which hopefully I can get back to. So what does iceberg look like in DuckDB? Well, it looks like 02:49.600 --> 02:55.040 you're pouring just your regular CSV or par k files. If you know where your metadata.json file 02:55.040 --> 03:00.640 is, you can just do a select star from the iceberg scan of the metadata.json and iceberg does 03:00.640 --> 03:05.680 all of the merging of the schema information and all of the data files and what you get out 03:05.680 --> 03:13.200 looks just like a regular table. This is just line item from tbch. A one problem with the iceberg 03:13.200 --> 03:18.240 format is that it is like all files and if you have fast-moving data, there is a lot of 03:18.240 --> 03:23.200 contention around, okay, who gets the insert now, what is the most recent version of the data. 03:23.200 --> 03:28.160 So what happens, what has now happened is there is now a catalog layer that has been introduced, 03:28.160 --> 03:34.640 which is the way of saying, okay, this catalog, this database now controls who has the most 03:34.800 --> 03:40.560 who gets to update this table with what data. So if any of these clients want to update your 03:40.560 --> 03:45.680 fast-moving table that exists all in blob storage, they get to write all of the data and then 03:45.680 --> 03:50.800 they get to ask the catalog, hey, update this table with my new data and the catalog will say, 03:50.800 --> 03:55.920 yes, okay, now this table has been updated or the catalog will then say, no, it hasn't been updated. 03:57.680 --> 04:02.480 And some of the popular catalog providers now are Lakekeeper, Polaris, Nessie, 04:02.560 --> 04:09.040 AWS glue Amazon S3 tables, these are all in RCI for ductDB iceberg. If you've been paying 04:09.040 --> 04:14.800 attention, oh, and this is what does it look like in the ductDB iceberg extension, you can create a secret 04:14.800 --> 04:21.520 using our secret, yeah, using our secret functionality, you can attach to, in this example, 04:21.520 --> 04:27.920 it's S3 tables, your S3 table Amazon resource name, type iceberg endpoint type, and then you're 04:27.920 --> 04:32.480 off to the races and you can dump all of your CSVs and park caves into an iceberg table. 04:34.880 --> 04:39.200 If you've been paying attention, you know that I've mentioned iceberg OTS open table formats 04:39.200 --> 04:43.440 and a lot of table state exists in these arrow files. If you've known that, I just want to 04:43.440 --> 04:48.480 mention duct Lake has recently been developed at ductDB, which takes a lot of that metadata state 04:48.480 --> 04:53.680 and now puts it into the catalog into the database. And that's a lot more efficient. 04:54.640 --> 04:58.960 It's still a little bit under development, but I'm happy to announce a touch statement, 04:58.960 --> 05:04.400 super easy. Happy to announce that we will be releasing a major version next month, February, 05:04.400 --> 05:08.960 and that's, yeah, that's my talk. Thank you so much.