Skip to content

Instantly share code, notes, and snippets.

@RobinL
Created January 11, 2025 15:15
Show Gist options
  • Save RobinL/bbabbe22d9177230648b7fc9a22a84d7 to your computer and use it in GitHub Desktop.
Save RobinL/bbabbe22d9177230648b7fc9a22a84d7 to your computer and use it in GitHub Desktop.
WEBVTT
00:00.000 --> 00:07.260
what's up hi Joe hello so we're hanging out at the forward data conference in
00:07.260 --> 00:13.320
wonderful Perry so good to finally hang out yeah Joe it's amazing I mean I
00:13.320 --> 00:16.800
didn't think that we had to come to Paris to finally meet and I was
00:16.800 --> 00:20.700
expecting somewhere in the US but hey here we are well I was actually in your
00:20.700 --> 00:25.660
neck of the woods last week in Amsterdam but we didn't yeah somehow
00:25.660 --> 00:30.520
managed to miss each other but I forgot to send the fax yeah so oh god the fax
00:30.520 --> 00:34.420
machines yeah this is a this is a you know something embarrassing for the
00:34.420 --> 00:38.560
people of Germany to to still using fax machines they still use them yeah yeah
00:38.560 --> 00:42.700
for what I think if you want to talk to some official thing like a like a you
00:42.700 --> 00:46.780
know government agency they they like faxes still although I have recently read
00:46.780 --> 00:51.280
that a what is it I think the tax people are now stopping to accept faxes
00:51.280 --> 00:57.580
finally it's oh yeah I know coming along I know someone said digital technology
00:57.580 --> 01:01.360
could be used for this purpose I don't know it's it'll make sense when they're
01:01.360 --> 01:06.280
older yeah no it's really pleasure to meet you finally after I've been
01:06.280 --> 01:10.180
obviously I have a copy of your book oh you do yeah I do I use it in my class
01:10.180 --> 01:14.980
actually I teach you all data engineering oh yeah well I mean the
01:14.980 --> 01:20.140
university was like is there a textbook and I was like oh yes there is one
01:20.140 --> 01:24.460
so cool is that one well if you ever need a guest lecture I'm happy to buy on
01:24.460 --> 01:28.720
if I happen to be in the area I can pop by if not we'll do it over zoom oh that
01:28.720 --> 01:32.440
would be interesting actually yeah yeah thanks for that yeah yeah so the
01:32.440 --> 01:36.320
students will be thrilled yes well they'll drop the class they don't
01:36.320 --> 01:39.040
obviously believe me what I say something about data engineering but they
01:39.040 --> 01:46.120
will believe you that's cool I guess to kick things up for people who don't
01:46.120 --> 01:49.900
know who you are do you want to give a quick intro sure so my name is
01:49.900 --> 01:54.280
Hannes Mueleisen I am a I'm from Germany but I live in the Netherlands I have
01:54.280 --> 02:00.040
been living there for 12 years I am a one of the two creators of the database
02:00.040 --> 02:05.440
system called DuckDB and I'm also the co-founder and CEO of DuckDB Labs the
02:05.440 --> 02:10.300
company that employs most of the DuckDB contributors I'm also a professor of data
02:10.300 --> 02:15.280
engineering at the wonderful University in Nijmegen which is a small town that's
02:15.280 --> 02:23.440
super cool yeah DuckDB I could make a very strong argument it's it's getting up
02:23.440 --> 02:27.660
there with being a very widely used database I think in terms of mindshare
02:27.660 --> 02:31.900
at least in the analytics community I would say it's probably the the the
02:31.900 --> 02:35.500
hottest database in the world right now in my my view I think that's that's true
02:35.500 --> 02:40.080
like we we do look we do track a bunch of these vanity metrics not too seriously
02:40.080 --> 02:44.400
because like what what's event but oh there's like things like the you know
02:44.400 --> 02:49.340
db engines ranking engines ranking and there is things like you know amount of
02:49.340 --> 02:55.980
downloads but I like the metrics that that are sort of not gamed by by scripts
02:55.980 --> 03:02.160
right because DuckDB of course is a and I should maybe explain isn't is database as a
03:02.160 --> 03:06.840
library right data of data warehouse as a library if you want and that means that
03:06.840 --> 03:10.560
people run it in all sorts of chronic creative places and it very often gets
03:10.560 --> 03:14.780
installed like just you know to spin up a lambda or something like that so that
03:14.780 --> 03:19.300
of course that would skew you download numbers but there's metrics that are not
03:19.300 --> 03:23.700
that skewed one of them I really I'm really impressed by is the amount of
03:23.700 --> 03:28.380
unique visitors to our website okay so we accidentally made one of the more
03:28.380 --> 03:33.480
popular websites of our country right by just having documentation and our blog and
03:33.480 --> 03:38.220
all sorts of things like that like it's more than a million unique visitors each
03:38.220 --> 03:42.840
month for the website yeah it's totally wild I don't know I didn't expect that
03:42.840 --> 03:47.220
how many downloads have you had so far downloads and that's that depends a bit
03:47.220 --> 03:52.500
on the platform so there is there's Python there's a have a big Python
03:52.500 --> 03:56.340
distribution so you can pipe by okay and pipe by and it's like think 7 million per
03:56.340 --> 03:59.760
month at the moment okay I actually have no idea what the the integral of all of
03:59.760 --> 04:05.880
this is okay there's a sum in that sense there we have that there's a bunch of
04:05.880 --> 04:12.060
other platforms that get downloads but Python I think is the biggest npm also
04:12.060 --> 04:16.860
has a bunch on our client there is direct downloads from a website of the CLI
04:16.860 --> 04:20.280
the things like homebrew which goes through them and we don't can't
04:20.280 --> 04:26.000
necessarily track it so we don't actually have a great way of of tracking
04:26.000 --> 04:30.080
downloads but what we we do have is these extensions plugins and those go
04:30.080 --> 04:35.420
through our download okay a server and that's and currently on the order of I
04:35.420 --> 04:41.660
think 300 terabytes each month just in extension downloads so so that's just
04:41.660 --> 04:45.320
somebody installing a DuckDB extension and that sums up to 300 terabytes by the
04:45.320 --> 04:52.680
way grateful to Cloudflare for sponsoring us thank you that that would be I mean
04:52.680 --> 04:55.760
it's not that Cloudflare charges you money but it's something that gives us like the
04:55.760 --> 04:59.780
confidence that we can pull this off for the next 10 years wow to come yeah it's
04:59.780 --> 05:05.240
it's quite wild it's it's somebody Jordan has said at some point that if what
05:05.240 --> 05:08.900
you're doing is exponentially growing every day is the craziest day of your
05:08.900 --> 05:15.980
life uh-huh and it is definitely been how is that Jordan Cigani of MotherDuck yeah okay
05:15.980 --> 05:20.780
I'm sure like Michael Jordan no no no no that's the other yeah so that's really
05:20.780 --> 05:25.340
something that that has surprised us obviously if you make an open source
05:25.340 --> 05:29.240
project your default is that you no one will care mm-hmm right that's true for
05:29.240 --> 05:36.680
99.9% of GitHub rep hoses nobody cares about them yep and when we started
05:36.680 --> 05:41.000
building back to be obviously we thought we had a bit of a an angle on what could
05:41.000 --> 05:46.780
you know prove to be popular but obviously you have no idea without it
05:46.780 --> 05:50.860
happening right so you don't know it's still unlikely to happen so it's it's
05:50.860 --> 05:55.480
it was an interesting experience I think the it's interesting if you wonder
05:55.480 --> 06:00.780
like I think we knew that we were onto something when the VC started calling oh
06:00.780 --> 06:06.700
right tell me more well okay so that we had been we've open source it in summer of
06:06.700 --> 06:15.280
2019 yeah and then we spun off the company in about 21 but before that
06:15.280 --> 06:22.540
actually in early 21 I think there was I think a Hacker News post which was a
06:22.540 --> 06:27.440
terrible like it was a terrible article it was like here's DuckDB it's like
06:27.440 --> 06:32.700
Postgres yes that was I think that I remember for this who wrote it I don't
06:32.700 --> 06:39.360
have don't remember who wrote it not us it's like sequel light I remember it was
06:39.360 --> 06:43.680
here's DuckDB it's a database it's like sequel light but with Postgres features
06:43.680 --> 06:47.700
which is a very bad characterization of what DuckDB is but that went viral on
06:47.700 --> 06:53.000
Hacker News and that was the sort of that I think was what pushed us over the
06:53.000 --> 06:58.020
thousand stars in GitHub or something like that and that was when you know the
06:58.020 --> 07:01.340
curve started to take its current from me now I think we now had something in the
07:01.340 --> 07:06.160
order of 25,000 stars in GitHub or something which is not a lot if not a lot
07:06.160 --> 07:11.300
if you have a JavaScript library true but for a data system it's pretty it's
07:11.300 --> 07:14.840
pretty crazy yeah yeah it's been a wild ride I can I have to say that's
07:14.840 --> 07:19.440
interesting walk me through the beginnings and so and actually we're
07:19.440 --> 07:23.660
talking about it earlier so yeah welcome to the database welcome to how you named
07:23.660 --> 07:28.080
it all those kind of fun things I think it's pretty hilarious I mean the
07:28.080 --> 07:32.280
database name to be brief is called DuckDB because I used to have a pet duck so I
07:32.280 --> 07:38.760
live on a boat with my family and the neighbors cats kept drowning and so we
07:38.760 --> 07:44.820
decided to not have a cat because yeah they fall in the water it's really sad so
07:44.820 --> 07:48.720
we thought we'll we'll have a cat I will have a we'll have a duck instead I'll have
07:48.720 --> 07:54.120
a duck instead of a is of a cat and the duck can swim so I had we got this
07:54.120 --> 07:59.760
little duckling called Wilbur I taught him how to swim I taught him how to fly it
07:59.760 --> 08:05.580
was very sweet and he has since left and probably has started a ducky family
08:05.580 --> 08:11.340
somewhere but in honor of little Wilbur the database is called DuckDB that's
08:11.340 --> 08:16.260
yeah I don't know it was it was it was kind of obvious to me I didn't think about
08:16.260 --> 08:20.580
it a whole lot but but yeah that was that was the that was very early on and
08:20.580 --> 08:26.640
DuckDB is a product of Mark Rassfeld and me it originally and Mark used to be my
08:26.640 --> 08:30.240
PhD student because I'm come from this whole academic background from back in
08:30.240 --> 08:32.940
Amsterdam at the Centrum Visken Informatica which is like their
08:32.940 --> 08:38.520
national research lab for mathematics and computer science it's by the way
08:38.520 --> 08:43.020
where Python was invented oh really yeah they invented Python well Guido
08:43.020 --> 08:48.840
invented Python while he was there wow and yeah in the same sort of research
08:48.840 --> 08:52.440
Institute we came up with this idea for DuckDB as because we realized that
08:52.440 --> 08:57.360
people kind of very sort of ignoring database technology for the wrong
08:57.360 --> 09:05.340
reasons what do you mean well okay so obviously databases is this like date
09:05.340 --> 09:09.420
relational data transformation is something that's well understood I would
09:09.420 --> 09:15.060
argue it is also the field with a long tradition and a significant body of work
09:15.060 --> 09:21.060
and you know best practices but people were casting that aside for reasons
09:21.060 --> 09:24.480
like oh yeah but it's super hard to get Postgres running I'd rather run like
09:24.480 --> 09:29.700
MongoDB mmm which is vastly inferior from a technical perspective but it was
09:29.700 --> 09:33.660
easy to get running so we took some inspiration from that and actually said
09:33.660 --> 09:40.200
but what if we take the sort of the body of knowledge the the orthodoxy of what
09:40.200 --> 09:45.780
database engine should look like and just put it into a package that doesn't make
09:45.780 --> 09:49.920
you you know hate everything and everyone around you I mean I I try to
09:49.920 --> 09:54.780
install Oracle on a box and I'm at some point I realized that normally
09:54.780 --> 10:00.780
consultants do this yeah because it is horrifying right and so with that to be
10:00.780 --> 10:05.340
we really try to be like absolutely like minimalistic in terms of what you need to
10:05.340 --> 10:10.260
install it there's no dependencies right zero you don't need root to install it
10:10.260 --> 10:19.680
it's small ish it's like you know tens of megabytes of binary size yeah it's
10:19.680 --> 10:22.980
just generally trying to be not an obtrusive but still contain a state-of-the-art
10:22.980 --> 10:27.660
query engine and we've actually went since we've did since we've done that we
10:27.660 --> 10:32.760
actually went much further so we're still doing research on in the in the field of
10:32.760 --> 10:37.860
databases on how we can you know make make that to be better like we're doing
10:37.860 --> 10:41.820
research for example we just wrote a paper on parsing we did bunch of papers
10:41.820 --> 10:46.860
on bigger than memory processing which is something that surprisingly and not a lot
10:46.860 --> 10:53.160
of work had been done in the field so we are still you know actually let's say
10:53.160 --> 10:56.320
pushing the envelope of what relation engine can be but at the same time we're
10:56.320 --> 11:02.580
making it trivial to use and I think that's the interesting sort of combo yeah well
11:02.580 --> 11:06.060
and we were talking last night too you even have somebody working on like the
11:06.060 --> 11:12.300
CSV yeah yeah it's a shout out to Pedro it's one of the I think that the second
11:12.300 --> 11:17.100
or the third the set first or the second sort of person besides mark and me to
11:17.100 --> 11:23.940
work on DuckDB and he was a postdoc at the CWI and he did a bunch of stuff he did
11:23.940 --> 11:28.800
like the arts did like a tree structure for indexing but then he found his true
11:28.800 --> 11:34.080
calling which is a CSV reader and so he's been noble calling I know he's he's
11:34.080 --> 11:39.720
been he's been working on the CSV reader ever since and other things but it's
11:39.720 --> 11:45.900
like his main project and it's super interesting to see you know what he's
11:45.900 --> 11:50.820
done and I think the reason why we spending so much sort of time on CSV
11:50.820 --> 11:54.420
reading is because it is the first thing you do yeah you're running a new
11:54.420 --> 11:57.960
database first thing you do is you're not going to enter your data like with the
11:57.960 --> 12:00.540
keyboard or anything like that you're not running insert statements you're going
12:00.540 --> 12:05.220
to load some CSV files these days ideally it's going to be parquet files but like
12:05.220 --> 12:09.060
yeah it's still gonna be CSV files and so I have spent so much time of my life
12:09.060 --> 12:14.520
dealing with broken CSV readers out there and I'm that it's absolutely clear to us
12:14.520 --> 12:16.980
that this needs to be absolutely top-notch we need to have the best
12:16.980 --> 12:22.020
CSV reader in the business and I think we actually do so so that's just to keep this
12:22.020 --> 12:27.060
initial threshold of people using your system somewhat manageable like they
12:27.060 --> 12:30.600
need to be like I think our goal is people like point this thing at
12:30.600 --> 12:34.440
something can be a CSV file it can be a parquet file it can be a bunch of JSON
12:34.440 --> 12:41.280
files I worked I'd last week I worked on abro files anything pointed at it and it
12:41.280 --> 12:45.360
will just be like yes sir here's yours here's your table all right yeah that's
12:45.360 --> 12:49.380
our that's that's where we want to be I think we're pretty close so that's that's
12:49.380 --> 12:54.360
and I think it comes back to this idea of like user experience I think I think
12:54.360 --> 12:59.220
databases I always say they tend to be sold on golf courses mm-hmm because like
12:59.220 --> 13:03.060
the CEO talks to the other CEO and they go and then they agree on a price shake
13:03.060 --> 13:08.900
hands and then and then that's how the database was sold we try we don't do this
13:08.900 --> 13:14.400
obviously because it's free and open source but we have a more bottom-up
13:14.400 --> 13:18.960
strategy and to do that the experience needs to be good right people need to
13:18.960 --> 13:23.820
just actually just we try to amaze people a bit with okay just can do this
13:23.820 --> 13:29.160
and it works fine and yeah it's people seem to like it what can I say it's that's
13:29.160 --> 13:33.820
interesting so do you have like an opinion on guardrails as well or is it
13:33.820 --> 13:37.680
more meant is your philosophy just make everything as simple as possible no
13:37.680 --> 13:40.320
matter what it is or do you have certain opinions about where those
13:40.320 --> 13:44.720
limitations should be um what do you mean by guardrails guardrails like we're
13:44.720 --> 13:51.000
talking last night about strings giving an example there where yeah if you want
13:51.000 --> 13:54.640
to do something with a string go for it we don't really care oh yeah oh yeah
13:54.640 --> 14:02.040
yeah that's interesting I think I think that that is about a schema well you have
14:02.040 --> 14:06.620
obviously been working on schema stuff so we've been talking about that but yeah but
14:06.620 --> 14:13.100
the let's say to be more forgiving I think databases traditionally have not
14:13.100 --> 14:17.900
been very forgiving we try to be forgiving inducted be more so maybe than
14:17.900 --> 14:23.840
other systems so we have things like we have this intermediate compression step
14:23.840 --> 14:29.240
where during execution of a pipeline we will actually look at the types and the
14:29.240 --> 14:33.580
statistics of the types that are in the in the columns in the data and then we
14:33.580 --> 14:36.760
will actually insert intermediate compression decompression steps just to
14:36.760 --> 14:40.960
lower the memory pressure on the way and the complexity of operations so it will
14:40.960 --> 14:44.380
actually not make it make a huge difference whether your type is
14:44.380 --> 14:48.900
declared as a let's say a string but only contains contains interest between one
14:48.900 --> 14:54.760
and a hundred you will still get a good sort of result in terms of performance same
14:54.760 --> 14:58.820
for storing things right we have a bunch of optimizations for storing very short
14:58.820 --> 15:03.460
strings for storing very regular strings we have yeah we have we have
15:03.460 --> 15:06.700
integer compression we have like there's a lot of sort of stuff that happens
15:06.700 --> 15:10.480
magically behind the scenes so you don't have to think about it like our on
15:10.480 --> 15:14.920
disk compression representation like if you you can use DuckDB to store a database
15:14.920 --> 15:21.480
file on disk and there there is this you can say which compression you want but by
15:21.480 --> 15:26.260
default we will actually run sort of an exploratory sort of exploration of like
15:26.260 --> 15:29.460
okay let's try all our compression mechanisms which one is working best okay this
15:29.460 --> 15:35.580
one great let's use this one we also have things like if you have an expression let's
15:35.580 --> 15:40.260
say I have have a compare it like I have a second let's have select stuff from table
15:40.260 --> 15:47.640
where I don't know length of aware this regex matches and this other value is
15:47.640 --> 15:52.980
bigger than four okay look just if you visualize this I have a filter that says if
15:52.980 --> 15:58.080
the regex matches and the other number is bigger than four then the role should
15:58.080 --> 16:02.280
qualify okay so now for maybe not everyone knows but matching a regex is
16:02.280 --> 16:06.420
way more expensive than running a bigger than four comparison so we actually
16:06.420 --> 16:10.980
automatically will reorder this comparison so we first check the four
16:10.980 --> 16:16.620
comparison and then check the regex just because yeah we need to yeah we want to be
16:16.620 --> 16:20.340
forgiving in terms of performance we also don't want to create these crazy
16:20.340 --> 16:25.200
performance clips that people generally hate databases for right where you change
16:25.200 --> 16:29.100
one little thing you add one value to your rows and suddenly move the plan
16:29.100 --> 16:32.580
changes and it's over right like we want to be a bit more robust here not saying
16:32.580 --> 16:36.240
we're entirely there yet but it's definitely a goal it's it's just I think
16:36.240 --> 16:41.400
it's just trying to just trying to be friendly I think it's the it's the general
16:41.400 --> 16:45.960
choice general goal if I'm if I want to be more strict do I have that option of
16:45.960 --> 16:50.520
making it it's interesting that you say that because people we get the request
16:50.520 --> 16:56.280
every now and then if you want to be more strict I don't think we like I can
16:56.280 --> 17:01.800
mention the compression techno technique you can force a specific one I don't
17:01.800 --> 17:05.880
think you can force a specific execution order for expressions I think we will
17:05.880 --> 17:10.380
always optimize it but it is definitely something that as people productize
17:10.380 --> 17:16.380
DuckDB more or put it more into into you know products or you know tools that do
17:16.380 --> 17:21.180
something else but they also include DuckDB that that we probably will avenge
17:21.180 --> 17:28.560
we will not get around having my flags to to to make it more predictive more
17:28.560 --> 17:32.040
deterministic just because you don't want it to change its mind three years
17:32.040 --> 17:35.400
down the road right like it's that's that's exactly the danger that that you can
17:35.400 --> 17:39.880
have there mm-hmm absolutely well especially if you do it using DuckDB on the edge for
17:39.880 --> 17:43.900
example yeah you can't update it as much then yeah yeah there is people that are
17:43.900 --> 17:50.980
running DuckDB on all sorts of devices that's that's and as something that in
17:50.980 --> 17:54.400
the past I think that wasn't a very good idea since we released 1.0 this year I
17:54.400 --> 17:58.000
think we are comfortable with the idea of somebody running that version of
17:58.000 --> 18:02.500
DuckDB somewhere for five years and you know it working out fine uh-huh we do
18:02.500 --> 18:07.960
regularly test against regressions against DuckDB 1.0 and so far we
18:07.960 --> 18:13.740
haven't found anything dramatic so that's what's the craziest place you've
18:13.740 --> 18:18.700
seen DuckDB deployed um so one thing is really crazy when I saw it first was the
18:18.700 --> 18:23.680
web browser okay so some so Andre Cohen from the Technical University of Munich
18:23.680 --> 18:29.800
he made DuckDB Wasm which is this version of DuckDB that's compiled to run in
18:29.800 --> 18:35.460
WebAssembly in the web browser right somewhere else I think when I saw that I was
18:35.460 --> 18:40.260
deeply impressed because I didn't even consider that to be a possibility that
18:40.260 --> 18:43.680
you could run DuckDB in a website and you can and lots of people use it is one
18:43.680 --> 18:48.000
of our biggest compil biggest sort of deployment targets right now because
18:48.000 --> 18:51.240
people use it yeah people put it in dashboards people put it in you know
18:51.240 --> 18:55.440
people use it in in like something behind the scenes in a website like we have we
18:55.440 --> 19:00.360
have people in in visualization that use DuckDB version to just you know drive a
19:00.360 --> 19:05.460
driver visualization you know to be reactive to user input live like right
19:05.460 --> 19:10.860
there on the thing that I think was I think the the craziest deployment target
19:10.860 --> 19:13.920
there's of course always the you know the random weirdos we have somebody I'm
19:13.920 --> 19:19.380
sorry valued contributors we have we have somebody that managed to compile DuckDB
19:19.380 --> 19:27.240
for the IBM mainframes like IBM Z series which I didn't think was possible but it
19:27.240 --> 19:31.860
worked and he's happy I think so so that's always funny others like what
19:31.860 --> 19:35.480
would that be used for like I mean that's not such a terrible idea I mean
19:35.480 --> 19:38.400
that yeah it's gonna be working I think pretty well and it's like okay you know
19:38.400 --> 19:42.960
maybe not on 10,000 cores at the same time but if these mainframes of course
19:42.960 --> 19:47.220
allow you to use fewer cores for a problem so you're running there good memory
19:47.220 --> 19:50.040
access I don't know I've never used one of these things I'm sorry yeah I don't we
19:50.040 --> 19:54.240
don't have enough money for for IBM Z series in the someday if you send him some
19:54.240 --> 19:58.080
money so we can get an IBM mainframe we don't we haven't taken enemy CVC money
19:58.080 --> 20:01.620
so I regrettably you know our cocaine budget is limited
20:01.620 --> 20:12.240
dang it well you're in Amsterdam it's difficult IBM mainframes coke whatever boats
20:12.240 --> 20:16.620
I don't know you have a lot of those in Amsterdam yeah that's true that's
20:16.620 --> 20:21.240
interesting yeah but it's a fun ride I have to say I think you said that there's
20:21.240 --> 20:25.560
space satellites running there is we haven't gotten confirmation we just went
20:25.560 --> 20:32.480
through the official software procurement process with NASA recently which is
20:32.480 --> 20:36.660
something they it's just funny because they have this massive entitlement that
20:36.660 --> 20:41.000
you will respond to them right so you are just a unsuspecting maintainer of a
20:41.000 --> 20:45.380
popular open-source project and you get a you know 40 question questionnaire sent to
20:45.380 --> 20:48.540
you by NASA and they say well if you don't fill this we can't use your software and
20:48.540 --> 20:52.860
you think what part of you know MIT license don't you understand right yeah
20:52.860 --> 20:57.660
it's like it's you know not fit for any particular purpose but we of course
20:57.660 --> 21:01.740
because it's NASA we did fill it out I don't I don't know exactly what what
21:01.740 --> 21:06.300
mission this is planned to be used but once we're here we'll we know obviously
21:06.300 --> 21:10.620
but because we're European we don't have any telemetry inductive so we don't
21:10.620 --> 21:15.780
really know where it's running so explain that part for the audience but like what
21:15.780 --> 21:21.240
what do you mean by that okay so in I think there is some cultural differences
21:21.240 --> 21:28.740
between the US and Europe one of these differences regards the sort of
21:28.740 --> 21:35.940
sensitivity to privacy things and I think it's pretty common for soft software
21:35.940 --> 21:40.140
projects especially more modern ones that come out of the US to have some sort of
21:40.140 --> 21:45.720
telemetry built-in basically they will report they will phone home in some way
21:45.720 --> 21:52.920
and report you know hey I'm just running on this IP I'm this version of DuckDB this
21:52.920 --> 21:57.840
is my you know Linux version or this is my glibc version like I don't know this is
21:57.840 --> 22:01.980
usually hidden from the users as a auto updater feature or something like that
22:01.980 --> 22:08.020
which is or not hidden maybe obfuscated is the right word here but we are in
22:08.020 --> 22:11.620
Europe and people care deeply about this sort of thing and we don't want to
22:11.620 --> 22:15.040
leak any information and we don't want to collect information that we don't
22:15.040 --> 22:19.420
strictly need we don't strictly need to know where DuckDB is running so we don't
22:19.420 --> 22:23.920
collect it I mean I mentioned earlier we do see this summary statistics on the
22:23.920 --> 22:27.940
extension installs but that's just because they go through Cloudflare and
22:27.940 --> 22:32.620
they will be able to report like here you know you had this many IPs from
22:32.620 --> 22:36.960
Germany and this many IPs from China and so on and so forth but it is it is
22:36.960 --> 22:40.560
something that we're conscious of I think if one of the strengths of DuckDB
22:40.560 --> 22:47.100
being a local first kind of system is also that you know you can just have you
22:47.100 --> 22:51.000
couldn't have an app imagine you have an app like Strava right they had some
22:51.000 --> 22:55.500
privacy issues recently where people found out like where military bases were
22:55.500 --> 22:59.340
where the like celebrities houses are how to you know best catch them in the
22:59.340 --> 23:06.940
forests bit scary and and it's why well because these the Strava app uploads all
23:06.940 --> 23:14.320
this stuff to the cloud and then they use some big data BS to to to to process it
23:14.320 --> 23:18.580
well but there's no reason there's not really a strong reason to do that they
23:18.580 --> 23:23.620
could also just leave this data on your device run DuckDB locally do all these
23:23.620 --> 23:27.820
analytics and and then you know that would not be an issue so I think that's
23:27.820 --> 23:31.940
something that is it's reason it's not it's not my main concern but it's
23:31.940 --> 23:36.860
something that I think is is nice if you can you know leave data under the
23:36.860 --> 23:41.220
control of the on the control of the device that the people have I think that
23:41.220 --> 23:44.400
would be that's a really cool thing of about DuckDB that you can just deploy it
23:44.400 --> 23:50.560
close to the user and we have people running on on phones that's actually on
23:50.560 --> 23:57.660
iPhones is we did an experiment last week a fun fun fun story we found that in
23:57.660 --> 24:01.260
order to get best at the best database performance out of an iPhone you have
24:01.260 --> 24:07.380
to put it into a box with dry ice oh yeah because it it got to get slows
24:07.380 --> 24:12.780
itself down when when when it gets hot so but if you put in dry ice it doesn't do
24:12.780 --> 24:16.500
that so you showed me a picture of that too yeah yes I think it was kind of melted
24:16.500 --> 24:20.520
around the CPU core on the upper it wasn't melt the ice melted ice melted yeah the phone
24:20.520 --> 24:24.660
wasn't melted just with clarity the phone died briefly after briefly after the
24:24.660 --> 24:28.680
experiment but it thankfully came back to life after 10 minutes was that your
24:28.680 --> 24:33.240
phone or somebody else's I have to admit that we bought it from Amazon and
24:33.240 --> 24:35.780
returned it
24:36.900 --> 24:44.060
sorry sorry whoever gets this phone you're a part of history you just won't know it
24:44.060 --> 24:50.640
hopefully we didn't break it too much yeah but we needed we wanted the latest
24:50.640 --> 24:56.680
model because obviously and I want to add you know and Apple stopped making the
24:56.680 --> 24:59.880
iPhone mini so it's really it's really Apple's that's what you have though right
24:59.880 --> 25:03.120
it's what I want but they don't make it anymore oh yeah you're gonna have to get
25:03.120 --> 25:07.920
this like big bugger here then yeah well well just means you have to have like
25:07.920 --> 25:14.100
bigger pockets that's true but uh yeah I don't know I don't want the max though
25:14.100 --> 25:17.220
because it feels like I'm have like a iPad mini at that point maybe good as a
25:17.220 --> 25:21.360
self-defense weapon though right like if you could do that can whack people with
25:21.360 --> 25:26.340
it yes that seems like a nice there might be another test for you to do is at
25:26.340 --> 25:30.360
what force can I knock somebody out with my iPhone no this is not we are more of our
25:30.360 --> 25:32.720
databases
25:32.720 --> 25:38.820
Dr.B could be collecting analytics I'm not sure just kidding um oh yeah that's
25:38.820 --> 25:44.280
interesting like um so you gave a talk today what was your talk about today today I
25:44.280 --> 25:52.680
talked about updating data which is this this one weird trick that I don't know
25:52.680 --> 25:59.440
that's that Hadoop doesn't know I don't know but this general idea that changing
25:59.440 --> 26:04.140
data is is a good thing and especially the transactional changes to data are a
26:04.140 --> 26:09.180
good thing and that transactional changing to data for are also a good idea
26:09.180 --> 26:13.860
for analytics right and it's actually kind of interesting because there are a bunch
26:13.860 --> 26:20.220
of popular analytics systems out there I won't name names that completely ignore
26:20.220 --> 26:26.460
transactional semantics right so you say you're loading a CSV file the thing goes
26:26.460 --> 26:31.180
wrong halfway through or something crashes or I don't know your internet goes down and
26:31.180 --> 26:38.800
you wrote it now the basically file will be half loaded and and somehow that's
26:38.800 --> 26:46.720
acceptable and so we have or we have systems that I also won't name names even
26:46.720 --> 26:53.980
though you're wearing those socks that that are not getting the the the try the
26:53.980 --> 26:59.940
durability right where you know when you when you commit a transaction the you know
26:59.940 --> 27:04.980
conventional again body of knowledge of database orthodoxy dictates that if you
27:04.980 --> 27:10.440
commit a transaction you need to synchronize your disk changes changes to
27:10.440 --> 27:15.480
the disk using something called f-sync and there are systems out there that
27:15.480 --> 27:20.120
simply ignore this because then they can say ah but I get 15 million transactions
27:20.120 --> 27:25.320
per second or you don't right and this is true if you disable syncing you can do a
27:25.320 --> 27:29.720
lot but that leads to the problem potentially that your database gets
27:29.720 --> 27:36.500
corrupted it's not great so so I was trying to make this point to say hey um we
27:36.500 --> 27:43.600
really want to have a sort of track classical transactionality for analytics
27:43.600 --> 27:50.480
pipelines so that's I mean you start with we want this okay and then you can say
27:50.480 --> 27:54.560
okay you can want everything but is there an efficient implementation and I think
27:54.560 --> 27:58.880
we've shown with DuckDB because we've wrote a bunch of blog posts on this and you
27:58.880 --> 28:04.760
know we did we did again we innovated what databases can do we showed with DuckDB
28:04.760 --> 28:12.380
that you can actually have a full transactional asset compliant transactional
28:12.380 --> 28:18.440
semantics in an analytical system without punishing performance that's that's the
28:18.440 --> 28:25.340
that's the big sort of asterisk right and I was trying to show also that if you
28:25.340 --> 28:31.640
have that you can do cool things like you can for example do things like have you
28:31.640 --> 28:35.120
know all or nothing kind of semantics on reading a whole folder of cc files this is
28:35.120 --> 28:40.580
this is pretty cool to have right you can have all or nothing semantics on on schema
28:40.580 --> 28:46.120
changes on you know type change schema changes on table creation table deletion that kind of
28:46.120 --> 28:50.840
thing it always people come from a consistent state and come to a consistent state you can
28:50.840 --> 28:55.880
have constraints I mentioned the art index that Pedro was working on earlier you can
28:55.880 --> 29:02.000
have you know say a primary key defined and it's by the way it is again I won't name names but it is
29:02.000 --> 29:06.960
completely common for analytical data management systems to ignore things like
29:06.960 --> 29:10.720
primary key constraint yeah it's common I know and you know why it's common
29:10.720 --> 29:16.700
why well it's because it's expensive right yeah it's like you need to check in a primary key is
29:16.700 --> 29:22.940
like okay you need to have some giant hash table or a B tree or God knows what to to actually ensure
29:22.940 --> 29:28.700
that that primary key is unique yeah or that constraint holes and again now we've we've worked
29:28.700 --> 29:36.800
very hard one of our team Tania she's a she's our local index hero she's doing working very hard of
29:36.800 --> 29:44.840
making an indexing structure that that that will be able to check concrete constraints efficiently you
29:44.840 --> 29:49.100
know add transaction commit to make sure that your data goes from one consistent state to another
29:49.100 --> 29:53.720
consistent state and that is really just something great to have so that's that's kind of I think
29:53.720 --> 29:59.180
that was what the main point I wanted to make and maybe maybe the sort of the one of the side notes
29:59.180 --> 30:07.700
that that makes this possible is that we used to trade away everything in data for scale yeah like
30:07.700 --> 30:14.720
there's entire like the entire NoSQL movement was essentially that right we said hey we need
30:14.720 --> 30:22.940
web scale whatever that meant back in 2001 lol right before iPhones we need web scale therefore
30:22.940 --> 30:29.000
we need to throw everything overboard that we've ever had like again we throw all the you know crap
30:29.000 --> 30:38.000
that these old database people had invented overboard for scale and one of the things that I was
30:38.000 --> 30:42.080
talking to database is like this you probably have talked did you talk to Jordan by the way before I
30:42.080 --> 30:46.880
think you had yeah yeah and you talked about the big data to him right yeah yeah so you had talked
30:46.880 --> 30:51.740
about him his his makes a very good point about big data not being as big as maybe you think it is
30:51.740 --> 30:58.220
and my point as a follow-up on that is like yeah okay it's not that big which means we can we don't have to
30:58.220 --> 31:06.560
trade everything away we can have things like transactional semantics in a not terrible way we can have you know
31:06.560 --> 31:15.780
basically data warehouse technology this is a weird word we'll have to talk about that yeah in in a non
31:15.780 --> 31:22.880
sort of punishing not a non punishing way from a performance perspective but yeah and I think talk
31:22.880 --> 31:30.560
about lake house formats very briefly because I don't like lake house formats let's talk about that full
31:30.560 --> 31:37.940
disclosure so this is this is uh how should I say I think I think it's um it's interesting because I'm
31:37.940 --> 31:46.400
okay I have to maybe as a bit of background I'm I really love file formats okay I'm really I'm really
31:46.400 --> 31:53.720
obsessed with file formats I've personally implemented the parquet reader inductive of course we worked as a
31:53.720 --> 32:02.180
team we worked on our own file format I've this week implemented the avro the avro code I've done we
32:02.180 --> 32:10.640
done a we did actually do a paper about protocols Lawrence my current PhD student he's doing papers
32:10.640 --> 32:19.040
on serialization on data structures to disk I really like file formats I like you know it because and I
32:19.040 --> 32:23.660
thought about this recently and I think it's so cool because it's a dimensionality reduction
32:23.660 --> 32:31.100
you have to somehow take a multiple multi-dimensional structure in a table it's 2d right columns rows and
32:31.100 --> 32:37.820
you have to put that into this one-dimensional thing which is your file or your disk or your blob
32:37.820 --> 32:44.060
store or whatever and that's something that's it's just somehow there's all the complexity inherent is
32:44.060 --> 32:52.100
really beautiful so I spent a lot of time parquet and then when iceberg first came out I was like okay
32:52.100 --> 32:56.840
I know everything about parquet is to know because it turns out if you implement your own reader yeah
32:56.840 --> 33:02.540
and writer you learn everything there is to learn about parquet in the process um so we did that and
33:02.540 --> 33:07.220
so I thought okay I know this I can understand iceberg and so I looked at it and I thought okay
33:07.220 --> 33:15.480
there's this JSON file on top and then there is these avro files on the bottom two layers because
33:15.480 --> 33:21.420
why not and then underneath there sits the parquet files with the actual data and it seemed extremely
33:21.420 --> 33:27.240
cumbersome and I couldn't really I mean I could at the time I could only really criticize the choice of
33:27.240 --> 33:34.680
avro which to this day I don't understand why you have two different metadata formats in one system but
33:34.680 --> 33:42.960
it's actually some I think my my beef with Lakers from this is actually something else it's more this
33:42.960 --> 33:55.140
yeah this idea of bringing basically core data warehouse features back but in a really bad way it's hard
33:55.140 --> 34:05.580
it's hard to explain it in a in a in a sort of in the worst way right for for reasons of like you
34:05.580 --> 34:11.520
you're making technical decisions based on sort of market force reasons and not really on technical
34:11.520 --> 34:17.340
reasons so I think the reason people got fed up with data warehouse systems in general
34:17.340 --> 34:27.120
it's because of pricing model right I would argue yeah is that fair to say like if Oracle hadn't
34:27.120 --> 34:34.620
charged per CPU but instead you know I don't know something else would have would would would be
34:34.620 --> 34:40.280
would like no sequel ever have happened are you talking about for the big tech companies that
34:40.280 --> 34:45.240
came up with their own solutions like the Hadoop yeah like other stuff if they if Oracle had like
34:45.240 --> 34:51.240
hadn't charged hadn't hadn't you I mean okay I don't want to you know I hit only an Oracle but
34:51.240 --> 34:58.260
if sort of big database in the late 90s yeah hadn't had these pricing models that were essentially in
34:58.260 --> 35:04.320
that were they still have that were let's say assuming your data has a high value per byte
35:04.320 --> 35:10.960
mm-hmm I think that's that's fair to say right right if they had in that in maybe had some other model
35:10.960 --> 35:15.120
then would I think a lot of the no sequel movement would not have happened because people would
35:15.120 --> 35:21.120
have installed you know happily installed I don't know sequel server yeah or like Postgres or something like that
35:21.120 --> 35:30.120
Postgres was free right yeah but and Postgres was not where it was so but I think I think a lot of this has to do with
35:30.120 --> 35:36.120
market forces and so people ended up hating data warehouse technology and there's of course other vendors like that are
35:36.120 --> 35:42.120
more on the analytics side but I think the same restrictions apply you had to have very deep pockets to to again golf course
35:42.120 --> 35:49.120
technology to to expensive back yeah exactly and so I get that and then but then people made again
35:49.120 --> 35:57.120
technical decisions like let's throw all this stuff overboard because the market sort of incumbents
35:57.120 --> 36:02.120
weren't willing to compromise on their pricing I think it was a big factor of it I'd love to talk to some of the
36:02.120 --> 36:08.120
some of the friends who are at the big tech companies at the time yeah like I think it was that plus there was a
36:08.120 --> 36:13.120
sense of just not scaling to quote web scale which is an entirely different discussion obviously
36:13.120 --> 36:19.120
yeah but yeah I think it yeah the notion of just let's throw things on commodity servers
36:19.120 --> 36:23.120
and and figure out how we're gonna work with that was a there's a big driving force for sure
36:23.120 --> 36:31.120
but at the time too and for people who you know listening it's like they are the vendors for these
36:31.120 --> 36:38.120
database companies database.co I mean they they big database I suppose they it's aggressive right
36:38.120 --> 36:43.120
I mean you know and I know like some of these companies if you if you decided to break the contract
36:43.120 --> 36:50.120
also penalties for that so you know there's a lot going in and a lot going out yeah I get that
36:50.120 --> 36:56.120
but now what we but I think what we're seeing we're seeing a market force is being I mean we see
36:56.120 --> 37:01.120
again we see the incumbents I would say cloud data warehouse vendors there's three so I don't have to
37:01.120 --> 37:09.120
name them and and those again you know people people whinging whining about I don't know complaining
37:09.120 --> 37:16.120
about the pricing model you know for really like I don't I don't really know about that but so now we
37:16.120 --> 37:24.120
are and then the that happened and we got we got data lake right so people essentially said fine
37:24.120 --> 37:30.120
we'll we'll just dump everything into s3 you're talking about the original data lakes like the
37:30.120 --> 37:35.120
HDFS ones yeah yeah we'll put everything in as HDFS yeah exactly so we'll not give money to these evil
37:35.120 --> 37:42.120
people we'll put everything in what is it called sequence files yeah remember those what a horrible thing
37:42.120 --> 37:49.120
we put everything in sequence files on our HDFS and we'll run some you know some some Java concoction
37:49.120 --> 37:57.120
on top of it and that's gonna be better and that then then paying you know vendor acts and that and that
37:57.120 --> 38:06.120
went on for 10 years or so I think right now we got parquet files better right and now but now we see the
38:06.120 --> 38:12.120
swing back right now the screen back is happening we want to people want to clearly need data warehouse
38:12.120 --> 38:17.120
features like they or want them or demand them or something like that I remember the conversations
38:17.120 --> 38:25.120
back in the 2010s during the quote big data era yeah was that data warehousing and BI were going to go
38:25.120 --> 38:28.120
away and data warehousing was dead especially when spark came out that I think there was a lot of shatter that the days of data warehousing are done you found the data warehousing was dead
38:28.120 --> 38:35.120
especially when spark came out that I think there was a lot of shatter that the days of data warehousing are
38:35.120 --> 38:43.120
done you fast forward to today and it's interesting because well spark data bricks it's sequel it's basically the lake house you know
38:43.120 --> 38:50.120
my friend Bill Inman who came up with the data warehouse actually wrote a book on data lake houses about it so it's it's it is interesting the
38:50.120 --> 38:58.120
pendulum swings back and forth I do recall like the conversation was sequels dead data warehousing is also dead you need to forget about
38:58.120 --> 39:03.120
this stuff because we're moving in this is also around the time of data science though so I felt like there was a lot of in
39:03.120 --> 39:15.120
retrospect hubris about the power of using data frames and all this stuff and because it was it felt like the data frame was really
39:15.120 --> 39:20.120
going to supplant everything there was a point in time when pandas was roaring ahead and then spark comes out with
39:20.120 --> 39:27.120
distributed what was the RDD at first and then distributed data frames but I felt like that could that had a chance of
39:27.120 --> 39:32.120
being the paradigm for like a split second and then it wasn't yeah yeah that's absolutely true and I think it's
39:32.120 --> 39:40.120
interesting because I see parallels there to what we saw with no sequel from for for for transactional use absolutely
39:40.120 --> 39:47.120
because people were saying up you know key value is all you need let the eventual consistency is all you need let the
39:47.120 --> 39:52.120
application developer figure it out well it ended up the application well your socks yeah
39:52.120 --> 39:58.120
let it it ended up the what happened happening is that the application developers were not happy
39:58.120 --> 40:04.120
dealing with eventual consistency in effect couldn't deal with it same happened with query languages people said
40:04.120 --> 40:09.120
are we don't need query languages we know we'll have the application developer deal with this have a joint application
40:09.120 --> 40:15.120
well turns out that was not great either and I think we are starting to see the same analytics now
40:15.120 --> 40:20.120
yeah where the data lake and data science movement there basically put all the all responsibility on the
40:20.120 --> 40:26.120
individual data scientists to say you get to you know dig into this sort of giant data lake find the things that
40:26.120 --> 40:36.120
are relevant to you decode them read them parse them put them into a data something and now the the the same effect is
40:36.120 --> 40:41.120
happening we're saying oh well some of these data warehouse features are actually great like being able to make a
40:41.120 --> 40:47.120
change to a table that's that sounds like great I mean you can do it a bit if you have a zoo of like a folder of
40:47.120 --> 40:54.120
parquet files you can throw another parquet file in there great that works but it doesn't work if you want to I don't
40:54.120 --> 41:00.120
know add a column right doesn't right it doesn't work well it doesn't work well if you want to remove rows now you have to
41:00.120 --> 41:07.120
basically invent your homegrown sort of invalidation system homebrew invalidation system for it so then we're seeing
41:07.120 --> 41:13.120
lake house formats which which are essentially doing that and I think what we're also seeing is these catalog things
41:13.120 --> 41:29.120
right and and and and and together if you if you took the users credentials for S3 away then a catalog and lake house formats are exactly the same as the old school data warehouse right?
41:29.120 --> 41:36.120
like okay the only difference that we have left over is that maybe you can have those maybe you can you get to poke
41:36.120 --> 41:40.120
around in the files yourself and then your question is do you actually want that?
41:40.120 --> 41:42.120
is this a good idea?
41:42.120 --> 41:43.120
right
41:43.120 --> 41:54.120
I would argue that there are political reasons why you want to do this you want to be able to blackmail or pressure your database vendor
41:54.120 --> 42:02.120
you want to threaten vendor X that you're going to leave them for vendor Y because they're both using the same formats
42:02.120 --> 42:05.120
okay sure I get that that's a political reason again not a technical reason
42:05.120 --> 42:06.120
right
42:06.120 --> 42:12.120
and there's also just some extreme serious limitations on on lake house formats right like you know can
42:12.120 --> 42:18.120
I don't think anybody can realistically show me a path to more than one transaction per second on an iceberg file
42:18.120 --> 42:25.120
like I don't I just don't see how right you have to stage all your parquet files you have to write the metadata files
42:25.120 --> 42:32.120
you have to write the new the new root sort of metadata thing and then you have to do a commit in the in the catalog
42:32.120 --> 42:38.120
I think that's that's the way they think about it to actually switch like that's okay one transaction per second
42:38.120 --> 42:45.120
that's like 1960s style right that's not and even then they could do more than one per second
42:45.120 --> 42:52.120
so by the way I recently I learned a fun fact did you know that the booking code on your flight tickets
42:52.120 --> 42:59.120
used to be a pointer so this was that was the actual pointer to the record
42:59.120 --> 43:05.120
oh really yeah and so they basically just that was the they had tapes with all the bookings for flights
43:05.120 --> 43:09.120
and that record locator is why it's called a record locator was just a pointer
43:09.120 --> 43:13.120
and by that point that I could find your record by just you know going to that point and the tape anyways
43:13.120 --> 43:20.120
it's interesting Bill Inman actually told me that the the on a similar note that plane ticket the paper ones used to be a punch card
43:20.120 --> 43:23.120
yeah so probably these probably are related somehow
43:23.120 --> 43:29.120
I don't know that but to come back to you to come back to the lake house format so we're we're essentially building these things
43:29.120 --> 43:35.120
things that that are inferior to you know state of the art from 20 years ago
43:35.120 --> 43:41.120
and somehow get excited about it and I what I think is gonna happen is that somebody is gonna
43:41.120 --> 43:47.120
like what happened with Hadoop basically and the MapReduce paper is that somebody's gonna build a
43:47.120 --> 43:58.120
clone of a cloud dis aggregated storage cloud data warehouse that that you know works and I think once that happens
43:58.120 --> 44:05.120
uh we're probably gonna forget about data lake formats quite quickly because then you have the entire
44:05.120 --> 44:11.120
sort of feature set in one place again you have catalog you have query engine you have storage you have updates
44:11.120 --> 44:18.120
you have uh you know authorization authorization is something that there is no story in lake house formats
44:18.120 --> 44:23.120
how to do that right if somebody has an s3 key for your files it's over
44:23.120 --> 44:28.120
uh and this is actually one of the things where we're getting a lot of requests at the moment
44:28.120 --> 44:36.120
like is there any way you you you crazy people at DuckDB can make a row level authentication possible for
44:36.120 --> 44:42.120
lake house formats and I'm like I have to tell them like there is just no way you give somebody an s3 key
44:42.120 --> 44:48.120
it's over right they they are they are there they can do anything which again it's very interesting from a sort of
44:48.120 --> 44:55.120
democratization of access perspective because I think one of the things that made data science in its heyday
44:55.120 --> 45:03.120
uh so successful is because we had a data lake and there was just the wild west of parquet files and essentially
45:03.120 --> 45:11.120
there was no um no governance right of any sorts and your low-level analysts could just go and grab some parquet files
45:11.120 --> 45:18.120
and that's I think another swing back that we're seeing that oh actually we have tons of regulation we have to follow now
45:18.120 --> 45:24.120
we can't just do that anymore we need to we need to give you know we need to do proper authorization and logging
45:24.120 --> 45:30.120
and all that stuff and lo and behold we're back at sort of the full-scale data warehouse so so I don't know
45:30.120 --> 45:35.120
and one of the things that actually I'm a bit concerned about in that space is like DuckDB one of the I think reasons
45:35.120 --> 45:42.120
where DuckDB is popular is because there is the wild west right you can just download a parquet file from your
45:42.120 --> 45:50.120
from your s3 or point directly directly to it right now and your IT department can't really stop you from doing that
45:50.120 --> 45:55.120
right but if the pendulum swings back further and it's kind of what we're seeing now with these catalogs
45:55.120 --> 46:02.120
where you do need credentials and they hand out you know they deal with all that stuff it might not be so easy in the future
46:02.120 --> 46:11.120
and so this is actually for us it's a it's a potential threat that that this that we that the wild west disappears
46:11.120 --> 46:18.120
more even more that yeah we would just lose access to this stuff because in the end if you need to pay Snowflake anyway
46:18.120 --> 46:25.120
I'm sorry I mentioned the vendor a cloud if you have to pay a cloud data warehouse anyway to get access to your data
46:25.120 --> 46:32.120
then there's no point for you to using things like DuckDB or things like you know Polars or things like ClickHouse
46:32.120 --> 46:39.120
or it really doesn't matter like one of these more like let's say guerrilla data tools because you you're paying them anyway
46:39.120 --> 46:47.120
you might as well use their stupid compute right I mean valuable computer it's uh it's uh it is it is interesting to see
46:47.120 --> 46:54.120
I'm I'm really curious what what will happen there I yeah I think we'll see we'll see that um that that sort of clone
46:54.120 --> 47:01.120
like the Hadoop of the Hadoop of cloud data warehouse hopefully not in Java though I don't know
47:01.120 --> 47:08.120
it's all coming back it's all coming back we're gonna have distributed real-time Java again yes yes
47:08.120 --> 47:15.120
anyways you mentioned local first yeah that's an interesting movement right now I
47:15.120 --> 47:20.120
I know Martin Kleppman he's working on some really cool stuff with the decentralized protocols
47:20.120 --> 47:33.120
do you envision like a decentralized version of DuckDB it's interesting we did have a
47:33.120 --> 47:39.120
uh we do have a current research project running on this actually I did get a grant uh for a research project
47:39.120 --> 47:45.120
with responsible decentralized data architectures I think is the term is the name um that is doing
47:45.120 --> 47:51.120
that is imagining this idea of there is going to be a fleet of of DuckDBs running they are gonna be
47:51.120 --> 47:59.120
you know under users control but we are still the idea is that you can still uh for example run aggregations
47:59.120 --> 48:05.120
over the whole fleet of systems with partial results being shipped back up um that's interesting
48:05.120 --> 48:12.120
um I haven't seen like the research project is mainly there because you know you need to build some
48:12.120 --> 48:15.120
abstractions for this to be something that it's not just a one-off that somebody hacks because you can
48:15.120 --> 48:20.120
totally build that today right like nothing keeps you from building that today um you can ship
48:20.120 --> 48:25.120
intermediates around there are some there are some organizations that build have built pretty wild
48:25.120 --> 48:31.120
things around parquet files that are being sent around or arrow buffers or anything like that but
48:31.120 --> 48:38.320
I think we need some abstractions there to make this like nice and and efficient um I think that
48:38.320 --> 48:43.040
MotherDuck is doing some interesting stuff so so for those who don't know MotherDuck is a company that's
48:43.840 --> 48:50.560
building a DuckDB as a service and they are their execution model is this whole uh hybrid execution where
48:51.200 --> 48:55.040
you have a DuckDB local you have a DuckDB on the server they talk to each other the query
48:55.040 --> 48:59.920
gets split up and run partially there or there depending on you know where it makes more sense
48:59.920 --> 49:05.520
depending on optimization and I think that's super interesting I didn't actually see that one coming
49:05.520 --> 49:11.200
I thought okay they're just going to do DuckDB as a service done uh but no they actually have been
49:11.200 --> 49:16.720
innovating in that in that space as well which I think is really cool yeah I remember when Jordan uh
49:17.840 --> 49:22.560
first mentioned this around uh I think it's right before Duck or MotherDuck was announced and I was just
49:22.560 --> 49:27.440
kicking myself because I asked can I put some money in that that's awesome he's like or oversubscribed
49:27.440 --> 49:34.400
like damn it yeah well thank you thank you for the for the for the trust so um yeah it's I think
49:34.400 --> 49:40.800
it's really exciting to see what what like what at the moment I think our from DuckDB labs from DuckDB
49:40.800 --> 49:45.760
the project perspective we are very much sort of focused on a single node yeah because that's still
49:45.760 --> 49:52.080
of the the space we inhabit right that's like we do the best damn job we can do on that in that single
49:52.560 --> 50:00.560
environment and uh and then other I think it's up to other people to you know to build to build sort
50:00.560 --> 50:06.560
of crazy combinations of this we don't we don't have the resources on our other team really to
50:06.560 --> 50:15.360
do a lot we have a small team we're 20 people uh wonderful team uh of of you know database hackers uh
50:15.360 --> 50:19.600
but we can only really with that that size of team right we can only really focus
50:20.320 --> 50:26.720
on one thing and it's gonna be like single node execution and what's your philosophy though in
50:26.720 --> 50:33.040
terms of is a pendulum swinging back to to single node are you saying or what yeah I think directory
50:33.040 --> 50:40.000
that's an excellent question I think that the the the distributed things I think that uh there's a
50:40.000 --> 50:47.600
wonderful uh talk at uh ICD it's a academic database conference um I think it's data engineering even
50:47.600 --> 50:54.480
international you should go and speak I should go and speak um uh that uh where he basically says
50:54.480 --> 51:00.640
that database researchers have been solving the whales problems for the last 20 years uh basically
51:00.640 --> 51:06.080
solving the problems that google have has that some of the problems that uh google what else
51:07.280 --> 51:13.760
netflix yeah the big ones right the ones with the blogs the ones with the blogs the ones the big
51:13.760 --> 51:18.800
tech blogs yeah but here's the problem so what I always see in this uh is you know the big tech
51:18.800 --> 51:24.800
companies will publish the blogs here's what we're doing um iceberg for example right uh built at netflix
51:24.800 --> 51:32.000
because it's all the netflix-like problem um and they build a lot of stuff there uh if you're a smaller
51:32.000 --> 51:36.880
company like you say say you know duck db is a is uh I don't know you suddenly sell furniture or
51:36.880 --> 51:42.960
something like that you just big pivot or ducks whatever boats boats yeah boats okay so then you
51:42.960 --> 51:46.880
have your data warehouse right does it make any sense for you to do something you know like that
51:46.880 --> 51:51.440
or do you so it does not make sense yeah but I don't want to say the blogs that's I mean everyone
51:51.440 --> 51:57.120
looks at the blogs like oh we got to do that at our company too right yeah no this is exactly right
51:57.120 --> 52:01.600
uh there was this there was this one thing where uber wrote about how they ditched postgres for my
52:01.600 --> 52:06.560
sql yeah remember the other way i think it was the other i don't remember which yeah who cares the
52:06.560 --> 52:13.280
point is it made huge it made huge splashes in the data engineering community and it didn't matter
52:13.280 --> 52:19.280
for everyone like it didn't like it matters it matters to uber sure right um but i think this point
52:19.280 --> 52:23.920
about people solving the whales problems is really interesting because it because we have been neglecting
52:23.920 --> 52:30.400
sort of the 99 of people's data problem data problems because we cannot tell you know like your
52:30.400 --> 52:38.000
your boat uh you know boat sales company to go install spark like there's no point ever for
52:38.000 --> 52:42.720
them to run spark to deal with their customer data yet we have been telling them that for the last 10
52:42.720 --> 52:49.360
years right oh you want to do you want to do data stuff go install spark um like that that that's not
52:49.360 --> 52:54.400
a great well otherwise you're not a real data company yeah i'm saying that jokingly but it's just uh yeah
52:54.400 --> 53:02.000
um but i think i think it's this is also where we see our role especially since we started as taxpayer
53:02.000 --> 53:07.360
funded uh research as a taxpayer funded research project is like we need to solve the the the real
53:07.360 --> 53:12.880
data problems out there the problem of the 99 which is you know the people that that run out of steam
53:12.880 --> 53:17.520
with excel right that that is the that it it does not make sense to solve google's problem they have they
53:17.520 --> 53:23.360
have clever people they can afford to pay the clever people the clever people you know are capable of building a
53:23.360 --> 53:29.600
solution that works for google and no one else and i don't care yeah it's it's like uh it's really
53:29.600 --> 53:35.840
a different it's really a different it's a different ball game i think and and uh yeah and i think it's
53:35.840 --> 53:44.000
super interesting if you look at these uh studies that came out from uh redshift and from uh who else
53:44.560 --> 53:50.240
redshift and the other one the snowflake had a private one but yeah but i don't think i think they
53:50.240 --> 53:55.920
didn't release the full benchmarks i'm not yeah i know redshift the redshift one is well known the
53:56.480 --> 54:01.440
um to see what you know data sizes actually look like in the real world and it's i mean jordan has
54:01.440 --> 54:06.640
been talking about this uh as well how he what he saw inside bigquery and these are all data sizes that
54:06.640 --> 54:12.240
are completely manageable um you could argue about the point whether you need disaggregated storage or not
54:12.240 --> 54:20.160
right that's an interesting point um jordan says yes i say probably not um but uh the uh
54:20.160 --> 54:25.840
uh because yeah you know you're you're writing 10 years of data maybe you want to have disaggregated
54:25.840 --> 54:30.720
storage so you don't run out of disk space but again disks are gigantic you can get a you know 20
54:30.720 --> 54:37.920
terabyte ssd no problem it's uh it's it's pretty wild um so i think that that fundamentally um yeah
54:37.920 --> 54:45.520
our our tech is is getting better faster in at a quicker rate than our data sets are getting bigger and
54:45.520 --> 54:50.400
more challenging and therefore we're gonna we're gonna see um actually much more of the sort of
54:50.400 --> 54:56.160
single user workload number one um moving moving to to single node and you know your laptop you have
54:56.160 --> 55:02.400
your laptop hey let's make it work well the laptops are insane these days insane you have six gigabytes
55:02.400 --> 55:08.720
per second i o speed on a macbook right like that's i don't know i don't know what this i don't even
55:08.720 --> 55:13.360
not actually and we managed to you know maybe managed to get that busy with ducktb that's not the
55:13.360 --> 55:19.840
point the but it's it's uh it's it's unheard of so one thing you were talking about we're when we're
55:19.840 --> 55:28.240
having uh lunch um i think somebody asked about memory uh versus ducktb um maybe can you walk people
55:28.240 --> 55:33.200
through sort of how to think about that yeah yeah people think uh there's a misconception that people
55:33.200 --> 55:39.760
think that ducktb is an in-memory database it is not uh it's not an in-memory database uh we can use
55:39.760 --> 55:46.720
memory we like memory it's nice it's fast uh but we're not limited by it so again with laurence the
55:46.720 --> 55:52.480
my phd student he's been working on the ducktb larger the memory capabilities so we've always had
55:52.480 --> 55:57.760
this thing that your input data could be bigger than memory because we read it in a sort of a streaming
55:57.760 --> 56:02.720
way like if you point ducktb to a parquet file it will not first copy the parquet file into ram
56:03.200 --> 56:09.440
and then do things with it that would be dumb it instead it it says okay we'll start in the first
56:09.440 --> 56:16.720
row group second row group and so on so forth um however there's operators in relational sort of
56:17.920 --> 56:25.120
analysis like block so-called blocking operators like join sort aggregate some window functions
56:26.160 --> 56:32.320
top n that may have to materialize actually a large of large amount of the input in itself so if you're
56:32.320 --> 56:36.720
if you're loading a terabyte of data we might read this in a string fashion but if you aggregate on
56:36.720 --> 56:42.800
the unique key we will have to put it into our hash table for the aggregation doesn't this just the
56:42.800 --> 56:50.160
semantics of sql dictate this um and it's pretty common in in analytical systems to actually fail or
56:50.160 --> 56:55.200
do something very slow in this case we have some papers that show that but inductively we actually
56:55.200 --> 57:01.760
build something that's called graceful degradation when we reach the the memory limit uh which means that
57:01.760 --> 57:06.880
we start using the disc and that's only really possible because we have these crazy ssds now
57:06.880 --> 57:10.720
because we can actually offload to disc at six gigabyte per second we can read it back at six
57:10.720 --> 57:15.280
gigabytes per second so and we don't get slowed down by multiple threads doing the same thing
57:15.280 --> 57:19.680
thing too much so it's really it's improved quite a lot compared to the spinning rust kind of thing right
57:20.160 --> 57:26.160
um and then so we can essentially for things like aggregations we can offload part of the
57:26.160 --> 57:31.040
intermediate result to disc and then basically resume and you will actually not feel a performance
57:31.040 --> 57:36.480
cliff there because it will just you know it will still use as much memory as you allow us to use
57:36.480 --> 57:41.600
but it will just also be able to use the disc this gets even more interesting if you have a join
57:41.600 --> 57:48.080
because you can have multiple join uh joins in the same sort of query right and then now we have
57:48.080 --> 57:52.080
multiple operators running at the same time fighting for memory at the same time and now you have to think
57:52.080 --> 57:57.920
about uh essentially a fair allocation strategy between those operators and again we have a paper
57:57.920 --> 58:03.920
that lawrence wrote that's uh currently under revision at vldb which is a large database conference
58:04.640 --> 58:10.400
that describes how we've implemented um the strategy to deal with this inductive and the result is
58:11.200 --> 58:18.240
that you can set that a fairly low memory limit inductive run complex queries that everybody else
58:18.240 --> 58:23.760
essentially blows up on on a single note again with your disc and still finish those queries in a
58:23.760 --> 58:30.000
reasonable time we we call it the galaxy quest principle have you seen galaxy quest no it's this
58:30.000 --> 58:36.800
movie uh it's just like lampooning star trek it's very funny it's tim tim burton and sigourney weaver i
58:36.800 --> 58:44.560
think um and uh in galaxy quest it's this fake tv show that's like like star trek uh and there you have
58:44.560 --> 58:49.680
this catchphrase which is never give up never surrender right and so we took this galaxy quest
58:49.680 --> 58:55.840
principle to uh to the query processing which means that we we just never want to give up we never
58:55.840 --> 59:00.800
want to surrender i will always be able to finish the query provided there is you know enough disk space
59:01.600 --> 59:07.200
uh to to do it you know we cannot we cannot control what users are doing if you if you cross product
59:07.200 --> 59:15.440
a giant parquet file it will be the end we will abort uh probably but uh if it's in any way reasonably
59:15.440 --> 59:21.040
we want to finish and i think that's something that is unheard of in like research prop systems it's also
59:21.040 --> 59:26.560
pretty unheard of in in in other database systems like you know if you have a cloud database if you
59:26.560 --> 59:32.480
have distributed query processing you have to do these shuffles there are just gonna be no win scenarios
59:32.480 --> 59:36.800
where every part a whole partition has to fit on one worker and if it doesn't it's just game over
59:36.800 --> 59:42.320
like spark has this problem a lot uh but i think on single note we can do pretty cool things there
59:42.320 --> 59:48.240
and make sure that we finish that's interesting um how does acid work in the that environment where
59:48.240 --> 59:55.920
which environment well in the situation where uh um you have a small amount of memory yeah uh how do
59:55.920 --> 01:00:01.120
you how do you make sure that the transactions are committed or rolled back right uh so this is uh so
01:00:01.120 --> 01:00:05.520
this is independent of query processing right so when a query starts running we'll you know we'll run it
01:00:05.520 --> 01:00:13.760
well mvcc the multi-version currency control currency control will read a um a specific version of the
01:00:13.760 --> 01:00:19.840
table okay and that's going to be consistent during the runtime of the query so we'll read this specific
01:00:19.840 --> 01:00:24.320
version so that has nothing to do with memory it's just we run this in this version what is more
01:00:24.320 --> 01:00:32.720
interesting is uh how do we deal with changes that are made within a transaction that are bigger than ram
01:00:32.720 --> 01:00:36.400
right that's kind of where it's getting to exactly exactly yeah and then we're also able to offload
01:00:36.400 --> 01:00:43.440
this so the the writer headlock um uh that where basically changes go on transaction commit they um
01:00:43.440 --> 01:00:50.400
that uh will of course be written when at the end of the transaction but uh we do speculatively write
01:00:50.400 --> 01:00:55.600
big changes already to the database file okay so we're running a big say you're this is because we had
01:00:55.600 --> 01:01:01.680
this problem this actual problem let's say you have a table you're inserting a terabyte csv file
01:01:02.240 --> 01:01:07.520
in a transaction and then you're committing okay traditionally what would happen is you're
01:01:07.520 --> 01:01:13.360
writing that terabyte file in your writer headlock because you need to be transactionally safe then
01:01:13.360 --> 01:01:17.680
a checkpoint you read it again and you actually write it to your persistent table at which point you
01:01:17.680 --> 01:01:22.720
truncate your better headlock now you've basically written this terabyte file twice to disk and read it
01:01:22.720 --> 01:01:27.520
twice because you read it once to read it in and then you wrote it to the val and then you read it
01:01:27.520 --> 01:01:32.080
from the val again and you wrote it to the main persistent database that's not great right that's
01:01:32.080 --> 01:01:37.840
both read and write amplification so inductively what what mark has built is a speculative writing
01:01:37.840 --> 01:01:44.080
of large changes so we will already write in the database file but only at transaction and at commit
01:01:44.080 --> 01:01:51.760
we into the val we don't we only write references to those blocks and only uh at checkpoint we will
01:01:51.760 --> 01:01:58.640
essentially mark these blocks at being used um yeah so in in the main sort of header so that means
01:01:58.640 --> 01:02:04.720
that basically in the default case where it works out we will write it once to the database file and
01:02:04.720 --> 01:02:10.400
that's it when we're done and then some metadata has to be updated later on um and in the failware in
01:02:10.400 --> 01:02:17.360
the in the case where we abort we'll have written some blocks to the database file that we then mark as
01:02:17.360 --> 01:02:24.160
empty so nothing so the worst thing that happened is that we made our file a terabyte bigger but that's
01:02:24.160 --> 01:02:28.800
uh something that we will be reused this space will be able to reuse later so we have thought about
01:02:28.800 --> 01:02:34.160
this and this is exactly what what i meant like we've thought really how do we bring these transactional
01:02:34.160 --> 01:02:42.240
semantics to analytical use cases without making the without like making it slow right because yeah you
01:02:42.240 --> 01:02:47.760
you assume that this is going to write work work out well so the default case is going to be fast the
01:02:47.760 --> 01:02:53.840
you know the worst case when it aborts is going to be we we use some disk space oh no it's a good
01:02:53.840 --> 01:02:58.960
trade-off i think from our perspective right yeah so so that's these are sort of things that we that
01:02:58.960 --> 01:03:05.360
we do in in actively to make this work that's interesting yeah it's a fun thing to do it's uh it's nobody
01:03:05.360 --> 01:03:11.520
has been there before it's right it's really like nobody has tried to write a terabyte to a vowel like
01:03:11.520 --> 01:03:17.200
that's uh or not to write a terabyte to the wall it's uh so it's really funny how the other guys do
01:03:17.200 --> 01:03:24.480
this uh so most systems out there again not gonna name names have a sort of bypass of the transactional
01:03:25.280 --> 01:03:32.800
uh sort of system for bulk loads so you have a special tool then you're basically bypassing the
01:03:32.800 --> 01:03:37.840
whole transactionality just for bulk loads because that that doesn't work and then yeah that's that's
01:03:37.840 --> 01:03:41.920
of course not what you want we for us reading a big file is a common operation so it needs to be
01:03:41.920 --> 01:03:47.280
transactional are you saying like a copy command yeah it is like a copy command but then um there's
01:03:47.280 --> 01:03:52.880
this like usually there's a separate command line tool that that uh that this just bypasses the
01:03:52.880 --> 01:03:59.360
transactional log yeah it's not great but if because if that goes wrong then you're you're back to
01:03:59.360 --> 01:04:03.600
you know not the things that's not to be done anyways it's a fun thing are you seeing many
01:04:03.600 --> 01:04:08.880
streaming use cases with docdb um well we're not a streaming system in that sense so maybe by
01:04:08.880 --> 01:04:14.000
definition we don't see a lot of demand for it i think there's really amazing streaming systems
01:04:14.000 --> 01:04:17.440
that they're like materialized i don't know if you've talked to them already yeah no material yeah
01:04:18.720 --> 01:04:24.000
so there i think i i think they have really amazing tech and it would be really weird for us to try to
01:04:24.000 --> 01:04:30.000
make a knockoff like there's another unnamed database company that or self self declared
01:04:30.000 --> 01:04:35.040
database company that's uh taking their bulk system and slapping on some sort of lightweight
01:04:35.040 --> 01:04:39.920
fake stream thing uh shouldn't be we don't want to do that right like that's just embarrassing
01:04:40.800 --> 01:04:47.200
uh and uh so i think the the real way how to do streaming is like what materialize is doing like
01:04:47.200 --> 01:04:55.040
they have frank frank's magic that that makes this really work um so i don't think we're we're gonna
01:04:55.040 --> 01:05:01.840
go there in in the near future what we might end up doing though is maybe that we um have materialized
01:05:01.840 --> 01:05:09.200
views that this you know can incrementally update but that's not really streaming right that's more like
01:05:10.160 --> 01:05:16.640
it's my updated materialized views it's i think streaming streaming itself um yeah if you use the
01:05:16.640 --> 01:05:21.920
user wants to we have users that that use ducktb for streaming use cases where they just you know
01:05:21.920 --> 01:05:26.240
they do an insert and they do a delete at the same time and then you have written they rerun the query
01:05:27.840 --> 01:05:33.040
that works it's it's it's it's not gonna it's not gonna work well it's really not why it seems kind
01:05:33.040 --> 01:05:43.120
of expensive i think it's also what the unnamed vendor is doing so it's like yeah man you see you
01:05:43.120 --> 01:05:50.640
see if you're long enough in data engines you see things right i'm sure oh my god it's uh it's like
01:05:50.640 --> 01:05:57.280
this f-sync thing it's it's it's really amazing like i have to say so in my i had a in my class at
01:05:57.280 --> 01:06:05.600
some point years ago i had a student saying but the key value store with the socks uh manages like so
01:06:05.600 --> 01:06:11.040
the socks he's referring to it starts with an m and there's a database i'll let you figure out the rest
01:06:11.040 --> 01:06:16.800
um i don't you know i don't know if i'm allowed to say it or not but uh but uh there's this there
01:06:16.800 --> 01:06:21.280
was this thing where somebody said yeah but but uh this system can do a million transactions i don't
01:06:21.280 --> 01:06:27.280
i forgot the actual numbers a million transactions per second and postgres being old and stupid uh can
01:06:27.280 --> 01:06:30.800
only do a hundred thousand transactions per second that's what they're saying that was that was the
01:06:30.800 --> 01:06:37.200
blog post with the pen and everything and so i looked at this and i thought bullshit um and i looked
01:06:37.200 --> 01:06:42.240
at it and in the lecture and like in the next lecture i i brought uh my laptop and i was like
01:06:42.240 --> 01:06:46.480
right i figured out we'll run an experiment right here right now i'll show you once and for all
01:06:46.480 --> 01:06:51.760
because it turns out if you enable the proper logging in the key value store that was bragging
01:06:51.760 --> 01:06:58.160
about performance low and behold it went to 100 000 transactions per second yeah and poscus has a
01:06:58.160 --> 01:07:03.520
flag to disable them and if you do that low and behold it went to a million transactions per second
01:07:03.520 --> 01:07:09.280
so the bragging result from blog post turned around 100 and if you had actually configured this fairly
01:07:10.400 --> 01:07:15.520
either both of them do it or none of them do it it would have you know they would have been exact same
01:07:15.520 --> 01:07:21.440
result interesting um and i think it's so funny how how defaults sometimes matter like again with the
01:07:21.440 --> 01:07:26.720
example i mentioned with the you know today with the with the the routers bricking themselves because
01:07:26.720 --> 01:07:32.640
again the same system wasn't using f-sync um that is they probably wanted to have this but the default
01:07:32.640 --> 01:07:38.080
was it was off so they aired on the source they aired on the wrong side if you want right postgres
01:07:38.080 --> 01:07:42.880
will always air on the on the side of caution and i think it's the correct thing if you want to optimize
01:07:42.880 --> 01:07:48.240
the hell out of it uh then you can by setting the flags it's just that you need to actually look at
01:07:48.240 --> 01:07:54.720
it and not just write a blog post mm-hmm yeah somebody wrote on blue sky today about um
01:07:54.720 --> 01:08:02.720
uh i can't remember who it was but uh how not to do database benchmarking 101 and i guess it was um
01:08:02.720 --> 01:08:09.040
just about certain metrics of performance but didn't take into account uh uh indexes and rebalancing there
01:08:09.040 --> 01:08:13.680
that's a classic yeah yeah yeah we wrote a paper about this what's that we actually wrote a paper about
01:08:13.680 --> 01:08:20.640
this oh yeah uh it's one of our most cited space cited papers it's a database benchmarking i forgot the
01:08:20.640 --> 01:08:29.760
actual title of it um i it is it is meant as a how-to but we actually if you're curious uh there's a
01:08:29.760 --> 01:08:35.360
there's a paper on this that go check it out from from us that that lists uh lists the most common
01:08:35.360 --> 01:08:42.400
benchmarking crimes oh yeah which is uh but absolutely like pre-processing time is definitely one of them
01:08:42.400 --> 01:08:49.040
it's a easy way of winning a benchmark is to to pre-process the hell out of it and uh and then
01:08:49.040 --> 01:08:54.160
depending on the benchmarks and also like there is also a lot of lawyering going on you know you
01:08:54.160 --> 01:08:59.360
know tpch right you've heard of it right it's terrible it's pretty old and busted people should
01:08:59.360 --> 01:09:05.520
be switching to tpcds much better benchmark um but even that uh there's real benchmark lawyers
01:09:05.520 --> 01:09:10.880
out there right that look through the spec and find like the the thing they forgot and they will
01:09:10.880 --> 01:09:16.480
exploit that thing and then they will win at the benchmark it's it's kind of weird well that's
01:09:16.480 --> 01:09:21.280
funny and then some databases don't let you benchmark because it's a do it clause the debit
01:09:21.280 --> 01:09:27.840
clause yeah yeah like uh i think that's that's the famous one so then you see dbms x and dbms y
01:09:27.840 --> 01:09:31.440
and all that stuff in papers which i don't think is you know helping science a whole lot
01:09:32.240 --> 01:09:39.040
but what's what's that i mean uh because of the debit clause oh okay people can't say i run this on
01:09:39.040 --> 01:09:47.360
oracle because uh yeah people fear oracles yeah you know legal department probably for a reason yeah
01:09:47.360 --> 01:09:53.200
um and then instead of saying we run the result with oracle was this they will say the result with
01:09:53.200 --> 01:10:01.120
dbmxx dbmsx was this and then it's up to the reader to guess what that was i don't know i don't
01:10:01.120 --> 01:10:06.800
understand that but then leave them out you know who cares yeah but uh yeah benchmarking is a is an
01:10:06.800 --> 01:10:12.320
interesting thing so we because it's so hard to make fair benchmarks we actually don't publish any
01:10:12.320 --> 01:10:18.080
benchmarks of duckdb against something else on like our website so we don't have anything there we we
01:10:19.200 --> 01:10:24.080
we call this a home game obviously it's much more interesting to win an away game yeah so and that's
01:10:24.080 --> 01:10:29.120
when somebody else runs the benchmark uh so that's that's the ones that uh that i think are more
01:10:29.120 --> 01:10:34.480
interesting to us because they hadn't you know they probably haven't spent a whole lot of time
01:10:34.480 --> 01:10:41.920
optimizing uh which is you know hard to do hard to not to do if it's your system uh but uh yeah so
01:10:41.920 --> 01:10:46.720
we're not doing anything of that because yeah it's also like raw performance is overrated
01:10:48.000 --> 01:10:53.280
i think we've talked about this earlier about the ease of use it's not that dcb is not fast i think it's
01:10:53.280 --> 01:11:01.840
it's a very competitive uh query engine uh but it's it's not the end it's not the end all of data
01:11:01.840 --> 01:11:07.440
things you need to solve a problem you're not solving the problem by being two percent faster
01:11:07.440 --> 01:11:11.360
and then the user not being to install it because you're using some intel intrinsic that you don't
01:11:11.360 --> 01:11:18.320
have yeah so so it's probably worth more to not have the two percent extra performance
01:11:18.320 --> 01:11:22.480
but knowing that this will run on every intel processor made in the last 20 years
01:11:22.480 --> 01:11:32.560
just saying that's interesting yeah how did you guys become so cool i don't know i'm not cool i i i
01:11:32.560 --> 01:11:37.200
think i'm database cool like that's cool that database cool is something very different right
01:11:37.200 --> 01:11:41.680
cool i will never achieve cool i know this but you're not like taylor swift exactly but you're kind
01:11:41.680 --> 01:11:46.480
of the taylor swift of the database world right so that's the but i have to admit the bar is much
01:11:46.480 --> 01:11:53.520
much much much lower right i i mean i i really like databases that's right and i have shaved today
01:11:54.560 --> 01:12:01.360
so i'm clearly not one of them but uh no i i know this is a this is like the people are like the
01:12:01.360 --> 01:12:07.440
people obsess about tables and query engines is a is a strange subgroup of people i will admit we
01:12:07.440 --> 01:12:12.000
managed to be cool in that subgroup of people yeah and that's okay that's that's that's all we can
01:12:12.000 --> 01:12:18.560
really ever hope for it's not going to be coolness uh as defined by others like whatever and what do
01:12:18.560 --> 01:12:24.880
you think it was though because i mean it's you know you enter the database uh field it's it's it
01:12:24.880 --> 01:12:32.720
tends to be pretty dry it tends to be pretty you know whatever you want to call it right but um but
01:12:32.720 --> 01:12:39.040
you guys have a cult following we have um i mean i can maybe say what originally drew me to the field was
01:12:39.040 --> 01:12:48.480
the low amount of that i initially perceived um so in in data systems uh or in systems research in general
01:12:49.440 --> 01:12:58.480
like there is a definition of right and wrong of course after 25 years in the field i have learned
01:12:58.480 --> 01:13:05.600
that it's not that clear cut uh but at least there was the definition and i mean when i was at university
01:13:05.600 --> 01:13:12.160
where people doing things like human computer interaction with like a visualization or like
01:13:12.160 --> 01:13:18.800
things that are quite hard to actually evaluate in a scientific way but you cannot it's you have
01:13:18.800 --> 01:13:24.240
to run like a user study you're you know asking 200 people what they think which which which plot
01:13:24.240 --> 01:13:30.080
they find prettier it's not very hard science i mean it is science of course i don't want to disparage
01:13:30.080 --> 01:13:36.400
but it's not something that i want to do uh i i like this idea of there's being like a there's a
01:13:36.400 --> 01:13:42.480
test you run the test you see who is who's better at least on this specific test under these specific
01:13:42.480 --> 01:13:50.640
circumstances so so i think that um yeah that's what i really liked about databases uh why the cult following
01:13:50.640 --> 01:13:58.160
i think i think making the user experience great is something that has not occurred to anyone in databases yet
01:13:58.160 --> 01:14:03.680
i think that i think we may be the first people to actually care about user experience
01:14:04.320 --> 01:14:10.560
um it's not entirely fair because there's other vendors that but certainly like open source systems
01:14:12.080 --> 01:14:16.880
i don't know there's there's not a lot there there's there's there's people in the no sql world
01:14:16.880 --> 01:14:23.040
that have gotten it right uh which is where we took some inspiration from this sqlite who got it right
01:14:23.040 --> 01:14:26.640
right yeah they they are the world's most widely used database system for a reason
01:14:27.920 --> 01:14:34.160
while being still extremely weird uh in other senses but uh but i think that if there is a cult
01:14:34.160 --> 01:14:41.840
following i'm you know i'm happy to hear it but um uh then i think it is because we really put the
01:14:41.840 --> 01:14:48.320
user first and not the and not our orthodoxy it's it's this needs to this needs to be nice to this
01:14:48.320 --> 01:14:53.760
needs to be easy to use it needs to be nice to use it needs to make you you know solve a problem and
01:14:53.760 --> 01:15:00.400
not not not satisfy our academic curiosity or our need to be the fastest because in the end right
01:15:00.400 --> 01:15:05.440
like let's say we win on a benchmark say we run this big huge benchmark we optimize we spend a year
01:15:05.440 --> 01:15:11.280
optimizing for it and we go out with big fanfare we've you know like other database vendors have
01:15:11.280 --> 01:15:18.160
recently and say we we beat tpcds scale factor so and so and you know screw these others who are slower
01:15:19.040 --> 01:15:25.120
is it gonna make the life of a single one of our users better i don't think so if we spend the
01:15:25.120 --> 01:15:29.040
same amount of time on like the csv reader is that gonna make the life of users better absolutely
01:15:29.920 --> 01:15:34.400
so it's clear where we have to go right we we're not gonna we're not gonna run a giant
01:15:34.400 --> 01:15:41.360
benchmarking campaign it makes sense yeah i think that i think that's i would hope that's why people
01:15:41.360 --> 01:15:50.160
like ducktb not us ducktb closing out like what are you excited about over the next year or two
01:15:50.160 --> 01:15:59.120
yeah um well uh i really like this uh extension ecosystem that we started building yeah so for
01:15:59.120 --> 01:16:04.480
those who don't know um we basically made a pip for ducktb that's kind of built into the system you
01:16:04.480 --> 01:16:09.920
can just install extensions that can add features like they can add new file formats they can add new
01:16:09.920 --> 01:16:18.320
functions it can add new soon it's going to be able to add new syntax um and i think that's really
01:16:18.320 --> 01:16:27.120
where i think the project needs to go is become this runtime fabric i don't know host for uh the
01:16:27.120 --> 01:16:34.560
cambrian explosion of creativity from people that you know are in you know that care about one specific
01:16:34.560 --> 01:16:39.040
subset of it like we have people in geospatial that are doing cool things there yeah there's people in
01:16:39.040 --> 01:16:42.880
like con compatibility with other systems that are cool things there there's people doing
01:16:43.440 --> 01:16:47.760
file formats you know all all sorts of crazy things somebody i think made an xml reader
01:16:48.640 --> 01:16:53.920
great right like that the world the world has xml files in it regret it almost swing back at some
01:16:53.920 --> 01:17:02.720
point um so that's i think what i'm really excited about obviously uh to see where this can go and
01:17:03.600 --> 01:17:08.880
i think enabling people building infrastructure is always going to be uh the winning the winning
01:17:08.880 --> 01:17:14.240
sort of thing to do and if that is not nothing to do with like any sort of you know commercial interest
01:17:14.240 --> 01:17:21.440
it's just that we really care about how what what is the state of the world in terms of how people look
01:17:21.440 --> 01:17:29.680
at data and uh i want people to just be comfortable leave like wrangling data and not be like terrified of it
01:17:29.680 --> 01:17:36.880
so it's awesome yeah thanks man it's good to chat with you finally and good to finally meet i know
01:17:36.880 --> 01:17:41.520
that we've been in each other's orbit for a bit and uh yeah thanks thanks so much uh joe for the for
01:17:41.520 --> 01:17:50.240
the you know for the nice chat and uh in us in paris awesome all right well uh au revoir or uh
01:17:52.000 --> 01:17:53.360
all right
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment