Skip to content

Instantly share code, notes, and snippets.

@greenlion
Last active August 29, 2022 18:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save greenlion/d54d32426117e06ebfe2a768bfb70bde to your computer and use it in GitHub Desktop.
Save greenlion/d54d32426117e06ebfe2a768bfb70bde to your computer and use it in GitHub Desktop.
BTS OnTime data WARP vs InnoDB
util.import_table("/data/ontime/ontime.csv", {"schema": "ontime", "table": "ontime", "dialect": "csv-unix", "skipRows": 0, "showProgress": "true", "threads": 12})
Load WARP table (11 mins)
File '/data/ontime/ontime.csv' (79.68 GB) was imported in 9 min 58.1429 sec at 133.22 MB/s
Total rows affected in ontime.ontime: Records: 157928010 Deleted: 0 Skipped: 0 Warnings: 0
Load InnoDB table (37.5 mins)
File '/data/ontime/ontime.csv' (79.68 GB) was imported in 37 min 28.6070 sec at 35.44 MB/s
Total rows affected in ontime.ontime_innodb: Records: 157928010 Deleted: 0 Skipped: 0 Warnings: 0
MySQL localhost ontime SQL > select Year, count(*) cnt from ontime_innodb group by 1;
+------+---------+
| Year | cnt |
+------+---------+
| 1987 | 1311827 |
| 1992 | 5092166 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 2014 | 5819811 |
+------+---------+
28 rows in set (8 min 7.4194 sec)
256K rows/sec processed
MySQL localhost ontime SQL > select Year(FlightDate), count(*) cnt from ontime_innodb group by 1;
+------------------+---------+
| Year(FlightDate) | cnt |
+------------------+---------+
| 1987 | 1311827 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1992 | 5092166 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 2014 | 5819811 |
+------------------+---------+
28 rows in set (3 min 38.2186 sec)
MySQL localhost ontime SQL > select Year, count(*) cnt from ontime group by 1;
+------+---------+
| Year | cnt |
+------+---------+
| 1987 | 1311826 |
| 2014 | 5819811 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 1990 | 5270893 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 1991 | 5076925 |
| 1992 | 5092157 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
+------+---------+
28 rows in set (1 min 9.2067 sec)
> select Year(FlightDate), count(*) cnt from ontime group by 1;
+------------------+---------+
| Year(FlightDate) | cnt |
+------------------+---------+
| 1987 | 1311826 |
| 2014 | 5819811 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 1990 | 5270893 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 1991 | 5076925 |
| 1992 | 5092157 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
+------------------+---------+
28 rows in set (1 min 11.8889 sec)
> select count(*) cnt from ontime_innodb where Year = 1998;
+---------+
| cnt |
+---------+
| 5384721 |
+---------+
1 row in set (8 min 17.9652 sec)
> select count(*) cnt from ontime where Year = 1998;
+---------+
| cnt |
+---------+
| 5384721 |
+---------+
1 row in set (8.1808 sec)
> select count(*) cnt from ontime_innodb where FlightDate between 19980101 and 19981231;
+---------+
| cnt |
+---------+
| 5384721 |
+---------+
1 row in set (8.1397 sec)
> select count(*) cnt from ontime_innodb where FlightDate between 19980101 and 19981231;
+---------+
| cnt |
+---------+
| 5384721 |
+---------+
1 row in set (1.9824 sec)
> select count(*) cnt from ontime where FlightDate between 19980101 and 19981231;
+---------+
| cnt |
+---------+
| 5384721 |
+---------+
1 row in set (9.8641 sec)
> select count(*) cnt from ontime where FlightDate between 19980101 and 19981231;
+---------+
| cnt |
+---------+
| 5384721 |
+---------+
1 row in set (2.1297 sec)
> select count(*) cnt from ontime_innodb where OriginAirportId = 10135 or DestAirportId = 10135;
+--------+
| cnt |
+--------+
| 275922 |
+--------+
1 row in set (8 min 10.1118 sec)
> select count(*) cnt from ontime where OriginAirportId = 10135 or DestAirportId = 10135;
+--------+
| cnt |
+--------+
| 275922 |
+--------+
1 row in set (22.0516 sec)
> select count(*) cnt from ontime where OriginAirportId = 10135 or DestAirportId = 10135;
+--------+
| cnt |
+--------+
| 275922 |
+--------+
1 row in set (1.0812 sec)
<div>
<h1>Using a materializec view to quickly answer multiple queries</h1>
This blog post is a followup to my last blog post comparing WARP and InnoDB. At the end of that post I promised to follow up with how materialized views can help with query performance when btree indexing can't effectively speed up queries enough. One example I gave was using a covering index to speed up queries, but it only speed up the queries by 3x. Materialized views speed up queries 100x or more!
<pre>
mysql> create incremental materialized view ontime_OriginAirport_DestAirport_Year_cnt
as
select OriginAirportId, DestAirportId, Year, count(*) cnt
from ontime_innodb
group by OriginAirportId, DestAirportId, Year;
Query OK, 0 rows affected (9 min 37.53 sec)
</pre>
The first query that I demonstrated in the earlier blog post was:
<pre>
select Year, count(*) cnt from ontime_innodb group by 1;
</pre>
</div>
We can answer this query using our materialized view
<pre>
mysql> select Year, sum(cnt) as `count(*)`
from ontime_OriginAirport_DestAirport_Year_cnt
group by Year order by Year;
+------+----------+
| Year | count(*) |
+------+----------+
| 1987 | 1311827 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1992 | 5092166 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 2014 | 5819811 |
+------+----------+
28 rows in set (0.05 sec)
</pre>
Compare that to running the query against the base table:
mysql> select Year, count(*) cnt
from ontime_innodb
group by 1 order by 1;
+------+----------+
| Year | count(*) |
+------+----------+
| 1987 | 1311827 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1992 | 5092166 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 2014 | 5819811 |
+------+----------+
28 rows in set (8 min 7.4194 sec)
</pre>
Using materialized views provides an impressive speedup. Maintaining the materialized view is easy. I deleted a month of data from the base table:
<pre>
mysql> delete from ontime_innodb where FlightDate < '1987-11-01';
Query OK, 448621 rows affected (5.80 sec)
</pre>
The background process that reads from the binary logs takes some time to process them. I waited until the background process caught up (about 4000K rows/sec) and then I refreshed the view:
<pre>mysql> call leapdb.refresh2('ontime','ontime_OriginAirport_DestAirport_Year_cnt');
Query OK, 0 rows affected (1.46 sec)
</pre>
As you can see, it takes very little time to refresh the view once the background process is caught up. If you modify data in batches or nightly, the background process should not be behind often.
<br>
<br>
Now we can do the query by year and see that the number of rows in 1987 has decreased.
<pre>
mysql> select Year, sum(cnt) as `count(*)`
from ontime_OriginAirport_DestAirport_Year_cnt
group by Year order by Year;
+------+----------+
| Year | count(*) |
+------+----------+
| 1987 | 914406 |
| 1988 | 5202096 |
| 1989 | 5041200 |
| 1990 | 5270893 |
| 1991 | 5076925 |
| 1992 | 5092166 |
| 1993 | 5070501 |
| 1994 | 5180048 |
| 1995 | 5327435 |
| 1996 | 5351983 |
| 1997 | 5411843 |
| 1998 | 5384721 |
| 1999 | 5527884 |
| 2000 | 5683047 |
| 2001 | 3422767 |
| 2002 | 4935434 |
| 2003 | 6488540 |
| 2004 | 7129270 |
| 2005 | 7140596 |
| 2006 | 7141922 |
| 2007 | 7455458 |
| 2008 | 7009726 |
| 2009 | 6450285 |
| 2010 | 6450117 |
| 2011 | 6085281 |
| 2012 | 6096762 |
| 2013 | 6369482 |
| 2014 | 5819811 |
+------+----------+
28 rows in set (0.03 sec)
<pre>
The next query I ran in the previous blog post used WARP automatic indexing. The query when run on InnoDB was:
<pre>
mysql> select count(*) cnt
from ontime_innodb
where Year = 1998;
+---------+
| cnt |
+---------+
| 5384721 |
+---------+
1 row in set (8 min 17.9652 sec)
</pre>
The materialized view can be used to answer such a query much faster than scanning the table. In this way, the materialized view is acting as a kind of index.
<pre>
mysql> select Year, sum(cnt) as `count(*)`
from ontime_OriginAirport_DestAirport_Year_cnt
where Year = '1998';
+------+----------+
| Year | count(*) |
+------+----------+
| 1998 | 5384721 |
+------+----------+
1 row in set (0.02 sec)
</pre>
Finally, in the previous blog post I demonstrated a query that could not be answered efficiently using a btree index:
<pre>
mysql> select count(*) cnt
from ontime_innodb
where OriginAirportId = 10135 or DestAirportId = 10135;
+--------+
| cnt |
+--------+
| 275132 |
+--------+
1 row in set (8 min 10.1118 sec)
</pre>
Using the materialized view, this query can be answered efficiently:
<pre>
mysql> select sum(cnt) as `count(*)` from ontime_OriginAirport_DestAirport_Year_cnt where OriginAirportId = 10135 or DestAirportId = 10135;
+----------+
| count(*) |
+----------+
| 275132 |
+----------+
1 row in set (0.02 sec)
</pre>
Using simple query rewrites to acess the materialized view, you can speed up lots of different queries on your system. This single materialized view answers many different questions. Choose how you build your views and you can get a lot of bang for your buck.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment