NOTE: This is a companion article to a talk I did at Montreal.rb on September 20th, 2016 titled "PostgreSQL: The Fastest Ways to Add Data".
Everybody knows how to add data to a PostgreSQL cluster:
INSERT INTO table(column0, column1) VALUES (val0, val1)
This is also the slowest way to add data! Obviously, this depends on your use case: if you're adding a single row, this is the best statement you can use, but if you're adding thousands of rows, there are better solutions.
The problem with doing a SQL statement per new row is two-fold: repeated work server-side and round-trip time.
Whenever you execute a SQL statement, the server will execute something like this (I haven't read the code, but it should be similar):
query := PARSE_FROM(socket);
plan := PLAN(query);
IF autocommit THEN
BEGIN;
EXECUTE(plan);
COMMIT;
ELSE
EXECUTE(plan)
END IF;
REPLY ok;
Every statement will execute the same PARSE_FROM
and PLAN
steps. PostgreSQL and ORM libraries are smart and cache plans and statements, but there is still some wasted work.
Round-Trip Time is really the worst problem when you're trying to achieve any kind of throughput. Everytime you talk over the network, you're eating in your time budget. What happens is this:
Client Server
---------------------------------------------
INSERT INTO ... 1 ms >
PARSE
PLAN
EXECUTE
(3 ms)
1 ms < REPLY
The problem stems from the fact that every network packet takes a non-negligible amount of time to reach the server, and for the reply to come back. In the simplified example above, we waste 1 millisecond reaching the server and 1 millisecond for the reply packet to reach us. That's 2 milliseconds wasted over 5 millisecond total time, or 40% of our time where we twiddled our thumbs. This is a huge amount of time!
But we can do better than that.
If we want to achieve better throughput, we have to somehow change the equation. Since we're wasting so much time on the network, we should reduce the number of round-trips we do. The SQL standard enables the following syntax, which I'll call "INSERT many":
INSERT INTO table(col0, col1)
VALUES
(val00, val01)
, (val10, val11)
, (val20, val21)
, ...
What we're doing here is adding many rows in a single SQL statement / one network round-trip. The equation has now become:
Client Server
---------------------------------------------
INSERT INTO ... 1 ms >
PARSE
PLAN
EXECUTE
(498 ms)
1 ms < REPLY
In this simplified example, we're still paying 2 milliseconds of round-trip time, but now, the server is doing 498ms of work for every round-trip. We now waste only 2 / 500 milliseconds on round-trip time, which boils down to 0.4 %. We've already achieved a 100x improvement in throughput with a simple change.
To achieve even better throughput, we have to delve into PostgreSQL's extensions and internals.
PostgreSQL has a COPY
statement. The COPY
statement allows PostgreSQL to read files (and STDIN) and copy the data to a table. For larger batches, this is better than going through the full query parser: PostgreSQL can use a simpler CSV parser instead.
Alone, COPY
many statements will be slightly faster than INSERT
many, but not by much. To achieve an even better throughput, we have to know more about PostgreSQL's internals.
The Write-Ahead Log is where PostgreSQL will store it's intents: I will write tuple X to row Y of relation Z. The WAL is the implementation detail that provides the "D" in ACID: durability. In case of a crash, the server can replay the log and achieve consistency. But writing to the Write-Ahead Log takes time. If we could somehow skip this write, we would achieve better throughput.
It turns out that if we open a transaction and use the COPY
statement to write to a temporary table, PostgreSQL will not write the table's data to the WAL. In details, the technique is:
BEGIN;
CREATE TEMP TABLE tmp_table( LIKE table ) ON COMMIT DROP;
COPY tmp_table(col0, col1) FROM stdin WITH (FORMAT csv, HEADER false, DELIMITER ',');
"val00","val01"
"val10","val11"
"val20","val21"
"val30","val31"
...
\.
INSERT INTO table
SELECT *
FROM tmp_table LEFT JOIN table USING (col0)
WHERE table.col0 IS NULL;
COMMIT;
- Open a transaction;
- Create a temporary table that has the same schema as the target table;
COPY
the data into the temporary table, skipping the WAL;- Next, write the data to the destination table, but only rows that don't already exist. This prevents duplicates in your destination table;
- Finally, close the transaction and let PostgreSQL write the subset of rows that were really new be written to the WAL.
The reason why we can skip the write to the WAL is because PostgreSQL knows that temporary tables cannot live longer than the session (here, we've even specified that the table must be dropped at commit, so even earlier). Since the table cannot survive the session, and sessions cannot survive a server restart, then temporary tables do not need to be written to the WAL.
The final trick I have for you is a way to increase concurrency further. The problem with the COPY
many approach is this: there's a hidden LOCK that's taken while writing to a table:
Client Server
---------------------------------------------
INSERT INTO ... 1 ms >
PARSE
PLAN
EXECUTE
LOCK
...
END
1 ms < REPLY
PostgreSQL protects itself against errors with this lock, such as preserving primary key constraints, for example. In Seevibes' case, we collect and store data for many markets and many time periods simultaneously. Data destined for the Canadian market will not and cannot influence the US market. Likewise, data coming for September will not and cannot influence the August data. This lets us achieve the best throughput: we can write to different tables from different processes and no locks will conflict, improving throughput.
-- process 1
BEGIN;
-- no need to be fancy with temp table names:
-- PostgreSQL ensures no two temp tables can conflict, even if they have the same name
-- (in reality, temp tables are created in a per-session schema)
CREATE TEMP TABLE tmp_table() ON COMMIT DROP;
COPY ...;
...
\.
INSERT INTO canada_2016_09
SELECT *
FROM tmp_table LEFT JOIN canada_2016_09 USING (service_name, interaction_service_id)
WHERE canada_2016_09.interaction_service_id IS NULL;
COMMIT;
-- process 2: cannot conflict with process 1 because this process is writing to a different table!
BEGIN;
CREATE TEMP TABLE tmp_table() ON COMMIT DROP;
COPY ...;
...
\.
INSERT INTO usa_2016_08
SELECT *
FROM tmp_table LEFT JOIN usa_2016_08 USING (service_name, interaction_service_id)
WHERE usa_2016_08.interaction_service_id IS NULL;
COMMIT;
Of course, there are no free lunches. Increasing throughput increased latency. In the INSERT
one case, each new row was visible to reports within milliseconds. In the COPY
many to many case, new rows will only be visible after a relatively long period of time. In Seevibes' case, this was an acceptable tradeoff: waiting two hours for a row to become queryable won't cause issues because our reports are based on the past six month's data anyway.
Hope this helps someone!
Good stuff, thanks for this :-)