francois/2016-09-20--fastest-ways-to-add-data-to-postgresql.markdown

## 2016-09-20--fastest-ways-to-add-data-to-postgresql.markdown

      
    Raw
  

              2016-09-20--fastest-ways-to-add-data-to-postgresql.markdown
            
          
    The Fastest Ways to Add Data to PostgreSQL

NOTE: This is a companion article to a talk I did at Montreal.rb on September 20^th, 2016 titled "PostgreSQL: The Fastest Ways to Add Data".
Everybody knows how to add data to a PostgreSQL cluster:
INSERT INTO table(column0, column1) VALUES (val0, val1)
This is also the slowest way to add data! Obviously, this depends on your use case: if you're adding a single row, this is the best statement you can use, but if you're adding thousands of rows, there are better solutions.
100s rows/sec: INSERT one

The problem with doing a SQL statement per new row is two-fold: repeated work server-side and round-trip time.
Repeated work server-side

Whenever you execute a SQL statement, the server will execute something like this (I haven't read the code, but it should be similar):
query := PARSE_FROM(socket);
plan := PLAN(query);
IF autocommit THEN
  BEGIN;
  EXECUTE(plan);
  COMMIT;
ELSE
  EXECUTE(plan)
END IF;

REPLY ok;

Every statement will execute the same PARSE_FROM and PLAN steps. PostgreSQL and ORM libraries are smart and cache plans and statements, but there is still some wasted work.
Round-trip time

Round-Trip Time is really the worst problem when you're trying to achieve any kind of throughput. Everytime you talk over the network, you're eating in your time budget. What happens is this:
Client                                 Server
---------------------------------------------
INSERT INTO ...          1 ms       >
                                       PARSE
                                       PLAN
                                       EXECUTE
                                                (3 ms)
                         1 ms       <  REPLY

The problem stems from the fact that every network packet takes a non-negligible amount of time to reach the server, and for the reply to come back. In the simplified example above, we waste 1 millisecond reaching the server and 1 millisecond for the reply packet to reach us. That's 2 milliseconds wasted over 5 millisecond total time, or 40% of our time where we twiddled our thumbs. This is a huge amount of time!
But we can do better than that.
1,000s rows/sec: INSERT many

If we want to achieve better throughput, we have to somehow change the equation. Since we're wasting so much time on the network, we should reduce the number of round-trips we do. The SQL standard enables the following syntax, which I'll call "INSERT many":
INSERT INTO table(col0, col1)
VALUES
    (val00, val01)
  , (val10, val11)
  , (val20, val21)
  , ...
What we're doing here is adding many rows in a single SQL statement / one network round-trip. The equation has now become:
Client                                 Server
---------------------------------------------
INSERT INTO ...          1 ms       >
                                       PARSE
                                       PLAN
                                       EXECUTE
                                                (498 ms)
                         1 ms       <  REPLY

In this simplified example, we're still paying 2 milliseconds of round-trip time, but now, the server is doing 498ms of work for every round-trip. We now waste only 2 / 500 milliseconds on round-trip time, which boils down to 0.4 %. We've already achieved a 100x improvement in throughput with a simple change.
10,000s rows/sec: COPY many

To achieve even better throughput, we have to delve into PostgreSQL's extensions and internals.
PostgreSQL COPY extension

PostgreSQL has a COPY statement. The COPY statement allows PostgreSQL to read files (and STDIN) and copy the data to a table. For larger batches, this is better than going through the full query parser: PostgreSQL can use a simpler CSV parser instead.
Alone, COPY many statements will be slightly faster than INSERT many, but not by much. To achieve an even better throughput, we have to know more about PostgreSQL's internals.
The PostgreSQL Write-Ahead Log

The Write-Ahead Log is where PostgreSQL will store it's intents: I will write tuple X to row Y of relation Z. The WAL is the implementation detail that provides the "D" in ACID: durability. In case of a crash, the server can replay the log and achieve consistency. But writing to the Write-Ahead Log takes time. If we could somehow skip this write, we would achieve better throughput.
It turns out that if we open a transaction and use the COPY statement to write to a temporary table, PostgreSQL will not write the table's data to the WAL. In details, the technique is:
BEGIN;

CREATE TEMP TABLE tmp_table( LIKE table ) ON COMMIT DROP;

COPY tmp_table(col0, col1) FROM stdin WITH (FORMAT csv, HEADER false, DELIMITER ',');
"val00","val01"
"val10","val11"
"val20","val21"
"val30","val31"
...
\.

INSERT INTO table
  SELECT *
  FROM tmp_table LEFT JOIN table USING (col0)
  WHERE table.col0 IS NULL;

COMMIT;

Open a transaction;
Create a temporary table that has the same schema as the target table;
COPY the data into the temporary table, skipping the WAL;
Next, write the data to the destination table, but only rows that don't already exist. This prevents duplicates in your destination table;
Finally, close the transaction and let PostgreSQL write the subset of rows that were really new be written to the WAL.

The reason why we can skip the write to the WAL is because PostgreSQL knows that temporary tables cannot live longer than the session (here, we've even specified that the table must be dropped at commit, so even earlier). Since the table cannot survive the session, and sessions cannot survive a server restart, then temporary tables do not need to be written to the WAL.
100,000s rows/sec: COPY many to many

The final trick I have for you is a way to increase concurrency further. The problem with the COPY many approach is this: there's a hidden LOCK that's taken while writing to a table:
Client                                 Server
---------------------------------------------
INSERT INTO ...          1 ms       >
                                       PARSE
                                       PLAN
                                       EXECUTE
                                        LOCK
                                        ...
                                       END
                         1 ms       <  REPLY

PostgreSQL protects itself against errors with this lock, such as preserving primary key constraints, for example. In Seevibes' case, we collect and store data for many markets and many time periods simultaneously. Data destined for the Canadian market will not and cannot influence the US market. Likewise, data coming for September will not and cannot influence the August data. This lets us achieve the best throughput: we can write to different tables from different processes and no locks will conflict, improving throughput.
-- process 1
BEGIN;

-- no need to be fancy with temp table names:
-- PostgreSQL ensures no two temp tables can conflict, even if they have the same name
-- (in reality, temp tables are created in a per-session schema)
CREATE TEMP TABLE tmp_table() ON COMMIT DROP;

COPY ...;
...
\.

INSERT INTO canada_2016_09
  SELECT *
  FROM tmp_table LEFT JOIN canada_2016_09 USING (service_name, interaction_service_id)
  WHERE canada_2016_09.interaction_service_id IS NULL;

COMMIT;

-- process 2: cannot conflict with process 1 because this process is writing to a different table!
BEGIN;

CREATE TEMP TABLE tmp_table() ON COMMIT DROP;

COPY ...;
...
\.

INSERT INTO usa_2016_08
  SELECT *
  FROM tmp_table LEFT JOIN usa_2016_08 USING (service_name, interaction_service_id)
  WHERE usa_2016_08.interaction_service_id IS NULL;

COMMIT;
Caveat

Of course, there are no free lunches. Increasing throughput increased latency. In the INSERT one case, each new row was visible to reports within milliseconds. In the COPY many to many case, new rows will only be visible after a relatively long period of time. In Seevibes' case, this was an acceptable tradeoff: waiting two hours for a row to become queryable won't cause issues because our reports are based on the past six month's data anyway.
Hope this helps someone!