Skip to content

Instantly share code, notes, and snippets.

@revbucket
Created October 7, 2016 21:38
Show Gist options
  • Save revbucket/ccecce8b9f3971077354de307ee680c2 to your computer and use it in GitHub Desktop.
Save revbucket/ccecce8b9f3971077354de307ee680c2 to your computer and use it in GitHub Desktop.
psycopg2 bulk loading

Bulk Loading from Python to Postgres

Here at MaestroIQ we deal with large amounts of user data as we need impression-level information to generate an accurate picture of our customer’s user’s paths to purchase. We store all this data in a postgres 9.3 instance. I’m hesitant to call it “big data,” but our largest tables have over 500M rows, occupying over 200GB of disk space (if you include the indices), so performance is a very real issue that we have to worry about. We are also a python shop on the backend. All of our production code that manages data analytics and machine learning is python, as well as the code that controls our ETL workflow. To interact with our Postgres instances, we use the psycopg2 package. It’s my hope through this post to share some tips, tricks, and best practices we’ve discovered by heavily using psycopg2.

Bulk loading

In dealing with large postgres tables, one has to first make them large. Bulk-loading is an extremely common task, and as we’ll see, there are good ways and bad ways to go about it. There are four possible techniques to insert a reasonably large amount of rows into postgres from python. I say reasonably large because this has to be an amount of rows that can be stored in the main memory of whichever machine is doing the T part of your ETL pipeline. The use case we run into the most is loading hundreds of thousands of rows at a time, into a table that has hundreds of millions of rows. I'll support my claims with some benchmark results, so let's set up these benchmarks first.

We’ll operate both on a local postgres 9.3 instance (running on my 2015 macbook pro), and an AWS db.t2.medium instance. The test tables we’ll be loading into will have the following schema:

CREATE TABLE benchmark_table 
  (id int PRIMARY KEY, 
   name varchar(32) NOT NULL, 
   other_info varchar(32),
   time_seen timestamptz);

We'll test on both an initially empty table, and a table populated with 5,000,000 arbitrary rows, populated as follows:

CREATE EXTENSION IF NOT EXISTS 'uuid-ossp';
INSERT INTO benchmark_table (id, name, other_info, time_seen)
  SELECT *, uuid_generate_v4(), md5(random()::varchar), current_time
  FROM generate_series(0, 5000000)

Our benchmark tests will be to insert 100, 1,000, 10,000, 100,000 and 1,000,000 rows into our table. Now we have 4 contenders for bulk-loading: 1| | 10^2 | 10^3 | 10^4 | 10^5 | 10^6 | |-------------|-------|-------|------|------|------| | Local/empty | 0.116 | 0.247 | 1.39 | 9.09 | | | | | | | | | | | | | | | | | | | | | | |) the execute loop, 2) executemany 3) chunked execute calls, and 4) copy. Because some of these methods require potentially costly query preparation, timing starts from the moment we generate the rows we want to insert (as a list of tuples), and ends when the commit returns.

The Execute Loop

Suppose we have the rows we want to insert into our tables as a local variable named rows. rows is a list of tuples, where each tuple is of the form (id, name, other_info, time_seen). The simplest way to insert each row into the table is using code like this

def execute_loop(conn, table_name, rows):
    cur = conn.cursor()
    query = ("INSERT INTO benchmark_table VALUES (%s, %s, %s, %s);")
    for row in rows:
        cur.execute(query, row)
    conn.commit()

Which is essentially running an individual insert query for each row in rows. Of course, this is a huge SQL no-no, and no one would ever do this, but it’s important to show the ‘bad way’ to do things.

executemany

It’d be really nice if the psycopg2 developers realized that bulkloading data was a common thing that their users would do, and what do you know, they did! They’ve provided a nice method that prepares and executes a query against a sequence of values (conveniently in the exact form as our rows variable). It’s called executemany, and while promising, is just as bad as the execute loop above. Here’s a snippet from the source code that defines the executemany function ( taken from here):

    while ((v = PyIter_Next(vars)) != NULL) {
        if (0 > _psyco_curs_execute(self, operation, v, 0, 1)) {
            Py_DECREF(v);
            Py_XDECREF(iter);
            return NULL;
        }
        else {
            if (self->rowcount == -1)
                rowcount = -1;
            else if (rowcount >= 0)
                rowcount += self->rowcount;
            Py_DECREF(v);
        }
    }

It’s the same thing as the execute loop, just written in C instead of python. So maybe the performance will be slightly better, but not by a lot.

chunked execute calls

The first two methods primarily have poor performance because they take what one might want to happen in a single query (“Insert all this stuff into the table”) and break it up into many many queries (“Insert this, and this, and this, …”). If we know we're bulk loading, then just format your queries like this:

INSERT INTO benchmark_table 
    VALUES (1, val1, val1, time1), 
           (2, val2, val2, time2), 
            ...

There are, of course, limits to query size. With my local server, I could insert up to a million rows, but for a remote server I got an error when uploading a million rows. It seems there's a limit of 1GB per query operating on a remote server, which probably has something to do with the maximum varlena being 1GB).

COPY

SQL veterans will probably be pretty frustrated with this post up to this point. Of course the best performance for bulkloading is going to come from the SQL command dedicated to bulk loading and unloading of data: COPY. The problem is, copy in psycopg2 isn’t very well documented, and what’s worse, there’s no builtin adapter for the various data types. You have to guess, mostly, how each data type should be adapted to a copyable string. This is an easy task for easy data types like int, varchar, even timestamps, but becomes far trickier for composite types and other 'new’ data types, like hstore, jsonb, etc.. The exact adaptation method is perhaps a topic for another post, so for now, let’s discuss psycopg2.copy_from. Since each method we’ve used has started with a list of tuples, let’s start with that for copy. The first thing we have to do is store these tuples into some file-like object. I find that StringIO is very easy to work with, but one can use an actual file, or a tempfile if one prefers. Then, if the formatting on this file is correct, we can execute the copy_from command, specify the columns, delimiter, and null value, and then we can be amazed at how fast this performs.

def copy_into_table(conn, table, columns, rows):
    cur = conn.cursor()
    sio = StringIO()
    # adapt_row is a custom method, differing from psyopg's adapt
    sio.write('\n'.join(adapt_row(row) for row in rows)
    
    sio.seek(0
    cur.copy_from(sio, table, columns=columns)
    conn.commit()

Results

See the results summarized in the table below. I only gave the numbers for the 100 and 1,000,000 row cases, but the graph should convince you that these things scale linearly enough to interpolate.

Copy, of course, is the clear winner. Chunking your execute calls is close, at roughly half as fast, but also can be executed more compactly in your python code, making it a useful alternative for small-medium inserts. I can't ever see why anyone would use executemany or its pythonic brother -- and that statement holds true in general, not just for bulk inserts.

Wrapping up

Bulkloading is a common task for anyone who works with databases. Psycopg2, while providing lots of great methods to interact with postgres from a python environment, is lacking in well-documented and easy-to-use support to bulkload. While one might be tempted to use executemany to run an insert command on lots of rows, the overhead from calling that many execute commands is very expensive. One’s much better off by concatenating all your commands into a single one, and even better off putting your data into a file like object and copying that object directly.

@MuhammadWaqas05
Copy link

Hi Matt,

Great post this is. When coming to the 'Results' heading, I can not find any table in which the results are summarized. Can you please guide me where that table is?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment