Here at MaestroIQ we deal with large amounts of user data as we need impression-level information to generate an accurate picture of our customer’s user’s paths to purchase. We store all this data in a postgres 9.3 instance. I’m hesitant to call it “big data,” but our largest tables have over 500M rows, occupying over 200GB of disk space (if you include the indices), so performance is a very real issue that we have to worry about. We are also a python shop on the backend. All of our production code that manages data analytics and machine learning is python, as well as the code that controls our ETL workflow. To interact with our Postgres instances, we use the psycopg2 package. It’s my hope through this post to share some tips, tricks, and best practices we’ve discovered by heavily using psycopg2.
In dealing with large postgres tables, one has to first make them large. Bulk-loading is an extremely common task, and as we’ll see, there are good ways and