Skip to content

Instantly share code, notes, and snippets.

@nroi
Last active July 28, 2018 19:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nroi/aa1838f5e2f83ff57b456e2567e98b26 to your computer and use it in GitHub Desktop.
Save nroi/aa1838f5e2f83ff57b456e2567e98b26 to your computer and use it in GitHub Desktop.
Various bits and pieces of code that turned out to be useful for the 'Load' step of ETL with PostgreSQL.
-- Various bits and pieces of code that turned out to be useful for the 'Load' step of ETL
-- (i.e., loading big chunks of data into postgres tables).
-- postgres' COPY is more efficient than a bunch of INSERT statements:
\copy table_name (col1, col2, …) from '/path/to/file.txt'
-- If the text file to load the data from is unusually large, it can also be unzipped. Note that, regardless of which of the
-- two methods is used (loading directly from a file or unzipping it first with zcat), this will *not* load the entire file
-- into memory, so this method can also be used for very large files even if both server and client have little memory available.
\copy table_name (col1, col2, …) from program 'zcat /path/to/file.txt.gz'
-- Even though COPY will not load the entire file into memory, memory usage will increase with the size of the file.
-- One simple method to avoid the high memory usage is to split the text file into smaller files, and then use COPY on
-- each of the smaller file. Again, we can use "\copy […] from program" to avoid actually having to write the chunks onto disk.
-- Notice that this only works if our columns contain no newlines, or only escaped newlines.
-- Note that the following is shell script, not SQL:
split -l1000000 --filter="psql -c '\copy table_name (col1, col2, …) from stdin'" /path/to/file.txt
-- Obviously, this method can also be used with compressed files:
zcat /path/to/file.txt.gz | split -l1000000 --filter="psql -c '\copy table_name (col1, col2, …) from stdin'"
-- With ETL and Data Warehousing, you might end up with the following scenario: The table from the source database is too large
-- to be copied regularly to the target database. Existing rows may change, so you can't just copy those rows from the source
-- database which have been newly added since the last synchronization. You may, however, assume that rows with an ID smaller
-- than a given threshold value will not change on the source database. So instead of copying the entire table from source
-- to target, you copy only those rows whose ID is equal to or larger than the threshold value. You can then copy the text file
-- into a temporary table and let postgres figure out which of the rows have actually changed and should therefore be inserted
-- into the actual table:
begin;
-- for the unlikely event that some rows have been deleted in the source
-- database, we need to delete those rows as well.
delete from table_name
where id in (
select id from (
select id, col1, col2, … from table_name where id >= (select min(id) from table_name_tmp)
except
select id, col1, col2, … from table_name_tmp
) deleted_ids
);
-- Next, we enter the rows which have changed:
insert into table_name (id, col1, col2, …)
select id, col1, col2, … from (
select id, col1, col2, … from table_name_tmp
except
select id, col1, col2, … from table_name where id >= (select min(id) from table_name_tmp)
) diff;
truncate table table_name_tmp;
commit;
-- Of course, it would be easier to simply delete all rows of the table from the target database where the id is larger than
-- or equal to the threshold id. We could then COPY the text file directly into the target table instead of creating a temporary
-- table first. However, the approach described above has two advantages:
-- 1. New rows are inserted with a normal INSERT operation instead of COPY, which means that triggers will be called.
-- 2. If many indexes exist on the target table, it might be faster to first insert everything into a temporary table (without
-- any indexes), and then insert only a small fraction of the rows of the temporary table into the target table.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment