Final submission for a PostgreSQL GSOC'17 project 'Add errors handling and parallel execution to COPY'
Alexey Kondratov (kondratov.aleksey@gmail.com).
In my initial proposal I was planning to use subtransactions for errors handling, since it is the only one completely safe way to catch all possible errors during the COPY FROM
execution. However, it would cause a serious problem – severe transactional IDs (XIDs) consumption hidden from the end-user. It may lead to a huge performance drop in the case, when errors are too frequent in the input data.
Following multiple advises received in the pgsql-hackers mailing list thread I have focused more on my main motivation – handle only input data errors (e.g. extra or not enough data/columns, wrong type formating). This goal may be acheived without subtransactions.
Thus, in my current patch:
- New option
IGNORE_ERRORS
added toCOPY FROM
syntax - Multiple validations rise WARNINGs insead of ERRORs, when errors handling is turned on
InputFunctionCall
is wrapped withPG_TRY/PG_CATCH
to safely catch input data type formatting errors
Usage example:
COPY test_table
FROM '/path/to/test.csv'
WITH (
IGNORE_ERRORS,
FORMAT csv
);
- Fruitful discussion in the pgsql-hacker thread
- Working PostgreSQL patch, which passes all regression tests. Applicable to the master's
6c774caf0ea6977f00af4225192915f0a602ea3d
- Submission to the nearest PostgresSQL Commitfest
- GitHub pull request for an interactive view of all commits and changes
Because of the PostgreSQL core development and release specifics no patch can be merged without passing detailed multi-step review process, including pgsql-hackers discussion, code-review, etc. However, all required steps toward the successful merging of this patch were made.
In order to understand are there any expensive calls in COPY, which
can be executed in parallel, I did a small research. First, find
flame graph of the most expensive copy.c
calls during the 'COPY FROM file'
attached (see 2-copy_from.svg). It reveals, that inevitably serial operations like
CopyReadLine
(<15%), heap_multi_insert
(~15%) take less than 50% of
time in summary, while remaining operations like heap_form_tuple
and
multiple checks inside NextCopyFrom
probably can be executed well in parallel.
Second, I have compared an execution time of 'COPY FROM a single large file (~300 MB, 50000000 lines)' vs. 'COPY FROM four equal parts of the original file executed in the four parallel processes'. Though it is a very rough test, it helps to obtain an overall estimation:
Serial:
real 0m56.571s
user 0m0.005s
sys 0m0.006s
Parallel (x4):
real 0m22.542s
user 0m0.015s
sys 0m0.018s
Thus, it results in a ~60% performance boost per each x2 multiplication of parallel processes, which is consistent with the initial estimation. Taking into account all gathered information I have choosen a parallel architecture with one master process and multiple worker processes, so in my current implementation:
Master process does:
- Dynamic shared memory allocation with parallel state across BGWorkers and master
- Attaching every worker to the personal message query (
shm_mq
) - Wait workers initialization using
Latch
- Read raw text lines using
CopyReadLine
and puts them into shm_mq's via round-robin to balance queries load - When EOF is reached sends zero-length message and workers are safely shut down when receive it
- Wait using
Latch
for worker until they complete their jobs
Each BGWorker does:
- Signal master on initialization via
Latch
- Reinitialize db connection using the same
db_id
anduser_id
as main process - Receive raw text lines over the personal
shm_mq
, process them and put them into the heap - Signal master via
Latch
on job done
All parallel state modifications are done under LWLocks.
Real performance boost is slightly less than expected: generally between x1.2-1.4 (for two parallel workers compared to serial code) and up to x1.5 from time to time, though it is still notable improvement.
- Mailing-list discussion
- GitHub pull request for an interactive view of all commits and changes
- Working patch, applicable after the
copy-errors
patch
Although parallel COPY FROM is working, there are a few things to be fixed and polished to finalize parallel code integration, since COPY is a core functionality in PostgreSQL and any architectural changes cause unpredictable crashes and problems. Anyway, I believe that the main goals of this project are achieved.