Skip to content

Instantly share code, notes, and snippets.

@ololobus
Last active September 6, 2017 12:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ololobus/249432b663e29d111f36b12458e3f30a to your computer and use it in GitHub Desktop.
Save ololobus/249432b663e29d111f36b12458e3f30a to your computer and use it in GitHub Desktop.
Final submission for a GSOC project 'Add errors handling and parallel execution to COPY'

Final submission for a PostgreSQL GSOC'17 project 'Add errors handling and parallel execution to COPY'

Alexey Kondratov (kondratov.aleksey@gmail.com).

Errors handling in COPY FROM

Details

In my initial proposal I was planning to use subtransactions for errors handling, since it is the only one completely safe way to catch all possible errors during the COPY FROM execution. However, it would cause a serious problem – severe transactional IDs (XIDs) consumption hidden from the end-user. It may lead to a huge performance drop in the case, when errors are too frequent in the input data.

Following multiple advises received in the pgsql-hackers mailing list thread I have focused more on my main motivation – handle only input data errors (e.g. extra or not enough data/columns, wrong type formating). This goal may be acheived without subtransactions.

Thus, in my current patch:

  • New option IGNORE_ERRORS added to COPY FROM syntax
  • Multiple validations rise WARNINGs insead of ERRORs, when errors handling is turned on
  • InputFunctionCall is wrapped with PG_TRY/PG_CATCH to safely catch input data type formatting errors

Usage example:

COPY test_table
FROM '/path/to/test.csv'
WITH (
    IGNORE_ERRORS,
    FORMAT csv
);

Results

Because of the PostgreSQL core development and release specifics no patch can be merged without passing detailed multi-step review process, including pgsql-hackers discussion, code-review, etc. However, all required steps toward the successful merging of this patch were made.

Parallel COPY FROM

Details

In order to understand are there any expensive calls in COPY, which can be executed in parallel, I did a small research. First, find flame graph of the most expensive copy.c calls during the 'COPY FROM file' attached (see 2-copy_from.svg). It reveals, that inevitably serial operations like CopyReadLine (<15%), heap_multi_insert (~15%) take less than 50% of time in summary, while remaining operations like heap_form_tuple and multiple checks inside NextCopyFrom probably can be executed well in parallel.

Second, I have compared an execution time of 'COPY FROM a single large file (~300 MB, 50000000 lines)' vs. 'COPY FROM four equal parts of the original file executed in the four parallel processes'. Though it is a very rough test, it helps to obtain an overall estimation:

Serial:

real 0m56.571s
user 0m0.005s
sys  0m0.006s

Parallel (x4):

real 0m22.542s
user 0m0.015s
sys  0m0.018s

Thus, it results in a ~60% performance boost per each x2 multiplication of parallel processes, which is consistent with the initial estimation. Taking into account all gathered information I have choosen a parallel architecture with one master process and multiple worker processes, so in my current implementation:

Master process does:

  • Dynamic shared memory allocation with parallel state across BGWorkers and master
  • Attaching every worker to the personal message query (shm_mq)
  • Wait workers initialization using Latch
  • Read raw text lines using CopyReadLine and puts them into shm_mq's via round-robin to balance queries load
  • When EOF is reached sends zero-length message and workers are safely shut down when receive it
  • Wait using Latch for worker until they complete their jobs

Each BGWorker does:

  • Signal master on initialization via Latch
  • Reinitialize db connection using the same db_id and user_id as main process
  • Receive raw text lines over the personal shm_mq, process them and put them into the heap
  • Signal master via Latch on job done

All parallel state modifications are done under LWLocks.

Real performance boost is slightly less than expected: generally between x1.2-1.4 (for two parallel workers compared to serial code) and up to x1.5 from time to time, though it is still notable improvement.

Results

Although parallel COPY FROM is working, there are a few things to be fixed and polished to finalize parallel code integration, since COPY is a core functionality in PostgreSQL and any architectural changes cause unpredictable crashes and problems. Anyway, I believe that the main goals of this project are achieved.

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment