ololobus/1-gsoc-final-submission.md

## 1-gsoc-final-submission.md

      
    Raw
  

              1-gsoc-final-submission.md
            
          
    Final submission for a PostgreSQL GSOC'17 project 'Add errors handling and parallel execution to COPY'

Alexey Kondratov (kondratov.aleksey@gmail.com).
Errors handling in COPY FROM

Details

In my initial proposal I was planning to use subtransactions for errors handling, since it is the only one completely safe way to catch all possible errors during the COPY FROM execution. However, it would cause a serious problem – severe transactional IDs (XIDs) consumption hidden from the end-user. It may lead to a huge performance drop in the case, when errors are too frequent in the input data.
Following multiple advises received in the pgsql-hackers mailing list thread I have focused more on my main motivation – handle only input data errors (e.g. extra or not enough data/columns, wrong type formating). This goal may be acheived without subtransactions.
Thus, in my current patch:

New option IGNORE_ERRORS added to COPY FROM syntax
Multiple validations rise WARNINGs insead of ERRORs, when errors handling is turned on
InputFunctionCall is wrapped with PG_TRY/PG_CATCH to safely catch input data type formatting errors

Usage example:
COPY test_table
FROM '/path/to/test.csv'
WITH (
    IGNORE_ERRORS,
    FORMAT csv
);
Results


Fruitful discussion in the pgsql-hacker thread
Working PostgreSQL patch, which passes all regression tests. Applicable to the master's 6c774caf0ea6977f00af4225192915f0a602ea3d
Submission to the nearest PostgresSQL Commitfest
GitHub pull request for an interactive view of all commits and changes

Because of the PostgreSQL core development and release specifics no patch can be merged without passing detailed multi-step review process, including pgsql-hackers discussion, code-review, etc. However, all required steps toward the successful merging of this patch were made.
Parallel COPY FROM

Details

In order to understand are there any expensive calls in COPY, which
can be executed in parallel, I did a small research. First, find
flame graph of the most expensive copy.c calls during the 'COPY FROM file'
attached (see 2-copy_from.svg). It reveals, that inevitably serial operations like
CopyReadLine (<15%), heap_multi_insert (~15%) take less than 50% of
time in summary, while remaining operations like heap_form_tuple and
multiple checks inside NextCopyFrom probably can be executed well in parallel.
Second, I have compared an execution time of 'COPY FROM a single large
file (~300 MB, 50000000 lines)' vs. 'COPY FROM four equal parts of the
original file executed in the four parallel processes'. Though it is a
very rough test, it helps to obtain an overall estimation:
Serial:
real 0m56.571s
user 0m0.005s
sys  0m0.006s

Parallel (x4):
real 0m22.542s
user 0m0.015s
sys  0m0.018s

Thus, it results in a ~60% performance boost per each x2 multiplication of
parallel processes, which is consistent with the initial estimation. Taking into account all gathered information I have choosen a parallel architecture with one master process and multiple worker processes, so in my current implementation:
Master process does:

Dynamic shared memory allocation with parallel state across BGWorkers and master
Attaching every worker to the personal message query (shm_mq)
Wait workers initialization using Latch
Read raw text lines using CopyReadLine and puts them into shm_mq's via round-robin to balance queries load
When EOF is reached sends zero-length message and workers are safely shut down when receive it
Wait using Latch for worker until they complete their jobs

Each BGWorker does:

Signal master on initialization via Latch
Reinitialize db connection using the same db_id and user_id as main process
Receive raw text lines over the personal shm_mq, process them and put them into the heap
Signal master via Latch on job done

All parallel state modifications are done under LWLocks.
Real performance boost is slightly less than expected: generally between x1.2-1.4 (for two parallel workers compared to serial code) and up to x1.5 from time to time, though it is still notable improvement.
Results


Mailing-list discussion
GitHub pull request for an interactive view of all commits and changes
Working patch, applicable after the copy-errors patch

Although parallel COPY FROM is working, there are a few things to be fixed and polished to finalize parallel code integration, since COPY is a core functionality in PostgreSQL and any architectural changes cause unpredictable crashes and problems. Anyway, I believe that the main goals of this project are achieved.

  
## 2-copy_from.svg

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              2-copy_from.svg
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.