Skip to content

Instantly share code, notes, and snippets.

@l1x

l1x/merge.parquet.py

Last active Oct 17, 2020
Embed
What would you like to do?
Merging Parquet files with Python
import os
import pyarrow.parquet as pq
#
# Warning!!!
# Suffers from the same problem as the parquet-tools merge function
#
#parquet-tools merge:
#Merges multiple Parquet files into one. The command doesn't merge row groups,
#just places one after the other. When used to merge many small files, the
#resulting file will still contain small row groups, which usually leads to bad
#query performance.
def combine_parquet_files(input_folder, target_path):
try:
files = []
for file_name in os.listdir(input_folder):
files.append(pq.read_table(os.path.join(input_folder, file_name)))
with pq.ParquetWriter(target_path,
files[0].schema,
version='2.0',
compression='gzip',
use_dictionary=True,
data_page_size=2097152, #2MB
write_statistics=True) as writer:
for f in files:
writer.write_table(f)
except Exception as e:
print(e)
combine_parquet_files('data', 'combined.parquet')
@gudata

This comment has been minimized.

Copy link

@gudata gudata commented Oct 7, 2020

Hi,
How do you explain that when I run this method once it produces a file with a lets say 20kb size.
When I run the same method only the file from the previous run, I got a file which is 10kb size.
The file content is identical, but the schema is different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.