Skip to content

Instantly share code, notes, and snippets.

@l1x
Last active March 2, 2024 05:16
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save l1x/76dab6445b6d55396c622f915c755a17 to your computer and use it in GitHub Desktop.
Save l1x/76dab6445b6d55396c622f915c755a17 to your computer and use it in GitHub Desktop.
Merging Parquet files with Python
import os
import pyarrow.parquet as pq
#
# Warning!!!
# Suffers from the same problem as the parquet-tools merge function
#
#parquet-tools merge:
#Merges multiple Parquet files into one. The command doesn't merge row groups,
#just places one after the other. When used to merge many small files, the
#resulting file will still contain small row groups, which usually leads to bad
#query performance.
def combine_parquet_files(input_folder, target_path):
try:
files = []
for file_name in os.listdir(input_folder):
files.append(pq.read_table(os.path.join(input_folder, file_name)))
with pq.ParquetWriter(target_path,
files[0].schema,
version='2.0',
compression='gzip',
use_dictionary=True,
data_page_size=2097152, #2MB
write_statistics=True) as writer:
for f in files:
writer.write_table(f)
except Exception as e:
print(e)
combine_parquet_files('data', 'combined.parquet')
@gudata
Copy link

gudata commented Oct 7, 2020

Hi,
How do you explain that when I run this method once it produces a file with a lets say 20kb size.
When I run the same method only the file from the previous run, I got a file which is 10kb size.
The file content is identical, but the schema is different.

@jtlz2
Copy link

jtlz2 commented Apr 20, 2022

Awesome!
You mention "suffering from problems" - what are these?

@NickCrews
Copy link

NickCrews commented Aug 24, 2022

@jtlz2 The problem is the one described in https://issues.apache.org/jira/browse/PARQUET-1115: If you started with small row groups, you still will have small row groups.

See https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c for a solution that

  1. Fixes the above problem
  2. Isn't memory limited because it uses streaming
  3. Is more flexible

@l1x
Copy link
Author

l1x commented Aug 25, 2022

@jtlz2 The problem is the one described in https://issues.apache.org/jira/browse/PARQUET-1115: If you started with small row groups, you still will have small row groups.

See https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c for a solution that

  1. Fixes the above problem
  2. Isn't memory limited because it uses streaming
  3. Is more flexible

Perfect! Thanks Nick!

@yygwak
Copy link

yygwak commented Sep 5, 2022

Hello. I tried running your code and got the following message:

realloc of size 67108928 failed

Do you know how to solve this problem? Would appreciate some help!

@l1x
Copy link
Author

l1x commented Sep 5, 2022

Hello. I tried running your code and got the following message:

realloc of size 67108928 failed

Do you know how to solve this problem? Would appreciate some help!

Use Nick's version, it does not have a O(n) memory requirement.

@yygwak
Copy link

yygwak commented Sep 5, 2022

Oh it works now. Thanks! just one more thing. I can't seem to read the new combined parquet file with the following code:

parquet_file2 = r'C:\Users\82103\Desktop\by_person\combined.parquet'
pd.read_parquet(parquet_file2, engine='auto')

This code also returns a similar message: ArrowMemoryError: malloc of size 8388608 failed. Are you aware of a way to go about this kind of problem in reading a huge parquet file?

@mmore500
Copy link

PSA: If you're looking for a pre-packaged CLI to grab and go give joinem a try, available via PyPi: python3 -m pip install joinem.

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars.
I/O is lazily streamed in order to give good performance when working with numerous, large files.

Example Usage

Pass input files via stdin and output file as an argument.

ls -1 path/to/*.parquet | python3 -m joinem out.parquet

You can add the --progress flag to get a progress bar.

No-install Containerized Interface

If you are working in a HPC environment, joinem can also be conveniently used via singularity/apptainer.

ls -1 *.pqt | singularity run docker://ghcr.io/mmore500/joinem out.pqt

Further Information

joinem is also compatible with CSV, JSON, and feather file types.
See the project's README for more usage examples and a full command-line interface API listing.

disclosure: I am the library author of joinem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment