-
-
Save l1x/76dab6445b6d55396c622f915c755a17 to your computer and use it in GitHub Desktop.
import os | |
import pyarrow.parquet as pq | |
# | |
# Warning!!! | |
# Suffers from the same problem as the parquet-tools merge function | |
# | |
#parquet-tools merge: | |
#Merges multiple Parquet files into one. The command doesn't merge row groups, | |
#just places one after the other. When used to merge many small files, the | |
#resulting file will still contain small row groups, which usually leads to bad | |
#query performance. | |
def combine_parquet_files(input_folder, target_path): | |
try: | |
files = [] | |
for file_name in os.listdir(input_folder): | |
files.append(pq.read_table(os.path.join(input_folder, file_name))) | |
with pq.ParquetWriter(target_path, | |
files[0].schema, | |
version='2.0', | |
compression='gzip', | |
use_dictionary=True, | |
data_page_size=2097152, #2MB | |
write_statistics=True) as writer: | |
for f in files: | |
writer.write_table(f) | |
except Exception as e: | |
print(e) | |
combine_parquet_files('data', 'combined.parquet') |
@jtlz2 The problem is the one described in https://issues.apache.org/jira/browse/PARQUET-1115: If you started with small row groups, you still will have small row groups.
See https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c for a solution that
- Fixes the above problem
- Isn't memory limited because it uses streaming
- Is more flexible
Perfect! Thanks Nick!
Hello. I tried running your code and got the following message:
realloc of size 67108928 failed
Do you know how to solve this problem? Would appreciate some help!
Hello. I tried running your code and got the following message:
realloc of size 67108928 failed
Do you know how to solve this problem? Would appreciate some help!
Use Nick's version, it does not have a O(n) memory requirement.
Oh it works now. Thanks! just one more thing. I can't seem to read the new combined parquet file with the following code:
parquet_file2 = r'C:\Users\82103\Desktop\by_person\combined.parquet'
pd.read_parquet(parquet_file2, engine='auto')
This code also returns a similar message: ArrowMemoryError: malloc of size 8388608 failed. Are you aware of a way to go about this kind of problem in reading a huge parquet file?
PSA: If you're looking for a pre-packaged CLI to grab and go give joinem a try, available via PyPi: python3 -m pip install joinem
.
joinem provides a CLI for fast, flexbile concatenation of tabular data using polars.
I/O is lazily streamed in order to give good performance when working with numerous, large files.
Example Usage
Pass input files via stdin and output file as an argument.
ls -1 path/to/*.parquet | python3 -m joinem out.parquet
You can add the --progress
flag to get a progress bar.
No-install Containerized Interface
If you are working in a HPC environment, joinem can also be conveniently used via singularity/apptainer.
ls -1 *.pqt | singularity run docker://ghcr.io/mmore500/joinem out.pqt
Further Information
joinem is also compatible with CSV, JSON, and feather file types.
See the project's README for more usage examples and a full command-line interface API listing.
disclosure: I am the library author of joinem.
@jtlz2 The problem is the one described in https://issues.apache.org/jira/browse/PARQUET-1115: If you started with small row groups, you still will have small row groups.
See https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c for a solution that