Skip to content

Instantly share code, notes, and snippets.

@thunderpoot
Last active February 21, 2024 18:34
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thunderpoot/71236f5ebf719276433527b36e588708 to your computer and use it in GitHub Desktop.
Save thunderpoot/71236f5ebf719276433527b36e588708 to your computer and use it in GitHub Desktop.
Parquet Examples
Parquet Example Programs
========================
These example programs demonstrate simple interactions with Parquet files using Python.
1. `write_parquet.py`: This program generates a small example Parquet file. It shows the ease of creating Parquet files with Python using the `pandas` library.
2. `read_parquet.py`: This program reads and displays the contents of the example Parquet file generated by `write_parquet.py`.
3. `describe_parquet.py`: This program demonstrates how to read Parquet files and extract information such as column names, schema, and file size, using the `pyarrow` library.
4. `filter_parquet.py`: This program demonstrates efficient filtering of a Parquet file, applying "predicate pushdown"
Dependencies:
- Python 3.x
- `pandas` library
- `pyarrow` library
For more information about Parquet files and the libraries, please refer to the official documentation:
- parquet: https://parquet.apache.org/documentation/latest/
- pandas: https://pandas.pydata.org/
- pyarrow: https://arrow.apache.org/docs/python/index.html
import os
import pyarrow.parquet as pq
def describe_parquet(file_path):
file_size = os.path.getsize(file_path)
print(f"File Size: {file_size} bytes")
table = pq.read_table(file_path)
columns = table.column_names
print(f"Number of rows: {table.num_rows}")
print(f"Number of columns: {len(columns)}")
print("Columns:")
for column in columns:
print(column)
describe_parquet("example.parquet")
# Output:
# File Size: 3557 bytes
# Number of rows: 7
# Number of columns: 4
# Columns:
# Captain
# Actor
# Ship
# Quote
import pyarrow.parquet as pq
# To extract entries from a Parquet file where the Ship column exactly matches `USS Enterprise-D`
# without loading the entire file into memory, you can use the filters argument in PyArrow's `read_table`
# function to apply "predicate pushdown". This method allows you to specify conditions that are used to
# filter data during the read operation, which can significantly reduce memory usage by only loading the
# relevant subset of data. This is useful when dealing with Common Crawl's indexes, because they're huge!
# Define filters to apply predicate pushdown
# Here we specify that we only want rows where the 'Ship' column is 'USS Enterprise-D'
filters = [('Ship', '=', 'USS Enterprise-D')]
# Read the Parquet file with the filters applied to avoid loading a monstrously large file into memory
table = pq.read_table('example.parquet', filters=filters)
# Convert to Pandas DataFrame for easier viewing/manipulation (optional)
filtered_df = table.to_pandas()
print(filtered_df)
# Output:
# Captain Actor Ship Quote
# 0 Jean-Luc Picard Patrick Stewart USS Enterprise-D Make it so.
# 1 Edward Jellico Ronny Cox USS Enterprise-D Get it done.
import pandas as pd
# Read the Parquet file into a DataFrame
df = pd.read_parquet('example.parquet')
# Display the contents of the DataFrame
print("Contents of the Parquet file:")
print(df)
# Output:
# Contents of the Parquet file:
# Captain Actor Ship Quote
# 0 James T. Kirk William Shatner USS Enterprise Beam me up, Scotty!
# 1 Jean-Luc Picard Patrick Stewart USS Enterprise-D Make it so.
# 2 Benjamin Sisko Avery Brooks Deep Space 9 It's a faaaaake!
# 3 Kathryn Janeway Kate Mulgrew USS Voyager There's coffee in that nebula.
# 4 Jonathan Archer Scott Bakula Enterprise NX-01 We're not out here to play God.
# 5 William T. Riker Jonathan Frakes USS Titan I love surprise parties.
# 6 Edward Jellico Ronny Cox USS Enterprise-D Get it done.
import pandas as pd
# Create a sample DataFrame
data = {
'Captain': ['James T. Kirk', 'Jean-Luc Picard', 'Benjamin Sisko', 'Kathryn Janeway', 'Jonathan Archer', 'William T. Riker', 'Edward Jellico'],
'Actor': ['William Shatner', 'Patrick Stewart', 'Avery Brooks', 'Kate Mulgrew', 'Scott Bakula', 'Jonathan Frakes', 'Ronny Cox'],
'Ship': ['USS Enterprise', 'USS Enterprise-D', 'Deep Space 9', 'USS Voyager', 'Enterprise NX-01', 'USS Titan', 'USS Enterprise-D'],
'Quote': ['Beam me up, Scotty!', 'Make it so.', "It's a faaaaake!", "There's coffee in that nebula.", "We're not out here to play God.", 'I love surprise parties.', 'Get it done.']
}
df = pd.DataFrame(data)
# Write DataFrame to Parquet file
df.to_parquet('example.parquet', index=False)
print("Parquet file 'example.parquet' has been created successfully.")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment