Skip to content

Instantly share code, notes, and snippets.

@kojix2
Created March 30, 2023 04:09
Show Gist options
  • Save kojix2/7be163f1717fc825e2de8b12d6f858e4 to your computer and use it in GitHub Desktop.
Save kojix2/7be163f1717fc825e2de8b12d6f858e4 to your computer and use it in GitHub Desktop.

red-datasets-parquet

red-datasets-parquet is a Ruby gem that provides datasets in Apache Arrow's Parquet format. It includes datasets from New York City's Taxi and Limousine Commission (TLC) data, such as green and yellow taxi trip records.

Installation

You can install the red-datasets-parquet gem by adding the following line to your Gemfile:

gem 'red-datasets-parquet'

Then execute:

$ bundle

Or install it yourself as:

$ gem install red-datasets-parquet

Usage

Here's an example of how you can use the red-datasets-parquet gem to access the taxi trip datasets:

require 'datasets-parquet'

# Get the green taxi trip dataset for January 2022
green_taxi_trips = Datasets::TLC::GreenTaxiTrip.new(year: 2022, month: 1)

# Access the dataset as an Apache Arrow table
green_arrow_table = green_taxi_trips.to_arrow

# Iterate over the green taxi trip records
green_taxi_trips.each do |trip|
  # Access data about the taxi trip
  p trip.vendor
  p trip.lpep_pickup_datetime
  p trip.lpep_dropoff_datetime
  # ...
end

# Get the yellow taxi trip dataset for January 2022
yellow_taxi_trips = Datasets::TLC::YellowTaxiTrip.new(year: 2022, month: 1)

# Access the dataset as an Apache Arrow table
yellow_arrow_table = yellow_taxi_trips.to_arrow

# Iterate over the yellow taxi trip records
yellow_taxi_trips.each do |trip|
  # Access data about the taxi trip
  p trip.vendor
  p trip.tpep_pickup_datetime
  p trip.tpep_dropoff_datetime
  # ...
end

Datasets

This gem currently provides the following datasets:

Note that the datasets provided by this gem may be updated periodically, so the data you receive may differ from the examples shown above.

License

The red-datasets-parquet code is available under the MIT License. Please note that the datasets themselves may be subject to different licenses and terms of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment