Skip to content

Instantly share code, notes, and snippets.

@shiumachi
Last active December 28, 2018 06:22
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shiumachi/d916b9459e56b496466aa2aa24859be7 to your computer and use it in GitHub Desktop.
Save shiumachi/d916b9459e56b496466aa2aa24859be7 to your computer and use it in GitHub Desktop.
日付単位に分けられた複数のCSVファイルを月単位のParquetファイルに変換する
# This script compacts daily based csv files to monthly based parquet file.
# The CSV files should be named like "YYYY-MM-DD.csv" format.
#
# このスクリプトは日付毎のcsvファイルを月毎のparquetファイルに変換します。
# CSVファイルの名前は"YYYY-MM-DD.csv"の形式にしてください。
#
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
from glob import glob
# configure this parameter
year = 2017
dirs = glob("*.csv")
df = pd.DataFrame(pd.Series.from_array(np.array(dirs)), columns=["filename"])
for month in range(1, 13):
df2 = pd.DataFrame()
for filename in df[df.filename.str.contains("{}-{:02}-*".format(year, month))]["filename"]:
df2 = df2.append(pd.read_csv(filename))
table = pa.Table.from_pandas(df2)
pq.write_table(table, "{}-{:02}.parq".format(year, month), compression="gzip")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment