Skip to content

Instantly share code, notes, and snippets.

@seirl
Created April 7, 2021 07:07
Show Gist options
  • Save seirl/3313921c4f86782fd54cd2570d2cc281 to your computer and use it in GitHub Desktop.
Save seirl/3313921c4f86782fd54cd2570d2cc281 to your computer and use it in GitHub Desktop.
Concatenate small ORC files into a single output using PyORC
#!/usr/bin/env python3
# Copyright (c) 2021 Antoine Pietri <antoine.pietri1@gmail.com>
# SPDX-License-Identifier: GPL-3.0
import pyorc
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-o', '--output', type=argparse.FileType(mode='wb'))
parser.add_argument('files', type=argparse.FileType(mode='rb'), nargs='+')
args = parser.parse_args()
schema = str(pyorc.Reader(args.files[0]).schema)
with pyorc.Writer(args.output, schema) as writer:
for i, f in enumerate(args.files):
reader = pyorc.Reader(f)
if str(reader.schema) != schema:
raise RuntimeError(
"Inconsistent ORC schemas.\n"
"\tFirst file schema: {}\n"
"\tFile #{} schema: {}"
.format(schema, i, str(reader.schema))
)
for line in reader:
writer.write(line)
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment