Skip to content

Instantly share code, notes, and snippets.

@89465127
Created June 6, 2013 18:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save 89465127/5723635 to your computer and use it in GitHub Desktop.
Save 89465127/5723635 to your computer and use it in GitHub Desktop.
Read in flat hadoop files into python. Uses a generator, so it is memory efficient.
import glob
import os
def filelist(path, _filter="part-*"):
basepath = os.path.abspath(os.path.expanduser(path))
return [filename for filename in glob.glob(basepath + '/' + _filter)]
def hfile(path, _filter="part-*"):
for filename in filelist(path, _filter):
with open(filename) as f:
for line in f:
yield line
''' Usage example:
from open_hadoop import hfile
for line in hfile('./input/path/'):
print line
'''
''' Installation:
- Place open_hadoop.py in your site-packages directory.
- Your site-packages directory can be located by running:
python -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())"
'''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment