Skip to content

Instantly share code, notes, and snippets.

@smr547
Last active October 6, 2016 01:23
Show Gist options
  • Save smr547/7dffa633d0650456fcf1da5cd1523022 to your computer and use it in GitHub Desktop.
Save smr547/7dffa633d0650456fcf1da5cd1523022 to your computer and use it in GitHub Desktop.
Hadoop Distributed Files System on HPC

Introduction

The department's HPC plaforms offers users 25TB of storage space within the [Hadoop Distributed File System](http://www.aosabook.org/en/hdfs.html]. This disk space is designed to store large datasets accessible by programs designed around the Map/Reduce pattern and running on the Hadoo platform.

User disk space

If you are user fred you may view the contents of your personal hdfs using the hadoop command

hadoop fs -ls /users/fred

HDFS via Shell

From the command line you can use the HDFS like any Linux file system. For example:

hadoop fs -cat /users/fred/my_big_file.txt | grep -i 'hello world'

Refer to the Hadoop File System Shell Guide to see all the commands available to you.

Data access via Python

The Pydoop package has been installed on the HPC cluster and is currently available on the head node. It contains a hdfs-api which allows Python programs to directly access the hdfs file system.

The HDFS API tutorial provides some simple examples of how to write Python programs using this API.

HDFS web page

The HdFS namenode running on our head node offers a web interface to allow you to examine the status of the HDFS. Simply establish a SSH tunnel to this service when you login:

ssh -L 50070:localhost:50070  fred@cbe-hpc-head.anu.edu.au

And then use your web browser to go to http://localhost:50070

Running Map/Reduce jobs

This capability has not been fully installed but will be available very soon.

Caveats

  1. Your data on hdfs will not be backed-up. Please ensure that you can recreate your datasets in the event that they become lost or corrupted. Having said that, HDFS keeps rendundant copies of all data across three separate hosts so data loss due to disk failure would be most uncommon.
  2. Our HDFS platform is currently in 'experimental' status. The platform and the data may not be available at all times.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment