There is a file called path.py https://github.com/metabrainz/listenbrainz-labs/blob/master/listenbrainz_spark/path.py that contain path to directories which we need in HDFS. eg path: DATAFRAME_DIR = os.path.join('/', 'recommendation', 'dataframe')
one use of path.py is here : https://github.com/metabrainz/listenbrainz-labs/blob/master/manage.py#L57
Now, there is another file called create_dataframe.py which need path info, here: https://github.com/metabrainz/listenbrainz-labs/blob/master/listenbrainz_spark/recommendations/create_dataframes.py#L86
The path needed in create_dataframes.py should be something like this: hdfs://hadoop-master:9000/recommendation/dataframe/user.py
os.path.join ignores everythin before a '/' so I created the path like hdfs://hadoop-master:9000 + path.DATAFRAME_DIR + '/user.py'
which looks very weird to me.
also, is it good to create all the required directories before hand with a single file (manage.py) or each directory should be created with the script that requires it ?
I specifically wanted your reviews on this PR: https://github.com/metabrainz/listenbrainz-labs/pull/46/commits/80fb5fe22c0cd77544a467ddb52d3bda9c206137