inchoate/including_external_package_in_dataflow.md

## including_external_package_in_dataflow.md

      
    Raw
  

              including_external_package_in_dataflow.md
            
          
    The Problem

The documentation for how to deploy a pipeline with extra, non-PyPi, pure Python packages on GCP is missing some detail. This gist shows how to package and deploy an external pure-Python, non-PyPi dependency to a managed dataflow pipeline on GCP.
TL;DR: You external package needs to be a python (source/binary) distro properly packaged and shipped alongside your pipeline. It is not enough to only specify a tar file with a setup.py.
Preparing the External Package

Your external package must have a proper setup.py. What follow is an example setup.py for our ETL package. This is used to package version 1.1.1 of the etl library. The library requires 3 native PyPi packages to run. These are specified in the install_requires field. This package also ships with custom external JSON data, declared in the package_data section. Last, the setuptools.find_packages function searches for all available packages and returns that list:
:
# ETL's setup.py
from setuptools import setup, find_packages
setup(
    name='etl',
    version='1.1.1',
    install_requires=[
        'nose==1.3.7',
        'datadiff==2.0.0',
        'unicodecsv==0.14.1'
    ],
    description='ETL tools for API v2',
    packages = find_packages(),
    package_data = {
        'etl.lib': ["*.json"]
    }
)
Otherwise, there is nothing special about this setup.py file.
Building the External Package

You need to create a real source (or binary?) distribution of your external package. To do so, run the following in your external package's directory:
python setup.py sdist --formats=gztar
The last few lines you should see will look like this,
hard linking etl/transform/user.py -> etl-1.1.1/etl/transform
Writing etl-1.1.1/setup.cfg
Creating tar archive
removing 'etl-1.1.1' (and everything under it)
The output of this command, if run successfully, is a source distribution of your package, suitable to including into the pipeline project. Look in your ./dist directory for the file:
14:55 $ ls dist
etl-1.1.1.tar.gz
Preparing the Pipeline Project


Create a distlib or extra packages directory in which you will place the file you just created in the previous step:

cd pipeline-project
mkdir dist/
cp ~/etl-project/dist/etl-1.1.1.tar.gz .

Let the pipeline know you intend to include this package, by using the --extra-package command line argument:

18:38 $ python dataflow_main.py \
    --input=staging
    --output=bigquery-staging
    --runner=DataflowPipelineRunner
    --project=realmassive-staging
    --job_name dataflow-project-1
    --setup_file ./setup.py
    --staging_location gs://dataflow/staging
    --temp_location gs://dataflow/temp
    --requirements_file requirements.txt
    --extra_package dist/etl-1.1.1.tar.gz