gimbo/pythonpath.rst

## pythonpath.rst

      
    Raw
  

              pythonpath.rst
            
          
    Python Path

Motivation

A whistle-stop tour of the techniques for getting your Python code into a place where the interpreter can get to work on it followed by the implications for code structure and usability.
Namespaces

$ python -c "import this" | tail -1
Namespaces are one honking great idea -- let's do more of those!
The preliminary part of this paper is about the way that your File System gets mapped to Python Namespaces.
Module Search Path

6.1.2. The Module Search Path
When a module named spam is imported, the interpreter first searches for a built-in module with that name. If not found, it then searches for a file named spam.py in a list of directories given by the variable sys.path. sys.path is initialized from these locations:

The directory containing the input script (or the current directory when no file is specified).
PYTHONPATH (a list of directory names, with the same syntax as the shell variable PATH).
The installation-dependent default.
Addressing each of these in turn...
The Current Directory

The directories in your current working directory are available for import.
$ mkdir foo
$ mkdir foo/bar
$ cd foo
$ python3 -c "import bar"
PYTHONPATH

The PYTHONPATH environment variable allows directories to be added
Continuing the example above, initially we can't import bar as it isn't a subdirectory of the CWD. However if we add it to the PYTHONPATH we can see it becomes part of the path.
$ cd ..
$ python3 -c "import bar"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named 'bar'
$ export PYTHONPATH=/path/to/foo
$ python3 -c "import bar"
$ python3 -c "import sys; print('/path/to/foo' in sys.path)"
True
Installation-Dependant default

When you work with Python libraries another mechanism is generally employed - you would install the library.
Installation is generally performed by:

Directly via Distutils.
A Package Manager.

For both these cases the code needs to be structured as for the standards. i.e. The Python library would be a Package (and probably obtained from PyPi).
Distutils

Python was traditionally installed at the package level and this is still valid:
python setup.py install
Python Packages

Python Packaging standards are well defined. I'll defer to Python Packaging Expert Tarek Ziadé for a tutorial on how to lay out your project
So a properly packaged project looks like this:
TowelStuff/
    LICENSE.txt
    README.txt
    setup.py
    towelstuff/
        __init__.py
Package Managers

Python has seen an evolution of Package Managers easy_install was released as part of setuptools and PIP evolved from this.
The Package Manager will install appropriately packaged Python code and dependencies.
Linux flavours have python packages in their repositories wrapped accordingly.
Site-Packages

The site is the platform specific hook that is automatically imported in initialization.
The package manager installs your package into the site-packages directory.
$ sudo pip3 install towelstuff

$ python3 -c "import towel_stuff; print(towel_stuff.__file__)"
/usr/local/lib/python3.5/site-packages/towel_stuff/__init__.py

$ python3 -c "import sys; print('/usr/local/lib/python3.5/site-packages' in sys.path)"
True

$ ls /usr/local/lib/python3.5/site-packages | grep towel
towel_stuff
VirtualEnv

This allows installation into a local environment. The site-packages in your environment will be added to the sys.path
$ pyvenv my_venv
$ my_venv/bin/activate
$  . my_venv/bin/activate
$ python3 -c "import sys; print('/Users/tomgalvin/my_venv/lib/python3.5/site-packages' in sys.path)"
True
requirements.txt

Thers is a mechanism specific to PIP for specifying the requirements.
The cognisant reader would have noticed that packaged your code will already have specified the requirements so the effort is duplicated. In this context we can regard requirements.txt and virtualenv as a convenient complement to the core technologies that exist around the Python Path.

Prior to Python 3.3, filesystem directories, and directories within zipfiles, had to contain an __init__.py in order to be recognised as Python package directories. Even if there is no initialisation code to run when the package is imported, an empty __init__.py file is still needed for the interpreter to find any modules or subpackages in that directory.

This has changed in Python 3.3: now any directory on sys.path with a name that matches the package name being looked for will be recognised as contributing modules and subpackages to that package.
The __init.py file will be executed when the namespace is imported. This means that any code in the __init__py will get executed as a side effect of the import:
$ touch foo/bar/__init__.py
$ echo "print('hello world')" > foo/bar/__init__.py 
$ python -c "import bar"
hello
What code should be included in the __init__.py is discussed here.
If we aspire to be Pythonic though we should recall that Explicit is better than implicit. and code that is executed as a result of an import rather than a deliberate function call appears to fall under the definiton of 'Implicit' and should be avoided.
Land Registry Code Structures

Land Registry Python code does not tend to use Python Packaging standards.
There are a couple of consequences of this:

There is no standard for the structure of the Python Apps within the Land Registry.
Land Registry code does not lend itself to the Installation ecosystem that exists in the community.

In turn this has knock-on affects... workarounds, misunderstandings, reinvented wheels and brittle design.
1. Limitations

We are limited in the way that the code gets on the path.
Symptoms:

Necessity to include bespoke shell scripts.
Difficulties in isolation - running modules independently
Assumptions of the Current Working Directory (os.cwd)

2. Second Class Application Structure

Ironically for a package manager whose purpose is to install Python packages the requirements.txt assists the creation of ad-hoc structures.
I understand this guidance was given to the Land Registry early on in it's adoption of Python. I am not sure of the context of this advice but it is worth being aware that there are those that take issue to this approach:
The sort of developers that only write applications don’t really understand packaging and are happy to hardcode an assortment of modules into their application and hook them in with the convenient requirements.txt. These developers will most likely tell people to set up a virtualenv and pip install -r requirements.txt. Fortunately if you have read this far then you no longer fall into this category!
Perhaps this ad-hoc approach to packaging can be explained from the context of a simple website where the code is effectively the root node of a hierarchical tree. The deployment and control of the service are the responsibility of the small band of Developers working on the project so a lack of a formalised layout does not register as a concern - it is their problem and one that goes un-noticed. This hypothetical case is not Microservices. It isn't wise to fall foul of the fifth fallacy of Distributed Computing. Rather than laying out the application as if it was purely a consumer of other packages, package it so that it provides a service to others.
3. Technical Debt

There is a burden of exploring each new project structure and dealing with the peculiarities that arise from it.
4. Ragged Interfaces

The interfaces are not clearly defined... or respected.
All the effort that has gone into the Python Packages that means they can sit on the common platforms are lost. So rather than have a unit with clearly defined Entry Points that can be hooked into natively, The Land Registry has become reliant on a hodge-podge of shell scripts to support the running code. Different groups are interacting with the code in different ways. This breaks Bezo's Big Mandate which I understand to be a Land Registry Design Goal.
5. Satellite Community

An unconventional structure puts Land Registry outside the community.
6. Automation

Without true standards we present challenges to the teams involved in Automation.
Related Issues

1. __init__.py

This seems to be widespread in Flask applications and for that reason, despite my discomfort, I have continued this trend - though I would think again before doing this in future.
It possibly stems from the top ranking Flask Tutorial on Google.
I questioned Miguel on this and his response is.
"I don’t have anything against empty __init__.py files, in fact I’ve worked on OpenStack which is pretty much religious about this practice. But in many cases __init__.py allows you to provide better encapsulation for your package. You don’t always want the structure of the package to be known outside, since that prevents you from changing it or improving it in the future."
The rationale appears to be the need to separate Interface from Implementation (hello Gamma! There are other ways of doing this without breaking Tim Peter's Zen. The most obvious is to use setuptools entry_points to specify the public interface i.e. using Python Packaging.
Avoiding throwing novice Flask developers into Python Packaging might be understandable in the context of an internet tutorial. For the purposes of Production code the perceived benefit of using __init__.py in this way appears diminished.
By having a template with code in the __init__.py we have a familiar pitfall:
A commonly seen issue is to add too much code to __init__.py files. When the project complexity grows, there may be sub-packages and sub-sub-packages in a deep directory structure. In this case, importing a single item from a sub-sub-package will require executing all __init__.py files met while traversing the tree.

Leaving an __init__.py file empty is considered normal and even a good practice, if the package’s modules and sub-packages do not need to share any code.

2. Environment Variables

Environment Variables are being used as a means of getting parameters to the applications. This seems to stem from the Heroku 12 Factor application. There are counter points to this. I am not sure it is wise to take Design Principles from an organisation with a interest in promoting techniques relevant to their own service.
Environment Variable should not be seen as the panacea for controlling Application Config for a number of reasons.

Environment Variables are certainly convenient in the way that Global Variables are. They fall down in similar ways.
One notable area of consistency in the Land Registry projects is the use of PORT, conflicts such as this imposes restrictions on the way that communicating services are run in parallel.
They present a loose interface that is not easily queried by the user.
Similarly it can be difficult to determine what configuration has been applied. This affects fault finding and post mortems.
Finally if the Variable is subsequently hardcoded into a shell script, or requires a bespoke Control Plane the point of configurable convenience is somewhat what lost...

As an alternative a proper Command Line Interface presents a clearer contract both at runtime and from code structure (once again defined by the entry points).
Summary

There are several mechanisms for putting your code in reach of the Python interpreter. The application of these techniques are appropriate at different stages of Development.
Once the code starts to form an Application it is worth considering using the Python Package structure. Python packages leave more techniques at your disposal for working with and distributing the application - and you have the benefit of evolution and being part of the mainstream.
The results of piecing together a bespoke structure can be seen in a variance in form factors and absence of a clear contract. Workarounds such as using the Current Working Directory will introduce brittle code and tight coupling.
The goals of clear APIs are not just applicable at the http interface. They are present in the Command Line and Package Level and in these areas the Land Registry applications are found wanting.
Comments

I got the text peer-reviewed by Python Packaging advocate Dave Haynes here are his comments:
I spent a few minutes reading your tech note. It all matches what I understand to be good practice just now. Here's a couple of things I might be tempted to emphasise:

* Business benefits of getting it right, ie: the well-defined happy path for a new user
* Use of pkg_resources to access cfg, template files, etc
* Namespace packages allow multiple projects to slot together. This is good when you've got a codebase artificially split across teams for corporate/political reasons

Myself, I always make a new virtualenv for each project I work on. I don't use activate at all, I just specify the absolute path to the python binary.
Definitely never rely on environment variables.
Only ever empty __init__.py except to define __version__ for the package.
Don't like requirements.txt so I define multiple 'extras_require' in setup.py for different uses, eg: building docs, packaging for Windows etc.