Skip to content

Instantly share code, notes, and snippets.

@yogabonito
Last active September 27, 2017 11:53
Show Gist options
  • Save yogabonito/713d3e9177480e6a8ca7b486b57c1fb3 to your computer and use it in GitHub Desktop.
Save yogabonito/713d3e9177480e6a8ca7b486b57c1fb3 to your computer and use it in GitHub Desktop.
Summary of the GSoC 2017 with PySAL, Project: Regionalization
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# GSoC with PySAL - Regionalization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regionalization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The analysis of geographical data often requires the clustering of areas into larger spatial units called regions. Furthermore, spatial clustering (rgionalization) can improve the readability of visualizations. This project focused on regionalization algorithms for Python 3."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Porting `clusterpy` to Python 3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An existing Python package offering regionalization algorithms is [`clusterpy`](http://www.rise-group.org/risem/clusterpy/#). Unfortunately, this package is not compatible to Python 3. One aim of this project was to port `clusterpy` to Python 3. A [blog post of mine](http://yogabonito-gsoc.blogspot.co.at/2017/06/porting-clusterpy-to-python-3-in-last.html) describes how this can be achieved. The resulting code can be found in the [`clusterpy` branch of my `region` repo](https://github.com/yogabonito/region/tree/clusterpy3). A short demonstration of the Python-3-port of clusterpy can be found [in this gist](https://gist.github.com/yogabonito/5e5adba8d7a44fb6e3e75ce3b09828ea#file-demo_clusterpy_with_python3-ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_Update:_ The code is now merged into `pysal/region` (see [this pull request](https://github.com/pysal/region/pull/2))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Implementing Algorithms"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A second goal of the project was to implement regionalization algorithms in a new package. To maximize usability the API allows for various representations of geographical data, including geopandas' [`GeoDataFrame`](http://geopandas.org/data_structures.html#geodataframe)s, PySAL [`W`](http://pysal.readthedocs.io/en/latest/library/weights/weights.html#pysal.weights.weights.W) objects, networkx [`Graph`](http://networkx.readthedocs.io/en/networkx-1.10/reference/classes.graph.html)s, and [sparse (adjacency) matrices](https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix). With the [p-regions problem](#p_regions) and the [Max-p-regions problem](#max_p_regions) we addressed two distinct clustering tasks. You can find the code in the [`master` branch of my `region` repo](https://github.com/yogabonito/region/tree/master)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"_Update:_ The code is now merged into `pysal/region` (see [this pull request](https://github.com/pysal/region/pull/1))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The p-Regions Problem<a id=\"p_regions\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The p-regions problem is defined as a regionalization task where the number of regions (clusters) is known. There are many approaches to solving this problem. One way to solve the problem is to translate it into an [mixed integer program](#p_ex) while another category of algorithms comprises [heuristic approaches](#p_heu). The code related to the p-regions problem can be found in the [`region.p_regions` subpackage](https://github.com/yogabonito/region/tree/master/region/p_regions)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exact Approach<a id=\"p_ex\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Duque, Church, and Middleton (2011)](http://onlinelibrary.wiley.com/doi/10.1111/j.1538-4632.2010.00810.x/abstract) describe three different ways to translate a p-regions problem into exact optimization models and call these approaches [Flow](#flow), [Order](#order), and [Tree](#tree) respectively. While the objective function of these models ensures maximum homogeneity within the regions, the models' constraints ensure that the resulting regions are spatially contiguous. Implementations for all of the three approaches are available through the `ClusterExact` class in the `region.p_regions.exact` module."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Flow<a id=\"flow\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Examples of how to use the Flow implementation are in [this gist](https://gist.github.com/yogabonito/4788eaa456931eba14779a93263968df#file-p_regions_exact-ipynb). Silencing the warnings shouldn't be difficult and is on my todo list."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Order<a id=\"order\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"This works analogously as the gist for the Flow-algorithm. Only add the keyword argument method=\"order\" to the `fit_from_...`-method to use the Order algorithm."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Tree<a id=\"tree\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"This works analogously as the gist for the Flow-algorithm. Only add the keyword argument method=\"tree\" to the `fit_from_...`-method to use the Tree algorithm."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Heuristic Approach<a id=\"p_heu\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"While exact approaches ensure optimality, they can be rather slow in delivering the solution. That's why (especially for clustering tasks including many areas) it may be preferable to use an heuristic approach. Many heuristic algorithms have been designed aiming for a high probability of a good solution while keeping the run times low. This category of algorithms includes the [AZP](#azp) described in [Openshaw & Rao (1995)](http://journals.sagepub.com/doi/abs/10.1068/a270425). This paper also presents several variations of this algorithm which aim to improve results, including [simulated annealing](#sa), [basic tabu](#bt), and [reactive tabu](#rt)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### AZP<a id=\"azp\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Examples of how to use the AZP implementation are in [this gist](https://gist.github.com/yogabonito/2f00726bf90acf63921b7de47600f731#file-azp-ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### AZP-SA<a id=\"sa\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Examples of how to use the AZP implementation with simulated annealing are in [this gist](https://gist.github.com/yogabonito/5b88bfddcde699c9d0eccb2b71d13c82#file-azp-sa-ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### AZP Basic Tabu<a id=\"bt\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Examples of how to use the AZP Basic Tabu implementation are in [this gist](https://gist.github.com/yogabonito/121028535d1eea3b2a1cf1682ec511c6#file-azp_basic_tabu-ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### AZP Reactive Tabu<a id=\"rt\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Examples of how to use the AZP Reactive Tabu implementation are in [this gist](https://gist.github.com/yogabonito/f59c5d6f457760348f324407746a79b0#file-azp_reactive_tabu-ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Max-p-Regions Problem<a id=\"max_p_regions\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As the p-regions problem the max-p-regions problem is about finding a clustering which satisfies the spatial contiguity condition. A key difference is that in the max-p-regions problem there is no predefined number of regions. The primary goal is to find a clustering that has as many regions as possible. In order to avoid a clustering in which every area forms its own region there is another condition. It is stated by using so called spatial extensive attributes which each area has. The condition requires that the sum of these spatial extensive attributes reaches or exceeds a predefined threshold. When the maximum number of regions is found, a second goal has to be met, namely to find the best clustering with the maximum number of regions. Here, optimality is defined by the areas' attributes. (These attributes are different from the spatial extensive attributes mentioned above.) For a detailed description of the max-p-regions problem take a look at [Duque, Anselin, and Rey](https://asu.pure.elsevier.com/en/publications/the-max-p-regions-problem). It can also be solved using either an [exact](#max_p_ex) or a [heuristic](#max_p_heu) approach. You can find the code in the [`max_p_regions` subpackage](https://github.com/yogabonito/region/tree/master/region/max_p_regions)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Exact Approach<a id=\"max_p_ex\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"An example of how to solve a small max-p-regions problem using an exact approach is in [this gist](https://gist.github.com/yogabonito/f653e4cc027915d497d40fc406aefc72#file-max_p_regions_exact-ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Heuristic Approach<a id=\"max_p_heu\"></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Examples of the heuristic approach with different local search algorithms are in [this gist](https://gist.github.com/yogabonito/9612dc53d2a313561c59dacdbe839f97#file-max_p_regions_heu-ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Documentation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using Sphinx, I have generated a documentation based on the docstrings in the code. A version of the documentation based on commit [7cf14ae](https://github.com/yogabonito/region/commit/7cf14ae5d88aa628d2f14adb457d57a55de0569a) (not up-to-date) was uploaded to [readthedocs.org](https://readthedocs.org).\n",
"Currently, the documentation comprises the automatically generated [index](http://region.readthedocs.io/en/docs/genindex.html) and [module index](http://region.readthedocs.io/en/docs/py-modindex.html) pages as well as a the [package's readme file](http://region.readthedocs.io/en/docs/readme.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### TODO"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The immediate todos include:\n",
"1. ~~Further harmonize the API~~\n",
"2. ~~Optimize the heuristic algorithms (use `scipy.sparse.csr_matrix` instead of `networkx.Graph` internally)~~\n",
"3. ~~Update test cases to recently changed API~~\n",
"\n",
"All three todos are completed. Furthermore, tests have been added as well as a new feature offering the user more control over the objective function used in the heuristic algorithms."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment