Skip to content

Instantly share code, notes, and snippets.

@GenevieveBuckley
Last active February 19, 2021 07:12
Show Gist options
  • Save GenevieveBuckley/2859c039f63d0878bd27041fb2feb791 to your computer and use it in GitHub Desktop.
Save GenevieveBuckley/2859c039f63d0878bd27041fb2feb791 to your computer and use it in GitHub Desktop.
scipy Japan 2020 talk proposal

Genevieve Buckley is a scientist and programmer based in Melbourne Australia. She builds software tools for scientific discovery. Her interests include deep learning, automated analysis, and contributing to open source projects. She has a wealth of professional experience with image processing and analysis, spanning x-ray imaging, fluorescence microscopy, and electron beam microscopy. She is a maintainer for the dask-image project.

dask-image: distributed image processing for large data

Contributors:

  • Genevieve Buckley
  • John Kirkham

Short Summary

This talk introduces dask-image, a python library for distributed image processing. Targeted towards applications involving large array data too big to fit in memory, dask-image is built on top of numpy, scipy, and dask allowing easy scalability and portability from your laptop to the supercomputing cluster. It is of broad interest to a diverse range of scientific fields including astronomy, geosciences, microscopy, and climate sciences. We will provide a general overview of the dask-image library, then discuss mixing and matching with your own custom functions, and present a practical case study of a python image processing pipeline.

Abstract

Scientific imaging datasets are large, and becoming larger. The average size of a single entry on the electron microscopy database EMPIAR is over 1TB. Individual lattice light sheet microscopy data can easily reach several terabytes. Even where individual images are small enough to fit in-memory, many existing parallelization methods are difficult to scale seamlessly between a laptop and a supercomputing cluster. For instance, the python multiprocessing module is restricted to a single mode and can't take advantage of multiple compute nodes on a distributed supercomputing cluster.

We need easy ways to work with large image data. This talk introduces dask-image, a python library for distributed image processing. The target audience are scientists currently using numpy and scipy with large array data, where the whole dataset cannot fit in memory or is close to that limit. It's for people who want to get started with parallel image processing, either because they have large single-image data (for example, very high resolution 2D histology slides where individual image tiles must be processed bit by bit), or because they want to do batch processing applying the same analysis to many smaller images (sometimes known an embarrassingly parallel problem). The specific image analysis functions provided by dask-image are of broad interest to a diverse range of scientific fields including (but not limited to) astronomy, geosciences, microscopy, and climate sciences.

Specifically, this talk will cover:

  • An overview of the dask-image library
    • Lazy image loading
    • Image pre-processing functionality (convolutions, filters, etc.)
    • Analysis of segmented images (distributed labelling, and measurements of those label regions)
  • Mixing in your own custom analysis functions (using dask delayed, map_blocks, and map_overlap)
  • A practical case study of a Python image processing pipeline

dask-image is open source, released under a BSD 3-Clause license, and can be installed using conda or pip.

Additional Material

PyConlineAU 2020 https://pretalx.com/pycon-au-2020/talk/review/RP9LMHZUMYZZWUB73ESTKG9SGT9QCBMJ

dask-image: distributed image processing for large data

Abstract

This talk introduces dask-image, a python library for distributed image processing. Targeted towards applications involving large array data too big to fit in memory, dask-image is built on top of numpy, scipy, and dask allowing easy scalability and portability from your laptop to the supercomputing cluster. It is of broad interest for a diverse range of data analysis applications such as video/streaming data, computer vision, and scientific fields including astronomy, microscopy and geosciences. We will provide a general overview of the dask-image library, then discuss mixing and matching with your own custom functions, and present a practical case study of a python image processing pipeline.

Detailed abstract

Image datasets are large, and becoming larger. The widely used benchmark dataset COCO (Common Objects in Context) contains 330,000 individual images. The average size of a single entry on the image database EMPIAR is over 1TB, and can easily reach several terabytes. Even where individual images are small enough to fit in-memory, many existing parallelization methods are difficult to scale seamlessly between a laptop and a supercomputing cluster. For instance, the python multiprocessing module is restricted to a single mode and can't take advantage of multiple compute nodes on a distributed supercomputing cluster.

We need easy ways to work with large image data. This talk introduces dask-image, a python library for distributed image processing. The target audience are python programmers currently using numpy and scipy with large array data, where the whole dataset cannot fit in memory or is close to that limit. It's for people who want to get started with parallel processing, either because they have large single-image data, or because they want to do batch processing applying the same analysis to many smaller images (sometimes known an embarrassingly parallel problem). The specific image analysis functions provided by dask-image are of broad interest to a diverse range of analysis applications including (but not limited to) video/streaming data, computer vision, and scientific fields including astronomy, microscopy and geosciences.

Specifically, this talk will cover:

  • An overview of the dask-image library
  • Lazy image loading
  • Image pre-processing functionality (convolutions, filters, etc.)
  • Analysis of segmented images (distributed labeling, and measurements of those label regions) * Mixing in your own custom analysis functions (using dask delayed, map_blocks, and map_overlap) * A practical case study of a Python image processing pipeline

dask-image is open source, released under a BSD 3-Clause license, and can be installed using conda or pip. You can find the source code at https://github.com/dask/dask-image and the quickstart guide at https://github.com/dask/dask-examples/blob/master/applications/image-processing.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment