Genevieve Buckley is a scientist and programmer based in Melbourne Australia. She builds software tools for scientific discovery. Her interests include deep learning, automated analysis, and contributing to open source projects. She has a wealth of professional experience with image processing and analysis, spanning x-ray imaging, fluorescence microscopy, and electron beam microscopy. She is a maintainer for the dask-image project.
-
-
Save GenevieveBuckley/2859c039f63d0878bd27041fb2feb791 to your computer and use it in GitHub Desktop.
Contributors:
- Genevieve Buckley
- John Kirkham
This talk introduces dask-image, a python library for distributed image processing. Targeted towards applications involving large array data too big to fit in memory, dask-image is built on top of numpy, scipy, and dask allowing easy scalability and portability from your laptop to the supercomputing cluster. It is of broad interest to a diverse range of scientific fields including astronomy, geosciences, microscopy, and climate sciences. We will provide a general overview of the dask-image library, then discuss mixing and matching with your own custom functions, and present a practical case study of a python image processing pipeline.
Scientific imaging datasets are large, and becoming larger. The average size of a single entry on the electron microscopy database EMPIAR is over 1TB. Individual lattice light sheet microscopy data can easily reach several terabytes. Even where individual images are small enough to fit in-memory, many existing parallelization methods are difficult to scale seamlessly between a laptop and a supercomputing cluster. For instance, the python multiprocessing module is restricted to a single mode and can't take advantage of multiple compute nodes on a distributed supercomputing cluster.
We need easy ways to work with large image data. This talk introduces dask-image, a python library for distributed image processing. The target audience are scientists currently using numpy and scipy with large array data, where the whole dataset cannot fit in memory or is close to that limit. It's for people who want to get started with parallel image processing, either because they have large single-image data (for example, very high resolution 2D histology slides where individual image tiles must be processed bit by bit), or because they want to do batch processing applying the same analysis to many smaller images (sometimes known an embarrassingly parallel problem). The specific image analysis functions provided by dask-image are of broad interest to a diverse range of scientific fields including (but not limited to) astronomy, geosciences, microscopy, and climate sciences.
Specifically, this talk will cover:
- An overview of the dask-image library
- Lazy image loading
- Image pre-processing functionality (convolutions, filters, etc.)
- Analysis of segmented images (distributed labelling, and measurements of those label regions)
- Mixing in your own custom analysis functions (using dask delayed, map_blocks, and map_overlap)
- A practical case study of a Python image processing pipeline
dask-image is open source, released under a BSD 3-Clause license, and can be installed using conda or pip.
Additional Material
- dask-image source code is available at https://github.com/dask/dask-image
- dask-image quickstart guide: https://github.com/dask/dask-examples/blob/master/applications/image-processing.ipynb
- Presenter speaking samples:
- PyConAU 2019 lightning talk: https://youtu.be/AJqcxEzRdSY?t=784
- SciPy 2019 talk: https://www.youtube.com/watch?v=ytEQl9xs8FQ&list=PLYx7XA2nY5GcDQblpQ_M1V3PQPoLWiDAC&index=79&t=0s
- Previous presentation on this topic: John Kirkham presenting at SciPy in Austin 2019: https://www.youtube.com/watch?v=XGUS174vvLs
PyConlineAU 2020 https://pretalx.com/pycon-au-2020/talk/review/RP9LMHZUMYZZWUB73ESTKG9SGT9QCBMJ
This talk introduces dask-image, a python library for distributed image processing. Targeted towards applications involving large array data too big to fit in memory, dask-image is built on top of numpy, scipy, and dask allowing easy scalability and portability from your laptop to the supercomputing cluster. It is of broad interest for a diverse range of data analysis applications such as video/streaming data, computer vision, and scientific fields including astronomy, microscopy and geosciences. We will provide a general overview of the dask-image library, then discuss mixing and matching with your own custom functions, and present a practical case study of a python image processing pipeline.
Image datasets are large, and becoming larger. The widely used benchmark dataset COCO (Common Objects in Context) contains 330,000 individual images. The average size of a single entry on the image database EMPIAR is over 1TB, and can easily reach several terabytes. Even where individual images are small enough to fit in-memory, many existing parallelization methods are difficult to scale seamlessly between a laptop and a supercomputing cluster. For instance, the python multiprocessing module is restricted to a single mode and can't take advantage of multiple compute nodes on a distributed supercomputing cluster.
We need easy ways to work with large image data. This talk introduces dask-image, a python library for distributed image processing. The target audience are python programmers currently using numpy and scipy with large array data, where the whole dataset cannot fit in memory or is close to that limit. It's for people who want to get started with parallel processing, either because they have large single-image data, or because they want to do batch processing applying the same analysis to many smaller images (sometimes known an embarrassingly parallel problem). The specific image analysis functions provided by dask-image are of broad interest to a diverse range of analysis applications including (but not limited to) video/streaming data, computer vision, and scientific fields including astronomy, microscopy and geosciences.
Specifically, this talk will cover:
- An overview of the dask-image library
- Lazy image loading
- Image pre-processing functionality (convolutions, filters, etc.)
- Analysis of segmented images (distributed labeling, and measurements of those label regions) * Mixing in your own custom analysis functions (using dask delayed, map_blocks, and map_overlap) * A practical case study of a Python image processing pipeline
dask-image is open source, released under a BSD 3-Clause license, and can be installed using conda or pip. You can find the source code at https://github.com/dask/dask-image and the quickstart guide at https://github.com/dask/dask-examples/blob/master/applications/image-processing.ipynb