Instantly share code, notes, and snippets.

Embed
What would you like to do?

What is your session about?

"I wish there was a way to easily manipulate this huge multi-dimensional array in Python...", I thought, as I stared at a huge chunk of satellite data on my laptop. The data was from a satellite measuring air quality - and I wanted to slice and dice the data in some supposedly simple ways. Using pure numpy - the go-to library when the words 'multi-dimensional', 'array' and 'python' are mentioned in the same sentence - was just such a pain. What I wished for was something like pandas - with datetime indexes, fancy ways of selecting subsets, group-by operations and so on - but something that would work with my huge multi-dimensional array.

The solution: XArray - a wonderful library which provides the power of pandas for multi-dimensional data. In this talk I will introduce the XArray library by showing how just a few lines of code can answer questions about my data that would take a lot of complex code to answer with pure numpy - questions like 'What is the average air quality in March?', 'What is the time series of air quality in Southampton?' and 'What is the seasonal average air quality for each census output area?'.

After demonstrating how these questions can be answered easily with XArray, I will introduce the fundamental XArray data types, and show how indexes can be added to raw arrays to fully utilise the power of XArray. I will discuss how to get data in and out of XArray, and how XArray can use dask for high-performance data processing on multiple cores, or distributed across multiple machines. Finally I will leave you with a taster of some of the advanced features of XArray - including seamless access to data via the internet using OpenDAP, complex apply functions, and XArray extension libraries.

Is there anything else we should know about your proposal?

XArray is a very useful, but less well-known, library under the PyData umbrella. It provides a set of very useful data structures that allow pandas-style processing of large multi-dimensional arrays - using indexes, fancy indexing, groupbys and more. I have found it invaluable for various projects I've been working on - mostly dealing with large multi-temporal stacks of satellite data - but didn't learn about it until I'd implemented a number of projects 'the hard way'. Therefore, I'm keen to share the project with PyConUK attendees - as I'm sure there are other people who never knew about XArray but will find it very useful.

I should point out that I'm not a developer of XArray, and not an in-depth XArray expert. However, I am a very happy user of XArray - and have the enthusiasm to present it effectively to a room of people who are likely to not have come across it before.

I have successfully presented at PyConUK before - a couple of years back, when I presented recipy, a module I created for automated provenance tracking. Since then I have been unable to attend due to poor health, but am keen to attend this year and get back in to the community.

Can you give us an outline of your proposed session?

00:00 Air quality over the UK - using XArray to answer some simple questions

We will jump straight in to the 'motivating example' - a problem I actually had to solve, and that was my first foray into using XArray. We will briefly show examples of loading NetCDF files into XArray, examining dimensions and indexes, selecting by indexes, groupbys and plotting. The idea is to show the power of XArray - how much can be done with such little code, and motivate the rest of the presentation.

00:05 The fundamental basis of XArray

Now we move on to how XArray represents datasets and how this is related to concepts you may be familiar with in pandas. We'll create an XArray dataset from scratch and show how we can add more metadata such as indexes to it to allow easier slicing and dicing of the data.

00:10 Data Input/Output

Here we learn how to go from a set of raw input files into a usable multidimensional dataset in XArray - after all, not everything comes already packaged as a NetCDF file!

00:13 High performance processing with dask

Often multidimensional datasets are very large - and are hard to process on a single core of a single machine. XArray uses dask behind the scenes to split data up, process it on different cores or different machines, and join it back together again. Here we'll show a live example of this, including looking at dask execution graphs and the dask live dashboard.

00:18 A few tasters

A few examples of more complex things that can be done with XArray, including complex apply functions, OpenDAP and the use of XArray extension libraries - not in depth, but just as a taster of what else is possible, and to pique people's interest to look further into XArray.

00:20 Questions

What are your equipment and other requirements?

I use a wheelchair for longer distances, but am able to climb stairs to the stage to present, as long as there is a chair for me to sit down on while delivering the talk.

This talk is suitable for data scientists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment