Skip to content

Instantly share code, notes, and snippets.

@dstansby
Last active February 22, 2023 18:05
Show Gist options
  • Save dstansby/b715efc9af89de1676a0783cd296be4c to your computer and use it in GitHub Desktop.
Save dstansby/b715efc9af89de1676a0783cd296be4c to your computer and use it in GitHub Desktop.

The future of time-series data in sunpy

David Stansby

Introduction

In late 2022 I got a small development grant from NumFocus to scope the future of time-series data in sunpy. The sucessful application can be read on the sunpy wiki - this contains context that I won't repeat here.

The current document will be the key outcome of the small development grant, with a record of what I did, the recommendations I made, and any decisions we came to as a community.

In it's current form, please feel free to leave comments.

User requirements

The first stage of my work investigated what the user requirements are for a sunpy data container. As part of this I used my own experience and the following community engagement:

From these discsusion came the following list of requriements:

Requirement Notes
Store data that is a function of time This means the time column should be treated as the index or coordinates to the data, and be stored as a time-like type.
Handle different time scales Data can have times defined in a variety of different time scales (e.g. UTC, TAI)
Store multi-dimensional data Although time is a common index to timeseries data, it isn't always the only one. As an exapmle, velocity distribution functions measured in the solar wind are 4D datasets, with data as a function of time and three dimensions in velocity space.
Handle time scales with leapseconds Some timescales can contain timestamps that occur within a leapsecond.
Store and use physical units with the data and any non-time indices
Store data in a format that can be used with scientific Python libraries
Support for storing out-of memory datasets
Store metadata alongside actual data
Have a way to store an observer coordinate alongside the time index
Have an easy way to do common data manipulation tasks e.g. interpolating, resampling, rebinning
Have a way to combine multiple timeseries objects, and keep track of metadata
Ability to convert to other common time series objects (e.g. pandas.DataFrame)
Functionality for loading and saving out to common file formats

Existing options for a data container

The next step was to identify a set of possible data containers that could be used to store time-series data in sunpy. The identified options were:

  • astropy.timeseries.TimeSeries
  • pandas.DataFrame
  • xarray.DataArray (or xarray.DataSet)
  • numpy.ndarray
  • ndcube

What do other projects use?

I also looked at what Python in Heliophysics projects use (as of writing, in Jan 2023):

Package Container
sunpy Custom TimeSeries object, backed by pandas.DataFrame
HAPI Client numpy.ndarray
pySPEDAS Not sure, can users actually get at the data itself?
spacepy Unclear if there is any specific timeseries container object?
aidapy xarray.DataArray
cdflib numpy.ndarray
NDCube NDCube
pytplot xarray.DataArray
solo-epd-loader pandas.DataFrame
speasy Custom DataContainer object, backed by numpy.ndarray

There is no common container used, with only astropy.TimeSeries not represented out of the possible options above.

What datasets does sunpy currently support?

sunpy currently has built in support for reading CDF files that conform to the Space Physics Guidelines for CDF, as long as the dataset is one- or two- dimensional. Alongside this several custom data readers have been written to support different data sources:

(links point to the data source information web page)

Data product(s) File format
SDO EVE/ESP L1 FITS
SDO EVE/ESP L0CS Text file
FERMI GBM summary FITS
GOES XRS FITS, netCDF
PROBA-2 LYRA ligthcurve FITS
NOAA solar cycle monthly indices JSON
NOAA solar cycle predicted indices JSON
NoRH radio FITS
RHESSI x-ray summary FITS

Evaluating options

Having found possible options, in this section I've evaluated them against the criteria set out above.

numpy.ndarray

Time-like index data πŸ›‘ Can store datetime64 data, but no support for indexes
Different time scales πŸ›‘ No support
Multi-dimensional data 🟩
Physical units πŸ›‘ No support
Interop with scientific Python 🟩
Out of memory πŸ›‘ numpy arrays are always in memory
Metadata πŸ›‘ No support
Observer coordinates πŸ›‘ No support
Easy data manipulation
I/O 🟠 Can save to binary .npy format or text file

pandas.DataFrame

Time-like index data 🟩
Different time scales πŸ›‘ No support
Multi-dimensional data 🟠 Possible, but recommended to use xarray instead
Physical units πŸ›‘ No native support (tracking issue), could be possible with pint-pands
Interop with scientific Python 🟩
Out of memory πŸ›‘ pandas DataFrames are always in memory
Metadata 🟩 Possible to add additional properties to a DataFrame
Observer coordinates πŸ›‘ No support
Easy data manipulation 🟩 Many built in methods for maniuplating time-like data
I/O 🟩 Lots of I/O options

xarray.DataArray

Time-like index data 🟩
Different time scales πŸ›‘ No support
Multi-dimensional data 🟩
Physical units πŸ›‘ No native support (tracking issue), could be possible with pint-xarray
Interop with scientific Python 🟩
Out of memory 🟩 Support for computing using dask
Metadata 🟩 Possible to add metadata to a DataArray
Observer coordinates 🟠 Support for adding "non-dimensional" coordinates (e.g. longitude/latitude), but not clear if storing astropy SkyCoord would work
Easy data manipulation 🟩 Many built in methods for maniuplating time-like data
I/O 🟩 Lots of I/O options

astropy.timeseries.TimeSeries

Time-like index data 🟩
Different time scales 🟩
Multi-dimensional data πŸ›‘
Physical units 🟩
Interop with scientific Python 🟩
Out of memory πŸ›‘ As far as I can tell, data has to be loaded into memory
Metadata 🟩 Can store on the .meta attribute
Observer coordinates 🟩 Support for adding "non-dimensional" coordinates (e.g. longitude/latitude), but not clear if storing astropy SkyCoord would work
Easy data manipulation 🟠
I/O 🟩 Lots of options via. the astropy.table API

NDCube

Time-like index data 🟩
Different time scales 🟩
Multi-dimensional data 🟩
Physical units 🟩
Interop with scientific Python 🟩
Out of memory 🟠 Seems to be supported in theory, but little docs
Metadata 🟩 Can store arbitrary FITS metadata
Observer coordinates 🟩 Support using the .extra_coords attribute
Easy data manipulation πŸ›‘ Very few manipulation methods impelmented
I/O πŸ›‘

Initial recommendations

  • numpy.ndarray doesn't implement several key features, and these are almost certainly out of scope for future ndarray development, so I suggest ndarray is discounted.
  • xarray.DataArray builds on top of pandas.DataFrame with additional features that would be useful to us, I suggest pandas.DataFrame is dicsounted.
  • NDCube is designed specifically to store data that is associated with a FITS world coordinate system (WCS). While some solar timeseries data is already in the FITS format, a large portion is in CDF format which is tabular, which FITS is not primarily designed to represent. So I suggest NDCube is discounted.

This leaves us with astropy.TimeSeries and xarray.DataArray, with the following comparison:

astropy.TimeSeries xarray.DataArray
Time-like index data 🟩 🟩
Different time scales 🟩 πŸ›‘
Multi-dimensional data πŸ›‘ 🟩
Physical units 🟩 πŸ›‘
Interop with scientific Python 🟩 🟩
Out of memory πŸ›‘ 🟩
Metadata 🟩 🟩
Observer coordinates πŸ›‘ 🟠
Easy data manipulation 🟠 🟩
I/O 🟩 🟩

My initial recommendation would be to adopt xarray.DataArray, as there are more green items compared to astropy.TimeSeries. I also think the two red items have the possibility of being solved with DataArray:

  • It should (I haven't confirmed this) be possible to convert times in different time scales (including ones with leapseconds) to a single timescale that doesn't have leapseconds, and store this in an xarray.DataArray.
  • Alternatively, it is possible to use ExtensionArrays to extend the data types used for a pandas Index, which is the data type used to index xarray.DataArray. I haven't checked yet if it's possible to use ExtensionArrays to store astropy Time-like objects, and therefore support different time scales without conversion.
  • Alternatively, it is possible to use ExtensionArrays to extend the data types used for a pandas Index, which is the data type used to index xarray.DataArray. I haven't checked yet if it's possible to use ExtensionArrays to store astropy Time-like objects, and therefore support different time scales without conversion.
  • Although there is not native support for units in DataArray currently, there is interest and ongoing development to support them.

Finally, xarray has a much bigger development community than astropy.TimeSeries, so implementing bug fixes and new features would probably be much easier with xarray.

@Cadair
Copy link

Cadair commented Feb 21, 2023

One immediate comment is that ndcube very much deos have support for non-dimensional coordinates via the .extra_coords property.

I am sure we will discuss this later, but I feel this lacks any discussion on the relative importance of the requirements. Also I would love to know what it would take to add astropy time support to pandas & xarray as indices.

@dstansby
Copy link
Author

One immediate comment is that ndcube very much deos have support for non-dimensional coordinates via the .extra_coords property.

Thanks for spotting that, fixed πŸ‘

@dstansby
Copy link
Author

Also I would love to know what it would take to add astropy time support to pandas & xarray as indices.

I've added a reference to pandas ExtensionArrays, which are the official way to extend pandas Index objects, which are used as the index in xarray.DataArray. Haven't investigated whether astropy.Time will work with this yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment