Skip to content

Instantly share code, notes, and snippets.

@dhruv9vats
Last active August 19, 2021 17:47
Show Gist options
  • Save dhruv9vats/fdbf193600cefdaf8c8e2f54bfd83083 to your computer and use it in GitHub Desktop.
Save dhruv9vats/fdbf193600cefdaf8c8e2f54bfd83083 to your computer and use it in GitHub Desktop.
Google Summer of Code 2021 Proposal

Student Information

Project Proposal

Proposal Title: Integrating Multitaper Periodogram into Stingray

Abstract:

As far as observational astronomy is concerned, it would not be an exaggeration to say that the study and interpretation of time-series data have become an integral part of modern-day astronomical studies. A common approach for characterizing the properties of time-series data is to estimate the power spectrum of the data using the periodogram. But the periodogram as an estimate suffers from being:

  1. Statistically inconsistent (that is, its variance does not go to zero as data samples reach infinity),
  2. Biased for finite samples, and
  3. Suffers from spectral leakage.

This project aims to implement and integrate a superior spectral estimation technique, known as the Multitaper Periodogram, into Stingray. This integration would follow the library’s existing design principles so its current users can make use of this technique to better interpret their time-series data without any additional steps.

Multitaper Periodogram as an estimate is attractive because it directly trades off bias and variance for frequency resolution; it is consistent; and is fast to compute.

The tapers, or data windows, applied to the data in the time domain before Fourier transforming, are characterized as being the most nearly band-limited functions possible among functions defined on a finite time domain, known as discrete prolate spheroidal sequences (DPSS), tending to the spectral leakage problem.

Deliverables/Milestones:

  1. Proof-of-concept implementation of the multitaper algorithm.
  2. Full implementation of the algorithm in the Stingray framework, including tests and relevant documentation/tutorial.
  3. Proof-of-concept implementation of the Lomb-Scargle (LS) Multitaper periodogram (Optional).
  4. Support for overlapping (tapered) data segments for use in averaged periodograms, like in the Welch method. (Optional).

Description:

Background and Motive:

The analysis of time-series data is becoming increasingly common in astrophysics research (e.g., (Huppenkothen, et al. 2019). And it will continue to grow in prevalence as data from telescopes such as the National Science Foundation Vera C. Rubin Telescope (Collaboration, et al. 2009) comes online in the 2020s. Astrophysics relies on observational time-series data from the diverse instruments on terrestrial and spatial telescopes. This data has and is improving our understanding of exoplanets and stars, but we are limited by issues such as unevenly sampled data, measurement uncertainties, and the statistical methods we employ. This project is most concerned with the latter. It seeks to help democratize and lower the barrier to the use of sophisticated, consistent, and well-tested spectral analysis techniques, which would be welcomed, if not anticipated, into the exploratory analysis toolkit of any time-series analyst.

The particular tool this project will implement is the Thomson’s Multitaper Periodogram (Thomson 1982).

In traditional spectral estimation, the data is often “windowed” (multiplied with a bell-shaped function) in order to reduce spectral leakage. The mathematical form for completeness:

https://user-images.githubusercontent.com/65168627/129919002-4bad1604-6b74-47cd-b630-e61e6942edd1.png

Here, xt being the time-series data, htthe taper(window) and ŜD( f ) the estimated periodogram.

In the multitaper method, too, the data is windowed or tapered. But this method differs from the traditional methods in the tapers used, as highlighted in the abstract, which are the most band-limited functions amongst those defined on a finite time domain, but more importantly, these tapers are orthogonal, enabling us to average the spectrum estimates from more than one tapers to obtain a vastly superior estimate in terms of noise. The resulting spectrum has low leakage, low variance, and retains information contained in the beginning and end of the time series.

image1 image2

Here vt(k) are the tapers, but in particular, are the Discrete Prolate Spheroidal Sequences (DPSS) or, concisely, the Selpians (Slepian 1978). The Selpians, written as vt(k)(N, W ), where k is the order of the sequence, N the length of the data, have the smallest spectral leakage outside of a defined frequency band of width 2W. The first K ⪅ 2NW estimates are then averaged to obtain the final estimate Ŝ(mt)( f ), having the optimal leakage and noise performance.

A more tangible comparison can be seen in Figure 1, in which a time series (N = 1000 samples long) is generated using an autoregressive process,

https://user-images.githubusercontent.com/65168627/129919178-78ae3281-7cc9-4b02-9e20-468d8084b59e.png

of order, p = 4, with coefficients 𝜑1 = 2.7607, 𝜑2 = −3.8106, 𝜑3 = 2.6535, and 𝜑4 = −0.9238, denoted AR(4). The classical and the multitaper periodogram are then subject to this time series sampled from the AR(4) process, yielding the result depicted below:

Figure 1

https://user-images.githubusercontent.com/65168627/129919229-16f9401d-1510-47dc-949e-1032e21e05da.png

As the underlying process (and its spectrum) generating the time series is fully known, we can effectively compare the results of these different methods. Two differences can be observed right away, namely, the bias and the variance. Where the classical periodogram is badly biased, the multitaper method follows the true spectrum quite well. This biased nature of the classical periodogram renders it inconclusive, and more often than not, misleading.

The difference in variance is quite pronounced too. The multitaper method can also trade frequency resolution for bias and variance improvements, making it attractive in terms of consistency and reliability.

The classical periodogram is also referred to as a naïve spectrum estimator because of its basic estimation process. An estimate with better noise properties can also be obtained by splitting the data into segments, producing periodograms with these segments, and averaging the calculated periodograms to obtain the final result. The segmented data can also optionally be tapered using a window of choice.

This functionality to obtain a periodogram as an average of the periodograms obtained from the data segments is implemented in Stingray. A comparison of this method on the same data presented above is shown in Figure 2. Here two estimates are shown, in one the data is simply split and the periodogram calculated and averaged (green curve), whereas in the other (orange curve), the data is also tapered using a ‘Hann’ window before the periodogram is calculated on overlapping data segments; this is known as the Welch method (Welch 1967).

Figure 2

https://user-images.githubusercontent.com/65168627/129919282-7b504890-552a-4511-a7bb-ca263f29f6e7.png

Both the tapered and the untapered averaging periodograms do the job of reducing the variance in the estimate. The bias, though, is still present in the untapered version, whereas the estimate produced using the Welch method does a good job of following the true spectrum all the way through.

The Welch method estimate (orange curve), for the purpose of this example, has been obtained using a function implemented in the SciPy library, namely scipy.signal.welch, as it makes use of overlapping data segments, something Stingray does not support at the moment. Viability and inclusion of the support for overlapping data segments will be checked and attempted.

The difference between this tapered averaging method (Welch method) and the multitaper method is as follows,

In the multitaper method, the same (full) data is used with different tapers, given by the Selpians (Slepian 1978), to calculate the periodograms which are then averaged.

In the Welch method, on the other hand, different data (overlapping data segments) is used with the same taper/window (‘Hann’ in the example above) before calculating and averaging the periodograms.

But there is a catch; because the tapered averaging method segments the original data into smaller pieces the frequency resolution of the individual and the resultant averaged periodograms is quite low (1/8th of the multitaper method in the example above). The segment size could be augmented to counter this, but increase segment size for better frequency resolution, and you lose the variance reductions, and decrease the segment size, and the estimate fails to follow the true spectrum while also having poorer frequency resolution.

This is in contrast to the multitaper method which operates on the whole data, retaining the frequency resolution. By this reasoning, the multitaper method could be considered as being the jack of all trades, giving optimal frequency resolution, bias, variance, and spectral leakage performance, all at the same time; while also providing a reasonable path to trade one property for the other, should such a need arise.

This quick example shows the superiority of the multitaper method, particularly in modelling the noise of the periodogram while retaining the frequency resolution. And considering that generating periodograms for astronomical data is the core task of Stingray (Huppenkothen, et al. 2019), this is an inclusion that is due. This will, I hope, help demystify something that happened,

“A long time ago in a galaxy, far, far away….”

Implementation Details:

As suggested by the mentors for the project, the initial implementation included writing a wrapper around an existing implementation of the algorithm in such a way that the resulting class would be in coherence, in attributes and methods, with Stingray’s other classes that provided similar tools.

A proof-of-concept implementation was put together (the PR linked above) by creating a wrapper around a project named ‘nitime’, intended for time-series analysis of neuroscience data.

Figure 1 was created using this implementation which integrated a wrapper class with Stingray. This was used to check the accuracy and the compatibility of this method. This result was a (successful) attempt to reproduce an example presented in Springford, Eadie, and Thomson (2020).

After further considerations by the project mentors, it was suggested that the multitaper method be implemented using the SciPy package for flexibility, fine-grain control, speed, and extensibility concerns. Implementing using SciPy also helps keep the dependencies small and in check.

Therefore, the SciPy package will be used to realize this implementation, with scipy.signal.windows.dpss being at its core, as the computation of the Selpians is most demanding and limiting in terms of computing resources. Optimal performance is thus predicted, as the SciPy package is quite optimized.

An outline of the defining attributes and methods the multitaper class is expected to include are,

Attributes:

  • power: Array of the PSD amplitudes normalized according to a desired scheme (Leahy, fractional, absolute, or none).
  • unnorm_power: Array of amplitudes of the unnormalized PSD.
  • power_err: Array of amplitudes of the errors in the PSD.
  • df: Frequency resolution of the periodogram.
  • freq: Array of mid-bin frequencies at which the PSD was calculated.

Methods (exposed to the user):

  • rebin: Rebin periodogram (on decimal and log scales).
  • compute_rms: Fractional root-mean-square amplitude between frequencies.
  • plot: Utility function to plot the periodogram (on decimal and log scales).
  • Methods inherited from the superclass (Crossspectrum.py)

Methods internal to the class are not listed and will primarily be used to set the attributes of the class. This list is just a reference template that will be used as a starting point and will likely evolve with the project. Resemblance to similar classes in Stingray will be emphasized.

This implementation will also support techniques such as,

  • Jackknife: A method to estimate the variance of the PSD at each point to provide a confidence interval.
  • Adaptive weighting: A method to combine the PSDs from the different tapers in an adaptive fashion.
  • Low Bias: A method to only use tapers with better than 90% spectral concentration within the given bandwidth to lower the leakage further and ‘play it safe.’

One requirement of these periodograms that we have neglected so far but is of great importance is that the time series data must be evenly sampled in time. But astronomy is rarely so privileged to fulfil this requirement. A popular solution to this is the use of the Lomb-Scargle (LS) periodogram (Lomb 1976) (Scargle 1982), which interpolates the data for further processing. A recent development (Springford, Eadie, and Thomson 2020) extends and combines this method with the multitaper algorithm. However, problems like aliasing and less than expected effective Nyquist frequency are evident, as this method is still very much under development.

Therefore, if time permits*, a proof-of-concept implementation of the Lomb-Scargle (LS) Multitaper periodogram will be attempted to check its integration viability. Support for overlapping data segments, as hinted earlier, will also be attempted as methods like the Welch (1967) Periodograms can make use of it.

The addition of these methods could also play a part in better understanding the Universe :-), as Nikola Tesla famously said,

“If you want to find the secrets of the universe, think in terms of energy, frequency, and vibration.”

*As the already implemented reference of the Multitaper method provides a good starting point, this should be doable.

Timeline:

Period Description
Community bonding period Become familiar with the usage of Stingray by doing spectral analysis on a real-world dataset. Brush up on the documentation and get acquainted with the design framework of the project. Finalize the project details and approach. Layout the known results (from literature or elsewhere) as a reference to check the accuracy of the (forthcoming) implementation. Go through literature once more, being on the outlook for technical subtleties.

June 7 – June 14

(1 week)

Create skeleton code for the class, implement a bare-bone version of the multitaper method, and get familiar with the SciPy APIs, data types, and pre-processing requirements along the way.

June 15 – June 30

(2 weeks)

Start implementing multitaper periodogram-specific methods, following the design aspects of Stingray (such as input/output data types and parameters at instantiation, which are common and root to the whole project). Crosscheck the results with the references for accuracy and identifying and correcting for any discrepancies such as normalizations.

July 1 – July 7

(1 week)

Get feedback from the mentors. Write tests to check the mathematical accuracy of the implementation. The proof-of-concept implementation is projected to be complete by this week.

July 8 – July 15

(1 week)

Buffer period to resolve any unforeseeable issues. Start implementing additional supporting tools and integrating the method with the project. Take mentor and community feedback to course-correct, if required.

July 16 – July 30

(2 weeks)

Finish implementing all the necessary tools and supporting framework as an extension to the proof-of-concept implementation. Link the class with different parts of the codebase as relevant. Study the project’s testing framework and start writing effective tests to evaluate the implementation under the guidance of the mentors.

August 1 – August 7

(1 week)

Finish testing the implementation and start writing relevant/missing documentation. Prepare a tutorial Jupyter notebook taking an analytical and a real-world example to showcase the method while presenting the structure and APIs to use the tool effectively.

August 8 – August 15

(1 week)

Buffer period to iron out any issues, touch-up on the documentation and tutorials. Wrap up this completed implementation. An attempt will be made to put together a proof-of-concept of the Lomb-Scargle (LS) Multitaper to check its integration viability.

Support for overlapping data segments used in averaged periodograms will also be attempted.

August 16 – August 23

(1 week)

Final week

Submit the final implementation of the Multitaper Periodogram and fine-tune any aspect of the project before final evaluations, as directed by the mentors.

Try to develop a mathematically functional implementation of the Lomb-Scargle (LS) Multitaper and take mentor and community feedback on the same for future extensions. This, after all, is not farewell but just a stopover.

I will also be blogging semi-regularly (once every two weeks) related, but not limited to, this project, what it entails, and how it fits into the astrophysics picture.

Cool Phrases:

It is said that mathematics is the language of physics, and arguably, of the modern society. But as ingeniously said by Franz Kafka,

“All language is but a poor translation.”

And a significant effort of the scholars has been to improve and make more efficient this translation of the physical and the abstract, that is, mathematics. And this project aims to be a (tiny) step in democratizing this language for everybody to make use of.

References:

Collaboration, L.S.S.T. Science, Paul A. Abell, Julius Allison, Scott F. Anderson, John R. Andrew, J. Roger P. Angel, Lee Armus, et al. 2009. “LSST Science Book, Version 2.0.”

Huppenkothen, Daniela, Matteo Bachetti, Abigail L. Stevens, Simone Migliari, Paul Balm, Omar Hammad, Usman Mahmood Khan, et al. 2019. “Stingray: A Modern Python Library for Spectral Timing.” The Astrophysical Journal (American Astronomical Society) 881: 39. doi:10.3847/1538-4357/ab258d.

Lomb, N. R. 1976. “Least-squares frequency analysis of unequally spaced data.” Astrophysics and Space Science (Springer Science and Business Media LLC) 39: 447–462. doi:10.1007/bf00648343.

P. Welch, “The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms,” in IEEE Transactions on Audio and Electroacoustics, vol. 15, no. 2, pp. 70-73, June 1967, doi: 10.1109/TAU.1967.1161901.

Scargle, J. D. 1982. “Studies in astronomical time series analysis. II - Statistical aspects of spectral analysis of unevenly spaced data.” The Astrophysical Journal (American Astronomical Society) 263: 835. doi:10.1086/160554.

Slepian, D. 1978. “Prolate Spheroidal Wave Functions, Fourier Analysis, and Uncertainty-V: The Discrete Case.” Bell System Technical Journal (Institute of Electrical and Electronics Engineers (IEEE)) 57: 1371–1430. doi:10.1002/j.1538-7305.1978.tb02104.x.

Springford, Aaron, Gwendolyn M. Eadie, and David J. Thomson. 2020. “Improving the Lomb–Scargle Periodogram with the Thomson Multitaper.” The Astronomical Journal (American Astronomical Society) 159: 205. doi:10.3847/1538-3881/ab7fa1.

Thomson, D. J. 1982. “Spectrum Estimation and Harmonic Analysis.” IEEE Proceedings 70: 1055-1096. https://ui.adsabs.harvard.edu/abs/1982IEEEP..70.1055T.

1990. “Time series analysis of Holocene climate data.” Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences (The Royal Society) 330: 601–616. doi:10.1098/rsta.1990.0041.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment