Skip to content

Instantly share code, notes, and snippets.

@exitflynn
Last active November 9, 2023 05:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save exitflynn/3a8db95de7208456464db8c8b71b9b1d to your computer and use it in GitHub Desktop.
Save exitflynn/3a8db95de7208456464db8c8b71b9b1d to your computer and use it in GitHub Desktop.
GSoC '23 Final Report

GSoC '23 @ SunPy : Scraper Rewrite

minor edit: The PR was unmerged when the GSoC period ended, however now that it's merged to main, I've updated parts to reflect that.

Personal Information

Project Information

  • Project: Scraper Rewrite
  • Mentors: Nabil Freij (nabobalis), Shane Maloney (samaloney), Laura Hayes (hayesla)
  • Organization: SunPy (OpenAstronomy)
  • Description: Retrieving data from external sources is one of the core features offered by the SunPy library. Some specialised clients are already implemented to integrate with FIDO and more can be added by the users. For cases where the required information is stored in an orderly fashion, as in, all files following a common naming scheme / directory pattern based on date, time or other factors we have the sunpy.net.scraper submodule which we can use to write clients. Over time the scraper class has grown too complex, this project aims at rewriting the scraper class more maintainable and making it easier to work with. This involves replacing regex with parse, having a scraper client require only one input instead of both a baseurl and a pattern string, and making it easier to work with.

What work was done

  • Rewrote the Scraper class functions and tests to replace regex with parse. This involved coming up with a way to merge pattern and baseurl which initially went through some trials and errors, the problem I faced I mentioned here in my third GSoC blogpost. I also removed some functions which became redundant and added new ones as well. New requirements kept popping up as I worked on fixing the tests which I had to include in the code, like standardising datetime information for formats with more than one way of representing it. This forms a major part of the PR.
  • Moved functions for better organisation to a new sunpy.net.scraper.scraper_utils submodule after considering the other options like moving to sunpy.util.net or moving outside of the Scraper class but still in the sunpy.net.scraper submodule.
  • Discovered a bug in the codebase about how timeranges were extracted from Scraper clients. Fixed it after discussing the desired behaviour and opened an issue about it to keep track.
  • Since the Scraper was a well-integrated and core part of the package, at one point there were > 150 failing test cases. Fixing them involved updating the code and tests for the Scraper, other Clients (NOAA, RHESSI, EVE and GOES) and parts of the FIDO codebases present in sunpy.net to comply with the new API. This sometimes required understanding and changing internal code logic like in the case of the (annoyingly complex) GOES clients where I ended up having to change how information was added to the output along the way.
  • It was brought up that having documentation going over just how the scraper operates could be useful for the users writing Scraper clients so I also added documentation describing the whole internal scraper algorithm with examples.
  • Added documentation about how to write the new patterns. Also added and extended doc-strings to functions that needed them.

Merged Pull Requests (to personal fork and main)

Since the project is one big change instead of smaller ones over time, the main contribution is one big PR that stayed unmerged for a while. After the rewrite, my mentors suggested me to make PR's to my branch itself for changes that built on top of it while waiting for further reviews and suggestions, which got reviewed in time as well.

Opened (and closed) issues

The Current State of the Project

Apart from some doctests failing, the project is finished, and awaits reviews from other members in the org through whom the changes should be run through as well, in case they have any thoughts. Update: The PR has been merged! 🎉

Acknowledgements

I would like to thank Google for making a program like this a possibility every year and the very helpful and awesome SunPy community for making me feel comfortable working on this project, especially Nabil, Alasdair and Shane. The mentors were extremely active and quick to respond to any queries I had, whether it was during the weekdays or the weekends. They could always make time for a meeting during the day shall I need it. They would help out whenever I had a question on what to do while also encouraging me to take ownership of the project and make logic-backed decisions by myself. Working on this project was an amazing experience and the first time I was working with a codebase of this size and is something that I'll always look back on with great fondness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment