minor edit: The PR was unmerged when the GSoC period ended, however now that it's merged to main, I've updated parts to reflect that.
- Student: Akshit Tyagi
- email: tyagiakshit833@gmail.com
- GitHub: @exitflynn
- Project: Scraper Rewrite
- Mentors: Nabil Freij (nabobalis), Shane Maloney (samaloney), Laura Hayes (hayesla)
- Organization: SunPy (OpenAstronomy)
- Description: Retrieving data from external sources is one of the core features offered by the SunPy library. Some specialised clients are already implemented to integrate with FIDO and more can be added by the users. For cases where the required information is stored in an orderly fashion, as in, all files following a common naming scheme / directory pattern based on date, time or other factors we have the
sunpy.net.scraper
submodule which we can use to write clients. Over time the scraper class has grown too complex, this project aims at rewriting the scraper class more maintainable and making it easier to work with. This involves replacing regex with parse, having a scraper client require only one input instead of both a baseurl and a pattern string, and making it easier to work with.
- Rewrote the Scraper class functions and tests to replace regex with parse. This involved coming up with a way to merge
pattern
andbaseurl
which initially went through some trials and errors, the problem I faced I mentioned here in my third GSoC blogpost. I also removed some functions which became redundant and added new ones as well. New requirements kept popping up as I worked on fixing the tests which I had to include in the code, like standardising datetime information for formats with more than one way of representing it. This forms a major part of the PR. - Moved functions for better organisation to a new
sunpy.net.scraper.scraper_utils
submodule after considering the other options like moving tosunpy.util.net
or moving outside of the Scraper class but still in thesunpy.net.scraper
submodule. - Discovered a bug in the codebase about how timeranges were extracted from Scraper clients. Fixed it after discussing the desired behaviour and opened an issue about it to keep track.
- Since the Scraper was a well-integrated and core part of the package, at one point there were > 150 failing test cases. Fixing them involved updating the code and tests for the Scraper, other Clients (NOAA, RHESSI, EVE and GOES) and parts of the FIDO codebases present in
sunpy.net
to comply with the new API. This sometimes required understanding and changing internal code logic like in the case of the (annoyingly complex) GOES clients where I ended up having to change how information was added to the output along the way. - It was brought up that having documentation going over just how the scraper operates could be useful for the users writing Scraper clients so I also added documentation describing the whole internal scraper algorithm with examples.
- Added documentation about how to write the new patterns. Also added and extended doc-strings to functions that needed them.
Since the project is one big change instead of smaller ones over time, the main contribution is one big PR that stayed unmerged for a while. After the rewrite, my mentors suggested me to make PR's to my branch itself for changes that built on top of it while waiting for further reviews and suggestions, which got reviewed in time as well.
Apart from some doctests failing, the project is finished, and awaits reviews from other members in the org through whom the changes should be run through as well, in case they have any thoughts. Update: The PR has been merged! 🎉
I would like to thank Google for making a program like this a possibility every year and the very helpful and awesome SunPy community for making me feel comfortable working on this project, especially Nabil, Alasdair and Shane. The mentors were extremely active and quick to respond to any queries I had, whether it was during the weekdays or the weekends. They could always make time for a meeting during the day shall I need it. They would help out whenever I had a question on what to do while also encouraging me to take ownership of the project and make logic-backed decisions by myself. Working on this project was an amazing experience and the first time I was working with a codebase of this size and is something that I'll always look back on with great fondness.