Skip to content

Instantly share code, notes, and snippets.

@w-nityammm
Last active August 26, 2025 10:40
Show Gist options
  • Select an option

  • Save w-nityammm/db81fcc587da7acc11cca6bd98e01a38 to your computer and use it in GitHub Desktop.

Select an option

Save w-nityammm/db81fcc587da7acc11cca6bd98e01a38 to your computer and use it in GitHub Desktop.
Google Summer of Code 2025 - Surveyjoin Database Project Report

Google Summer of Code 2025 - Surveyjoin Database

gist

Project Details

  • Contributor: Nityam Madaan
  • Organization: IOOS
  • GSoC Page: https://summerofcode.withgoogle.com/programs/2025/projects/ewtYg3ZW
  • Duration: 12 Weeks
  • Repository: https://github.com/DFO-NOAA-Pacific/surveyjoin-db
  • Background: The surveyjoin project by NOAA and DFO was a foundational effort to consolidate transboundary marine / trawl data from various surveys conducted across the Northeast Pacific Ocean. It successfully combined the data (provided via an R package) from the various sources into a single, large R data frame, providing a unified view limited to the top 55 species due to size constraints. This structure presented challenges in terms of scalability, performance, standardization, and accessibility.
  • Project Description: This project built upon the previous work by migrating surveyjoin to a scalable and standardised relational database in PostgreSQL. The project achieved its major planned milestones, beginning with the design of the relational database schema and the initial migration of the 55-species datasets. Second, the database was expanded to include 1,600+ species, which nearly tripled the amount of available catch data. Furthermore, the project achieved its planned stretch goal of integrating biological / specimen data into the database. The final deliverable is a fast, scalable, and documented database that is accompanied by the reproducible R-based pipelines used throughout the project. This database serves as a powerful new resource for marine ecosystem research and will allow researchers to focus more on their model results rather than data wrangling.

Code Contributions

  • PR#8 - Added NWFSC hook-and-line data (not used in the db yet), and some initial transformation pipes.
  • PR#14 - Populated the database and tested it with data from the original 55 species.
  • PR#23 - Added a guide and pipeline to setup and reproduce the entire database locally for development.
  • PR#31 - Expanded the database to include 1,600+ species with ~1.4M positive catch records. Created proper queryable views for the database. Designed a comprehensive R-function pipeline to query the views.
  • PR#35 - Integrated biological / specimen data to the database.

Future Work

As the main development phase of the project concludes, there are several important directions for future development:

  • Darwin Core Standardization: Map the database schema to Darwin Core terms to improve interoperability with global biodiversity data systems like OBIS.
  • Production Deployment & API: Deploy the database to a production server and provide public access via a REST API.
  • Expanded Survey Integration: Expand the database to incorporate additional data sources beyond trawl surveys.
  • Performance Tuning: Consider further optimizations like materialized views or table partitioning to ensure long-term scalability and query speed as the dataset grows.

Reflection and Acknowledgments

This whole project has been an amazing experience. Working with the complexities and nuances of marine data was challenging at times but tackling those issues and finding solutions made the process all the more satisfying. Writing the code to clean up and organize millions of records, and then seeing it all come together in a final, fast database was definitely the best part. Could not have asked for a better summer!

I absolutely could not have done this alone, and this project is what it is today because of the incredible group of people who supported it. I am sincerely grateful to Bridget Ferriss, Chris Rooper, Curt Whitmire, Derek Bolser, Emily Markowitz, Eric Ward, Kelli Johnson, Lewis Barnett, Mukta Gupta, Scott Sauri, Sean Anderson, and Stephen Formel. Our weekly meetings were so helpful, and I want to thank every one of them for their time, guidance, and willingness to share their expertise. Their collective feedback was crucial in shaping the project and making this a truly successful and memorable experience.

I'd also like to extend my thanks to the IOOS org admins for giving me this wonderful opportunity. Finally, a huge thank you to Google for organizing the Summer of Code program and making this entire experience possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment