Skip to content

Instantly share code, notes, and snippets.

@AggelosMargkas
Last active September 21, 2023 14:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save AggelosMargkas/f8bccf4b86b565c766a997d422706ad7 to your computer and use it in GitHub Desktop.
Save AggelosMargkas/f8bccf4b86b565c766a997d422706ad7 to your computer and use it in GitHub Desktop.
Presented here is my concluding report for the Google Summer of Code project in which I actively participated during the summer of 2023.

Final Report

Contributor: Aggelos Margkas

Mentor: Diomidi Spinellis

Project: Diomidi Spinellis

Organization: Open Technologies Alliance - GFOSS

Description: This project took place over the summer of 2023 as part of the Google Summer of Code. The project's objective was to expand the alexandria3k package to support the Bibliographic (Front Page) Text data of the US patents issued weekly from the US Patent and Trademark office (USPTO) from 2005 to now.

Alexandria3k is an open source project that can be used as a command line tool or as a python package. Alexandria3k aims to facilitate the access and research on various publication datasets, with the biggest one being Crossref. For the supporting datasets alexandria3k provides an efficient relational query access, that one can use to retrieve various data via SQL queries, but also provides a great connectivity among the publication datasets.

One can download a substantial volume of data from the Bibliographic Front Page text dataset by following this link and selecting data ranging from the year 2005 to the present.

What work was done

Throughout this project, a significant portion of my efforts has been dedicated to understand the structure of the Bibliographic (Front Page) Text data and to successfully connect it to alexandria3k. This involved the initialization of a plugin called USPTO that enables alexandria3k to access the data. Using the amazing package apsw to unleash the full power of SQLite3, I had to develop a cursor pointing to the correct data properties and navigate through them, with the goal to be able to apply relational query access. Some important points of my work are:

  • Learning how to read XML files using the document type definition (DTD). DTD is a specification file that contains a set of markup declarations that define a document type. It establishes the document's structure by specifying a roster of validated elements and attributes.
  • Parsing and reading through Zip files and XML files contents. By using the library libraries such as ElementTree and zipfile. In this part of my work I had to deal with the structure of the XML files, which proved to be not valid. Thus, I had to create a way to read through their contents efficiently.
  • Caching. One of the biggest powers of alexandria3k is caching. I had to implement caching for reading through the files of the datasets and for each patent that contained inside. Short notice inside each weekly issued patent file, one can find up to 9000 patents! Thus caching provides a great tool for parsing and reading and of course helps saving a lot of memory and time.
  • Sampling is a really helpful tool that alexandria3k provides. It essentially mirrors the process of running Alexandria3k on a directory containing fewer containers than it really has. This rapid execution enables swift verification of a workflow's various steps.
  • Testing. To quote my mentor: "The work ain't done, till all tests run." On the development of my work I had to always combine my work with efficient testing routines. Testing proved to be a great ally in times I couldn't understand what was wrong with my coding.
  • Documentation. A necessary part for the completion of my objective was to provide a complete documentation about my work and how the extension of the USPTO plugin can be of great importance for the the researching community.

Merged Pull Requests

And some minor PR that were quick fixes:

Raised Issues

What's left to do

What wasn't completed was the initialization of the plugin for the PubMed/MEDLINE data. This was an optional addition that was recommended on the project's description in case the USPTO implementation was completed earlier in time.

Acknowledgement

I want to express my heartfelt gratitude to my mentor, Diomidi Spinellis, for being an unwavering source of support and guidance throughout my GSoC journey. Working under Diomidi's mentorship has been an incredibly enriching experience, marking my first foray into the realm of professional developers, open-source communities, and established software projects. The profound experiences and challenges I've encountered during this project have not only accelerated my growth but have also paved the way for my future development. Diomidi's mentorship has been instrumental in shaping my journey, and I am truly privileged to have had such an inspiring mentor by my side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment