Skip to content

Instantly share code, notes, and snippets.

@waterflow80
Last active September 17, 2023 00:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save waterflow80/89f31a9e2bdf855715d8c35391e897d3 to your computer and use it in GitHub Desktop.
Save waterflow80/89f31a9e2bdf855715d8c35391e897d3 to your computer and use it in GitHub Desktop.
GSoC'23 - Final Project Report

Google Summer of Code '23 - Final Report

Basic Info

Project Details

ga4gh

Overview of the Project

The genome of an organism consists of a set of chromosomes (or contigs). Genomic scientists often rely on using a reference genome to define a baseline of knowledge against which our understanding of biological systems, phenotypes and variation are based upon. A majority of these genomes are served by central providers such as INSDC (e.g. NCBI, ENA), Ensembl or UCSC and if they most of the time agree on the sequence being used they often differ on the naming of these sequences. Thus a standardized gene naming is crucial for effective communication about genes.

In this context, a service named the Contig Alias service has been developed by the EVA for the purpose of providing a translation between different sequence naming conventions.

In the same context, the GA4GH has created a standard named Sequence Collection (SeqCol) that aims mainly at providing a protocol for uniquely identifying the content of a set of sequences and a comparison function that assess the compatibility of 2 provided sequence collections.

Project Goals

  • Evaluation of the contig alias database and data model
  • Create the databases and Java model to represent a reference genome as sequence collection data (link)
  • Create a mechanism to serialise and digest a sequence collection object.
  • Define the main endpoint that will resolve the sequence collection object in json from the digest (link)
  • Create a comparison function and the endpoint associated (link)

Important Workflows

SeqCol Ingestion Workflow

Screenshot from 2023-08-23 01-57-11

SeqCol digest calculation process

Screenshot from 2023-08-23 00-27-37 For more details about the sequence digests computation, see seqcol.readthedocs.io

Data Model

In order to make an efficient data model for the seqCol porject, we've chosen the disk (db) space and the processing time as the main criteria of choosing this model. In this process, we've made some evaluations of different data-models like:

  1. Evaluation of the contig-alias backend to determine whether it is useful to use it's data-model for our API. Eventually we agreed that this data model is not the best option for us. More details about the evaluation of this model can be found here.
  2. Evaluation of a recursive data-model that uses only one database table to store all different levels of seqCol objects. This model has also been rejected. More details about the evaluation of this model can be found here.
  3. Evaluation of a recursive data-model, which is the model we've chosen to use, that uses two database tables to store the seqCol objects' data in different levels. More details about the evaluation of this model can be found here. The design of this model looks like this: data-model-design
  • sequence_collections_l1: stores the seqCol level 1 objects. Example:

    digest (level 0) naming_convention seq_col_level1object
    rkTW1yZ0e22IN8K-0frqoGOMT8dynNyE UCSC {"names": "rIeQU2I79mbAFQwiV_kf6E8OUIEWq5h9", "lengths": "Ms_ixPgQMJaM54dVntLWeovXSO7ljvZh", "sequences": "dda3Kzi1Wkm2A8I99WietU1R8J4PL-D6", "md5-sequences": "_6iaYtcWw4TZaowlL7_64Wu9mbHpDUw4", "sorted-name-length-pairs": "qCETfSy_Ygmk0qtUJwI8V6SdsYX_AC53"}
    3mTg0tAA3PS-R1TzelLVWJ2ilUzoWfVq GENBANK {"names": "mfxUkK3J5y7BGVW7hJWcJ3erxuaMX6xm", "lengths": "Ms_ixPgQMJaM54dVntLWeovXSO7ljvZh", "sequences": "dda3Kzi1Wkm2A8I99WietU1R8J4PL-D6", "md5-sequences": "_6iaYtcWw4TZaowlL7_64Wu9mbHpDUw4", "sorted-name-length-pairs": "QFuKs5Hh8uQwwUtnRxIf8W3zeJoFOp8Z"}
  • seqcol_extended_data: stores the seqCol extended (exploded) level 2 seqCol data. Each entry has a digest that corresponds to one of the seqCol level 1 attribute's values. Example:

    digest extended_seq_col_data
    dda3Kzi1Wkm2A8I99WietU1R8J4PL-D6 {"object": ["SQ.lZyxiD_ByprhOUzrR1o1bq0ezO_1gkrn", "SQ.vw8jTiV5SAPDH4TEIZhNGylzNsQM4NC9", "SQ.A_i2Id0FjBI-tQyU4ZaCEdxRzQheDevn", "SQ.QXSUMoZW_SSsCCN9_wc-xmubKQSOn3Qb", "SQ.UN_b-wij0EtsgFqQ2xNsbXs_GYQQIbeQ", "SQ.z-qJgWoacRBV77zcMgZN9E_utrdzmQsH", "SQ.9wkqGXgK6bvM0gcjBiTDk9tAaqOZojlR", "SQ.K8ln7Ygob_lcVjNh-C8kUydzZjRt3UDf", "SQ.hb1scjdCWL89PtAkR0AVH9-dNH5R0FsN", "SQ.DKiPmNQT_aUFndwpRiUbgkRj4DPHgGjd", "SQ.RwKcMXVadHZub1qL0Y5c1gmNU1_vHFme", "SQ.1sw7ZtgO9JRb1kUEuhVz1wBix5_8Opci", "SQ.V7DQqMKG7bcyxiMZK9wNjkK-udR7hrad", "SQ.R8nT1N2qQFMc_uVMQUVMw-D2GcVmb5v6", "SQ.DPa_ORXLkGyyCbW9SWeqePfortM-Vdlm", "SQ.koyLEKoDOQtGHjb4r0m3o2SXxI09Z_sI"]}
    _6iaYtcWw4TZaowlL7_64Wu9mbHpDUw4 ...
    rIeQU2I79mbAFQwiV_kf6E8OUIEWq5h9 ...
    Ms_ixPgQMJaM54dVntLWeovXSO7ljvZh ...
    qCETfSy_Ygmk0qtUJwI8V6SdsYX_AC53 ...
    mfxUkK3J5y7BGVW7hJWcJ3erxuaMX6xm ...
    QFuKs5Hh8uQwwUtnRxIf8W3zeJoFOp8Z ...

The digests in bold indicates how the data is handled recursively, pretty much, between the two tables.

A detailed evaluation of all different proposed data-models can be found here: Deliverable 1.

Endpoints

Note: the seqCol service is currently deployed on server 45.88.81.158, under port 8081

Exposed endpoints

  • PUT - SERVER_IP:PORT/eva/webservices/seqcol/admin/seqcols/{asm_accession}
  • GET - SERVER_IP:PORT/eva/webservices/seqcol/collection/{seqCol_digest}?level={level}
  • GET - SERVER_IP:PORT/eva/webservices/seqcol/comparison/{seqColA_digest}/{seqColB_digest}
  • POST - SERVER_IP:PORT/eva/webservices/seqcol/comparison/{seqColA_digest}; body = {level 2 JSON representation of another seqCol}
  • GET - SERVER_IP:PORT/eva/webservices/seqcol/collection/service-info

Usage and description

For a more detailed, user-friendly documentation of the API's endpoints, please visit the seqCol's swagger page

Compile

This web service has some authenticated endpoints. The current approach to secure them is to provide the credentials in the src/main/resources/application.properties file at compilation time, using maven profiles.

The application also requires to be connected to an external database (PostgreSQL by default) to function. The credentials for this database need to be provided at compilation time using the same maven profiles.

You can edit the maven profiles values in pom.xml by locating the below section and changing the values manually or by setting environment variables. Alternatively, you can make the changes directly on the application.properties file.

Use <ftp.proxy.host> and <ftp.proxy.port> to configure proxy settings for accessing FTP servers (such as NCBI's). Set them to null and 0 to prevent overriding default the proxy configuration.

Set a boolean flag using <contig-alias.scaffolds-enabled> to enable or disable parsing and storing of scaffolds in the database.

 <profiles>
	<profile>
		<id>seqcol</id>
		<properties>
			<spring.profiles.active>seqcol</spring.profiles.active>
			<seqcol.db-url>jdbc:postgresql://${env.SERVER_IP}:${env.POSTGRES_PORT}/seqcol_db</seqcol.db-url>
			<seqcol.db-username>${env.POSTGRES_USER}</seqcol.db-username>
			<seqcol.db-password>${env.POSTGRES_PASS}</seqcol.db-password>
			<seqcol.ddl-behaviour>${env.DDL_BEHAVIOUR}</seqcol.ddl-behaviour>
			<seqcol.admin-user>${env.ADMIN_USER}</seqcol.admin-user>
			<seqcol.admin-password>${env.ADMIN_PASSWORD}</seqcol.admin-password>
			<ftp.proxy.host>${optional default=null}</ftp.proxy.host>
			<ftp.proxy.port>${optional default=0}</ftp.proxy.port>
			<contig-alias.scaffolds-enabled>${optional default=false}</contig-alias.scaffolds-enabled>
		</properties>
		<activation>
			<activeByDefault>true</activeByDefault>
		</activation>
	</profile>
 </profiles>

Once that's done, you can trigger the variable replacement with the -P option in maven. Example: mvn clean install -Pseqcol to compile the service including tests or mvn clean install -Pseqcol -DskipTests to ignore tests.

You can then run: mvn spring-boot:run to run the service.

Deployment

The current version of our API is currently deployed in an embedded tomcat server at 45.88.81.158 under port 8081. Briefly, the deployment steps are as follows:

  1. Connect to server 45.88.81.158 (deployment server) via SSH
  2. Clone the seqCol project's repo on the server (under /opt/eva-seqcol for example)
  3. Update the configuration files such as pom.xml and/or application.properties with the correct information/credentials
  4. Run: /opt/eva-seqcol$ nohup bash -c "git pull && mvn clean install -DskipTests -f local-pom.xml && export TOMCAT_PASS=`cat ~/.seqcol-tomcat-pass` && sleep 5 && export WAR_FILE=`ls -1 target/*.war|head -1` && java -jar $WAR_FILE" 1>> eva-seqcol.log 2>&1 &
  5. Log messages are in eva-seqcol.log file, you can see them live running the command /opt/eva-seqcol$ tail -f eva-seqcol.log
  6. If you want to stop the running version and pull the latest version, you can run: /opt/eva-seqcol$ kill `ps -ef | grep -E '.[e]va-seqcol..war' | awk '{print $2}'`, and repeat steps from 4.

Pull Requests and Issues

EBIvariation/contig-alias

  • Issue #105: Sequence collection data models using contig alias
  • Issue #106: Sequence collection data models without using contig alias
  • Issue #107: Add function to download the fasta file when retrieving the genome from FTP
  • Issue #108: Some assemblies in the NCBI database does not have a fna/fasta files
  • Issue #111: Evaluation of the Contig-alias's data model
  • PR #110: Seqcol: Retrieve the FASTA file from the reomte repository.
  • PR #112: With contig alias (Data Model Proposition)
  • PR #113: Recursive approach (Data Model Proposition)
  • PR #114: Recursive approach v2 (Data Model Proposition)
  • PR #115: Recursive approach v3 (Data Model Proposition)
  • PR #116: Relational data model (Data Model Proposition)
  • PR #117: Simple intuitive approach (Data Model Proposition)

EBIvariation/eva-seqcol

  • PR #1: Created the spring boot project with the necessary dependencies
  • PR #2: Implemented the process of downloading the assembly sequences fasta file
  • PR #3: Changed the checksum calculation design and renamed some classes plus slight changes
  • Issue #4: Refget checksum calculation (ga4gh ) not working
  • Issue #5: Confusing sequence (chromosome) identifiers in the FASTA files
  • PR #6: Composite primary key
  • PR #7: Ingest seqcol process
  • PR #8: Create maven.yml - Github action
  • PR #9: Data model refactor
  • PR #10: Full ingest of seqcol
  • PR #11: Seqcol initial controller
  • Issue #12: Some naming conventions are missing in some assembly reports
  • PR #14: Add new attribute to model
  • PR #16: Seqcol comparison api
  • Issue #18: Method getSeqColByDigestAndLevel unexpectedly throws org.hibernate.LazyInitializationException...
  • PR #19: fixed the digest calculation algorithm
  • PR #20: Seqcol comparison api generic version - using low level reflection libraries
  • PR #21: added the unbalanced-duplicate condition to the comparison api
  • PR #22: fetch and insert all possible seqCol objects from one assembly read
  • Issue #23: Adding flexibility for additional attributes in comparison endpoint
  • PR #24: Test suite enhancements - SeqColWriter & AssemblyDataGenerator
  • Issue #26: There's still a use of the language level reflections in the merged code
  • Issue #27: Common elements count algorithm is not correct - Comparison api
  • PR #28: fix common elements count algorithm - Comparison API
  • PR #29: Change json attribute name - md5-sequences
  • PR #30: added the sorted-name-length-pair attribute to the model
  • PR #31: Some code clean up and enhancements
  • Issue #32: Using the wrong algorithm to digest seqCol Level 2 into Level 1 seqCol object
  • PR #33: Generic comparison function for both GET and POST requests
  • Issue #34: Wrong algorihtm for a-and-b-same-order in comparison function
  • PR #36: Digest algorithm correction - level 2 to level 1 digest calculation
  • Issue #37: Declare implementation capability in service-info endpoint
  • Issue #38: Create swagger page
  • PR #40: fixed the a-and-b-same-order algorithm
  • PR #41: added a basic authentication for admin endpoints
  • PR #42: Swagger documentation
  • PR #43: Implemented the service-info endpoint
  • PR #44: basic integration tests for the API's endpoints
  • PR #46: Fixed the digits verification for seqCol arrays to be digested - ingestion endpoint
  • PR #47: fixed the seqCol retrieval's default level

Technologies Used

  • Spring Boot v2.7.13
  • PostgreSQL Database v15.2
  • Swagger v3 (springdoc-openapi implementation)
  • Git and GitHub Actions

Useful Links

Experience

Google Summer of Code was one the best programs that I've worked in so far. During this program, I had the opportunity to work and collaborate with highly experienced people wtih whom I gained, not only a lot of technical skills in Java, Spring Boot, Git, GitHub etc, but also a lot of project management skills that allowed me to submit deliverables in time. So I can list my gained experience in GSoC '23 as follows:

  • Enhanced my programming skills mainly in Java and Spring Boot
  • Enhanced my software and database design capabilities
  • Enhanced my skills in Git, GitHub actions and open source contribution in general
  • Enhanced my communication skills that allowed me to better reveal my ideas and ask the correct questions
  • Got involved with a wonderful community full of passion and energy

Acknowledgment

Finally, I would like to give special thanks to the people who helped making this project successful, which are my mentors: Timothee Cezard, April Shen and of course Sundar Venkataraman . They were so professional and supportive and I was really lucky to have them as my mentors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment