Last active
April 18, 2024 15:59
-
-
Save DSuveges/82f7bbf2bcb3b38a97d9326034ff6e2f to your computer and use it in GitHub Desktop.
QC Updated GnomAD
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "# QC updated GnomAD datasets\n\nApplied updates on the datasets:\n\n- Removing Ensembl/Gnomad coordinate conversion for indels\n- Updating schema to reflect change\n\n**Tasks**\n\n- Ensure data consistency - the number of variants and variants in the ld index should not change.\n- Validate new schema.\n- Validate the changes in the indel coordinates\n\n" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T14:23:14.760329Z", | |
"end_time": "2024-04-18T14:23:14.766101Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "from pyspark.sql import SparkSession\nfrom pyspark.sql import functions as f\nfrom pyspark.sql import types as t\nfrom pyspark.sql import DataFrame, Column\n\nspark = SparkSession.builder.getOrCreate()\n\n", | |
"execution_count": 3, | |
"outputs": [] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## QC Variant annotation\n\n\n**Conclusions:**\n\n- The reference for variant annotation in the configuration point to a dataset based on GnomAD3.\n- The `gnomad3VariantId` column is dropped. Good.\n- Could verify that the same indel variant in the old and the new show different in position:\n\n*New data*\n\n```\n-RECORD 0---------------------------\n variantId | 10-41776411-C-CT \n chromosome | 10 \n position | 41776411 \n referenceAllele | C \n alternateAllele | CT \n```\n\n*Old data*:\n\n```\n-RECORD 0----------------------------\n variantId | 10_41776412_C_CT \n chromosome | 10 \n position | 41776412 \n referenceAllele | C \n alternateAllele | CT \n gnomad3VariantId | 10-41776411-C-CT \n```" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T14:24:10.995962Z", | |
"end_time": "2024-04-18T14:25:18.405182Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "OLD_PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/variant_annotation'\nNEW_PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX_gnomad_update/variant_annotation'\n\ndatasets = {\n 'old': spark.read.parquet(OLD_PATH),\n 'new': spark.read.parquet(NEW_PATH),\n}\n\ndef get_count(df: DataFrame) -> int:\n return df.count()\n\nfor label, df in datasets.items():\n print(f'Number of variants in the {label} dataset: {get_count(df)}')", | |
"execution_count": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": " \r", | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "Number of variants in the old dataset: 759302267\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "[Stage 5:======================================================>(530 + 1) / 531]\r", | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "Number of variants in the new dataset: 759336320\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "\r \r", | |
"name": "stderr" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T14:26:12.560050Z", | |
"end_time": "2024-04-18T14:27:02.063373Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "G4PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX_gnomad4/variant_annotation'\n\nspark.read.parquet(G4PATH).count()", | |
"execution_count": 5, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": " \r", | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "execute_result", | |
"execution_count": 5, | |
"data": { | |
"text/plain": "759336320" | |
}, | |
"metadata": {} | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T14:33:15.689236Z", | |
"end_time": "2024-04-18T14:33:18.446310Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "datasets = {\n 'old': spark.read.parquet(G4PATH),\n 'new': spark.read.parquet(NEW_PATH),\n}", | |
"execution_count": 7, | |
"outputs": [] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T14:35:02.128013Z", | |
"end_time": "2024-04-18T14:35:02.135270Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "def show_schema(df: DataFrame) -> None:\n df.printSchema()\n \nfor label, df in datasets.items():\n print(f'The schema of the {label} dataset:')\n show_schema(df)\n print('\\n')", | |
"execution_count": 8, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "The schema of the old dataset:\nroot\n |-- gnomad3VariantId: string (nullable = true)\n |-- chromosome: string (nullable = true)\n |-- position: integer (nullable = true)\n |-- variantId: string (nullable = true)\n |-- chromosomeB37: string (nullable = true)\n |-- positionB37: integer (nullable = true)\n |-- referenceAllele: string (nullable = true)\n |-- alternateAllele: string (nullable = true)\n |-- rsIds: array (nullable = true)\n | |-- element: string (containsNull = true)\n |-- alleleType: string (nullable = true)\n |-- alleleFrequencies: array (nullable = true)\n | |-- element: struct (containsNull = true)\n | | |-- populationName: string (nullable = true)\n | | |-- alleleFrequency: double (nullable = true)\n |-- vep: struct (nullable = true)\n | |-- mostSevereConsequence: string (nullable = true)\n | |-- transcriptConsequences: array (nullable = true)\n | | |-- element: struct (containsNull = true)\n | | | |-- aminoAcids: string (nullable = true)\n | | | |-- consequenceTerms: array (nullable = true)\n | | | | |-- element: string (containsNull = true)\n | | | |-- geneId: string (nullable = true)\n | | | |-- lof: string (nullable = true)\n |-- inSilicoPredictors: struct (nullable = true)\n | |-- cadd: struct (nullable = true)\n | | |-- phred: float (nullable = true)\n | | |-- raw: float (nullable = true)\n | |-- revelMax: double (nullable = true)\n | |-- spliceaiDsMax: float (nullable = true)\n | |-- pangolinLargestDs: double (nullable = true)\n | |-- phylop: double (nullable = true)\n | |-- siftMax: double (nullable = true)\n | |-- polyphenMax: double (nullable = true)\n\n\n\nThe schema of the new dataset:\nroot\n |-- variantId: string (nullable = true)\n |-- chromosome: string (nullable = true)\n |-- position: integer (nullable = true)\n |-- chromosomeB37: string (nullable = true)\n |-- positionB37: integer (nullable = true)\n |-- referenceAllele: string (nullable = true)\n |-- alternateAllele: string (nullable = true)\n |-- rsIds: array (nullable = true)\n | |-- element: string (containsNull = true)\n |-- alleleType: string (nullable = true)\n |-- alleleFrequencies: array (nullable = true)\n | |-- element: struct (containsNull = true)\n | | |-- populationName: string (nullable = true)\n | | |-- alleleFrequency: double (nullable = true)\n |-- vep: struct (nullable = true)\n | |-- mostSevereConsequence: string (nullable = true)\n | |-- transcriptConsequences: array (nullable = true)\n | | |-- element: struct (containsNull = true)\n | | | |-- aminoAcids: string (nullable = true)\n | | | |-- consequenceTerms: array (nullable = true)\n | | | | |-- element: string (containsNull = true)\n | | | |-- geneId: string (nullable = true)\n | | | |-- lof: string (nullable = true)\n |-- inSilicoPredictors: struct (nullable = true)\n | |-- cadd: struct (nullable = true)\n | | |-- phred: float (nullable = true)\n | | |-- raw: float (nullable = true)\n | |-- revelMax: double (nullable = true)\n | |-- spliceaiDsMax: float (nullable = true)\n | |-- pangolinLargestDs: double (nullable = true)\n | |-- phylop: double (nullable = true)\n | |-- siftMax: double (nullable = true)\n | |-- polyphenMax: double (nullable = true)\n\n\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T15:07:02.804680Z", | |
"end_time": "2024-04-18T15:07:03.373575Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "filter_columns = [\n 'variantId',\n 'chromosome',\n 'position',\n 'referenceAllele',\n 'alternateAllele'\n]\n\n(\n datasets['new']\n .filter(\n (f.length(f.col('referenceAllele')) > 1) |\n (f.length(f.col('alternateAllele')) > 1)\n )\n .select(*filter_columns)\n .show(1, vertical=True)\n)", | |
"execution_count": 11, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "-RECORD 0---------------------------\n variantId | 10-41776411-C-CT \n chromosome | 10 \n position | 41776411 \n referenceAllele | C \n alternateAllele | CT \nonly showing top 1 row\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T15:08:19.641765Z", | |
"end_time": "2024-04-18T15:08:20.133358Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "(\n datasets['old']\n .filter(\n (f.length(f.col('referenceAllele')) > 1) |\n (f.length(f.col('alternateAllele')) > 1)\n )\n .select(*filter_columns, 'gnomad3VariantId')\n .show(1, vertical=True)\n)", | |
"execution_count": 15, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "-RECORD 0----------------------------\n variantId | 10_41776412_C_CT \n chromosome | 10 \n position | 41776412 \n referenceAllele | C \n alternateAllele | CT \n gnomad3VariantId | 10-41776411-C-CT \nonly showing top 1 row\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "## QC LD-index\n\n- The size of the LD index doesn't change.\n- The coordinates of indels in the LD index is changed:\n\n*Old ld-index:*\n\n```\n+-----------------+\n|variantId |\n+-----------------+\n|6_100124716_CA_C |\n|6_100124716_C_CA |\n|6_100124715_C_A |\n|6_100124716_C_CAA|\n+-----------------+\n```\n\n*New ld-index:*\n\n```\n+-----------------+\n|variantId |\n+-----------------+\n|6_100124715_C_CA |\n|6_100124715_C_CAA|\n|6_100124715_C_A |\n|6_100124715_CA_C |\n+-----------------+\n```" | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T15:14:58.162619Z", | |
"end_time": "2024-04-18T15:16:30.024804Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "OLD_PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/ld_index'\nNEW_PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX_gnomad_update/ld_index'\n\ndatasets = {\n 'old': spark.read.parquet(OLD_PATH),\n 'new': spark.read.parquet(NEW_PATH),\n}\n\ndef get_count(df: DataFrame) -> int:\n return df.count()\n\nfor label, df in datasets.items():\n print(f'Number of variants in the {label} dataset: {get_count(df)}')", | |
"execution_count": 16, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": " \r", | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "Number of variants in the old dataset: 33997554\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "[Stage 29:=====================================================>(830 + 1) / 831]\r", | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "Number of variants in the new dataset: 33997554\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "\r \r", | |
"name": "stderr" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T15:18:29.033648Z", | |
"end_time": "2024-04-18T15:18:31.708552Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "datasets = {\n 'old': spark.read.parquet(OLD_PATH),\n 'new': spark.read.parquet(NEW_PATH),\n}\n\nfor label, df in datasets.items():\n print(f'The schema of the {label} dataset:')\n show_schema(df)\n print('\\n')\n\n", | |
"execution_count": 18, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "The schema of the old dataset:\nroot\n |-- variantId: string (nullable = true)\n |-- ldSet: array (nullable = true)\n | |-- element: struct (containsNull = true)\n | | |-- tagVariantId: string (nullable = true)\n | | |-- rValues: array (nullable = true)\n | | | |-- element: struct (containsNull = true)\n | | | | |-- population: string (nullable = true)\n | | | | |-- r: double (nullable = true)\n |-- chromosome: string (nullable = true)\n\n\n\nThe schema of the new dataset:\nroot\n |-- variantId: string (nullable = true)\n |-- ldSet: array (nullable = true)\n | |-- element: struct (containsNull = true)\n | | |-- tagVariantId: string (nullable = true)\n | | |-- rValues: array (nullable = true)\n | | | |-- element: struct (containsNull = true)\n | | | | |-- population: string (nullable = true)\n | | | | |-- r: double (nullable = true)\n |-- chromosome: string (nullable = true)\n\n\n\n", | |
"name": "stdout" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"ExecuteTime": { | |
"start_time": "2024-04-18T15:24:46.569218Z", | |
"end_time": "2024-04-18T15:25:11.834286Z" | |
}, | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "for label, df in datasets.items():\n print(f'Example from {label} dataset:')\n (\n df\n .select('variantId')\n .filter(\n (f.col('chromosome') == '6') &\n (f.split(f.col('variantId'), '_')[1].startswith('10012471'))\n )\n .show(truncate=False)\n )\n print('\\n')", | |
"execution_count": 22, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"text": "Example from old dataset:\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": " \r", | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "+-----------------+\n|variantId |\n+-----------------+\n|6_100124716_CA_C |\n|6_100124716_C_CA |\n|6_100124715_C_A |\n|6_100124716_C_CAA|\n+-----------------+\n\n\n\nExample from new dataset:\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "[Stage 56:=====================================================> (25 + 1) / 26]\r", | |
"name": "stderr" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "+-----------------+\n|variantId |\n+-----------------+\n|6_100124715_C_CA |\n|6_100124715_C_CAA|\n|6_100124715_C_A |\n|6_100124715_CA_C |\n+-----------------+\n\n\n\n", | |
"name": "stdout" | |
}, | |
{ | |
"output_type": "stream", | |
"text": "\r \r", | |
"name": "stderr" | |
} | |
] | |
}, | |
{ | |
"metadata": { | |
"trusted": true | |
}, | |
"cell_type": "code", | |
"source": "", | |
"execution_count": null, | |
"outputs": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"language_info": { | |
"name": "python", | |
"version": "3.10.8", | |
"mimetype": "text/x-python", | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"pygments_lexer": "ipython3", | |
"nbconvert_exporter": "python", | |
"file_extension": ".py" | |
}, | |
"gist": { | |
"id": "82f7bbf2bcb3b38a97d9326034ff6e2f", | |
"data": { | |
"description": "QC Updated GnomAD", | |
"public": false | |
} | |
}, | |
"_draft": { | |
"nbviewer_url": "https://gist.github.com/DSuveges/82f7bbf2bcb3b38a97d9326034ff6e2f" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 5 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment