Skip to content

Instantly share code, notes, and snippets.

@DSuveges
Last active April 18, 2024 15:59
Show Gist options
  • Save DSuveges/82f7bbf2bcb3b38a97d9326034ff6e2f to your computer and use it in GitHub Desktop.
Save DSuveges/82f7bbf2bcb3b38a97d9326034ff6e2f to your computer and use it in GitHub Desktop.
QC Updated GnomAD
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "# QC updated GnomAD datasets\n\nApplied updates on the datasets:\n\n- Removing Ensembl/Gnomad coordinate conversion for indels\n- Updating schema to reflect change\n\n**Tasks**\n\n- Ensure data consistency - the number of variants and variants in the ld index should not change.\n- Validate new schema.\n- Validate the changes in the indel coordinates\n\n"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T14:23:14.760329Z",
"end_time": "2024-04-18T14:23:14.766101Z"
},
"trusted": true
},
"cell_type": "code",
"source": "from pyspark.sql import SparkSession\nfrom pyspark.sql import functions as f\nfrom pyspark.sql import types as t\nfrom pyspark.sql import DataFrame, Column\n\nspark = SparkSession.builder.getOrCreate()\n\n",
"execution_count": 3,
"outputs": []
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## QC Variant annotation\n\n\n**Conclusions:**\n\n- The reference for variant annotation in the configuration point to a dataset based on GnomAD3.\n- The `gnomad3VariantId` column is dropped. Good.\n- Could verify that the same indel variant in the old and the new show different in position:\n\n*New data*\n\n```\n-RECORD 0---------------------------\n variantId | 10-41776411-C-CT \n chromosome | 10 \n position | 41776411 \n referenceAllele | C \n alternateAllele | CT \n```\n\n*Old data*:\n\n```\n-RECORD 0----------------------------\n variantId | 10_41776412_C_CT \n chromosome | 10 \n position | 41776412 \n referenceAllele | C \n alternateAllele | CT \n gnomad3VariantId | 10-41776411-C-CT \n```"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T14:24:10.995962Z",
"end_time": "2024-04-18T14:25:18.405182Z"
},
"trusted": true
},
"cell_type": "code",
"source": "OLD_PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/variant_annotation'\nNEW_PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX_gnomad_update/variant_annotation'\n\ndatasets = {\n 'old': spark.read.parquet(OLD_PATH),\n 'new': spark.read.parquet(NEW_PATH),\n}\n\ndef get_count(df: DataFrame) -> int:\n return df.count()\n\nfor label, df in datasets.items():\n print(f'Number of variants in the {label} dataset: {get_count(df)}')",
"execution_count": 4,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "Number of variants in the old dataset: 759302267\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "[Stage 5:======================================================>(530 + 1) / 531]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "Number of variants in the new dataset: 759336320\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T14:26:12.560050Z",
"end_time": "2024-04-18T14:27:02.063373Z"
},
"trusted": true
},
"cell_type": "code",
"source": "G4PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX_gnomad4/variant_annotation'\n\nspark.read.parquet(G4PATH).count()",
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "execute_result",
"execution_count": 5,
"data": {
"text/plain": "759336320"
},
"metadata": {}
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T14:33:15.689236Z",
"end_time": "2024-04-18T14:33:18.446310Z"
},
"trusted": true
},
"cell_type": "code",
"source": "datasets = {\n 'old': spark.read.parquet(G4PATH),\n 'new': spark.read.parquet(NEW_PATH),\n}",
"execution_count": 7,
"outputs": []
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T14:35:02.128013Z",
"end_time": "2024-04-18T14:35:02.135270Z"
},
"trusted": true
},
"cell_type": "code",
"source": "def show_schema(df: DataFrame) -> None:\n df.printSchema()\n \nfor label, df in datasets.items():\n print(f'The schema of the {label} dataset:')\n show_schema(df)\n print('\\n')",
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": "The schema of the old dataset:\nroot\n |-- gnomad3VariantId: string (nullable = true)\n |-- chromosome: string (nullable = true)\n |-- position: integer (nullable = true)\n |-- variantId: string (nullable = true)\n |-- chromosomeB37: string (nullable = true)\n |-- positionB37: integer (nullable = true)\n |-- referenceAllele: string (nullable = true)\n |-- alternateAllele: string (nullable = true)\n |-- rsIds: array (nullable = true)\n | |-- element: string (containsNull = true)\n |-- alleleType: string (nullable = true)\n |-- alleleFrequencies: array (nullable = true)\n | |-- element: struct (containsNull = true)\n | | |-- populationName: string (nullable = true)\n | | |-- alleleFrequency: double (nullable = true)\n |-- vep: struct (nullable = true)\n | |-- mostSevereConsequence: string (nullable = true)\n | |-- transcriptConsequences: array (nullable = true)\n | | |-- element: struct (containsNull = true)\n | | | |-- aminoAcids: string (nullable = true)\n | | | |-- consequenceTerms: array (nullable = true)\n | | | | |-- element: string (containsNull = true)\n | | | |-- geneId: string (nullable = true)\n | | | |-- lof: string (nullable = true)\n |-- inSilicoPredictors: struct (nullable = true)\n | |-- cadd: struct (nullable = true)\n | | |-- phred: float (nullable = true)\n | | |-- raw: float (nullable = true)\n | |-- revelMax: double (nullable = true)\n | |-- spliceaiDsMax: float (nullable = true)\n | |-- pangolinLargestDs: double (nullable = true)\n | |-- phylop: double (nullable = true)\n | |-- siftMax: double (nullable = true)\n | |-- polyphenMax: double (nullable = true)\n\n\n\nThe schema of the new dataset:\nroot\n |-- variantId: string (nullable = true)\n |-- chromosome: string (nullable = true)\n |-- position: integer (nullable = true)\n |-- chromosomeB37: string (nullable = true)\n |-- positionB37: integer (nullable = true)\n |-- referenceAllele: string (nullable = true)\n |-- alternateAllele: string (nullable = true)\n |-- rsIds: array (nullable = true)\n | |-- element: string (containsNull = true)\n |-- alleleType: string (nullable = true)\n |-- alleleFrequencies: array (nullable = true)\n | |-- element: struct (containsNull = true)\n | | |-- populationName: string (nullable = true)\n | | |-- alleleFrequency: double (nullable = true)\n |-- vep: struct (nullable = true)\n | |-- mostSevereConsequence: string (nullable = true)\n | |-- transcriptConsequences: array (nullable = true)\n | | |-- element: struct (containsNull = true)\n | | | |-- aminoAcids: string (nullable = true)\n | | | |-- consequenceTerms: array (nullable = true)\n | | | | |-- element: string (containsNull = true)\n | | | |-- geneId: string (nullable = true)\n | | | |-- lof: string (nullable = true)\n |-- inSilicoPredictors: struct (nullable = true)\n | |-- cadd: struct (nullable = true)\n | | |-- phred: float (nullable = true)\n | | |-- raw: float (nullable = true)\n | |-- revelMax: double (nullable = true)\n | |-- spliceaiDsMax: float (nullable = true)\n | |-- pangolinLargestDs: double (nullable = true)\n | |-- phylop: double (nullable = true)\n | |-- siftMax: double (nullable = true)\n | |-- polyphenMax: double (nullable = true)\n\n\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T15:07:02.804680Z",
"end_time": "2024-04-18T15:07:03.373575Z"
},
"trusted": true
},
"cell_type": "code",
"source": "filter_columns = [\n 'variantId',\n 'chromosome',\n 'position',\n 'referenceAllele',\n 'alternateAllele'\n]\n\n(\n datasets['new']\n .filter(\n (f.length(f.col('referenceAllele')) > 1) |\n (f.length(f.col('alternateAllele')) > 1)\n )\n .select(*filter_columns)\n .show(1, vertical=True)\n)",
"execution_count": 11,
"outputs": [
{
"output_type": "stream",
"text": "-RECORD 0---------------------------\n variantId | 10-41776411-C-CT \n chromosome | 10 \n position | 41776411 \n referenceAllele | C \n alternateAllele | CT \nonly showing top 1 row\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T15:08:19.641765Z",
"end_time": "2024-04-18T15:08:20.133358Z"
},
"trusted": true
},
"cell_type": "code",
"source": "(\n datasets['old']\n .filter(\n (f.length(f.col('referenceAllele')) > 1) |\n (f.length(f.col('alternateAllele')) > 1)\n )\n .select(*filter_columns, 'gnomad3VariantId')\n .show(1, vertical=True)\n)",
"execution_count": 15,
"outputs": [
{
"output_type": "stream",
"text": "-RECORD 0----------------------------\n variantId | 10_41776412_C_CT \n chromosome | 10 \n position | 41776412 \n referenceAllele | C \n alternateAllele | CT \n gnomad3VariantId | 10-41776411-C-CT \nonly showing top 1 row\n\n",
"name": "stdout"
}
]
},
{
"metadata": {},
"cell_type": "markdown",
"source": "## QC LD-index\n\n- The size of the LD index doesn't change.\n- The coordinates of indels in the LD index is changed:\n\n*Old ld-index:*\n\n```\n+-----------------+\n|variantId |\n+-----------------+\n|6_100124716_CA_C |\n|6_100124716_C_CA |\n|6_100124715_C_A |\n|6_100124716_C_CAA|\n+-----------------+\n```\n\n*New ld-index:*\n\n```\n+-----------------+\n|variantId |\n+-----------------+\n|6_100124715_C_CA |\n|6_100124715_C_CAA|\n|6_100124715_C_A |\n|6_100124715_CA_C |\n+-----------------+\n```"
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T15:14:58.162619Z",
"end_time": "2024-04-18T15:16:30.024804Z"
},
"trusted": true
},
"cell_type": "code",
"source": "OLD_PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX/ld_index'\nNEW_PATH = 'gs://genetics_etl_python_playground/output/python_etl/parquet/XX.XX_gnomad_update/ld_index'\n\ndatasets = {\n 'old': spark.read.parquet(OLD_PATH),\n 'new': spark.read.parquet(NEW_PATH),\n}\n\ndef get_count(df: DataFrame) -> int:\n return df.count()\n\nfor label, df in datasets.items():\n print(f'Number of variants in the {label} dataset: {get_count(df)}')",
"execution_count": 16,
"outputs": [
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "Number of variants in the old dataset: 33997554\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "[Stage 29:=====================================================>(830 + 1) / 831]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "Number of variants in the new dataset: 33997554\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T15:18:29.033648Z",
"end_time": "2024-04-18T15:18:31.708552Z"
},
"trusted": true
},
"cell_type": "code",
"source": "datasets = {\n 'old': spark.read.parquet(OLD_PATH),\n 'new': spark.read.parquet(NEW_PATH),\n}\n\nfor label, df in datasets.items():\n print(f'The schema of the {label} dataset:')\n show_schema(df)\n print('\\n')\n\n",
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": "The schema of the old dataset:\nroot\n |-- variantId: string (nullable = true)\n |-- ldSet: array (nullable = true)\n | |-- element: struct (containsNull = true)\n | | |-- tagVariantId: string (nullable = true)\n | | |-- rValues: array (nullable = true)\n | | | |-- element: struct (containsNull = true)\n | | | | |-- population: string (nullable = true)\n | | | | |-- r: double (nullable = true)\n |-- chromosome: string (nullable = true)\n\n\n\nThe schema of the new dataset:\nroot\n |-- variantId: string (nullable = true)\n |-- ldSet: array (nullable = true)\n | |-- element: struct (containsNull = true)\n | | |-- tagVariantId: string (nullable = true)\n | | |-- rValues: array (nullable = true)\n | | | |-- element: struct (containsNull = true)\n | | | | |-- population: string (nullable = true)\n | | | | |-- r: double (nullable = true)\n |-- chromosome: string (nullable = true)\n\n\n\n",
"name": "stdout"
}
]
},
{
"metadata": {
"ExecuteTime": {
"start_time": "2024-04-18T15:24:46.569218Z",
"end_time": "2024-04-18T15:25:11.834286Z"
},
"trusted": true
},
"cell_type": "code",
"source": "for label, df in datasets.items():\n print(f'Example from {label} dataset:')\n (\n df\n .select('variantId')\n .filter(\n (f.col('chromosome') == '6') &\n (f.split(f.col('variantId'), '_')[1].startswith('10012471'))\n )\n .show(truncate=False)\n )\n print('\\n')",
"execution_count": 22,
"outputs": [
{
"output_type": "stream",
"text": "Example from old dataset:\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": " \r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+-----------------+\n|variantId |\n+-----------------+\n|6_100124716_CA_C |\n|6_100124716_C_CA |\n|6_100124715_C_A |\n|6_100124716_C_CAA|\n+-----------------+\n\n\n\nExample from new dataset:\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "[Stage 56:=====================================================> (25 + 1) / 26]\r",
"name": "stderr"
},
{
"output_type": "stream",
"text": "+-----------------+\n|variantId |\n+-----------------+\n|6_100124715_C_CA |\n|6_100124715_C_CAA|\n|6_100124715_C_A |\n|6_100124715_CA_C |\n+-----------------+\n\n\n\n",
"name": "stdout"
},
{
"output_type": "stream",
"text": "\r \r",
"name": "stderr"
}
]
},
{
"metadata": {
"trusted": true
},
"cell_type": "code",
"source": "",
"execution_count": null,
"outputs": []
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3",
"language": "python"
},
"language_info": {
"name": "python",
"version": "3.10.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"gist": {
"id": "82f7bbf2bcb3b38a97d9326034ff6e2f",
"data": {
"description": "QC Updated GnomAD",
"public": false
}
},
"_draft": {
"nbviewer_url": "https://gist.github.com/DSuveges/82f7bbf2bcb3b38a97d9326034ff6e2f"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment