Skip to content

Instantly share code, notes, and snippets.

@nchelaru
Created November 6, 2019 01:31
Show Gist options
  • Save nchelaru/4317a4b23442d423a4fc7867ea4c2d01 to your computer and use it in GitHub Desktop.
Save nchelaru/4317a4b23442d423a4fc7867ea4c2d01 to your computer and use it in GitHub Desktop.
7b. DR CNS transcriptome analyses.ipynb
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"# Index reference transcripts \n",
"\n",
"Updated on Jan 2018"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"## Download \n",
"cat DR_cDNA.fa DR_ncRNA.fa > DR_all_RNA.fa\n",
"\n",
"## Create Salmon index (k=31)\n",
"salmon index -t DR_all_RNA.fa -i ~/DR_all_RNA_Jun28_k31_salmon_index --type quasi -k 31"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"# Pre-processing & mapping"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## SRR3465546 (15-50bp)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"## FastQC\n",
"fastqc -o . -f fastq --extract SRR3465546.fastq.gz -t 8\n",
"\n",
"## rCorrector \n",
"perl ~/install/Rcorrector-master/run_rcorrector.pl -t 10 -s ./SRR3465546.fastq.gz\n",
"\n",
"## Filter\n",
"python ~/FilterUncorrectabledSEfastq.py -i SRR3465546.cor.fq.gz -o filtered\n",
"\n",
"## fastp \n",
"fastp -i filtered_SRR3465546.cor.fq -o filtered_SRR3465546_fastp.cor.fq \\\n",
"-q 5 -c -p -w 10 \\\n",
"-j filtered_SRR3465546_fastp.json -h filtered_SRR3465546_fastp.html -R \"filtered_SRR3465546 report\"\n",
"\n",
"## FastQC [27909863.bc]\n",
"fastqc -o . -f fastq --extract filtered_SRR3465546_fastp.cor.fq -t 10\n",
"\n",
"## Mapping (k=31) ---> 71.3872% reads mapped \n",
"salmon quant -i ~/DR_all_RNA_Jun28_k31_salmon_index -l A \\\n",
"-r filtered_SRR3465546_fastp.cor.fq \\\n",
"-o ~/filtered_SRR3465546_fastp_DR_all_RNA_Jun28_k31_salmon_quant"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## SRR3465547 (50bp)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"## FastQC\n",
"fastqc -o . -f fastq --extract SRR3465547.fastq.gz -t 8\n",
"\n",
"## rCorrector \n",
"perl ~/install/Rcorrector-master/run_rcorrector.pl -t 10 -s ./SRR3465547.fastq.gz\n",
"\n",
"## Filter \n",
"python ~/FilterUncorrectabledSEfastq.py -i SRR3465547.cor.fq.gz -o filtered\n",
" \n",
"## fastp \n",
"fastp -i filtered_SRR3465547.cor.fq -o filtered_SRR3465547_fastp.cor.fq \\\n",
"-q 5 -c -p -w 10 \\\n",
"-j filtered_SRR3465547_fastp.json -h filtered_SRR3465547_fastp.html -R \"filtered_SRR3465547 report\"\n",
"\n",
"## FastQC \n",
"fastqc -o . -f fastq --extract filtered_SRR3465547_fastp.cor.fq -t 10\n",
" \n",
"## Mapping (k=31) index ---> 71.0672%% reads mapped [27922630.bc]\n",
"salmon quant -i ~/DR_all_RNA_Jun28_k31_salmon_index -l A \\\n",
"-r filtered_SRR3465547_fastp.cor.fq \\\n",
"-o ~/filtered_SRR3465547_fastp_DR_all_RNA_Jun28_k31_salmon_quant"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## SRR3465548 (50bp)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"## FastQC\n",
"fastqc -o . -f fastq --extract SRR3465548.fastq.gz -t 8\n",
"\n",
"## rCorrector\n",
"perl ~/install/Rcorrector-master/run_rcorrector.pl -t 10 -s ./SRR3465548.fastq.gz \n",
"\n",
"## Filter \n",
"python ~/FilterUncorrectabledSEfastq.py -i SRR3465548.cor.fq.gz -o filtered\n",
"\n",
"## fastp\n",
"fastp -i filtered_SRR3465548.cor.fq -o filtered_SRR3465548_fastp.cor.fq \\\n",
"-q 5 -c -p -w 10 \\\n",
"-j filtered_SRR3465548_fastp.json -h filtered_SRR3465548_fastp.html -R \"filtered_SRR3465548 report\"\n",
"\n",
"## FastQC \n",
"fastqc -o . -f fastq --extract filtered_SRR3465548_fastp.cor.fq -t 10\n",
"\n",
"## Mapping (k=31) ---> 71.4919% reads mapped\n",
"salmon quant -i ~/DR_all_RNA_Jun28_k31_salmon_index -l A \\\n",
"-r filtered_SRR3465548_fastp.cor.fq \\\n",
"-o ~/filtered_SRR3465548_fastp_DR_all_RNA_Jun28_k31_salmon_quant"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## SRR3465549 (15-50 bp)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"## rCorrector [27908342.bc]\n",
"perl ~/install/Rcorrector-master/run_rcorrector.pl -t 10 -s ./SRR3465549.fastq.gz\n",
"\n",
"## Filter [27908407.bc]\n",
"python ~/FilterUncorrectabledSEfastq.py -i SRR3465549.cor.fq.gz -o filtered\n",
"\n",
"## fastp [27909875.bc]\n",
"fastp -i filtered_SRR3465549.cor.fq -o filtered_SRR3465549_fastp.cor.fq \\\n",
"-q 5 -c -p -w 10 \\\n",
"-j filtered_SRR3465549_fastp.json -h filtered_SRR3465549_fastp.html -R \"filtered_SRR3465549 report\"\n",
"\n",
"## FastQC [27909876.bc]\n",
"fastqc -o . -f fastq --extract filtered_SRR3465549_fastp.cor.fq -t 10\n",
"\n",
"## Mapping (k=31) ---> 71.5107% reads mapped [27922634.bc]\n",
"salmon quant -i ~/DR_all_RNA_Jun28_k31_salmon_index -l A \\\n",
"-r filtered_SRR3465549_fastp.cor.fq \\\n",
"-o ~/filtered_SRR3465549_fastp_DR_all_RNA_Jun28_k31_salmon_quant"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"# MultiQC summary"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"- [DR_multiqc_report_Jun29.html](https://www.dropbox.com/s/jt1nfaq28nn26st/DR_multiqc_report_Jun29.html?dl=0)\n",
"- [DR multiQC summary Jun30.xlsx](https://www.dropbox.com/s/rflaagkhvaftq5a/DR%20multiQC%20summary%20Jun30.xlsx?dl=0)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"multiqc ."
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"# Extract expressed transcripts"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## Extract IDs of transcripts with TPM>0 in all libraries"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"kernel": "Python3"
},
"outputs": [],
"source": [
"## Import libraries\n",
"import pandas as pd\n",
"import os\n",
"os.chdir(\"/home/zhanglab1/ndong/Lymnaea_CNS_transcriptome_files/7_Interspecies_comparison/7b_Zebrafish\")\n",
"\n",
"## Define function\n",
"def extract_non0(salmon_output_filename, library_ID):\n",
" with open(salmon_output_filename, \"r\") as infile:\n",
" lib = pd.read_csv(infile, sep='\\t')\n",
" lib_non0 = lib.loc[lib[\"TPM\"]>0] \n",
" lib_non0_counts = lib_non0[[\"Name\", \"TPM\"]] \n",
" return(lib_non0_counts)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"kernel": "Python3"
},
"outputs": [],
"source": [
"## Analyze each library \n",
"lib6 = extract_non0(\"./filtered_SRR3465546_fastp_DR_all_RNA_Jun28_k31_salmon_quant/quant.sf\", \"SRR3465546\")\n",
"lib7 = extract_non0(\"./filtered_SRR3465547_fastp_DR_all_RNA_Jun28_k31_salmon_quant/quant.sf\", \"SRR3465547\")\n",
"lib8 = extract_non0(\"./filtered_SRR3465548_fastp_DR_all_RNA_Jun28_k31_salmon_quant/quant.sf\", \"SRR3465548\")\n",
"lib9 = extract_non0(\"./filtered_SRR3465549_fastp_DR_all_RNA_Jun28_k31_salmon_quant/quant.sf\", \"SRR3465549\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"kernel": "R"
},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<thead><tr><th></th><th scope=col>Name</th><th scope=col>TPM</th></tr></thead>\n",
"<tbody>\n",
"\t<tr><th scope=row>3</th><td>ENSDART00000170923.2</td><td> 2.479800 </td></tr>\n",
"\t<tr><th scope=row>4</th><td>ENSDART00000171190.2</td><td> 7.667080 </td></tr>\n",
"\t<tr><th scope=row>5</th><td>ENSDART00000165811.2</td><td> 3.880930 </td></tr>\n",
"\t<tr><th scope=row>7</th><td>ENSDART00000007487.9</td><td>17.090300 </td></tr>\n",
"\t<tr><th scope=row>8</th><td>ENSDART00000162972.3</td><td> 0.141226 </td></tr>\n",
"\t<tr><th scope=row>9</th><td>ENSDART00000171570.2</td><td> 5.170290 </td></tr>\n",
"\t<tr><th scope=row>10</th><td>ENSDART00000168177.2</td><td> 7.088380 </td></tr>\n",
"\t<tr><th scope=row>12</th><td>ENSDART00000162709.3</td><td> 9.806990 </td></tr>\n",
"\t<tr><th scope=row>14</th><td>ENSDART00000168926.2</td><td>13.296900 </td></tr>\n",
"\t<tr><th scope=row>15</th><td>ENSDART00000169180.2</td><td> 8.571490 </td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" & Name & TPM\\\\\n",
"\\hline\n",
"\t3 & ENSDART00000170923.2 & 2.479800 \\\\\n",
"\t4 & ENSDART00000171190.2 & 7.667080 \\\\\n",
"\t5 & ENSDART00000165811.2 & 3.880930 \\\\\n",
"\t7 & ENSDART00000007487.9 & 17.090300 \\\\\n",
"\t8 & ENSDART00000162972.3 & 0.141226 \\\\\n",
"\t9 & ENSDART00000171570.2 & 5.170290 \\\\\n",
"\t10 & ENSDART00000168177.2 & 7.088380 \\\\\n",
"\t12 & ENSDART00000162709.3 & 9.806990 \\\\\n",
"\t14 & ENSDART00000168926.2 & 13.296900 \\\\\n",
"\t15 & ENSDART00000169180.2 & 8.571490 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| <!--/--> | Name | TPM | \n",
"|---|---|---|---|---|---|---|---|---|---|\n",
"| 3 | ENSDART00000170923.2 | 2.479800 | \n",
"| 4 | ENSDART00000171190.2 | 7.667080 | \n",
"| 5 | ENSDART00000165811.2 | 3.880930 | \n",
"| 7 | ENSDART00000007487.9 | 17.090300 | \n",
"| 8 | ENSDART00000162972.3 | 0.141226 | \n",
"| 9 | ENSDART00000171570.2 | 5.170290 | \n",
"| 10 | ENSDART00000168177.2 | 7.088380 | \n",
"| 12 | ENSDART00000162709.3 | 9.806990 | \n",
"| 14 | ENSDART00000168926.2 | 13.296900 | \n",
"| 15 | ENSDART00000169180.2 | 8.571490 | \n",
"\n",
"\n"
],
"text/plain": [
" Name TPM \n",
"3 ENSDART00000170923.2 2.479800\n",
"4 ENSDART00000171190.2 7.667080\n",
"5 ENSDART00000165811.2 3.880930\n",
"7 ENSDART00000007487.9 17.090300\n",
"8 ENSDART00000162972.3 0.141226\n",
"9 ENSDART00000171570.2 5.170290\n",
"10 ENSDART00000168177.2 7.088380\n",
"12 ENSDART00000162709.3 9.806990\n",
"14 ENSDART00000168926.2 13.296900\n",
"15 ENSDART00000169180.2 8.571490"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<table>\n",
"<thead><tr><th></th><th scope=col>Name</th><th scope=col>TPM</th></tr></thead>\n",
"<tbody>\n",
"\t<tr><th scope=row>3</th><td>ENSDART00000170923.2</td><td> 1.229110 </td></tr>\n",
"\t<tr><th scope=row>4</th><td>ENSDART00000171190.2</td><td> 5.517730 </td></tr>\n",
"\t<tr><th scope=row>5</th><td>ENSDART00000165811.2</td><td> 3.949240 </td></tr>\n",
"\t<tr><th scope=row>7</th><td>ENSDART00000007487.9</td><td>16.860100 </td></tr>\n",
"\t<tr><th scope=row>8</th><td>ENSDART00000162972.3</td><td> 0.665929 </td></tr>\n",
"\t<tr><th scope=row>9</th><td>ENSDART00000171570.2</td><td> 5.979240 </td></tr>\n",
"\t<tr><th scope=row>10</th><td>ENSDART00000168177.2</td><td> 6.115120 </td></tr>\n",
"\t<tr><th scope=row>12</th><td>ENSDART00000162709.3</td><td> 9.458940 </td></tr>\n",
"\t<tr><th scope=row>14</th><td>ENSDART00000168926.2</td><td>12.616400 </td></tr>\n",
"\t<tr><th scope=row>15</th><td>ENSDART00000169180.2</td><td> 8.660670 </td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"\\begin{tabular}{r|ll}\n",
" & Name & TPM\\\\\n",
"\\hline\n",
"\t3 & ENSDART00000170923.2 & 1.229110 \\\\\n",
"\t4 & ENSDART00000171190.2 & 5.517730 \\\\\n",
"\t5 & ENSDART00000165811.2 & 3.949240 \\\\\n",
"\t7 & ENSDART00000007487.9 & 16.860100 \\\\\n",
"\t8 & ENSDART00000162972.3 & 0.665929 \\\\\n",
"\t9 & ENSDART00000171570.2 & 5.979240 \\\\\n",
"\t10 & ENSDART00000168177.2 & 6.115120 \\\\\n",
"\t12 & ENSDART00000162709.3 & 9.458940 \\\\\n",
"\t14 & ENSDART00000168926.2 & 12.616400 \\\\\n",
"\t15 & ENSDART00000169180.2 & 8.660670 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| <!--/--> | Name | TPM | \n",
"|---|---|---|---|---|---|---|---|---|---|\n",
"| 3 | ENSDART00000170923.2 | 1.229110 | \n",
"| 4 | ENSDART00000171190.2 | 5.517730 | \n",
"| 5 | ENSDART00000165811.2 | 3.949240 | \n",
"| 7 | ENSDART00000007487.9 | 16.860100 | \n",
"| 8 | ENSDART00000162972.3 | 0.665929 | \n",
"| 9 | ENSDART00000171570.2 | 5.979240 | \n",
"| 10 | ENSDART00000168177.2 | 6.115120 | \n",
"| 12 | ENSDART00000162709.3 | 9.458940 | \n",
"| 14 | ENSDART00000168926.2 | 12.616400 | \n",
"| 15 | ENSDART00000169180.2 | 8.660670 | \n",
"\n",
"\n"
],
"text/plain": [
" Name TPM \n",
"3 ENSDART00000170923.2 1.229110\n",
"4 ENSDART00000171190.2 5.517730\n",
"5 ENSDART00000165811.2 3.949240\n",
"7 ENSDART00000007487.9 16.860100\n",
"8 ENSDART00000162972.3 0.665929\n",
"9 ENSDART00000171570.2 5.979240\n",
"10 ENSDART00000168177.2 6.115120\n",
"12 ENSDART00000162709.3 9.458940\n",
"14 ENSDART00000168926.2 12.616400\n",
"15 ENSDART00000169180.2 8.660670"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%get lib6 --from Python3\n",
"%get lib7 --from Python3 \n",
"\n",
"head(lib6, 10)\n",
"head(lib7, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "R"
},
"source": [
"## Create median-sorted lookup table for TPM>0 transcripts in all four libraries"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"kernel": "Python3"
},
"outputs": [],
"source": [
"def sorted_merged_table(df1, df2, df3, df4, \n",
" lib_name1, lib_name2, lib_name3, lib_name4):\n",
" non0_2 = df1.merge(df2, on='Name') # Create lookup merged table of all libraries\n",
" non0_3 = non0_2.merge(df3, on='Name')\n",
" non0_4 = non0_3.merge(df4, on='Name')\n",
" non0_4['Median'] = non0_4.median(axis=1) # Calculate median read count of each transcript across all libraries\n",
" non0_4_median_sorted = non0_4.sort_values(by=\"Median\", ascending=False) # Sort transcripts by median read count \n",
" non0_4_median_sorted.columns = (\"Name\", lib_name1, lib_name2, lib_name3, lib_name4, \"Median\") # Rename columns\n",
" return non0_4_median_sorted\n",
"\n",
"DR_sorted_merged = sorted_merged_table(lib6, lib7, lib8, lib9, \n",
" \"SRR3465546\", \"SRR3465547\", \"SRR3465548\", \"SRR3465549\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"kernel": "R"
},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<thead><tr><th></th><th scope=col>Name</th><th scope=col>SRR3465546</th><th scope=col>SRR3465547</th><th scope=col>SRR3465548</th><th scope=col>SRR3465549</th><th scope=col>Median</th></tr></thead>\n",
"<tbody>\n",
"\t<tr><th scope=row>34737</th><td>ENSDART00000093611.3</td><td>121405.00 </td><td>132392.00 </td><td>132560.00 </td><td>135275.00 </td><td>132476.00 </td></tr>\n",
"\t<tr><th scope=row>35022</th><td>ENSDART00000117474.3</td><td> 58401.40 </td><td> 65014.10 </td><td> 67116.20 </td><td> 63151.30 </td><td> 64082.70 </td></tr>\n",
"\t<tr><th scope=row>33311</th><td>ENSDART00000182970.1</td><td> 19842.10 </td><td> 9807.03 </td><td> 21568.10 </td><td> 15676.90 </td><td> 17759.50 </td></tr>\n",
"\t<tr><th scope=row>34736</th><td>ENSDART00000093609.3</td><td> 11807.40 </td><td> 13954.20 </td><td> 14022.30 </td><td> 13290.50 </td><td> 13622.35 </td></tr>\n",
"\t<tr><th scope=row>34739</th><td>ENSDART00000093613.3</td><td> 10803.00 </td><td> 12767.50 </td><td> 12902.30 </td><td> 13264.50 </td><td> 12834.90 </td></tr>\n",
"\t<tr><th scope=row>34735</th><td>ENSDART00000093606.3</td><td> 11273.20 </td><td> 13201.90 </td><td> 12731.30 </td><td> 12627.60 </td><td> 12679.45 </td></tr>\n",
"\t<tr><th scope=row>34738</th><td>ENSDART00000093612.3</td><td> 9459.21 </td><td> 11129.50 </td><td> 10755.40 </td><td> 11220.60 </td><td> 10942.45 </td></tr>\n",
"\t<tr><th scope=row>35138</th><td>ENSDART00000174022.2</td><td> 10769.90 </td><td> 14598.50 </td><td> 10601.30 </td><td> 8607.61 </td><td> 10685.60 </td></tr>\n",
"\t<tr><th scope=row>35030</th><td>ENSDART00000116823.3</td><td> 6529.78 </td><td> 8440.01 </td><td> 7806.37 </td><td> 6778.69 </td><td> 7292.53 </td></tr>\n",
"\t<tr><th scope=row>35019</th><td>ENSDART00000116869.3</td><td> 7975.68 </td><td> 5231.02 </td><td> 7202.87 </td><td> 6323.97 </td><td> 6763.42 </td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"\\begin{tabular}{r|llllll}\n",
" & Name & SRR3465546 & SRR3465547 & SRR3465548 & SRR3465549 & Median\\\\\n",
"\\hline\n",
"\t34737 & ENSDART00000093611.3 & 121405.00 & 132392.00 & 132560.00 & 135275.00 & 132476.00 \\\\\n",
"\t35022 & ENSDART00000117474.3 & 58401.40 & 65014.10 & 67116.20 & 63151.30 & 64082.70 \\\\\n",
"\t33311 & ENSDART00000182970.1 & 19842.10 & 9807.03 & 21568.10 & 15676.90 & 17759.50 \\\\\n",
"\t34736 & ENSDART00000093609.3 & 11807.40 & 13954.20 & 14022.30 & 13290.50 & 13622.35 \\\\\n",
"\t34739 & ENSDART00000093613.3 & 10803.00 & 12767.50 & 12902.30 & 13264.50 & 12834.90 \\\\\n",
"\t34735 & ENSDART00000093606.3 & 11273.20 & 13201.90 & 12731.30 & 12627.60 & 12679.45 \\\\\n",
"\t34738 & ENSDART00000093612.3 & 9459.21 & 11129.50 & 10755.40 & 11220.60 & 10942.45 \\\\\n",
"\t35138 & ENSDART00000174022.2 & 10769.90 & 14598.50 & 10601.30 & 8607.61 & 10685.60 \\\\\n",
"\t35030 & ENSDART00000116823.3 & 6529.78 & 8440.01 & 7806.37 & 6778.69 & 7292.53 \\\\\n",
"\t35019 & ENSDART00000116869.3 & 7975.68 & 5231.02 & 7202.87 & 6323.97 & 6763.42 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| <!--/--> | Name | SRR3465546 | SRR3465547 | SRR3465548 | SRR3465549 | Median | \n",
"|---|---|---|---|---|---|---|---|---|---|\n",
"| 34737 | ENSDART00000093611.3 | 121405.00 | 132392.00 | 132560.00 | 135275.00 | 132476.00 | \n",
"| 35022 | ENSDART00000117474.3 | 58401.40 | 65014.10 | 67116.20 | 63151.30 | 64082.70 | \n",
"| 33311 | ENSDART00000182970.1 | 19842.10 | 9807.03 | 21568.10 | 15676.90 | 17759.50 | \n",
"| 34736 | ENSDART00000093609.3 | 11807.40 | 13954.20 | 14022.30 | 13290.50 | 13622.35 | \n",
"| 34739 | ENSDART00000093613.3 | 10803.00 | 12767.50 | 12902.30 | 13264.50 | 12834.90 | \n",
"| 34735 | ENSDART00000093606.3 | 11273.20 | 13201.90 | 12731.30 | 12627.60 | 12679.45 | \n",
"| 34738 | ENSDART00000093612.3 | 9459.21 | 11129.50 | 10755.40 | 11220.60 | 10942.45 | \n",
"| 35138 | ENSDART00000174022.2 | 10769.90 | 14598.50 | 10601.30 | 8607.61 | 10685.60 | \n",
"| 35030 | ENSDART00000116823.3 | 6529.78 | 8440.01 | 7806.37 | 6778.69 | 7292.53 | \n",
"| 35019 | ENSDART00000116869.3 | 7975.68 | 5231.02 | 7202.87 | 6323.97 | 6763.42 | \n",
"\n",
"\n"
],
"text/plain": [
" Name SRR3465546 SRR3465547 SRR3465548 SRR3465549\n",
"34737 ENSDART00000093611.3 121405.00 132392.00 132560.00 135275.00 \n",
"35022 ENSDART00000117474.3 58401.40 65014.10 67116.20 63151.30 \n",
"33311 ENSDART00000182970.1 19842.10 9807.03 21568.10 15676.90 \n",
"34736 ENSDART00000093609.3 11807.40 13954.20 14022.30 13290.50 \n",
"34739 ENSDART00000093613.3 10803.00 12767.50 12902.30 13264.50 \n",
"34735 ENSDART00000093606.3 11273.20 13201.90 12731.30 12627.60 \n",
"34738 ENSDART00000093612.3 9459.21 11129.50 10755.40 11220.60 \n",
"35138 ENSDART00000174022.2 10769.90 14598.50 10601.30 8607.61 \n",
"35030 ENSDART00000116823.3 6529.78 8440.01 7806.37 6778.69 \n",
"35019 ENSDART00000116869.3 7975.68 5231.02 7202.87 6323.97 \n",
" Median \n",
"34737 132476.00\n",
"35022 64082.70\n",
"33311 17759.50\n",
"34736 13622.35\n",
"34739 12834.90\n",
"34735 12679.45\n",
"34738 10942.45\n",
"35138 10685.60\n",
"35030 7292.53\n",
"35019 6763.42"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"36362"
],
"text/latex": [
"36362"
],
"text/markdown": [
"36362"
],
"text/plain": [
"[1] 36362"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%get DR_sorted_merged --from Python3\n",
"\n",
"head(DR_sorted_merged,10)\n",
"nrow(DR_sorted_merged)"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "R"
},
"source": [
"## Retrieve gene names and biotype info of all expressed transcripts from Biomart"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "R"
},
"source": [
"### Extract IDs of all expressed transcripts"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"kernel": "Python3"
},
"outputs": [
{
"data": {
"text/plain": [
"(36362,)"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DR_median_sorted_IDs = DR_sorted_merged[\"Name\"]\n",
"DR_median_sorted_IDs.shape"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"kernel": "R"
},
"outputs": [
{
"data": {
"text/html": [
"<dl class=dl-horizontal>\n",
"\t<dt>34737</dt>\n",
"\t\t<dd>'ENSDART00000093611.3'</dd>\n",
"\t<dt>35022</dt>\n",
"\t\t<dd>'ENSDART00000117474.3'</dd>\n",
"\t<dt>33311</dt>\n",
"\t\t<dd>'ENSDART00000182970.1'</dd>\n",
"\t<dt>34736</dt>\n",
"\t\t<dd>'ENSDART00000093609.3'</dd>\n",
"\t<dt>34739</dt>\n",
"\t\t<dd>'ENSDART00000093613.3'</dd>\n",
"\t<dt>34735</dt>\n",
"\t\t<dd>'ENSDART00000093606.3'</dd>\n",
"\t<dt>34738</dt>\n",
"\t\t<dd>'ENSDART00000093612.3'</dd>\n",
"\t<dt>35138</dt>\n",
"\t\t<dd>'ENSDART00000174022.2'</dd>\n",
"\t<dt>35030</dt>\n",
"\t\t<dd>'ENSDART00000116823.3'</dd>\n",
"\t<dt>35019</dt>\n",
"\t\t<dd>'ENSDART00000116869.3'</dd>\n",
"</dl>\n"
],
"text/latex": [
"\\begin{description*}\n",
"\\item[34737] 'ENSDART00000093611.3'\n",
"\\item[35022] 'ENSDART00000117474.3'\n",
"\\item[33311] 'ENSDART00000182970.1'\n",
"\\item[34736] 'ENSDART00000093609.3'\n",
"\\item[34739] 'ENSDART00000093613.3'\n",
"\\item[34735] 'ENSDART00000093606.3'\n",
"\\item[34738] 'ENSDART00000093612.3'\n",
"\\item[35138] 'ENSDART00000174022.2'\n",
"\\item[35030] 'ENSDART00000116823.3'\n",
"\\item[35019] 'ENSDART00000116869.3'\n",
"\\end{description*}\n"
],
"text/markdown": [
"34737\n",
": 'ENSDART00000093611.3'35022\n",
": 'ENSDART00000117474.3'33311\n",
": 'ENSDART00000182970.1'34736\n",
": 'ENSDART00000093609.3'34739\n",
": 'ENSDART00000093613.3'34735\n",
": 'ENSDART00000093606.3'34738\n",
": 'ENSDART00000093612.3'35138\n",
": 'ENSDART00000174022.2'35030\n",
": 'ENSDART00000116823.3'35019\n",
": 'ENSDART00000116869.3'\n",
"\n"
],
"text/plain": [
" 34737 35022 33311 \n",
"\"ENSDART00000093611.3\" \"ENSDART00000117474.3\" \"ENSDART00000182970.1\" \n",
" 34736 34739 34735 \n",
"\"ENSDART00000093609.3\" \"ENSDART00000093613.3\" \"ENSDART00000093606.3\" \n",
" 34738 35138 35030 \n",
"\"ENSDART00000093612.3\" \"ENSDART00000174022.2\" \"ENSDART00000116823.3\" \n",
" 35019 \n",
"\"ENSDART00000116869.3\" "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%get DR_median_sorted_IDs --from Python3\n",
"head(DR_median_sorted_IDs, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"### Biomart analysis"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"kernel": "R"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Batch submitting query [>------------------------------] 3% eta: 1m\n",
"Batch submitting query [>------------------------------] 4% eta: 1m\n",
"Batch submitting query [=>-----------------------------] 5% eta: 1m\n",
"Batch submitting query [=>-----------------------------] 7% eta: 1m\n",
"Batch submitting query [==>----------------------------] 8% eta: 1m\n",
"Batch submitting query [==>----------------------------] 10% eta: 1m\n",
"Batch submitting query [==>----------------------------] 11% eta: 1m\n",
"Batch submitting query [===>---------------------------] 12% eta: 1m\n",
"Batch submitting query [===>---------------------------] 14% eta: 1m\n",
"Batch submitting query [====>--------------------------] 15% eta: 1m\n",
"Batch submitting query [====>--------------------------] 16% eta: 1m\n",
"Batch submitting query [=====>-------------------------] 18% eta: 1m\n",
"Batch submitting query [=====>-------------------------] 19% eta: 1m\n",
"Batch submitting query [=====>-------------------------] 21% eta: 1m\n",
"Batch submitting query [======>------------------------] 22% eta: 1m\n",
"Batch submitting query [======>------------------------] 23% eta: 1m\n",
"Batch submitting query [=======>-----------------------] 25% eta: 1m\n",
"Batch submitting query [=======>-----------------------] 26% eta: 1m\n",
"Batch submitting query [=======>-----------------------] 27% eta: 1m\n",
"Batch submitting query [========>----------------------] 29% eta: 1m\n",
"Batch submitting query [========>----------------------] 30% eta: 1m\n",
"Batch submitting query [=========>---------------------] 32% eta: 1m\n",
"Batch submitting query [=========>---------------------] 33% eta: 1m\n",
"Batch submitting query [==========>--------------------] 34% eta: 49s\n",
"Batch submitting query [==========>--------------------] 36% eta: 50s\n",
"Batch submitting query [==========>--------------------] 37% eta: 49s\n",
"Batch submitting query [===========>-------------------] 38% eta: 47s\n",
"Batch submitting query [===========>-------------------] 40% eta: 46s\n",
"Batch submitting query [============>------------------] 41% eta: 45s\n",
"Batch submitting query [============>------------------] 42% eta: 44s\n",
"Batch submitting query [=============>-----------------] 44% eta: 43s\n",
"Batch submitting query [=============>-----------------] 45% eta: 42s\n",
"Batch submitting query [=============>-----------------] 47% eta: 41s\n",
"Batch submitting query [==============>----------------] 48% eta: 40s\n",
"Batch submitting query [==============>----------------] 49% eta: 38s\n",
"Batch submitting query [===============>---------------] 51% eta: 37s\n",
"Batch submitting query [===============>---------------] 52% eta: 36s\n",
"Batch submitting query [================>--------------] 53% eta: 35s\n",
"Batch submitting query [================>--------------] 55% eta: 34s\n",
"Batch submitting query [================>--------------] 56% eta: 33s\n",
"Batch submitting query [=================>-------------] 58% eta: 32s\n",
"Batch submitting query [=================>-------------] 59% eta: 32s\n",
"Batch submitting query [==================>------------] 60% eta: 31s\n",
"Batch submitting query [==================>------------] 62% eta: 30s\n",
"Batch submitting query [===================>-----------] 63% eta: 28s\n",
"Batch submitting query [===================>-----------] 64% eta: 27s\n",
"Batch submitting query [===================>-----------] 66% eta: 27s\n",
"Batch submitting query [====================>----------] 67% eta: 25s\n",
"Batch submitting query [====================>----------] 68% eta: 24s\n",
"Batch submitting query [=====================>---------] 70% eta: 23s\n",
"Batch submitting query [=====================>---------] 71% eta: 22s\n",
"Batch submitting query [======================>--------] 73% eta: 21s\n",
"Batch submitting query [======================>--------] 74% eta: 20s\n",
"Batch submitting query [======================>--------] 75% eta: 19s\n",
"Batch submitting query [=======================>-------] 77% eta: 18s\n",
"Batch submitting query [=======================>-------] 78% eta: 17s\n",
"Batch submitting query [========================>------] 79% eta: 16s\n",
"Batch submitting query [========================>------] 81% eta: 15s\n",
"Batch submitting query [========================>------] 82% eta: 13s\n",
"Batch submitting query [=========================>-----] 84% eta: 12s\n",
"Batch submitting query [=========================>-----] 85% eta: 11s\n",
"Batch submitting query [==========================>----] 86% eta: 10s\n",
"Batch submitting query [==========================>----] 88% eta: 9s\n",
"Batch submitting query [===========================>---] 89% eta: 8s\n",
"Batch submitting query [===========================>---] 90% eta: 7s\n",
"Batch submitting query [===========================>---] 92% eta: 6s\n",
"Batch submitting query [============================>--] 93% eta: 5s\n",
"Batch submitting query [============================>--] 95% eta: 4s\n",
"Batch submitting query [=============================>-] 96% eta: 3s\n",
"Batch submitting query [=============================>-] 97% eta: 2s\n",
"Batch submitting query [==============================>] 99% eta: 1s\n",
"Batch submitting query [===============================] 100% eta: 0s\n"
]
},
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>36361</li>\n",
"\t<li>6</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 36361\n",
"\\item 6\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 36361\n",
"2. 6\n",
"\n",
"\n"
],
"text/plain": [
"[1] 36361 6"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<table>\n",
"<thead><tr><th scope=col>ensembl_transcript_id_version</th><th scope=col>ensembl_gene_id</th><th scope=col>external_gene_name</th><th scope=col>description</th><th scope=col>transcript_biotype</th><th scope=col>gene_biotype</th></tr></thead>\n",
"<tbody>\n",
"\t<tr><td>ENSDART00000002691.9 </td><td>ENSDARG00000008407 </td><td>tspan7b </td><td>tetraspanin 7b [Source:ZFIN;Acc:ZDB-GENE-040927-5] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000003001.8 </td><td>ENSDARG00000006316 </td><td>rpl23a </td><td>ribosomal protein L23a [Source:ZFIN;Acc:ZDB-GENE-030131-7479] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000003008.8 </td><td>ENSDARG00000027419 </td><td>gad1b </td><td>glutamate decarboxylase 1b [Source:ZFIN;Acc:ZDB-GENE-030909-3] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000003042.6 </td><td>ENSDARG00000020708 </td><td>mdkb </td><td>midkine b [Source:ZFIN;Acc:ZDB-GENE-010131-6] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000003825.7 </td><td>ENSDARG00000018997 </td><td>cplx2l </td><td>complexin 2, like [Source:ZFIN;Acc:ZDB-GENE-040718-160] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000004109.7 </td><td>ENSDARG00000009553 </td><td>gng3 </td><td>guanine nucleotide binding protein (G protein), gamma 3 [Source:ZFIN;Acc:ZDB-GENE-010705-1]</td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000004692.9 </td><td>ENSDARG00000003795 </td><td>idh2 </td><td>isocitrate dehydrogenase 2 (NADP+), mitochondrial [Source:ZFIN;Acc:ZDB-GENE-031118-95] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000005191.8 </td><td>ENSDARG00000011146 </td><td>uqcrb </td><td>ubiquinol-cytochrome c reductase binding protein [Source:ZFIN;Acc:ZDB-GENE-050522-542] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000005879.8 </td><td>ENSDARG00000001788 </td><td>atp5po </td><td>ATP synthase peripheral stalk subunit OSCP [Source:ZFIN;Acc:ZDB-GENE-050522-147] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><td>ENSDART00000006132.8 </td><td>ENSDARG00000021124 </td><td>cfl1 </td><td>cofilin 1 [Source:ZFIN;Acc:ZDB-GENE-030131-215] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"\\begin{tabular}{r|llllll}\n",
" ensembl\\_transcript\\_id\\_version & ensembl\\_gene\\_id & external\\_gene\\_name & description & transcript\\_biotype & gene\\_biotype\\\\\n",
"\\hline\n",
"\t ENSDART00000002691.9 & ENSDARG00000008407 & tspan7b & tetraspanin 7b {[}Source:ZFIN;Acc:ZDB-GENE-040927-5{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000003001.8 & ENSDARG00000006316 & rpl23a & ribosomal protein L23a {[}Source:ZFIN;Acc:ZDB-GENE-030131-7479{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000003008.8 & ENSDARG00000027419 & gad1b & glutamate decarboxylase 1b {[}Source:ZFIN;Acc:ZDB-GENE-030909-3{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000003042.6 & ENSDARG00000020708 & mdkb & midkine b {[}Source:ZFIN;Acc:ZDB-GENE-010131-6{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000003825.7 & ENSDARG00000018997 & cplx2l & complexin 2, like {[}Source:ZFIN;Acc:ZDB-GENE-040718-160{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000004109.7 & ENSDARG00000009553 & gng3 & guanine nucleotide binding protein (G protein), gamma 3 {[}Source:ZFIN;Acc:ZDB-GENE-010705-1{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000004692.9 & ENSDARG00000003795 & idh2 & isocitrate dehydrogenase 2 (NADP+), mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-031118-95{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000005191.8 & ENSDARG00000011146 & uqcrb & ubiquinol-cytochrome c reductase binding protein {[}Source:ZFIN;Acc:ZDB-GENE-050522-542{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000005879.8 & ENSDARG00000001788 & atp5po & ATP synthase peripheral stalk subunit OSCP {[}Source:ZFIN;Acc:ZDB-GENE-050522-147{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t ENSDART00000006132.8 & ENSDARG00000021124 & cfl1 & cofilin 1 {[}Source:ZFIN;Acc:ZDB-GENE-030131-215{]} & protein\\_coding & protein\\_coding \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"ensembl_transcript_id_version | ensembl_gene_id | external_gene_name | description | transcript_biotype | gene_biotype | \n",
"|---|---|---|---|---|---|---|---|---|---|\n",
"| ENSDART00000002691.9 | ENSDARG00000008407 | tspan7b | tetraspanin 7b [Source:ZFIN;Acc:ZDB-GENE-040927-5] | protein_coding | protein_coding | \n",
"| ENSDART00000003001.8 | ENSDARG00000006316 | rpl23a | ribosomal protein L23a [Source:ZFIN;Acc:ZDB-GENE-030131-7479] | protein_coding | protein_coding | \n",
"| ENSDART00000003008.8 | ENSDARG00000027419 | gad1b | glutamate decarboxylase 1b [Source:ZFIN;Acc:ZDB-GENE-030909-3] | protein_coding | protein_coding | \n",
"| ENSDART00000003042.6 | ENSDARG00000020708 | mdkb | midkine b [Source:ZFIN;Acc:ZDB-GENE-010131-6] | protein_coding | protein_coding | \n",
"| ENSDART00000003825.7 | ENSDARG00000018997 | cplx2l | complexin 2, like [Source:ZFIN;Acc:ZDB-GENE-040718-160] | protein_coding | protein_coding | \n",
"| ENSDART00000004109.7 | ENSDARG00000009553 | gng3 | guanine nucleotide binding protein (G protein), gamma 3 [Source:ZFIN;Acc:ZDB-GENE-010705-1] | protein_coding | protein_coding | \n",
"| ENSDART00000004692.9 | ENSDARG00000003795 | idh2 | isocitrate dehydrogenase 2 (NADP+), mitochondrial [Source:ZFIN;Acc:ZDB-GENE-031118-95] | protein_coding | protein_coding | \n",
"| ENSDART00000005191.8 | ENSDARG00000011146 | uqcrb | ubiquinol-cytochrome c reductase binding protein [Source:ZFIN;Acc:ZDB-GENE-050522-542] | protein_coding | protein_coding | \n",
"| ENSDART00000005879.8 | ENSDARG00000001788 | atp5po | ATP synthase peripheral stalk subunit OSCP [Source:ZFIN;Acc:ZDB-GENE-050522-147] | protein_coding | protein_coding | \n",
"| ENSDART00000006132.8 | ENSDARG00000021124 | cfl1 | cofilin 1 [Source:ZFIN;Acc:ZDB-GENE-030131-215] | protein_coding | protein_coding | \n",
"\n",
"\n"
],
"text/plain": [
" ensembl_transcript_id_version ensembl_gene_id external_gene_name\n",
"1 ENSDART00000002691.9 ENSDARG00000008407 tspan7b \n",
"2 ENSDART00000003001.8 ENSDARG00000006316 rpl23a \n",
"3 ENSDART00000003008.8 ENSDARG00000027419 gad1b \n",
"4 ENSDART00000003042.6 ENSDARG00000020708 mdkb \n",
"5 ENSDART00000003825.7 ENSDARG00000018997 cplx2l \n",
"6 ENSDART00000004109.7 ENSDARG00000009553 gng3 \n",
"7 ENSDART00000004692.9 ENSDARG00000003795 idh2 \n",
"8 ENSDART00000005191.8 ENSDARG00000011146 uqcrb \n",
"9 ENSDART00000005879.8 ENSDARG00000001788 atp5po \n",
"10 ENSDART00000006132.8 ENSDARG00000021124 cfl1 \n",
" description \n",
"1 tetraspanin 7b [Source:ZFIN;Acc:ZDB-GENE-040927-5] \n",
"2 ribosomal protein L23a [Source:ZFIN;Acc:ZDB-GENE-030131-7479] \n",
"3 glutamate decarboxylase 1b [Source:ZFIN;Acc:ZDB-GENE-030909-3] \n",
"4 midkine b [Source:ZFIN;Acc:ZDB-GENE-010131-6] \n",
"5 complexin 2, like [Source:ZFIN;Acc:ZDB-GENE-040718-160] \n",
"6 guanine nucleotide binding protein (G protein), gamma 3 [Source:ZFIN;Acc:ZDB-GENE-010705-1]\n",
"7 isocitrate dehydrogenase 2 (NADP+), mitochondrial [Source:ZFIN;Acc:ZDB-GENE-031118-95] \n",
"8 ubiquinol-cytochrome c reductase binding protein [Source:ZFIN;Acc:ZDB-GENE-050522-542] \n",
"9 ATP synthase peripheral stalk subunit OSCP [Source:ZFIN;Acc:ZDB-GENE-050522-147] \n",
"10 cofilin 1 [Source:ZFIN;Acc:ZDB-GENE-030131-215] \n",
" transcript_biotype gene_biotype \n",
"1 protein_coding protein_coding\n",
"2 protein_coding protein_coding\n",
"3 protein_coding protein_coding\n",
"4 protein_coding protein_coding\n",
"5 protein_coding protein_coding\n",
"6 protein_coding protein_coding\n",
"7 protein_coding protein_coding\n",
"8 protein_coding protein_coding\n",
"9 protein_coding protein_coding\n",
"10 protein_coding protein_coding"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"## Load Drosophila Biomart dataset\n",
"library(biomaRt)\n",
"DR_ensembl = useMart(\"ensembl\", dataset=\"drerio_gene_ensembl\")\n",
"\n",
"## Create Biomart query\n",
"DR_biomart_output <- getBM(attributes = c('ensembl_transcript_id_version', 'ensembl_gene_id', 'external_gene_name', \n",
" 'description', 'transcript_biotype', 'gene_biotype'), \n",
" filters = 'ensembl_transcript_id_version', \n",
" values = DR_median_sorted_IDs, \n",
" mart = DR_ensembl)\n",
"\n",
"## Preview results\n",
"dim(DR_biomart_output)\n",
"head(DR_biomart_output, 10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## Merge with Biomart output with TPM>0 read counts table"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"### Import BiomaRt output into Python & rename transcript ID column header"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"kernel": "Python3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dataframe dimensions: (36361, 6) \n",
"\n"
]
}
],
"source": [
"%get DR_biomart_output --from R\n",
"\n",
"## Check dataframe dimension after import\n",
"print(\"Dataframe dimensions:\", DR_biomart_output.shape, \"\\n\")\n",
"\n",
"## Change transcript ID column header to match TPM>0 read counts table for merging \n",
"DR_biomart_output_df = DR_biomart_output.rename(columns = {'ensembl_transcript_id_version':'Name'})"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"### Merge with Biomart output with TPM>0 read counts table & extract entries of protein-coding transcripts"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"kernel": "Python3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(32082, 11)\n"
]
}
],
"source": [
"## Merge Biomart output & read counts table sorted by median read count\n",
"DR_sorted_counts_BM = DR_sorted_merged.merge(DR_biomart_output_df, on='Name')\n",
"\n",
"## Extract entries of protein-coding transcripts\n",
"DR_PC_transcripts = DR_sorted_counts_BM.loc[DR_sorted_counts_BM[\"transcript_biotype\"].str.contains(\"protein_coding\")] \n",
"\n",
"## Check dataframe dimensions\n",
"print(DR_PC_transcripts.shape)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"kernel": "R"
},
"outputs": [
{
"data": {
"text/html": [
"<ol class=list-inline>\n",
"\t<li>32082</li>\n",
"\t<li>11</li>\n",
"</ol>\n"
],
"text/latex": [
"\\begin{enumerate*}\n",
"\\item 32082\n",
"\\item 11\n",
"\\end{enumerate*}\n"
],
"text/markdown": [
"1. 32082\n",
"2. 11\n",
"\n",
"\n"
],
"text/plain": [
"[1] 32082 11"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<table>\n",
"<thead><tr><th></th><th scope=col>Name</th><th scope=col>SRR3465546</th><th scope=col>SRR3465547</th><th scope=col>SRR3465548</th><th scope=col>SRR3465549</th><th scope=col>Median</th><th scope=col>ensembl_gene_id</th><th scope=col>external_gene_name</th><th scope=col>description</th><th scope=col>transcript_biotype</th><th scope=col>gene_biotype</th></tr></thead>\n",
"<tbody>\n",
"\t<tr><th scope=row>0</th><td>ENSDART00000093611.3 </td><td>121405.00 </td><td>132392.00 </td><td>132560.00 </td><td>135275.00 </td><td>132476.000 </td><td>ENSDARG00000063910 </td><td>mt-atp8 </td><td>ATP synthase 8, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-19] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>2</th><td>ENSDART00000182970.1 </td><td> 19842.10 </td><td> 9807.03 </td><td> 21568.10 </td><td> 15676.90 </td><td> 17759.500 </td><td>ENSDARG00000111458 </td><td>wu:fi09b08 </td><td>wu:fi09b08 [Source:ZFIN;Acc:ZDB-GENE-030131-5630] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>3</th><td>ENSDART00000093609.3 </td><td> 11807.40 </td><td> 13954.20 </td><td> 14022.30 </td><td> 13290.50 </td><td> 13622.350 </td><td>ENSDARG00000063908 </td><td>mt-co2 </td><td>cytochrome c oxidase II, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-15] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>4</th><td>ENSDART00000093613.3 </td><td> 10803.00 </td><td> 12767.50 </td><td> 12902.30 </td><td> 13264.50 </td><td> 12834.900 </td><td>ENSDARG00000063912 </td><td>mt-co3 </td><td>cytochrome c oxidase III, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-16] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>5</th><td>ENSDART00000093606.3 </td><td> 11273.20 </td><td> 13201.90 </td><td> 12731.30 </td><td> 12627.60 </td><td> 12679.450 </td><td>ENSDARG00000063905 </td><td>mt-co1 </td><td>cytochrome c oxidase I, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-14] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>6</th><td>ENSDART00000093612.3 </td><td> 9459.21 </td><td> 11129.50 </td><td> 10755.40 </td><td> 11220.60 </td><td> 10942.450 </td><td>ENSDARG00000063911 </td><td>mt-atp6 </td><td>ATP synthase 6, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-18] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>10</th><td>ENSDART00000171617.2 </td><td> 5566.77 </td><td> 8967.55 </td><td> 5215.21 </td><td> 7021.53 </td><td> 6294.150 </td><td>ENSDARG00000103498 </td><td>epd </td><td>ependymin [Source:ZFIN;Acc:ZDB-GENE-980526-111] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>11</th><td>ENSDART00000188084.1 </td><td> 5578.48 </td><td> 6410.95 </td><td> 6359.70 </td><td> 5774.42 </td><td> 6067.060 </td><td>ENSDARG00000117167 </td><td>rpl39 </td><td>ribosomal protein L39 [Source:ZFIN;Acc:ZDB-GENE-040625-51] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>12</th><td>ENSDART00000093617.3 </td><td> 4871.62 </td><td> 6375.34 </td><td> 5814.17 </td><td> 5918.74 </td><td> 5866.455 </td><td>ENSDARG00000063916 </td><td>mt-nd4l </td><td>NADH dehydrogenase 4L, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-11] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>13</th><td>ENSDART00000181887.1 </td><td> 5265.68 </td><td> 5232.80 </td><td> 5966.44 </td><td> 5993.42 </td><td> 5616.060 </td><td>ENSDARG00000115319 </td><td>mtbl </td><td>metallothionein-B-like [Source:ZFIN;Acc:ZDB-GENE-110414-3] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>15</th><td>ENSDART00000093625.3 </td><td> 4082.53 </td><td> 4731.02 </td><td> 4862.35 </td><td> 4979.03 </td><td> 4796.685 </td><td>ENSDARG00000063924 </td><td>mt-cyb </td><td>cytochrome b, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-17] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>19</th><td>ENSDART00000182119.1 </td><td> 4289.66 </td><td> 4498.40 </td><td> 4416.94 </td><td> 4565.07 </td><td> 4457.670 </td><td>ENSDARG00000112656 </td><td>rpl36a </td><td>ribosomal protein L36A [Source:ZFIN;Acc:ZDB-GENE-020423-1] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>22</th><td>ENSDART00000149639.2 </td><td> 4097.25 </td><td> 3771.98 </td><td> 4444.10 </td><td> 3604.46 </td><td> 3934.615 </td><td>ENSDARG00000036186 </td><td>mbpa </td><td>myelin basic protein a [Source:ZFIN;Acc:ZDB-GENE-030128-2] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>24</th><td>ENSDART00000052556.8 </td><td> 3106.96 </td><td> 3046.13 </td><td> 3425.84 </td><td> 2607.85 </td><td> 3076.545 </td><td>ENSDARG00000036186 </td><td>mbpa </td><td>myelin basic protein a [Source:ZFIN;Acc:ZDB-GENE-030128-2] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>25</th><td>ENSDART00000093615.3 </td><td> 2732.44 </td><td> 3048.63 </td><td> 2987.04 </td><td> 3038.94 </td><td> 3012.990 </td><td>ENSDARG00000063914 </td><td>mt-nd3 </td><td>NADH dehydrogenase 3, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-9] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>26</th><td>ENSDART00000182410.1 </td><td> 2893.27 </td><td> 2790.55 </td><td> 2946.34 </td><td> 3451.55 </td><td> 2919.805 </td><td>ENSDARG00000116304 </td><td>ndufa4 </td><td>NADH:ubiquinone oxidoreductase subunit A4 [Source:ZFIN;Acc:ZDB-GENE-040426-1962]</td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>28</th><td>ENSDART00000093623.3 </td><td> 2203.58 </td><td> 2532.95 </td><td> 2440.17 </td><td> 2446.22 </td><td> 2443.195 </td><td>ENSDARG00000063922 </td><td>mt-nd6 </td><td>NADH dehydrogenase 6, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-13] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>29</th><td>ENSDART00000186782.1 </td><td> 2557.15 </td><td> 2037.68 </td><td> 2265.55 </td><td> 2603.08 </td><td> 2411.350 </td><td>ENSDARG00000113583 </td><td>mt2 </td><td>metallothionein 2 [Source:ZFIN;Acc:ZDB-GENE-030131-4174] </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>30</th><td>ENSDART00000191912.1 </td><td> 2323.54 </td><td> 2424.89 </td><td> 2257.54 </td><td> 2486.85 </td><td> 2374.215 </td><td>ENSDARG00000116220 </td><td>BX005436.1 </td><td> </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"\t<tr><th scope=row>31</th><td>ENSDART00000186881.1 </td><td> 2329.81 </td><td> 2438.60 </td><td> 2360.62 </td><td> 1852.06 </td><td> 2345.215 </td><td>ENSDARG00000114294 </td><td>BX511120.1 </td><td> </td><td>protein_coding </td><td>protein_coding </td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"\\begin{tabular}{r|lllllllllll}\n",
" & Name & SRR3465546 & SRR3465547 & SRR3465548 & SRR3465549 & Median & ensembl\\_gene\\_id & external\\_gene\\_name & description & transcript\\_biotype & gene\\_biotype\\\\\n",
"\\hline\n",
"\t0 & ENSDART00000093611.3 & 121405.00 & 132392.00 & 132560.00 & 135275.00 & 132476.000 & ENSDARG00000063910 & mt-atp8 & ATP synthase 8, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-19{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t2 & ENSDART00000182970.1 & 19842.10 & 9807.03 & 21568.10 & 15676.90 & 17759.500 & ENSDARG00000111458 & wu:fi09b08 & wu:fi09b08 {[}Source:ZFIN;Acc:ZDB-GENE-030131-5630{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t3 & ENSDART00000093609.3 & 11807.40 & 13954.20 & 14022.30 & 13290.50 & 13622.350 & ENSDARG00000063908 & mt-co2 & cytochrome c oxidase II, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-15{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t4 & ENSDART00000093613.3 & 10803.00 & 12767.50 & 12902.30 & 13264.50 & 12834.900 & ENSDARG00000063912 & mt-co3 & cytochrome c oxidase III, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-16{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t5 & ENSDART00000093606.3 & 11273.20 & 13201.90 & 12731.30 & 12627.60 & 12679.450 & ENSDARG00000063905 & mt-co1 & cytochrome c oxidase I, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-14{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t6 & ENSDART00000093612.3 & 9459.21 & 11129.50 & 10755.40 & 11220.60 & 10942.450 & ENSDARG00000063911 & mt-atp6 & ATP synthase 6, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-18{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t10 & ENSDART00000171617.2 & 5566.77 & 8967.55 & 5215.21 & 7021.53 & 6294.150 & ENSDARG00000103498 & epd & ependymin {[}Source:ZFIN;Acc:ZDB-GENE-980526-111{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t11 & ENSDART00000188084.1 & 5578.48 & 6410.95 & 6359.70 & 5774.42 & 6067.060 & ENSDARG00000117167 & rpl39 & ribosomal protein L39 {[}Source:ZFIN;Acc:ZDB-GENE-040625-51{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t12 & ENSDART00000093617.3 & 4871.62 & 6375.34 & 5814.17 & 5918.74 & 5866.455 & ENSDARG00000063916 & mt-nd4l & NADH dehydrogenase 4L, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-11{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t13 & ENSDART00000181887.1 & 5265.68 & 5232.80 & 5966.44 & 5993.42 & 5616.060 & ENSDARG00000115319 & mtbl & metallothionein-B-like {[}Source:ZFIN;Acc:ZDB-GENE-110414-3{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t15 & ENSDART00000093625.3 & 4082.53 & 4731.02 & 4862.35 & 4979.03 & 4796.685 & ENSDARG00000063924 & mt-cyb & cytochrome b, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-17{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t19 & ENSDART00000182119.1 & 4289.66 & 4498.40 & 4416.94 & 4565.07 & 4457.670 & ENSDARG00000112656 & rpl36a & ribosomal protein L36A {[}Source:ZFIN;Acc:ZDB-GENE-020423-1{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t22 & ENSDART00000149639.2 & 4097.25 & 3771.98 & 4444.10 & 3604.46 & 3934.615 & ENSDARG00000036186 & mbpa & myelin basic protein a {[}Source:ZFIN;Acc:ZDB-GENE-030128-2{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t24 & ENSDART00000052556.8 & 3106.96 & 3046.13 & 3425.84 & 2607.85 & 3076.545 & ENSDARG00000036186 & mbpa & myelin basic protein a {[}Source:ZFIN;Acc:ZDB-GENE-030128-2{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t25 & ENSDART00000093615.3 & 2732.44 & 3048.63 & 2987.04 & 3038.94 & 3012.990 & ENSDARG00000063914 & mt-nd3 & NADH dehydrogenase 3, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-9{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t26 & ENSDART00000182410.1 & 2893.27 & 2790.55 & 2946.34 & 3451.55 & 2919.805 & ENSDARG00000116304 & ndufa4 & NADH:ubiquinone oxidoreductase subunit A4 {[}Source:ZFIN;Acc:ZDB-GENE-040426-1962{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t28 & ENSDART00000093623.3 & 2203.58 & 2532.95 & 2440.17 & 2446.22 & 2443.195 & ENSDARG00000063922 & mt-nd6 & NADH dehydrogenase 6, mitochondrial {[}Source:ZFIN;Acc:ZDB-GENE-011205-13{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t29 & ENSDART00000186782.1 & 2557.15 & 2037.68 & 2265.55 & 2603.08 & 2411.350 & ENSDARG00000113583 & mt2 & metallothionein 2 {[}Source:ZFIN;Acc:ZDB-GENE-030131-4174{]} & protein\\_coding & protein\\_coding \\\\\n",
"\t30 & ENSDART00000191912.1 & 2323.54 & 2424.89 & 2257.54 & 2486.85 & 2374.215 & ENSDARG00000116220 & BX005436.1 & & protein\\_coding & protein\\_coding \\\\\n",
"\t31 & ENSDART00000186881.1 & 2329.81 & 2438.60 & 2360.62 & 1852.06 & 2345.215 & ENSDARG00000114294 & BX511120.1 & & protein\\_coding & protein\\_coding \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| <!--/--> | Name | SRR3465546 | SRR3465547 | SRR3465548 | SRR3465549 | Median | ensembl_gene_id | external_gene_name | description | transcript_biotype | gene_biotype | \n",
"|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n",
"| 0 | ENSDART00000093611.3 | 121405.00 | 132392.00 | 132560.00 | 135275.00 | 132476.000 | ENSDARG00000063910 | mt-atp8 | ATP synthase 8, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-19] | protein_coding | protein_coding | \n",
"| 2 | ENSDART00000182970.1 | 19842.10 | 9807.03 | 21568.10 | 15676.90 | 17759.500 | ENSDARG00000111458 | wu:fi09b08 | wu:fi09b08 [Source:ZFIN;Acc:ZDB-GENE-030131-5630] | protein_coding | protein_coding | \n",
"| 3 | ENSDART00000093609.3 | 11807.40 | 13954.20 | 14022.30 | 13290.50 | 13622.350 | ENSDARG00000063908 | mt-co2 | cytochrome c oxidase II, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-15] | protein_coding | protein_coding | \n",
"| 4 | ENSDART00000093613.3 | 10803.00 | 12767.50 | 12902.30 | 13264.50 | 12834.900 | ENSDARG00000063912 | mt-co3 | cytochrome c oxidase III, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-16] | protein_coding | protein_coding | \n",
"| 5 | ENSDART00000093606.3 | 11273.20 | 13201.90 | 12731.30 | 12627.60 | 12679.450 | ENSDARG00000063905 | mt-co1 | cytochrome c oxidase I, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-14] | protein_coding | protein_coding | \n",
"| 6 | ENSDART00000093612.3 | 9459.21 | 11129.50 | 10755.40 | 11220.60 | 10942.450 | ENSDARG00000063911 | mt-atp6 | ATP synthase 6, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-18] | protein_coding | protein_coding | \n",
"| 10 | ENSDART00000171617.2 | 5566.77 | 8967.55 | 5215.21 | 7021.53 | 6294.150 | ENSDARG00000103498 | epd | ependymin [Source:ZFIN;Acc:ZDB-GENE-980526-111] | protein_coding | protein_coding | \n",
"| 11 | ENSDART00000188084.1 | 5578.48 | 6410.95 | 6359.70 | 5774.42 | 6067.060 | ENSDARG00000117167 | rpl39 | ribosomal protein L39 [Source:ZFIN;Acc:ZDB-GENE-040625-51] | protein_coding | protein_coding | \n",
"| 12 | ENSDART00000093617.3 | 4871.62 | 6375.34 | 5814.17 | 5918.74 | 5866.455 | ENSDARG00000063916 | mt-nd4l | NADH dehydrogenase 4L, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-11] | protein_coding | protein_coding | \n",
"| 13 | ENSDART00000181887.1 | 5265.68 | 5232.80 | 5966.44 | 5993.42 | 5616.060 | ENSDARG00000115319 | mtbl | metallothionein-B-like [Source:ZFIN;Acc:ZDB-GENE-110414-3] | protein_coding | protein_coding | \n",
"| 15 | ENSDART00000093625.3 | 4082.53 | 4731.02 | 4862.35 | 4979.03 | 4796.685 | ENSDARG00000063924 | mt-cyb | cytochrome b, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-17] | protein_coding | protein_coding | \n",
"| 19 | ENSDART00000182119.1 | 4289.66 | 4498.40 | 4416.94 | 4565.07 | 4457.670 | ENSDARG00000112656 | rpl36a | ribosomal protein L36A [Source:ZFIN;Acc:ZDB-GENE-020423-1] | protein_coding | protein_coding | \n",
"| 22 | ENSDART00000149639.2 | 4097.25 | 3771.98 | 4444.10 | 3604.46 | 3934.615 | ENSDARG00000036186 | mbpa | myelin basic protein a [Source:ZFIN;Acc:ZDB-GENE-030128-2] | protein_coding | protein_coding | \n",
"| 24 | ENSDART00000052556.8 | 3106.96 | 3046.13 | 3425.84 | 2607.85 | 3076.545 | ENSDARG00000036186 | mbpa | myelin basic protein a [Source:ZFIN;Acc:ZDB-GENE-030128-2] | protein_coding | protein_coding | \n",
"| 25 | ENSDART00000093615.3 | 2732.44 | 3048.63 | 2987.04 | 3038.94 | 3012.990 | ENSDARG00000063914 | mt-nd3 | NADH dehydrogenase 3, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-9] | protein_coding | protein_coding | \n",
"| 26 | ENSDART00000182410.1 | 2893.27 | 2790.55 | 2946.34 | 3451.55 | 2919.805 | ENSDARG00000116304 | ndufa4 | NADH:ubiquinone oxidoreductase subunit A4 [Source:ZFIN;Acc:ZDB-GENE-040426-1962] | protein_coding | protein_coding | \n",
"| 28 | ENSDART00000093623.3 | 2203.58 | 2532.95 | 2440.17 | 2446.22 | 2443.195 | ENSDARG00000063922 | mt-nd6 | NADH dehydrogenase 6, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-13] | protein_coding | protein_coding | \n",
"| 29 | ENSDART00000186782.1 | 2557.15 | 2037.68 | 2265.55 | 2603.08 | 2411.350 | ENSDARG00000113583 | mt2 | metallothionein 2 [Source:ZFIN;Acc:ZDB-GENE-030131-4174] | protein_coding | protein_coding | \n",
"| 30 | ENSDART00000191912.1 | 2323.54 | 2424.89 | 2257.54 | 2486.85 | 2374.215 | ENSDARG00000116220 | BX005436.1 | | protein_coding | protein_coding | \n",
"| 31 | ENSDART00000186881.1 | 2329.81 | 2438.60 | 2360.62 | 1852.06 | 2345.215 | ENSDARG00000114294 | BX511120.1 | | protein_coding | protein_coding | \n",
"\n",
"\n"
],
"text/plain": [
" Name SRR3465546 SRR3465547 SRR3465548 SRR3465549 Median \n",
"0 ENSDART00000093611.3 121405.00 132392.00 132560.00 135275.00 132476.000\n",
"2 ENSDART00000182970.1 19842.10 9807.03 21568.10 15676.90 17759.500\n",
"3 ENSDART00000093609.3 11807.40 13954.20 14022.30 13290.50 13622.350\n",
"4 ENSDART00000093613.3 10803.00 12767.50 12902.30 13264.50 12834.900\n",
"5 ENSDART00000093606.3 11273.20 13201.90 12731.30 12627.60 12679.450\n",
"6 ENSDART00000093612.3 9459.21 11129.50 10755.40 11220.60 10942.450\n",
"10 ENSDART00000171617.2 5566.77 8967.55 5215.21 7021.53 6294.150\n",
"11 ENSDART00000188084.1 5578.48 6410.95 6359.70 5774.42 6067.060\n",
"12 ENSDART00000093617.3 4871.62 6375.34 5814.17 5918.74 5866.455\n",
"13 ENSDART00000181887.1 5265.68 5232.80 5966.44 5993.42 5616.060\n",
"15 ENSDART00000093625.3 4082.53 4731.02 4862.35 4979.03 4796.685\n",
"19 ENSDART00000182119.1 4289.66 4498.40 4416.94 4565.07 4457.670\n",
"22 ENSDART00000149639.2 4097.25 3771.98 4444.10 3604.46 3934.615\n",
"24 ENSDART00000052556.8 3106.96 3046.13 3425.84 2607.85 3076.545\n",
"25 ENSDART00000093615.3 2732.44 3048.63 2987.04 3038.94 3012.990\n",
"26 ENSDART00000182410.1 2893.27 2790.55 2946.34 3451.55 2919.805\n",
"28 ENSDART00000093623.3 2203.58 2532.95 2440.17 2446.22 2443.195\n",
"29 ENSDART00000186782.1 2557.15 2037.68 2265.55 2603.08 2411.350\n",
"30 ENSDART00000191912.1 2323.54 2424.89 2257.54 2486.85 2374.215\n",
"31 ENSDART00000186881.1 2329.81 2438.60 2360.62 1852.06 2345.215\n",
" ensembl_gene_id external_gene_name\n",
"0 ENSDARG00000063910 mt-atp8 \n",
"2 ENSDARG00000111458 wu:fi09b08 \n",
"3 ENSDARG00000063908 mt-co2 \n",
"4 ENSDARG00000063912 mt-co3 \n",
"5 ENSDARG00000063905 mt-co1 \n",
"6 ENSDARG00000063911 mt-atp6 \n",
"10 ENSDARG00000103498 epd \n",
"11 ENSDARG00000117167 rpl39 \n",
"12 ENSDARG00000063916 mt-nd4l \n",
"13 ENSDARG00000115319 mtbl \n",
"15 ENSDARG00000063924 mt-cyb \n",
"19 ENSDARG00000112656 rpl36a \n",
"22 ENSDARG00000036186 mbpa \n",
"24 ENSDARG00000036186 mbpa \n",
"25 ENSDARG00000063914 mt-nd3 \n",
"26 ENSDARG00000116304 ndufa4 \n",
"28 ENSDARG00000063922 mt-nd6 \n",
"29 ENSDARG00000113583 mt2 \n",
"30 ENSDARG00000116220 BX005436.1 \n",
"31 ENSDARG00000114294 BX511120.1 \n",
" description \n",
"0 ATP synthase 8, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-19] \n",
"2 wu:fi09b08 [Source:ZFIN;Acc:ZDB-GENE-030131-5630] \n",
"3 cytochrome c oxidase II, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-15] \n",
"4 cytochrome c oxidase III, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-16] \n",
"5 cytochrome c oxidase I, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-14] \n",
"6 ATP synthase 6, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-18] \n",
"10 ependymin [Source:ZFIN;Acc:ZDB-GENE-980526-111] \n",
"11 ribosomal protein L39 [Source:ZFIN;Acc:ZDB-GENE-040625-51] \n",
"12 NADH dehydrogenase 4L, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-11] \n",
"13 metallothionein-B-like [Source:ZFIN;Acc:ZDB-GENE-110414-3] \n",
"15 cytochrome b, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-17] \n",
"19 ribosomal protein L36A [Source:ZFIN;Acc:ZDB-GENE-020423-1] \n",
"22 myelin basic protein a [Source:ZFIN;Acc:ZDB-GENE-030128-2] \n",
"24 myelin basic protein a [Source:ZFIN;Acc:ZDB-GENE-030128-2] \n",
"25 NADH dehydrogenase 3, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-9] \n",
"26 NADH:ubiquinone oxidoreductase subunit A4 [Source:ZFIN;Acc:ZDB-GENE-040426-1962]\n",
"28 NADH dehydrogenase 6, mitochondrial [Source:ZFIN;Acc:ZDB-GENE-011205-13] \n",
"29 metallothionein 2 [Source:ZFIN;Acc:ZDB-GENE-030131-4174] \n",
"30 \n",
"31 \n",
" transcript_biotype gene_biotype \n",
"0 protein_coding protein_coding\n",
"2 protein_coding protein_coding\n",
"3 protein_coding protein_coding\n",
"4 protein_coding protein_coding\n",
"5 protein_coding protein_coding\n",
"6 protein_coding protein_coding\n",
"10 protein_coding protein_coding\n",
"11 protein_coding protein_coding\n",
"12 protein_coding protein_coding\n",
"13 protein_coding protein_coding\n",
"15 protein_coding protein_coding\n",
"19 protein_coding protein_coding\n",
"22 protein_coding protein_coding\n",
"24 protein_coding protein_coding\n",
"25 protein_coding protein_coding\n",
"26 protein_coding protein_coding\n",
"28 protein_coding protein_coding\n",
"29 protein_coding protein_coding\n",
"30 protein_coding protein_coding\n",
"31 protein_coding protein_coding"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%get DR_PC_transcripts --from Python3\n",
"\n",
"## Preview\n",
"dim(DR_PC_transcripts)\n",
"head(DR_PC_transcripts, 20)\n",
"\n",
"## Extract to CSV\n",
"write.csv(head(DR_PC_transcripts, 20), file = \"DR_top_20_PC_transcripts.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"## Extract amino acid sequences of protein-coding sequences from Biomart"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"kernel": "R"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Batch submitting query [>------------------------------] 3% eta: 5m\n",
"Batch submitting query [>------------------------------] 5% eta: 6m\n",
"Batch submitting query [=>-----------------------------] 6% eta: 6m\n",
"Batch submitting query [=>-----------------------------] 8% eta: 6m\n",
"Batch submitting query [==>----------------------------] 9% eta: 6m\n",
"Batch submitting query [==>----------------------------] 11% eta: 6m\n",
"Batch submitting query [===>---------------------------] 12% eta: 6m\n",
"Batch submitting query [===>---------------------------] 14% eta: 5m\n",
"Batch submitting query [====>--------------------------] 15% eta: 5m\n",
"Batch submitting query [====>--------------------------] 17% eta: 5m\n",
"Batch submitting query [=====>-------------------------] 18% eta: 5m\n",
"Batch submitting query [=====>-------------------------] 20% eta: 5m\n",
"Batch submitting query [======>------------------------] 22% eta: 5m\n",
"Batch submitting query [======>------------------------] 23% eta: 5m\n",
"Batch submitting query [=======>-----------------------] 25% eta: 5m\n",
"Batch submitting query [=======>-----------------------] 26% eta: 5m\n",
"Batch submitting query [========>----------------------] 28% eta: 5m\n",
"Batch submitting query [========>----------------------] 29% eta: 4m\n",
"Batch submitting query [=========>---------------------] 31% eta: 4m\n",
"Batch submitting query [=========>---------------------] 32% eta: 4m\n",
"Batch submitting query [=========>---------------------] 34% eta: 4m\n",
"Batch submitting query [==========>--------------------] 35% eta: 4m\n",
"Batch submitting query [==========>--------------------] 37% eta: 4m\n",
"Batch submitting query [===========>-------------------] 38% eta: 4m\n",
"Batch submitting query [===========>-------------------] 40% eta: 4m\n",
"Batch submitting query [============>------------------] 42% eta: 4m\n",
"Batch submitting query [============>------------------] 43% eta: 4m\n",
"Batch submitting query [=============>-----------------] 45% eta: 4m\n",
"Batch submitting query [=============>-----------------] 46% eta: 3m\n",
"Batch submitting query [==============>----------------] 48% eta: 3m\n",
"Batch submitting query [==============>----------------] 49% eta: 3m\n",
"Batch submitting query [===============>---------------] 51% eta: 3m\n",
"Batch submitting query [===============>---------------] 52% eta: 3m\n",
"Batch submitting query [================>--------------] 54% eta: 3m\n",
"Batch submitting query [================>--------------] 55% eta: 3m\n",
"Batch submitting query [=================>-------------] 57% eta: 3m\n",
"Batch submitting query [=================>-------------] 58% eta: 3m\n",
"Batch submitting query [==================>------------] 60% eta: 3m\n",
"Batch submitting query [==================>------------] 62% eta: 2m\n",
"Batch submitting query [===================>-----------] 63% eta: 2m\n",
"Batch submitting query [===================>-----------] 65% eta: 2m\n",
"Batch submitting query [====================>----------] 66% eta: 2m\n",
"Batch submitting query [====================>----------] 68% eta: 2m\n",
"Batch submitting query [====================>----------] 69% eta: 2m\n",
"Batch submitting query [=====================>---------] 71% eta: 2m\n",
"Batch submitting query [=====================>---------] 72% eta: 2m\n",
"Batch submitting query [======================>--------] 74% eta: 2m\n",
"Batch submitting query [======================>--------] 75% eta: 2m\n",
"Batch submitting query [=======================>-------] 77% eta: 1m\n",
"Batch submitting query [=======================>-------] 78% eta: 1m\n",
"Batch submitting query [========================>------] 80% eta: 1m\n",
"Batch submitting query [========================>------] 82% eta: 1m\n",
"Batch submitting query [=========================>-----] 83% eta: 1m\n",
"Batch submitting query [=========================>-----] 85% eta: 1m\n",
"Batch submitting query [==========================>----] 86% eta: 1m\n",
"Batch submitting query [==========================>----] 88% eta: 46s\n",
"Batch submitting query [===========================>---] 89% eta: 40s\n",
"Batch submitting query [===========================>---] 91% eta: 34s\n",
"Batch submitting query [============================>--] 92% eta: 28s\n",
"Batch submitting query [============================>--] 94% eta: 23s\n",
"Batch submitting query [=============================>-] 95% eta: 17s\n",
"Batch submitting query [=============================>-] 97% eta: 11s\n",
"Batch submitting query [==============================>] 98% eta: 6s\n",
"Batch submitting query [===============================] 100% eta: 0s\n"
]
}
],
"source": [
"DR_PC_transcripts_ID <- DR_PC_transcripts$Name\n",
"\n",
"## Create getBM() query for obtaining peptide sequences\n",
"DR_BM_peptide_seqs <- getSequence(id = DR_PC_transcripts_ID, \n",
" type = 'ensembl_transcript_id_version', \n",
" seqType = 'peptide', \n",
" mart = DR_ensembl)\n",
"\n",
"## Export to FASTA\n",
"exportFASTA(DR_BM_peptide_seqs, file='./DR_non0_pep_Nov11.fasta')"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "R"
},
"source": [
"## KOG analysis of extracted protein sequences"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"##[28617541.bc]\n",
"rpsblast -query ~/Lymnaea_CNS_transcriptome_files/7_Interspecies_comparison/7b_Zebrafish/DR_non0_pep_Nov11.fasta -db Kog \\\n",
"-out ~/DR_CNS_pep_Nov12_KOG.txt -evalue 1E-5 \\\n",
"-outfmt \"6 qseqid sseqid stitle pident length mismatch gapopen qlen qstart qend slen sstart send evalue bitscore qcovhsp qcovs\" \\\n",
"-max_hsps 1 -max_target_seqs 1"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n"
]
}
],
"source": [
"## Replace \"[\" and \"]\" with \"#\" for later import as dataframe\n",
"cp ~/DR_CNS_pep_Nov12_KOG.txt .\n",
"\n",
"sed -i 's/\\[/#/g' DR_CNS_pep_Nov12_KOG.txt \n",
"sed -i 's/\\]./#/g' DR_CNS_pep_Nov12_KOG.txt"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"kernel": "Python3"
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import os\n",
"os.chdir(\"/home/zhanglab1/ndong/Lymnaea_CNS_transcriptome_files/7_Interspecies_comparison/7b_Zebrafish\")\n",
"\n",
"DR_KOG = pd.read_csv(\"DR_CNS_pep_Nov12_KOG.txt\", sep='#', header=None, engine=\"python\")\n",
"type(DR_KOG)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"kernel": "R"
},
"outputs": [
{
"data": {
"text/html": [
"<table>\n",
"<thead><tr><th></th><th scope=col>0</th><th scope=col>1</th><th scope=col>2</th></tr></thead>\n",
"<tbody>\n",
"\t<tr><th scope=row>0</th><td>ENSDART00000014723.7\tgnl|CDD|230178\tKOG2239, KOG2239, KOG2239, Transcription factor containing NAC and TS-N domains </td><td>Transcription </td><td>\t81.019\t216\t33\t3\t216\t1\t215\t209\t1\t209\t3.22e-94\t272\t99\t99 </td></tr>\n",
"\t<tr><th scope=row>1</th><td>ENSDART00000010777.7\tgnl|CDD|229436\tKOG1495, KOG1495, KOG1495, Lactate dehydrogenase </td><td>Energy production and conversion </td><td>\t68.769\t333\t103\t1\t335\t1\t333\t332\t1\t332\t0.0\t553\t99\t99 </td></tr>\n",
"\t<tr><th scope=row>2</th><td>ENSDART00000004692.9\tgnl|CDD|229467\tKOG1526, KOG1526, KOG1526, NADP-dependent isocitrate dehydrogenase </td><td>Energy production and conversion </td><td>\t73.039\t204\t52\t1\t226\t22\t225\t383\t180\t380\t1.90e-153\t428\t90\t90 </td></tr>\n",
"\t<tr><th scope=row>3</th><td>ENSDART00000016181.11\tgnl|CDD|230870\tKOG2931, KOG2931, KOG2931, Differentiation-related gene 1 protein (NDR1 protein), related proteins </td><td>Function unknown </td><td>\t59.218\t179\t60\t2\t363\t149\t320\t222\t38\t210\t3.28e-82\t247\t47\t47 </td></tr>\n",
"\t<tr><th scope=row>4</th><td>ENSDART00000013690.9\tgnl|CDD|231387\tKOG3449, KOG3449, KOG3449, 60S acidic ribosomal protein P2 </td><td>Translation, ribosomal structure and biogenesis </td><td>\t65.789\t114\t37\t1\t115\t1\t114\t112\t1\t112\t1.23e-30\t103\t99\t99 </td></tr>\n",
"\t<tr><th scope=row>5</th><td>ENSDART00000006948.6\tgnl|CDD|230248\tKOG2309, KOG2309, KOG2309, 60s ribosomal protein L2/L8 </td><td>Translation, ribosomal structure and biogenesis </td><td>\t82.677\t254\t38\t4\t258\t1\t254\t248\t1\t248\t1.69e-151\t420\t98\t98 </td></tr>\n",
"\t<tr><th scope=row>6</th><td>ENSDART00000002595.7\tgnl|CDD|229671\tKOG1732, KOG1732, KOG1732, 60S ribosomal protein L21 </td><td>Translation, ribosomal structure and biogenesis </td><td>\t83.125\t160\t27\t0\t161\t1\t160\t160\t1\t160\t2.68e-85\t245\t99\t99 </td></tr>\n",
"\t<tr><th scope=row>7</th><td>ENSDART00000015629.9\tgnl|CDD|229242\tKOG1300, KOG1300, KOG1300, Vesicle trafficking protein Sec1 </td><td>Intracellular trafficking, secretion, and vesicular transport </td><td>\t59.391\t591\t232\t8\t592\t4\t589\t593\t1\t588\t0.0\t841\t99\t99 </td></tr>\n",
"\t<tr><th scope=row>8</th><td>ENSDART00000002691.9\tgnl|CDD|231813\tKOG3882, KOG3882, KOG3882, Tetraspanin family integral membrane protein </td><td>General function prediction only </td><td>\t34.728\t239\t147\t3\t250\t11\t248\t237\t5\t235\t1.51e-60\t188\t95\t95 </td></tr>\n",
"\t<tr><th scope=row>9</th><td>ENSDART00000006132.8\tgnl|CDD|229674\tKOG1735, KOG1735, KOG1735, Actin depolymerizing factor </td><td>Cytoskeleton </td><td>\t49.367\t158\t58\t5\t166\t1\t152\t146\t1\t142\t1.08e-53\t164\t92\t92 </td></tr>\n",
"</tbody>\n",
"</table>\n"
],
"text/latex": [
"\\begin{tabular}{r|lll}\n",
" & 0 & 1 & 2\\\\\n",
"\\hline\n",
"\t0 & ENSDART00000014723.7\tgnl\\textbar{}CDD\\textbar{}230178\tKOG2239, KOG2239, KOG2239, Transcription factor containing NAC and TS-N domains & Transcription & \t81.019\t216\t33\t3\t216\t1\t215\t209\t1\t209\t3.22e-94\t272\t99\t99 \\\\\n",
"\t1 & ENSDART00000010777.7\tgnl\\textbar{}CDD\\textbar{}229436\tKOG1495, KOG1495, KOG1495, Lactate dehydrogenase & Energy production and conversion & \t68.769\t333\t103\t1\t335\t1\t333\t332\t1\t332\t0.0\t553\t99\t99 \\\\\n",
"\t2 & ENSDART00000004692.9\tgnl\\textbar{}CDD\\textbar{}229467\tKOG1526, KOG1526, KOG1526, NADP-dependent isocitrate dehydrogenase & Energy production and conversion & \t73.039\t204\t52\t1\t226\t22\t225\t383\t180\t380\t1.90e-153\t428\t90\t90 \\\\\n",
"\t3 & ENSDART00000016181.11\tgnl\\textbar{}CDD\\textbar{}230870\tKOG2931, KOG2931, KOG2931, Differentiation-related gene 1 protein (NDR1 protein), related proteins & Function unknown & \t59.218\t179\t60\t2\t363\t149\t320\t222\t38\t210\t3.28e-82\t247\t47\t47 \\\\\n",
"\t4 & ENSDART00000013690.9\tgnl\\textbar{}CDD\\textbar{}231387\tKOG3449, KOG3449, KOG3449, 60S acidic ribosomal protein P2 & Translation, ribosomal structure and biogenesis & \t65.789\t114\t37\t1\t115\t1\t114\t112\t1\t112\t1.23e-30\t103\t99\t99 \\\\\n",
"\t5 & ENSDART00000006948.6\tgnl\\textbar{}CDD\\textbar{}230248\tKOG2309, KOG2309, KOG2309, 60s ribosomal protein L2/L8 & Translation, ribosomal structure and biogenesis & \t82.677\t254\t38\t4\t258\t1\t254\t248\t1\t248\t1.69e-151\t420\t98\t98 \\\\\n",
"\t6 & ENSDART00000002595.7\tgnl\\textbar{}CDD\\textbar{}229671\tKOG1732, KOG1732, KOG1732, 60S ribosomal protein L21 & Translation, ribosomal structure and biogenesis & \t83.125\t160\t27\t0\t161\t1\t160\t160\t1\t160\t2.68e-85\t245\t99\t99 \\\\\n",
"\t7 & ENSDART00000015629.9\tgnl\\textbar{}CDD\\textbar{}229242\tKOG1300, KOG1300, KOG1300, Vesicle trafficking protein Sec1 & Intracellular trafficking, secretion, and vesicular transport & \t59.391\t591\t232\t8\t592\t4\t589\t593\t1\t588\t0.0\t841\t99\t99 \\\\\n",
"\t8 & ENSDART00000002691.9\tgnl\\textbar{}CDD\\textbar{}231813\tKOG3882, KOG3882, KOG3882, Tetraspanin family integral membrane protein & General function prediction only & \t34.728\t239\t147\t3\t250\t11\t248\t237\t5\t235\t1.51e-60\t188\t95\t95 \\\\\n",
"\t9 & ENSDART00000006132.8\tgnl\\textbar{}CDD\\textbar{}229674\tKOG1735, KOG1735, KOG1735, Actin depolymerizing factor & Cytoskeleton & \t49.367\t158\t58\t5\t166\t1\t152\t146\t1\t142\t1.08e-53\t164\t92\t92 \\\\\n",
"\\end{tabular}\n"
],
"text/markdown": [
"\n",
"| <!--/--> | 0 | 1 | 2 | \n",
"|---|---|---|---|---|---|---|---|---|---|\n",
"| 0 | ENSDART00000014723.7\tgnl|CDD|230178\tKOG2239, KOG2239, KOG2239, Transcription factor containing NAC and TS-N domains | Transcription | \t81.019\t216\t33\t3\t216\t1\t215\t209\t1\t209\t3.22e-94\t272\t99\t99 | \n",
"| 1 | ENSDART00000010777.7\tgnl|CDD|229436\tKOG1495, KOG1495, KOG1495, Lactate dehydrogenase | Energy production and conversion | \t68.769\t333\t103\t1\t335\t1\t333\t332\t1\t332\t0.0\t553\t99\t99 | \n",
"| 2 | ENSDART00000004692.9\tgnl|CDD|229467\tKOG1526, KOG1526, KOG1526, NADP-dependent isocitrate dehydrogenase | Energy production and conversion | \t73.039\t204\t52\t1\t226\t22\t225\t383\t180\t380\t1.90e-153\t428\t90\t90 | \n",
"| 3 | ENSDART00000016181.11\tgnl|CDD|230870\tKOG2931, KOG2931, KOG2931, Differentiation-related gene 1 protein (NDR1 protein), related proteins | Function unknown | \t59.218\t179\t60\t2\t363\t149\t320\t222\t38\t210\t3.28e-82\t247\t47\t47 | \n",
"| 4 | ENSDART00000013690.9\tgnl|CDD|231387\tKOG3449, KOG3449, KOG3449, 60S acidic ribosomal protein P2 | Translation, ribosomal structure and biogenesis | \t65.789\t114\t37\t1\t115\t1\t114\t112\t1\t112\t1.23e-30\t103\t99\t99 | \n",
"| 5 | ENSDART00000006948.6\tgnl|CDD|230248\tKOG2309, KOG2309, KOG2309, 60s ribosomal protein L2/L8 | Translation, ribosomal structure and biogenesis | \t82.677\t254\t38\t4\t258\t1\t254\t248\t1\t248\t1.69e-151\t420\t98\t98 | \n",
"| 6 | ENSDART00000002595.7\tgnl|CDD|229671\tKOG1732, KOG1732, KOG1732, 60S ribosomal protein L21 | Translation, ribosomal structure and biogenesis | \t83.125\t160\t27\t0\t161\t1\t160\t160\t1\t160\t2.68e-85\t245\t99\t99 | \n",
"| 7 | ENSDART00000015629.9\tgnl|CDD|229242\tKOG1300, KOG1300, KOG1300, Vesicle trafficking protein Sec1 | Intracellular trafficking, secretion, and vesicular transport | \t59.391\t591\t232\t8\t592\t4\t589\t593\t1\t588\t0.0\t841\t99\t99 | \n",
"| 8 | ENSDART00000002691.9\tgnl|CDD|231813\tKOG3882, KOG3882, KOG3882, Tetraspanin family integral membrane protein | General function prediction only | \t34.728\t239\t147\t3\t250\t11\t248\t237\t5\t235\t1.51e-60\t188\t95\t95 | \n",
"| 9 | ENSDART00000006132.8\tgnl|CDD|229674\tKOG1735, KOG1735, KOG1735, Actin depolymerizing factor | Cytoskeleton | \t49.367\t158\t58\t5\t166\t1\t152\t146\t1\t142\t1.08e-53\t164\t92\t92 | \n",
"\n",
"\n"
],
"text/plain": [
" 0 \n",
"0 ENSDART00000014723.7\\tgnl|CDD|230178\\tKOG2239, KOG2239, KOG2239, Transcription factor containing NAC and TS-N domains \n",
"1 ENSDART00000010777.7\\tgnl|CDD|229436\\tKOG1495, KOG1495, KOG1495, Lactate dehydrogenase \n",
"2 ENSDART00000004692.9\\tgnl|CDD|229467\\tKOG1526, KOG1526, KOG1526, NADP-dependent isocitrate dehydrogenase \n",
"3 ENSDART00000016181.11\\tgnl|CDD|230870\\tKOG2931, KOG2931, KOG2931, Differentiation-related gene 1 protein (NDR1 protein), related proteins \n",
"4 ENSDART00000013690.9\\tgnl|CDD|231387\\tKOG3449, KOG3449, KOG3449, 60S acidic ribosomal protein P2 \n",
"5 ENSDART00000006948.6\\tgnl|CDD|230248\\tKOG2309, KOG2309, KOG2309, 60s ribosomal protein L2/L8 \n",
"6 ENSDART00000002595.7\\tgnl|CDD|229671\\tKOG1732, KOG1732, KOG1732, 60S ribosomal protein L21 \n",
"7 ENSDART00000015629.9\\tgnl|CDD|229242\\tKOG1300, KOG1300, KOG1300, Vesicle trafficking protein Sec1 \n",
"8 ENSDART00000002691.9\\tgnl|CDD|231813\\tKOG3882, KOG3882, KOG3882, Tetraspanin family integral membrane protein \n",
"9 ENSDART00000006132.8\\tgnl|CDD|229674\\tKOG1735, KOG1735, KOG1735, Actin depolymerizing factor \n",
" 1 \n",
"0 Transcription \n",
"1 Energy production and conversion \n",
"2 Energy production and conversion \n",
"3 Function unknown \n",
"4 Translation, ribosomal structure and biogenesis \n",
"5 Translation, ribosomal structure and biogenesis \n",
"6 Translation, ribosomal structure and biogenesis \n",
"7 Intracellular trafficking, secretion, and vesicular transport\n",
"8 General function prediction only \n",
"9 Cytoskeleton \n",
" 2 \n",
"0 \\t81.019\\t216\\t33\\t3\\t216\\t1\\t215\\t209\\t1\\t209\\t3.22e-94\\t272\\t99\\t99 \n",
"1 \\t68.769\\t333\\t103\\t1\\t335\\t1\\t333\\t332\\t1\\t332\\t0.0\\t553\\t99\\t99 \n",
"2 \\t73.039\\t204\\t52\\t1\\t226\\t22\\t225\\t383\\t180\\t380\\t1.90e-153\\t428\\t90\\t90\n",
"3 \\t59.218\\t179\\t60\\t2\\t363\\t149\\t320\\t222\\t38\\t210\\t3.28e-82\\t247\\t47\\t47 \n",
"4 \\t65.789\\t114\\t37\\t1\\t115\\t1\\t114\\t112\\t1\\t112\\t1.23e-30\\t103\\t99\\t99 \n",
"5 \\t82.677\\t254\\t38\\t4\\t258\\t1\\t254\\t248\\t1\\t248\\t1.69e-151\\t420\\t98\\t98 \n",
"6 \\t83.125\\t160\\t27\\t0\\t161\\t1\\t160\\t160\\t1\\t160\\t2.68e-85\\t245\\t99\\t99 \n",
"7 \\t59.391\\t591\\t232\\t8\\t592\\t4\\t589\\t593\\t1\\t588\\t0.0\\t841\\t99\\t99 \n",
"8 \\t34.728\\t239\\t147\\t3\\t250\\t11\\t248\\t237\\t5\\t235\\t1.51e-60\\t188\\t95\\t95 \n",
"9 \\t49.367\\t158\\t58\\t5\\t166\\t1\\t152\\t146\\t1\\t142\\t1.08e-53\\t164\\t92\\t92 "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%get DR_KOG --from Python3\n",
"head(DR_KOG, 10)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"kernel": "Python3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RNA processing and modification 882\n",
"Chromatin structure and dynamics 425\n",
"Energy production and conversion 567\n",
"Cell cycle control 687\n",
"Amino acid transport and metabolism 605\n",
"Nucleotide transport and metabolism 286\n",
"Carbohydrate transport and metabolism 676\n",
"Coenzyme transport and metabolism 165\n",
"Lipid transport and metabolism 731\n",
"Translation, ribosomal structure and biogenesis 691\n",
"Transcription 3095\n",
"Replication, recombination and repair 438\n",
"Cell wall/membrane/envelope biogenesis 278\n",
"Cell motility 64\n",
"Posttranslational modification 2036\n",
"Inorganic ion transport and metabolism 809\n",
"Secondary metabolites 291\n",
"General function prediction only 3365\n",
"Function unknown 1940\n",
"Signal transduction mechanisms 6221\n",
"Intracellular trafficking 1613\n",
"Defense mechanisms 228\n",
"Extracellular structures 608\n",
"Nuclear structure 172\n",
"Cytoskeleton 1429\n",
" KOG Count DR_Percentage\n",
"0 RNA processing and modification 882 3.116388\n",
"1 Chromatin structure and dynamics 425 1.501661\n",
"2 Energy production and conversion 567 2.003392\n",
"3 Cell cycle control 687 2.427390\n",
"4 Amino acid transport and metabolism 605 2.137658\n",
"5 Nucleotide transport and metabolism 286 1.010529\n",
"6 Carbohydrate transport and metabolism 676 2.388524\n",
"7 Coenzyme transport and metabolism 165 0.582998\n",
"8 Lipid transport and metabolism 731 2.582856\n",
"9 Translation, ribosomal structure and biogenesis 691 2.441524\n",
"10 Transcription 3095 10.935623\n",
"11 Replication, recombination and repair 438 1.547594\n",
"12 Cell wall/membrane/envelope biogenesis 278 0.982263\n",
"13 Cell motility 64 0.226132\n",
"14 Posttranslational modification 2036 7.193838\n",
"15 Inorganic ion transport and metabolism 809 2.858455\n",
"16 Secondary metabolites 291 1.028196\n",
"17 General function prediction only 3365 11.889619\n",
"18 Function unknown 1940 6.854639\n",
"19 Signal transduction mechanisms 6221 21.980779\n",
"20 Intracellular trafficking 1613 5.699244\n",
"21 Defense mechanisms 228 0.805597\n",
"22 Extracellular structures 608 2.148258\n",
"23 Nuclear structure 172 0.607731\n",
"24 Cytoskeleton 1429 5.049113\n"
]
}
],
"source": [
"## Count the number of occurrences of each category\n",
"KOGs= [\"RNA processing and modification\", \"Chromatin structure and dynamics\", \"Energy production and conversion\", \"Cell cycle control\", \n",
" \"Amino acid transport and metabolism\", \"Nucleotide transport and metabolism\", \"Carbohydrate transport and metabolism\", \"Coenzyme transport and metabolism\", \n",
" \"Lipid transport and metabolism\", \"Translation, ribosomal structure and biogenesis\", \"Transcription\", \"Replication, recombination and repair\", \n",
" \"Cell wall/membrane/envelope biogenesis\", \"Cell motility\", \"Posttranslational modification\", \"Inorganic ion transport and metabolism\", \n",
" \"Secondary metabolites\", \"General function prediction only\", \"Function unknown\", \"Signal transduction mechanisms\", \"Intracellular trafficking\", \n",
" \"Defense mechanisms\", \"Extracellular structures\", \"Nuclear structure\", \"Cytoskeleton\"]\n",
"\n",
"data = []\n",
"for KOG in KOGs:\n",
" print(KOG, DR_KOG[1].str.contains(KOG).sum())\n",
" data.append([KOG, DR_KOG[1].str.contains(KOG).sum()])\n",
" \n",
"df = pd.DataFrame(data)\n",
"df.columns = [\"KOG\", \"Count\"]\n",
"df[\"DR_Percentage\"] = df[\"Count\"]/df[\"Count\"].sum()*100\n",
"print(df)\n",
"\n",
"df[[\"KOG\", \"DR_Percentage\"]].to_csv(\"DR_KOG_summary.txt\", sep=\"\\t\", index=None)"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"# Archive"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "SoS"
},
"source": [
"- `SRR3465546_non0_Jun30.txt` ---> 44,699 IDs\n",
"- `SRR3465547_non0_Jun30.txt` ---> 44,324 IDs\n",
"- `SRR3465548_non0_Jun30.txt` ---> 43,915 IDs\n",
"- `SRR3465549_non0_Jun30.txt` ---> 44,017 IDs\n",
"\n",
"- `DR_6789_non0_Jun30.txt` ---> 36,362 IDs"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"## Extract transcripts with TPM>0 in all 4 libraries\n",
"> grep -Fwf SRR3465546_non0_Jun30.txt SRR3465547_non0_Jun30.txt > DR_67_non0_Jun30.txt\n",
"> grep -Fwf SRR3465548_non0_Jun30.txt DR_67_non0_Jun30.txt > DR_678_non0_Jun30.txt\n",
"> grep -Fwf SRR3465549_non0_Jun30.txt DR_678_non0_Jun30.txt > DR_6789_non0_Jun30.txt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"36362 DR_6789_non0_Jun30.txt\n",
"\n",
"ENSDART00000093611.3\n",
"ENSDART00000117474.3\n",
"ENSDART00000174022.2\n",
"ENSDART00000093609.3\n",
"ENSDART00000093606.3\n",
"ENSDART00000093613.3\n",
"ENSDART00000093612.3\n",
"ENSDART00000182970.1\n",
"ENSDART00000171617.2\n",
"ENSDART00000116823.3\n",
"\n"
]
}
],
"source": [
"wc -l DR_6789_non0_Jun30.txt\n",
"echo \"\"\n",
"head -n 10 DR_6789_non0_Jun30.txt"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "R"
},
"source": [
"## KOG analysis of extracted protein sequences"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"rpsblast -query ~/CNS-transcriptomes/DR/DR_CNS_pep_Jun30.fa -db Kog \\\n",
"-out ~/CNS-transcriptomes/DR/DR_CNS_pep_Jun30_KOG.txt -evalue 1E-5 \\\n",
"-outfmt \"6 qseqid sseqid stitle pident length mismatch gapopen qlen qstart qend slen sstart send evalue bitscore qcovhsp qcovs\" \\\n",
"-max_hsps 1 -max_target_seqs 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"kernel": "R"
},
"source": [
"## Extract protein sequences of expressed transcripts from Ensembl reference protein sequences\n",
"\n",
"`DR_CNS_pep_Jun30.fa` ---> 32,394 sequences"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"kernel": "calysto_bash"
},
"outputs": [],
"source": [
"## Download protein sequences\n",
"wget ftp://ftp.ensembl.org/pub/release-92/fasta/danio_rerio/pep/Danio_rerio.GRCz11.pep.all.fa.gz\n",
"gunzip Danio_rerio.GRCz11.pep.all.fa.gz\n",
"mv Danio_rerio.GRCz11.pep.all.fa DR_pep.fa\n",
"\n",
"## Extract protein sequences of transcripts with TPM>0 in all 4 libraries [27923037.bc]\n",
"filterbyname.sh in=DR_pep.fa out=DR_CNS_pep_Jun30.fa names=DR_6789_non0_Jun30.txt include=t substring "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "SoS",
"language": "sos",
"name": "sos"
},
"language_info": {
"codemirror_mode": "sos",
"file_extension": ".sos",
"mimetype": "text/x-sos",
"name": "sos",
"nbconvert_exporter": "sos_notebook.converter.SoS_Exporter",
"pygments_lexer": "sos"
},
"sos": {
"kernels": [
[
"Python3",
"python3",
"Python3",
"#FFD91A"
],
[
"R",
"ir",
"R",
"#DCDCDA"
],
[
"calysto_bash",
"calysto_bash",
"",
""
]
],
"version": "0.9.15.8"
},
"toc-autonumbering": true
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment