Skip to content

Instantly share code, notes, and snippets.

@k3yavi
Last active July 11, 2019 20:35
Show Gist options
  • Save k3yavi/c501705ed2d29b12b0d10cf78b3ed001 to your computer and use it in GitHub Desktop.
Save k3yavi/c501705ed2d29b12b0d10cf78b3ed001 to your computer and use it in GitHub Desktop.
Example pipeline for Alevin-tool on 10x-pbmc 1k data.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"salmon-latest_linux_x86_64/\n",
"salmon-latest_linux_x86_64/sample_data.tgz\n",
"salmon-latest_linux_x86_64/bin/\n",
"salmon-latest_linux_x86_64/bin/salmon\n",
"salmon-latest_linux_x86_64/lib/\n",
"salmon-latest_linux_x86_64/lib/libm.so.6\n",
"salmon-latest_linux_x86_64/lib/libtbb.so\n",
"salmon-latest_linux_x86_64/lib/libgomp.so.1\n",
"salmon-latest_linux_x86_64/lib/libtbbmalloc.so\n",
"salmon-latest_linux_x86_64/lib/libtbbmalloc_proxy.so\n",
"salmon-latest_linux_x86_64/lib/libtbbmalloc.so.2\n",
"salmon-latest_linux_x86_64/lib/libtbb.so.2\n",
"salmon-latest_linux_x86_64/lib/liblzma.so.0\n",
"salmon-latest_linux_x86_64/lib/libtbbmalloc_proxy.so.2\n",
"salmon-latest_linux_x86_64/lib/libgcc_s.so.1\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"2019-07-11 16:00:18 URL:https://github-production-release-asset-2e65be.s3.amazonaws.com/32549942/905b8780-981a-11e9-9e44-e488007f1799?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20190711%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20190711T200017Z&X-Amz-Expires=300&X-Amz-Signature=a41878f46d8dab13cc4a704f30c2e8eae317c3190889328fad9565a18fc6b7e0&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dsalmon-0.14.1_linux_x86_64.tar.gz&response-content-type=application%2Foctet-stream [37093812/37093812] -> \"salmon-0.14.1_linux_x86_64.tar.gz.1\" [1]\n"
]
}
],
"source": [
"%%bash\n",
"# Download Alevin\n",
"wget -nv https://github.com/COMBINE-lab/salmon/releases/download/v0.14.1/salmon-0.14.1_linux_x86_64.tar.gz\n",
"tar -xvzf salmon-0.14.1_linux_x86_64.tar.gz"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2019-07-11 16:01:40 URL: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.pc_transcripts.fa.gz [42658144] -> \"gencode.v31.pc_transcripts.fa.gz.1\" [1]\n"
]
}
],
"source": [
"%%bash\n",
"# Download Reference transcriptome\n",
"wget -nv ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.pc_transcripts.fa.gz"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2019-07-11 16:02:40 URL: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.primary_assembly.annotation.gtf.gz [42834319] -> \"gencode.v31.primary_assembly.annotation.gtf.gz\" [1]\n"
]
}
],
"source": [
"%%bash\n",
"# Downloading GTF and extracting Txp to Gene mapping\n",
"wget -nv ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.primary_assembly.annotation.gtf.gz\n",
"bioawk -c gff '$feature==\"transcript\" {print $group}' <(gunzip -c gencode.v31.primary_assembly.annotation.gtf.gz) | awk -F ' ' '{print substr($4,2,length($4)-3) \"\\t\" substr($2,2,length($2)-3)}' - > txp2gene.tsv"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Version Info: This is the most recent version of salmon.\n",
"index [\"index\"] did not previously exist . . . creating it\n",
"[2019-07-11 16:02:51.688] [jLog] [info] building index\n",
"[2019-07-11 16:02:51.688] [jointLog] [info] [Step 1 of 4] : counting k-mers\n",
"[2019-07-11 16:02:55.396] [jointLog] [warning] Entry with header [ENST00000632684.1|ENSG00000282431.1|OTTHUMG00000190602.2|OTTHUMT00000485301.2|TRBD1-202|TRBD1|12|CDS:1-12|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.288] [jointLog] [warning] Entry with header [ENST00000415118.1|ENSG00000223997.1|OTTHUMG00000170844.2|OTTHUMT00000410670.2|TRDD1-201|TRDD1|8|CDS:1-8|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.288] [jointLog] [warning] Entry with header [ENST00000434970.2|ENSG00000237235.2|OTTHUMG00000170845.2|OTTHUMT00000410671.2|TRDD2-201|TRDD2|9|CDS:1-9|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.288] [jointLog] [warning] Entry with header [ENST00000448914.1|ENSG00000228985.1|OTTHUMG00000170846.2|OTTHUMT00000410672.2|TRDD3-201|TRDD3|13|CDS:1-13|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000439842.1|ENSG00000236597.1|OTTHUMG00000152435.2|OTTHUMT00000326213.2|IGHD7-27-201|IGHD7-27|11|CDS:1-11|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000390567.1|ENSG00000211907.1|OTTHUMG00000152429.2|OTTHUMT00000326207.2|IGHD1-26-201|IGHD1-26|20|CDS:1-20|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000452198.1|ENSG00000225825.1|OTTHUMG00000152436.2|OTTHUMT00000326214.2|IGHD6-25-201|IGHD6-25|18|CDS:1-18|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000390569.1|ENSG00000211909.1|OTTHUMG00000152427.2|OTTHUMT00000326205.2|IGHD5-24-201|IGHD5-24|20|CDS:1-20|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000450276.1|ENSG00000237020.1|OTTHUMG00000152432.2|OTTHUMT00000326210.2|IGHD1-20-201|IGHD1-20|17|CDS:1-17|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000390574.1|ENSG00000211914.1|OTTHUMG00000152431.2|OTTHUMT00000326209.2|IGHD6-19-201|IGHD6-19|21|CDS:1-21|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000390575.1|ENSG00000211915.1|OTTHUMG00000152433.2|OTTHUMT00000326211.2|IGHD5-18-201|IGHD5-18|20|CDS:1-20|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000451044.1|ENSG00000227108.1|OTTHUMG00000152369.2|OTTHUMT00000326003.2|IGHD1-14-201|IGHD1-14|17|CDS:1-17|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000390580.1|ENSG00000211920.1|OTTHUMG00000152370.2|OTTHUMT00000326004.2|IGHD6-13-201|IGHD6-13|21|CDS:1-21|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000390581.1|ENSG00000211921.1|OTTHUMG00000152367.2|OTTHUMT00000326001.2|IGHD5-12-201|IGHD5-12|23|CDS:1-23|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000430425.1|ENSG00000237197.1|OTTHUMG00000152357.2|OTTHUMT00000325963.2|IGHD1-7-201|IGHD1-7|17|CDS:1-17|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.648] [jointLog] [warning] Entry with header [ENST00000454691.1|ENSG00000228131.1|OTTHUMG00000152353.2|OTTHUMT00000325959.2|IGHD6-6-201|IGHD6-6|18|CDS:1-18|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.649] [jointLog] [warning] Entry with header [ENST00000390588.1|ENSG00000211928.1|OTTHUMG00000152360.2|OTTHUMT00000325966.2|IGHD5-5-201|IGHD5-5|20|CDS:1-20|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.649] [jointLog] [warning] Entry with header [ENST00000454908.1|ENSG00000236170.1|OTTHUMG00000152359.2|OTTHUMT00000325965.2|IGHD1-1-201|IGHD1-1|17|CDS:1-17|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.649] [jointLog] [warning] Entry with header [ENST00000604642.1|ENSG00000270961.1|OTTHUMG00000184622.2|OTTHUMT00000468982.2|IGHD5OR15-5A-201|IGHD5OR15-5A|23|CDS:1-23|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.649] [jointLog] [warning] Entry with header [ENST00000605284.1|ENSG00000271336.1|OTTHUMG00000184580.2|OTTHUMT00000468908.2|IGHD1OR15-1A-201|IGHD1OR15-1A|17|CDS:1-17|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.650] [jointLog] [warning] Entry with header [ENST00000604446.1|ENSG00000270824.1|OTTHUMG00000184624.2|OTTHUMT00000468984.2|IGHD5OR15-5B-201|IGHD5OR15-5B|23|CDS:1-23|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"[2019-07-11 16:02:57.650] [jointLog] [warning] Entry with header [ENST00000604838.1|ENSG00000270185.1|OTTHUMG00000184585.2|OTTHUMT00000468915.2|IGHD1OR15-1B-201|IGHD1OR15-1B|17|CDS:1-17|], had length less than the k-mer length of 31 (perhaps after poly-A clipping)\n",
"Elapsed time: 7.7178s\n",
"\n",
"[2019-07-11 16:02:59.406] [jointLog] [warning] Removed 140 transcripts that were sequence duplicates of indexed transcripts.\n",
"[2019-07-11 16:02:59.406] [jointLog] [warning] If you wish to retain duplicate transcripts, please use the `--keepDuplicates` flag\n",
"[2019-07-11 16:02:59.407] [jointLog] [info] Replaced 0 non-ATCG nucleotides\n",
"[2019-07-11 16:02:59.407] [jointLog] [info] Clipped poly-A tails from 611 transcripts\n",
"[2019-07-11 16:02:59.423] [jointLog] [info] Building rank-select dictionary and saving to disk\n",
"[2019-07-11 16:02:59.442] [jointLog] [info] done\n",
"Elapsed time: 0.0191958s\n",
"[2019-07-11 16:02:59.442] [jointLog] [info] Writing sequence data to file . . . \n",
"[2019-07-11 16:02:59.610] [jointLog] [info] done\n",
"Elapsed time: 0.168379s\n",
"[2019-07-11 16:02:59.618] [jointLog] [info] Building 32-bit suffix array (length of generalized text is 193,209,717)\n",
"[2019-07-11 16:02:59.953] [jointLog] [info] Building suffix array . . . \n",
"success\n",
"saving to disk . . . done\n",
"Elapsed time: 0.762352s\n",
"done\n",
"Elapsed time: 31.4003s\n",
"processed 193,000,000 positions[2019-07-11 16:05:17.135] [jointLog] [info] khash had 74,005,889 keys\n",
"[2019-07-11 16:05:17.641] [jointLog] [info] saving hash to disk . . . \n",
"[2019-07-11 16:05:26.118] [jointLog] [info] done\n",
"Elapsed time: 8.47652s\n",
"[2019-07-11 16:05:26.904] [jLog] [info] done building index\n"
]
}
],
"source": [
"%%bash\n",
"# create salmon index\n",
"salmon-latest_linux_x86_64/bin/salmon index -i index -k 31 --gencode -p 4 -t gencode.v31.pc_transcripts.fa.gz"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# Assuming the raw-fastqs are available under folder fastqs\n",
"# NOTE: for 10x-v2, Files with subsequence -- `R1` and `R2` are relavant for alevin, `I1` can be ignored."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.00201235\t0.000547496\t0.00189954\t96398.5\t3590.05\t0.100335\t\n",
"0.0386245\t0.130075\t0.078244\t3117.08\t281.453\t2044.74\t\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Version Info: This is the most recent version of salmon.\n",
"Logs will be written to alevin_output/logs\n",
"[2019-07-11 16:15:52.376] [jointLog] [info] Fragment incompatibility prior below threshold. Incompatible fragments will be ignored.\n",
"[2019-07-11 16:15:52.376] [jointLog] [info] Usage of --validateMappings implies use of minScoreFraction. Since not explicitly specified, it is being set to 0.65\n",
"[2019-07-11 16:15:52.376] [jointLog] [info] Usage of --validateMappings, without --hardFilter implies use of range factorization. rangeFactorizationBins is being set to 4\n",
"[2019-07-11 16:15:52.376] [jointLog] [info] Usage of --validateMappings implies a default consensus slack of 0.2. Setting consensusSlack to 0.2.\n",
"[2019-07-11 16:15:52.376] [jointLog] [info] Using default value of 0.87 for minScoreFraction in Alevin\n",
"Using default value of 0.6 for consensusSlack in Alevin\n",
"[2019-07-11 16:15:52.376] [alevinLog] [info] Loading Header\n",
"[2019-07-11 16:15:52.376] [alevinLog] [info] Loading Transcript Info \n",
"### alevin (dscRNA-seq quantification) v0.14.1\n",
"### [ program ] => salmon \n",
"### [ command ] => alevin \n",
"### [ libType ] => { ISR }\n",
"### [ mates1 ] => { fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz }\n",
"### [ mates2 ] => { fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz }\n",
"### [ chromium ] => { }\n",
"### [ index ] => { index }\n",
"### [ threads ] => { 8 }\n",
"### [ output ] => { alevin_output }\n",
"### [ tgMap ] => { txp2gene.tsv }\n",
"\n",
"\n",
"[2019-07-11 16:15:52.531] [alevinLog] [info] Found all transcripts to gene mappings\n",
"[2019-07-11 16:15:52.546] [alevinLog] [info] Processing barcodes files (if Present) \n",
"\n",
" \n",
"\u001b[32mprocessed\u001b[31m 38 Million \u001b[32mbarcodes\u001b[0m\n",
"\n",
"[2019-07-11 16:17:05.092] [alevinLog] [info] Done barcode density calculation.\n",
"[2019-07-11 16:17:05.092] [alevinLog] [info] # Barcodes Used: \u001b[32m38588322\u001b[0m / \u001b[31m38615880\u001b[0m.\n",
"[2019-07-11 16:17:09.788] [alevinLog] [info] Knee found left boundary at \u001b[32m 1219 \u001b[0m\n",
"[2019-07-11 16:17:10.194] [alevinLog] [info] Gauss Corrected Boundary at \u001b[32m 1016 \u001b[0m\n",
"[2019-07-11 16:17:10.194] [alevinLog] [info] Learned InvCov: 86.3314 normfactor: 567.612\n",
"[2019-07-11 16:17:10.194] [alevinLog] [info] Total \u001b[32m1521\u001b[0m(has \u001b[32m504\u001b[0m low confidence) barcodes\n",
"[2019-07-11 16:17:10.920] [alevinLog] [info] Done True Barcode Sampling\n",
"[2019-07-11 16:17:11.045] [alevinLog] [info] Total 8.66749% reads will be thrown away because of noisy Cellular barcodes.\n",
"[2019-07-11 16:17:11.162] [alevinLog] [info] Done populating Z matrix\n",
"[2019-07-11 16:17:11.173] [alevinLog] [info] Total 31730 CB got sequence corrected\n",
"[2019-07-11 16:17:11.176] [alevinLog] [info] Done indexing Barcodes\n",
"[2019-07-11 16:17:11.176] [alevinLog] [info] Total Unique barcodes found: 489673\n",
"[2019-07-11 16:17:11.176] [alevinLog] [info] Used Barcodes except Whitelist: 30899\n",
"[2019-07-11 16:17:11.373] [alevinLog] [info] Done with Barcode Processing; Moving to Quantify\n",
"\n",
"[2019-07-11 16:17:11.373] [alevinLog] [info] parsing read library format\n",
"[2019-07-11 16:17:11.488] [stderrLog] [info] Loading Suffix Array \n",
"[2019-07-11 16:17:11.374] [jointLog] [info] There is 1 library.\n",
"[2019-07-11 16:17:11.488] [jointLog] [info] Loading Quasi index\n",
"[2019-07-11 16:17:11.488] [jointLog] [info] Loading 32-bit quasi index\n",
"[2019-07-11 16:17:12.182] [stderrLog] [info] Loading Transcript Info \n",
"[2019-07-11 16:17:12.333] [stderrLog] [info] Loading Rank-Select Bit Array\n",
"[2019-07-11 16:17:12.368] [stderrLog] [info] There were 88,775 set bits in the bit array\n",
"[2019-07-11 16:17:12.393] [stderrLog] [info] Computing transcript lengths\n",
"[2019-07-11 16:17:12.393] [stderrLog] [info] Waiting to finish loading hash\n",
"[2019-07-11 16:17:15.356] [stderrLog] [info] Done loading index\n",
"[2019-07-11 16:17:15.356] [jointLog] [info] done\n",
"[2019-07-11 16:17:15.356] [jointLog] [info] Index contained 88,775 targets\n",
"\n",
"\n",
"\n",
"\n",
"\u001b[32mprocessed\u001b[31m 0 Million \u001b[32mfragments\u001b[0m\n",
"\u001b[32mprocessed\u001b[31m 38 Million \u001b[32mfragments\u001b[0m\n",
"hits: 67274069, hits per frag: 1.74774\n",
"\n",
"\n",
"\n",
"[2019-07-11 16:22:50.153] [jointLog] [info] Computed 59,776 rich equivalence classes for further processing\n",
"[2019-07-11 16:22:50.154] [jointLog] [info] Counted 16,819,018 total reads in the equivalence classes \n",
"[2019-07-11 16:22:50.154] [jointLog] [info] Number of fragments discarded because they are best-mapped to decoys : 0\n",
"[2019-07-11 16:22:50.154] [jointLog] [warning] Found 1673 reads with `N` in the UMI sequence and ignored the reads.\n",
"Please report on github if this number is too large\n",
"[2019-07-11 16:22:50.154] [jointLog] [info] Mapping rate = 43.5547%\n",
"\n",
"[2019-07-11 16:22:50.154] [jointLog] [info] finished quantifyLibrary()\n",
"[2019-07-11 16:22:50.405] [alevinLog] [info] Starting optimizer\n",
"\n",
"\n",
"\u001b[32mAnalyzed 7 cells (\u001b[31m0%\u001b[32m of all).\u001b[0m\n",
"\u001b[32mAnalyzed 1520 cells (\u001b[31m100%\u001b[32m of all).\u001b[0m\n",
"\u001b[32mAnalyzed 1520 cells (\u001b[31m100%\u001b[32m of all).\u001b[0m\n",
"[2019-07-11 16:22:57.832] [alevinLog] [info] Total 2733613.00 UMI after deduplicating.\n",
"[2019-07-11 16:22:57.832] [alevinLog] [info] Total 2706643 BiDirected Edges.\n",
"[2019-07-11 16:22:57.832] [alevinLog] [info] Total 318391 UniDirected Edges.\n",
"[2019-07-11 16:22:57.832] [alevinLog] [warning] Skipped 53 barcodes due to No mapped read\n",
"[2019-07-11 16:22:57.836] [alevinLog] [info] Clearing EqMap; Might take some time.\n",
"[2019-07-11 16:22:58.386] [alevinLog] [info] Starting white listing of 1467 cells\n",
"[2019-07-11 16:22:58.386] [alevinLog] [info] Starting to make feature Matrix\n",
"[2019-07-11 16:22:58.396] [alevinLog] [info] Done making feature Matrix\n",
"[2019-07-11 16:22:58.561] [alevinLog] [info] Finished white listing\n",
"[2019-07-11 16:22:58.586] [alevinLog] [info] Finished optimizer\n"
]
}
],
"source": [
"%%bash\n",
"# Running Alevin\n",
"salmon-latest_linux_x86_64/bin/salmon alevin -lISR -1 fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz -2 fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz --chromium -i index -p 8 -o alevin_output --tgMap txp2gene.tsv"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment