Skip to content

Instantly share code, notes, and snippets.

@hussius
Created November 17, 2016 09:39
Show Gist options
  • Save hussius/5bd8f044eb4fda04bd83b5c3dbff5020 to your computer and use it in GitHub Desktop.
Save hussius/5bd8f044eb4fda04bd83b5c3dbff5020 to your computer and use it in GitHub Desktop.
Sum transcript TPMs by gene using the FASTA file used for a Kallisto index
import sys
import gzip
if len(sys.argv)<3:
sys.exit("python sum_per_gene.py <cDNA FASTA file> <TPM table>")
ensg = {}
mapf = gzip.open(sys.argv[1])
ctr = 0
for line in mapf:
if not line.startswith('>'): continue
ctr += 1
if ctr % 1000 == 0: sys.stderr.write(ctr)
c = line.split(' ')
tx = c[0][1:]
gene = c[3].split(':')[1].split('.')[0]
ensg[tx] = gene
gene_tpm = {}
tpm = open(sys.argv[2])
print(tpm.readline().strip()) # Header
for line in tpm:
c = line.split()
tx = c[0]
vals = []
for val in c[1:len(c)]:
vals.append(float(val))
gene = ensg[tx]
if gene in gene_tpm:
prev = gene_tpm[gene]
curr = vals
new = [prev[i]+curr[i] for i in range(0,len(prev))]
gene_tpm[gene] = new
else:
gene_tpm[gene] = vals
for g in gene_tpm:
string_rep = [x for x in map(str, gene_tpm[g])]
print(g + '\t' + '\t'.join(string_rep))
@hussius
Copy link
Author

hussius commented Nov 17, 2016

Intended to collapse transcript TPMs by gene name based on the original FASTA file that Kallisto used for quantification

python sum_per_gene.py gzipped FASTA file TPM table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment