Skip to content

Instantly share code, notes, and snippets.

@jameslyons
Last active January 4, 2020 04:15
Show Gist options
  • Save jameslyons/00cad4f1ab6a49c2b3df to your computer and use it in GitHub Desktop.
Save jameslyons/00cad4f1ab6a49c2b3df to your computer and use it in GitHub Desktop.
convert pdb file to fasta
import sys
if len(sys.argv) <= 1:
print 'usage: python pdb2fasta.py file.pdb > file.fasta'
exit()
input_file = open(sys.argv[1])
letters = {'ALA':'A','ARG':'R','ASN':'N','ASP':'D','CYS':'C','GLU':'E','GLN':'Q','GLY':'G','HIS':'H',
'ILE':'I','LEU':'L','LYS':'K','MET':'M','PHE':'F','PRO':'P','SER':'S','THR':'T','TRP':'W',
'TYR':'Y','VAL':'V'}
print '>',sys.argv[1]
prev = '-1'
for line in input_file:
toks = line.split()
if len(toks)<1: continue
if toks[0] != 'ATOM': continue
if toks[4] != prev:
sys.stdout.write('%c' % letters[toks[3]])
prev = toks[4]
sys.stdout.write('\n')
input_file.close()
@jingchuansun
Copy link

Hello, I tried your pdb2fasta.py, the result output file only has one letter. After I change toks[4] in line 19 and 21 to toks[3], the result seems correct.

It's really great help to me.
Best,
Jiim

@bougui505
Copy link

Hello, it's the same for me: I change toks[4] in line 19 and 21 to toks[3].
Best,
Guillaume

@jlaffy
Copy link

jlaffy commented Feb 26, 2016

Lines 19 and 21 should be changed to toks[5] rather than toks[3] as the latter solution skips amino acids that are the same as the previous one in the sequence.

@dongshuyan
Copy link

Lines 18 and 20 "prev=toks[4]" should be changed to toks[5] rather than toks[4] .

@dongshuyan
Copy link

hello, i develop a program from your program~~~
https://github.com/dongshuyan/pdb2fasta

@RaviThakkar369
Copy link

Hi,
This script is not useful for sequence having repeated residues.
After changing [5] to [3] script runs, but gives wrong output. For examples if original pdb has sequence "IIPLEES" this script generates "IPLES", This script omits repeating residues.
.
Download rosetta and use the script "get_fasta_from_pdb.py" from folder "rosetta_bin_linux_2019.35.60890_bundle/tools/protein_tools/scripts".
Syntax: python get_fasta_from_pdb.py PDB chainID outputname

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment