Skip to content

Instantly share code, notes, and snippets.

@hammer
Created January 11, 2016 22:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hammer/b5db91e321c22f89d816 to your computer and use it in GitHub Desktop.
Save hammer/b5db91e321c22f89d816 to your computer and use it in GitHub Desktop.
Convert DeepSEA training sequences a BED file
import h5py
# HDF5 file with two arrays: 'trainxdata' (samples) and 'traindata' (labels)
INFILE_SAMPLES = ''
INFILE_REFERENCE_FASTA = ''
OUTFILE_FASTA = 'deepsea_train10k.fa'
OUTFILE_BED = 'deepsea_train10k.bed'
def onehot2base(onehot):
if onehot == [1,0,0,0]:
return 'A'
elif onehot == [0,1,0,0]:
return 'G'
elif onehot == [0,0,1,0]:
return 'C'
elif onehot == [0,0,0,1]:
return 'T'
elif onehot == [0,0,0,0]:
return 'N'
else:
return 'U'
training_data_file = h5py.File(INFILE_SAMPLES, 'r')
samples_onehot = training_data_file['trainxdata']
samples_fasta = '\n'.join(
['>seq' + str(i) + '\n' +
''.join(map(lambda x: onehot2base(x), samples_onehot[:,:,i].tolist()))
for i in samples_onehot.shape[2]]
)
with open(OUTFILE_FASTA, 'w') as f:
f.write(samples_fasta)
# Call bwa mem hg19.fasta OUTFILE_FASTA
# Use bedtools bamtobed on BWA MEM output
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment