Skip to content

Instantly share code, notes, and snippets.

@elyase
Last active December 28, 2015 13:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save elyase/7509072 to your computer and use it in GitHub Desktop.
Save elyase/7509072 to your computer and use it in GitHub Desktop.
Counts motifs appearances in a list of DNA sequences
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
def tokenizer(s):
width = 7
return [s[i:i+width] for i in range(len(s)-width+1)]
def count_chunks(sequence_list):
vectorizer = CountVectorizer(tokenizer=tokenizer)
X = vectorizer.fit_transform(sequence_list)
counts = (X.toarray()>0).astype(int).sum(axis=0)
return vectorizer.get_feature_names(), counts
#import data
data = np.genfromtxt('data.txt', dtype=(str))
down = data[:,1].astype(float) < -0.5
down_list = data[:,2][down] # down_list.size == 5534
not_down_list = data[:,2][~down] # not_down_list.size == 6312
#calculate counts
down_names, down_counts = count_chunks(down_list)
not_down_names, not_down_counts = count_chunks(not_down_list)
# to get the negative counts just substract, for example
no_down_counts = down_list.size - down_counts
@kitaekim077
Copy link

Thank you for your help @elyase, Although I'm not counting number of motifs within a sequence, what I count is number of presence or absence in list of sequences.

@elyase
Copy link
Author

elyase commented Nov 18, 2013

I am counting the same as you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment