Skip to content

Instantly share code, notes, and snippets.

View lisanka93's full-sized avatar
💭
PhDing, Yogiing & Living

Lisa Andreevna lisanka93

💭
PhDing, Yogiing & Living
View GitHub Profile
@lisanka93
lisanka93 / training_nb.py
Created August 3, 2020 16:29
training naive bayes classifier and printing accuracy
naive_bayes.fit(training_data,y_train)
predictions = naive_bayes.predict(testing_data)
# accuracy: 0.7391304347826086
print('accuracy: {}'.format(accuracy_score(y_test, predictions)))
#using cross-validation
X_whole = count_vector.fit_transform(covid_data['prep_arg'])
y_whole = covid_data['concern']
@lisanka93
lisanka93 / countvectoriser.py
Created August 3, 2020 16:26
instantiating countvectorizer and learning vocabulary and transforming arguments into vectors
count_vectorizer = CountVectorizer(binary=True)
#fit training data
training_data = count_vectorizer.fit_transform(X_train)
#transform test data
testing_data = count_vectorizer.transform(X_test)
@lisanka93
lisanka93 / train_test.py
Created August 3, 2020 16:25
splitting data into train and test set
X_train, X_test, y_train, y_test = train_test_split(
covid_data['prep_arg'],
covid_data['concern'],
test_size=0.2,
random_state=50
)
@lisanka93
lisanka93 / read_in_covid_data.py
Created August 3, 2020 16:24
reading and and preprocessing anti covid-19 vaccine arguments
covid_data = pd.read_csv('covid_vacc_concerns.csv')
covid_data['prep_arg'] = covid_data['arg'].apply(preprocess)
@lisanka93
lisanka93 / bow.csv
Created August 3, 2020 16:07
Bag of words model explained
vocabulary apples are great but so pears however sometimes I feel like oranges and on other days bananas
index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Sentence 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
Sentence 2 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2
Sentence 3 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
@lisanka93
lisanka93 / Dummy movie dataset.ipynb
Created July 13, 2020 16:12
notebook dummy movie dataset
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@lisanka93
lisanka93 / lem_stem.py
Created July 13, 2020 16:09
lemmatisation and stemming with NLTK
from nltk.stem import WordNetLemmatizer, PorterStemmer, SnowballStemmer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
word = "considering"
stemmed_word = stemmer.stem(word)
lemmatised_word = lemmatizer.lemmatize(word)
@lisanka93
lisanka93 / regex_punct.py
Created July 13, 2020 15:59
regular expressions to remove punctuation
import re
#letters only
raw_text = "this is a test. To demonstrate 2 regex expressions!!"
letters_only_text = re.sub("[^a-zA-Z]", " ", raw_text)
#keep numbers
letnum_text = re.sub("[^a-zA-Z0-9\s]+", " ",raw_text )
@lisanka93
lisanka93 / ngrams.py
Created July 13, 2020 15:53
NLTK ngrams, bigrams and trigrams
from nltk.util import ngrams, word_tokenize, bigrams, trigrams
sen = "Dummy sentence to demonstrate bigrams"
nltk_tokens = word_tokenize(sen) #using tokenize from NLKT and not split() because split() does not take into account punctuation
#splitting sentence into bigrams and trigrams
print(list(bigrams(nltk_tokens)))
print(list(trigrams(nltk_tokens)))
#creating a dictionary that shows occurances of n-grams in text
@lisanka93
lisanka93 / stopword_removal.py
Created July 13, 2020 15:37
removing stopwords with NLTK
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
example_sent = "This is a sample sentence, showing off the stop words filtration."
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
print(filtered_sentence)