Skip to content

Instantly share code, notes, and snippets.

@sandersyen
Last active March 11, 2019 03:27
Show Gist options
  • Save sandersyen/7c2fa35d650e009dc79fbee472d265bf to your computer and use it in GitHub Desktop.
Save sandersyen/7c2fa35d650e009dc79fbee472d265bf to your computer and use it in GitHub Desktop.
source code (CS 839 Spring 2019, Project Stage 1, team23)
able
bad
best
better
big
black
certain
clear
different
early
easy
economic
federal
free
full
good
great
hard
high
human
important
international
large
late
little
local
long
low
major
military
national
new
old
only
other
political
possible
public
real
recent
right
small
social
special
strong
sure
true
white
whole
young
other
new
good
high
old
great
big
American
small
large
national
different
black
long
little
important
political
bad
white
real
best
right
social
only
public
sure
low
early
able
human
local
late
hard
major
better
economic
strong
possible
whole
free
military
true
federal
international
full
special
easy
clear
recent
certain
personal
open
red
difficult
available
likely
short
single
medical
current
wrong
private
past
foreign
fine
common
poor
natural
significant
similar
hot
dead
central
happy
serious
ready
simple
left
physical
general
environmental
financial
blue
democratic
dark
various
entire
close
legal
religious
cold
final
main
green
nice
huge
popular
traditional
cultural
Oliver
Jake
Noah
James
Jack
Connor
Liam
John
Harry
Callum
Mason
Robert
Jacob
Michael
Charlie
Kyle
William
Williams
Thomas
Shawn
Joe
Ethan
David
George
Reece
Michael
Richard
Oscar
Rhys
Alexander
Joseph
James
Charlie
James
Charles
Damian
Daniel
Thomas
Amelia
Margaret
Emma
Mary
Olivia
Samantha
Patricia
Isla
Bethany
Sophia
Jennifer
Emily
Elizabeth
Isabella
Elizabeth
Poppy
Joanne
Ava
Linda
Megan
Mia
Barbara
Isabella
Victoria
Susan
Jessica
Lauren
Abigail
Margaret
Lily
Michelle
Madison
Jessica
Sophie
Cooper
Tracy
Charlotte
Sarah
Murphy
Li
Smith
Jones
O'Kelly
Johnson
Jones
Wilson
O'Sullivan
Lam
Brown
Walsh
Martin
Taylor
Jones
Gelbero
Wilson
Taylor
Davies
O'Brien
Miller
Roy
Taylor
Byrne
Davis
Tremblay
Morton
Singh
Evans
O'Ryan
Garcia
Lee
White
Wang
Thomas
O'Connor
Rodriguez
Gagnon
Martin
Anderson
Roberts
O'Neill
Anderson
Clark
Wright
Mitchell
Johnson
Rodriguez
Lopez
Perez
Jackson
Lewis
Hill
Roberts
Jones
White
Scott
Turner
Brown
Harris
Walker
Green
Phillips
Hall
Adams
Campbell
Miller
Allen
Baker
Parker
Garcia
Young
Gonzalez
Evans
Moore
Martinez
Hernandez
Nelson
Edwards
Taylor
Robinson
Carter
Collins
George
Ronald
John
Richard
Kenneth
Anthony
Charles
Paul
Steven
Michael
Joseph
Mark
Thomas
Donald
Brian
Jeff
Mary
Jennifer
Lisa
Sandra
Michelle
Patricia
Maria
Nancy
Donna
Laura
Linda
Susan
Karen
Carol
Sarah
Barbara
Margaret
Betty
Ruth
Kimberly
Elizabeth
Dorothy
Helen
Sharon
Deborah
Sanders
Joy
Sean
Walton
Reznor
Antonio
Trump
Julia
Blair
Nobel
Johann
Ann
Lindsay
Laura
Sam
Kelly
Bill
Maya
Adriana
Lola
Ingrid
Clare
Emma
Isabella
Abigail
Charlotte
Lillian
Hannah
Samantha
Caroline
Sheeran
Madelyn
Kate
Hayes
Arianna
Maggie
Audrey
Luis
Paolo
Oliver
Emilio
Gustav
Tyler
Taylor
Javier
Kristian
Henrik
Stefan
Etienne
Johnson
Ferdinand
Hector
Catlin
Hugo
Ali
Raymond
Xavier
Harry
Potter
Evan
Elvis
Harrison
Jasper
Hitler
<<<<<<< HEAD
Scott
=======
John
Patricia
Robert
Linda
Richard
Susan
Joseph
Jessica
Thomas
Sarah
Charles
Margaret
Christopher
Daniel
Nancy
Matthew
Lisa
Anthony
Betty
Donald
Dorothy
Paul
Ashley
Andrew
Donna
Kenneth
Carol
Joshua
Amanda
Brian
Melissa
Deborah
Ronald
Stephanie
Timothy
Rebecca
Jeffrey
Helen
Sharon
Gary
Kathleen
Nicholas
Amy
Eric
Shirley
Angela
Larry
Justin
Brenda
Scott
Pamela
Nicole
Frank
Katherine
Benjamin
Samantha
Gregory
Christine
Samuel
Virginia
Rachel
Jack
Janet
Dennis
Jerry
Carolyn
Maria
Aaron
Heather
Jose
Julie
Douglas
Joyce
Peter
Evelyn
Nathan
Victoria
Zachary
Walter
Christina
Kyle
Lauren
Harold
Frances
Carl
Martha
Judith
Gerald
Cheryl
Keith
Megan
Roger
Andrea
Arthur
Olivia
Terry
Ann
Jacqueline
Ethan
Austin
Doris
Kathryn
Albert
Gloria
Jesse
Teresa
Willie
Sara
Billy
Janice
Marie
Bruce
Noah
Jordan
Judy
Dylan
Theresa
Ralph
Madison
Roy
Beverly
Alan
Denise
Wayne
Marilyn
Eugene
Amber
Juan
Danielle
Gabriel
Rose
Louis
Brittany
Russell
Diana
Randy
Abigail
Vincent
Natalie
Philip
Jane
Logan
Lori
Bobby
Alexis
Tiffany
Johnny
Kayla
Boccaccio
Gruber
Huber
Bauer
Wagner
Pichler
Steiner
Moser
Mayer
Hofer
Leitner
Berger
Fuchs
Eder
Fischer
Schmid
Winkler
Weber
Schwarz
Maier
Schneider
Reiter
Mayr
Schmidt
Wimmer
Egger
Brunner
Lang
Baumgartner
Auer
Binder
Lechner
Wolf
Wallner
Aigner
Ebner
Koller
Lehner
Haas
Schuster
Heilig
Peeters
Janssens
Maes
Jacobs
Mertens
Willems
Claes
Goossens
Wouters
Dubois
Lambert
Dupont
Martin
Simon
Nielsen
Jensen
Hansen
Pedersen
Andersen
Christensen
Larsen
Rasmussen
Petersen
Madsen
Kristensen
Olsen
Thomsen
Christiansen
Poulsen
Johansen
Mortensen
Joensen
Hansen
Jacobsen
Olsen
Poulsen
Petersen
Johannesen
Thomsen
Nielsen
Johansen
Rasmussen
Simonsen
Djurhuus
Jensen
Danielsen
Mortensen
Mikkelsen
Dam
Andreasen
Johansson
Nyman
Lindholm
Karlsson
Andersson
Hendriks
or
but
nor
so
for
yet
after
although
as
as if
as long as
because
before
even if
even though
once
since
so that
though
till
unless
until
what
when
whenever
wherever
whether
while
why
if
after
from
by
for
with
but
of
to
and
before
how
which
a
an
the
these
our
i
he
she
they
there
are
is
be
you
able
about
across
all
almost
also
am
among
any
at
been
best
can
cannot
could
dear
did
do
does
either
else
ever
every
get
got
have
has
had
her
hers
him
his
however
in
into
it
its
just
least
let
like
likely
other
rather
me
might
most
must
my
neither
not
nor
often
off
on
only
should
some
then
that
their
then
this
too
us
we
who
whom
would
yet
here
there
bbc
abc
news
maybe
perhaps
man
men
woman
women
Out
yes
no
in
out
Afghanistan
Albania
Algeria
America
Andorra
Angola
Antigua
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Russians
Europeans
Benin
Bhutan
Bissau
Bolivia
Bosnia
Botswana
Brazil
British
Britan
Brunei
Bulgaria
Burkina
Burma
Burundi
Cambodia
Cameroon
Canada
Cape Verde
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo
Costa Rica
country debt
Croatia
Cuba
Cyprus
Czech
Denmark
Djibouti
Dominica
East Timor
Ecuador
Egypt
El Salvador
Emirate
England
Eritrea
Estonia
Ethiopia
Russian
Fiji
Finland
France
Gabon
Gambia
Georgia
French
Germany
Ghana
Great Britain
Europe
European
Britain
Greece
Grenada
Grenadines
Guatemala
Guinea
Guyana
Haiti
Herzegovina
Honduras
Hungary
Iceland
in usa
India
Indian
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Ivory Coast
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Korea
Kosovo
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
Macedonia
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall
Mauritania
Mauritius
Mexico
Micronesia
Moldova
Monaco
Mongolia
Montenegro
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands
New Zealand
Nicaragua
Niger
Nigeria
Norway
Oman
Pakistan
Palau
Panama
Papua
Paraguay
Peru
Philippines
Poland
Portugal
Qatar
Romania
Russia
Rwanda
Samoa
San Marino
Sao Tome
Saudi Arabia
scotland
scottish
Senegal
Serbia
Seychelles
Sierra Leone
Singapore
Slovakia
Slovenia
Solomon
Somalia
South Africa
Africa
South Sudan
Spain
Sri Lanka
St. Kitts
St. Lucia
St Kitts
St Lucia
Saint Kitts
Santa Lucia
Sudan
Suriname
Swaziland
Sweden
Switzerland
Syria
Taiwan
Tajikistan
Tanzania
Thailand
Tobago
Togo
Tonga
Trinidad
Tunisia
Turkey
Turkmenistan
Tuvalu
Uganda
Ukraine
United Kingdom
United States
Uruguay
USA
US
UK
Uzbekistan
Vanuatu
Vatican
Venezuela
Vietnam
wales
welsh
Yemen
Zambia
Zimbabwe
Afghan
Albanian
Algerian
American
Andorran
Angolan
Antiguans
Argentinean
Armenian
Australian
Austrian
Azerbaijani
Bahamian
Bahraini
Bangladeshi
Barbadian
Barbudans
Batswana
Belarusian
Belgian
Bourgeoi
Bourgeoisie
Belizean
Beninese
Bhutanese
Bolivian
Beverly Hills
Bosnian
Brazilian
British
Bruneian
Bulgarian
Burkinabe
Burmese
Burundian
Cambodian
Cameroonian
Canadian
Cape Verdean
Central African
Chadian
Chilean
Chinese
Colombian
Comoran
Congolese
Costa Rican
Croatian
Cuban
Cypriot
Czech
Danish
Djibouti
Dominican
Dutch
East Timorese
Ecuadorean
Egyptian
Emirian
Equatorial Guinean
Eritrean
Estonian
Ethiopian
Fijian
Filipino
Finnish
French
Gabonese
Gambian
Georgian
German
Ghanaian
Greek
Grenadian
Guatemalan
Guinea-Bissauan
Guinean
Guyanese
Haitian
Herzegovinian
Honduran
Hungarian
I-Kiribati
Icelander
Indian
Indonesian
Iranian
Iraqi
Irish
Israeli
Italian
Ivorian
Jamaican
Japanese
Jordanian
Kazakhstani
Kenyan
Kittian
Nevisian
Kuwaiti
Kyrgyz
Laotian
Latvian
Lebanese
Liberian
Libyan
Liechtensteiner
Lithuanian
Luxembourger
Macedonian
Malagasy
Malawian
Malaysian
Maldivian
Malian
Maltese
Marshallese
Mauritanian
Mauritian
Mexican
Micronesian
Moldovan
Monacan
Mongolian
Moroccan
Mosotho
Motswana
Mozambican
Namibian
Nauruan
Nepalese
New Zealander
Ni-Vanuatu
Nicaraguan
Nigerian
Nigerien
North Korean
Northern Irish
Norwegian
Omani
Pakistani
Palauan
Panamanian
Papua New Guinean
Paraguayan
Peruvian
Polish
Portuguese
Qatari
Romanian
Russian
Rwandan
Saint Lucian
Salvadoran
Samoan
San Marinese
Sao Tomean
Saudi
Scottish
Senegalese
Serbian
Seychellois
Sierra Leonean
Singaporean
Slovakian
Slovenian
Solomon Islander
Somali
South African
South Korean
Spanish
Sri Lankan
Sudanese
Surinamer
Swazi
Swedish
Swiss
Syrian
Taiwanese
Tajik
Tanzanian
Thai
Togolese
Tongan
Trinidadian
Tobagonian
Tunisian
Turkish
Tuvaluan
Ugandan
Ukrainian
Uruguayan
Uzbekistani
Uzbekistan
Venezuelan
Vietnamese
Welsh
Yemenite
Zambian
Zimbabwean
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Beijing
Chicago
Taoyuan
San Antonio
Toronto
New York
English
Pennsylvania
South Carolina
Texas
Wisconsin
St Paul
London
Soho
Brexit
Britain
Manchester
Middle Eastern
Taipei
Vienna
EU
Yemeni
Europe
European
South America
South American
Asia
Asian
Oceania
Oceanian
Africa
African
Antartica
Pacific
Atlantic
Mediterranean
Scot
Scots
Korean
California
Swedes
Swede
Zurich
Yemenis
Western
Chicago
northeast
southeast
southwest
northwest
northern
western
eastern
sourthern
States
state
Limburger
Limburgers
Country
Countries
City
Cities
County
Counties
York
Madison
def load_who_file():
"""
Read file whos.txt and insert data to a hash set
:return: the hash set with all whos from file whos.txt
"""
return set([who.strip('\n').lower() for who in file("data/whos.txt", 'r').readlines()])
def load_common_name_file():
"""
Read file common_name.txt and insert data to a hash set
:return: the hash set with all whos from file common_name.txt
"""
return set([common_name.strip('\n').lower() for common_name in file("data/common_name.txt", 'r').readlines()])
def load_common_adj_file():
"""
Read file common_adj.txt and insert data to a hash set
:return: the hash set with all whos from file common_adj.txt
"""
return set([common_adj.strip('\n').lower() for common_adj in file("data/common_adj.txt", 'r').readlines()])
def load_country_file():
"""
Read file countries.txt and insert data to a hash set
:return: the hash set with all countries name from file countries.txt
"""
return set([country.strip('\n').lower() for country in file("data/countries.txt", 'r').readlines()])
def load_conjunction_file():
"""
Read file conjunctions.txt and insert data to a hash set
:return: the hash set with all conjunctions from file conjunctions.txt
"""
return set([conjunctions.strip('\n').lower() for conjunctions in file("data/conjunctions.txt", 'r').readlines()])
def load_prefix_library():
"""
Generate a hash set with all prefixes
:return: the hash set with all prefixes
"""
return set([prefix.strip('\n').lower() for prefix in file("data/prefix.txt", 'r').readlines()])
def load_organ_library():
"""
Generate a hash set with all organization titles
:return: the hash set with all organization titles
"""
return set([organ.strip('\n').lower() for organ in file("data/organization.txt", 'r').readlines()])
def load_month_file():
"""
Generate a hash set with all months
:return: the hash set with all months
"""
return set([month.strip('\n').lower() for month in file("data/month.txt", 'r').readlines()])
def load_verb_file():
"""
Read files irregular_verbs.txt and regular_verbs.txt and insert data to a hash set
:return: the hash set with all verbs from files
"""
return set(file("data/irregular_verbs.txt", 'r').read().split(', ')) |\
set(file("data/regular_verbs.txt", 'r').read().split(', '))
def load_preposition_file():
"""
Read files preposition.txt and insert data to a hash set
:return: the hash set with all preposition from files
"""
return set(open("data/preposition.txt", 'r').read().split(', '))
def contains_country(ngram, country_set):
"""
Identify if a n-gram has countries, return 1 (has feature) 0 (no such feature)
:param ngram: a ngram
:param country_set: a set contains all countries
:return: 1 (has feature) or 0 (no such feature)
"""
# find if any word in current ngram has country name
words = ngram[0].split(' ')
for word in words:
if word.lower() in country_set:
return 1
if len(words) >= 2:
for i in range(1, len(words)):
if (words[i-1]+' '+words[i]).lower() in country_set:
return 1
if len(words) >= 3:
for i in range(2, len(words)):
if (words[i-2]+' '+words[i-1]+' '+words[i]).lower() in country_set:
return 1
return 0
def contains_common_name(ngram, common_name_set):
"""
Identify if a n-gram has common name, return 1 (has feature) 0 (no such feature)
:param ngram: a ngram
:param common_name_set: a set contains all common name
:return: 1 (has feature) or 0 (no such feature)
"""
# find if any word in current ngram has common name
for word in ngram[0].split(' '):
if word.lower() in common_name_set:
return 1
return 0
def contains_common_adj(ngram, common_adj_set):
"""
Identify if a n-gram has common adj, return 1 (has feature) 0 (no such feature)
:param ngram: a ngram
:param common_name_set: a set contains all common adj
:return: 1 (has feature) or 0 (no such feature)
"""
# find if any word in current ngram has common adj
for word in ngram[0].split(' '):
if word.lower() in common_adj_set:
return 1
return 0
def contains_prefix(ngram, prefix_set):
"""
Identify if a n-gram has prefix, return 1 (has feature) 0 (no such feature)
:param ngram: a ngram
:param prefix_set: a set contains all prefixes
:return: 1 (has feature) or 0 (no such feature)
"""
# find if any word in current ngram has country name
for word in ngram[0].split(' '):
if word.lower() in prefix_set:
return 1
return 0
def contains_month(ngram, month_set):
"""
Identify if a n-gram has month, return 1 (has feature) 0 (no such feature)
:param ngram: a ngram
:param month_set: a set contains all months
:return: 1 (has feature) or 0 (no such feature)
"""
# find if any word in current ngram has country name
for word in ngram[0].split(' '):
if word.lower() in month_set:
return 1
return 0
def contains_organization(ngram, organ_set):
"""
Identify if a n-gram has organization titles, return 1 (has feature) 0 (no such feature)
:param ngram: a ngram
:param organ_set: a set contains commom organization titles
:return: 1 (has feature) or 0 (no such feature)
"""
# find if any word in current ngram has country name
words = ngram[0].split(' ')
for word in words:
if word.lower() in organ_set:
return 1
if len(words) >= 2:
for i in range(1, len(words)):
if (words[i-1]+' '+words[i]).lower() in organ_set:
return 1
return 0
def contains_conjunction(ngram, conjunctions_set):
"""
Identify if a n-gram has conjunctions, return 1 (has feature) 0 (no such feature)
:param ngram: a ngram
:param conjunctions_set: a set contains all conjunctions
:return: 1 (has feature) 0 (no such feature)
"""
# find if any word in current ngram has country name
for word in ngram[0].split(' '):
if word.lower() in conjunctions_set:
return 1
return 0
def contains_verb(ngram, verb_set):
"""
Identify if a n-gram has conjunctions, return 1 (has feature) 0 (no such feature)
:param ngram: a ngram
:param conjunctions_set: a set contains all conjunctions
:return: 1 (has feature) 0 (no such feature)
"""
# find if any word in current ngram has country name
for word in ngram[0].split(' '):
if word.lower() in verb_set:
return 1
return 0
def is_all_upper(ngram):
"""
Check all the words in the content if all words start with upper case
:param ngram: a ngram
:return: 1 (has feature) 0 (no such feature)
"""
for word in ngram[0].split(' '):
if len(word) > 0 and word[0].islower():
return 0
return 1
def has_who(ngram, who_set):
"""
Check all the words in the content if it has who
:param ngram: a ngram
:return: 1 (has feature) 0 (no such feature)
"""
for word in ngram[0].split(' '):
if word.lower() in who_set:
return 1
return 0
def no_more_than_one_lower(ngram):
"""
Check all the words in the content if all words has less than 2 lower case at each starting letter
:param ngram: a ngram
:return: 1 (has feature) 0 (no such feature)
"""
count = 0
for word in ngram[0].split(' '):
if word.islower():
count += 1
if count > 1:
return 0
return 1
def has_prefix_before_ngram(ngram, single_grams, prefix_set):
"""
Check if the word in front of the input ngram is a prefix for name
:param ngram: a n-gram
:param single_grams: all words in an article with order
:param prefix_set: a set contains all prefixes
:return: 1 (has feature) 0 (no such feature)
"""
if (ngram[2] - 1) >= 0:
preWord = single_grams[ngram[2] - 1][0].lower()
if preWord in prefix_set:
return 1
return 0
def has_human_verb(ngram, single_grams, verb_set):
"""
Check if the word after the input ngram is a verb usually used for human
:param ngram: a n-gram
:param single_grams: all words in an article with order
:param verb_set: a set contains all verbs usually used for human
:return: 1 (has feature) 0 (no such feature)
"""
ngram_end_index = ngram[3]
if (ngram_end_index + 1) < len(single_grams):
if single_grams[ngram_end_index+1][0] in verb_set:
return 1
return 0
def features_label_separator(ngrams, labels_set=None):
"""
Separate features and label from n-grams and return two lists
:param ngrams: all n-grams from all articles
:param labels_set: the hash set of all labels
:return: two lists -- features and label from n-grams
"""
features = [ngram[4:] for ngram in ngrams]
label = [1 if ngram[0] in labels_set else 0 for ngram in ngrams] if labels_set else []
return features, label
def afterpreposition(ngram, single_grams, preposition_set):
"""
Check if the word in front of the input ngram is a preposition
:param ngram: a n-gram
:param single_grams: all words in an article with order
:param preposition_set: a set contains all prefixes
:return: 1 (has feature) 0 (no such feature)
"""
if (ngram[2] - 1) >= 0:
prepos = single_grams[ngram[2] - 1][0].lower()
if prepos in preposition_set:
return 1
return 0
def before_who(ngram, single_grams, who_set):
"""
Check if the word after the input ngram is "who"
:param ngram: a n-gram
:param single_grams: all words in an article with order
:param who_set: a set contains who
:return: 1 (has feature) 0 (no such feature)
"""
return 1 if (ngram[2]+1) < len(single_grams) and (single_grams[ngram[2]+1][0]).lower() in who_set else 0
def has_duplicate(ngram):
"""
Check if the words in input ngram has any duplicate words
:param ngram: a n-gram
:return: 1 (has feature) 0 (no such feature)
"""
words = set()
for word in ngram[0].split(' '):
if word in words:
return 1
else:
words.add(word)
return 0
def count_occurrences(ngram, single_grams):
"""
Count the word's occurrences in the article(only for single words)
:param ngram: a n-gram
:param single_grams: all words in an article with order
:return: word's occurrences
"""
# print (single_grams)
# data = ' '.join(a[0] for a in single_grams)
return single_grams.count(ngram[0])
def start_end_dash(ngram):
"""
Check if the ngram contain string starts or ends with dash
:param ngram: a n-gram
:return: 1 (has feature) 0 (no such feature)
"""
words = ngram[0].split(' ')
if not words[0].isalpha() or (len(words) > 1 and not words[-1].isalpha()) or words.count('-') > 1:
return 1
count = 0
for word in words:
count += word.count('-')
if count > 1:
return 1
return 0
def has_one_dash(ngram):
"""
Check if the ngram contain exactly one dash
:param ngram: a n-gram
:return: 1 (has feature) 0 (no such feature)
"""
words = ngram[0].split(' ')
if words.count('-') == 1:
return 1
count = 0
for word in words:
count += word.count('-')
if count == 1:
return 1
return 0
def all_upper_character(ngram):
"""
Check all the words in the content if all character in words is upper case
:param ngram: a ngram
:return: 1 (has feature) 0 (no such feature)
"""
for word in ngram[0].split(' '):
if word.isupper():
return 1
return 0
def word_length(ngram):
"""
Check number of words
:param ngram: a ngram
:return: number of words
"""
words = ngram[0].split(' ')
return len(words)
def has_fullstop_before_ngram(ngram, single_grams2):
"""
Check if the word in front of the input ngram is a fullstop
:param ngram: a n-gram
:param single_grams2: all words including punctuation in an article with order
:return: 1 (has feature) 0 (no such feature)
"""
if (ngram[2] - 1) >= 0:
preWord = single_grams2[ngram[2] - 1][0].lower()
if preWord.endswith("."):
return 1
return 0
def has_comma_before_ngram(ngram, single_grams2):
"""
Check if the word in front of the input ngram is a comma
:param ngram: a n-gram
:param single_grams2: all words including punctuation in an article with order
:return: 1 (has feature) 0 (no such feature)
"""
if (ngram[2] - 1) >= 0:
preWord = single_grams2[ngram[2] - 1][0].lower()
if preWord.endswith(","):
return 1
return 0
def has_comma(ngram, single_grams2):
"""
Check if the word has a comma in the end
:param ngram: a n-gram
:param single_grams2: all words including punctuation in an article with order
:return: 1 (has feature) 0 (no such feature)
"""
lastWord = single_grams2[ngram[3]][0]
if lastWord.endswith(","):
return 1
return 0
def is_name_suffix(ngram):
"""
Check if the word has a suffix
:param ngram: a n-gram
:return: 1 (has feature) 0 (no such feature)
"""
suffixes = ['Sr', 'Sr.', 'Jr', 'Jr.', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX',
'sr', 'sr.', 'jr', 'jr.', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx']
for word in ngram[0].split(' '):
if word in suffixes:
return 1
return 0
def start_with_suffix(ngram):
"""
Check if the word has a suffix
:param ngram: a n-gram
:return: 1 (has feature) 0 (no such feature)
"""
suffixes = ['Sr', 'Sr.', 'Jr', 'Jr.', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX',
'sr', 'sr.', 'jr', 'jr.', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx']
word = ngram[0].split(' ')
if word[0] in suffixes:
return 1
return 0
is, am, are, was, were, has been, have been, beat, beaten, become, became, begin, began, begun, bend, bent, bet, bid, bite, bit, bitten, blow, blew, blown, break, broke, broken, bring, brought, build, built, burn, burned, burnt, buy, bought, catch, caught, choose, chose, chosen, come, came, cost, cut, dig, dug, dive, dove, dived, do, did, done, draw, drew, drawn, dream, dreamed, dreamt, drive, drove, driven, drink, drank, drunk, eat, ate, eaten, fall, fell, fallen, feel, felt, fight, fought, find, found, fly, flew, flown, forget, forgot, forgotten, forgive, forgave, forgiven, freeze, froze, frozen, get, got, gotten, give, gave, given, go, went, gone, grow, grew, grown, hang, hung, have, had, hear, heard, hide, hid, hidden, hit, hold, held, hurt, keep, kept, know, knew, known, lay, laid, lead, led, leave, left, lend, lent, let, lie, lain, lose, lost, make, made, mean, meant, meet, met, pay, paid, put, read, ride, rode, ridden, ring, rang, rung, rise, rose, risen, run, ran, say, said, see, saw, seen, sell, sold, send, sent, show, showed, shown, shut, sing, sang, sung, sit, sat, sleep, slept, speak, spoke, spoken, spend, spent, stand, stood, swim, swam, swum, take, took, taken, teach, taught, tear, tore, torn, tell, told, think, thought, throw, threw, thrown, understand, understood, wake, woke, woken, wear, wore, worn, win, won, write, wrote, written
from sklearn.model_selection import cross_val_score, ShuffleSplit
# from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
import preprocessing
from ngramGenerator import *
from featureIdentifier import *
from mlModel import *
from postProcessing import *
import pandas as pd
from pandas import DataFrame
def main():
articles, train_labels_set, test_labels_set = [], set(), set()
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' Pre-processing '''
''' (1) Load data and split data into train/test sets '''
''' (2) Hashset the labels and remove labels on the data '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# add all files' data into articles
preprocessing.read_data(articles)
# split data to train and test sets
train_set, test_set = preprocessing.data_split(articles)
train_label_count, test_label_count = 0, 0
# take off label and add names to labels
for i in range(len(train_set)):
train_set[i], train_label_count, train_labels_set =\
preprocessing.label_extraction_takeoff(paragraphs=train_set[i], count=train_label_count, labels=train_labels_set)
for i in range(len(test_set)):
test_set[i], test_label_count, test_labels_set =\
preprocessing.label_extraction_takeoff(paragraphs=test_set[i], count=test_label_count, labels=test_labels_set)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' N-gram generation '''
''' (1) Generate all n-gram (with first feature whether contains 's) '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
train_ngram_result, test_ngram_result = [], []
train_single_gram, test_single_gram = [], []
train_single_gram2, test_single_gram2 = [], [] # save single ones in order for later use
for i in range(len(train_set)):
ngrams, singles, singles2 = generate_ngrams(filename=train_set[i][0], content=train_set[i][1], n=5)
train_ngram_result.append(ngrams)
train_single_gram.append(singles)
train_single_gram2.append(singles2)
for i in range(len(test_set)):
ngrams, singles, singles2 = generate_ngrams(filename=test_set[i][0], content=test_set[i][1], n=5)
test_ngram_result.append(ngrams)
test_single_gram.append(singles)
test_single_gram2.append(singles2)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' Take out n-gram with only lowercase (only for training data) '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
for index in range(len(train_ngram_result)):
train_ngram_result[index] = eliminate_all_lower(train_ngram_result[index])
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' Create a test ngram result without n-gram has only lowercase '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
test_ngram_result_without_all_lower = test_ngram_result[:]
for index in range(len(test_ngram_result_without_all_lower)):
test_ngram_result_without_all_lower[index] = eliminate_all_lower(test_ngram_result_without_all_lower[index])
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' Feature creation '''
''' (1) 's (added during generation of ngram) '''
''' (2) contains country '''
''' (3) contains conjunction '''
''' (4) all capitalised '''
''' (5) prefix before n-gram '''
''' (6) verbs for humans '''
''' (7) prefix in n-gram '''
''' (8) after preposition '''
''' (9) contains organization '''
''' (10) has no more than 1 word without capitalised starting letter '''
''' (11) contains month '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
country_set, conjunction_set, prefix_set, verb_set, preposition_set, organ_set, month_set, who_set, common_name_set, common_adj_set = \
load_country_file(), load_conjunction_file(), load_prefix_library(),\
load_verb_file(), load_preposition_file(), load_organ_library(), load_month_file(), load_who_file(), load_common_name_file(), load_common_adj_file()
for ngram_set_index in range(len(train_ngram_result)):
article = ' '.join(a[0] for a in train_single_gram[ngram_set_index])
for ngram_index in range(len(train_ngram_result[ngram_set_index])):
ngram = train_ngram_result[ngram_set_index][ngram_index]
train_ngram_result[ngram_set_index][ngram_index] = ngram +\
(contains_country(ngram=ngram, country_set=country_set),
contains_conjunction(ngram=ngram, conjunctions_set=conjunction_set),
is_all_upper(ngram=ngram),
has_prefix_before_ngram(ngram=ngram, single_grams=train_single_gram[ngram_set_index], prefix_set=prefix_set),
has_human_verb(ngram=ngram, single_grams=train_single_gram[ngram_set_index], verb_set=verb_set),
contains_prefix(ngram=ngram, prefix_set=prefix_set),
afterpreposition(ngram=ngram, single_grams=train_single_gram[ngram_set_index], preposition_set=preposition_set),
contains_organization(ngram=ngram, organ_set=organ_set),
contains_common_name(ngram=ngram, common_name_set=common_name_set),
has_duplicate(ngram=ngram),
count_occurrences(ngram=ngram, single_grams=article),
no_more_than_one_lower(ngram=ngram),
contains_month(ngram=ngram, month_set=month_set),
contains_verb(ngram=ngram, verb_set=verb_set),
start_end_dash(ngram=ngram),
all_upper_character(ngram=ngram),
word_length(ngram=ngram),
has_fullstop_before_ngram(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]),
has_comma_before_ngram(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]),
before_who(ngram=ngram, single_grams=train_single_gram[ngram_set_index], who_set=who_set),
has_comma(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]),
has_who(ngram=ngram, who_set=who_set),
is_name_suffix(ngram=ngram),
has_one_dash(ngram=ngram),
start_with_suffix(ngram=ngram),
contains_common_adj(ngram=ngram, common_adj_set=common_adj_set),)
for ngram_set_index in range(len(test_ngram_result_without_all_lower)):
article = ' '.join(a[0] for a in test_single_gram[ngram_set_index])
for ngram_index in range(len(test_ngram_result_without_all_lower[ngram_set_index])):
ngram = test_ngram_result_without_all_lower[ngram_set_index][ngram_index]
test_ngram_result_without_all_lower[ngram_set_index][ngram_index] = ngram +\
(contains_country(ngram=ngram, country_set=country_set),
contains_conjunction(ngram=ngram, conjunctions_set=conjunction_set),
is_all_upper(ngram=ngram),
has_prefix_before_ngram(ngram=ngram, single_grams=test_single_gram[ngram_set_index], prefix_set=prefix_set),
has_human_verb(ngram=ngram, single_grams=test_single_gram[ngram_set_index], verb_set=verb_set),
contains_prefix(ngram=ngram, prefix_set=prefix_set),
afterpreposition(ngram=ngram, single_grams=test_single_gram[ngram_set_index], preposition_set=preposition_set),
contains_organization(ngram=ngram, organ_set=organ_set),
contains_common_name(ngram=ngram, common_name_set=common_name_set),
has_duplicate(ngram=ngram),
count_occurrences(ngram=ngram, single_grams=article),
no_more_than_one_lower(ngram=ngram),
contains_month(ngram=ngram, month_set=month_set),
contains_verb(ngram=ngram, verb_set=verb_set),
start_end_dash(ngram=ngram),
all_upper_character(ngram=ngram),
word_length(ngram=ngram),
has_fullstop_before_ngram(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]),
has_comma_before_ngram(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]),
before_who(ngram=ngram, single_grams=test_single_gram[ngram_set_index], who_set=who_set),
has_comma(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]),
has_who(ngram=ngram, who_set=who_set),
is_name_suffix(ngram=ngram),
has_one_dash(ngram=ngram),
start_with_suffix(ngram=ngram),
contains_common_adj(ngram=ngram, common_adj_set=common_adj_set),)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' Train DT, SVM, NB '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
train_ngrams = []
while len(train_ngram_result):
train_ngrams.extend(train_ngram_result.pop())
train_ngrams = sorted(train_ngrams, key=lambda i: (int(i[1]), i[2], i[3]-i[2]), reverse=True)
new_train, train_label = features_label_separator(ngrams=train_ngrams, labels_set=train_labels_set)
decision_tree = build_decision_tree(data=new_train, label=train_label)
support_vector_machine = build_support_vector_machine(data=new_train, label=train_label)
nb_classifier = build_nb_classifier(data=new_train, label=train_label)
rf_classifier = build_rf_classifier(data=new_train, label=train_label)
lr_classifier = build_lr_classifier(data=new_train, label=train_label)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' merge test ngram result '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
test_ngrams = []
while len(test_ngram_result_without_all_lower):
test_ngrams.extend(test_ngram_result_without_all_lower.pop())
test_ngrams = sorted(test_ngrams, key=lambda i: (int(i[1]), i[2], i[3]-i[2]), reverse=True)
new_test, test_label = features_label_separator(ngrams=test_ngrams, labels_set=test_labels_set)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' use DT, SVM, NB, RF, LR to predict test set '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
print("Train Set")
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
print(" ")
print("Number of Name: ")
print(train_label_count)
decision_tree_predict_train = decision_tree.predict(new_train)
support_vector_machine_predict_train = support_vector_machine.predict(new_train)
nb_classifier_predict_train = nb_classifier.predict(new_train)
rf_classifier_predict_train = rf_classifier.predict(new_train)
lr_classifier_predict_train = lr_classifier.predict(new_train)
print("precision before post processing:")
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, train_label)])) / sum(lr_classifier_predict_train))
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, train_label)])) / sum(decision_tree_predict_train))
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, train_label)])) / sum(support_vector_machine_predict_train))
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, train_label)])) / sum(nb_classifier_predict_train))
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, train_label)])) / sum(rf_classifier_predict_train))
print('')
print("recall before post processing:")
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, train_label)])) / sum(train_label))
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, train_label)])) / sum(train_label))
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, train_label)])) / sum(train_label))
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, train_label)])) / sum(train_label))
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, train_label)])) / sum(train_label))
print('')
decision_tree_ngrams_train, decision_tree_predict_train, decision_tree_label_train = take_out_overlapped(train_ngrams, decision_tree_predict_train, train_label)
support_vector_machine_ngrams_train, support_vector_machine_predict_train, support_vector_machine_label_train = take_out_overlapped(train_ngrams, support_vector_machine_predict_train, train_label)
nb_classifier_ngrams_train, nb_classifier_predict_train, nb_classifier_label_train = take_out_overlapped(train_ngrams, nb_classifier_predict_train, train_label)
rf_classifier_ngrams_train, rf_classifier_predict_train, rf_classifier_label_train = take_out_overlapped(train_ngrams, rf_classifier_predict_train, train_label)
lr_classifier_ngrams_train, lr_classifier_predict_train, lr_classifier_label_train = take_out_overlapped(train_ngrams, lr_classifier_predict_train, train_label)
decision_tree_predict_train = set_predict_value(ngrams=decision_tree_ngrams_train, predict=decision_tree_predict_train)
support_vector_machine_predict_train = set_predict_value(ngrams=support_vector_machine_ngrams_train, predict=support_vector_machine_predict_train)
nb_classifier_predict_train = set_predict_value(ngrams=nb_classifier_ngrams_train, predict=nb_classifier_predict_train)
rf_classifier_predict_train = set_predict_value(ngrams=rf_classifier_ngrams_train, predict=rf_classifier_predict_train)
lr_classifier_predict_train = set_predict_value(ngrams=lr_classifier_ngrams_train, predict=lr_classifier_predict_train)
print("precision:")
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, lr_classifier_label_train)])) / sum(lr_classifier_predict_train))
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, decision_tree_label_train)])) / sum(decision_tree_predict_train))
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, support_vector_machine_label_train)])) / sum(support_vector_machine_predict_train))
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, nb_classifier_label_train)])) / sum(nb_classifier_predict_train))
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, rf_classifier_label_train)])) / sum(rf_classifier_predict_train))
print('')
print("recall:")
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, lr_classifier_label_train)])) / sum(lr_classifier_label_train))
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, decision_tree_label_train)])) / sum(decision_tree_label_train))
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, support_vector_machine_label_train)])) / sum(support_vector_machine_label_train))
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, nb_classifier_label_train)])) / sum(nb_classifier_label_train))
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, rf_classifier_label_train)])) / sum(rf_classifier_label_train))
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
''' use DT, SVM, NB, RF, LR to predict test set '''
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
print("Test Set")
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
print(" ")
print("Number of Name: ")
print(test_label_count)
decision_tree_predict = decision_tree.predict(new_test)
support_vector_machine_predict = support_vector_machine.predict(new_test)
nb_classifier_predict = nb_classifier.predict(new_test)
rf_classifier_predict = rf_classifier.predict(new_test)
lr_classifier_predict = lr_classifier.predict(new_test)
print("precision before post processing:")
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, test_label)])) / sum(lr_classifier_predict))
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, test_label)])) / sum(decision_tree_predict))
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, test_label)])) / sum(support_vector_machine_predict))
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, test_label)])) / sum(nb_classifier_predict))
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, test_label)])) / sum(rf_classifier_predict))
print('')
print("recall before post processing:")
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, test_label)])) / sum(test_label))
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, test_label)])) / sum(test_label))
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, test_label)])) / sum(test_label))
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, test_label)])) / sum(test_label))
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, test_label)])) / sum(test_label))
print('')
decision_tree_ngrams, decision_tree_predict, decision_tree_label = take_out_overlapped(test_ngrams, decision_tree_predict, test_label)
support_vector_machine_ngrams, support_vector_machine_predict, support_vector_machine_label = take_out_overlapped(test_ngrams, support_vector_machine_predict, test_label)
nb_classifier_ngrams, nb_classifier_predict, nb_classifier_label = take_out_overlapped(test_ngrams, nb_classifier_predict, test_label)
rf_classifier_ngrams, rf_classifier_predict, rf_classifier_label = take_out_overlapped(test_ngrams, rf_classifier_predict, test_label)
lr_classifier_ngrams, lr_classifier_predict, lr_classifier_label = take_out_overlapped(test_ngrams, lr_classifier_predict, test_label)
decision_tree_predict = set_predict_value(ngrams=decision_tree_ngrams, predict=decision_tree_predict)
support_vector_machine_predict = set_predict_value(ngrams=support_vector_machine_ngrams, predict=support_vector_machine_predict)
nb_classifier_predict = set_predict_value(ngrams=nb_classifier_ngrams, predict=nb_classifier_predict)
rf_classifier_predict = set_predict_value(ngrams=rf_classifier_ngrams, predict=rf_classifier_predict)
lr_classifier_predict = set_predict_value(ngrams=lr_classifier_ngrams, predict=lr_classifier_predict)
print("precision:")
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, lr_classifier_label)])) / sum(lr_classifier_predict))
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, decision_tree_label)])) / sum(decision_tree_predict))
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, support_vector_machine_label)])) / sum(support_vector_machine_predict))
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, nb_classifier_label)])) / sum(nb_classifier_predict))
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, rf_classifier_label)])) / sum(rf_classifier_predict))
print('')
print("recall:")
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, lr_classifier_label)])) / sum(lr_classifier_label))
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, decision_tree_label)])) / sum(decision_tree_label))
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, support_vector_machine_label)])) / sum(support_vector_machine_label))
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, nb_classifier_label)])) / sum(nb_classifier_label))
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, rf_classifier_label)])) / sum(rf_classifier_label))
# print ("==========================================================================")
# print("data frame:")
# df = pd.DataFrame(columns=['words', 'predict', 'label'])
# for i in range(len(rf_classifier_predict)):
# if not (rf_classifier_predict[i] == rf_classifier_label[i]) and rf_classifier_predict[i] == 1:
# df = df.append({'words': rf_classifier_ngrams[i], 'predict': rf_classifier_predict[i], 'label':rf_classifier_label[i]}, ignore_index = True)
# DataFrame.to_csv(df, "rf_classifier_predict.csv", index=False)
# scores = cross_val_score(svm.SVC(), new_train, train_label, cv=ShuffleSplit(n_splits=5, test_size=0.3, random_state=0))
# print (scores)
if __name__ == "__main__":
main()
from sklearn.linear_model import LogisticRegression
from sklearn import tree, svm
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
def build_decision_tree(data, label):
"""
Build the decision tree based on the data and its corresponding label
:param data: a list of tuple which contains all features of a data
:param label: a list of label for data
:return: a trained decision tree
"""
dt_tree = tree.DecisionTreeClassifier()
return dt_tree.fit(data, label)
def build_support_vector_machine(data, label):
"""
Build the support vector machine based on the data and its corresponding label
:param data: a list of tuple which contains all features of a data
:param label: a list of label for data
:return: trained support vector machine
"""
trained_svm = svm.SVC(gamma='scale', C=100)
return trained_svm.fit(data, label)
def build_nb_classifier(data, label):
"""
Build the naive bayes classifier based on the data and its corresponding label
:param data: a list of tuple which contains all features of a data
:param label: a list of label for data
:return: trained naive bayes classifier
"""
classifier = BernoulliNB()
return classifier.fit(data, label)
def build_rf_classifier(data, label):
"""
Build the random forest classifier based on the data and its corresponding label
:param data: a list of tuple which contains all features of a data
:param label: a list of label for data
:return: trained naive bayes classifier
"""
# pipe = make_pipeline(StandardScaler(),RandomForestClassifier())
# param_grid = {'n_estimators': list(range(1, 30))}
# gs = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, \
# iid=False, n_jobs=-1, refit=True,scoring='accuracy',cv=10)
# gs.fit(data, label)
# n_estimators=gs.best_params_['n_estimators']
classifier = RandomForestClassifier(n_estimators=34, n_jobs=-1, criterion='gini', class_weight={0: 1, 1: 1.45}, random_state=10)
return classifier.fit(data, label)
def build_lr_classifier(data, label):
"""
Build the logistic regression classifier based on the data and its corresponding label
:param data: a list of tuple which contains all features of a data
:param label: a list of label for data
:return: trained logistic regression classifier
"""
classifier = LogisticRegression(solver='newton-cg',n_jobs=-1,class_weight={0: 1, 1: 1.5})
return classifier.fit(data, label)
January
February
March
April
May
June
July
August
September
October
November
December
import re
def generate_ngrams(filename, content, n):
"""
Generate n-grams (with a feature whether it contains "'s") from the content
:param filename: filename
:param content: the whole article
:param n: the size of n gram
:return: generated list of n-grams, single grams
"""
sentences = content.split(".")
index, index2 = 0, 0
n_grams, single_grams, single_grams2 = [], [], []
for sentence in sentences:
sections = sentence.split(",")
for section in sections:
parts = section.split(";")
for part in parts:
words = part.split()
single_grams_temp, feature_single_quote_temp = [], []
for i in range(len(words)):
words2 = words[:]
words2[i] = re.sub('[;@#$()\{\}:"]', '', words2[i])
single_grams2.append((words2[i], filename, index2, index2))
index2 += 1
# first clean the data
for i in range(len(words)):
# clean data by removing special characters
words[i] = re.sub('[?;!@#$()\{\}:\,\."]', '', words[i])
# for cases 's, take off 's
if (len(words[i]) >= 2 and words[i][-2] == "'"):
words[i] = words[i][:-2]
feature_single_quote_temp.append(1)
elif (len(words[i]) >= 2 and words[i][-2] == "s" and words[i][-1] == "'"):
words[i] = words[i][:-1]
feature_single_quote_temp.append(1)
else:
feature_single_quote_temp.append(0)
single_grams_temp.append((words[i], filename, index, index))
index += 1
n_grams_temp = [] # the return list
for i in range(len(words)):
temp = words[i]
for j in range(1, n):
if (i + j) < len(words):
temp = temp + ' ' + words[i + j]
temp_with_first_index = (temp, filename, single_grams_temp[i][2], single_grams_temp[i + j][2], feature_single_quote_temp[i + j])
n_grams_temp.append(temp_with_first_index)
# single_grams += n_grams
for i in range(len(single_grams_temp)):
n_grams_temp.append(single_grams_temp[i] + (feature_single_quote_temp[i],))
n_grams.extend(n_grams_temp)
single_grams.extend(single_grams_temp)
return n_grams, single_grams, single_grams2
def eliminate_all_lower(ngrams):
"""
Take out n-gram which does not have any word capitalised
:param ngrams: all n-grams
:return: all n-grams for each n-gram has a least one word capitalised
"""
new_ngram = []
for ngram in ngrams:
for word in ngram[0].split(' '):
if len(word) > 0 and word[0].isupper():
new_ngram.append(ngram)
break
return new_ngram
university
college
association
commission
council
laboratory
government
committee
department
school
research
office
affairs
court
corporation
company
agency
organization
group
empire
league
music
hotel
hotels
white house
party
hilton
Walmart
Genk
Brugge
Concert
organisation
prize
rolling stones
White House
Virgin Galactic
Art Brut
amazon
walmart
art
following
club
people
human
guitar
violin
def take_out_overlapped(ngrams, predict, label):
"""
Take out n-gram which is the subset of another n-gram
:param ngrams: all n-grams
:return: the remaining n-grams
"""
new_ngrams, new_predict, new_label, prev, prev_predict = [], [], [], None, 0
for element_index in range(len(ngrams)):
# if prev is None || (filenames are different) || (starting index are different)
if not prev \
or ngrams[element_index][1] != prev[1] \
or ngrams[element_index][2] == 0 \
or prev_predict == 0 \
or (#ngrams_labels_predicts_sets[element_index][0][1] == prev[0][1] \
# pre[element_index]==1 \
# and prev[2]==1 \
not(prev[2] <= ngrams[element_index][2] <= prev[3]) \
or not(prev[2] <= ngrams[element_index][3] <= prev[3])):
prev = ngrams[element_index]
prev_predict = predict[element_index]
new_ngrams.append(ngrams[element_index])
new_predict.append(predict[element_index])
new_label.append(label[element_index])
return new_ngrams, new_predict, new_label
def set_predict_value(ngrams, predict):
for element_index in range(len(ngrams)):
# 19: start_end_dash, 5: contains_country, 10: contains_prefix, 12: contains_organization, 18: contains_verb,\
# 6: contains_conjunction, 29: start_with_suffix, 30: contains_common_adj
if ngrams[element_index][19] == 1 or ngrams[element_index][5] == 1 \
or ngrams[element_index][29] == 1 or ngrams[element_index][6] == 1 \
or ngrams[element_index][10] == 1 or ngrams[element_index][12] == 1 \
or ngrams[element_index][18] == 1 or ngrams[element_index][30] == 1:
predict[element_index] = 0
return predict
adm
atty
baz
brother
capped
chief
cmdr
col
dean
dr
elder
father
gen
gov
hon
maj
msgt
mr
mrs
ms
prince
prof
rabbi
rev
king
queen
professor
maid
madam
princess
duke
duchess
baroness
baron
pope
popess
president
mother
saint
minister
doctor
major
general
marshal
officer
admiral
attorney
commander
colonel
governor
honorable
mister
reverend
actor
actress
writer
performer
journalism
dj
star
producer
engineer
coordinator
administrator
manager
agent
promoter
accompanist
bassist
busker
cellist
composer
drummer
fiddler
flautist
flutist
mpressionist
instrumentalist
keyboardist
leader
musician
pianist
player
saxophonist
soloist
timpanist
tuner
virtuoso
guitarist
organist
violinist
trumpeter
trombonist
percussionist
oboist
mandolinist
keytarist
harpsichordist
harpist
clarinetist
bassoonist
bagpiper
accordionist
master
by
winner
nominee
lord
sir
sculptor
uncle
co-star
representative
pilot
cinematographer
named
director
author
lady
maid
junior
stars
farmer
anchorwoman
nephew
newcomer
prodigy
brother
photographer
assistant
journalist
miss
novelist
father
agent
partner
lawyer
reporter
sisters
composer
Major
actor
captain
astronaut
commander
painter
musician
meets
champion
orphan
sheriff
writer
detective
artist
jr
army
attorney
commandant
filmmaker
filmmakers
guardian
ceo
cfo
cto
mayor
st
emperor
senator
administration
senators
representatives
representative
chancellor
dj
secretary
after
from
by
for
with
but
of
to
and
before
import os
from unidecode import unidecode
import re
def read_data(articles):
"""
read the file and append it to the articles
:param articles: a list for all articles
:return: None
"""
def files(path):
"""
find all file in a given path and yield the files' paths
:param path: the path to the location of files
:return: files' paths
"""
for f in os.listdir(path):
if len(f.split('.')[0]) == 3 and f.split('.')[1] == "txt" and os.path.isfile(os.path.join(path, f)):
yield path + "/" + f, f.split('.')[0]
for file_path, filename in files("data"):
articles.append((filename, unidecode(file(file_path, 'r').read().decode("UTF-8"))))
def data_split(articles):
"""
split data into two data-sets (training and testing)
:param articles: a list of articles
:return: two lists for two data-sets (training and testing)
"""
train_set, test_set = [], []
for i in range(0, len(articles), 3):
train_set.append(articles[i])
train_set.append(articles[i+1])
test_set.append(articles[i+2])
return train_set, test_set
def label_extraction_takeoff(paragraphs, count, labels=None):
"""
Take off the label <person> and </person> and return the paragraph without labels
:param paragraphs: string input data with <person></person> labels
:param count: number of labels in articles
:param labels: a set which contains all label among all input data
:return: new paragraohs without labels, number of labels in articles
"""
LABEL, LABEL_END = "<person>", "</person>"
index, new_paragraph = 0, ""
filename = paragraphs[0]
paragraphs = paragraphs[1]
while index < len(paragraphs):
# find the index of the closest LABEL
found = paragraphs.find(LABEL, index)
# if the label is found
if found != -1:
# find the index (location) of the end of label
found_end = paragraphs.find(LABEL_END, found)
# append label to the return variable new_paragraph
new_paragraph += paragraphs[index:found] + paragraphs[found+len(LABEL):found_end]
# if labels is not None, add the label into it
if labels is not None:
labels.add(re.sub('[?;!@#$(){}\\,\\."]', '', paragraphs[found+len(LABEL):found_end]))
# update the current index
index = found_end + len(LABEL_END)
count += 1
else:
new_paragraph += paragraphs[index:]
break
return (filename, new_paragraph), count, labels
accept, add, admire, admit, advise, afford, agree, alert, allow, amuse, analyze , analyze , announce, annoy, answer, apologize, appear, applaud, appreciate, approve, argue, arrange, arrest, arrive, ask, attach, attack, attempt, attend, attract, avoid, back, bake, balance, ban, bang, bare, bat, bathe, battle, beam, beg, behave, belong, bleach, bless, blind, blink, blot, blush, boast, boil, bolt, bomb, book, bore, borrow, bounce, bow, box, brake, branch, breathe, bruise, brush, bubble, bump, burn, bury, buzz, calculate, call, camp, care, carry, carve, cause, challenge, change, charge, chase, cheat, check, cheer, chew, choke, chop, claim, clap, clean, clear, clip, close, coach, coil, collect, color, comb, command, communicate, compare, compete, complain, complete, concentrate, concern, confess, confuse, connect, consider, consist, contain, continue, copy, correct, cough, count, cover, crack, crash, crawl, cross, crush, cry, cure, curl, curve, cycle, dam, damage, dance, dare, decay, deceive, decide, decorate, delay, delight, deliver, depend, describe, desert, deserve, destroy, detect, develop, disagree, disappear, disapprove, disarm, discover, dislike, divide, double, doubt, drag, drain, dream, dress, drip, drop, drown, drum, dry, dust, earn, educate, embarrass, employ, empty, encourage, end, enjoy, enter, entertain, escape, examine, excite, excuse, exercise, exist, expand, expect, explain, explode, extend, face, fade, fail, fancy, fasten, fax, fear, fence, fetch, file, fill, film, fire, fit, fix, flap, flash, float, flood, flow, flower, fold, follow, fool, force, form, found, frame, frighten, fry, gather, gaze, glow, glue, grab, grate, grease, greet, grin, grip, groan, guarantee, guard, guess, guide, hammer, hand, handle, hang, happen, harass, harm, hate, haunt, head, heal, heap, heat, help, hook, hop, hope, hover, hug, hum, hunt, hurry, identify, ignore, imagine, impress, improve, include, increase, influence, inform, inject, injure, instruct, intend, interest, interfere, interrupt, introduce, invent, invite, irritate, itch, jail, jam, jog, join, joke, judge, juggle, jump, kick, kill, kiss, kneel, knit, knock, knot, label, land, last, laugh, launch, learn, level, license, lick, lie, lighten, like, list, listen, live, load, lock, long, look, love, man, manage, march, mark, marry, match, mate, matter, measure, meddle, melt, memorize, mend, mess up, milk, mine, miss, mix, moan, moor, mourn, move, muddle, mug, multiply, murder, nail, name, need, nest, nod, note, notice, number, obey, object, observe, obtain, occur, offend, offer, open, order, overflow, owe, own, pack, paddle, paint, park, part, pass, paste, pat, pause, peck, pedal, peel, peep, perform, permit, phone, pick, pinch, pine, place, plan, plant, play, please, plug, point, poke, polish, pop, possess, post, pour, practice , practice , pray, preach, precede, prefer, prepare, present, preserve, press, pretend, prevent, prick, print, produce, program, promise, protect, provide, pull, pump, punch, puncture, punish, push, question, queue, race, radiate, rain, raise, reach, realize, receive, recognize, record, reduce, reflect, refuse, regret, reign, reject, rejoice, relax, release, rely, remain, remember, remind, remove, repair, repeat, replace, reply, report, reproduce, request, rescue, retire, return, rhyme, rinse, risk, rob, rock, roll, rot, rub, ruin, rule, rush, sack, sail, satisfy, save, saw, scare, scatter, scold, scorch, scrape, scratch, scream, screw, scribble, scrub, seal, search, separate, serve, settle, shade, share, shave, shelter, shiver, shock, shop, shrug, sigh, sign, signal, sin, sip, ski, skip, slap, slip, slow, smash, smell, smile, smoke, snatch, sneeze, sniff, snore, snow, soak, soothe, sound, spare, spark, sparkle, spell, spill, spoil, spot, spray, sprout, squash, squeak, squeal, squeeze, stain, stamp, stare, start, stay, steer, step, stir, stitch, stop, store, strap, strengthen, stretch, strip, stroke, stuff, subtract, succeed, suck, suffer, suggest, suit, supply, support, suppose, surprise, surround, suspect, suspend, switch, talk, tame, tap, taste, tease, telephone, tempt, terrify, test, thank, thaw, tick, tickle, tie, time, tip, tire, touch, tour, tow, trace, trade, train, transport, trap, travel, treat, tremble, trick, trip, trot, trouble, trust, try, tug, tumble, turn, twist, type, undress, unfasten, unite, unlock, unpack, untidy, use, vanish, visit, wail, wait, walk, wander, want, warm, warn, wash, waste, watch, water, wave, weigh, welcome, whine, whip, whirl, whisper, whistle, wink, wipe, wish, wobble, wonder, work, worry, wrap, wreck, wrestle, wriggle, x-ray, yawn, yell, zip, zoom, accepted, added, admired, admitted, advised, afforded, agreed, alerted, allowed, amused, analyze ed, analyze ed, announced, annoyed, answered, apologized, appeared, applauded, appreciated, approved, argued, arranged, arrested, arrived, asked, attached, attacked, attempted, attended, attracted, avoided, backed, baked, balanced, banned, banged, bared, bated, bathed, battled, beamed, begged, behaved, belonged, bleached, blessed, blinded, blinked, blotted, blushed, boasted, boiled, bolted, bombed, booked, bored, borrowed, bounced, bowed, boxed, braked, branched, breathed, bruised, brushed, bubbled, bumped, burned, buried, buzzed, calculated, called, camped, cared, carried, carved, caused, challenged, changed, charged, chased, cheated, checked, cheered, chewed, choked, chopped, claimed, clapped, cleaned, cleared, clipped, closed, coached, coiled, collected, colored, combed, commanded, communicated, compared, competed, complained, completed, concentrated, concerned, confessed, confused, connected, considered, consisted, contained, continued, copied, corrected, coughed, counted, covered, cracked, crashed, crawled, crossed, crushed, cried, cured, curled, curved, cycled, damed, damaged, danced, dared, decayed, deceived, decided, decorated, delayed, delighted, delivered, depended, described, deserted, deserved, destroyed, detected, developed, disagreed, disappeared, disapproved, disarmed, discovered, disliked, divided, doubled, doubted, dragged, drained, dreamed, dressed, dripped, dropped, drowned, drummed, dried, dusted, earned, educated, embarrassed, employed, emptied, encouraged, ended, enjoyed, entered, entertained, escaped, examined, excited, excused, exercised, existed, expanded, expected, explained, exploded, extended, faced, faded, failed, fancied, fastened, faxed, feared, fenced, fetched, filed, filled, filmed, fired, fitted, fixed, flapped, flashed, floated, flooded, flowed, flowered, folded, followed, fooled, forced, formed, founded, framed, frightened, fried, gathered, gazed, glowed, glued, grabbed, grated, greased, greeted, grinned, griped, groaned, guaranteed, guarded, guessed, guided, hammered, handed, handled, hanged, happened, harassed, harmed, hated, haunted, headed, healed, heaped, heated, helped, hooked, hoped, hoped, hovered, hugged, hummed, hunted, hurried, identified, ignored, imagined, impressed, improved, included, increased, influenced, informed, injected, injured, instructed, intended, interested, interfered, interrupted, introduced, invented, invited, irritated, itched, jailed, jammed, jogged, joined, joked, judged, juggled, jumped, kicked, killed, kissed, kneeled, knitted, knocked, knotted, labeled, landed, lasted, laughed, launched, learned, leveled, licensed, licked, lied, lightened, liked, listed, listened, lived, loaded, locked, longed, looked, loved, maned, managed, marched, marked, married, matched, mated, mattered, measured, meddled, melted, memorized, mended, mess upped, milked, mined, missed, mixed, moaned, moored, mourned, moved, muddled, mugged, multiplied, murdered, nailed, named, needed, nested, nodded, noted, noticed, numbered, obeyed, objected, observed, obtained, occurred, offended, offered, opened, ordered, overflowed, owed, owned, packed, paddled, painted, parked, parted, passed, pasted, pated, paused, pecked, pedaled, peeled, peeped, performed, permitted, phoned, picked, pinched, pined, placed, planed, planted, played, pleased, plugged, pointed, poked, polished, popped, possessed, posted, poured, practice ed, practice ed, prayed, preached, preceded, preferred, prepared, presented, preserved, pressed, pretended, prevented, pricked, printed, produced, programed, promised, protected, provided, pulled, pumped, punched, punctured, punished, pushed, questioned, queued, raced, radiated, rained, raised, reached, realized, received, recognized, recorded, reduced, reflected, refused, regretted, reigned, rejected, rejoiced, relaxed, released, relied, remained, remembered, reminded, removed, repaired, repeated, replaced, replied, reported, reproduced, requested, rescued, retired, returned, rhymed, rinsed, risked, robed, rocked, rolled, rotted, rubbed, ruined, ruled, rushed, sacked, sailed, satisfied, saved, sawed, scared, scattered, scolded, scorched, scraped, scratched, screamed, screwed, scribbled, scribed, sealed, searched, separated, served, settled, shaded, shared, shaved, sheltered, shivered, shocked, shopped, shrugged, sighed, signed, signaled, sinned, sipped, skied, skipped, slapped, slipped, slowed, smashed, smelled, smiled, smoked, snatched, sneezed, sniffed, snored, snowed, soaked, soothed, sounded, spared, sparked, sparkled, spelled, spilled, spoiled, spotted, sprayed, sprouted, squashed, squeaked, squealed, squeezed, stained, stamped, stared, started, stayed, steered, stepped, stirred, stitched, stoped, stored, strapped, strengthened, stretched, striped, stroked, stuffed, subtracted, succeeded, sucked, suffered, suggested, suited, supplied, supported, supposed, surprised, surrounded, suspected, suspended, switched, talked, tamed, taped, tasted, teased, telephoned, tempted, terrified, tested, thanked, thawed, ticked, tickled, tied, timed, tipped, tired, touched, toured, towed, traced, traded, trained, transported, trapped, traveled, treated, trembled, tricked, tripped, trotted, troubled, trusted, tried, tugged, tumbled, turned, twisted, typed, undressed, unfastened, united, unlocked, unpacked, untidied, used, vanished, visited, wailed, waited, walked, wandered, wanted, warmed, warned, washed, wasted, watched, watered, waved, weighed, welcomed, whined, whipped, whirled, whispered, whistled, winked, wiped, wished, wobbled, wondered, worked, worried, wrapped, wrecked, wrestled, wriggled, yawned, yelled, zipped, zoomed, accepts, adds, admires, admits, advises, affords, agrees, alerts, allows, amuses, analyze s, analyze s, announces, annoys, answers, apologizes, appears, applauds, appreciates, approves, argues, arranges, arrests, arrives, asks, attaches, attacks, attempts, attends, attracts, avoids, backs, bakes, balances, bans, bangs, bares, bats, bathes, battles, beams, begs, behaves, belongs, bleaches, blesses, blinds, blinks, blots, blushes, boasts, boils, bolts, bombs, books, bores, borrows, bounces, bows, boxes, brakes, branches, breathes, bruises, brushes, bubbles, bumps, burns, buries, buzzes, calculates, calls, camps, cares, carries, carves, causes, challenges, changes, charges, chases, cheats, checks, cheers, chews, chokes, chops, claims, claps, cleans, clears, clips, closes, coaches, coils, collects, colors, combs, commands, communicates, compares, competes, complains, completes, concentrates, concerns, confesses, confuses, connects, considers, consists, contains, continues, copies, corrects, coughs, counts, covers, cracks, crashes, crawls, crosses, crushes, cries, cures, curls, curves, cycles, dams, damages, dances, dares, decays, deceives, decides, decorates, delays, delights, delivers, depends, describes, deserts, deserves, destroys, detects, develops, disagrees, disappears, disapproves, disarms, discovers, dislikes, divides, doubles, doubts, drags, drains, dreams, dresses, drips, drops, drowns, drums, dries, dusts, earns, educates, embarrasses, employs, empties, encourages, ends, enjoys, enters, entertains, escapes, examines, excites, excuses, exercises, exists, expands, expects, explains, explodes, extends, faces, fades, fails, fancies, fastens, faxes, fears, fences, fetches, files, fills, films, fires, fits, fixes, flaps, flashes, floats, floods, flows, flowers, folds, follows, fools, forces, forms, founds, frames, frightens, fries, gathers, gazes, glows, glues, grabs, grates, greases, greets, grins, grips, groans, guarantees, guards, guesses, guides, hammers, hands, handles, hangs, happens, harasses, harms, hates, haunts, heads, heals, heaps, heats, helps, hooks, hops, hopes, hovers, hugs, hums, hunts, hurries, identifies, ignores, imagines, impresses, improves, includes, increases, influences, informs, injects, injures, instructs, intends, interests, interferes, interrupts, introduces, invents, invites, irritates, itches, jails, jams, jogs, joins, jokes, judges, juggles, jumps, kicks, kills, kisses, kneels, knits, knocks, knots, labels, lands, lasts, laughs, launches, learns, levels, licenses, licks, lies, lightens, likes, lists, listens, lives, loads, locks, longs, looks, loves, mans, manages, marches, marks, marries, matches, mates, matters, measures, meddles, melts, memorizes, mends, mess ups, milks, mines, misses, mixes, moans, moors, mourns, moves, muddles, mugs, multiplies, murders, nails, names, needs, nests, nods, notes, notices, numbers, obeys, objects, observes, obtains, occurs, offends, offers, opens, orders, overflows, owes, owns, packs, paddles, paints, parks, parts, passes, pastes, pats, pauses, pecks, pedals, peels, peeps, performs, permits, phones, picks, pinches, pines, places, plans, plants, plays, pleases, plugs, points, pokes, polishes, pops, possesses, posts, pours, practice s, practice s, prays, preaches, precedes, prefers, prepares, presents, preserves, presses, pretends, prevents, pricks, prints, produces, programs, promises, protects, provides, pulls, pumps, punches, punctures, punishes, pushes, questions, queues, races, radiates, rains, raises, reaches, realizes, receives, recognizes, records, reduces, reflects, refuses, regrets, reigns, rejects, rejoices, relaxes, releases, relies, remains, remembers, reminds, removes, repairs, repeats, replaces, replies, reports, reproduces, requests, rescues, retires, returns, rhymes, rinses, risks, robs, rocks, rolls, rots, rubs, ruins, rules, rushes, sacks, sails, satisfies, saves, saws, scares, scatters, scolds, scorches, scrapes, scratches, screams, screws, scribbles, scrubs, seals, searches, separates, serves, settles, shades, shares, shaves, shelters, shivers, shocks, shops, shrugs, sighs, signs, signals, sins, sips, skis, skips, slaps, slips, slows, smashes, smells, smiles, smokes, snatches, sneezes, sniffs, snores, snows, soaks, soothes, sounds, spares, sparks, sparkles, spells, spills, spoils, spots, sprays, sprouts, squashes, squeaks, squeals, squeezes, stains, stamps, stares, starts, stays, steers, steps, stirs, stitches, stops, stores, straps, strengthens, stretches, strips, strokes, stuffs, subtracts, succeeds, sucks, suffers, suggests, suits, supplies, supports, supposes, surprises, surrounds, suspects, suspends, switches, talks, tames, taps, tastes, teases, telephones, tempts, terrifies, tests, thanks, thaws, ticks, tickles, ties, times, tips, tires, touches, tours, tows, traces, trades, trains, transports, traps, travels, treats, trembles, tricks, trips, trots, troubles, trusts, tries, tugs, tumbles, turns, twists, types, undresses, unfastens, unites, unlocks, unpacks, untidies, uses, vanishes, visits, wails, waits, walks, wanders, wants, warms, warns, washes, wastes, watches, waters, waves, weighs, welcomes, whines, whips, whirls, whispers, whistles, winks, wipes, wishes, wobbles, wonders, works, worries, wraps, wrecks, wrestles, wriggles, x-rays, yawns, yells, zips, zooms, beats, becomes, begins, bends, bets, bids, blows, breaks, brings, builds, burns, buys, catches, chooses, comes, costs, cuts, digs, dives, does, draws, dreams, drives, drinks, eats, falls, feels, fights, finds, flies, forgets, forgives, gets, gots, give goes, grows, hangs, hears, hides, hurts, keeps, knows, lays, leads, leaves, lends, lets, loses, makes, means, meets, pays, puts, reads, rides, rings, rises, runs, says, sees, sells, sends, shows, shuts, sings, sits, sleeps, speaks, spends, stands, swims, takes, teaches, tears, tells, thinks, throws, understands, wakes, wears, wins, writes
who
whose
whom
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment