Last active
March 11, 2019 03:27
-
-
Save sandersyen/7c2fa35d650e009dc79fbee472d265bf to your computer and use it in GitHub Desktop.
source code (CS 839 Spring 2019, Project Stage 1, team23)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
able | |
bad | |
best | |
better | |
big | |
black | |
certain | |
clear | |
different | |
early | |
easy | |
economic | |
federal | |
free | |
full | |
good | |
great | |
hard | |
high | |
human | |
important | |
international | |
large | |
late | |
little | |
local | |
long | |
low | |
major | |
military | |
national | |
new | |
old | |
only | |
other | |
political | |
possible | |
public | |
real | |
recent | |
right | |
small | |
social | |
special | |
strong | |
sure | |
true | |
white | |
whole | |
young | |
other | |
new | |
good | |
high | |
old | |
great | |
big | |
American | |
small | |
large | |
national | |
different | |
black | |
long | |
little | |
important | |
political | |
bad | |
white | |
real | |
best | |
right | |
social | |
only | |
public | |
sure | |
low | |
early | |
able | |
human | |
local | |
late | |
hard | |
major | |
better | |
economic | |
strong | |
possible | |
whole | |
free | |
military | |
true | |
federal | |
international | |
full | |
special | |
easy | |
clear | |
recent | |
certain | |
personal | |
open | |
red | |
difficult | |
available | |
likely | |
short | |
single | |
medical | |
current | |
wrong | |
private | |
past | |
foreign | |
fine | |
common | |
poor | |
natural | |
significant | |
similar | |
hot | |
dead | |
central | |
happy | |
serious | |
ready | |
simple | |
left | |
physical | |
general | |
environmental | |
financial | |
blue | |
democratic | |
dark | |
various | |
entire | |
close | |
legal | |
religious | |
cold | |
final | |
main | |
green | |
nice | |
huge | |
popular | |
traditional | |
cultural |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oliver | |
Jake | |
Noah | |
James | |
Jack | |
Connor | |
Liam | |
John | |
Harry | |
Callum | |
Mason | |
Robert | |
Jacob | |
Michael | |
Charlie | |
Kyle | |
William | |
Williams | |
Thomas | |
Shawn | |
Joe | |
Ethan | |
David | |
George | |
Reece | |
Michael | |
Richard | |
Oscar | |
Rhys | |
Alexander | |
Joseph | |
James | |
Charlie | |
James | |
Charles | |
Damian | |
Daniel | |
Thomas | |
Amelia | |
Margaret | |
Emma | |
Mary | |
Olivia | |
Samantha | |
Patricia | |
Isla | |
Bethany | |
Sophia | |
Jennifer | |
Emily | |
Elizabeth | |
Isabella | |
Elizabeth | |
Poppy | |
Joanne | |
Ava | |
Linda | |
Megan | |
Mia | |
Barbara | |
Isabella | |
Victoria | |
Susan | |
Jessica | |
Lauren | |
Abigail | |
Margaret | |
Lily | |
Michelle | |
Madison | |
Jessica | |
Sophie | |
Cooper | |
Tracy | |
Charlotte | |
Sarah | |
Murphy | |
Li | |
Smith | |
Jones | |
O'Kelly | |
Johnson | |
Jones | |
Wilson | |
O'Sullivan | |
Lam | |
Brown | |
Walsh | |
Martin | |
Taylor | |
Jones | |
Gelbero | |
Wilson | |
Taylor | |
Davies | |
O'Brien | |
Miller | |
Roy | |
Taylor | |
Byrne | |
Davis | |
Tremblay | |
Morton | |
Singh | |
Evans | |
O'Ryan | |
Garcia | |
Lee | |
White | |
Wang | |
Thomas | |
O'Connor | |
Rodriguez | |
Gagnon | |
Martin | |
Anderson | |
Roberts | |
O'Neill | |
Anderson | |
Clark | |
Wright | |
Mitchell | |
Johnson | |
Rodriguez | |
Lopez | |
Perez | |
Jackson | |
Lewis | |
Hill | |
Roberts | |
Jones | |
White | |
Scott | |
Turner | |
Brown | |
Harris | |
Walker | |
Green | |
Phillips | |
Hall | |
Adams | |
Campbell | |
Miller | |
Allen | |
Baker | |
Parker | |
Garcia | |
Young | |
Gonzalez | |
Evans | |
Moore | |
Martinez | |
Hernandez | |
Nelson | |
Edwards | |
Taylor | |
Robinson | |
Carter | |
Collins | |
George | |
Ronald | |
John | |
Richard | |
Kenneth | |
Anthony | |
Charles | |
Paul | |
Steven | |
Michael | |
Joseph | |
Mark | |
Thomas | |
Donald | |
Brian | |
Jeff | |
Mary | |
Jennifer | |
Lisa | |
Sandra | |
Michelle | |
Patricia | |
Maria | |
Nancy | |
Donna | |
Laura | |
Linda | |
Susan | |
Karen | |
Carol | |
Sarah | |
Barbara | |
Margaret | |
Betty | |
Ruth | |
Kimberly | |
Elizabeth | |
Dorothy | |
Helen | |
Sharon | |
Deborah | |
Sanders | |
Joy | |
Sean | |
Walton | |
Reznor | |
Antonio | |
Trump | |
Julia | |
Blair | |
Nobel | |
Johann | |
Ann | |
Lindsay | |
Laura | |
Sam | |
Kelly | |
Bill | |
Maya | |
Adriana | |
Lola | |
Ingrid | |
Clare | |
Emma | |
Isabella | |
Abigail | |
Charlotte | |
Lillian | |
Hannah | |
Samantha | |
Caroline | |
Sheeran | |
Madelyn | |
Kate | |
Hayes | |
Arianna | |
Maggie | |
Audrey | |
Luis | |
Paolo | |
Oliver | |
Emilio | |
Gustav | |
Tyler | |
Taylor | |
Javier | |
Kristian | |
Henrik | |
Stefan | |
Etienne | |
Johnson | |
Ferdinand | |
Hector | |
Catlin | |
Hugo | |
Ali | |
Raymond | |
Xavier | |
Harry | |
Potter | |
Evan | |
Elvis | |
Harrison | |
Jasper | |
Hitler | |
<<<<<<< HEAD | |
Scott | |
======= | |
John | |
Patricia | |
Robert | |
Linda | |
Richard | |
Susan | |
Joseph | |
Jessica | |
Thomas | |
Sarah | |
Charles | |
Margaret | |
Christopher | |
Daniel | |
Nancy | |
Matthew | |
Lisa | |
Anthony | |
Betty | |
Donald | |
Dorothy | |
Paul | |
Ashley | |
Andrew | |
Donna | |
Kenneth | |
Carol | |
Joshua | |
Amanda | |
Brian | |
Melissa | |
Deborah | |
Ronald | |
Stephanie | |
Timothy | |
Rebecca | |
Jeffrey | |
Helen | |
Sharon | |
Gary | |
Kathleen | |
Nicholas | |
Amy | |
Eric | |
Shirley | |
Angela | |
Larry | |
Justin | |
Brenda | |
Scott | |
Pamela | |
Nicole | |
Frank | |
Katherine | |
Benjamin | |
Samantha | |
Gregory | |
Christine | |
Samuel | |
Virginia | |
Rachel | |
Jack | |
Janet | |
Dennis | |
Jerry | |
Carolyn | |
Maria | |
Aaron | |
Heather | |
Jose | |
Julie | |
Douglas | |
Joyce | |
Peter | |
Evelyn | |
Nathan | |
Victoria | |
Zachary | |
Walter | |
Christina | |
Kyle | |
Lauren | |
Harold | |
Frances | |
Carl | |
Martha | |
Judith | |
Gerald | |
Cheryl | |
Keith | |
Megan | |
Roger | |
Andrea | |
Arthur | |
Olivia | |
Terry | |
Ann | |
Jacqueline | |
Ethan | |
Austin | |
Doris | |
Kathryn | |
Albert | |
Gloria | |
Jesse | |
Teresa | |
Willie | |
Sara | |
Billy | |
Janice | |
Marie | |
Bruce | |
Noah | |
Jordan | |
Judy | |
Dylan | |
Theresa | |
Ralph | |
Madison | |
Roy | |
Beverly | |
Alan | |
Denise | |
Wayne | |
Marilyn | |
Eugene | |
Amber | |
Juan | |
Danielle | |
Gabriel | |
Rose | |
Louis | |
Brittany | |
Russell | |
Diana | |
Randy | |
Abigail | |
Vincent | |
Natalie | |
Philip | |
Jane | |
Logan | |
Lori | |
Bobby | |
Alexis | |
Tiffany | |
Johnny | |
Kayla | |
Boccaccio | |
Gruber | |
Huber | |
Bauer | |
Wagner | |
Pichler | |
Steiner | |
Moser | |
Mayer | |
Hofer | |
Leitner | |
Berger | |
Fuchs | |
Eder | |
Fischer | |
Schmid | |
Winkler | |
Weber | |
Schwarz | |
Maier | |
Schneider | |
Reiter | |
Mayr | |
Schmidt | |
Wimmer | |
Egger | |
Brunner | |
Lang | |
Baumgartner | |
Auer | |
Binder | |
Lechner | |
Wolf | |
Wallner | |
Aigner | |
Ebner | |
Koller | |
Lehner | |
Haas | |
Schuster | |
Heilig | |
Peeters | |
Janssens | |
Maes | |
Jacobs | |
Mertens | |
Willems | |
Claes | |
Goossens | |
Wouters | |
Dubois | |
Lambert | |
Dupont | |
Martin | |
Simon | |
Nielsen | |
Jensen | |
Hansen | |
Pedersen | |
Andersen | |
Christensen | |
Larsen | |
Rasmussen | |
Petersen | |
Madsen | |
Kristensen | |
Olsen | |
Thomsen | |
Christiansen | |
Poulsen | |
Johansen | |
Mortensen | |
Joensen | |
Hansen | |
Jacobsen | |
Olsen | |
Poulsen | |
Petersen | |
Johannesen | |
Thomsen | |
Nielsen | |
Johansen | |
Rasmussen | |
Simonsen | |
Djurhuus | |
Jensen | |
Danielsen | |
Mortensen | |
Mikkelsen | |
Dam | |
Andreasen | |
Johansson | |
Nyman | |
Lindholm | |
Karlsson | |
Andersson | |
Hendriks |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
or | |
but | |
nor | |
so | |
for | |
yet | |
after | |
although | |
as | |
as if | |
as long as | |
because | |
before | |
even if | |
even though | |
once | |
since | |
so that | |
though | |
till | |
unless | |
until | |
what | |
when | |
whenever | |
wherever | |
whether | |
while | |
why | |
if | |
after | |
from | |
by | |
for | |
with | |
but | |
of | |
to | |
and | |
before | |
how | |
which | |
a | |
an | |
the | |
these | |
our | |
i | |
he | |
she | |
they | |
there | |
are | |
is | |
be | |
you | |
able | |
about | |
across | |
all | |
almost | |
also | |
am | |
among | |
any | |
at | |
been | |
best | |
can | |
cannot | |
could | |
dear | |
did | |
do | |
does | |
either | |
else | |
ever | |
every | |
get | |
got | |
have | |
has | |
had | |
her | |
hers | |
him | |
his | |
however | |
in | |
into | |
it | |
its | |
just | |
least | |
let | |
like | |
likely | |
other | |
rather | |
me | |
might | |
most | |
must | |
my | |
neither | |
not | |
nor | |
often | |
off | |
on | |
only | |
should | |
some | |
then | |
that | |
their | |
then | |
this | |
too | |
us | |
we | |
who | |
whom | |
would | |
yet | |
here | |
there | |
bbc | |
abc | |
news | |
maybe | |
perhaps | |
man | |
men | |
woman | |
women | |
Out | |
yes | |
no | |
in | |
out |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Afghanistan | |
Albania | |
Algeria | |
America | |
Andorra | |
Angola | |
Antigua | |
Argentina | |
Armenia | |
Australia | |
Austria | |
Azerbaijan | |
Bahamas | |
Bahrain | |
Bangladesh | |
Barbados | |
Belarus | |
Belgium | |
Belize | |
Russians | |
Europeans | |
Benin | |
Bhutan | |
Bissau | |
Bolivia | |
Bosnia | |
Botswana | |
Brazil | |
British | |
Britan | |
Brunei | |
Bulgaria | |
Burkina | |
Burma | |
Burundi | |
Cambodia | |
Cameroon | |
Canada | |
Cape Verde | |
Central African Republic | |
Chad | |
Chile | |
China | |
Colombia | |
Comoros | |
Congo | |
Costa Rica | |
country debt | |
Croatia | |
Cuba | |
Cyprus | |
Czech | |
Denmark | |
Djibouti | |
Dominica | |
East Timor | |
Ecuador | |
Egypt | |
El Salvador | |
Emirate | |
England | |
Eritrea | |
Estonia | |
Ethiopia | |
Russian | |
Fiji | |
Finland | |
France | |
Gabon | |
Gambia | |
Georgia | |
French | |
Germany | |
Ghana | |
Great Britain | |
Europe | |
European | |
Britain | |
Greece | |
Grenada | |
Grenadines | |
Guatemala | |
Guinea | |
Guyana | |
Haiti | |
Herzegovina | |
Honduras | |
Hungary | |
Iceland | |
in usa | |
India | |
Indian | |
Indonesia | |
Iran | |
Iraq | |
Ireland | |
Israel | |
Italy | |
Ivory Coast | |
Jamaica | |
Japan | |
Jordan | |
Kazakhstan | |
Kenya | |
Kiribati | |
Korea | |
Kosovo | |
Kuwait | |
Kyrgyzstan | |
Laos | |
Latvia | |
Lebanon | |
Lesotho | |
Liberia | |
Libya | |
Liechtenstein | |
Lithuania | |
Luxembourg | |
Macedonia | |
Madagascar | |
Malawi | |
Malaysia | |
Maldives | |
Mali | |
Malta | |
Marshall | |
Mauritania | |
Mauritius | |
Mexico | |
Micronesia | |
Moldova | |
Monaco | |
Mongolia | |
Montenegro | |
Morocco | |
Mozambique | |
Myanmar | |
Namibia | |
Nauru | |
Nepal | |
Netherlands | |
New Zealand | |
Nicaragua | |
Niger | |
Nigeria | |
Norway | |
Oman | |
Pakistan | |
Palau | |
Panama | |
Papua | |
Paraguay | |
Peru | |
Philippines | |
Poland | |
Portugal | |
Qatar | |
Romania | |
Russia | |
Rwanda | |
Samoa | |
San Marino | |
Sao Tome | |
Saudi Arabia | |
scotland | |
scottish | |
Senegal | |
Serbia | |
Seychelles | |
Sierra Leone | |
Singapore | |
Slovakia | |
Slovenia | |
Solomon | |
Somalia | |
South Africa | |
Africa | |
South Sudan | |
Spain | |
Sri Lanka | |
St. Kitts | |
St. Lucia | |
St Kitts | |
St Lucia | |
Saint Kitts | |
Santa Lucia | |
Sudan | |
Suriname | |
Swaziland | |
Sweden | |
Switzerland | |
Syria | |
Taiwan | |
Tajikistan | |
Tanzania | |
Thailand | |
Tobago | |
Togo | |
Tonga | |
Trinidad | |
Tunisia | |
Turkey | |
Turkmenistan | |
Tuvalu | |
Uganda | |
Ukraine | |
United Kingdom | |
United States | |
Uruguay | |
USA | |
US | |
UK | |
Uzbekistan | |
Vanuatu | |
Vatican | |
Venezuela | |
Vietnam | |
wales | |
welsh | |
Yemen | |
Zambia | |
Zimbabwe | |
Afghan | |
Albanian | |
Algerian | |
American | |
Andorran | |
Angolan | |
Antiguans | |
Argentinean | |
Armenian | |
Australian | |
Austrian | |
Azerbaijani | |
Bahamian | |
Bahraini | |
Bangladeshi | |
Barbadian | |
Barbudans | |
Batswana | |
Belarusian | |
Belgian | |
Bourgeoi | |
Bourgeoisie | |
Belizean | |
Beninese | |
Bhutanese | |
Bolivian | |
Beverly Hills | |
Bosnian | |
Brazilian | |
British | |
Bruneian | |
Bulgarian | |
Burkinabe | |
Burmese | |
Burundian | |
Cambodian | |
Cameroonian | |
Canadian | |
Cape Verdean | |
Central African | |
Chadian | |
Chilean | |
Chinese | |
Colombian | |
Comoran | |
Congolese | |
Costa Rican | |
Croatian | |
Cuban | |
Cypriot | |
Czech | |
Danish | |
Djibouti | |
Dominican | |
Dutch | |
East Timorese | |
Ecuadorean | |
Egyptian | |
Emirian | |
Equatorial Guinean | |
Eritrean | |
Estonian | |
Ethiopian | |
Fijian | |
Filipino | |
Finnish | |
French | |
Gabonese | |
Gambian | |
Georgian | |
German | |
Ghanaian | |
Greek | |
Grenadian | |
Guatemalan | |
Guinea-Bissauan | |
Guinean | |
Guyanese | |
Haitian | |
Herzegovinian | |
Honduran | |
Hungarian | |
I-Kiribati | |
Icelander | |
Indian | |
Indonesian | |
Iranian | |
Iraqi | |
Irish | |
Israeli | |
Italian | |
Ivorian | |
Jamaican | |
Japanese | |
Jordanian | |
Kazakhstani | |
Kenyan | |
Kittian | |
Nevisian | |
Kuwaiti | |
Kyrgyz | |
Laotian | |
Latvian | |
Lebanese | |
Liberian | |
Libyan | |
Liechtensteiner | |
Lithuanian | |
Luxembourger | |
Macedonian | |
Malagasy | |
Malawian | |
Malaysian | |
Maldivian | |
Malian | |
Maltese | |
Marshallese | |
Mauritanian | |
Mauritian | |
Mexican | |
Micronesian | |
Moldovan | |
Monacan | |
Mongolian | |
Moroccan | |
Mosotho | |
Motswana | |
Mozambican | |
Namibian | |
Nauruan | |
Nepalese | |
New Zealander | |
Ni-Vanuatu | |
Nicaraguan | |
Nigerian | |
Nigerien | |
North Korean | |
Northern Irish | |
Norwegian | |
Omani | |
Pakistani | |
Palauan | |
Panamanian | |
Papua New Guinean | |
Paraguayan | |
Peruvian | |
Polish | |
Portuguese | |
Qatari | |
Romanian | |
Russian | |
Rwandan | |
Saint Lucian | |
Salvadoran | |
Samoan | |
San Marinese | |
Sao Tomean | |
Saudi | |
Scottish | |
Senegalese | |
Serbian | |
Seychellois | |
Sierra Leonean | |
Singaporean | |
Slovakian | |
Slovenian | |
Solomon Islander | |
Somali | |
South African | |
South Korean | |
Spanish | |
Sri Lankan | |
Sudanese | |
Surinamer | |
Swazi | |
Swedish | |
Swiss | |
Syrian | |
Taiwanese | |
Tajik | |
Tanzanian | |
Thai | |
Togolese | |
Tongan | |
Trinidadian | |
Tobagonian | |
Tunisian | |
Turkish | |
Tuvaluan | |
Ugandan | |
Ukrainian | |
Uruguayan | |
Uzbekistani | |
Uzbekistan | |
Venezuelan | |
Vietnamese | |
Welsh | |
Yemenite | |
Zambian | |
Zimbabwean | |
Monday | |
Tuesday | |
Wednesday | |
Thursday | |
Friday | |
Saturday | |
Sunday | |
Beijing | |
Chicago | |
Taoyuan | |
San Antonio | |
Toronto | |
New York | |
English | |
Pennsylvania | |
South Carolina | |
Texas | |
Wisconsin | |
St Paul | |
London | |
Soho | |
Brexit | |
Britain | |
Manchester | |
Middle Eastern | |
Taipei | |
Vienna | |
EU | |
Yemeni | |
Europe | |
European | |
South America | |
South American | |
Asia | |
Asian | |
Oceania | |
Oceanian | |
Africa | |
African | |
Antartica | |
Pacific | |
Atlantic | |
Mediterranean | |
Scot | |
Scots | |
Korean | |
California | |
Swedes | |
Swede | |
Zurich | |
Yemenis | |
Western | |
Chicago | |
northeast | |
southeast | |
southwest | |
northwest | |
northern | |
western | |
eastern | |
sourthern | |
States | |
state | |
Limburger | |
Limburgers | |
Country | |
Countries | |
City | |
Cities | |
County | |
Counties | |
York | |
Madison |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def load_who_file(): | |
""" | |
Read file whos.txt and insert data to a hash set | |
:return: the hash set with all whos from file whos.txt | |
""" | |
return set([who.strip('\n').lower() for who in file("data/whos.txt", 'r').readlines()]) | |
def load_common_name_file(): | |
""" | |
Read file common_name.txt and insert data to a hash set | |
:return: the hash set with all whos from file common_name.txt | |
""" | |
return set([common_name.strip('\n').lower() for common_name in file("data/common_name.txt", 'r').readlines()]) | |
def load_common_adj_file(): | |
""" | |
Read file common_adj.txt and insert data to a hash set | |
:return: the hash set with all whos from file common_adj.txt | |
""" | |
return set([common_adj.strip('\n').lower() for common_adj in file("data/common_adj.txt", 'r').readlines()]) | |
def load_country_file(): | |
""" | |
Read file countries.txt and insert data to a hash set | |
:return: the hash set with all countries name from file countries.txt | |
""" | |
return set([country.strip('\n').lower() for country in file("data/countries.txt", 'r').readlines()]) | |
def load_conjunction_file(): | |
""" | |
Read file conjunctions.txt and insert data to a hash set | |
:return: the hash set with all conjunctions from file conjunctions.txt | |
""" | |
return set([conjunctions.strip('\n').lower() for conjunctions in file("data/conjunctions.txt", 'r').readlines()]) | |
def load_prefix_library(): | |
""" | |
Generate a hash set with all prefixes | |
:return: the hash set with all prefixes | |
""" | |
return set([prefix.strip('\n').lower() for prefix in file("data/prefix.txt", 'r').readlines()]) | |
def load_organ_library(): | |
""" | |
Generate a hash set with all organization titles | |
:return: the hash set with all organization titles | |
""" | |
return set([organ.strip('\n').lower() for organ in file("data/organization.txt", 'r').readlines()]) | |
def load_month_file(): | |
""" | |
Generate a hash set with all months | |
:return: the hash set with all months | |
""" | |
return set([month.strip('\n').lower() for month in file("data/month.txt", 'r').readlines()]) | |
def load_verb_file(): | |
""" | |
Read files irregular_verbs.txt and regular_verbs.txt and insert data to a hash set | |
:return: the hash set with all verbs from files | |
""" | |
return set(file("data/irregular_verbs.txt", 'r').read().split(', ')) |\ | |
set(file("data/regular_verbs.txt", 'r').read().split(', ')) | |
def load_preposition_file(): | |
""" | |
Read files preposition.txt and insert data to a hash set | |
:return: the hash set with all preposition from files | |
""" | |
return set(open("data/preposition.txt", 'r').read().split(', ')) | |
def contains_country(ngram, country_set): | |
""" | |
Identify if a n-gram has countries, return 1 (has feature) 0 (no such feature) | |
:param ngram: a ngram | |
:param country_set: a set contains all countries | |
:return: 1 (has feature) or 0 (no such feature) | |
""" | |
# find if any word in current ngram has country name | |
words = ngram[0].split(' ') | |
for word in words: | |
if word.lower() in country_set: | |
return 1 | |
if len(words) >= 2: | |
for i in range(1, len(words)): | |
if (words[i-1]+' '+words[i]).lower() in country_set: | |
return 1 | |
if len(words) >= 3: | |
for i in range(2, len(words)): | |
if (words[i-2]+' '+words[i-1]+' '+words[i]).lower() in country_set: | |
return 1 | |
return 0 | |
def contains_common_name(ngram, common_name_set): | |
""" | |
Identify if a n-gram has common name, return 1 (has feature) 0 (no such feature) | |
:param ngram: a ngram | |
:param common_name_set: a set contains all common name | |
:return: 1 (has feature) or 0 (no such feature) | |
""" | |
# find if any word in current ngram has common name | |
for word in ngram[0].split(' '): | |
if word.lower() in common_name_set: | |
return 1 | |
return 0 | |
def contains_common_adj(ngram, common_adj_set): | |
""" | |
Identify if a n-gram has common adj, return 1 (has feature) 0 (no such feature) | |
:param ngram: a ngram | |
:param common_name_set: a set contains all common adj | |
:return: 1 (has feature) or 0 (no such feature) | |
""" | |
# find if any word in current ngram has common adj | |
for word in ngram[0].split(' '): | |
if word.lower() in common_adj_set: | |
return 1 | |
return 0 | |
def contains_prefix(ngram, prefix_set): | |
""" | |
Identify if a n-gram has prefix, return 1 (has feature) 0 (no such feature) | |
:param ngram: a ngram | |
:param prefix_set: a set contains all prefixes | |
:return: 1 (has feature) or 0 (no such feature) | |
""" | |
# find if any word in current ngram has country name | |
for word in ngram[0].split(' '): | |
if word.lower() in prefix_set: | |
return 1 | |
return 0 | |
def contains_month(ngram, month_set): | |
""" | |
Identify if a n-gram has month, return 1 (has feature) 0 (no such feature) | |
:param ngram: a ngram | |
:param month_set: a set contains all months | |
:return: 1 (has feature) or 0 (no such feature) | |
""" | |
# find if any word in current ngram has country name | |
for word in ngram[0].split(' '): | |
if word.lower() in month_set: | |
return 1 | |
return 0 | |
def contains_organization(ngram, organ_set): | |
""" | |
Identify if a n-gram has organization titles, return 1 (has feature) 0 (no such feature) | |
:param ngram: a ngram | |
:param organ_set: a set contains commom organization titles | |
:return: 1 (has feature) or 0 (no such feature) | |
""" | |
# find if any word in current ngram has country name | |
words = ngram[0].split(' ') | |
for word in words: | |
if word.lower() in organ_set: | |
return 1 | |
if len(words) >= 2: | |
for i in range(1, len(words)): | |
if (words[i-1]+' '+words[i]).lower() in organ_set: | |
return 1 | |
return 0 | |
def contains_conjunction(ngram, conjunctions_set): | |
""" | |
Identify if a n-gram has conjunctions, return 1 (has feature) 0 (no such feature) | |
:param ngram: a ngram | |
:param conjunctions_set: a set contains all conjunctions | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
# find if any word in current ngram has country name | |
for word in ngram[0].split(' '): | |
if word.lower() in conjunctions_set: | |
return 1 | |
return 0 | |
def contains_verb(ngram, verb_set): | |
""" | |
Identify if a n-gram has conjunctions, return 1 (has feature) 0 (no such feature) | |
:param ngram: a ngram | |
:param conjunctions_set: a set contains all conjunctions | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
# find if any word in current ngram has country name | |
for word in ngram[0].split(' '): | |
if word.lower() in verb_set: | |
return 1 | |
return 0 | |
def is_all_upper(ngram): | |
""" | |
Check all the words in the content if all words start with upper case | |
:param ngram: a ngram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
for word in ngram[0].split(' '): | |
if len(word) > 0 and word[0].islower(): | |
return 0 | |
return 1 | |
def has_who(ngram, who_set): | |
""" | |
Check all the words in the content if it has who | |
:param ngram: a ngram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
for word in ngram[0].split(' '): | |
if word.lower() in who_set: | |
return 1 | |
return 0 | |
def no_more_than_one_lower(ngram): | |
""" | |
Check all the words in the content if all words has less than 2 lower case at each starting letter | |
:param ngram: a ngram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
count = 0 | |
for word in ngram[0].split(' '): | |
if word.islower(): | |
count += 1 | |
if count > 1: | |
return 0 | |
return 1 | |
def has_prefix_before_ngram(ngram, single_grams, prefix_set): | |
""" | |
Check if the word in front of the input ngram is a prefix for name | |
:param ngram: a n-gram | |
:param single_grams: all words in an article with order | |
:param prefix_set: a set contains all prefixes | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
if (ngram[2] - 1) >= 0: | |
preWord = single_grams[ngram[2] - 1][0].lower() | |
if preWord in prefix_set: | |
return 1 | |
return 0 | |
def has_human_verb(ngram, single_grams, verb_set): | |
""" | |
Check if the word after the input ngram is a verb usually used for human | |
:param ngram: a n-gram | |
:param single_grams: all words in an article with order | |
:param verb_set: a set contains all verbs usually used for human | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
ngram_end_index = ngram[3] | |
if (ngram_end_index + 1) < len(single_grams): | |
if single_grams[ngram_end_index+1][0] in verb_set: | |
return 1 | |
return 0 | |
def features_label_separator(ngrams, labels_set=None): | |
""" | |
Separate features and label from n-grams and return two lists | |
:param ngrams: all n-grams from all articles | |
:param labels_set: the hash set of all labels | |
:return: two lists -- features and label from n-grams | |
""" | |
features = [ngram[4:] for ngram in ngrams] | |
label = [1 if ngram[0] in labels_set else 0 for ngram in ngrams] if labels_set else [] | |
return features, label | |
def afterpreposition(ngram, single_grams, preposition_set): | |
""" | |
Check if the word in front of the input ngram is a preposition | |
:param ngram: a n-gram | |
:param single_grams: all words in an article with order | |
:param preposition_set: a set contains all prefixes | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
if (ngram[2] - 1) >= 0: | |
prepos = single_grams[ngram[2] - 1][0].lower() | |
if prepos in preposition_set: | |
return 1 | |
return 0 | |
def before_who(ngram, single_grams, who_set): | |
""" | |
Check if the word after the input ngram is "who" | |
:param ngram: a n-gram | |
:param single_grams: all words in an article with order | |
:param who_set: a set contains who | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
return 1 if (ngram[2]+1) < len(single_grams) and (single_grams[ngram[2]+1][0]).lower() in who_set else 0 | |
def has_duplicate(ngram): | |
""" | |
Check if the words in input ngram has any duplicate words | |
:param ngram: a n-gram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
words = set() | |
for word in ngram[0].split(' '): | |
if word in words: | |
return 1 | |
else: | |
words.add(word) | |
return 0 | |
def count_occurrences(ngram, single_grams): | |
""" | |
Count the word's occurrences in the article(only for single words) | |
:param ngram: a n-gram | |
:param single_grams: all words in an article with order | |
:return: word's occurrences | |
""" | |
# print (single_grams) | |
# data = ' '.join(a[0] for a in single_grams) | |
return single_grams.count(ngram[0]) | |
def start_end_dash(ngram): | |
""" | |
Check if the ngram contain string starts or ends with dash | |
:param ngram: a n-gram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
words = ngram[0].split(' ') | |
if not words[0].isalpha() or (len(words) > 1 and not words[-1].isalpha()) or words.count('-') > 1: | |
return 1 | |
count = 0 | |
for word in words: | |
count += word.count('-') | |
if count > 1: | |
return 1 | |
return 0 | |
def has_one_dash(ngram): | |
""" | |
Check if the ngram contain exactly one dash | |
:param ngram: a n-gram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
words = ngram[0].split(' ') | |
if words.count('-') == 1: | |
return 1 | |
count = 0 | |
for word in words: | |
count += word.count('-') | |
if count == 1: | |
return 1 | |
return 0 | |
def all_upper_character(ngram): | |
""" | |
Check all the words in the content if all character in words is upper case | |
:param ngram: a ngram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
for word in ngram[0].split(' '): | |
if word.isupper(): | |
return 1 | |
return 0 | |
def word_length(ngram): | |
""" | |
Check number of words | |
:param ngram: a ngram | |
:return: number of words | |
""" | |
words = ngram[0].split(' ') | |
return len(words) | |
def has_fullstop_before_ngram(ngram, single_grams2): | |
""" | |
Check if the word in front of the input ngram is a fullstop | |
:param ngram: a n-gram | |
:param single_grams2: all words including punctuation in an article with order | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
if (ngram[2] - 1) >= 0: | |
preWord = single_grams2[ngram[2] - 1][0].lower() | |
if preWord.endswith("."): | |
return 1 | |
return 0 | |
def has_comma_before_ngram(ngram, single_grams2): | |
""" | |
Check if the word in front of the input ngram is a comma | |
:param ngram: a n-gram | |
:param single_grams2: all words including punctuation in an article with order | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
if (ngram[2] - 1) >= 0: | |
preWord = single_grams2[ngram[2] - 1][0].lower() | |
if preWord.endswith(","): | |
return 1 | |
return 0 | |
def has_comma(ngram, single_grams2): | |
""" | |
Check if the word has a comma in the end | |
:param ngram: a n-gram | |
:param single_grams2: all words including punctuation in an article with order | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
lastWord = single_grams2[ngram[3]][0] | |
if lastWord.endswith(","): | |
return 1 | |
return 0 | |
def is_name_suffix(ngram): | |
""" | |
Check if the word has a suffix | |
:param ngram: a n-gram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
suffixes = ['Sr', 'Sr.', 'Jr', 'Jr.', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX', | |
'sr', 'sr.', 'jr', 'jr.', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx'] | |
for word in ngram[0].split(' '): | |
if word in suffixes: | |
return 1 | |
return 0 | |
def start_with_suffix(ngram): | |
""" | |
Check if the word has a suffix | |
:param ngram: a n-gram | |
:return: 1 (has feature) 0 (no such feature) | |
""" | |
suffixes = ['Sr', 'Sr.', 'Jr', 'Jr.', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX', | |
'sr', 'sr.', 'jr', 'jr.', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx'] | |
word = ngram[0].split(' ') | |
if word[0] in suffixes: | |
return 1 | |
return 0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
is, am, are, was, were, has been, have been, beat, beaten, become, became, begin, began, begun, bend, bent, bet, bid, bite, bit, bitten, blow, blew, blown, break, broke, broken, bring, brought, build, built, burn, burned, burnt, buy, bought, catch, caught, choose, chose, chosen, come, came, cost, cut, dig, dug, dive, dove, dived, do, did, done, draw, drew, drawn, dream, dreamed, dreamt, drive, drove, driven, drink, drank, drunk, eat, ate, eaten, fall, fell, fallen, feel, felt, fight, fought, find, found, fly, flew, flown, forget, forgot, forgotten, forgive, forgave, forgiven, freeze, froze, frozen, get, got, gotten, give, gave, given, go, went, gone, grow, grew, grown, hang, hung, have, had, hear, heard, hide, hid, hidden, hit, hold, held, hurt, keep, kept, know, knew, known, lay, laid, lead, led, leave, left, lend, lent, let, lie, lain, lose, lost, make, made, mean, meant, meet, met, pay, paid, put, read, ride, rode, ridden, ring, rang, rung, rise, rose, risen, run, ran, say, said, see, saw, seen, sell, sold, send, sent, show, showed, shown, shut, sing, sang, sung, sit, sat, sleep, slept, speak, spoke, spoken, spend, spent, stand, stood, swim, swam, swum, take, took, taken, teach, taught, tear, tore, torn, tell, told, think, thought, throw, threw, thrown, understand, understood, wake, woke, woken, wear, wore, worn, win, won, write, wrote, written |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.model_selection import cross_val_score, ShuffleSplit | |
# from imblearn.under_sampling import RandomUnderSampler | |
from collections import Counter | |
import preprocessing | |
from ngramGenerator import * | |
from featureIdentifier import * | |
from mlModel import * | |
from postProcessing import * | |
import pandas as pd | |
from pandas import DataFrame | |
def main(): | |
articles, train_labels_set, test_labels_set = [], set(), set() | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' Pre-processing ''' | |
''' (1) Load data and split data into train/test sets ''' | |
''' (2) Hashset the labels and remove labels on the data ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
# add all files' data into articles | |
preprocessing.read_data(articles) | |
# split data to train and test sets | |
train_set, test_set = preprocessing.data_split(articles) | |
train_label_count, test_label_count = 0, 0 | |
# take off label and add names to labels | |
for i in range(len(train_set)): | |
train_set[i], train_label_count, train_labels_set =\ | |
preprocessing.label_extraction_takeoff(paragraphs=train_set[i], count=train_label_count, labels=train_labels_set) | |
for i in range(len(test_set)): | |
test_set[i], test_label_count, test_labels_set =\ | |
preprocessing.label_extraction_takeoff(paragraphs=test_set[i], count=test_label_count, labels=test_labels_set) | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' N-gram generation ''' | |
''' (1) Generate all n-gram (with first feature whether contains 's) ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
train_ngram_result, test_ngram_result = [], [] | |
train_single_gram, test_single_gram = [], [] | |
train_single_gram2, test_single_gram2 = [], [] # save single ones in order for later use | |
for i in range(len(train_set)): | |
ngrams, singles, singles2 = generate_ngrams(filename=train_set[i][0], content=train_set[i][1], n=5) | |
train_ngram_result.append(ngrams) | |
train_single_gram.append(singles) | |
train_single_gram2.append(singles2) | |
for i in range(len(test_set)): | |
ngrams, singles, singles2 = generate_ngrams(filename=test_set[i][0], content=test_set[i][1], n=5) | |
test_ngram_result.append(ngrams) | |
test_single_gram.append(singles) | |
test_single_gram2.append(singles2) | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' Take out n-gram with only lowercase (only for training data) ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
for index in range(len(train_ngram_result)): | |
train_ngram_result[index] = eliminate_all_lower(train_ngram_result[index]) | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' Create a test ngram result without n-gram has only lowercase ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
test_ngram_result_without_all_lower = test_ngram_result[:] | |
for index in range(len(test_ngram_result_without_all_lower)): | |
test_ngram_result_without_all_lower[index] = eliminate_all_lower(test_ngram_result_without_all_lower[index]) | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' Feature creation ''' | |
''' (1) 's (added during generation of ngram) ''' | |
''' (2) contains country ''' | |
''' (3) contains conjunction ''' | |
''' (4) all capitalised ''' | |
''' (5) prefix before n-gram ''' | |
''' (6) verbs for humans ''' | |
''' (7) prefix in n-gram ''' | |
''' (8) after preposition ''' | |
''' (9) contains organization ''' | |
''' (10) has no more than 1 word without capitalised starting letter ''' | |
''' (11) contains month ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
country_set, conjunction_set, prefix_set, verb_set, preposition_set, organ_set, month_set, who_set, common_name_set, common_adj_set = \ | |
load_country_file(), load_conjunction_file(), load_prefix_library(),\ | |
load_verb_file(), load_preposition_file(), load_organ_library(), load_month_file(), load_who_file(), load_common_name_file(), load_common_adj_file() | |
for ngram_set_index in range(len(train_ngram_result)): | |
article = ' '.join(a[0] for a in train_single_gram[ngram_set_index]) | |
for ngram_index in range(len(train_ngram_result[ngram_set_index])): | |
ngram = train_ngram_result[ngram_set_index][ngram_index] | |
train_ngram_result[ngram_set_index][ngram_index] = ngram +\ | |
(contains_country(ngram=ngram, country_set=country_set), | |
contains_conjunction(ngram=ngram, conjunctions_set=conjunction_set), | |
is_all_upper(ngram=ngram), | |
has_prefix_before_ngram(ngram=ngram, single_grams=train_single_gram[ngram_set_index], prefix_set=prefix_set), | |
has_human_verb(ngram=ngram, single_grams=train_single_gram[ngram_set_index], verb_set=verb_set), | |
contains_prefix(ngram=ngram, prefix_set=prefix_set), | |
afterpreposition(ngram=ngram, single_grams=train_single_gram[ngram_set_index], preposition_set=preposition_set), | |
contains_organization(ngram=ngram, organ_set=organ_set), | |
contains_common_name(ngram=ngram, common_name_set=common_name_set), | |
has_duplicate(ngram=ngram), | |
count_occurrences(ngram=ngram, single_grams=article), | |
no_more_than_one_lower(ngram=ngram), | |
contains_month(ngram=ngram, month_set=month_set), | |
contains_verb(ngram=ngram, verb_set=verb_set), | |
start_end_dash(ngram=ngram), | |
all_upper_character(ngram=ngram), | |
word_length(ngram=ngram), | |
has_fullstop_before_ngram(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]), | |
has_comma_before_ngram(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]), | |
before_who(ngram=ngram, single_grams=train_single_gram[ngram_set_index], who_set=who_set), | |
has_comma(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]), | |
has_who(ngram=ngram, who_set=who_set), | |
is_name_suffix(ngram=ngram), | |
has_one_dash(ngram=ngram), | |
start_with_suffix(ngram=ngram), | |
contains_common_adj(ngram=ngram, common_adj_set=common_adj_set),) | |
for ngram_set_index in range(len(test_ngram_result_without_all_lower)): | |
article = ' '.join(a[0] for a in test_single_gram[ngram_set_index]) | |
for ngram_index in range(len(test_ngram_result_without_all_lower[ngram_set_index])): | |
ngram = test_ngram_result_without_all_lower[ngram_set_index][ngram_index] | |
test_ngram_result_without_all_lower[ngram_set_index][ngram_index] = ngram +\ | |
(contains_country(ngram=ngram, country_set=country_set), | |
contains_conjunction(ngram=ngram, conjunctions_set=conjunction_set), | |
is_all_upper(ngram=ngram), | |
has_prefix_before_ngram(ngram=ngram, single_grams=test_single_gram[ngram_set_index], prefix_set=prefix_set), | |
has_human_verb(ngram=ngram, single_grams=test_single_gram[ngram_set_index], verb_set=verb_set), | |
contains_prefix(ngram=ngram, prefix_set=prefix_set), | |
afterpreposition(ngram=ngram, single_grams=test_single_gram[ngram_set_index], preposition_set=preposition_set), | |
contains_organization(ngram=ngram, organ_set=organ_set), | |
contains_common_name(ngram=ngram, common_name_set=common_name_set), | |
has_duplicate(ngram=ngram), | |
count_occurrences(ngram=ngram, single_grams=article), | |
no_more_than_one_lower(ngram=ngram), | |
contains_month(ngram=ngram, month_set=month_set), | |
contains_verb(ngram=ngram, verb_set=verb_set), | |
start_end_dash(ngram=ngram), | |
all_upper_character(ngram=ngram), | |
word_length(ngram=ngram), | |
has_fullstop_before_ngram(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]), | |
has_comma_before_ngram(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]), | |
before_who(ngram=ngram, single_grams=test_single_gram[ngram_set_index], who_set=who_set), | |
has_comma(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]), | |
has_who(ngram=ngram, who_set=who_set), | |
is_name_suffix(ngram=ngram), | |
has_one_dash(ngram=ngram), | |
start_with_suffix(ngram=ngram), | |
contains_common_adj(ngram=ngram, common_adj_set=common_adj_set),) | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' Train DT, SVM, NB ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
train_ngrams = [] | |
while len(train_ngram_result): | |
train_ngrams.extend(train_ngram_result.pop()) | |
train_ngrams = sorted(train_ngrams, key=lambda i: (int(i[1]), i[2], i[3]-i[2]), reverse=True) | |
new_train, train_label = features_label_separator(ngrams=train_ngrams, labels_set=train_labels_set) | |
decision_tree = build_decision_tree(data=new_train, label=train_label) | |
support_vector_machine = build_support_vector_machine(data=new_train, label=train_label) | |
nb_classifier = build_nb_classifier(data=new_train, label=train_label) | |
rf_classifier = build_rf_classifier(data=new_train, label=train_label) | |
lr_classifier = build_lr_classifier(data=new_train, label=train_label) | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' merge test ngram result ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
test_ngrams = [] | |
while len(test_ngram_result_without_all_lower): | |
test_ngrams.extend(test_ngram_result_without_all_lower.pop()) | |
test_ngrams = sorted(test_ngrams, key=lambda i: (int(i[1]), i[2], i[3]-i[2]), reverse=True) | |
new_test, test_label = features_label_separator(ngrams=test_ngrams, labels_set=test_labels_set) | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' use DT, SVM, NB, RF, LR to predict test set ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''") | |
print("Train Set") | |
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''") | |
print(" ") | |
print("Number of Name: ") | |
print(train_label_count) | |
decision_tree_predict_train = decision_tree.predict(new_train) | |
support_vector_machine_predict_train = support_vector_machine.predict(new_train) | |
nb_classifier_predict_train = nb_classifier.predict(new_train) | |
rf_classifier_predict_train = rf_classifier.predict(new_train) | |
lr_classifier_predict_train = lr_classifier.predict(new_train) | |
print("precision before post processing:") | |
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, train_label)])) / sum(lr_classifier_predict_train)) | |
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, train_label)])) / sum(decision_tree_predict_train)) | |
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, train_label)])) / sum(support_vector_machine_predict_train)) | |
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, train_label)])) / sum(nb_classifier_predict_train)) | |
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, train_label)])) / sum(rf_classifier_predict_train)) | |
print('') | |
print("recall before post processing:") | |
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, train_label)])) / sum(train_label)) | |
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, train_label)])) / sum(train_label)) | |
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, train_label)])) / sum(train_label)) | |
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, train_label)])) / sum(train_label)) | |
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, train_label)])) / sum(train_label)) | |
print('') | |
decision_tree_ngrams_train, decision_tree_predict_train, decision_tree_label_train = take_out_overlapped(train_ngrams, decision_tree_predict_train, train_label) | |
support_vector_machine_ngrams_train, support_vector_machine_predict_train, support_vector_machine_label_train = take_out_overlapped(train_ngrams, support_vector_machine_predict_train, train_label) | |
nb_classifier_ngrams_train, nb_classifier_predict_train, nb_classifier_label_train = take_out_overlapped(train_ngrams, nb_classifier_predict_train, train_label) | |
rf_classifier_ngrams_train, rf_classifier_predict_train, rf_classifier_label_train = take_out_overlapped(train_ngrams, rf_classifier_predict_train, train_label) | |
lr_classifier_ngrams_train, lr_classifier_predict_train, lr_classifier_label_train = take_out_overlapped(train_ngrams, lr_classifier_predict_train, train_label) | |
decision_tree_predict_train = set_predict_value(ngrams=decision_tree_ngrams_train, predict=decision_tree_predict_train) | |
support_vector_machine_predict_train = set_predict_value(ngrams=support_vector_machine_ngrams_train, predict=support_vector_machine_predict_train) | |
nb_classifier_predict_train = set_predict_value(ngrams=nb_classifier_ngrams_train, predict=nb_classifier_predict_train) | |
rf_classifier_predict_train = set_predict_value(ngrams=rf_classifier_ngrams_train, predict=rf_classifier_predict_train) | |
lr_classifier_predict_train = set_predict_value(ngrams=lr_classifier_ngrams_train, predict=lr_classifier_predict_train) | |
print("precision:") | |
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, lr_classifier_label_train)])) / sum(lr_classifier_predict_train)) | |
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, decision_tree_label_train)])) / sum(decision_tree_predict_train)) | |
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, support_vector_machine_label_train)])) / sum(support_vector_machine_predict_train)) | |
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, nb_classifier_label_train)])) / sum(nb_classifier_predict_train)) | |
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, rf_classifier_label_train)])) / sum(rf_classifier_predict_train)) | |
print('') | |
print("recall:") | |
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, lr_classifier_label_train)])) / sum(lr_classifier_label_train)) | |
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, decision_tree_label_train)])) / sum(decision_tree_label_train)) | |
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, support_vector_machine_label_train)])) / sum(support_vector_machine_label_train)) | |
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, nb_classifier_label_train)])) / sum(nb_classifier_label_train)) | |
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, rf_classifier_label_train)])) / sum(rf_classifier_label_train)) | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
''' use DT, SVM, NB, RF, LR to predict test set ''' | |
'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' | |
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''") | |
print("Test Set") | |
print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''") | |
print(" ") | |
print("Number of Name: ") | |
print(test_label_count) | |
decision_tree_predict = decision_tree.predict(new_test) | |
support_vector_machine_predict = support_vector_machine.predict(new_test) | |
nb_classifier_predict = nb_classifier.predict(new_test) | |
rf_classifier_predict = rf_classifier.predict(new_test) | |
lr_classifier_predict = lr_classifier.predict(new_test) | |
print("precision before post processing:") | |
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, test_label)])) / sum(lr_classifier_predict)) | |
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, test_label)])) / sum(decision_tree_predict)) | |
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, test_label)])) / sum(support_vector_machine_predict)) | |
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, test_label)])) / sum(nb_classifier_predict)) | |
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, test_label)])) / sum(rf_classifier_predict)) | |
print('') | |
print("recall before post processing:") | |
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, test_label)])) / sum(test_label)) | |
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, test_label)])) / sum(test_label)) | |
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, test_label)])) / sum(test_label)) | |
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, test_label)])) / sum(test_label)) | |
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, test_label)])) / sum(test_label)) | |
print('') | |
decision_tree_ngrams, decision_tree_predict, decision_tree_label = take_out_overlapped(test_ngrams, decision_tree_predict, test_label) | |
support_vector_machine_ngrams, support_vector_machine_predict, support_vector_machine_label = take_out_overlapped(test_ngrams, support_vector_machine_predict, test_label) | |
nb_classifier_ngrams, nb_classifier_predict, nb_classifier_label = take_out_overlapped(test_ngrams, nb_classifier_predict, test_label) | |
rf_classifier_ngrams, rf_classifier_predict, rf_classifier_label = take_out_overlapped(test_ngrams, rf_classifier_predict, test_label) | |
lr_classifier_ngrams, lr_classifier_predict, lr_classifier_label = take_out_overlapped(test_ngrams, lr_classifier_predict, test_label) | |
decision_tree_predict = set_predict_value(ngrams=decision_tree_ngrams, predict=decision_tree_predict) | |
support_vector_machine_predict = set_predict_value(ngrams=support_vector_machine_ngrams, predict=support_vector_machine_predict) | |
nb_classifier_predict = set_predict_value(ngrams=nb_classifier_ngrams, predict=nb_classifier_predict) | |
rf_classifier_predict = set_predict_value(ngrams=rf_classifier_ngrams, predict=rf_classifier_predict) | |
lr_classifier_predict = set_predict_value(ngrams=lr_classifier_ngrams, predict=lr_classifier_predict) | |
print("precision:") | |
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, lr_classifier_label)])) / sum(lr_classifier_predict)) | |
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, decision_tree_label)])) / sum(decision_tree_predict)) | |
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, support_vector_machine_label)])) / sum(support_vector_machine_predict)) | |
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, nb_classifier_label)])) / sum(nb_classifier_predict)) | |
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, rf_classifier_label)])) / sum(rf_classifier_predict)) | |
print('') | |
print("recall:") | |
print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, lr_classifier_label)])) / sum(lr_classifier_label)) | |
print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, decision_tree_label)])) / sum(decision_tree_label)) | |
print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, support_vector_machine_label)])) / sum(support_vector_machine_label)) | |
print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, nb_classifier_label)])) / sum(nb_classifier_label)) | |
print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, rf_classifier_label)])) / sum(rf_classifier_label)) | |
# print ("==========================================================================") | |
# print("data frame:") | |
# df = pd.DataFrame(columns=['words', 'predict', 'label']) | |
# for i in range(len(rf_classifier_predict)): | |
# if not (rf_classifier_predict[i] == rf_classifier_label[i]) and rf_classifier_predict[i] == 1: | |
# df = df.append({'words': rf_classifier_ngrams[i], 'predict': rf_classifier_predict[i], 'label':rf_classifier_label[i]}, ignore_index = True) | |
# DataFrame.to_csv(df, "rf_classifier_predict.csv", index=False) | |
# scores = cross_val_score(svm.SVC(), new_train, train_label, cv=ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)) | |
# print (scores) | |
if __name__ == "__main__": | |
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from sklearn.linear_model import LogisticRegression | |
from sklearn import tree, svm | |
from sklearn.naive_bayes import BernoulliNB | |
from sklearn.ensemble import RandomForestClassifier | |
from sklearn.pipeline import make_pipeline | |
from sklearn.preprocessing import StandardScaler | |
from sklearn.model_selection import GridSearchCV | |
def build_decision_tree(data, label): | |
""" | |
Build the decision tree based on the data and its corresponding label | |
:param data: a list of tuple which contains all features of a data | |
:param label: a list of label for data | |
:return: a trained decision tree | |
""" | |
dt_tree = tree.DecisionTreeClassifier() | |
return dt_tree.fit(data, label) | |
def build_support_vector_machine(data, label): | |
""" | |
Build the support vector machine based on the data and its corresponding label | |
:param data: a list of tuple which contains all features of a data | |
:param label: a list of label for data | |
:return: trained support vector machine | |
""" | |
trained_svm = svm.SVC(gamma='scale', C=100) | |
return trained_svm.fit(data, label) | |
def build_nb_classifier(data, label): | |
""" | |
Build the naive bayes classifier based on the data and its corresponding label | |
:param data: a list of tuple which contains all features of a data | |
:param label: a list of label for data | |
:return: trained naive bayes classifier | |
""" | |
classifier = BernoulliNB() | |
return classifier.fit(data, label) | |
def build_rf_classifier(data, label): | |
""" | |
Build the random forest classifier based on the data and its corresponding label | |
:param data: a list of tuple which contains all features of a data | |
:param label: a list of label for data | |
:return: trained naive bayes classifier | |
""" | |
# pipe = make_pipeline(StandardScaler(),RandomForestClassifier()) | |
# param_grid = {'n_estimators': list(range(1, 30))} | |
# gs = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, \ | |
# iid=False, n_jobs=-1, refit=True,scoring='accuracy',cv=10) | |
# gs.fit(data, label) | |
# n_estimators=gs.best_params_['n_estimators'] | |
classifier = RandomForestClassifier(n_estimators=34, n_jobs=-1, criterion='gini', class_weight={0: 1, 1: 1.45}, random_state=10) | |
return classifier.fit(data, label) | |
def build_lr_classifier(data, label): | |
""" | |
Build the logistic regression classifier based on the data and its corresponding label | |
:param data: a list of tuple which contains all features of a data | |
:param label: a list of label for data | |
:return: trained logistic regression classifier | |
""" | |
classifier = LogisticRegression(solver='newton-cg',n_jobs=-1,class_weight={0: 1, 1: 1.5}) | |
return classifier.fit(data, label) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
January | |
February | |
March | |
April | |
May | |
June | |
July | |
August | |
September | |
October | |
November | |
December |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import re | |
def generate_ngrams(filename, content, n): | |
""" | |
Generate n-grams (with a feature whether it contains "'s") from the content | |
:param filename: filename | |
:param content: the whole article | |
:param n: the size of n gram | |
:return: generated list of n-grams, single grams | |
""" | |
sentences = content.split(".") | |
index, index2 = 0, 0 | |
n_grams, single_grams, single_grams2 = [], [], [] | |
for sentence in sentences: | |
sections = sentence.split(",") | |
for section in sections: | |
parts = section.split(";") | |
for part in parts: | |
words = part.split() | |
single_grams_temp, feature_single_quote_temp = [], [] | |
for i in range(len(words)): | |
words2 = words[:] | |
words2[i] = re.sub('[;@#$()\{\}:"]', '', words2[i]) | |
single_grams2.append((words2[i], filename, index2, index2)) | |
index2 += 1 | |
# first clean the data | |
for i in range(len(words)): | |
# clean data by removing special characters | |
words[i] = re.sub('[?;!@#$()\{\}:\,\."]', '', words[i]) | |
# for cases 's, take off 's | |
if (len(words[i]) >= 2 and words[i][-2] == "'"): | |
words[i] = words[i][:-2] | |
feature_single_quote_temp.append(1) | |
elif (len(words[i]) >= 2 and words[i][-2] == "s" and words[i][-1] == "'"): | |
words[i] = words[i][:-1] | |
feature_single_quote_temp.append(1) | |
else: | |
feature_single_quote_temp.append(0) | |
single_grams_temp.append((words[i], filename, index, index)) | |
index += 1 | |
n_grams_temp = [] # the return list | |
for i in range(len(words)): | |
temp = words[i] | |
for j in range(1, n): | |
if (i + j) < len(words): | |
temp = temp + ' ' + words[i + j] | |
temp_with_first_index = (temp, filename, single_grams_temp[i][2], single_grams_temp[i + j][2], feature_single_quote_temp[i + j]) | |
n_grams_temp.append(temp_with_first_index) | |
# single_grams += n_grams | |
for i in range(len(single_grams_temp)): | |
n_grams_temp.append(single_grams_temp[i] + (feature_single_quote_temp[i],)) | |
n_grams.extend(n_grams_temp) | |
single_grams.extend(single_grams_temp) | |
return n_grams, single_grams, single_grams2 | |
def eliminate_all_lower(ngrams): | |
""" | |
Take out n-gram which does not have any word capitalised | |
:param ngrams: all n-grams | |
:return: all n-grams for each n-gram has a least one word capitalised | |
""" | |
new_ngram = [] | |
for ngram in ngrams: | |
for word in ngram[0].split(' '): | |
if len(word) > 0 and word[0].isupper(): | |
new_ngram.append(ngram) | |
break | |
return new_ngram |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
university | |
college | |
association | |
commission | |
council | |
laboratory | |
government | |
committee | |
department | |
school | |
research | |
office | |
affairs | |
court | |
corporation | |
company | |
agency | |
organization | |
group | |
empire | |
league | |
music | |
hotel | |
hotels | |
white house | |
party | |
hilton | |
Walmart | |
Genk | |
Brugge | |
Concert | |
organisation | |
prize | |
rolling stones | |
White House | |
Virgin Galactic | |
Art Brut | |
amazon | |
walmart | |
art | |
following | |
club | |
people | |
human | |
guitar | |
violin |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def take_out_overlapped(ngrams, predict, label): | |
""" | |
Take out n-gram which is the subset of another n-gram | |
:param ngrams: all n-grams | |
:return: the remaining n-grams | |
""" | |
new_ngrams, new_predict, new_label, prev, prev_predict = [], [], [], None, 0 | |
for element_index in range(len(ngrams)): | |
# if prev is None || (filenames are different) || (starting index are different) | |
if not prev \ | |
or ngrams[element_index][1] != prev[1] \ | |
or ngrams[element_index][2] == 0 \ | |
or prev_predict == 0 \ | |
or (#ngrams_labels_predicts_sets[element_index][0][1] == prev[0][1] \ | |
# pre[element_index]==1 \ | |
# and prev[2]==1 \ | |
not(prev[2] <= ngrams[element_index][2] <= prev[3]) \ | |
or not(prev[2] <= ngrams[element_index][3] <= prev[3])): | |
prev = ngrams[element_index] | |
prev_predict = predict[element_index] | |
new_ngrams.append(ngrams[element_index]) | |
new_predict.append(predict[element_index]) | |
new_label.append(label[element_index]) | |
return new_ngrams, new_predict, new_label | |
def set_predict_value(ngrams, predict): | |
for element_index in range(len(ngrams)): | |
# 19: start_end_dash, 5: contains_country, 10: contains_prefix, 12: contains_organization, 18: contains_verb,\ | |
# 6: contains_conjunction, 29: start_with_suffix, 30: contains_common_adj | |
if ngrams[element_index][19] == 1 or ngrams[element_index][5] == 1 \ | |
or ngrams[element_index][29] == 1 or ngrams[element_index][6] == 1 \ | |
or ngrams[element_index][10] == 1 or ngrams[element_index][12] == 1 \ | |
or ngrams[element_index][18] == 1 or ngrams[element_index][30] == 1: | |
predict[element_index] = 0 | |
return predict |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
adm | |
atty | |
baz | |
brother | |
capped | |
chief | |
cmdr | |
col | |
dean | |
dr | |
elder | |
father | |
gen | |
gov | |
hon | |
maj | |
msgt | |
mr | |
mrs | |
ms | |
prince | |
prof | |
rabbi | |
rev | |
king | |
queen | |
professor | |
maid | |
madam | |
princess | |
duke | |
duchess | |
baroness | |
baron | |
pope | |
popess | |
president | |
mother | |
saint | |
minister | |
doctor | |
major | |
general | |
marshal | |
officer | |
admiral | |
attorney | |
commander | |
colonel | |
governor | |
honorable | |
mister | |
reverend | |
actor | |
actress | |
writer | |
performer | |
journalism | |
dj | |
star | |
producer | |
engineer | |
coordinator | |
administrator | |
manager | |
agent | |
promoter | |
accompanist | |
bassist | |
busker | |
cellist | |
composer | |
drummer | |
fiddler | |
flautist | |
flutist | |
mpressionist | |
instrumentalist | |
keyboardist | |
leader | |
musician | |
pianist | |
player | |
saxophonist | |
soloist | |
timpanist | |
tuner | |
virtuoso | |
guitarist | |
organist | |
violinist | |
trumpeter | |
trombonist | |
percussionist | |
oboist | |
mandolinist | |
keytarist | |
harpsichordist | |
harpist | |
clarinetist | |
bassoonist | |
bagpiper | |
accordionist | |
master | |
by | |
winner | |
nominee | |
lord | |
sir | |
sculptor | |
uncle | |
co-star | |
representative | |
pilot | |
cinematographer | |
named | |
director | |
author | |
lady | |
maid | |
junior | |
stars | |
farmer | |
anchorwoman | |
nephew | |
newcomer | |
prodigy | |
brother | |
photographer | |
assistant | |
journalist | |
miss | |
novelist | |
father | |
agent | |
partner | |
lawyer | |
reporter | |
sisters | |
composer | |
Major | |
actor | |
captain | |
astronaut | |
commander | |
painter | |
musician | |
meets | |
champion | |
orphan | |
sheriff | |
writer | |
detective | |
artist | |
jr | |
army | |
attorney | |
commandant | |
filmmaker | |
filmmakers | |
guardian | |
ceo | |
cfo | |
cto | |
mayor | |
st | |
emperor | |
senator | |
administration | |
senators | |
representatives | |
representative | |
chancellor | |
dj | |
secretary |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
after | |
from | |
by | |
for | |
with | |
but | |
of | |
to | |
and | |
before |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import os | |
from unidecode import unidecode | |
import re | |
def read_data(articles): | |
""" | |
read the file and append it to the articles | |
:param articles: a list for all articles | |
:return: None | |
""" | |
def files(path): | |
""" | |
find all file in a given path and yield the files' paths | |
:param path: the path to the location of files | |
:return: files' paths | |
""" | |
for f in os.listdir(path): | |
if len(f.split('.')[0]) == 3 and f.split('.')[1] == "txt" and os.path.isfile(os.path.join(path, f)): | |
yield path + "/" + f, f.split('.')[0] | |
for file_path, filename in files("data"): | |
articles.append((filename, unidecode(file(file_path, 'r').read().decode("UTF-8")))) | |
def data_split(articles): | |
""" | |
split data into two data-sets (training and testing) | |
:param articles: a list of articles | |
:return: two lists for two data-sets (training and testing) | |
""" | |
train_set, test_set = [], [] | |
for i in range(0, len(articles), 3): | |
train_set.append(articles[i]) | |
train_set.append(articles[i+1]) | |
test_set.append(articles[i+2]) | |
return train_set, test_set | |
def label_extraction_takeoff(paragraphs, count, labels=None): | |
""" | |
Take off the label <person> and </person> and return the paragraph without labels | |
:param paragraphs: string input data with <person></person> labels | |
:param count: number of labels in articles | |
:param labels: a set which contains all label among all input data | |
:return: new paragraohs without labels, number of labels in articles | |
""" | |
LABEL, LABEL_END = "<person>", "</person>" | |
index, new_paragraph = 0, "" | |
filename = paragraphs[0] | |
paragraphs = paragraphs[1] | |
while index < len(paragraphs): | |
# find the index of the closest LABEL | |
found = paragraphs.find(LABEL, index) | |
# if the label is found | |
if found != -1: | |
# find the index (location) of the end of label | |
found_end = paragraphs.find(LABEL_END, found) | |
# append label to the return variable new_paragraph | |
new_paragraph += paragraphs[index:found] + paragraphs[found+len(LABEL):found_end] | |
# if labels is not None, add the label into it | |
if labels is not None: | |
labels.add(re.sub('[?;!@#$(){}\\,\\."]', '', paragraphs[found+len(LABEL):found_end])) | |
# update the current index | |
index = found_end + len(LABEL_END) | |
count += 1 | |
else: | |
new_paragraph += paragraphs[index:] | |
break | |
return (filename, new_paragraph), count, labels |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
accept, add, admire, admit, advise, afford, agree, alert, allow, amuse, analyze , analyze , announce, annoy, answer, apologize, appear, applaud, appreciate, approve, argue, arrange, arrest, arrive, ask, attach, attack, attempt, attend, attract, avoid, back, bake, balance, ban, bang, bare, bat, bathe, battle, beam, beg, behave, belong, bleach, bless, blind, blink, blot, blush, boast, boil, bolt, bomb, book, bore, borrow, bounce, bow, box, brake, branch, breathe, bruise, brush, bubble, bump, burn, bury, buzz, calculate, call, camp, care, carry, carve, cause, challenge, change, charge, chase, cheat, check, cheer, chew, choke, chop, claim, clap, clean, clear, clip, close, coach, coil, collect, color, comb, command, communicate, compare, compete, complain, complete, concentrate, concern, confess, confuse, connect, consider, consist, contain, continue, copy, correct, cough, count, cover, crack, crash, crawl, cross, crush, cry, cure, curl, curve, cycle, dam, damage, dance, dare, decay, deceive, decide, decorate, delay, delight, deliver, depend, describe, desert, deserve, destroy, detect, develop, disagree, disappear, disapprove, disarm, discover, dislike, divide, double, doubt, drag, drain, dream, dress, drip, drop, drown, drum, dry, dust, earn, educate, embarrass, employ, empty, encourage, end, enjoy, enter, entertain, escape, examine, excite, excuse, exercise, exist, expand, expect, explain, explode, extend, face, fade, fail, fancy, fasten, fax, fear, fence, fetch, file, fill, film, fire, fit, fix, flap, flash, float, flood, flow, flower, fold, follow, fool, force, form, found, frame, frighten, fry, gather, gaze, glow, glue, grab, grate, grease, greet, grin, grip, groan, guarantee, guard, guess, guide, hammer, hand, handle, hang, happen, harass, harm, hate, haunt, head, heal, heap, heat, help, hook, hop, hope, hover, hug, hum, hunt, hurry, identify, ignore, imagine, impress, improve, include, increase, influence, inform, inject, injure, instruct, intend, interest, interfere, interrupt, introduce, invent, invite, irritate, itch, jail, jam, jog, join, joke, judge, juggle, jump, kick, kill, kiss, kneel, knit, knock, knot, label, land, last, laugh, launch, learn, level, license, lick, lie, lighten, like, list, listen, live, load, lock, long, look, love, man, manage, march, mark, marry, match, mate, matter, measure, meddle, melt, memorize, mend, mess up, milk, mine, miss, mix, moan, moor, mourn, move, muddle, mug, multiply, murder, nail, name, need, nest, nod, note, notice, number, obey, object, observe, obtain, occur, offend, offer, open, order, overflow, owe, own, pack, paddle, paint, park, part, pass, paste, pat, pause, peck, pedal, peel, peep, perform, permit, phone, pick, pinch, pine, place, plan, plant, play, please, plug, point, poke, polish, pop, possess, post, pour, practice , practice , pray, preach, precede, prefer, prepare, present, preserve, press, pretend, prevent, prick, print, produce, program, promise, protect, provide, pull, pump, punch, puncture, punish, push, question, queue, race, radiate, rain, raise, reach, realize, receive, recognize, record, reduce, reflect, refuse, regret, reign, reject, rejoice, relax, release, rely, remain, remember, remind, remove, repair, repeat, replace, reply, report, reproduce, request, rescue, retire, return, rhyme, rinse, risk, rob, rock, roll, rot, rub, ruin, rule, rush, sack, sail, satisfy, save, saw, scare, scatter, scold, scorch, scrape, scratch, scream, screw, scribble, scrub, seal, search, separate, serve, settle, shade, share, shave, shelter, shiver, shock, shop, shrug, sigh, sign, signal, sin, sip, ski, skip, slap, slip, slow, smash, smell, smile, smoke, snatch, sneeze, sniff, snore, snow, soak, soothe, sound, spare, spark, sparkle, spell, spill, spoil, spot, spray, sprout, squash, squeak, squeal, squeeze, stain, stamp, stare, start, stay, steer, step, stir, stitch, stop, store, strap, strengthen, stretch, strip, stroke, stuff, subtract, succeed, suck, suffer, suggest, suit, supply, support, suppose, surprise, surround, suspect, suspend, switch, talk, tame, tap, taste, tease, telephone, tempt, terrify, test, thank, thaw, tick, tickle, tie, time, tip, tire, touch, tour, tow, trace, trade, train, transport, trap, travel, treat, tremble, trick, trip, trot, trouble, trust, try, tug, tumble, turn, twist, type, undress, unfasten, unite, unlock, unpack, untidy, use, vanish, visit, wail, wait, walk, wander, want, warm, warn, wash, waste, watch, water, wave, weigh, welcome, whine, whip, whirl, whisper, whistle, wink, wipe, wish, wobble, wonder, work, worry, wrap, wreck, wrestle, wriggle, x-ray, yawn, yell, zip, zoom, accepted, added, admired, admitted, advised, afforded, agreed, alerted, allowed, amused, analyze ed, analyze ed, announced, annoyed, answered, apologized, appeared, applauded, appreciated, approved, argued, arranged, arrested, arrived, asked, attached, attacked, attempted, attended, attracted, avoided, backed, baked, balanced, banned, banged, bared, bated, bathed, battled, beamed, begged, behaved, belonged, bleached, blessed, blinded, blinked, blotted, blushed, boasted, boiled, bolted, bombed, booked, bored, borrowed, bounced, bowed, boxed, braked, branched, breathed, bruised, brushed, bubbled, bumped, burned, buried, buzzed, calculated, called, camped, cared, carried, carved, caused, challenged, changed, charged, chased, cheated, checked, cheered, chewed, choked, chopped, claimed, clapped, cleaned, cleared, clipped, closed, coached, coiled, collected, colored, combed, commanded, communicated, compared, competed, complained, completed, concentrated, concerned, confessed, confused, connected, considered, consisted, contained, continued, copied, corrected, coughed, counted, covered, cracked, crashed, crawled, crossed, crushed, cried, cured, curled, curved, cycled, damed, damaged, danced, dared, decayed, deceived, decided, decorated, delayed, delighted, delivered, depended, described, deserted, deserved, destroyed, detected, developed, disagreed, disappeared, disapproved, disarmed, discovered, disliked, divided, doubled, doubted, dragged, drained, dreamed, dressed, dripped, dropped, drowned, drummed, dried, dusted, earned, educated, embarrassed, employed, emptied, encouraged, ended, enjoyed, entered, entertained, escaped, examined, excited, excused, exercised, existed, expanded, expected, explained, exploded, extended, faced, faded, failed, fancied, fastened, faxed, feared, fenced, fetched, filed, filled, filmed, fired, fitted, fixed, flapped, flashed, floated, flooded, flowed, flowered, folded, followed, fooled, forced, formed, founded, framed, frightened, fried, gathered, gazed, glowed, glued, grabbed, grated, greased, greeted, grinned, griped, groaned, guaranteed, guarded, guessed, guided, hammered, handed, handled, hanged, happened, harassed, harmed, hated, haunted, headed, healed, heaped, heated, helped, hooked, hoped, hoped, hovered, hugged, hummed, hunted, hurried, identified, ignored, imagined, impressed, improved, included, increased, influenced, informed, injected, injured, instructed, intended, interested, interfered, interrupted, introduced, invented, invited, irritated, itched, jailed, jammed, jogged, joined, joked, judged, juggled, jumped, kicked, killed, kissed, kneeled, knitted, knocked, knotted, labeled, landed, lasted, laughed, launched, learned, leveled, licensed, licked, lied, lightened, liked, listed, listened, lived, loaded, locked, longed, looked, loved, maned, managed, marched, marked, married, matched, mated, mattered, measured, meddled, melted, memorized, mended, mess upped, milked, mined, missed, mixed, moaned, moored, mourned, moved, muddled, mugged, multiplied, murdered, nailed, named, needed, nested, nodded, noted, noticed, numbered, obeyed, objected, observed, obtained, occurred, offended, offered, opened, ordered, overflowed, owed, owned, packed, paddled, painted, parked, parted, passed, pasted, pated, paused, pecked, pedaled, peeled, peeped, performed, permitted, phoned, picked, pinched, pined, placed, planed, planted, played, pleased, plugged, pointed, poked, polished, popped, possessed, posted, poured, practice ed, practice ed, prayed, preached, preceded, preferred, prepared, presented, preserved, pressed, pretended, prevented, pricked, printed, produced, programed, promised, protected, provided, pulled, pumped, punched, punctured, punished, pushed, questioned, queued, raced, radiated, rained, raised, reached, realized, received, recognized, recorded, reduced, reflected, refused, regretted, reigned, rejected, rejoiced, relaxed, released, relied, remained, remembered, reminded, removed, repaired, repeated, replaced, replied, reported, reproduced, requested, rescued, retired, returned, rhymed, rinsed, risked, robed, rocked, rolled, rotted, rubbed, ruined, ruled, rushed, sacked, sailed, satisfied, saved, sawed, scared, scattered, scolded, scorched, scraped, scratched, screamed, screwed, scribbled, scribed, sealed, searched, separated, served, settled, shaded, shared, shaved, sheltered, shivered, shocked, shopped, shrugged, sighed, signed, signaled, sinned, sipped, skied, skipped, slapped, slipped, slowed, smashed, smelled, smiled, smoked, snatched, sneezed, sniffed, snored, snowed, soaked, soothed, sounded, spared, sparked, sparkled, spelled, spilled, spoiled, spotted, sprayed, sprouted, squashed, squeaked, squealed, squeezed, stained, stamped, stared, started, stayed, steered, stepped, stirred, stitched, stoped, stored, strapped, strengthened, stretched, striped, stroked, stuffed, subtracted, succeeded, sucked, suffered, suggested, suited, supplied, supported, supposed, surprised, surrounded, suspected, suspended, switched, talked, tamed, taped, tasted, teased, telephoned, tempted, terrified, tested, thanked, thawed, ticked, tickled, tied, timed, tipped, tired, touched, toured, towed, traced, traded, trained, transported, trapped, traveled, treated, trembled, tricked, tripped, trotted, troubled, trusted, tried, tugged, tumbled, turned, twisted, typed, undressed, unfastened, united, unlocked, unpacked, untidied, used, vanished, visited, wailed, waited, walked, wandered, wanted, warmed, warned, washed, wasted, watched, watered, waved, weighed, welcomed, whined, whipped, whirled, whispered, whistled, winked, wiped, wished, wobbled, wondered, worked, worried, wrapped, wrecked, wrestled, wriggled, yawned, yelled, zipped, zoomed, accepts, adds, admires, admits, advises, affords, agrees, alerts, allows, amuses, analyze s, analyze s, announces, annoys, answers, apologizes, appears, applauds, appreciates, approves, argues, arranges, arrests, arrives, asks, attaches, attacks, attempts, attends, attracts, avoids, backs, bakes, balances, bans, bangs, bares, bats, bathes, battles, beams, begs, behaves, belongs, bleaches, blesses, blinds, blinks, blots, blushes, boasts, boils, bolts, bombs, books, bores, borrows, bounces, bows, boxes, brakes, branches, breathes, bruises, brushes, bubbles, bumps, burns, buries, buzzes, calculates, calls, camps, cares, carries, carves, causes, challenges, changes, charges, chases, cheats, checks, cheers, chews, chokes, chops, claims, claps, cleans, clears, clips, closes, coaches, coils, collects, colors, combs, commands, communicates, compares, competes, complains, completes, concentrates, concerns, confesses, confuses, connects, considers, consists, contains, continues, copies, corrects, coughs, counts, covers, cracks, crashes, crawls, crosses, crushes, cries, cures, curls, curves, cycles, dams, damages, dances, dares, decays, deceives, decides, decorates, delays, delights, delivers, depends, describes, deserts, deserves, destroys, detects, develops, disagrees, disappears, disapproves, disarms, discovers, dislikes, divides, doubles, doubts, drags, drains, dreams, dresses, drips, drops, drowns, drums, dries, dusts, earns, educates, embarrasses, employs, empties, encourages, ends, enjoys, enters, entertains, escapes, examines, excites, excuses, exercises, exists, expands, expects, explains, explodes, extends, faces, fades, fails, fancies, fastens, faxes, fears, fences, fetches, files, fills, films, fires, fits, fixes, flaps, flashes, floats, floods, flows, flowers, folds, follows, fools, forces, forms, founds, frames, frightens, fries, gathers, gazes, glows, glues, grabs, grates, greases, greets, grins, grips, groans, guarantees, guards, guesses, guides, hammers, hands, handles, hangs, happens, harasses, harms, hates, haunts, heads, heals, heaps, heats, helps, hooks, hops, hopes, hovers, hugs, hums, hunts, hurries, identifies, ignores, imagines, impresses, improves, includes, increases, influences, informs, injects, injures, instructs, intends, interests, interferes, interrupts, introduces, invents, invites, irritates, itches, jails, jams, jogs, joins, jokes, judges, juggles, jumps, kicks, kills, kisses, kneels, knits, knocks, knots, labels, lands, lasts, laughs, launches, learns, levels, licenses, licks, lies, lightens, likes, lists, listens, lives, loads, locks, longs, looks, loves, mans, manages, marches, marks, marries, matches, mates, matters, measures, meddles, melts, memorizes, mends, mess ups, milks, mines, misses, mixes, moans, moors, mourns, moves, muddles, mugs, multiplies, murders, nails, names, needs, nests, nods, notes, notices, numbers, obeys, objects, observes, obtains, occurs, offends, offers, opens, orders, overflows, owes, owns, packs, paddles, paints, parks, parts, passes, pastes, pats, pauses, pecks, pedals, peels, peeps, performs, permits, phones, picks, pinches, pines, places, plans, plants, plays, pleases, plugs, points, pokes, polishes, pops, possesses, posts, pours, practice s, practice s, prays, preaches, precedes, prefers, prepares, presents, preserves, presses, pretends, prevents, pricks, prints, produces, programs, promises, protects, provides, pulls, pumps, punches, punctures, punishes, pushes, questions, queues, races, radiates, rains, raises, reaches, realizes, receives, recognizes, records, reduces, reflects, refuses, regrets, reigns, rejects, rejoices, relaxes, releases, relies, remains, remembers, reminds, removes, repairs, repeats, replaces, replies, reports, reproduces, requests, rescues, retires, returns, rhymes, rinses, risks, robs, rocks, rolls, rots, rubs, ruins, rules, rushes, sacks, sails, satisfies, saves, saws, scares, scatters, scolds, scorches, scrapes, scratches, screams, screws, scribbles, scrubs, seals, searches, separates, serves, settles, shades, shares, shaves, shelters, shivers, shocks, shops, shrugs, sighs, signs, signals, sins, sips, skis, skips, slaps, slips, slows, smashes, smells, smiles, smokes, snatches, sneezes, sniffs, snores, snows, soaks, soothes, sounds, spares, sparks, sparkles, spells, spills, spoils, spots, sprays, sprouts, squashes, squeaks, squeals, squeezes, stains, stamps, stares, starts, stays, steers, steps, stirs, stitches, stops, stores, straps, strengthens, stretches, strips, strokes, stuffs, subtracts, succeeds, sucks, suffers, suggests, suits, supplies, supports, supposes, surprises, surrounds, suspects, suspends, switches, talks, tames, taps, tastes, teases, telephones, tempts, terrifies, tests, thanks, thaws, ticks, tickles, ties, times, tips, tires, touches, tours, tows, traces, trades, trains, transports, traps, travels, treats, trembles, tricks, trips, trots, troubles, trusts, tries, tugs, tumbles, turns, twists, types, undresses, unfastens, unites, unlocks, unpacks, untidies, uses, vanishes, visits, wails, waits, walks, wanders, wants, warms, warns, washes, wastes, watches, waters, waves, weighs, welcomes, whines, whips, whirls, whispers, whistles, winks, wipes, wishes, wobbles, wonders, works, worries, wraps, wrecks, wrestles, wriggles, x-rays, yawns, yells, zips, zooms, beats, becomes, begins, bends, bets, bids, blows, breaks, brings, builds, burns, buys, catches, chooses, comes, costs, cuts, digs, dives, does, draws, dreams, drives, drinks, eats, falls, feels, fights, finds, flies, forgets, forgives, gets, gots, give goes, grows, hangs, hears, hides, hurts, keeps, knows, lays, leads, leaves, lends, lets, loses, makes, means, meets, pays, puts, reads, rides, rings, rises, runs, says, sees, sells, sends, shows, shuts, sings, sits, sleeps, speaks, spends, stands, swims, takes, teaches, tears, tells, thinks, throws, understands, wakes, wears, wins, writes |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
who | |
whose | |
whom |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment