sandersyen/common_adj.txt

## common_adj.txt
able
bad
best
better
big
black
certain
clear
different
early
easy
economic
federal
free
full
good
great
hard
high
human
important
international
large
late
little
local
long
low
major
military
national
new
old
only
other
political
possible
public
real
recent
right
small
social
special
strong
sure
true
white
whole
young
other
new
good
high
old
great
big
American
small
large
national
different
black
long
little
important
political
bad
white
real
best
right
social
only
public
sure
low
early
able
human
local
late
hard
major
better
economic
strong
possible
whole
free
military
true
federal
international
full
special
easy
clear
recent
certain
personal
open
red
difficult
available
likely
short
single
medical
current
wrong
private
past
foreign
fine
common
poor
natural
significant
similar
hot
dead
central
happy
serious
ready
simple
left
physical
general
environmental
financial
blue
democratic
dark
various
entire
close
legal
religious
cold
final
main
green
nice
huge
popular
traditional
cultural

## common_name.txt
Oliver
Jake
Noah
James
Jack
Connor
Liam
John
Harry
Callum
Mason
Robert
Jacob
Michael
Charlie
Kyle
William
Williams
Thomas
Shawn
Joe
Ethan
David
George
Reece
Michael
Richard
Oscar
Rhys
Alexander
Joseph
James
Charlie
James
Charles
Damian
Daniel
Thomas
Amelia
Margaret
Emma
Mary
Olivia
Samantha
Patricia
Isla
Bethany
Sophia
Jennifer
Emily
Elizabeth
Isabella
Elizabeth
Poppy
Joanne
Ava
Linda
Megan
Mia
Barbara
Isabella
Victoria
Susan
Jessica
Lauren
Abigail
Margaret
Lily
Michelle
Madison
Jessica
Sophie
Cooper
Tracy
Charlotte
Sarah
Murphy
Li
Smith
Jones
O'Kelly
Johnson
Jones
Wilson
O'Sullivan
Lam
Brown
Walsh
Martin
Taylor
Jones
Gelbero
Wilson
Taylor
Davies
O'Brien
Miller
Roy
Taylor
Byrne
Davis
Tremblay
Morton
Singh
Evans
O'Ryan
Garcia
Lee
White
Wang
Thomas
O'Connor
Rodriguez
Gagnon
Martin
Anderson
Roberts
O'Neill
Anderson
Clark
Wright
Mitchell
Johnson
Rodriguez
Lopez
Perez
Jackson
Lewis
Hill
Roberts
Jones
White
Scott
Turner
Brown
Harris
Walker
Green
Phillips
Hall
Adams
Campbell
Miller
Allen
Baker
Parker
Garcia
Young
Gonzalez
Evans
Moore
Martinez
Hernandez
Nelson
Edwards
Taylor
Robinson
Carter
Collins
George
Ronald
John
Richard
Kenneth
Anthony
Charles
Paul
Steven
Michael
Joseph
Mark
Thomas
Donald
Brian
Jeff
Mary
Jennifer
Lisa
Sandra
Michelle
Patricia
Maria
Nancy
Donna
Laura
Linda
Susan
Karen
Carol
Sarah
Barbara
Margaret
Betty
Ruth
Kimberly
Elizabeth
Dorothy
Helen
Sharon
Deborah
Sanders
Joy
Sean
Walton
Reznor
Antonio
Trump
Julia
Blair
Nobel
Johann
Ann
Lindsay
Laura
Sam
Kelly
Bill
Maya
Adriana
Lola
Ingrid
Clare
Emma
Isabella
Abigail
Charlotte
Lillian
Hannah
Samantha
Caroline
Sheeran
Madelyn
Kate
Hayes
Arianna
Maggie
Audrey
Luis
Paolo
Oliver
Emilio
Gustav
Tyler
Taylor
Javier
Kristian
Henrik
Stefan
Etienne
Johnson
Ferdinand
Hector
Catlin
Hugo
Ali
Raymond
Xavier
Harry
Potter
Evan
Elvis
Harrison
Jasper
Hitler
<<<<<<< HEAD
Scott
=======
John
Patricia
Robert
Linda
Richard
Susan
Joseph
Jessica
Thomas
Sarah
Charles
Margaret
Christopher
Daniel
Nancy
Matthew
Lisa
Anthony
Betty
Donald
Dorothy
Paul
Ashley
Andrew
Donna
Kenneth
Carol
Joshua
Amanda
Brian
Melissa
Deborah
Ronald
Stephanie
Timothy
Rebecca
Jeffrey
Helen
Sharon
Gary
Kathleen
Nicholas
Amy
Eric
Shirley
Angela
Larry
Justin
Brenda
Scott
Pamela
Nicole
Frank
Katherine
Benjamin
Samantha
Gregory
Christine
Samuel
Virginia
Rachel
Jack
Janet
Dennis
Jerry
Carolyn
Maria
Aaron
Heather
Jose
Julie
Douglas
Joyce
Peter
Evelyn
Nathan
Victoria
Zachary
Walter
Christina
Kyle
Lauren
Harold
Frances
Carl
Martha
Judith
Gerald
Cheryl
Keith
Megan
Roger
Andrea
Arthur
Olivia
Terry
Ann
Jacqueline
Ethan
Austin
Doris
Kathryn
Albert
Gloria
Jesse
Teresa
Willie
Sara
Billy
Janice
Marie
Bruce
Noah
Jordan
Judy
Dylan
Theresa
Ralph
Madison
Roy
Beverly
Alan
Denise
Wayne
Marilyn
Eugene
Amber
Juan
Danielle
Gabriel
Rose
Louis
Brittany
Russell
Diana
Randy
Abigail
Vincent
Natalie
Philip
Jane
Logan
Lori
Bobby
Alexis
Tiffany
Johnny
Kayla
Boccaccio
Gruber
Huber
Bauer
Wagner
Pichler
Steiner
Moser
Mayer
Hofer
Leitner
Berger
Fuchs
Eder
Fischer
Schmid
Winkler
Weber
Schwarz
Maier
Schneider
Reiter
Mayr
Schmidt
Wimmer
Egger
Brunner
Lang
Baumgartner
Auer
Binder
Lechner
Wolf
Wallner
Aigner
Ebner
Koller
Lehner
Haas
Schuster
Heilig
Peeters
Janssens
Maes
Jacobs
Mertens
Willems
Claes
Goossens
Wouters
Dubois
Lambert
Dupont
Martin
Simon
Nielsen
Jensen
Hansen
Pedersen
Andersen
Christensen
Larsen
Rasmussen
Petersen
Madsen
Kristensen
Olsen
Thomsen
Christiansen
Poulsen
Johansen
Mortensen
Joensen
Hansen
Jacobsen
Olsen
Poulsen
Petersen
Johannesen
Thomsen
Nielsen
Johansen
Rasmussen
Simonsen
Djurhuus
Jensen
Danielsen
Mortensen
Mikkelsen
Dam
Andreasen
Johansson
Nyman
Lindholm
Karlsson
Andersson
Hendriks

## conjunctions.txt
or
but
nor
so
for
yet
after
although
as
as if
as long as
because
before
even if
even though
once
since
so that
though
till
unless
until
what
when
whenever
wherever
whether
while
why
if
after
from
by
for
with
but
of
to
and
before
how
which
a
an
the
these
our
i
he
she
they
there
are
is
be
you
able
about
across
all
almost
also
am
among
any
at
been
best
can
cannot
could
dear
did
do
does
either
else
ever
every
get
got
have
has
had
her
hers
him
his
however
in
into
it
its
just
least
let
like
likely
other
rather
me
might
most
must
my
neither
not
nor
often
off
on
only
should
some
then
that
their
then
this
too
us
we
who
whom
would
yet
here
there
bbc
abc
news
maybe
perhaps
man
men
woman
women
Out
yes
no
in
out

## countries.txt
Afghanistan
Albania
Algeria
America
Andorra
Angola
Antigua
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Russians
Europeans
Benin
Bhutan
Bissau
Bolivia
Bosnia
Botswana
Brazil
British
Britan
Brunei
Bulgaria
Burkina
Burma
Burundi
Cambodia
Cameroon
Canada
Cape Verde
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo
Costa Rica
country debt
Croatia
Cuba
Cyprus
Czech
Denmark
Djibouti
Dominica
East Timor
Ecuador
Egypt
El Salvador
Emirate
England
Eritrea
Estonia
Ethiopia
Russian
Fiji
Finland
France
Gabon
Gambia
Georgia
French
Germany
Ghana
Great Britain
Europe
European
Britain
Greece
Grenada
Grenadines
Guatemala
Guinea
Guyana
Haiti
Herzegovina
Honduras
Hungary
Iceland
in usa
India
Indian
Indonesia
Iran
Iraq
Ireland
Israel
Italy
Ivory Coast
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Korea
Kosovo
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
Macedonia
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Marshall
Mauritania
Mauritius
Mexico
Micronesia
Moldova
Monaco
Mongolia
Montenegro
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands
New Zealand
Nicaragua
Niger
Nigeria
Norway
Oman
Pakistan
Palau
Panama
Papua
Paraguay
Peru
Philippines
Poland
Portugal
Qatar
Romania
Russia
Rwanda
Samoa
San Marino
Sao Tome
Saudi Arabia
scotland
scottish
Senegal
Serbia
Seychelles
Sierra Leone
Singapore
Slovakia
Slovenia
Solomon
Somalia
South Africa
Africa
South Sudan
Spain
Sri Lanka
St. Kitts
St. Lucia
St Kitts
St Lucia
Saint Kitts
Santa Lucia
Sudan
Suriname
Swaziland
Sweden
Switzerland
Syria
Taiwan
Tajikistan
Tanzania
Thailand
Tobago
Togo
Tonga
Trinidad
Tunisia
Turkey
Turkmenistan
Tuvalu
Uganda
Ukraine
United Kingdom
United States
Uruguay
USA
US
UK
Uzbekistan
Vanuatu
Vatican
Venezuela
Vietnam
wales
welsh
Yemen
Zambia
Zimbabwe
Afghan
Albanian
Algerian
American
Andorran
Angolan
Antiguans
Argentinean
Armenian
Australian
Austrian
Azerbaijani
Bahamian
Bahraini
Bangladeshi
Barbadian
Barbudans
Batswana
Belarusian
Belgian
Bourgeoi
Bourgeoisie
Belizean
Beninese
Bhutanese
Bolivian
Beverly Hills
Bosnian
Brazilian
British
Bruneian
Bulgarian
Burkinabe
Burmese
Burundian
Cambodian
Cameroonian
Canadian
Cape Verdean
Central African
Chadian
Chilean
Chinese
Colombian
Comoran
Congolese
Costa Rican
Croatian
Cuban
Cypriot
Czech
Danish
Djibouti
Dominican
Dutch
East Timorese
Ecuadorean
Egyptian
Emirian
Equatorial Guinean
Eritrean
Estonian
Ethiopian
Fijian
Filipino
Finnish
French
Gabonese
Gambian
Georgian
German
Ghanaian
Greek
Grenadian
Guatemalan
Guinea-Bissauan
Guinean
Guyanese
Haitian
Herzegovinian
Honduran
Hungarian
I-Kiribati
Icelander
Indian
Indonesian
Iranian
Iraqi
Irish
Israeli
Italian
Ivorian
Jamaican
Japanese
Jordanian
Kazakhstani
Kenyan
Kittian
Nevisian
Kuwaiti
Kyrgyz
Laotian
Latvian
Lebanese
Liberian
Libyan
Liechtensteiner
Lithuanian
Luxembourger
Macedonian
Malagasy
Malawian
Malaysian
Maldivian
Malian
Maltese
Marshallese
Mauritanian
Mauritian
Mexican
Micronesian
Moldovan
Monacan
Mongolian
Moroccan
Mosotho
Motswana
Mozambican
Namibian
Nauruan
Nepalese
New Zealander
Ni-Vanuatu
Nicaraguan
Nigerian
Nigerien
North Korean
Northern Irish
Norwegian
Omani
Pakistani
Palauan
Panamanian
Papua New Guinean
Paraguayan
Peruvian
Polish
Portuguese
Qatari
Romanian
Russian
Rwandan
Saint Lucian
Salvadoran
Samoan
San Marinese
Sao Tomean
Saudi
Scottish
Senegalese
Serbian
Seychellois
Sierra Leonean
Singaporean
Slovakian
Slovenian
Solomon Islander
Somali
South African
South Korean
Spanish
Sri Lankan
Sudanese
Surinamer
Swazi
Swedish
Swiss
Syrian
Taiwanese
Tajik
Tanzanian
Thai
Togolese
Tongan
Trinidadian
Tobagonian
Tunisian
Turkish
Tuvaluan
Ugandan
Ukrainian
Uruguayan
Uzbekistani
Uzbekistan
Venezuelan
Vietnamese
Welsh
Yemenite
Zambian
Zimbabwean
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
Beijing
Chicago
Taoyuan
San Antonio
Toronto
New York
English
Pennsylvania
South Carolina
Texas
Wisconsin
St Paul
London
Soho
Brexit
Britain
Manchester
Middle Eastern
Taipei
Vienna
EU
Yemeni
Europe
European
South America
South American
Asia
Asian
Oceania
Oceanian
Africa
African
Antartica
Pacific
Atlantic
Mediterranean
Scot
Scots
Korean
California
Swedes
Swede
Zurich
Yemenis
Western
Chicago
northeast
southeast
southwest
northwest
northern
western
eastern
sourthern
States
state
Limburger
Limburgers
Country
Countries
City
Cities
County
Counties
York
Madison

## featureIdentifier.py
def load_who_file():
    """
    Read file whos.txt and insert data to a hash set
    :return: the hash set with all whos from file whos.txt
    """
    return set([who.strip('\n').lower() for who in file("data/whos.txt", 'r').readlines()])

def load_common_name_file():
    """
    Read file common_name.txt and insert data to a hash set
    :return: the hash set with all whos from file common_name.txt
    """
    return set([common_name.strip('\n').lower() for common_name in file("data/common_name.txt", 'r').readlines()])

def load_common_adj_file():
    """
    Read file common_adj.txt and insert data to a hash set
    :return: the hash set with all whos from file common_adj.txt
    """
    return set([common_adj.strip('\n').lower() for common_adj in file("data/common_adj.txt", 'r').readlines()])

def load_country_file():
    """
    Read file countries.txt and insert data to a hash set
    :return: the hash set with all countries name from file countries.txt
    """
    return set([country.strip('\n').lower() for country in file("data/countries.txt", 'r').readlines()])

def load_conjunction_file():
    """
    Read file conjunctions.txt and insert data to a hash set
    :return: the hash set with all conjunctions from file conjunctions.txt
    """
    return set([conjunctions.strip('\n').lower() for conjunctions in file("data/conjunctions.txt", 'r').readlines()])

def load_prefix_library():
    """
    Generate a hash set with all prefixes
    :return: the hash set with all prefixes
    """
    return set([prefix.strip('\n').lower() for prefix in file("data/prefix.txt", 'r').readlines()])

def load_organ_library():
    """
    Generate a hash set with all organization titles
    :return: the hash set with all organization titles
    """
    return set([organ.strip('\n').lower() for organ in file("data/organization.txt", 'r').readlines()])

def load_month_file():
    """
    Generate a hash set with all months
    :return: the hash set with all months
    """
    return set([month.strip('\n').lower() for month in file("data/month.txt", 'r').readlines()])

def load_verb_file():
    """
    Read files irregular_verbs.txt and regular_verbs.txt and insert data to a hash set
    :return: the hash set with all verbs from files
    """
    return set(file("data/irregular_verbs.txt", 'r').read().split(', ')) |\
           set(file("data/regular_verbs.txt", 'r').read().split(', '))

def load_preposition_file():
    """
    Read files preposition.txt and insert data to a hash set
    :return: the hash set with all preposition from files
    """
    return set(open("data/preposition.txt", 'r').read().split(', '))

def contains_country(ngram, country_set):
    """
    Identify if a n-gram has countries, return 1 (has feature) 0 (no such feature)
    :param ngram: a ngram
    :param country_set: a set contains all countries
    :return: 1 (has feature) or 0 (no such feature)
    """
    # find if any word in current ngram has country name
    words = ngram[0].split(' ')
    for word in words:
        if word.lower() in country_set:
            return 1
    if len(words) >= 2:
        for i in range(1, len(words)):
            if (words[i-1]+' '+words[i]).lower() in country_set:
                return 1
    if len(words) >= 3:
        for i in range(2, len(words)):
            if (words[i-2]+' '+words[i-1]+' '+words[i]).lower() in country_set:
                return 1
    return 0


def contains_common_name(ngram, common_name_set):
    """
    Identify if a n-gram has common name, return 1 (has feature) 0 (no such feature)
    :param ngram: a ngram
    :param common_name_set: a set contains all common name
    :return: 1 (has feature) or 0 (no such feature)
    """
    # find if any word in current ngram has common name
    for word in ngram[0].split(' '):
        if word.lower() in common_name_set:
            return 1
    return 0

def contains_common_adj(ngram, common_adj_set):
    """
    Identify if a n-gram has common adj, return 1 (has feature) 0 (no such feature)
    :param ngram: a ngram
    :param common_name_set: a set contains all common adj
    :return: 1 (has feature) or 0 (no such feature)
    """
    # find if any word in current ngram has common adj
    for word in ngram[0].split(' '):
        if word.lower() in common_adj_set:
            return 1
    return 0

def contains_prefix(ngram, prefix_set):
    """
    Identify if a n-gram has prefix, return 1 (has feature) 0 (no such feature)
    :param ngram: a ngram
    :param prefix_set: a set contains all prefixes
    :return: 1 (has feature) or 0 (no such feature)
    """
    # find if any word in current ngram has country name
    for word in ngram[0].split(' '):
        if word.lower() in prefix_set:
            return 1
    return 0

def contains_month(ngram, month_set):
    """
    Identify if a n-gram has month, return 1 (has feature) 0 (no such feature)
    :param ngram: a ngram
    :param month_set: a set contains all months
    :return: 1 (has feature) or 0 (no such feature)
    """
    # find if any word in current ngram has country name
    for word in ngram[0].split(' '):
        if word.lower() in month_set:
            return 1
    return 0

def contains_organization(ngram, organ_set):
    """
    Identify if a n-gram has organization titles, return 1 (has feature) 0 (no such feature)
    :param ngram: a ngram
    :param organ_set: a set contains commom organization titles
    :return: 1 (has feature) or 0 (no such feature)
    """
    # find if any word in current ngram has country name
    words = ngram[0].split(' ')
    for word in words:
        if word.lower() in organ_set:
            return 1

    if len(words) >= 2:
        for i in range(1, len(words)):
            if (words[i-1]+' '+words[i]).lower() in organ_set:
                return 1
    return 0

def contains_conjunction(ngram, conjunctions_set):
    """
    Identify if a n-gram has conjunctions, return 1 (has feature) 0 (no such feature)
    :param ngram: a ngram
    :param conjunctions_set: a set contains all conjunctions
    :return: 1 (has feature) 0 (no such feature)
    """
    # find if any word in current ngram has country name
    for word in ngram[0].split(' '):
        if word.lower() in conjunctions_set:
            return 1
    return 0

def contains_verb(ngram, verb_set):
    """
    Identify if a n-gram has conjunctions, return 1 (has feature) 0 (no such feature)
    :param ngram: a ngram
    :param conjunctions_set: a set contains all conjunctions
    :return: 1 (has feature) 0 (no such feature)
    """
    # find if any word in current ngram has country name
    for word in ngram[0].split(' '):
        if word.lower() in verb_set:
            return 1
    return 0

def is_all_upper(ngram):
    """
    Check all the words in the content if all words start with upper case
    :param ngram: a ngram
    :return: 1 (has feature) 0 (no such feature)
    """
    for word in ngram[0].split(' '):
        if len(word) > 0 and word[0].islower():
            return 0
    return 1

def has_who(ngram, who_set):
    """
    Check all the words in the content if it has who
    :param ngram: a ngram
    :return: 1 (has feature) 0 (no such feature)
    """
    for word in ngram[0].split(' '):
        if word.lower() in who_set:
            return 1
    return 0

def no_more_than_one_lower(ngram):
    """
    Check all the words in the content if all words has less than 2 lower case at each starting letter
    :param ngram: a ngram
    :return: 1 (has feature) 0 (no such feature)
    """
    count = 0
    for word in ngram[0].split(' '):
        if word.islower():
            count += 1
            if count > 1:
                return 0
    return 1

def has_prefix_before_ngram(ngram, single_grams, prefix_set):
    """
    Check if the word in front of the input ngram is a prefix for name
    :param ngram: a n-gram
    :param single_grams: all words in an article with order
    :param prefix_set: a set contains all prefixes
    :return: 1 (has feature) 0 (no such feature)
    """
    if (ngram[2] - 1) >= 0:
        preWord = single_grams[ngram[2] - 1][0].lower()
        if preWord in prefix_set:
            return 1
    return 0

def has_human_verb(ngram, single_grams, verb_set):
    """
    Check if the word after the input ngram is a verb usually used for human
    :param ngram: a n-gram
    :param single_grams: all words in an article with order
    :param verb_set: a set contains all verbs usually used for human
    :return: 1 (has feature) 0 (no such feature)
    """
    ngram_end_index = ngram[3]
    if (ngram_end_index + 1) < len(single_grams):
        if single_grams[ngram_end_index+1][0] in verb_set:
            return 1
    return 0

def features_label_separator(ngrams, labels_set=None):
    """
    Separate features and label from n-grams and return two lists
    :param ngrams: all n-grams from all articles
    :param labels_set: the hash set of all labels
    :return: two lists -- features and label from n-grams
    """
    features = [ngram[4:] for ngram in ngrams]
    label = [1 if ngram[0] in labels_set else 0 for ngram in ngrams] if labels_set else []
    return features, label

def afterpreposition(ngram, single_grams, preposition_set):
    """
    Check if the word in front of the input ngram is a preposition
    :param ngram: a n-gram
    :param single_grams: all words in an article with order
    :param preposition_set: a set contains all prefixes
    :return: 1 (has feature) 0 (no such feature)
    """
    if (ngram[2] - 1) >= 0:
        prepos = single_grams[ngram[2] - 1][0].lower()
        if prepos in preposition_set:
            return 1
    return 0

def before_who(ngram, single_grams, who_set):
    """
    Check if the word after the input ngram is "who"
    :param ngram: a n-gram
    :param single_grams: all words in an article with order
    :param who_set: a set contains who
    :return: 1 (has feature) 0 (no such feature)
    """
    return 1 if (ngram[2]+1) < len(single_grams) and (single_grams[ngram[2]+1][0]).lower() in who_set else 0

def has_duplicate(ngram):
    """
    Check if the words in input ngram has any duplicate words
    :param ngram: a n-gram
    :return: 1 (has feature) 0 (no such feature)
    """
    words = set()
    for word in ngram[0].split(' '):
        if word in words:
            return 1
        else:
            words.add(word)
    return 0

def count_occurrences(ngram, single_grams):
    """
    Count the word's occurrences in the article(only for single words)
    :param ngram: a n-gram
    :param single_grams: all words in an article with order
    :return: word's occurrences
    """
    # print (single_grams)
    # data = ' '.join(a[0] for a in single_grams)
    return single_grams.count(ngram[0])

def start_end_dash(ngram):
    """
    Check if the ngram contain string starts or ends with dash
    :param ngram: a n-gram
    :return: 1 (has feature) 0 (no such feature)
    """
    words = ngram[0].split(' ')
    if not words[0].isalpha() or (len(words) > 1 and not words[-1].isalpha()) or words.count('-') > 1:
        return 1
    count = 0
    for word in words:
        count += word.count('-')
    if count > 1:
        return 1
    return 0

def has_one_dash(ngram):
    """
    Check if the ngram contain exactly one dash
    :param ngram: a n-gram
    :return: 1 (has feature) 0 (no such feature)
    """
    words = ngram[0].split(' ')
    if words.count('-') == 1:
        return 1
    count = 0
    for word in words:
        count += word.count('-')
    if count == 1:
        return 1
    return 0

def all_upper_character(ngram):
    """
    Check all the words in the content if all character in words is upper case
    :param ngram: a ngram
    :return: 1 (has feature) 0 (no such feature)
    """
    for word in ngram[0].split(' '):
        if word.isupper():
            return 1
    return 0

def word_length(ngram):
    """
    Check number of words
    :param ngram: a ngram
    :return: number of words
    """
    words = ngram[0].split(' ')
    return len(words)

def has_fullstop_before_ngram(ngram, single_grams2):
    """
    Check if the word in front of the input ngram is a fullstop
    :param ngram: a n-gram
    :param single_grams2: all words including punctuation in an article with order
    :return: 1 (has feature) 0 (no such feature)
    """
    if (ngram[2] - 1) >= 0:
        preWord = single_grams2[ngram[2] - 1][0].lower()
        if preWord.endswith("."):
            return 1
    return 0

def has_comma_before_ngram(ngram, single_grams2):
    """
    Check if the word in front of the input ngram is a comma
    :param ngram: a n-gram
    :param single_grams2: all words including punctuation in an article with order
    :return: 1 (has feature) 0 (no such feature)
    """
    if (ngram[2] - 1) >= 0:
        preWord = single_grams2[ngram[2] - 1][0].lower()
        if preWord.endswith(","):
            return 1
    return 0

def has_comma(ngram, single_grams2):
    """
    Check if the word has a comma in the end
    :param ngram: a n-gram
    :param single_grams2: all words including punctuation in an article with order
    :return: 1 (has feature) 0 (no such feature)
    """
    lastWord = single_grams2[ngram[3]][0]
    if lastWord.endswith(","):
        return 1
    return 0

def is_name_suffix(ngram):
    """
    Check if the word has a suffix
    :param ngram: a n-gram
    :return: 1 (has feature) 0 (no such feature)
    """
    suffixes = ['Sr', 'Sr.', 'Jr', 'Jr.', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX',
                'sr', 'sr.', 'jr', 'jr.', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx']
    for word in ngram[0].split(' '):
        if word in suffixes:
            return 1
    return 0

def start_with_suffix(ngram):
    """
    Check if the word has a suffix
    :param ngram: a n-gram
    :return: 1 (has feature) 0 (no such feature)
    """
    suffixes = ['Sr', 'Sr.', 'Jr', 'Jr.', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX',
                'sr', 'sr.', 'jr', 'jr.', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx']
    word = ngram[0].split(' ')
    if word[0] in suffixes:
        return 1
    return 0

## irregular_verbs.txt
is, am, are, was, were, has been, have been, beat, beaten, become, became, begin, began, begun, bend, bent, bet, bid, bite, bit, bitten, blow, blew, blown, break, broke, broken, bring, brought, build, built, burn, burned, burnt, buy, bought, catch, caught, choose, chose, chosen, come, came, cost, cut, dig, dug, dive, dove, dived, do, did, done, draw, drew, drawn, dream, dreamed, dreamt, drive, drove, driven, drink, drank, drunk, eat, ate, eaten, fall, fell, fallen, feel, felt, fight, fought, find, found, fly, flew, flown, forget, forgot, forgotten, forgive, forgave, forgiven, freeze, froze, frozen, get, got, gotten, give, gave, given, go, went, gone, grow, grew, grown, hang, hung, have, had, hear, heard, hide, hid, hidden, hit, hold, held, hurt, keep, kept, know, knew, known, lay, laid, lead, led, leave, left, lend, lent, let, lie, lain, lose, lost, make, made, mean, meant, meet, met, pay, paid, put, read, ride, rode, ridden, ring, rang, rung, rise, rose, risen, run, ran, say, said, see, saw, seen, sell, sold, send, sent, show, showed, shown, shut, sing, sang, sung, sit, sat, sleep, slept, speak, spoke, spoken, spend, spent, stand, stood, swim, swam, swum, take, took, taken, teach, taught, tear, tore, torn, tell, told, think, thought, throw, threw, thrown, understand, understood, wake, woke, woken, wear, wore, worn, win, won, write, wrote, written

## main.py
from sklearn.model_selection import cross_val_score, ShuffleSplit
# from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
import preprocessing
from ngramGenerator import *
from featureIdentifier import *
from mlModel import *
from postProcessing import *
import pandas as pd
from pandas import DataFrame


def main():
    articles, train_labels_set,  test_labels_set = [], set(), set()

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' Pre-processing                                                   '''
    ''' (1) Load data and split data into train/test sets                '''
    ''' (2) Hashset the labels and remove labels on the data             '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    # add all files' data into articles
    preprocessing.read_data(articles)

    # split data to train and test sets
    train_set, test_set = preprocessing.data_split(articles)
    train_label_count, test_label_count = 0, 0

    # take off label and add names to labels
    for i in range(len(train_set)):
        train_set[i], train_label_count, train_labels_set =\
            preprocessing.label_extraction_takeoff(paragraphs=train_set[i], count=train_label_count, labels=train_labels_set)

    for i in range(len(test_set)):
        test_set[i], test_label_count, test_labels_set =\
            preprocessing.label_extraction_takeoff(paragraphs=test_set[i], count=test_label_count, labels=test_labels_set)

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' N-gram generation                                                '''
    ''' (1) Generate all n-gram (with first feature whether contains 's) '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    train_ngram_result, test_ngram_result = [], []
    train_single_gram, test_single_gram = [], []
    train_single_gram2, test_single_gram2 = [], []        # save single ones in order for later use

    for i in range(len(train_set)):
        ngrams, singles, singles2 = generate_ngrams(filename=train_set[i][0], content=train_set[i][1], n=5)
        train_ngram_result.append(ngrams)
        train_single_gram.append(singles)
        train_single_gram2.append(singles2)

    for i in range(len(test_set)):
        ngrams, singles, singles2 = generate_ngrams(filename=test_set[i][0], content=test_set[i][1], n=5)
        test_ngram_result.append(ngrams)
        test_single_gram.append(singles)
        test_single_gram2.append(singles2)

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' Take out n-gram with only lowercase (only for training data)     '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    for index in range(len(train_ngram_result)):
        train_ngram_result[index] = eliminate_all_lower(train_ngram_result[index])

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' Create a test ngram result without n-gram has only lowercase     '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    test_ngram_result_without_all_lower = test_ngram_result[:]
    for index in range(len(test_ngram_result_without_all_lower)):
        test_ngram_result_without_all_lower[index] = eliminate_all_lower(test_ngram_result_without_all_lower[index])

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' Feature creation                                                 '''
    ''' (1) 's (added during generation of ngram)                        '''
    ''' (2) contains country                                             '''
    ''' (3) contains conjunction                                         '''
    ''' (4) all capitalised                                              '''
    ''' (5) prefix before n-gram                                         '''
    ''' (6) verbs for humans                                             '''
    ''' (7) prefix in n-gram                                             '''
    ''' (8) after preposition                                            '''
    ''' (9) contains organization                                        '''
    ''' (10) has no more than 1 word without capitalised starting letter '''
    ''' (11) contains month                                              '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    country_set, conjunction_set, prefix_set, verb_set, preposition_set, organ_set, month_set, who_set, common_name_set, common_adj_set = \
        load_country_file(), load_conjunction_file(), load_prefix_library(),\
        load_verb_file(), load_preposition_file(), load_organ_library(), load_month_file(), load_who_file(), load_common_name_file(), load_common_adj_file()

    for ngram_set_index in range(len(train_ngram_result)):
        article = ' '.join(a[0] for a in train_single_gram[ngram_set_index])
        for ngram_index in range(len(train_ngram_result[ngram_set_index])):
            ngram = train_ngram_result[ngram_set_index][ngram_index]

            train_ngram_result[ngram_set_index][ngram_index] = ngram +\
                (contains_country(ngram=ngram, country_set=country_set),
                 contains_conjunction(ngram=ngram, conjunctions_set=conjunction_set),
                 is_all_upper(ngram=ngram),
                 has_prefix_before_ngram(ngram=ngram, single_grams=train_single_gram[ngram_set_index], prefix_set=prefix_set),
                 has_human_verb(ngram=ngram, single_grams=train_single_gram[ngram_set_index], verb_set=verb_set),
                 contains_prefix(ngram=ngram, prefix_set=prefix_set),
                 afterpreposition(ngram=ngram, single_grams=train_single_gram[ngram_set_index], preposition_set=preposition_set),
                 contains_organization(ngram=ngram, organ_set=organ_set),
                 contains_common_name(ngram=ngram, common_name_set=common_name_set),
                 has_duplicate(ngram=ngram),
                 count_occurrences(ngram=ngram, single_grams=article),
                 no_more_than_one_lower(ngram=ngram),
                 contains_month(ngram=ngram, month_set=month_set),
                 contains_verb(ngram=ngram, verb_set=verb_set),
                 start_end_dash(ngram=ngram),
                 all_upper_character(ngram=ngram),
                 word_length(ngram=ngram),
                 has_fullstop_before_ngram(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]),
                 has_comma_before_ngram(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]),
                 before_who(ngram=ngram, single_grams=train_single_gram[ngram_set_index], who_set=who_set),
                 has_comma(ngram=ngram, single_grams2=train_single_gram2[ngram_set_index]),
                 has_who(ngram=ngram, who_set=who_set),
                 is_name_suffix(ngram=ngram),
                 has_one_dash(ngram=ngram),
                 start_with_suffix(ngram=ngram),
                 contains_common_adj(ngram=ngram, common_adj_set=common_adj_set),)

    for ngram_set_index in range(len(test_ngram_result_without_all_lower)):
        article = ' '.join(a[0] for a in test_single_gram[ngram_set_index])
        for ngram_index in range(len(test_ngram_result_without_all_lower[ngram_set_index])):
            ngram = test_ngram_result_without_all_lower[ngram_set_index][ngram_index]
            test_ngram_result_without_all_lower[ngram_set_index][ngram_index] = ngram +\
                (contains_country(ngram=ngram, country_set=country_set),
                 contains_conjunction(ngram=ngram, conjunctions_set=conjunction_set),
                 is_all_upper(ngram=ngram),
                 has_prefix_before_ngram(ngram=ngram, single_grams=test_single_gram[ngram_set_index], prefix_set=prefix_set),
                 has_human_verb(ngram=ngram, single_grams=test_single_gram[ngram_set_index], verb_set=verb_set),
                 contains_prefix(ngram=ngram, prefix_set=prefix_set),
                 afterpreposition(ngram=ngram, single_grams=test_single_gram[ngram_set_index], preposition_set=preposition_set),
                 contains_organization(ngram=ngram, organ_set=organ_set),
                 contains_common_name(ngram=ngram, common_name_set=common_name_set),
                 has_duplicate(ngram=ngram),
                 count_occurrences(ngram=ngram, single_grams=article),
                 no_more_than_one_lower(ngram=ngram),
                 contains_month(ngram=ngram, month_set=month_set),
                 contains_verb(ngram=ngram, verb_set=verb_set),
                 start_end_dash(ngram=ngram),
                 all_upper_character(ngram=ngram),
                 word_length(ngram=ngram),
                 has_fullstop_before_ngram(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]),
                 has_comma_before_ngram(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]),
                 before_who(ngram=ngram, single_grams=test_single_gram[ngram_set_index], who_set=who_set),
                 has_comma(ngram=ngram, single_grams2=test_single_gram2[ngram_set_index]),
                 has_who(ngram=ngram, who_set=who_set),
                 is_name_suffix(ngram=ngram),
                 has_one_dash(ngram=ngram),
                 start_with_suffix(ngram=ngram),
                 contains_common_adj(ngram=ngram, common_adj_set=common_adj_set),)

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' Train DT, SVM, NB                                                '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    train_ngrams = []
    while len(train_ngram_result):
        train_ngrams.extend(train_ngram_result.pop())
    train_ngrams = sorted(train_ngrams, key=lambda i: (int(i[1]), i[2], i[3]-i[2]), reverse=True)
    new_train, train_label = features_label_separator(ngrams=train_ngrams, labels_set=train_labels_set)

    decision_tree = build_decision_tree(data=new_train, label=train_label)
    support_vector_machine = build_support_vector_machine(data=new_train, label=train_label)
    nb_classifier = build_nb_classifier(data=new_train, label=train_label)
    rf_classifier = build_rf_classifier(data=new_train, label=train_label)
    lr_classifier = build_lr_classifier(data=new_train, label=train_label)

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' merge test ngram result                                          '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    test_ngrams = []
    while len(test_ngram_result_without_all_lower):
        test_ngrams.extend(test_ngram_result_without_all_lower.pop())
    test_ngrams = sorted(test_ngrams, key=lambda i: (int(i[1]), i[2], i[3]-i[2]), reverse=True)
    new_test, test_label = features_label_separator(ngrams=test_ngrams, labels_set=test_labels_set)

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' use DT, SVM, NB, RF, LR to predict test set                      '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
    print("Train Set")
    print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
    print(" ")
    print("Number of Name: ")
    print(train_label_count)
    decision_tree_predict_train = decision_tree.predict(new_train)
    support_vector_machine_predict_train = support_vector_machine.predict(new_train)
    nb_classifier_predict_train = nb_classifier.predict(new_train)
    rf_classifier_predict_train = rf_classifier.predict(new_train)
    lr_classifier_predict_train = lr_classifier.predict(new_train)

    print("precision before post processing:")
    print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, train_label)])) / sum(lr_classifier_predict_train))
    print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, train_label)])) / sum(decision_tree_predict_train))
    print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, train_label)])) / sum(support_vector_machine_predict_train))
    print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, train_label)])) / sum(nb_classifier_predict_train))
    print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, train_label)])) / sum(rf_classifier_predict_train))
    print('')
    print("recall before post processing:")
    print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, train_label)])) / sum(train_label))
    print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, train_label)])) / sum(train_label))
    print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, train_label)])) / sum(train_label))
    print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, train_label)])) / sum(train_label))
    print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, train_label)])) / sum(train_label))
    print('')
    decision_tree_ngrams_train, decision_tree_predict_train, decision_tree_label_train = take_out_overlapped(train_ngrams, decision_tree_predict_train, train_label)
    support_vector_machine_ngrams_train, support_vector_machine_predict_train, support_vector_machine_label_train = take_out_overlapped(train_ngrams, support_vector_machine_predict_train, train_label)
    nb_classifier_ngrams_train, nb_classifier_predict_train, nb_classifier_label_train = take_out_overlapped(train_ngrams, nb_classifier_predict_train, train_label)
    rf_classifier_ngrams_train, rf_classifier_predict_train, rf_classifier_label_train = take_out_overlapped(train_ngrams, rf_classifier_predict_train, train_label)
    lr_classifier_ngrams_train, lr_classifier_predict_train, lr_classifier_label_train = take_out_overlapped(train_ngrams, lr_classifier_predict_train, train_label)

    decision_tree_predict_train = set_predict_value(ngrams=decision_tree_ngrams_train, predict=decision_tree_predict_train)
    support_vector_machine_predict_train = set_predict_value(ngrams=support_vector_machine_ngrams_train, predict=support_vector_machine_predict_train)
    nb_classifier_predict_train = set_predict_value(ngrams=nb_classifier_ngrams_train, predict=nb_classifier_predict_train)
    rf_classifier_predict_train = set_predict_value(ngrams=rf_classifier_ngrams_train, predict=rf_classifier_predict_train)
    lr_classifier_predict_train = set_predict_value(ngrams=lr_classifier_ngrams_train, predict=lr_classifier_predict_train)

    print("precision:")
    print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, lr_classifier_label_train)])) / sum(lr_classifier_predict_train))
    print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, decision_tree_label_train)])) / sum(decision_tree_predict_train))
    print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, support_vector_machine_label_train)])) / sum(support_vector_machine_predict_train))
    print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, nb_classifier_label_train)])) / sum(nb_classifier_predict_train))
    print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, rf_classifier_label_train)])) / sum(rf_classifier_predict_train))
    print('')
    print("recall:")
    print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict_train, lr_classifier_label_train)])) / sum(lr_classifier_label_train))
    print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict_train, decision_tree_label_train)])) / sum(decision_tree_label_train))
    print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict_train, support_vector_machine_label_train)])) / sum(support_vector_machine_label_train))
    print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict_train, nb_classifier_label_train)])) / sum(nb_classifier_label_train))
    print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict_train, rf_classifier_label_train)])) / sum(rf_classifier_label_train))

    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    ''' use DT, SVM, NB, RF, LR to predict test set                      '''
    ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
    print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
    print("Test Set")
    print("''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''")
    print(" ")
    print("Number of Name: ")
    print(test_label_count)
    decision_tree_predict = decision_tree.predict(new_test)
    support_vector_machine_predict = support_vector_machine.predict(new_test)
    nb_classifier_predict = nb_classifier.predict(new_test)
    rf_classifier_predict = rf_classifier.predict(new_test)
    lr_classifier_predict = lr_classifier.predict(new_test)

    print("precision before post processing:")
    print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, test_label)])) / sum(lr_classifier_predict))
    print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, test_label)])) / sum(decision_tree_predict))
    print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, test_label)])) / sum(support_vector_machine_predict))
    print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, test_label)])) / sum(nb_classifier_predict))
    print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, test_label)])) / sum(rf_classifier_predict))
    print('')
    print("recall before post processing:")
    print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, test_label)])) / sum(test_label))
    print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, test_label)])) / sum(test_label))
    print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, test_label)])) / sum(test_label))
    print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, test_label)])) / sum(test_label))
    print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, test_label)])) / sum(test_label))
    print('')
    decision_tree_ngrams, decision_tree_predict, decision_tree_label = take_out_overlapped(test_ngrams, decision_tree_predict, test_label)
    support_vector_machine_ngrams, support_vector_machine_predict, support_vector_machine_label = take_out_overlapped(test_ngrams, support_vector_machine_predict, test_label)
    nb_classifier_ngrams, nb_classifier_predict, nb_classifier_label = take_out_overlapped(test_ngrams, nb_classifier_predict, test_label)
    rf_classifier_ngrams, rf_classifier_predict, rf_classifier_label = take_out_overlapped(test_ngrams, rf_classifier_predict, test_label)
    lr_classifier_ngrams, lr_classifier_predict, lr_classifier_label = take_out_overlapped(test_ngrams, lr_classifier_predict, test_label)

    decision_tree_predict = set_predict_value(ngrams=decision_tree_ngrams, predict=decision_tree_predict)
    support_vector_machine_predict = set_predict_value(ngrams=support_vector_machine_ngrams, predict=support_vector_machine_predict)
    nb_classifier_predict = set_predict_value(ngrams=nb_classifier_ngrams, predict=nb_classifier_predict)
    rf_classifier_predict = set_predict_value(ngrams=rf_classifier_ngrams, predict=rf_classifier_predict)
    lr_classifier_predict = set_predict_value(ngrams=lr_classifier_ngrams, predict=lr_classifier_predict)

    print("precision:")
    print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, lr_classifier_label)])) / sum(lr_classifier_predict))
    print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, decision_tree_label)])) / sum(decision_tree_predict))
    print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, support_vector_machine_label)])) / sum(support_vector_machine_predict))
    print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, nb_classifier_label)])) / sum(nb_classifier_predict))
    print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, rf_classifier_label)])) / sum(rf_classifier_predict))
    print('')
    print("recall:")
    print 'lr: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(lr_classifier_predict, lr_classifier_label)])) / sum(lr_classifier_label))
    print 'dt: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(decision_tree_predict, decision_tree_label)])) / sum(decision_tree_label))
    print 'svm: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(support_vector_machine_predict, support_vector_machine_label)])) / sum(support_vector_machine_label))
    print 'nb: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(nb_classifier_predict, nb_classifier_label)])) / sum(nb_classifier_label))
    print 'rf: ' + str(float(sum([1 if a == b and a == 1 else 0 for a, b in zip(rf_classifier_predict, rf_classifier_label)])) / sum(rf_classifier_label))

    # print ("==========================================================================")
    # print("data frame:")

    # df = pd.DataFrame(columns=['words', 'predict', 'label'])
    # for i in range(len(rf_classifier_predict)):
    #     if not (rf_classifier_predict[i] == rf_classifier_label[i])  and rf_classifier_predict[i] == 1:
    #         df = df.append({'words': rf_classifier_ngrams[i], 'predict': rf_classifier_predict[i], 'label':rf_classifier_label[i]}, ignore_index = True)
    # DataFrame.to_csv(df, "rf_classifier_predict.csv", index=False)
    # scores = cross_val_score(svm.SVC(), new_train, train_label, cv=ShuffleSplit(n_splits=5, test_size=0.3, random_state=0))
    # print (scores)


if __name__ == "__main__":
    main()

## mlModel.py
from sklearn.linear_model import LogisticRegression
from sklearn import tree, svm
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV


def build_decision_tree(data, label):
    """
    Build the decision tree based on the data and its corresponding label
    :param data: a list of tuple which contains all features of a data
    :param label: a list of label for data
    :return: a trained decision tree
    """
    dt_tree = tree.DecisionTreeClassifier()
    return dt_tree.fit(data, label)


def build_support_vector_machine(data, label):
    """
    Build the support vector machine based on the data and its corresponding label
    :param data: a list of tuple which contains all features of a data
    :param label: a list of label for data
    :return: trained support vector machine
    """
    trained_svm = svm.SVC(gamma='scale', C=100)
    return trained_svm.fit(data, label)


def build_nb_classifier(data, label):
    """
    Build the naive bayes classifier based on the data and its corresponding label
    :param data: a list of tuple which contains all features of a data
    :param label: a list of label for data
    :return: trained naive bayes classifier
    """
    classifier = BernoulliNB()
    return classifier.fit(data, label)


def build_rf_classifier(data, label):
    """
    Build the random forest classifier based on the data and its corresponding label
    :param data: a list of tuple which contains all features of a data
    :param label: a list of label for data
    :return: trained naive bayes classifier
    """
    # pipe = make_pipeline(StandardScaler(),RandomForestClassifier())
    # param_grid = {'n_estimators': list(range(1, 30))}
    # gs = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, \
    #     iid=False, n_jobs=-1, refit=True,scoring='accuracy',cv=10)
    # gs.fit(data, label)
    # n_estimators=gs.best_params_['n_estimators']
    classifier = RandomForestClassifier(n_estimators=34, n_jobs=-1, criterion='gini', class_weight={0: 1, 1: 1.45}, random_state=10)
    return classifier.fit(data, label)


def build_lr_classifier(data, label):
    """
    Build the logistic regression classifier based on the data and its corresponding label
    :param data: a list of tuple which contains all features of a data
    :param label: a list of label for data
    :return: trained logistic regression classifier
    """
    classifier = LogisticRegression(solver='newton-cg',n_jobs=-1,class_weight={0: 1, 1: 1.5})
    return classifier.fit(data, label)

## month.txt
January
February
March
April
May
June
July
August
September
October
November
December

## ngramGenerator.py
import re


def generate_ngrams(filename, content, n):
    """
    Generate n-grams (with a feature whether it contains "'s") from the content
    :param filename: filename
    :param content: the whole article
    :param n: the size of n gram
    :return: generated list of n-grams, single grams
    """
    sentences = content.split(".")
    index, index2 = 0, 0
    n_grams, single_grams, single_grams2 = [], [], []
    for sentence in sentences:
        sections = sentence.split(",")
        for section in sections:
            parts = section.split(";")
            for part in parts:
                words = part.split()
                single_grams_temp, feature_single_quote_temp = [], []
                for i in range(len(words)):
                    words2 = words[:]
                    words2[i] = re.sub('[;@#$()\{\}:"]', '', words2[i])
                    single_grams2.append((words2[i], filename, index2, index2))
                    index2 += 1

                # first clean the data
                for i in range(len(words)):
                    # clean data by removing special characters
                    words[i] = re.sub('[?;!@#$()\{\}:\,\."]', '', words[i])

                    # for cases 's, take off 's
                    if (len(words[i]) >= 2 and words[i][-2] == "'"):
                        words[i] = words[i][:-2]
                        feature_single_quote_temp.append(1)
                    elif (len(words[i]) >= 2 and words[i][-2] == "s" and words[i][-1] == "'"):
                        words[i] = words[i][:-1]
                        feature_single_quote_temp.append(1)
                    else:
                        feature_single_quote_temp.append(0)

                    single_grams_temp.append((words[i], filename, index, index))
                    index += 1

                n_grams_temp = []    # the return list
                for i in range(len(words)):
                    temp = words[i]
                    for j in range(1, n):
                        if (i + j) < len(words):
                            temp = temp + ' ' + words[i + j]
                            temp_with_first_index = (temp, filename, single_grams_temp[i][2], single_grams_temp[i + j][2], feature_single_quote_temp[i + j])
                            n_grams_temp.append(temp_with_first_index)

                # single_grams += n_grams
                for i in range(len(single_grams_temp)):
                    n_grams_temp.append(single_grams_temp[i] + (feature_single_quote_temp[i],))

                n_grams.extend(n_grams_temp)
                single_grams.extend(single_grams_temp)
    return n_grams, single_grams, single_grams2

def eliminate_all_lower(ngrams):
    """
    Take out n-gram which does not have any word capitalised
    :param ngrams: all n-grams
    :return: all n-grams for each n-gram has a least one word capitalised
    """
    new_ngram = []
    for ngram in ngrams:
        for word in ngram[0].split(' '):
            if len(word) > 0 and word[0].isupper():
                new_ngram.append(ngram)
                break

    return new_ngram

## organization.txt
university
college
association
commission
council
laboratory
government
committee
department
school
research
office
affairs
court
corporation
company
agency
organization
group
empire
league
music
hotel
hotels
white house
party
hilton
Walmart
Genk
Brugge
Concert
organisation
prize
rolling stones
White House
Virgin Galactic
Art Brut
amazon
walmart
art
following
club
people
human
guitar
violin

## postProcessing.py
def take_out_overlapped(ngrams, predict, label):
    """
    Take out n-gram which is the subset of another n-gram
    :param ngrams: all n-grams
    :return: the remaining n-grams
    """
    new_ngrams, new_predict, new_label, prev, prev_predict = [], [], [], None, 0
    for element_index in range(len(ngrams)):
        # if prev is None || (filenames are different) || (starting index are different)
        if not prev \
            or ngrams[element_index][1] != prev[1] \
            or ngrams[element_index][2] == 0 \
            or prev_predict == 0 \
            or (#ngrams_labels_predicts_sets[element_index][0][1] == prev[0][1] \
                # pre[element_index]==1 \
                # and prev[2]==1 \
                not(prev[2] <= ngrams[element_index][2] <= prev[3]) \
                or not(prev[2] <= ngrams[element_index][3] <= prev[3])):
            prev = ngrams[element_index]
            prev_predict = predict[element_index]
            new_ngrams.append(ngrams[element_index])
            new_predict.append(predict[element_index])
            new_label.append(label[element_index])

    return new_ngrams, new_predict, new_label


def set_predict_value(ngrams, predict):
    for element_index in range(len(ngrams)):
        # 19: start_end_dash, 5: contains_country, 10: contains_prefix, 12: contains_organization, 18: contains_verb,\
        # 6: contains_conjunction, 29: start_with_suffix, 30: contains_common_adj
        if ngrams[element_index][19] == 1 or ngrams[element_index][5] == 1 \
            or ngrams[element_index][29] == 1 or ngrams[element_index][6] == 1 \
            or ngrams[element_index][10] == 1 or ngrams[element_index][12] == 1 \
            or ngrams[element_index][18] == 1 or ngrams[element_index][30] == 1:
            predict[element_index] = 0

    return predict

## prefix.txt
adm
atty
baz
brother
capped
chief
cmdr
col
dean
dr
elder
father
gen
gov
hon
maj
msgt
mr
mrs
ms
prince
prof
rabbi
rev
king
queen
professor
maid
madam
princess
duke
duchess
baroness
baron
pope
popess
president
mother
saint
minister
doctor
major
general
marshal
officer
admiral
attorney
commander
colonel
governor
honorable
mister
reverend
actor
actress
writer
performer
journalism
dj
star
producer
engineer
coordinator
administrator
manager
agent
promoter
accompanist
bassist
busker
cellist
composer
drummer
fiddler
flautist
flutist
mpressionist
instrumentalist
keyboardist
leader
musician
pianist
player
saxophonist
soloist
timpanist
tuner
virtuoso
guitarist
organist
violinist
trumpeter
trombonist
percussionist
oboist
mandolinist
keytarist
harpsichordist
harpist
clarinetist
bassoonist
bagpiper
accordionist
master
by
winner
nominee
lord
sir
sculptor
uncle
co-star
representative
pilot
cinematographer
named
director
author
lady
maid
junior
stars
farmer
anchorwoman
nephew
newcomer
prodigy
brother
photographer
assistant
journalist
miss
novelist
father
agent
partner
lawyer
reporter
sisters
composer
Major
actor
captain
astronaut
commander
painter
musician
meets
champion
orphan
sheriff
writer
detective
artist
jr
army
attorney
commandant
filmmaker
filmmakers
guardian
ceo
cfo
cto
mayor
st
emperor
senator
administration
senators
representatives
representative
chancellor
dj
secretary

## preposition.txt
after
from
by
for
with
but
of
to
and
before

## preprocessing.py
import os
from unidecode import unidecode
import re

def read_data(articles):
    """
    read the file and append it to the articles
    :param articles: a list for all articles
    :return: None
    """

    def files(path):
        """
        find all file in a given path and yield the files' paths
        :param path: the path to the location of files
        :return: files' paths
        """
        for f in os.listdir(path):
            if len(f.split('.')[0]) == 3 and f.split('.')[1] == "txt" and os.path.isfile(os.path.join(path, f)):
                yield path + "/" + f, f.split('.')[0]

    for file_path, filename in files("data"):
        articles.append((filename, unidecode(file(file_path, 'r').read().decode("UTF-8"))))


def data_split(articles):
    """
    split data into two data-sets (training and testing)
    :param articles: a list of articles
    :return: two lists for two data-sets (training and testing)
    """
    train_set, test_set = [], []

    for i in range(0, len(articles), 3):
        train_set.append(articles[i])
        train_set.append(articles[i+1])
        test_set.append(articles[i+2])

    return train_set, test_set


def label_extraction_takeoff(paragraphs, count, labels=None):
    """
    Take off the label <person> and </person> and return the paragraph without labels
    :param paragraphs: string input data with <person></person> labels
    :param count: number of labels in articles
    :param labels: a set which contains all label among all input data
    :return: new paragraohs without labels, number of labels in articles
    """
    LABEL, LABEL_END = "<person>", "</person>"
    index, new_paragraph = 0, ""
    filename = paragraphs[0]
    paragraphs = paragraphs[1]

    while index < len(paragraphs):
        # find the index of the closest LABEL
        found = paragraphs.find(LABEL, index)

        # if the label is found
        if found != -1:
            # find the index (location) of the end of label
            found_end = paragraphs.find(LABEL_END, found)
            # append label to the return variable new_paragraph
            new_paragraph += paragraphs[index:found] + paragraphs[found+len(LABEL):found_end]

            # if labels is not None, add the label into it
            if labels is not None:
                labels.add(re.sub('[?;!@#$(){}\\,\\."]', '', paragraphs[found+len(LABEL):found_end]))

            # update the current index
            index = found_end + len(LABEL_END)
            count += 1

        else:
            new_paragraph += paragraphs[index:]
            break

    return (filename, new_paragraph), count, labels

## regular_verbs.txt
accept, add, admire, admit, advise, afford, agree, alert, allow, amuse, analyze , analyze , announce, annoy, answer, apologize, appear, applaud, appreciate, approve, argue, arrange, arrest, arrive, ask, attach, attack, attempt, attend, attract, avoid,  back, bake, balance, ban, bang, bare, bat, bathe, battle, beam, beg, behave, belong, bleach, bless, blind, blink, blot, blush, boast, boil, bolt, bomb, book, bore, borrow, bounce, bow, box, brake, branch, breathe, bruise, brush, bubble, bump, burn, bury, buzz, calculate, call, camp, care, carry, carve, cause, challenge, change, charge, chase, cheat, check, cheer, chew, choke, chop, claim, clap, clean, clear, clip, close, coach, coil, collect, color, comb, command, communicate, compare, compete, complain, complete, concentrate, concern, confess, confuse, connect, consider, consist, contain, continue, copy, correct, cough, count, cover, crack, crash, crawl, cross, crush, cry, cure, curl, curve, cycle,  dam, damage, dance, dare, decay, deceive, decide, decorate, delay, delight, deliver, depend, describe, desert, deserve, destroy, detect, develop, disagree, disappear, disapprove, disarm, discover, dislike, divide, double, doubt, drag, drain, dream, dress, drip, drop, drown, drum, dry, dust, earn, educate, embarrass, employ, empty, encourage, end, enjoy, enter, entertain, escape, examine, excite, excuse, exercise, exist, expand, expect, explain, explode, extend, face, fade, fail, fancy, fasten, fax, fear, fence, fetch, file, fill, film, fire, fit, fix, flap, flash, float, flood, flow, flower, fold, follow, fool, force, form, found, frame, frighten, fry,  gather, gaze, glow, glue, grab, grate, grease, greet, grin, grip, groan, guarantee, guard, guess, guide, hammer, hand, handle, hang, happen, harass, harm, hate, haunt, head, heal, heap, heat, help, hook, hop, hope, hover, hug, hum, hunt, hurry, identify, ignore, imagine, impress, improve, include, increase, influence, inform, inject, injure, instruct, intend, interest, interfere, interrupt, introduce, invent, invite, irritate, itch, jail, jam, jog, join, joke, judge, juggle, jump, kick, kill, kiss, kneel, knit, knock, knot, label, land, last, laugh, launch, learn, level, license, lick, lie, lighten, like, list, listen, live, load, lock, long, look, love, man, manage, march, mark, marry, match, mate, matter, measure, meddle, melt, memorize, mend, mess up, milk, mine, miss, mix, moan, moor, mourn, move, muddle, mug, multiply, murder, nail, name, need, nest, nod, note, notice, number, obey, object, observe, obtain, occur, offend, offer, open, order, overflow, owe, own, pack, paddle, paint, park, part, pass, paste, pat, pause, peck, pedal, peel, peep, perform, permit, phone, pick, pinch, pine, place, plan, plant, play, please, plug, point, poke, polish, pop, possess, post, pour, practice , practice , pray, preach, precede, prefer, prepare, present, preserve, press, pretend, prevent, prick, print, produce, program, promise, protect, provide, pull, pump, punch, puncture, punish, push, question, queue, race, radiate, rain, raise, reach, realize, receive, recognize, record, reduce, reflect, refuse, regret, reign, reject, rejoice, relax, release, rely, remain, remember, remind, remove, repair, repeat, replace, reply, report, reproduce, request, rescue, retire, return, rhyme, rinse, risk, rob, rock, roll, rot, rub, ruin, rule, rush, sack, sail, satisfy, save, saw, scare, scatter, scold, scorch, scrape, scratch, scream, screw, scribble, scrub, seal, search, separate, serve, settle, shade, share, shave, shelter, shiver, shock, shop, shrug, sigh, sign, signal, sin, sip, ski, skip, slap, slip, slow, smash, smell, smile, smoke, snatch, sneeze, sniff, snore, snow, soak, soothe, sound, spare, spark, sparkle, spell, spill, spoil, spot, spray, sprout, squash, squeak, squeal, squeeze, stain, stamp, stare, start, stay, steer, step, stir, stitch, stop, store, strap, strengthen, stretch, strip, stroke, stuff, subtract, succeed, suck, suffer, suggest, suit, supply, support, suppose, surprise, surround, suspect, suspend, switch, talk, tame, tap, taste, tease, telephone, tempt, terrify, test, thank, thaw, tick, tickle, tie, time, tip, tire, touch, tour, tow, trace, trade, train, transport, trap, travel, treat, tremble, trick, trip, trot, trouble, trust, try, tug, tumble, turn, twist, type, undress, unfasten, unite, unlock, unpack, untidy, use, vanish, visit, wail, wait, walk, wander, want, warm, warn, wash, waste, watch, water, wave, weigh, welcome, whine, whip, whirl, whisper, whistle, wink, wipe, wish, wobble, wonder, work, worry, wrap, wreck, wrestle, wriggle, x-ray, yawn, yell, zip, zoom, accepted, added, admired, admitted, advised, afforded, agreed, alerted, allowed, amused, analyze ed, analyze ed, announced, annoyed, answered, apologized, appeared, applauded, appreciated, approved, argued, arranged, arrested, arrived, asked, attached, attacked, attempted, attended, attracted, avoided,  backed, baked, balanced, banned, banged, bared, bated, bathed, battled, beamed, begged, behaved, belonged, bleached, blessed, blinded, blinked, blotted, blushed, boasted, boiled, bolted, bombed, booked, bored, borrowed, bounced, bowed, boxed, braked, branched, breathed, bruised, brushed, bubbled, bumped, burned, buried, buzzed, calculated, called, camped, cared, carried, carved, caused, challenged, changed, charged, chased, cheated, checked, cheered, chewed, choked, chopped, claimed, clapped, cleaned, cleared, clipped, closed, coached, coiled, collected, colored, combed, commanded, communicated, compared, competed, complained, completed, concentrated, concerned, confessed, confused, connected, considered, consisted, contained, continued, copied, corrected, coughed, counted, covered, cracked, crashed, crawled, crossed, crushed, cried, cured, curled, curved, cycled,  damed, damaged, danced, dared, decayed, deceived, decided, decorated, delayed, delighted, delivered, depended, described, deserted, deserved, destroyed, detected, developed, disagreed, disappeared, disapproved, disarmed, discovered, disliked, divided, doubled, doubted, dragged, drained, dreamed, dressed, dripped, dropped, drowned, drummed, dried, dusted, earned, educated, embarrassed, employed, emptied, encouraged, ended, enjoyed, entered, entertained, escaped, examined, excited, excused, exercised, existed, expanded, expected, explained, exploded, extended, faced, faded, failed, fancied, fastened, faxed, feared, fenced, fetched, filed, filled, filmed, fired, fitted, fixed, flapped, flashed, floated, flooded, flowed, flowered, folded, followed, fooled, forced, formed, founded, framed, frightened, fried,  gathered, gazed, glowed, glued, grabbed, grated, greased, greeted, grinned, griped, groaned, guaranteed, guarded, guessed, guided, hammered, handed, handled, hanged, happened, harassed, harmed, hated, haunted, headed, healed, heaped, heated, helped, hooked, hoped, hoped, hovered, hugged, hummed, hunted, hurried, identified, ignored, imagined, impressed, improved, included, increased, influenced, informed, injected, injured, instructed, intended, interested, interfered, interrupted, introduced, invented, invited, irritated, itched, jailed, jammed, jogged, joined, joked, judged, juggled, jumped, kicked, killed, kissed, kneeled, knitted, knocked, knotted, labeled, landed, lasted, laughed, launched, learned, leveled, licensed, licked, lied, lightened, liked, listed, listened, lived, loaded, locked, longed, looked, loved, maned, managed, marched, marked, married, matched, mated, mattered, measured, meddled, melted, memorized, mended, mess upped, milked, mined, missed, mixed, moaned, moored, mourned, moved, muddled, mugged, multiplied, murdered, nailed, named, needed, nested, nodded, noted, noticed, numbered, obeyed, objected, observed, obtained, occurred, offended, offered, opened, ordered, overflowed, owed, owned, packed, paddled, painted, parked, parted, passed, pasted, pated, paused, pecked, pedaled, peeled, peeped, performed, permitted, phoned, picked, pinched, pined, placed, planed, planted, played, pleased, plugged, pointed, poked, polished, popped, possessed, posted, poured, practice ed, practice ed, prayed, preached, preceded, preferred, prepared, presented, preserved, pressed, pretended, prevented, pricked, printed, produced, programed, promised, protected, provided, pulled, pumped, punched, punctured, punished, pushed, questioned, queued, raced, radiated, rained, raised, reached, realized, received, recognized, recorded, reduced, reflected, refused, regretted, reigned, rejected, rejoiced, relaxed, released, relied, remained, remembered, reminded, removed, repaired, repeated, replaced, replied, reported, reproduced, requested, rescued, retired, returned, rhymed, rinsed, risked, robed, rocked, rolled, rotted, rubbed, ruined, ruled, rushed, sacked, sailed, satisfied, saved, sawed, scared, scattered, scolded, scorched, scraped, scratched, screamed, screwed, scribbled, scribed, sealed, searched, separated, served, settled, shaded, shared, shaved, sheltered, shivered, shocked, shopped, shrugged, sighed, signed, signaled, sinned, sipped, skied, skipped, slapped, slipped, slowed, smashed, smelled, smiled, smoked, snatched, sneezed, sniffed, snored, snowed, soaked, soothed, sounded, spared, sparked, sparkled, spelled, spilled, spoiled, spotted, sprayed, sprouted, squashed, squeaked, squealed, squeezed, stained, stamped, stared, started, stayed, steered, stepped, stirred, stitched, stoped, stored, strapped, strengthened, stretched, striped, stroked, stuffed, subtracted, succeeded, sucked, suffered, suggested, suited, supplied, supported, supposed, surprised, surrounded, suspected, suspended, switched, talked, tamed, taped, tasted, teased, telephoned, tempted, terrified, tested, thanked, thawed, ticked, tickled, tied, timed, tipped, tired, touched, toured, towed, traced, traded, trained, transported, trapped, traveled, treated, trembled, tricked, tripped, trotted, troubled, trusted, tried, tugged, tumbled, turned, twisted, typed, undressed, unfastened, united, unlocked, unpacked, untidied, used, vanished, visited, wailed, waited, walked, wandered, wanted, warmed, warned, washed, wasted, watched, watered, waved, weighed, welcomed, whined, whipped, whirled, whispered, whistled, winked, wiped, wished, wobbled, wondered, worked, worried, wrapped, wrecked, wrestled, wriggled, yawned, yelled, zipped, zoomed, accepts, adds, admires, admits, advises, affords, agrees, alerts, allows, amuses, analyze s, analyze s, announces, annoys, answers, apologizes, appears, applauds, appreciates, approves, argues, arranges, arrests, arrives, asks, attaches, attacks, attempts, attends, attracts, avoids,  backs, bakes, balances, bans, bangs, bares, bats, bathes, battles, beams, begs, behaves, belongs, bleaches, blesses, blinds, blinks, blots, blushes, boasts, boils, bolts, bombs, books, bores, borrows, bounces, bows, boxes, brakes, branches, breathes, bruises, brushes, bubbles, bumps, burns, buries, buzzes, calculates, calls, camps, cares, carries, carves, causes, challenges, changes, charges, chases, cheats, checks, cheers, chews, chokes, chops, claims, claps, cleans, clears, clips, closes, coaches, coils, collects, colors, combs, commands, communicates, compares, competes, complains, completes, concentrates, concerns, confesses, confuses, connects, considers, consists, contains, continues, copies, corrects, coughs, counts, covers, cracks, crashes, crawls, crosses, crushes, cries, cures, curls, curves, cycles,  dams, damages, dances, dares, decays, deceives, decides, decorates, delays, delights, delivers, depends, describes, deserts, deserves, destroys, detects, develops, disagrees, disappears, disapproves, disarms, discovers, dislikes, divides, doubles, doubts, drags, drains, dreams, dresses, drips, drops, drowns, drums, dries, dusts, earns, educates, embarrasses, employs, empties, encourages, ends, enjoys, enters, entertains, escapes, examines, excites, excuses, exercises, exists, expands, expects, explains, explodes, extends, faces, fades, fails, fancies, fastens, faxes, fears, fences, fetches, files, fills, films, fires, fits, fixes, flaps, flashes, floats, floods, flows, flowers, folds, follows, fools, forces, forms, founds, frames, frightens, fries,  gathers, gazes, glows, glues, grabs, grates, greases, greets, grins, grips, groans, guarantees, guards, guesses, guides, hammers, hands, handles, hangs, happens, harasses, harms, hates, haunts, heads, heals, heaps, heats, helps, hooks, hops, hopes, hovers, hugs, hums, hunts, hurries, identifies, ignores, imagines, impresses, improves, includes, increases, influences, informs, injects, injures, instructs, intends, interests, interferes, interrupts, introduces, invents, invites, irritates, itches, jails, jams, jogs, joins, jokes, judges, juggles, jumps, kicks, kills, kisses, kneels, knits, knocks, knots, labels, lands, lasts, laughs, launches, learns, levels, licenses, licks, lies, lightens, likes, lists, listens, lives, loads, locks, longs, looks, loves, mans, manages, marches, marks, marries, matches, mates, matters, measures, meddles, melts, memorizes, mends, mess ups, milks, mines, misses, mixes, moans, moors, mourns, moves, muddles, mugs, multiplies, murders, nails, names, needs, nests, nods, notes, notices, numbers, obeys, objects, observes, obtains, occurs, offends, offers, opens, orders, overflows, owes, owns, packs, paddles, paints, parks, parts, passes, pastes, pats, pauses, pecks, pedals, peels, peeps, performs, permits, phones, picks, pinches, pines, places, plans, plants, plays, pleases, plugs, points, pokes, polishes, pops, possesses, posts, pours, practice s, practice s, prays, preaches, precedes, prefers, prepares, presents, preserves, presses, pretends, prevents, pricks, prints, produces, programs, promises, protects, provides, pulls, pumps, punches, punctures, punishes, pushes, questions, queues, races, radiates, rains, raises, reaches, realizes, receives, recognizes, records, reduces, reflects, refuses, regrets, reigns, rejects, rejoices, relaxes, releases, relies, remains, remembers, reminds, removes, repairs, repeats, replaces, replies, reports, reproduces, requests, rescues, retires, returns, rhymes, rinses, risks, robs, rocks, rolls, rots, rubs, ruins, rules, rushes, sacks, sails, satisfies, saves, saws, scares, scatters, scolds, scorches, scrapes, scratches, screams, screws, scribbles, scrubs, seals, searches, separates, serves, settles, shades, shares, shaves, shelters, shivers, shocks, shops, shrugs, sighs, signs, signals, sins, sips, skis, skips, slaps, slips, slows, smashes, smells, smiles, smokes, snatches, sneezes, sniffs, snores, snows, soaks, soothes, sounds, spares, sparks, sparkles, spells, spills, spoils, spots, sprays, sprouts, squashes, squeaks, squeals, squeezes, stains, stamps, stares, starts, stays, steers, steps, stirs, stitches, stops, stores, straps, strengthens, stretches, strips, strokes, stuffs, subtracts, succeeds, sucks, suffers, suggests, suits, supplies, supports, supposes, surprises, surrounds, suspects, suspends, switches, talks, tames, taps, tastes, teases, telephones, tempts, terrifies, tests, thanks, thaws, ticks, tickles, ties, times, tips, tires, touches, tours, tows, traces, trades, trains, transports, traps, travels, treats, trembles, tricks, trips, trots, troubles, trusts, tries, tugs, tumbles, turns, twists, types, undresses, unfastens, unites, unlocks, unpacks, untidies, uses, vanishes, visits, wails, waits, walks, wanders, wants, warms, warns, washes, wastes, watches, waters, waves, weighs, welcomes, whines, whips, whirls, whispers, whistles, winks, wipes, wishes, wobbles, wonders, works, worries, wraps, wrecks, wrestles, wriggles, x-rays, yawns, yells, zips, zooms, beats, becomes, begins, bends, bets, bids, blows, breaks, brings, builds, burns, buys, catches, chooses, comes, costs, cuts, digs, dives, does, draws, dreams, drives, drinks, eats, falls, feels, fights, finds, flies, forgets, forgives, gets, gots, give goes, grows, hangs, hears, hides, hurts, keeps, knows, lays, leads, leaves, lends, lets, loses, makes, means, meets, pays, puts, reads, rides, rings, rises, runs, says, sees, sells, sends, shows, shuts, sings, sits, sleeps, speaks, spends, stands, swims, takes, teaches, tears, tells, thinks, throws, understands, wakes, wears, wins, writes

## whos.txt
who
whose
whom
	able
	bad
	best
	better
	big
	black
	certain
	clear
	different
	early
	easy
	economic
	federal
	free
	full
	good
	great
	hard
	high
	human
	important
	international
	large
	late
	little
	local
	long
	low
	major
	military
	national
	new
	old
	only
	other
	political
	possible
	public
	real
	recent
	right
	small
	social
	special
	strong
	sure
	true
	white
	whole
	young
	other
	new
	good
	high
	old
	great
	big
	American
	small
	large
	national
	different
	black
	long
	little
	important
	political
	bad
	white
	real
	best
	right
	social
	only
	public
	sure
	low
	early
	able
	human
	local
	late
	hard
	major
	better
	economic
	strong
	possible
	whole
	free
	military
	true
	federal
	international
	full
	special
	easy
	clear
	recent
	certain
	personal
	open
	red
	difficult
	available
	likely
	short
	single
	medical
	current
	wrong
	private
	past
	foreign
	fine
	common
	poor
	natural
	significant
	similar
	hot
	dead
	central
	happy
	serious
	ready
	simple
	left
	physical
	general
	environmental
	financial
	blue
	democratic
	dark
	various
	entire
	close
	legal
	religious
	cold
	final
	main
	green
	nice
	huge
	popular
	traditional
	cultural
	Oliver
	Jake
	Noah
	James
	Jack
	Connor
	Liam
	John
	Harry
	Callum
	Mason
	Robert
	Jacob
	Michael
	Charlie
	Kyle
	William
	Williams
	Thomas
	Shawn
	Joe
	Ethan
	David
	George
	Reece
	Michael
	Richard
	Oscar
	Rhys
	Alexander
	Joseph
	James
	Charlie
	James
	Charles
	Damian
	Daniel
	Thomas
	Amelia
	Margaret
	Emma
	Mary
	Olivia
	Samantha
	Patricia
	Isla
	Bethany
	Sophia
	Jennifer
	Emily
	Elizabeth
	Isabella
	Elizabeth
	Poppy
	Joanne
	Ava
	Linda
	Megan
	Mia
	Barbara
	Isabella
	Victoria
	Susan
	Jessica
	Lauren
	Abigail
	Margaret
	Lily
	Michelle
	Madison
	Jessica
	Sophie
	Cooper
	Tracy
	Charlotte
	Sarah
	Murphy
	Li
	Smith
	Jones
	O'Kelly
	Johnson
	Jones
	Wilson
	O'Sullivan
	Lam
	Brown
	Walsh
	Martin
	Taylor
	Jones
	Gelbero
	Wilson
	Taylor
	Davies
	O'Brien
	Miller
	Roy
	Taylor
	Byrne
	Davis
	Tremblay
	Morton
	Singh
	Evans
	O'Ryan
	Garcia
	Lee
	White
	Wang
	Thomas
	O'Connor
	Rodriguez
	Gagnon
	Martin
	Anderson
	Roberts
	O'Neill
	Anderson
	Clark
	Wright
	Mitchell
	Johnson
	Rodriguez
	Lopez
	Perez
	Jackson
	Lewis
	Hill
	Roberts
	Jones
	White
	Scott
	Turner
	Brown
	Harris
	Walker
	Green
	Phillips
	Hall
	Adams
	Campbell
	Miller
	Allen
	Baker
	Parker
	Garcia
	Young
	Gonzalez
	Evans
	Moore
	Martinez
	Hernandez
	Nelson
	Edwards
	Taylor
	Robinson
	Carter
	Collins
	George
	Ronald
	John
	Richard
	Kenneth
	Anthony
	Charles
	Paul
	Steven
	Michael
	Joseph
	Mark
	Thomas
	Donald
	Brian
	Jeff
	Mary
	Jennifer
	Lisa
	Sandra
	Michelle
	Patricia
	Maria
	Nancy
	Donna
	Laura
	Linda
	Susan
	Karen
	Carol
	Sarah
	Barbara
	Margaret
	Betty
	Ruth
	Kimberly
	Elizabeth
	Dorothy
	Helen
	Sharon
	Deborah
	Sanders
	Joy
	Sean
	Walton
	Reznor
	Antonio
	Trump
	Julia
	Blair
	Nobel
	Johann
	Ann
	Lindsay
	Laura
	Sam
	Kelly
	Bill
	Maya
	Adriana
	Lola
	Ingrid
	Clare
	Emma
	Isabella
	Abigail
	Charlotte
	Lillian
	Hannah
	Samantha
	Caroline
	Sheeran
	Madelyn
	Kate
	Hayes
	Arianna
	Maggie
	Audrey
	Luis
	Paolo
	Oliver
	Emilio
	Gustav
	Tyler
	Taylor
	Javier
	Kristian
	Henrik
	Stefan
	Etienne
	Johnson
	Ferdinand
	Hector
	Catlin
	Hugo
	Ali
	Raymond
	Xavier
	Harry
	Potter
	Evan
	Elvis
	Harrison
	Jasper
	Hitler
	<<<<<<< HEAD
	Scott
	=======
	John
	Patricia
	Robert
	Linda
	Richard
	Susan
	Joseph
	Jessica
	Thomas
	Sarah
	Charles
	Margaret
	Christopher
	Daniel
	Nancy
	Matthew
	Lisa
	Anthony
	Betty
	Donald
	Dorothy
	Paul
	Ashley
	Andrew
	Donna
	Kenneth
	Carol
	Joshua
	Amanda
	Brian
	Melissa
	Deborah
	Ronald
	Stephanie
	Timothy
	Rebecca
	Jeffrey
	Helen
	Sharon
	Gary
	Kathleen
	Nicholas
	Amy
	Eric
	Shirley
	Angela
	Larry
	Justin
	Brenda
	Scott
	Pamela
	Nicole
	Frank
	Katherine
	Benjamin
	Samantha
	Gregory
	Christine
	Samuel
	Virginia
	Rachel
	Jack
	Janet
	Dennis
	Jerry
	Carolyn
	Maria
	Aaron
	Heather
	Jose
	Julie
	Douglas
	Joyce
	Peter
	Evelyn
	Nathan
	Victoria
	Zachary
	Walter
	Christina
	Kyle
	Lauren
	Harold
	Frances
	Carl
	Martha
	Judith
	Gerald
	Cheryl
	Keith
	Megan
	Roger
	Andrea
	Arthur
	Olivia
	Terry
	Ann
	Jacqueline
	Ethan
	Austin
	Doris
	Kathryn
	Albert
	Gloria
	Jesse
	Teresa
	Willie
	Sara
	Billy
	Janice
	Marie
	Bruce
	Noah
	Jordan
	Judy
	Dylan
	Theresa
	Ralph
	Madison
	Roy
	Beverly
	Alan
	Denise
	Wayne
	Marilyn
	Eugene
	Amber
	Juan
	Danielle
	Gabriel
	Rose
	Louis
	Brittany
	Russell
	Diana
	Randy
	Abigail
	Vincent
	Natalie
	Philip
	Jane
	Logan
	Lori
	Bobby
	Alexis
	Tiffany
	Johnny
	Kayla
	Boccaccio
	Gruber
	Huber
	Bauer
	Wagner
	Pichler
	Steiner
	Moser
	Mayer
	Hofer
	Leitner
	Berger
	Fuchs
	Eder
	Fischer
	Schmid
	Winkler
	Weber
	Schwarz
	Maier
	Schneider
	Reiter
	Mayr
	Schmidt
	Wimmer
	Egger
	Brunner
	Lang
	Baumgartner
	Auer
	Binder
	Lechner
	Wolf
	Wallner
	Aigner
	Ebner
	Koller
	Lehner
	Haas
	Schuster
	Heilig
	Peeters
	Janssens
	Maes
	Jacobs
	Mertens
	Willems
	Claes
	Goossens
	Wouters
	Dubois
	Lambert
	Dupont
	Martin
	Simon
	Nielsen
	Jensen
	Hansen
	Pedersen
	Andersen
	Christensen
	Larsen
	Rasmussen
	Petersen
	Madsen
	Kristensen
	Olsen
	Thomsen
	Christiansen
	Poulsen
	Johansen
	Mortensen
	Joensen
	Hansen
	Jacobsen
	Olsen
	Poulsen
	Petersen
	Johannesen
	Thomsen
	Nielsen
	Johansen
	Rasmussen
	Simonsen
	Djurhuus
	Jensen
	Danielsen
	Mortensen
	Mikkelsen
	Dam
	Andreasen
	Johansson
	Nyman
	Lindholm
	Karlsson
	Andersson
	Hendriks
	or
	but
	nor
	so
	for
	yet
	after
	although
	as
	as if
	as long as
	because
	before
	even if
	even though
	once
	since
	so that
	though
	till
	unless
	until
	what
	when
	whenever
	wherever
	whether
	while
	why
	if
	after
	from
	by
	for
	with
	but
	of
	to
	and
	before
	how
	which
	a
	an
	the
	these
	our
	i
	he
	she
	they
	there
	are
	is
	be
	you
	able
	about
	across
	all
	almost
	also
	am
	among
	any
	at
	been
	best
	can
	cannot
	could
	dear
	did
	do
	does
	either
	else
	ever
	every
	get
	got
	have
	has
	had
	her
	hers
	him
	his
	however
	in
	into
	it
	its
	just
	least
	let
	like
	likely
	other
	rather
	me
	might
	most
	must
	my
	neither
	not
	nor
	often
	off
	on
	only
	should
	some
	then
	that
	their
	then
	this
	too
	us
	we
	who
	whom
	would
	yet
	here
	there
	bbc
	abc
	news
	maybe
	perhaps
	man
	men
	woman
	women
	Out
	yes
	no
	in
	out
	Afghanistan
	Albania
	Algeria
	America
	Andorra
	Angola
	Antigua
	Argentina
	Armenia
	Australia
	Austria
	Azerbaijan
	Bahamas
	Bahrain
	Bangladesh
	Barbados
	Belarus
	Belgium
	Belize
	Russians
	Europeans
	Benin
	Bhutan
	Bissau
	Bolivia
	Bosnia
	Botswana
	Brazil
	British
	Britan
	Brunei
	Bulgaria
	Burkina
	Burma
	Burundi
	Cambodia
	Cameroon
	Canada
	Cape Verde
	Central African Republic
	Chad
	Chile
	China
	Colombia
	Comoros
	Congo
	Costa Rica
	country debt
	Croatia
	Cuba
	Cyprus
	Czech
	Denmark
	Djibouti
	Dominica
	East Timor
	Ecuador
	Egypt
	El Salvador
	Emirate
	England
	Eritrea
	Estonia
	Ethiopia
	Russian
	Fiji
	Finland
	France
	Gabon
	Gambia
	Georgia
	French
	Germany
	Ghana
	Great Britain
	Europe
	European
	Britain
	Greece
	Grenada
	Grenadines
	Guatemala
	Guinea
	Guyana
	Haiti
	Herzegovina
	Honduras
	Hungary
	Iceland
	in usa
	India
	Indian
	Indonesia
	Iran
	Iraq
	Ireland
	Israel
	Italy
	Ivory Coast
	Jamaica
	Japan
	Jordan
	Kazakhstan
	Kenya
	Kiribati
	Korea
	Kosovo
	Kuwait
	Kyrgyzstan
	Laos
	Latvia
	Lebanon
	Lesotho
	Liberia
	Libya
	Liechtenstein
	Lithuania
	Luxembourg
	Macedonia
	Madagascar
	Malawi
	Malaysia
	Maldives
	Mali
	Malta
	Marshall
	Mauritania
	Mauritius
	Mexico
	Micronesia
	Moldova
	Monaco
	Mongolia
	Montenegro
	Morocco
	Mozambique
	Myanmar
	Namibia
	Nauru
	Nepal
	Netherlands
	New Zealand
	Nicaragua
	Niger
	Nigeria
	Norway
	Oman
	Pakistan
	Palau
	Panama
	Papua
	Paraguay
	Peru
	Philippines
	Poland
	Portugal
	Qatar
	Romania
	Russia
	Rwanda
	Samoa
	San Marino
	Sao Tome
	Saudi Arabia
	scotland
	scottish
	Senegal
	Serbia
	Seychelles
	Sierra Leone
	Singapore
	Slovakia
	Slovenia
	Solomon
	Somalia
	South Africa
	Africa
	South Sudan
	Spain
	Sri Lanka
	St. Kitts
	St. Lucia
	St Kitts
	St Lucia
	Saint Kitts
	Santa Lucia
	Sudan
	Suriname
	Swaziland
	Sweden
	Switzerland
	Syria
	Taiwan
	Tajikistan
	Tanzania
	Thailand
	Tobago
	Togo
	Tonga
	Trinidad
	Tunisia
	Turkey
	Turkmenistan
	Tuvalu
	Uganda
	Ukraine
	United Kingdom
	United States
	Uruguay
	USA
	US
	UK
	Uzbekistan
	Vanuatu
	Vatican
	Venezuela
	Vietnam
	wales
	welsh
	Yemen
	Zambia
	Zimbabwe
	Afghan
	Albanian
	Algerian
	American
	Andorran
	Angolan
	Antiguans
	Argentinean
	Armenian
	Australian
	Austrian
	Azerbaijani
	Bahamian
	Bahraini
	Bangladeshi
	Barbadian
	Barbudans
	Batswana
	Belarusian
	Belgian
	Bourgeoi
	Bourgeoisie
	Belizean
	Beninese
	Bhutanese
	Bolivian
	Beverly Hills
	Bosnian
	Brazilian
	British
	Bruneian
	Bulgarian
	Burkinabe
	Burmese
	Burundian
	Cambodian
	Cameroonian
	Canadian
	Cape Verdean
	Central African
	Chadian
	Chilean
	Chinese
	Colombian
	Comoran
	Congolese
	Costa Rican
	Croatian
	Cuban
	Cypriot
	Czech
	Danish
	Djibouti
	Dominican
	Dutch
	East Timorese
	Ecuadorean
	Egyptian
	Emirian
	Equatorial Guinean
	Eritrean
	Estonian
	Ethiopian
	Fijian
	Filipino
	Finnish
	French
	Gabonese
	Gambian
	Georgian
	German
	Ghanaian
	Greek
	Grenadian
	Guatemalan
	Guinea-Bissauan
	Guinean
	Guyanese
	Haitian
	Herzegovinian
	Honduran
	Hungarian
	I-Kiribati
	Icelander
	Indian
	Indonesian
	Iranian
	Iraqi
	Irish
	Israeli
	Italian
	Ivorian
	Jamaican
	Japanese
	Jordanian
	Kazakhstani
	Kenyan
	Kittian
	Nevisian
	Kuwaiti
	Kyrgyz
	Laotian
	Latvian
	Lebanese
	Liberian
	Libyan
	Liechtensteiner
	Lithuanian
	Luxembourger
	Macedonian
	Malagasy
	Malawian
	Malaysian
	Maldivian
	Malian
	Maltese
	Marshallese
	Mauritanian
	Mauritian
	Mexican
	Micronesian
	Moldovan
	Monacan
	Mongolian
	Moroccan
	Mosotho
	Motswana
	Mozambican
	Namibian
	Nauruan
	Nepalese
	New Zealander
	Ni-Vanuatu
	Nicaraguan
	Nigerian
	Nigerien
	North Korean
	Northern Irish
	Norwegian
	Omani
	Pakistani
	Palauan
	Panamanian
	Papua New Guinean
	Paraguayan
	Peruvian
	Polish
	Portuguese
	Qatari
	Romanian
	Russian
	Rwandan
	Saint Lucian
	Salvadoran
	Samoan
	San Marinese
	Sao Tomean
	Saudi
	Scottish
	Senegalese
	Serbian
	Seychellois
	Sierra Leonean
	Singaporean
	Slovakian
	Slovenian
	Solomon Islander
	Somali
	South African
	South Korean
	Spanish
	Sri Lankan
	Sudanese
	Surinamer
	Swazi
	Swedish
	Swiss
	Syrian
	Taiwanese
	Tajik
	Tanzanian
	Thai
	Togolese
	Tongan
	Trinidadian
	Tobagonian
	Tunisian
	Turkish
	Tuvaluan
	Ugandan
	Ukrainian
	Uruguayan
	Uzbekistani
	Uzbekistan
	Venezuelan
	Vietnamese
	Welsh
	Yemenite
	Zambian
	Zimbabwean
	Monday
	Tuesday
	Wednesday
	Thursday
	Friday
	Saturday
	Sunday
	Beijing
	Chicago
	Taoyuan
	San Antonio
	Toronto
	New York
	English
	Pennsylvania
	South Carolina
	Texas
	Wisconsin
	St Paul
	London
	Soho
	Brexit
	Britain
	Manchester
	Middle Eastern
	Taipei
	Vienna
	EU
	Yemeni
	Europe
	European
	South America
	South American
	Asia
	Asian
	Oceania
	Oceanian
	Africa
	African
	Antartica
	Pacific
	Atlantic
	Mediterranean
	Scot
	Scots
	Korean
	California
	Swedes
	Swede
	Zurich
	Yemenis
	Western
	Chicago
	northeast
	southeast
	southwest
	northwest
	northern
	western
	eastern
	sourthern
	States
	state
	Limburger
	Limburgers
	Country
	Countries
	City
	Cities
	County
	Counties
	York
	Madison
	def load_who_file():
	"""
	Read file whos.txt and insert data to a hash set
	:return: the hash set with all whos from file whos.txt
	"""
	return set([who.strip('\n').lower() for who in file("data/whos.txt", 'r').readlines()])

	def load_common_name_file():
	"""
	Read file common_name.txt and insert data to a hash set
	:return: the hash set with all whos from file common_name.txt
	"""
	return set([common_name.strip('\n').lower() for common_name in file("data/common_name.txt", 'r').readlines()])

	def load_common_adj_file():
	"""
	Read file common_adj.txt and insert data to a hash set
	:return: the hash set with all whos from file common_adj.txt
	"""
	return set([common_adj.strip('\n').lower() for common_adj in file("data/common_adj.txt", 'r').readlines()])

	def load_country_file():
	"""
	Read file countries.txt and insert data to a hash set
	:return: the hash set with all countries name from file countries.txt
	"""
	return set([country.strip('\n').lower() for country in file("data/countries.txt", 'r').readlines()])

	def load_conjunction_file():
	"""
	Read file conjunctions.txt and insert data to a hash set
	:return: the hash set with all conjunctions from file conjunctions.txt
	"""
	return set([conjunctions.strip('\n').lower() for conjunctions in file("data/conjunctions.txt", 'r').readlines()])

	def load_prefix_library():
	"""
	Generate a hash set with all prefixes
	:return: the hash set with all prefixes
	"""
	return set([prefix.strip('\n').lower() for prefix in file("data/prefix.txt", 'r').readlines()])

	def load_organ_library():
	"""
	Generate a hash set with all organization titles
	:return: the hash set with all organization titles
	"""
	return set([organ.strip('\n').lower() for organ in file("data/organization.txt", 'r').readlines()])

	def load_month_file():
	"""
	Generate a hash set with all months
	:return: the hash set with all months
	"""
	return set([month.strip('\n').lower() for month in file("data/month.txt", 'r').readlines()])

	def load_verb_file():
	"""
	Read files irregular_verbs.txt and regular_verbs.txt and insert data to a hash set
	:return: the hash set with all verbs from files
	"""
	return set(file("data/irregular_verbs.txt", 'r').read().split(', ')) \|\
	set(file("data/regular_verbs.txt", 'r').read().split(', '))

	def load_preposition_file():
	"""
	Read files preposition.txt and insert data to a hash set
	:return: the hash set with all preposition from files
	"""
	return set(open("data/preposition.txt", 'r').read().split(', '))

	def contains_country(ngram, country_set):
	"""
	Identify if a n-gram has countries, return 1 (has feature) 0 (no such feature)
	:param ngram: a ngram
	:param country_set: a set contains all countries
	:return: 1 (has feature) or 0 (no such feature)
	"""
	# find if any word in current ngram has country name
	words = ngram[0].split(' ')
	for word in words:
	if word.lower() in country_set:
	return 1
	if len(words) >= 2:
	for i in range(1, len(words)):
	if (words[i-1]+' '+words[i]).lower() in country_set:
	return 1
	if len(words) >= 3:
	for i in range(2, len(words)):
	if (words[i-2]+' '+words[i-1]+' '+words[i]).lower() in country_set:
	return 1
	return 0


	def contains_common_name(ngram, common_name_set):
	"""
	Identify if a n-gram has common name, return 1 (has feature) 0 (no such feature)
	:param ngram: a ngram
	:param common_name_set: a set contains all common name
	:return: 1 (has feature) or 0 (no such feature)
	"""
	# find if any word in current ngram has common name
	for word in ngram[0].split(' '):
	if word.lower() in common_name_set:
	return 1
	return 0

	def contains_common_adj(ngram, common_adj_set):
	"""
	Identify if a n-gram has common adj, return 1 (has feature) 0 (no such feature)
	:param ngram: a ngram
	:param common_name_set: a set contains all common adj
	:return: 1 (has feature) or 0 (no such feature)
	"""
	# find if any word in current ngram has common adj
	for word in ngram[0].split(' '):
	if word.lower() in common_adj_set:
	return 1
	return 0

	def contains_prefix(ngram, prefix_set):
	"""
	Identify if a n-gram has prefix, return 1 (has feature) 0 (no such feature)
	:param ngram: a ngram
	:param prefix_set: a set contains all prefixes
	:return: 1 (has feature) or 0 (no such feature)
	"""
	# find if any word in current ngram has country name
	for word in ngram[0].split(' '):
	if word.lower() in prefix_set:
	return 1
	return 0

	def contains_month(ngram, month_set):
	"""
	Identify if a n-gram has month, return 1 (has feature) 0 (no such feature)
	:param ngram: a ngram
	:param month_set: a set contains all months
	:return: 1 (has feature) or 0 (no such feature)
	"""
	# find if any word in current ngram has country name
	for word in ngram[0].split(' '):
	if word.lower() in month_set:
	return 1
	return 0

	def contains_organization(ngram, organ_set):
	"""
	Identify if a n-gram has organization titles, return 1 (has feature) 0 (no such feature)
	:param ngram: a ngram
	:param organ_set: a set contains commom organization titles
	:return: 1 (has feature) or 0 (no such feature)
	"""
	# find if any word in current ngram has country name
	words = ngram[0].split(' ')
	for word in words:
	if word.lower() in organ_set:
	return 1

	if len(words) >= 2:
	for i in range(1, len(words)):
	if (words[i-1]+' '+words[i]).lower() in organ_set:
	return 1
	return 0

	def contains_conjunction(ngram, conjunctions_set):
	"""
	Identify if a n-gram has conjunctions, return 1 (has feature) 0 (no such feature)
	:param ngram: a ngram
	:param conjunctions_set: a set contains all conjunctions
	:return: 1 (has feature) 0 (no such feature)
	"""
	# find if any word in current ngram has country name
	for word in ngram[0].split(' '):
	if word.lower() in conjunctions_set:
	return 1
	return 0

	def contains_verb(ngram, verb_set):
	"""
	Identify if a n-gram has conjunctions, return 1 (has feature) 0 (no such feature)
	:param ngram: a ngram
	:param conjunctions_set: a set contains all conjunctions
	:return: 1 (has feature) 0 (no such feature)
	"""
	# find if any word in current ngram has country name
	for word in ngram[0].split(' '):
	if word.lower() in verb_set:
	return 1
	return 0

	def is_all_upper(ngram):
	"""
	Check all the words in the content if all words start with upper case
	:param ngram: a ngram
	:return: 1 (has feature) 0 (no such feature)
	"""
	for word in ngram[0].split(' '):
	if len(word) > 0 and word[0].islower():
	return 0
	return 1

	def has_who(ngram, who_set):
	"""
	Check all the words in the content if it has who
	:param ngram: a ngram
	:return: 1 (has feature) 0 (no such feature)
	"""
	for word in ngram[0].split(' '):
	if word.lower() in who_set:
	return 1
	return 0

	def no_more_than_one_lower(ngram):
	"""
	Check all the words in the content if all words has less than 2 lower case at each starting letter
	:param ngram: a ngram
	:return: 1 (has feature) 0 (no such feature)
	"""
	count = 0
	for word in ngram[0].split(' '):
	if word.islower():
	count += 1
	if count > 1:
	return 0
	return 1

	def has_prefix_before_ngram(ngram, single_grams, prefix_set):
	"""
	Check if the word in front of the input ngram is a prefix for name
	:param ngram: a n-gram
	:param single_grams: all words in an article with order
	:param prefix_set: a set contains all prefixes
	:return: 1 (has feature) 0 (no such feature)
	"""
	if (ngram[2] - 1) >= 0:
	preWord = single_grams[ngram[2] - 1][0].lower()
	if preWord in prefix_set:
	return 1
	return 0

	def has_human_verb(ngram, single_grams, verb_set):
	"""
	Check if the word after the input ngram is a verb usually used for human
	:param ngram: a n-gram
	:param single_grams: all words in an article with order
	:param verb_set: a set contains all verbs usually used for human
	:return: 1 (has feature) 0 (no such feature)
	"""
	ngram_end_index = ngram[3]
	if (ngram_end_index + 1) < len(single_grams):
	if single_grams[ngram_end_index+1][0] in verb_set:
	return 1
	return 0

	def features_label_separator(ngrams, labels_set=None):
	"""
	Separate features and label from n-grams and return two lists
	:param ngrams: all n-grams from all articles
	:param labels_set: the hash set of all labels
	:return: two lists -- features and label from n-grams
	"""
	features = [ngram[4:] for ngram in ngrams]
	label = [1 if ngram[0] in labels_set else 0 for ngram in ngrams] if labels_set else []
	return features, label

	def afterpreposition(ngram, single_grams, preposition_set):
	"""
	Check if the word in front of the input ngram is a preposition
	:param ngram: a n-gram
	:param single_grams: all words in an article with order
	:param preposition_set: a set contains all prefixes
	:return: 1 (has feature) 0 (no such feature)
	"""
	if (ngram[2] - 1) >= 0:
	prepos = single_grams[ngram[2] - 1][0].lower()
	if prepos in preposition_set:
	return 1
	return 0

	def before_who(ngram, single_grams, who_set):
	"""
	Check if the word after the input ngram is "who"
	:param ngram: a n-gram
	:param single_grams: all words in an article with order
	:param who_set: a set contains who
	:return: 1 (has feature) 0 (no such feature)
	"""
	return 1 if (ngram[2]+1) < len(single_grams) and (single_grams[ngram[2]+1][0]).lower() in who_set else 0

	def has_duplicate(ngram):
	"""
	Check if the words in input ngram has any duplicate words
	:param ngram: a n-gram
	:return: 1 (has feature) 0 (no such feature)
	"""
	words = set()
	for word in ngram[0].split(' '):
	if word in words:
	return 1
	else:
	words.add(word)
	return 0

	def count_occurrences(ngram, single_grams):
	"""
	Count the word's occurrences in the article(only for single words)
	:param ngram: a n-gram
	:param single_grams: all words in an article with order
	:return: word's occurrences
	"""
	# print (single_grams)
	# data = ' '.join(a[0] for a in single_grams)
	return single_grams.count(ngram[0])

	def start_end_dash(ngram):
	"""
	Check if the ngram contain string starts or ends with dash
	:param ngram: a n-gram
	:return: 1 (has feature) 0 (no such feature)
	"""
	words = ngram[0].split(' ')
	if not words[0].isalpha() or (len(words) > 1 and not words[-1].isalpha()) or words.count('-') > 1:
	return 1
	count = 0
	for word in words:
	count += word.count('-')
	if count > 1:
	return 1
	return 0

	def has_one_dash(ngram):
	"""
	Check if the ngram contain exactly one dash
	:param ngram: a n-gram
	:return: 1 (has feature) 0 (no such feature)
	"""
	words = ngram[0].split(' ')
	if words.count('-') == 1:
	return 1
	count = 0
	for word in words:
	count += word.count('-')
	if count == 1:
	return 1
	return 0

	def all_upper_character(ngram):
	"""
	Check all the words in the content if all character in words is upper case
	:param ngram: a ngram
	:return: 1 (has feature) 0 (no such feature)
	"""
	for word in ngram[0].split(' '):
	if word.isupper():
	return 1
	return 0

	def word_length(ngram):
	"""
	Check number of words
	:param ngram: a ngram
	:return: number of words
	"""
	words = ngram[0].split(' ')
	return len(words)

	def has_fullstop_before_ngram(ngram, single_grams2):
	"""
	Check if the word in front of the input ngram is a fullstop
	:param ngram: a n-gram
	:param single_grams2: all words including punctuation in an article with order
	:return: 1 (has feature) 0 (no such feature)
	"""
	if (ngram[2] - 1) >= 0:
	preWord = single_grams2[ngram[2] - 1][0].lower()
	if preWord.endswith("."):
	return 1
	return 0

	def has_comma_before_ngram(ngram, single_grams2):
	"""
	Check if the word in front of the input ngram is a comma
	:param ngram: a n-gram
	:param single_grams2: all words including punctuation in an article with order
	:return: 1 (has feature) 0 (no such feature)
	"""
	if (ngram[2] - 1) >= 0:
	preWord = single_grams2[ngram[2] - 1][0].lower()
	if preWord.endswith(","):
	return 1
	return 0

	def has_comma(ngram, single_grams2):
	"""
	Check if the word has a comma in the end
	:param ngram: a n-gram
	:param single_grams2: all words including punctuation in an article with order
	:return: 1 (has feature) 0 (no such feature)
	"""
	lastWord = single_grams2[ngram[3]][0]
	if lastWord.endswith(","):
	return 1
	return 0

	def is_name_suffix(ngram):
	"""
	Check if the word has a suffix
	:param ngram: a n-gram
	:return: 1 (has feature) 0 (no such feature)
	"""
	suffixes = ['Sr', 'Sr.', 'Jr', 'Jr.', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX',
	'sr', 'sr.', 'jr', 'jr.', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx']
	for word in ngram[0].split(' '):
	if word in suffixes:
	return 1
	return 0

	def start_with_suffix(ngram):
	"""
	Check if the word has a suffix
	:param ngram: a n-gram
	:return: 1 (has feature) 0 (no such feature)
	"""
	suffixes = ['Sr', 'Sr.', 'Jr', 'Jr.', 'III', 'IV', 'V', 'VI', 'VII', 'VIII', 'IX', 'X', 'XI', 'XII', 'XIII', 'XIV', 'XV', 'XVI', 'XVII', 'XVIII', 'XIX', 'XX',
	'sr', 'sr.', 'jr', 'jr.', 'iii', 'iv', 'v', 'vi', 'vii', 'viii', 'ix', 'x', 'xi', 'xii', 'xiii', 'xiv', 'xv', 'xvi', 'xvii', 'xviii', 'xix', 'xx']
	word = ngram[0].split(' ')
	if word[0] in suffixes:
	return 1
	return 0
	from sklearn.linear_model import LogisticRegression
	from sklearn import tree, svm
	from sklearn.naive_bayes import BernoulliNB
	from sklearn.ensemble import RandomForestClassifier
	from sklearn.pipeline import make_pipeline
	from sklearn.preprocessing import StandardScaler
	from sklearn.model_selection import GridSearchCV


	def build_decision_tree(data, label):
	"""
	Build the decision tree based on the data and its corresponding label
	:param data: a list of tuple which contains all features of a data
	:param label: a list of label for data
	:return: a trained decision tree
	"""
	dt_tree = tree.DecisionTreeClassifier()
	return dt_tree.fit(data, label)


	def build_support_vector_machine(data, label):
	"""
	Build the support vector machine based on the data and its corresponding label
	:param data: a list of tuple which contains all features of a data
	:param label: a list of label for data
	:return: trained support vector machine
	"""
	trained_svm = svm.SVC(gamma='scale', C=100)
	return trained_svm.fit(data, label)


	def build_nb_classifier(data, label):
	"""
	Build the naive bayes classifier based on the data and its corresponding label
	:param data: a list of tuple which contains all features of a data
	:param label: a list of label for data
	:return: trained naive bayes classifier
	"""
	classifier = BernoulliNB()
	return classifier.fit(data, label)


	def build_rf_classifier(data, label):
	"""
	Build the random forest classifier based on the data and its corresponding label
	:param data: a list of tuple which contains all features of a data
	:param label: a list of label for data
	:return: trained naive bayes classifier
	"""
	# pipe = make_pipeline(StandardScaler(),RandomForestClassifier())
	# param_grid = {'n_estimators': list(range(1, 30))}
	# gs = GridSearchCV(estimator=RandomForestClassifier(), param_grid=param_grid, \
	# iid=False, n_jobs=-1, refit=True,scoring='accuracy',cv=10)
	# gs.fit(data, label)
	# n_estimators=gs.best_params_['n_estimators']
	classifier = RandomForestClassifier(n_estimators=34, n_jobs=-1, criterion='gini', class_weight={0: 1, 1: 1.45}, random_state=10)
	return classifier.fit(data, label)


	def build_lr_classifier(data, label):
	"""
	Build the logistic regression classifier based on the data and its corresponding label
	:param data: a list of tuple which contains all features of a data
	:param label: a list of label for data
	:return: trained logistic regression classifier
	"""
	classifier = LogisticRegression(solver='newton-cg',n_jobs=-1,class_weight={0: 1, 1: 1.5})
	return classifier.fit(data, label)
	January
	February
	March
	April
	May
	June
	July
	August
	September
	October
	November
	December
	import re


	def generate_ngrams(filename, content, n):
	"""
	Generate n-grams (with a feature whether it contains "'s") from the content
	:param filename: filename
	:param content: the whole article
	:param n: the size of n gram
	:return: generated list of n-grams, single grams
	"""
	sentences = content.split(".")
	index, index2 = 0, 0
	n_grams, single_grams, single_grams2 = [], [], []
	for sentence in sentences:
	sections = sentence.split(",")
	for section in sections:
	parts = section.split(";")
	for part in parts:
	words = part.split()
	single_grams_temp, feature_single_quote_temp = [], []
	for i in range(len(words)):
	words2 = words[:]
	words2[i] = re.sub('[;@#$()\{\}:"]', '', words2[i])
	single_grams2.append((words2[i], filename, index2, index2))
	index2 += 1

	# first clean the data
	for i in range(len(words)):
	# clean data by removing special characters
	words[i] = re.sub('[?;!@#$()\{\}:\,\."]', '', words[i])

	# for cases 's, take off 's
	if (len(words[i]) >= 2 and words[i][-2] == "'"):
	words[i] = words[i][:-2]
	feature_single_quote_temp.append(1)
	elif (len(words[i]) >= 2 and words[i][-2] == "s" and words[i][-1] == "'"):
	words[i] = words[i][:-1]
	feature_single_quote_temp.append(1)
	else:
	feature_single_quote_temp.append(0)

	single_grams_temp.append((words[i], filename, index, index))
	index += 1

	n_grams_temp = [] # the return list
	for i in range(len(words)):
	temp = words[i]
	for j in range(1, n):
	if (i + j) < len(words):
	temp = temp + ' ' + words[i + j]
	temp_with_first_index = (temp, filename, single_grams_temp[i][2], single_grams_temp[i + j][2], feature_single_quote_temp[i + j])
	n_grams_temp.append(temp_with_first_index)

	# single_grams += n_grams
	for i in range(len(single_grams_temp)):
	n_grams_temp.append(single_grams_temp[i] + (feature_single_quote_temp[i],))

	n_grams.extend(n_grams_temp)
	single_grams.extend(single_grams_temp)
	return n_grams, single_grams, single_grams2

	def eliminate_all_lower(ngrams):
	"""
	Take out n-gram which does not have any word capitalised
	:param ngrams: all n-grams
	:return: all n-grams for each n-gram has a least one word capitalised
	"""
	new_ngram = []
	for ngram in ngrams:
	for word in ngram[0].split(' '):
	if len(word) > 0 and word[0].isupper():
	new_ngram.append(ngram)
	break

	return new_ngram
	university
	college
	association
	commission
	council
	laboratory
	government
	committee
	department
	school
	research
	office
	affairs
	court
	corporation
	company
	agency
	organization
	group
	empire
	league
	music
	hotel
	hotels
	white house
	party
	hilton
	Walmart
	Genk
	Brugge
	Concert
	organisation
	prize
	rolling stones
	White House
	Virgin Galactic
	Art Brut
	amazon
	walmart
	art
	following
	club
	people
	human
	guitar
	violin
	def take_out_overlapped(ngrams, predict, label):
	"""
	Take out n-gram which is the subset of another n-gram
	:param ngrams: all n-grams
	:return: the remaining n-grams
	"""
	new_ngrams, new_predict, new_label, prev, prev_predict = [], [], [], None, 0
	for element_index in range(len(ngrams)):
	# if prev is None \|\| (filenames are different) \|\| (starting index are different)
	if not prev \
	or ngrams[element_index][1] != prev[1] \
	or ngrams[element_index][2] == 0 \
	or prev_predict == 0 \
	or (#ngrams_labels_predicts_sets[element_index][0][1] == prev[0][1] \
	# pre[element_index]==1 \
	# and prev[2]==1 \
	not(prev[2] <= ngrams[element_index][2] <= prev[3]) \
	or not(prev[2] <= ngrams[element_index][3] <= prev[3])):
	prev = ngrams[element_index]
	prev_predict = predict[element_index]
	new_ngrams.append(ngrams[element_index])
	new_predict.append(predict[element_index])
	new_label.append(label[element_index])

	return new_ngrams, new_predict, new_label


	def set_predict_value(ngrams, predict):
	for element_index in range(len(ngrams)):
	# 19: start_end_dash, 5: contains_country, 10: contains_prefix, 12: contains_organization, 18: contains_verb,\
	# 6: contains_conjunction, 29: start_with_suffix, 30: contains_common_adj
	if ngrams[element_index][19] == 1 or ngrams[element_index][5] == 1 \
	or ngrams[element_index][29] == 1 or ngrams[element_index][6] == 1 \
	or ngrams[element_index][10] == 1 or ngrams[element_index][12] == 1 \
	or ngrams[element_index][18] == 1 or ngrams[element_index][30] == 1:
	predict[element_index] = 0

	return predict