eoglethorpe/Classify keywords by sector.md

## Classify keywords by sector.md

      
    Raw
  

              Classify keywords by sector.md
            
          
    Creating a system for identifying keywords in textual humanitarian data

This is a brief crash course for creating a system to categorize key words in sets of data with known classifications of segments of text. Although it is written in a humanitarian context, it can be flexibly used elsewhere.
Overview: What's in the data?

A sample dataset could include the following format where it is normalized:
Column 1: A segment of text (a few words, sentence or paragraph) that has been manually written by the user or copied from a source such as a newspaper or piece of analysis
Column 2...x: A column containing boolean or numerical values indicating if a sector is present.
For instance:


Text
Wash
Food Security
Shelter


There was an acute shortage of water sanitation products, food supplies and housing as a result of the crisis.
X
X
X


The crisis has rendered 10,000 people without housing.


X


The dataset could also be structured in the following format, where it is denormalized:
Column 1: A segment of text (a few words, sentence or paragraph) that has been manually written by the user or copied from a source such as a newspaper or piece of analysis
Column 2: A list containing sectors for which the text body has been classified.
For instance:


Text
Classification


There was an acute shortage of water sanitation products, food supplies and housing.
WASH,Food Security, Shelter


The crisis has rendered 10,000 people without housing.
Shelter


In this situation, you would first denormalize the data by splitting it into appropriate categories by splitting by splitting based on a separator (in this case, a comma). In this situation there a few caveats to watch out for:

The same sector could be misspelled meaning that it would be categorized as different sectors. To prevent this:

First split your categories by separator and then manually overview the unique categories. Do a simple search and replace for categories that are misspelled
If you have many categories and manual review isn't practical, you can apply Levenshtein Distance on each category to all categories and look for any matches at a high threshold. Note that this technique won't work for sectors who have multiple names applied (for instance "Water" and "WASH" have low LD scores but can be considered the same sector).


Different separators could be used

Cleaning the data

Once you have sorted out your categories, you now clean the entry data.
Remove stop words
Stop words are what separate us from the robots. For this text classification exercise and most NLP, stop words are simply noise. There are many different lists of stop words so pick one that suits your environment/needs. Many NLP packages will have a stop word removal built in. Whichever method you choose, remove all stop words in your data.
Remove puncutat!on.
Like stop words, punctuation also creates a lot of noise in our classifier.
Categorizing sectors

In this example, we will do perform a basic method of proportional frequency classification where we will take words and count their frequency in relation to a given sector.
Absolute Count
Essentially what this means is that we create a dictionary for every unique word present in our data and count the number of times it is in an entry that is categorized by a given sector. The top scoring words in our example would be:


Word
Sector
Count


crisis
Shelter
2


housing
Shelter
2


water
WASH
1


water
Food Security
1


While 'housing' certainly pairs with Shelter and 'water' with WASH, water does not match with Food Security. This certainly is not a fool proof method but with a larger dataset such obvious mistakes should not be present.
Percentage of words
While having an absolute count can be useful, we can also give the count in relation to the total number of words present for a given sector. For instance:


Word
Sector
Count
Total number of words
Pct among words


crisis
Shelter
2
27
.07


housing
Shelter
2
27
.07


water
WASH
1
19
.05


water
Food Security
1
19
.05


Percentage of entries
We could also amend this method to score the occurrence of a given word amongst all entries for a sector as opposed to its frequency amongst all words. For instance:


Word
Sector
Count
% of presence in all entries


crisis
Shelter
2
100


housing
Shelter
2
100


water
WASH
1
100


water
Food Security
1
100


...this is where my examples fall short but in practice you will see percentages that aren't all 100.
Other things to consider

How do you handle repetitions of the same word in one entry? Do you count it multiple times or just use its presence in a boolean sense? There isn't a steadfast rule but it depends on how you want to analyze your text. What you can do is try both methods and look at the results and adjust as needed

What do you do now?

Once you have used whichever method you prefer, run this categorization on your data and you will come up with a whole host of classification scores for each sector.
One step that could be good here would be to repeat the LD exercise listed above to make sure that similarly spelled words are being counted properly. Be careful though about making wholesale changes as even though words can be spelled in a similar manner they can still be unique words!
This list of words and scorings is how you can select which words are relevant to given sectors. Draw a line in the sand for how you want to consider a word to be a relevant classification... it could be all words above a percentage score or the top X words per sector. There is no right or wrong way.
How to actually apply this to a working system

Let's say we have a platform for classifying text and users can manually say that a word should be added to a sector's list or should be removed.
What can be done in this situation is to maintain two seperate lists: one that is the words automatically generated and ones that are added by the user.
How could this be improved?


We could use n-grams to create groupings of strings of words as opposed to simply picking one word.

...we could also get fancy and look at occurrence of n-grams where word ordering down't matter


Written with StackEdit.
Text	Wash	Food Security	Shelter
There was an acute shortage of water sanitation products, food supplies and housing as a result of the crisis.	X	X	X
The crisis has rendered 10,000 people without housing.			X
Text	Classification
There was an acute shortage of water sanitation products, food supplies and housing.	WASH,Food Security, Shelter
The crisis has rendered 10,000 people without housing.	Shelter
Word	Sector	Count	Total number of words	Pct among words
crisis	Shelter	2	27	.07
housing	Shelter	2	27	.07
water	WASH	1	19	.05
water	Food Security	1	19	.05