Skip to content

Instantly share code, notes, and snippets.

@eoglethorpe
Created November 13, 2016 12:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save eoglethorpe/b978617ca7f29f33f5dd26ba6df219e5 to your computer and use it in GitHub Desktop.
Save eoglethorpe/b978617ca7f29f33f5dd26ba6df219e5 to your computer and use it in GitHub Desktop.

Creating a system for identifying keywords in textual humanitarian data

This is a brief crash course for creating a system to categorize key words in sets of data with known classifications of segments of text. Although it is written in a humanitarian context, it can be flexibly used elsewhere.

Overview: What's in the data?

A sample dataset could include the following format where it is normalized:

Column 1: A segment of text (a few words, sentence or paragraph) that has been manually written by the user or copied from a source such as a newspaper or piece of analysis Column 2...x: A column containing boolean or numerical values indicating if a sector is present.

For instance:

Text Wash Food Security Shelter
There was an acute shortage of water sanitation products, food supplies and housing as a result of the crisis. X X X
The crisis has rendered 10,000 people without housing. X

The dataset could also be structured in the following format, where it is denormalized:

Column 1: A segment of text (a few words, sentence or paragraph) that has been manually written by the user or copied from a source such as a newspaper or piece of analysis Column 2: A list containing sectors for which the text body has been classified.

For instance:

Text Classification
There was an acute shortage of water sanitation products, food supplies and housing. WASH,Food Security, Shelter
The crisis has rendered 10,000 people without housing. Shelter

In this situation, you would first denormalize the data by splitting it into appropriate categories by splitting by splitting based on a separator (in this case, a comma). In this situation there a few caveats to watch out for:

  • The same sector could be misspelled meaning that it would be categorized as different sectors. To prevent this:
    • First split your categories by separator and then manually overview the unique categories. Do a simple search and replace for categories that are misspelled
    • If you have many categories and manual review isn't practical, you can apply Levenshtein Distance on each category to all categories and look for any matches at a high threshold. Note that this technique won't work for sectors who have multiple names applied (for instance "Water" and "WASH" have low LD scores but can be considered the same sector).
  • Different separators could be used

Cleaning the data

Once you have sorted out your categories, you now clean the entry data.

Remove stop words Stop words are what separate us from the robots. For this text classification exercise and most NLP, stop words are simply noise. There are many different lists of stop words so pick one that suits your environment/needs. Many NLP packages will have a stop word removal built in. Whichever method you choose, remove all stop words in your data.

Remove puncutat!on. Like stop words, punctuation also creates a lot of noise in our classifier.

Categorizing sectors

In this example, we will do perform a basic method of proportional frequency classification where we will take words and count their frequency in relation to a given sector.

Absolute Count Essentially what this means is that we create a dictionary for every unique word present in our data and count the number of times it is in an entry that is categorized by a given sector. The top scoring words in our example would be:

Word Sector Count
crisis Shelter 2
housing Shelter 2
water WASH 1
water Food Security 1

While 'housing' certainly pairs with Shelter and 'water' with WASH, water does not match with Food Security. This certainly is not a fool proof method but with a larger dataset such obvious mistakes should not be present.

Percentage of words While having an absolute count can be useful, we can also give the count in relation to the total number of words present for a given sector. For instance:

Word Sector Count Total number of words Pct among words
crisis Shelter 2 27 .07
housing Shelter 2 27 .07
water WASH 1 19 .05
water Food Security 1 19 .05

Percentage of entries We could also amend this method to score the occurrence of a given word amongst all entries for a sector as opposed to its frequency amongst all words. For instance:

Word Sector Count % of presence in all entries
crisis Shelter 2 100
housing Shelter 2 100
water WASH 1 100
water Food Security 1 100

...this is where my examples fall short but in practice you will see percentages that aren't all 100.

Other things to consider

  • How do you handle repetitions of the same word in one entry? Do you count it multiple times or just use its presence in a boolean sense? There isn't a steadfast rule but it depends on how you want to analyze your text. What you can do is try both methods and look at the results and adjust as needed

What do you do now?

Once you have used whichever method you prefer, run this categorization on your data and you will come up with a whole host of classification scores for each sector.

One step that could be good here would be to repeat the LD exercise listed above to make sure that similarly spelled words are being counted properly. Be careful though about making wholesale changes as even though words can be spelled in a similar manner they can still be unique words!

This list of words and scorings is how you can select which words are relevant to given sectors. Draw a line in the sand for how you want to consider a word to be a relevant classification... it could be all words above a percentage score or the top X words per sector. There is no right or wrong way.

How to actually apply this to a working system

Let's say we have a platform for classifying text and users can manually say that a word should be added to a sector's list or should be removed.

What can be done in this situation is to maintain two seperate lists: one that is the words automatically generated and ones that are added by the user.

How could this be improved?

  • We could use n-grams to create groupings of strings of words as opposed to simply picking one word.
    • ...we could also get fancy and look at occurrence of n-grams where word ordering down't matter

Written with StackEdit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment