Skip to content

Instantly share code, notes, and snippets.

@primaryobjects
Created December 18, 2019 17:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save primaryobjects/b1f4a322d00cbff419a447bb56a75514 to your computer and use it in GitHub Desktop.
Save primaryobjects/b1f4a322d00cbff419a447bb56a75514 to your computer and use it in GitHub Desktop.
A summary of weak supervision and Snorkel.

Weak Supervision

Programmatically label millions of data points. See also here.

How does it work?

Ask domain experts for weak supervision signals (heuristics) to use as labeling functions, which can be programatically implemented for automatic labeling.

Examples of labeling functions: regular expressions, dependency trees, knowledge bases, crowdsourcing.

Example Labeling Function

The following is an example labeling function for determining whether two individuals are married within a block of text.

//
// Example NLP labeling function to indicate whether two people are married.
//
const isMarried = (text, name1, name2) => {
  return (between(text, name1, name2, 'and') && after(text, name2, 'their daughter'));
}

const between = (text, str1, str2) => {
  const expr = new RegExp(`${str1}.*and.*${str2}`);
  return text.match(expr) !== null;
}

const after = (text, str1, str2) => {
  const expr = new RegExp(`${str1}.*${str2}`);
  return text.match(expr) !== null;
}

Running the example produces the following output label of True.

const text = 'Barack and Michelle visited the museum with their daughter for a holiday tour.'
console.log(isMarried(text, 'Barack', 'Michelle'))

// True

About Snorkel

Snorkel trains a label model by analyzing conflicts between labeling functions to estimate their accuracy. A labeling function that all other labeling functions tend to agree with will have a high learned accuracy, compared to a labeling function that disagrees with others when voting on the same example.

Snorkel runs each labeling function on a data point, obtaining a vote, weighed by their estimated accuracies (per above). Based on the votes from each labeling function and the accuracy estimate, the label model can assign labels to each data point. Finally, a machine learning model can be trained on the resulting data-set, hopefully generalizing beyond the training data (and thus, the labeling functions).

See also Snorkel Tutorials.

Applications to Sentiment Analysis

The idea of weak supervision can be thought as similar to the early days of Twitter sentiment analysis, where some of the largest data-sets were produced by automatically labeling tweets based upon emotions present within the body of the text :) = positive, :( = negative.

Additional heuristics for sentiment have since been utilized. Examples include using keyword lists (such as the AFINN valence word list, where you simply sum up the values for keywords found within the text, < 0 = negative, > 0 = positive), unsupervised learning techniques such as clustering, and domain-specific techniques involving manual labeling.

Example AFINN Word List

loss	-3
lost	-3
love	3
lovely	3
lowest	-1
luck	3

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment