Skip to content

Instantly share code, notes, and snippets.

@selbyk
Created December 23, 2015 05:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save selbyk/c102049678eb3515822c to your computer and use it in GitHub Desktop.
Save selbyk/c102049678eb3515822c to your computer and use it in GitHub Desktop.
Scraper/Content Extraction Training
Goal: Fetch relevant information sources, extract only appropriate content, save as documents as training data and usable by Watson
Method:
Fetch a few pages from various data sources using Phantom.js, then parse and save the website’s HTML as JSON
Iterate the text elements and extract features such as size, position, text, CSS properties, etc
Run the DBSCAN clustering algorithm over the document’s extracted feature data. Similar elements such as titles, headers, and article content should be grouped into the same clusters
Manually tag a portion of the documents to use as training data
A support vector machine (SVM) with linear kernel using a 4-fold cross validation should be capable of detecting the main content of a scraped page
Credit: Ziyan Zhou (https://goo.gl/9OYasW)
Statements to Questions Transformation Training
Goal: Turn questions from documents into sentences
Identify important who, why, what, and where’s in a list to use as a white list, WL, to generate questions about. Running a Term Frequency–Inverse Document Frequency (tf-idf) algorithm on the corpus with 1, 2, and 3 N-gram words and phrases from the corpus should return exactly what we need
For each document
Use a NLP summarization module to extract the most important statements from the entire text or subsets like paragraphs.
Tag and stem sentence with its parts of speech
Search special, then regular cases, and if the grammar of the sentence is an unrecognized pattern:
Save to database to ask for manual input of an appropriate question from the sentence
Tag and stem both statement and question
If there are words in question that aren’t in sentence
The question isn’t valid or it’s a special case
Normal: grammar transforms that needs no words
ML is cool . . => NNP VBZ JJ . .
What is ML ? => WP VBZ NNP?
Irregular
ML seems hard. => NNP VBZ seems JJ.
What does ML seem like? => WP VBZ does NNP VB seem IN like?
Ask the user if it’s a special case or for a better sentence and if, special, save the grammar and the words of the transformed parts of speech in a separate list
Else
Tag statement
Search for special cases, th
If not found, try a grammar match in normal cases
Example Grammars. Will probably take a little trial and area with a good bit of verification.
Normal
ML is a fun topic .
What is a fun topic?
Is ML a fun topic?
NNP VBZ DT NN NN .
WP VBZ DT ?
VBZ NNP DT NN NN . ?
ML can do anything . .
What can ML do . ?
NNP MD VB NN. .
WP MD NNP VB?
ML is a fun topic . .
What is a fun topic . ?
Is ML a fun topic . ?
NNP VBZ DT NN NN . .
WP VBZ DT NN NN . ?
VBZ NNP DT NN NN . ?
Irregular
ML solves hard problems
What does ML solve ?
NNP VBZ solves JJ NNS . .
WP VBZ does NNP VB solve . ?
ML has many applications . .
What does ML have . ?
NNP VBZ has JJ NNS. .
WP VBZ does NNP VB have . ?
ML kills anything in its way . .
What does ML kill . ?
NNP VBZ kills NN IN PRP NN.
WP VBZ does NNP VB kill?
ML seems hard . .
What does ML seem like . ?
How does ML seem . ?
NNP VBZ seems JJ. .
WP VBZ does NNP VB seem IN like . ?
WRB VBZ does NNP VB . ?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment