JohnStuartRutledge/nlp_roadmap.md

## nlp_roadmap.md

      
    Raw
  

              nlp_roadmap.md
            
          
    NLP-ETL PIPELINE REQUIREMENTS

3 Primary Goals


aggregate/tag all useful language learning educational material on the web
create enough output JSON docs to:

offer pre-made high quality content on any topic
serve as a raw data for training generative ML models


minimize editorial review step as much as possible

INPUT & OUTPUTS

Each input typically has a "media-type" associated with it (video, audio, html, pdf, etc). Every media type has an "extractor program" that is designed to extract the language data from it. E.g:

video media-type uses youtube-dl to extract subtitles.
image media-type uses Tesseract to extract text from images.
HTML media-type uses Scrapy, Beautiful Soup, or just the Python Requests library

Input Types


Webpage, articles, blogs [mvp]
Video (Youtube) [mvp]
Audio (soundcloud, podcasts, Spotify)
Files (pdf, docx, etc) [mvp]
Images (infographics, etc)
Games (typically a sub-type of webpage)
App (no extraction on these, we just track them)
APIs (dictionary api, news apis, etc)

Input Sources


direct submissions to external resources table [mvp]
student/teacher internal message (e.g. our session recap) [mvp]
FC Webscrapers (e.g. our sports scraper, PRAW) [mvp]
CRON jobs to 3rd party data providers like pushshift.io and wikipedia
Authenticated API endpoint; FC staff CLI that posts to said endpoint
3rd party APIs like newscatcher.api

Output Types


Activity JSON Structure [mvp]
LMS output types like SCORM
PDF & Google Docs

PROCESSING

section on how processing works
Example Youtube Video Processing Steps

url added to external-resources table

url submitted to processing queue

urls content-type is identified (youtube video)

"video" media-type tag is added

youtube subtitle extractor loads and extracts raw text

language-identifier runs on raw text and adds "english" tag

Verb-Tense analyzer runs on raw text

CEFR Level Tagger runs on raw text (adds "A1" tag)

Topic Detecter runs on raw text (adds "food" tag w/ confidence)

external-resource is flagged for manual review.

Admin reviews, makes adjustments and hits "approve" button

resource is indexed for search (Postgres, later ElasticSearch)

Example Processed Resource to Lesson

STEP 1:
User or Admin defines desired lesson parameters
	- target language
	- taught in language
	- level
	- media type
	- activity type
	- pedagogical type
	- communicative objective
	- grammar objective
	- language specific objective
	- topic

STEP 2:
The parameters selected above filter applicable source-materials

STEP 3:
(optional) User selects/discards recommended source materials

STEP 4:
materials are processed into activity JSON format
	Supplementary 3rd party content is added (if applicable)
		- presaved idioms, vocab, phrases by topic
		- common errors (e.g. mult-choice preposition distractors)
		- wikipedia lookups (e.g. named-entity supplements)
	- Generative models run, e.g. question prompts (if applicable)

STEP 5
Lesson is auto-generated from the Step-4 output JSON.

Supplementary Sources of Information


ConceptNet
Wikipedia DB Dumps
Fluent City Topic Genomes
Linguistic Corpora

OUTPUT

The output is the result of the Transform step of the pipeline. Output is often synonymous with the final product from a pure data perspective. E.g. the Activity JSON output is a pure data representation of a learning activity.
Output Types


Activity JSON Structure [mvp]
LMS output types like SCORM
PDF & Google Docs
3rd Party Integrations (e.g. typeform, quizlet)
ML Training Data

Activity Types


fill-in-blank
multipe-choice
true-false
matching
short-answer
video
audio
interactive-table
flash-cards
write-in-columns (depricated)
roulette wheel
dialog

Structure of a Lesson

Lessons are a collection of one or more activity. You can think of it as the input JS Object you would put into a top level React component that generates a language lesson. The High level component structure is:
Course
	Unit
		Lesson
			Section
				Activity

See Definitions Doc for details

Course: groups a collection of units (analagous to a book grouping chapters)
Unit: a series of lessons usually grouped around a topic (like a chapter)
Lesson: content covered in a teacher/student meeting (like a page)
Section: a collection of 1 or more activities (like a section of a page)
Activity: the actual interactive widget (see list of activity-types above)


DATA PIPELINE

Data-lake architecture will be no more than a simple S3 bucket to start with. See Appendix for an example of a ML model development pipeline that would eventually be used to deploy nodes into the Data Pipeline discussed in this section.

Extract   => S3
Transform => The Data Pipeline
Load      => Fluent City Postgres DB

Pipeline Services

To start with we are going to build a "dumb version" of a pipeline. Example of types of solutions in the realm of consideration are:

AWS Data Pipeline
AWS Lambda scripts that trigger when objects are added to an S3 bucket

Pipeline Node Types


Extraction: extracts data from an input media-type (url, pdf, video, etc)
Classification: loads pickle file or model and runs classification
Supplementation: consults stored data in FC DB or 3rd party services
Transformation: pure python code that transforms data from one form to another

Pipeline Node Examples

Practically speaking nodes will likely take one of the following common forms:

pure python program (e.g. AWS lambda friendly)
SpaCy custom component. E.g. the Fluent City Verb Phrase Finder
loading a Python Pickle using Scikit-learn code (e.g. see basic example)
large external model (e.g. Hugging face models like BERT and GPT-2)


APPENDIX

OTHER REQUIREMENTS


non destructive data extraction (keep raw data)
extractors/transformers include version tags
ability to quickly load new components to processing
possibly DAG style processing pipeline (eventually)
versioned schema validation

PROBLEMS


Heroku limitations (e.g. slugsize)
how to provision pipeline processing steps that require large language models
how to switch between language models in SpaCy pipeline nodes

Useful Links


NLP-Progress Good overview of state of the art NLP techniques
HuggingFace Source of many of the transformer models we use.
SpaCy Our NLP library of choice
SpaCy Models The core language models we use

Example Activity Output Structures


NOTE: these are not the real schemas, they just illustrate the basic notion that different "activity types" have different JSON structures.

Fill-in-blank Schema

{
    'meta': {
        'language': 'en',
        'activity_type': 'fill-in-blank',
        'activity_name': 'verb',
        'time_limit': 0,
        'duration': 0,
    },
    'results': {
        'score': 0,
        'grade': 'A',
        'total': 1,
        'questions': {
            '1': {
                'completed': True,
                'attempts': 1,
                'value': 0,
                'points': 0,
            }
        }
    },
    'history': [
        {
            'action': 'submit answer',
            'timestamp': '00:00:00',
            'details': {
                'question_id': 1,
                'answer_id': [1],
            },
        }
    ],
    'text': "\nFar out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun.\n\nOrbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape- descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.\n\nThis planet has - or rather had - a problem, which was this: most of the people living on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because on the whole it wasn't the small green pieces of paper that were unhappy.\n\nAnd so the problem remained; lots of the people were mean, and most of them were miserable, even the ones with digital watches.\n",
    'ents': [
        {'start': 95, 'end': 101, 'label': 'ORG'},
        {'start': 170, 'end': 202, 'label': 'QUANTITY'}
    ],
    'sents': [
        {'start': 0, 'end': 139},
        {'start': 139, 'end': 378},
        {'start': 378, 'end': 512},
        {'start': 513, 'end': 749},
        {'start': 749, 'end': 877}
    ],
    'tokens': [
        ...
    ],
    'questions': [
        {
            'id': 1,
            'label': 'a',
            'start': 0,
            'end': 0,
            'sent_start': 0,
            'sent_end': 25,
            'type': 'fill-in-blank',
            'title': 'English word for inu?',
            'description': '',
            'required': True,
            'evaluation_type': 'iregex',
            'answer': '(dog|canine|hound)',
        },
    ],
}
Multiple-choice Schema

{
    'meta': {
        'language': 'en',
        'activity_type': 'multipe-choice',
        'activity_name': 'name',
    },

    'questions': [
        {
            'id': 1,
            'label': 'A',
            'start': 0,
            'end': 5,
            'sent_start': 0,
            'sent_end': 25,
            'type': 'multiple-choice',
            'title': 'which one?',
            'description': 'pick the best answer.',
            'required': True,
            'evaluation_type': 'exact',

            # the possible answers to the question
            'answer': {
                'choices': [
                    {'id': 1, 'value': 'answer 1'},
                    {'id': 2, 'value': 'answer 2'},
                    {'id': 3, 'value': 'answer 3'},
                ],
                # ids or id combinations that must be True.
                # the following has two correct answers:
                # id 1 alone or ids 2 and 3 together.
                'answer_key': [1, [2, 3]],
            },

            # (optional) feedback to provide on success/failure.
            # this will often be dynamically generated via a function
            'feedback': {
                'correct': 'good job',
                'incorrect': 'shitty job',
            },
        },
    ],
}


ML CONTINUOUS INTEGRATION

This section is included to give context to the pipeline. I.e, it would be ideal that whatever "pipeline" we do build stays open to an future integration with an ML CI workflow.
Machine Learning

soure article
General outline of ML development process

Example of aforementioned pipeline divided by responsibility

Full view of an ML model creation pipeline

Example of a tool we will use once we start training more of our own models.
MLaaS - Machine Learning as a Service

Data Science Version Control
Most of our models will be simple Python Pickle files that are loadable via Scikit-learn