Skip to content

Instantly share code, notes, and snippets.

@manisnesan
Last active February 2, 2022 17:07
Show Gist options
  • Save manisnesan/3f450b366830b4ce4e179f15de96f433 to your computer and use it in GitHub Desktop.
Save manisnesan/3f450b366830b4ce4e179f15de96f433 to your computer and use it in GitHub Desktop.
Prodigy Usage

Teach the model ie Annotate

$ prodigy textcat.teach troubleshoot-sample en_core_web_sm troubleshoot_sample.jsonl

Batch Train

$ prodigy textcat.batch-train troubleshoot-sample en_vectors_web_lg --output troubleshoot-sample-model --eval-split 0.2

Create the dataset

$ prodigy dataset --author msivanes openshift_troubleshoot "Dataset to classify openshift as usage or troubleshoot types."

Import the dataset

$ prodigy db-in openshift_troubleshoot annotated_openshift_troubleshoot.jsonl

Drop the dataset

$ prodigy drop openshift_troubleshoot

Explore the source

$ prodigy textcat.print-stream annotated_openshift_troubleshoot.jsonl | less -r
$ prodigy pipe annotated_reddit-INSULT-textcat.jsonl

Explore the dataset

$ prodigy textcat.print-dataset openshift_troubleshoot
$ prodigy pipe --from-dataset openshift_troubleshoot | less -r

Prepare the dataset

prepare_data.py

import json

def process(record_str):
    record = json.loads(record_str)
    category = record.get('category', '')
    if category is None:
        label = ""
    elif category == 'Troubleshoot':
        label = "Troubleshoot"
    elif len(category) > 0 and category[0] in ['Install', 'Configure', 'Supportability', 'Learn more', 'Upgrade']:
        label = "Usage"
    else:
        label = ""
    processed = {"text": record['text'], "label": label, "meta": {"id": record.get('id', "")}}
    return processed
    

with open('data/preannotated_openshift.jsonl') as file:
    data = file.readlines()

processor = lambda x: process(x)
processed_data = list(map(processor, data))
with open('data/annotated_openshift.json','w') as outfile:
    json.dump(processed_data, outfile)
  • Convert the above json into json lines suitable for prodigy jq -c '.[]' annotated_openshift.json > annotated_openshift_titles.jsonl

Load the input file into annotation db

$ prodigy db-in openshift_usage_troubleshoot annotated_openshift_titles.jsonl

Start the text classification annotation session

python -m spacy download en_core_web_md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment