manisnesan/prodigy-usages.md

## prodigy-usages.md

      
    Raw
  

              prodigy-usages.md
            
          
    Teach the model ie Annotate

$ prodigy textcat.teach troubleshoot-sample en_core_web_sm troubleshoot_sample.jsonl 
Batch Train

$ prodigy textcat.batch-train troubleshoot-sample en_vectors_web_lg --output troubleshoot-sample-model --eval-split 0.2
Create the dataset

$ prodigy dataset --author msivanes openshift_troubleshoot "Dataset to classify openshift as usage or troubleshoot types."
Import the dataset

$ prodigy db-in openshift_troubleshoot annotated_openshift_troubleshoot.jsonl
Drop the dataset

$ prodigy drop openshift_troubleshoot
Explore the source

$ prodigy textcat.print-stream annotated_openshift_troubleshoot.jsonl | less -r
$ prodigy pipe annotated_reddit-INSULT-textcat.jsonl

Explore the dataset

$ prodigy textcat.print-dataset openshift_troubleshoot
$ prodigy pipe --from-dataset openshift_troubleshoot | less -r

Prepare the dataset

prepare_data.py
import json

def process(record_str):
    record = json.loads(record_str)
    category = record.get('category', '')
    if category is None:
        label = ""
    elif category == 'Troubleshoot':
        label = "Troubleshoot"
    elif len(category) > 0 and category[0] in ['Install', 'Configure', 'Supportability', 'Learn more', 'Upgrade']:
        label = "Usage"
    else:
        label = ""
    processed = {"text": record['text'], "label": label, "meta": {"id": record.get('id', "")}}
    return processed
    

with open('data/preannotated_openshift.jsonl') as file:
    data = file.readlines()

processor = lambda x: process(x)
processed_data = list(map(processor, data))
with open('data/annotated_openshift.json','w') as outfile:
    json.dump(processed_data, outfile)


Convert the above json into json lines suitable for prodigy
jq -c '.[]' annotated_openshift.json > annotated_openshift_titles.jsonl

Load the input file into annotation db

$ prodigy db-in openshift_usage_troubleshoot annotated_openshift_titles.jsonl
Start the text classification annotation session

python -m spacy download en_core_web_md