Last active
January 30, 2023 15:59
-
-
Save wesslen/940da012837dda5d125e35e6a97f82ec to your computer and use it in GitHub Desktop.
Prodigy recipe for binary text classification on the sentence level (highlighted) within context of paragraph.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import prodigy | |
import spacy | |
from prodigy.components.loaders import JSONL | |
@prodigy.recipe( | |
"textcat_sent_sequence", | |
dataset=("Dataset to save answers to", "positional", None, str), | |
examples=("Examples to load from disk", "positional", None, str), | |
model=("spaCy model to load", "positional", None, str), | |
label=("Label for annotated data", "positional", None, str), | |
) | |
def textcat_topic(dataset, examples, model, label): | |
# import spaCy | |
nlp = spacy.load(model) | |
# set up stream; may want get_stream() instead to hash/avoid dedup | |
stream = JSONL(examples) | |
# Render highlight of each sentence | |
def add_html(examples): | |
for ex in examples: | |
doc = nlp(ex["paragraph"]) | |
for sent in doc.sents: | |
summary_highlight = ex["paragraph"] | |
summary_highlight = summary_highlight.replace( | |
sent.text, f"<b style='background-color: yellow;'>{sent.text}</b>" | |
) | |
ex["sentence"] = sent.text | |
ex["html"] = f"{summary_highlight}" | |
ex["label"] = label | |
yield ex | |
# delete html key in output data | |
def before_db(examples): | |
for ex in examples: | |
del ex["html"] | |
return examples | |
return { | |
"before_db": before_db, | |
"dataset": dataset, | |
"stream": add_html(stream), | |
"view_id": "classification", | |
} |
Author
wesslen
commented
Jan 27, 2023
To run:
python -m prodigy textcat_sent_sequence sent_dataset input_paragraphs.jsonl en_core_web_sm MY_LABEL -F textcat_sent_sequence.py
Annotation examples: python -m prodigy db-out sent_dataset > sent_seq.jsonl
{
"paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
"sentence": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models.",
"label": "MY_LABEL",
"_input_hash": -1908851693,
"_task_hash": -2083698072,
"_view_id": "classification",
"answer": "accept",
"_timestamp": 1674846351
}
{
"paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
"sentence": "You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models.",
"label": "MY_LABEL",
"_input_hash": -1655152078,
"_task_hash": 1852110735,
"_view_id": "classification",
"answer": "reject",
"_timestamp": 1674846353
}
{
"paragraph": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation.",
"sentence": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts.",
"label": "MY_LABEL",
"_input_hash": -913072919,
"_task_hash": 1979525482,
"_view_id": "classification",
"answer": "accept",
"_timestamp": 1674846353
}
{
"paragraph": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation.",
"sentence": "Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end.",
"label": "MY_LABEL",
"_input_hash": -1635295063,
"_task_hash": 1202994399,
"_view_id": "classification",
"answer": "accept",
"_timestamp": 1674846354
}
{
"paragraph": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation.",
"sentence": "The web application is optimized for fast, intuitive and efficient annotation.",
"label": "MY_LABEL",
"_input_hash": -1883974235,
"_task_hash": -7895426,
"_view_id": "classification",
"answer": "reject",
"_timestamp": 1674846355
}
{
"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do.",
"sentence": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of.",
"label": "MY_LABEL",
"_input_hash": -1549905405,
"_task_hash": 1533273470,
"_view_id": "classification",
"answer": "accept",
"_timestamp": 1674846356
}
{
"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do.",
"sentence": "To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice.",
"label": "MY_LABEL",
"_input_hash": -2015913971,
"_task_hash": -1499993124,
"_view_id": "classification",
"answer": "reject",
"_timestamp": 1674846356
}
{
"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do.",
"sentence": "Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow.",
"label": "MY_LABEL",
"_input_hash": -1134729851,
"_task_hash": 1772709716,
"_view_id": "classification",
"answer": "accept",
"_timestamp": 1674846357
}
{
"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do.",
"sentence": "With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do.",
"label": "MY_LABEL",
"_input_hash": -745302604,
"_task_hash": 712420947,
"_view_id": "classification",
"answer": "accept",
"_timestamp": 1674846358
}
Input data: input_paragraphs.jsonl
{"paragraph": "Prodigy is a modern annotation tool for creating training and evaluation data for machine learning models. You can also use Prodigy to help you inspect and clean your data, do error analysis and develop rule-based systems to use in combination with your statistical models."}
{"paragraph": "The Python library includes a range of pre-built workflows and command-line commands for various tasks, and well-documented components for implementing your own workflow scripts. Your scripts can specify how the data is loaded and saved, change which questions are asked in the annotation interface, and can even define custom HTML and JavaScript to change the behavior of the front-end. The web application is optimized for fast, intuitive and efficient annotation."}
{"paragraph": "Prodigy’s mission is to help you do more of all those manual or semi-automatic processes that we all know we don’t do enough of. To most data scientists, the advice to spend more time looking at your data is sort of like the advice to floss or get more sleep: it’s easy to recognize that it’s sound advice, but not always easy to put into practice. Prodigy helps by giving you a practical, flexible tool that fits easily into your workflow. With concrete steps to follow instead of a vague goal, annotation and data inspection will change from something you should do, to something you will do."}
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment