Skip to content

Instantly share code, notes, and snippets.

@ghomasHudson
Last active November 23, 2023 12:45
Show Gist options
  • Save ghomasHudson/57b2ef6d71220e19975443089ee900fa to your computer and use it in GitHub Desktop.
Save ghomasHudson/57b2ef6d71220e19975443089ee900fa to your computer and use it in GitHub Desktop.
{
"basics": {
"name": "Thomas Hudson",
"email": "thomas@gthudson.me",
"label": "Machine Learning PhD Student",
"picture": "https://avatars.githubusercontent.com/u/13795113?v=4",
"summary": "Researching NLP problems in veterinary medicine",
"location": {
"address": "",
"postalCode": "",
"city": "Durham",
"countryCode": "UK",
"region": ""
},
"profiles": [
{
"url": "https://github.com/ghomasHudson",
"username": "ghomasHudson",
"network": "github"
},
{
"url": "https://scholar.google.com/citations?user=S_bOPrsAAAAJ&hl=en",
"username": "Thomas Hudson",
"network": "Google Scholar"
}
]
},
"education": [
{
"startDate": "2018-09-01",
"studyType": "PhD",
"area": "Computer Science",
"institution": "Durham University"
},
{
"courses": [
"Machine Learning"
],
"endDate": "2018-07-01",
"startDate": "2014-09-01",
"studyType": "Masters",
"area": "Computer Science",
"institution": "Durham University"
}
],
"languages": [
{
"fluency": "Native speaker",
"language": "English"
},
{
"fluency": "Beginner",
"language": "Turkish"
}
],
"skills": [
{
"keywords": [
"Python",
"Pytorch",
"HuggingFace",
"LLMs"
],
"name": "Machine Learning"
}
],
"work": [
{
"company": "Durham University",
"website": "https://www.dur.ac.uk",
"summary": "Working on a project to explore how a range of techniques, including the innovative use of LLMs (Large Language Models such as chatGPT) can solve veterinary problems.",
"position": "Post Doctoral Research Associate",
"startDate": "2023-02-01"
},
{
"company": "Caspian",
"website": "https://www.caspian.co.uk",
"summary": "Researching how machine learning techniques can help solve NLP problems in the financial sector including information extraction from large unstructured documents.",
"position": "Data Scientist",
"startDate": "2018-07-01",
"endDate": "2023-01-20"
},
{
"summary": "Working on a range of projects including how my work in Native Language Identification can be applied to the national security agenda.",
"position": "Technologist",
"startDate": "2017-06-01",
"endDate": "2017-09-01"
},
{
"company": "Hut8 Strategic Mobile",
"website": "",
"summary": "Developed and tested a range of web-based software projects for clients in a small team. Lead a project to support solar roof installations. Developed key skills in client communication, scrum agile development, and unit testing.",
"position": "Software Engineer",
"startDate": "2015-05-01",
"endDate": "2015-09-01"
}
],
"publications": [
{
"name": "MuLD: The Multitask Long Document Benchmark",
"publisher": "The Language Resources and Evaluation Conference (LREC 2022)",
"releaseDate": "2022-06-25",
"url": "https://aclanthology.org/2022.lrec-1.392",
"summary": "MuLD (Multitask Long Document Benchmark) is a set of 6 NLP tasks where the inputs consist of at least 10,000 words. The benchmark covers a wide variety of task types including translation, summarization, question answering, and classification. Additionally there is a range of output lengths from a single word classification label all the way up to an output longer than the input text."
},
{
"name": "Ask me in your own words: paraphrasing for multitask question answering",
"publisher": "PeerJ Computer Science",
"releaseDate": "2021-06-14",
"url": "https://peerj.com/articles/cs-759",
"summary": "Multitask learning has led to significant advances in Natural Language Processing, including the decaNLP benchmark where question answering is used to frame 10 natural language understanding tasks in a single model. In this work we show how models trained to solve decaNLP fail with simple paraphrasing of the question. We contribute a crowd-sourced corpus of paraphrased questions (PQ-decaNLP), annotated with paraphrase phenomena. This enables analysis of how transformations such as swapping the class labels and changing the sentence modality lead to a large performance degradation. Training both MQAN and the newer T5 model using PQ-decaNLP improves their robustness and for some tasks improves the performance on the original questions, demonstrating the benefits of a model which is more robust to paraphrasing. Additionally, we explore how paraphrasing knowledge is transferred between tasks, with the aim of exploiting the multitask property to improve the robustness of the models. We explore the addition of paraphrase detection and paraphrase generation tasks, and find that while both models are able to learn these new tasks, knowledge about paraphrasing does not transfer to other decaNLP tasks."
},
{
"name": "On the Development of a Large Scale Corpus for Native Language Identification",
"publisher": "Treebanks and Linguistic Theories (TLT17)",
"releaseDate": "2018",
"url": "https://ep.liu.se/en/conference-article.aspx?series=&issue=155&Article_No=12",
"summary": "Native Language Identification (NLI) is the task of identifying an author’s native language from their writings in a second language. In this paper, we introduce a new corpus (italki), which is larger than the current corpora. It can be used for training machine learning based systems for classifying and identifying the native language of authors of English text. To examine the usefulness of italki, we evaluate it by using it to train and test some of the well performing NLI systems presented in the 2017 NLI shared task. In this paper, we present some aspects of italki. We show the impact of the variation of italki’s training dataset size of some languages on systems performance. From our empirical finding, we highlight the potential of italki as a large scale corpus for training machine learning classifiers for classifying the native language of authors from their written English text. We obtained promising results that show the potential of italki to improve the performance of current NLI systems. More importantly, we found that training the current NLI systems on italki generalize better than training them on the current corpora."
}],
"references": []
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment