Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active May 30, 2024 19:58
Show Gist options
  • Save rjurney/dc8f76a8387a4dfa0b0102b1aa1f7de6 to your computer and use it in GitHub Desktop.
Save rjurney/dc8f76a8387a4dfa0b0102b1aa1f7de6 to your computer and use it in GitHub Desktop.
Address label multiplication data augmentation strategy
System: I need your help with a data science, data augmentation task. I am fine-tuning a sentence transformer paraphrase model to match pairs of addresses. I tried several embedding models and none of them perform well. They need fine-tuning for this task. I have created 27 example pairs of addresses to serve as training data for fine-tuning a SentenceTransformer model. Each record has the fields Address1, Address2, a Description of the semantic they express (ex. 'different street number') and a Label (1.0 for positive match, 0.0 for negative).
The training data covers two categories of corner cases. The first is when similar addresses in string distance aren't the same. The second is the opposite: when dissimilar addresses in string distance are the same. Your task is to read a pair of Addresses, their Description and their Label and generate 100 different examples that express a similar semantic. Your job is to create variations of these records. For some of the records, implement the logic in the Description directly. In others, you should include other categories of address variations that still fit the label - whether the addresses should match or not. Use what you know about postal addresses to accomplish this work. Try to distribute the example addresses you generate around the world, as well as in the United States. Half should be from the US and the rest should be global.
You should return the result in a valid JSON array of records and nothing else, using the fields Address1, Address2, Description and Label.
Human: Please generate 100 different examples that express the same semantic as the pair of addresses below based on its descripton, label and the address pairs.
Address 1: 2024 NW 5th Ave, Miami, FL 33127
Address 2: 2024 Northwest 5th Avenue, Miami, Florida 33127
Description: Standard Formatting Differences
Label: 1.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment