Skip to content

Instantly share code, notes, and snippets.

@o0101
Last active October 17, 2017 12:00
Show Gist options
  • Save o0101/2ca05eedcf98107bbe59c9b72fbd672b to your computer and use it in GitHub Desktop.
Save o0101/2ca05eedcf98107bbe59c9b72fbd672b to your computer and use it in GitHub Desktop.
Text matching in adversarial conditions.
/**
Idea is we want to "attach" something ( annotation, edit, image, link, whatever )
to a particular piece of text that is not necessarily defined by an element.
In other words, some free form text. Whether this text comes from HTML, or a text file
is unimportant. The point is to find this attachment point even when:
- the order of paragraphs is altered
- the order of sentences in a paragraph is altered
- the order of words in a sentence is altered
And we would like to still find the attachment point with high probability when:
- the words before, after and within the attachment point have changed, been deleted or been added to.
The basic idea is that in order to create a memory
of a certain location in the source we extract multiple layers
of features, or patterns, or signals
And our 'matching function' which ranks candidate attachment points
by how closely we believe them to match the intended remembered point
is a combination of scores derived from these signals.
Some ideas I have for signals now are:
- bag of words / word vector, take inner product to produce score
- bag of letter trigrams / trigram vector, take inner product to produce score
- exact match / 0 or 1 for mismatch or exact match to produce score
- edit distance / alignment to produce score
- word bigram vector, take inner product to produce score
- paragraph index, symmetric difference to produce score
- sentence index relative to document, symmetric difference to produce score
- sentence index relative to paragraph, symmetric difference to produce score
- first, middle or last, sentence, 0 or 1 to produce score
- first, middle or last, paragraph, 0 or 1 to produce score
- sentence prior, sentence after
So to make a memory, we record the exact text from the sentence we are memorizing
We also record the sentence and paragraph indices, and the values for features we cannot
compute from the extracted text ourselves ( first, middle, last; sentence prior and sentence after )
And then to compute a match we do the following algorithm:
- find exact match for extract, if there is only 1, we find, otherwise continue
- compute values for all the signals for the extracted sentence, and compute values for all the signals from every other sentence,
possibly weighting each signal, and then compute match scores between the values of signals for the extracted sentence,
and values of signals for all other sentences. Rank these, break aggregate score ties by earliest precedence in the document.
- attempt to apply the edit, annotation, modification whatever to the found highest ranked sentence, and if it works, say:
"The sentence we're editing has changed, and this may not be the sentence we were looking for. Click here to see the next 10 best
candidates for the sentence we were looking for."
if it doesn't work, attempt to apply it to each of the next 10 best matches. If it works, then display the same message as above.
If it doesn't work, apply it anyway to the top ranked sentences and leave a note that says,
"The sentence we're editing has changed or moved, and we are not sure if this is the sentence we were looking for. Sorry.
This can happen when the document was edited after we marked it. Click here to see the next 20 best candidates
for the sentence we were looking for."
// we break "sentences" on these marks
const SEN_MARK = {
en: [ ".", "'", '"', ":", ";", "!", "?", "()", "[]", "“”", "‘’" ],
zh: [ "。", "「」", "﹁ ﹂", ";", ":", "!", "?", "()", "[]", "【】", "“”", "‘’", "《》", "〈〉"],
es: [ ".", "'", '"', ":", ";", "¡!", "¿?", " "()", "[]", "⟨⟩", "“”", "‘’", "‹›", "«»" ],
hi: [ "|", ";", "?", "!", "”" ],
ar: [ ".", "؟", ":", "“”" ]
};
The aim is to approach the best possible we can do without understanding semantics. `
**/
function remember( letter_index_from_source, sentence_text, source ) {
}
function find( sentence_text, source_dependent_scores, source ) {
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment