Astroneko404/INFSCI 2420 Final Notes.md

## INFSCI 2420 Final Notes.md

      
    Raw
  

              INFSCI 2420 Final Notes.md
            
          
    Based on Speech and Language Processing (3rd edition draft) by Daniel Jurafsky et al.
Ch. 19 Word Senses and WordNet

Word Senses

Word sense: 

A discrete representation of one aspect of the meaning of a word.
The meaning of a word can be defined by its co-occurrences, the counts of words that often occur nearby => Word embedding models like Word2Vec or GloVe.
How dictionaries and thesauruses define senses


Using glosses -- a translation or explanation of a word or expression;
Defining a sense through its relationship with other senses.

Discrete senses: 

We might consider two senses discrete if they have independent truth conditions, different syntactic behavior, and independent sense relations, or if they exhibit antagonistic meanings.
One practical technique for determining if two senses are distinct is to conjoin two uses of a word in a single sentence. (This kind of conjunction of antagonistic readings is called zeugma)
e.g. Given three sentences:

Which of those flights serve breakfast?
Does Air France serve Philadelphia?
【?】Does Air France serve breakfast and Philadelphia?

For education aid, dictionaries tend to capture subtle meaning differences and use many fine-grained senses; for computational purposes, we often group or cluster senses instead.
Relation Between Senses

Synonym:

We say two senses are synonyms when two senses of two different words (lemmas) are identical or nearly identical.
Antonym:

Antonyms are words with an opposite meaning.
Note: Automatically distinguishing synonyms from antonyms can be difficult, because although antonyms differ completely with respect to one aspect of their meaning (position on a scale or direction), they are otherwise very similar, sharing almost all other aspects of meaning.
Hyponym & Hypernym:

A word (or sense) is a hyponym of another word (or sense) if the first is more specific, denoting a subclass of the other.
e.g.

"dog" is a hyponym of "animal"
"animal" is a hypernym of "dog"

Since hyponym and hypernym are easily confused, we use superordinate & subordinate more often.
Meronym & Holonym:

Meronymy represents the part-whole relation.
e.g.

"wheel" is a meronym of "car"
"car" is a holonym of "wheel"

Structured Polysemy:

We call the relationship between semantically related senses of a word structured polysemy.
e.g. "bank" could represent:

An organization
The building associated with an organization

Metonymy:

The use of one aspect of a concept or entity to refer to other aspects of the entity or to the entity itself.
WordNet

WordNet is a lexical database, and the English WordNet contains three databases, one each for nouns and verbs, and a third for adjectives and adverbs. (Closed class words are not included)
Word Sense Disambiguation

The task of WSD

Input:

A word in context and a fixed inventory of potential word senses

Output:

The correct word sense in context
What kind of corpora?

Lexical sample task: 

Given a small pre-selected set of target words and an inventory of senses for each word from the lexicon, disambiguate a small number of words.

All-words task:

Given an entire texts and a lexicon with an inventory of senses for each entry, disambiguate all words in the text.
Semantic concordance:

A corpus in which each open-class word in each sentence is labeled with its word sense from a specific dictionary of thesaurus, most often WordNet.
Approaches


Supervised machine learning
Unsupervised machine learning

Thesaurus / Dictionary-based techniques
Selectional association


Lightly supervised

Evaluation


Compute F1 score against hand-labeled sense tags in a held-out set, wuch as the SemCor corpus or SemEval corpora;
Another strong baseline is majority vote;
One sense per discourse: A word appearing multiple times in a text or discourse often appears with the same sense. It works better for coarse-grained senses and particularly for cases of homonymy rather than polysemy.

Contextual embedding algorithms

1-nearest-neighbor algorithm

Training

Embed each token in a sense-labeled training corpus
Average each token of each sense of each word to produce a sense embedding

Testing

Compare test embedding with training embeddings
Return sense of the nearest neighbor based on a similarity metric such as cosine

Note: For unseen test words, we could

Fall back to the Most Frequent Sense baseline (majority vote);
Impute the missing sense embeddings via WordNet taxonomy and supersenses.

Imputation


Feature-Based WSD

Feature Vectors

A simple representation for each instance of a target word
What sort of features?

Collocational features:

Features about words at specific positions near target word.

e.g. ... guitar and bass player stand ...
=> [guitar, NN, and, CC, player, NN, stand, VB]

Bag-of-words features:

Features about words that occur anywhere in the window (regardless of position)
Evaluations and Baselines

Commonly used baselines

Most frequent sense, one sense per discourse, Lest algorithm, ...
Simplified Lest algorithm


Corpus Lest algorithm


Assume we have some sense-labeled data (like SemCor);
Take all the sentences with the relevant word sense;
Now add these to the gloss + examples for each sense, call it the "signature" of a sense;
Choose sense with most word overlap between context and signature.

Semi-supervised Bootstrapping

If we don't have enough data to train a system:

Pick a word that might co-occur with the target word in particular sense;
Grep through the corpus for the target word and the hypothesized word;
Assume that the target tag is the right one;
Generalize from a small hand-labeled seed set.

Using Thesauruses to Improve Embeddings

Current problem

Static word embeddings have a problem with antonyms.
For example, "expensive" is often very similar in embedding cosine to its antonym like "cheap".
Solutions

To improve both static and contextual word embeddings, we have two families of solutions:

Retraining: Modify the static embedding loss function for Word2Vec, or modify contextual embedding training;
Retrofitting / Counterfitting: After embeddings are trained, use a thesaurus to learn a second mapping that shifts antonyms apart and synonyms closer.

Word Sense Induction

Training


Testing

To disambiguate a particular token t of w we again have three steps:


Ch. 20 Semantic Role Labeling

Semantic Roles

Thematic roles are a way to capture the semantic commonality between "breakers" and "eaters", below are some commonly used thematic roles are their examples:


Diathesis Alternations

Thematic grid/Case frame: The set of thematic role arguments taken by a verb.

e.g. Possible realization of arguments of the verb "break":


AGENT, THEME
AGENT, THEME, INSTRUMENT
INSTRUMENT, THEME
THEME

Verb alternations/Diathesis alternations: Sometimes verbs can realize the same arguments in different ways, and we call these multiple argument structure realizations verb alternations or diathesis alternations.
Problems with Thematic Roles


It is difficult to come up with a standard set of roles, and equally difficult to produce a formal definition of roles like AGENT, THEME, or INSTRUMENT; (e.g. There seem to be at least two kinds of INSTRUMENTS)
We would like to reason about and generalize across semantic roles, but the finite discrete lists of roles don't let us do this;
It is difficult to formally define the thematic roles.

Solutions

There are alternative semantic role models that use either many fewer or many more roles.

Define generalized semantic roles that abstract over the specific thematic roles;
Define semantic roles that are specific to a particular verb or a particular group of semantically related verbs or nouns.

The Proposition Bank

Proposition bank/PropBank: A resource of sentences annotated with semantic roles.

In general:

arg0 - PROTO-AGENT
arg1 - PROTO-PATIENT
arg2 - The benefactive, instrument, attribute, or end state
arg3 - The start point
arg4 - The end point

PropBank focuses on verbs, while NomBank adds annotations to nouns to noun predicates.
FrameNet

Frame: The holistic background knowledge that unites groups of words. A frame in FrameNet is a background knowledge structure that defines a set of frame-specific roles, called frame elements.
Core roles: Semantic roles that are frame specific.

Non-core roles: Semantic roles which are more like the Arg-M arguments in PropBank, expressing more general properties of time, location, and so on.


FrameNet also codes relationships between frames, allowing frames to inherit from each other, or representing relations between frames like causation.
Semantic Role Labeling

Semantic role labeling (SRL) is the task of automatically finding the semantic roles of each argument of each predicate in a sentence.
The difference between FrameNet and PropBank


FrameNet employs many frame-specific frame elements as roles;
PropBank uses a smaller number of numbered argument labels that can be interpreted as verb specific labels, along with the more general ARGM labels.

Why semantic role labeling


A useful shallow semantic representation
Improves downstream NLP tasks (like machine translation and question answering)

Steps


Pruning;
Identification;
Classification

A common final stage: joint inference

There is a common final stage to deal with global consistency since the algorithm classifies everything locally -- each decision about a constituent is made independently of all others.
To do the joint inference, we could rerank labels:

The first stage produces multiple possible labels for each constituent
The second stage classifies the best global label for all consituents

Evaluation


Each argument label must be assigned to the exactly correct word sequence or parse constituent
We could use precision, recall and F-measure
Two commonly used datasets for evaluation are CoNLL-2005 and CoNLL-2012

Selectional Restrictions

Problem: Consider the two interpretations of "I want to eat someplace nearby":

"someplace nearby" is a location adjunct;
Speaker is a Godzilla.

Representing selectional restrictions

We could add new term to the representation:

However, it has two problems:

Using FOL to perform the simple task of enforcing selectional restrictions is overkill;
This approach presupposes a large, logical knowledge base of facts about the concepts that make up selectional restrictions.

A more practical approach is to state selectional restrictions in ters of WordNet synsets rather than as logical concepts.
Selectional preferences


Kullback-Leibler divergence: The difference between two distributions
Selectional preference: How much information the verb expresses about the semantic class of its argument
Selectional association of a verb with a class: The relative contribution of the class to the general preference of the verb

To compute selectional association:

A probabilistic measure of the strength of association between a predicate and a semantic class of its argument
A model represents the association of predicate v with a noun n

To evaluate:

Pseudowords
Compare to human preferences

Premitive Decomposition of Predicates

Premitive decomposition/Componential analysis is the idea of decomposing meaning into sets of primitive semantics elements or features.
Ch. 22 Coreference Resolution

Linguistic Background

What makes a text coherent?


Discourse structure
Rhetorical structure
Entity structure

Discourse models

When a referent is first mentioned in a discourse, a representation is evoked in the model.
What affects reference resolution


Lexical factors

Reference type: Inferrability, discontinuous set, generics, one anaphora, pronouns,...


Discourse factors

Recency
Focus/Topic structure, digression
Repeated mention


Syntactic factors

Agreement: Gender, number, person, case
Parallel construction
Grammatical role


Semantic/Lexical factors

Selectional restrictions
Verb semantics, thematic role


Task and Datasets

Reference resolution task

Finding in a text all the referring expressions that have one and the same denotation.

Input: Text
Output: All entities and the coreference links between them (create clusters)

Mention Detection

This is the first stage of coreference: finding the spans of text that constitute each mention.
Architectures for Coreference Algorithms

The mention-pair architecture


Input: A candidate anaphor and a candidate antecedent
Output: Probablistic binary decision about coreference

Approaches


Machine learning supervised classifiers
Need a heuristic for sampling training examples due to class imbalance