Skip to content

Instantly share code, notes, and snippets.

@Mistobaan
Forked from wael34218/DeepLearningPapers.md
Created May 13, 2020 02:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Mistobaan/8071ef119eb8b2c1027cbaaf08106d16 to your computer and use it in GitHub Desktop.
Save Mistobaan/8071ef119eb8b2c1027cbaaf08106d16 to your computer and use it in GitHub Desktop.

05/05/2018

2018: Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

Projects audio files that contains one word of speech into a hyper-dimension space just like Word2Vec. Uses "Force Aligment" to split audio into words (which requires text). Pad the audio segments with zeros, do MFCC, feed into encoder-decoder which uses RMSE. They also add noise to the signal and make the network denoise it. LibriSpeech 500 hour of audio. Not sure how it can incorporated in an ASR or TTS systems. The audio file has to be paired with a text otherwise Speech2Vec cannot split the audio file into words using "Forced Alignment" method. It is used to query if the spoken word is similar to an existing word in the corpus.

2016: Neural Machine Translation of Rare Words with Subword Units (BPE)

BPE data compression tool that combines most frequent pair of bytes with one. It works well with Named Entity, loadwords and morphologically complex words. Handles OOVs well and rare words. You can do BPE on source and target separately or transliterate the Russian vocabulary into Latin characters with ISO-9 to learn the joint BPE encoding, then transliterate the BPE merge operations back into Cyrillic to apply them to the Russian training text.

2018: WORD TRANSLATION WITHOUT PARALLEL DATA ->MUSE: Multilingual Unsupervised and Supervised Embeddings

we show that we can build a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way.

2018: Phrase-Based & Neural Unsupervised Machine Translation

They present 2 methods of unsupervised MT:

1- Unsupervised NMT: A) Initializing using PBE with joint corpus. B) Learn LM using denoising autoencoder network. C) Back-translation for both languages separately ... problem: this will split the latent space. D) Share encoder weghts and parameters. (Sharing decoder will make better regularization ... but not sure how this actually wil work? How does the decoder know when to translate to source or target?!)

2- Unsupervised PBSMT: A) Initializating embeddings of separate word embeddings using MUSE and FastText. B) Use KenLM to learn LM. C) Iterative Back-Translation to generate data data from source to target and vice versa, and train on it.

11/05/2018

2016: WaveNet: A Generative Model for Raw Audio (DeepMind - Google)

Audio generation depends on an autoregressive model:

AutoRegressive

WaveNet can generate raw speech signals with subhective naturalness.

Network:

Network uses dilated causal convolutions:

DCCCN

Causal: Mean it doesnt depend on future inputs. Dialated: where the filter is applied over an area larger than its length by skipping input values with a certain step.

Stacked dialated convolutions enable networks to have very large receptive fields with just a few layers, while preserving the input resolution throughout the network as well as computational efficiency.

Output layer is softmax, but since 16-bit prescition has 65,536 they narrowed it down using μ-law which reduces number of output to 256.

Network uses gated activation units with Residual Connectinos and Skip Connections.

SkipConnectinos

They also added 2 different types of conditioning:

  • Global conditioning: that specifies speaker ID.
  • Local conditioning: That specified linguistic features such as grapheme-to-phoneme (G2P)

P(x|h) = Product of P(xt|x1, ..., xt-1, h)

Where h specifies the speaker ID and the text linguistic features. Network also uses "Context Stacks" that processes a long part of the audio signal and locally conditions a larger WaveNet that processes only a smaller part of the audio signal.

Experiments:

  1. Multi-Speaker Speech Generation: Using 44 hours of data from 109 speakers. Not a TTS.
  2. Text-To-Speech: Network was locally conditioned on liguistic feautures. External F0 (Fundamental Frequency) and phoneme duration models trained from linguistic model for each language. English dataset consists of 24.6 hours, while chinese is 34.8 hours.
  3. Music generation.
  4. Speech Recognition

28/05/2018

2015: Highway Networks

New architecture designed to ease gradient-based training of very deep networks. It is referred to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on information highways. Works like context in LSTM but on deep fully connected or convolutional networks.

out = H(in, WH)· T(in, WT) + in · (1 − T(in, WT))

If T(in, WT) is euqal 0 then the output is in. If 1 then the output is H(x,WH). Activation function of T is sigmoid function.

03/06/2018

2017: TACOTRON: A Fully End-to-End TTS Synthesis Model

It is an end-to-end text to speech model. The benefits of having it an end it end is:

  1. Remove feature engineering
  2. Condition on various attributes like speaker language and sentiment
  3. Less errors as it removes many models that can introduce more error
  4. Adaptation to new data might be easier

Can be trained from <audio, text> from scratch. The main building component is CBHG (1-D Convolutional Bank, Highway Layers, and Bidirectional GRUs). The input sequence is first convolved with K sets of 1-D convolutional filters, where the k-th set contains Ck filters of width k (i.e. k = 1, 2, . . . , K). These filters explicitly model local and contextual information. Batch normalization is done on every single conv layer.

Entire network looks as follows:

SkipConnectinos

The network predicts 80-band mel-scale spectrogram as the target which later is used to generate the waveform using Griffin-Lim synthesizer.

With the following parameters:

SkipConnectinos

Data is ~25 hours of the same female speaker.

04/06/2018

2016: Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau Attention)

Encoding a a source sentence into a fixed-length vector from which a decoder generates a translation can be an informaiton bottleneck. Instead, this paper encodes source sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.

Attention Architecture:

SkipConnectinos

Output layer equations which is based on maxout layers (which works like a max pooling in convolutional networks, it adds a l dimension to the weight constant, then select the max):

SkipConnectinos

03/12/2018

2017: A Structured Self-attentive Sentence Embedding

Allows extracting different aspects of the sentence into multiple vector representations. Releives some long-term memorization burden from LSTM. The network consists of 2 components: BiLSTM layer and self-attentive mechanism.

Penalization term as shown in the illustration below makes each indivisual raw in the matrix M to focus on a single aspect of semantics, so we want the probability mass in the annotation softmax output to be more focused. It punishes redundancy between different summarization vectors.

Not able to train in unsupervised way there has to be an objective (downstream) tasks which they trained on Author Profiling, Sentiment Analysis and textual entailment.

SelfAttentive

07/12/2018

2016: Long Short-Term Memory-Networks for Machine Reading

It is a model to compute a semantic meaning of a given sentence (like Sent2vec). Trained on a downstream tasks like Language Model and Sentiment Analysis.

The model performs implicit relation analysis between tokens with an attention-based memory addressing mechanism at every time step. It is an augmented version of LSTM where they memorize a set of Hidden States (h1 through ht) and a set of Current Memory (c1 through ct). When a new word is given, the ht is calculated using attention module before passing it to the normal LSTM as demonstrated in the diagram below:

SelfAttentive

Shallow fusion simply treats the LSTMN as a separate module that can be readily used in an encoder-decoder architecture, in lieu of a standard RNN or LSTM. Deep fusion combines inter- and intra-attention (initiated by the decoder) when computing state updates. As demonstrated below:

SelfAttentive

It has been applied to:

  1. Language Model (Dataset: English Penn Treebank)
  2. Semantic Analysis (Dataset: Standford Sentiment Treebank)
  3. Textual Entailment (Dataset: SNLI)

23/12/2018

2017: Attention Is All You Need

Traditional sequence to sequence has the following problems:

  1. Needs more memory as sentence gets longer
  2. You cannot parallelize computation … has to be sequential
  3. Many of attention models do not take distance into consideration

Transformer networks:

  1. No need for recurrent neural networks
  2. No need for convolutional
  3. Faster … Needed 3.5 days to train (not a lot faster during inference)
  4. All about attention
  5. Mapping a query and a set of key-value pairs

Transformer network looks like this:

SelfAttentive

The description below will go through the layers one by one:

1. Embedding:

They use learned embeddings to convert the input tokens and output tokens to vectors of dimension dmodel (512)

They used byte-pair encoding 37000 shared tokens - hence they used shared embeddings between input and output embeddings.

2. Positional Encoding:

Inject some information about relative or absolute position.

SelfAttentive

The dimension of this vector is equal to dmodel since the positional encoding vector is added to the token embedding.

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear function of PEpos.

3. Multi-Head Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

SelfAttentive

In a mathematical form it looks like this:

SelfAttentive

We suspect that for large values of dk, the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients.

SelfAttentive

4. Forward Feed Network

FFN(x) = max(0, xW1 + b1)W2 + b2

W1 => dmodel x dff (512 x 2048) W2 => dff x dmodel (2048 x 512)

5. Masked Multi-Head Attention

Self-attention layers in the decoder allow each position in the decoder to attend to all positions in the decoder up to and including that position. Masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

6. Decoder Multi-Head

Queries come from the previous decoder layer, and the memory keys and values come from the output of the encoder. Here is a bunch of interesting things … and thats how you address them.

Query is asking for information for the next word.

Results:

  • EN-FR corpus is 36M sentence pairs got 41.8 BLEU
  • EN-GR corpus is 4.5M sentence pairs got 28.4 BLEU

Discussion

Encoding is executed in parallel, while decoding has to be sequential - cannot decode the next word before knowing what was the previous word. It is faster to train since we dont need to backpropagate all the way back to step 1 every time.

29/12/2018

2017: Unsupervised Neural Machine Translation (Mikel Artetxe)

NMT requires large parellel corpus and the lack of parallel corpora is a major problem. This paper discusses a novel approach in overcoming this obsticle.

The network uses a shared encoder and 2 decoders one for each language:

Architecture

Word embeddings are trained on corpus of text for each language then apply linear transformation that maps them into the shared space. This is done using few thousands of dictionary entries. The apply BPE for tokenization.

Training schemas:

  1. Denoising: Swapping words in input sentence and train decoder to output original sentence. Solves the problem of compying input to output.
  2. Backtranslation:

Backtranslation

System Architecture:

  1. Dual Structure: translates in both direction. L1 to L2 and L2 to L1.
  2. Shared Encoder
  3. Fixed pre-trained embeddings (cross-lingual_

Notes:

  • Uses Byte Pair Encoding (BPE)
  • Embedding size of 300 (Word2vec SkipGram)
  • Greedy decoder in training => Beamsearch of 12 at inference
  • Cross-entroy loss function
  • Adding 100k parallel corpora added huge boost to BLEU
  • Training is done by alternating between:
    1. Denoising L1
    2. Denoising L2
    3. Backtranslationg L1 -> L2
    4. Backtranslationg L2 -> L1
  • Results:

Results

29/03/2019

2017: Arabic POS Tagging: Don’t Abandon Feature Engineering Just Yet (K. Darwish)

This paper focuses on comparing between using Support Vector Machine based ranking (SVMRank) and Bidirectional LongShort-Term-Memory (bi-LSTM) neuralnetwork based sequence labeling in building a state-of-the-art Arabic part-ofspeech tagging system. Using SVMRank leads to state-of-the-art results, but with a fair amount of feature engineering. Using bi-LSTM, particularly when combined with word embeddings, may lead to competitive POS-tagging results by automatically deducing latent linguistic features. However, we show that augmenting biLSTM sequence labeling with some of the features that we used for the SVMRank based tagger yields to further improvements. We also show that gains realized using embeddings may not be additive with the gains achieved due to features. We are open-sourcing both the SVMRank and the bi-LSTM based systems for the research community.

We train the classifier using the following features, which are computed using the maximum-likelihood estimate on our training corpus:

  • p(POS|c0) and p(c0|POS)
  • p(POS|c−i..c−1) and p(POS|c1..cj ); i, j ∈ [1, 4]
  • p(POS|c−iPOS ..c−1POS ) and p(P OS|c1POS ..cjPOS ); i, j ∈ [1, 4]. Since we don’t know the POS tags of these clitics a priori, we estimate the conditional probability as: SUM p(POS|c−1possible POS ..c−ipossible POS ) . For example, if the previous clitic could be a NOUN or ADJ, then p(POS|c−1) = p(POS|NOUN) + p(POS|ADJ).
  • p(POS|stem template)
  • p(POS|prefix) and p(P OS|suffix)
  • p(POS|MetaType): Which can be (NUM, FOREIGN, PUNCT, ARAB, PREFIX, SUFFIX)
  • p(POS|prefix, prev word prefix), p(POS|prev word suffix), p(POS|prev word POS).

Almost same feature set calculated for words. Results are as follows:

Results

09/04/2019

2018: BERT: Pre-Training of Deep Bidirectional Transformaers for Language Understanding

BERT stands for Bidirectional Encoder Representations from Transformers.

There are 2 main strategies for applying pre-trained language model representations to downstream task: Feature-based and fine-tuning. Feature-based are like ELMo, Word2Vec where trained parameters/representations are not adjusted to downstream task. Fine-tuning is like Generative Pre-Trained Transformer (OpenAI GPT) which introduces minimal task-specific parameters and is trained on downstream tasks. Pre-train some model architecture on a LM objective before fine-tuning that same model fora supervised downstream task.

For BERT there are 2 downstream tasks: Masked Language Model (MLM) and "Next Sentence Prediction" that jointly pre-trains text-pair representations.

Contributions

  1. Importance of bidirectional pre-training for language represntation. Uses MLM to enable pre-trained deep bidirectinoal representations.
  2. Those pre-trained representations eliminate the need of many heavily engineered task-specific architectures.
  3. BERT advances state-of-the-art for 11 NLP tasks

Architecture

Multi-layer bidirectinoal Transformer encoder as shown in the figure.

Results

Input representation is sum of token representation, sentence/segment (first as SentA and second as SentB) and positional embeddings. 30k token, 512 max tokens, first token is [CLS] and sentence separation is [sep]

BERT network may take one sentence (A) as an input or a pair of sentences (A and B).

Pretraining Tasks
  1. Masked LM (MLM): 15% of all words in the sentence get masked at random [MASK]. The output of the entire sentence (batch size, seq length and hidden size) used to predict masked words. Each word output gets fed into dense layer then softmax to predict the missing word. Loss is only calculated for the position where the word is masked. 2 Downside of this approach, using this technique training data will always have [mask] while fine-tuning wouldn't. To fix this, 80% of the time mask stay as a mask, while 10% it changes into random word and 10% it stays the same word. Second downside is that it needs a lot of time to train.

  2. Next Sentence Prediction: gets the output of [cls] passed into softmax to predict whether the sentences are next to each other or not.

Parameters

800M word from Book Corpus and 2500M word from wikipedia text (excluding lists and tables). Batch size 256, max tokens 512, steps 1M ~ 40 epochs, gelu rather than relu, Layers 12, Hidden Size 768, Attention-heads 12 (BERT Large Layers 24, Hidden Size 1024, Attention-heads 16)

11 NLP Tasks

GLUE (General Language Understanding Evaluation):

  1. MNLI Entailment: input 2 sentences labels them as entailment, contradiction or neutral.
  2. QQP (Quora Question Pairs): input 2 questions labels them if they are equivalent or not.
  3. QNLI: Input Question and an Answer labels them as correct answer or not.
  4. SST-2: Sentiment of a single sentence positive or negative.
  5. CoLA: Predict if a single sentence is acceptable or not.
  6. STS-B: input 2 sentences predicts how much are they similar from 1 to 5.
  7. MRPC: input 2 sentences predicts if they are similar or not 0,1
  8. RTE: Textual entailment similar to MNLI.

Results

SQuAD (Stanford Question Answering Dataset): Collection of 100k crowdsourced question/answer pairs. Given a paragraph from wikipedia, find the span that contains the answer.

NER

SWAG

Results

07/06/2019

2018: Universal Sentence Encoder

We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks.

We introduce the model architecture for our two encoding models in this section. Our two encoders have different design goals. One based on the transformer architecture targets high accuracy at the cost of greater model complexity and resource consumption. The other targets efficient inference with slightly reduced accuracy.

1- Transformer

The context aware word representations are converted to a fixed length sentence encoding vector by computing the element-wise sum of the representations at each word position. We then divide by the square root of the length of the sentence so that the differences between short sentences are not dominated by sentence length effects.

2- Deep Averaging Networks (DAN)

The second encoding model makes use of a deep averaging network (DAN) (Iyyer et al.,2015) whereby input embeddings for words and bi-grams are first averaged together and then passed through a feedforward deep neural network (DNN) to produce sentence embeddings.

Tasks:

  1. MR : Movie review snippet sentiment on a five star scale (Pang and Lee, 2005).
  2. CR : Sentiment of sentences mined from customer reviews (Hu and Liu, 2004).
  3. SUBJ : Subjectivity of sentences from movie reviews and plot summaries (Pang and Lee, 2004).
  4. MPQA : Phrase level opinion polarity from news data (Wiebe et al., 2005).
  5. TREC : Fine grained question classification sourced from TREC (Li and Roth, 2002).
  6. SST : Binary phrase level sentiment classification (Socher et al., 2013).
  7. STS Benchmark : Semantic textual similarity (STS) between sentence pairs scored by Pearson correlation with human judgments (Cer et al., 2017).
  8. WEAT : Word pairs from the psychology literature on implicit association tests (IAT) that are used to characterize model bias (Caliskan et al.,2017).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment