Skip to content

Instantly share code, notes, and snippets.

@machelreid
Created April 28, 2021 00:43
Show Gist options
  • Save machelreid/a8ebe66370ec64f2812677110f574381 to your computer and use it in GitHub Desktop.
Save machelreid/a8ebe66370ec64f2812677110f574381 to your computer and use it in GitHub Desktop.
Tips for CNN daily mail abstractive summarization

Abstractive Summarization (CNN-DM)

Here are things that I spent a lot of time on, so you don’t have to - especially with regard to preprocessing data for abstractive summarization. It will be pretty disorganized, but bear with me - there might be something useful in here.

Using Rouge

Important: Don’t use pure Python implementations of Rouge!! Use the following Python wrapper for the original Perl package: https://github.com/pltrdy/files2rouge

Preprocessing Data for Abstractive Summarization

https://cs.nyu.edu/~kcho/DMQA/ - Download the stories portion for CNN and DailyMail (you can use gdown)

Preprocess into txt files using this: ↓

https://gist.github.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e

https://gist.github.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e or

wget https://gist.githubusercontent.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e/raw/1ddc2bd4260e503a03c133b4cf0956867a04dcd9/make_datafiles_cnn_dailymail.py

Then, the following pointers:

  1. Learn the BPE vocabulary on the concatenated training source/target (no truncation) (32K vocabulary should be good)
  2. During training/inference, its common practice to truncate to 400 tokens (on the source side)
  3. When evaluating on the test set, tokenize using the Stanford PTB Tokenizer as follows:
    export CLASSPATH=`pwd`/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
    cat $GEN | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $GEN.tokenized
    cat $REF | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $REF.target

(you can install with the following script)

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip
``
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment