Large-scale dataset? | Year of release | Name | Reference | URL to ref | URL to data | Access | Price | License | Summarization type | Language | Summaries specifically written for the corpora | Need to generate data? | domain | muli-doc? | nb of texts LREC | nb of texts | nb of texts per topic | nb of gold summaries per text to summarize | input length | output length LREC | output length | generic? | Misc comments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Main summarization corpora | |||||||||||||||||||||||
n | 2001 | DUC 2001 | ? | http://www-nlpir.nist.gov/projects/duc/pubs.html | http://www-nlpir.nist.gov/projects/duc/data.html | Email request | 0 | Abstractive | English | n | news | both | 60x10 | 600 | 10 | 1 | 50, 100, 200, 400 | multi-doc: 50, 100, 200, 400 words; single-doc: 100 words | generic? | See "DUC in context" Table 1 for more details | |||
n | 2002 | DUC 2002 | ? | http://www-nlpir.nist.gov/projects/duc/pubs.html | http://www-nlpir.nist.gov/projects/duc/data.html | Email request | 0 | Abstractive, extractive | English | n | news | both | 60x10 | 600 | 10 | 2 | 10, 50, 100, 200, 400 | multi-doc: 10, 50, 100, 200 words; single-doc: 100 words | generic? | ||||
n | 2003 | DUC 2003 | ? | http://www-nlpir.nist.gov/projects/duc/pubs.html | http://www-nlpir.nist.gov/projects/duc/data.html | Email request | 0 | Abstractive | English | n | news | both | 60x10, 30x25 | 624 | ~10 | 1? | 10, 100 | both | |||||
n | 2004 | DUC 2004 | ? | http://www-nlpir.nist.gov/projects/duc/pubs.html | http://www-nlpir.nist.gov/projects/duc/data.html | Email request | 0 | Abstractive | English+Arabic | n | news | both | 100x10 | ~740 | ~10 | 4 | 10, 100 | 50, 100, 250 words | both | ||||
n | 2005 | DUC 2005 | ? | http://www-nlpir.nist.gov/projects/duc/pubs.html | http://www-nlpir.nist.gov/projects/duc/data.html | Email request | 0 | Abstractive | English | n | news | y | 50x32 | 25-50 | 250 | 250 words | query-focused | ||||||
n | 2006 | DUC 2006 | ? | http://www-nlpir.nist.gov/projects/duc/pubs.html | http://www-nlpir.nist.gov/projects/duc/data.html | Email request | 0 | Abstractive | English | n | news | y | 50x25 | 25-50 | 4 | 250 words | query-focused | ||||||
n | 2007 | DUC 2007 | ? | http://www-nlpir.nist.gov/projects/duc/pubs.html | http://www-nlpir.nist.gov/projects/duc/data.html | Email request | 0 | Abstractive | English | n | news | y | 25x10 | 100 | update | http://duc.nist.gov/duc2007/tasks.html | |||||||
n | 2008 | TAC 2008 | ? | https://tac.nist.gov/publications/index.html | https://tac.nist.gov/data/index.html | Email request | 0 | Abstractive | English | n | news | y | 48x20 | 960 | 20 | 100 | update,query | ||||||
n | 2009 | TAC 2009 | ? | https://tac.nist.gov/publications/index.html | https://tac.nist.gov/data/index.html | Email request | 0 | Abstractive | English | n | news | y | 44x20 | 880 | 20 | 100 | guided | https://tac.nist.gov/data/index.html | |||||
n | 2010 | TAC 2010 | ? | https://tac.nist.gov/publications/index.html | https://tac.nist.gov/data/index.html | Email request | 0 | Abstractive | English | n | news | y | 46x20 | 920 | 20 | 100 | guided | ||||||
n | 2011 | TAC 2011 | ? | https://tac.nist.gov/publications/index.html | https://tac.nist.gov/data/index.html | Email request | 0 | Abstractive | English | n | news | y | 44x20 | 880 | 20 | 100 | guided | ||||||
n | 2003 | ICSI | Janin, Adam, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin et al. "The ICSI meeting corpus." In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). 2003 IEEE International Conference on, vol. 1, pp. I-I. IEEE, 2003. | https://scholar.google.com/scholar?cluster=734196485602731249&hl=en&as_sdt=0,5 | Abstractive, extractive | English | transcribed meetings | n | 57 | 57 | 3 human abstractive and 3 human extractive summaries are available, of respective average sizes 390 words and 133 utterances. | 390 | |||||||||||
n | 2005 | AMI | McCowan, Iain, Jean Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot et al. "The AMI meeting corpus." In Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research, vol. 88. 2005. | https://scholar.google.com/scholar?cluster=9565292835176993645&hl=en&as_sdt=0,5 | Abstractive, extractive | English | transcribed meetings | n | 137 | 137 | 1 human-written abstractive summary of 300 words on average, and with a human-composed extractive summary (140 utterances on average). | 300 | |||||||||||
n | 2010 | Opinosis | Ganesan, K. A., C. X. Zhai, and J. Han, "Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions", Proceedings of the 23rd International Conference on Computational Linguistics (COLING '10), 2010. | http://kavita-ganesan.com/opinosis | http://kavita-ganesan.com/opinosis-opinion-dataset | Publicly available on website | 0 | Abstractive | English | y | n | product reviews | y | 51x100 | ~5100 (51 topics, each containing around 100 sentences) | 51 | 4 human abstracts | 1 sentence | 25 | ~25 words | |||
y | 2003 | Gigaword | Graff, David, and Christopher Cieri. English Gigaword LDC2003T05. Web Download. Philadelphia: Linguistic Data Consortium, 2003. | ? | https://catalog.ldc.upenn.edu/ldc2003t05 | Publicly available on website | 3000 | Abstractive | English | n | n | news | n | 4111240 | 4111240 | 1 | Headline | y | |||||
y | 2005 | Gigaword 2 | Graff, David, et al. English Gigaword Second Edition LDC2005T12. Web Download. Philadelphia: Linguistic Data Consortium, 2005. | ? | https://catalog.ldc.upenn.edu/LDC2005T12 | Publicly available on website | 400 | Abstractive | English | news | |||||||||||||
y | 2007 | Gigaword 3 | Graff, David, et al. English Gigaword Third Edition LDC2007T07. Web Download. Philadelphia: Linguistic Data Consortium, 2007 | ? | https://catalog.ldc.upenn.edu/LDC2007T07 | Publicly available on website | 4000 | Abstractive | English | news | |||||||||||||
y | 2009 | Gigaword 4 | Parker, Robert, et al. English Gigaword Fourth Edition LDC2009T13. Web Download. Philadelphia: Linguistic Data Consortium, 2009. | ? | https://catalog.ldc.upenn.edu/LDC2009T13 | Publicly available on website | 5000 | Abstractive | English | news | |||||||||||||
y | 2011 | Gigaword 5 | Parker, Robert, et al. English Gigaword Fifth Edition LDC2011T07. DVD. Philadelphia: Linguistic Data Consortium, 2011. | ? | https://catalog.ldc.upenn.edu/ldc2011t07 | Publicly available on website | 6000 | Abstractive | English | news | 9876086 | 9876086 | |||||||||||
y | 2015 | LCSTS | LCSTS: A Large Scale Chinese Short Text Summarization Dataset | http://www.emnlp2015.org/proceedings/EMNLP/pdf/EMNLP229.pdf | http://icrc.hitsz.edu.cn/Article/show/139.html | 0 | 2.The original copyright of all the data of the Large Scale Chinese Short Text Summarization Dataset belongs to writers of the Weiboes, Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen Graduate School collects, organizes, filters and purifies them. LCSTS is free to the public. 3.If you want to use the dataset for depth study, data providers (Intelligent Computing Research Center, Harbin Institute of Technology Shenzhen Graduate School) should be identified in your results.4.The dataset is only for the specified applicant or study groups for research purposes. Without permission, it may not be used for any commercial purposes. | Abstractive | Chinese | n | n | Chinese microblogging website SinaWeibo | n | 2400591 | 2400591 | 1 | short text | even shorter text | y | Also contains 10,666 human labeled (short text, summary) pairs, the score ranges from 1 to 5 which indicates the relevance between the short text and the corresponding summary, as well as 1,106 pairs which are scored by 3 persons simultaneously. | |||
y | 2015 | CNN/Daily Mail dataset | (Hermann et al., 2015; Nallapati et al., 2016) | https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Teaching+machines+to+read+and+comprehend.&btnG= | 0 | Abstractive | English | n | y | news | n | 312084 | 312084 | 1 | typical news article | a few sentences | y | ||||||
y | 2016 | MSR Abstractive Text Compression Dataset | A Dataset and Evaluation Metrics for Abstractive Compression of Sentences and Short Paragraphs. Kristina Toutanova, Chris Brockett, Ke M. Tran, and Saleema Amershi, EMNLP 2016 | https://scholar.google.com/scholar?cluster=11978909955936947219&hl=en&as_sdt=0,5 | https://www.microsoft.com/en-us/download/details.aspx?id=54262 | Publicly available on website | 0 | Abstracted | English | y | n | business letters, newswire, journals, and technical documents sampled from the Open American National Corpus (OANC). | n | 6000 | 6000 | 26000/6000 | two-sentence paragraphs | ||||||
Less commonly used datasets | |||||||||||||||||||||||
LREC 2016 | A Publicly Available Indonesian Corpora for Automatic Abstractive and Extractive Chat Summarization | http://www.lrec-conf.org/proceedings/lrec2016/pdf/366_Paper.pdf | Email request | 0 | Indonesian | 300 chat logs | 3 | ||||||||||||||||
LREC 2014 | Building a Dataset for Summarization and Keyword Extraction from Emails | http://www.lrec-conf.org/proceedings/lrec2014/pdf/1037_Paper.pdf | Email request | 0 | English | 349 emails and threads have been annotated. 100k words. | |||||||||||||||||
LREC 2014 | A Repository of State of the Art and Competitive Baseline Summaries for Generic News Summarization | http://www.lrec-conf.org/proceedings/lrec2014/pdf/1093_Paper.pdf | 0 | English | Automatically generated from 2004 DUC using summarization systems | ||||||||||||||||||
LREC 2014 | Priberam Compressive Summarization Corpus: A New Multi-Document Summarization Corpus for European Portuguese | http://www.lrec-conf.org/proceedings/lrec2014/pdf/187_Paper.pdf | 0 | European Portuguese | 100 words | ||||||||||||||||||
LREC 2014 | LQVSumm: A Corpus of Linguistic Quality Violations in Multi-Document Summarization | http://www.lrec-conf.org/proceedings/lrec2014/pdf/578_Paper.pdf | 0 | English | Automatically generated from TAC 2011, wih summarization system output error annotated | ||||||||||||||||||
LREC 2010 | A French Human Reference Corpus for Multi-Document Summarization and Sentence Compression | https://aclanthology.coli.uni-saarland.de/papers/L10-1626/a-french-human-reference-corpus-for-multi-document-summarization-and-sentence-compression | 0 | French | see abstract | ||||||||||||||||||
EACL 2017 | Ouyang, Jessica, Serina Chang, and Kathleen McKeown. "Crowd-Sourced Iterative Annotation for Narrative Summarization Corpora." EACL 2017 (2017): 46. | http://www.aclweb.org/anthology/E17-2008 | http://www.cs.columbia.edu/~ouyangj/aligned-summarization-data/ | 0 | Abstractive and extractive | English | 476 | ||||||||||||||||
A Publicly Available Annotated Corpus for Supervised Email Summarization | |||||||||||||||||||||||
Sentence compression | |||||||||||||||||||||||
2013 | Google sentence compression | Overcoming the Lack of Parallel Data in Sentence Compression, Katja Filippova and Yasemin Altun, Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP '13), pp. 1481-1491. | http://www.aclweb.org/anthology/D/D13/D13-1155.pdf | https://github.com/google-research-datasets/sentence-compression?files=1 | 0 | Compressive | English | 250k | |||||||||||||||
2015 | Google sentence compression 2 | Sentence Compression by Deletion with LSTMs. | https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Sentence+Compression+by+Deletion+with+LSTMs.&btnG= | 0 | ~2M (but only 10k released?) | ||||||||||||||||||
Sentence simplification | |||||||||||||||||||||||
2017 | The WebSplit Benchmark | https://arxiv.org/pdf/1707.06971.pdf | 0 | 1066115 | |||||||||||||||||||
Special kinds of summarization | |||||||||||||||||||||||
EMNLP 2017 | https://aclanthology.coli.uni-saarland.de/papers/D17-1223/d17-1223 | Overview | |||||||||||||||||||||
EMNLP 2017 | https://aclanthology.coli.uni-saarland.de/papers/D17-1322/d17-1322 | Concept maps | |||||||||||||||||||||
ACL 2013 | CMU Movie Summary Corpus | Learning Latent Personas of Film Characters. David Bamman, Brendan O'Connor, and Noah A. Smith. ACL 2013, Sofia, Bulgaria, August 2013 | http://www.cs.cmu.edu/~ark/personas/ | Publicly available on website | 0 | CC-BY-SA 3.0 | Abstractive | English | 42306 |
Last active
August 10, 2022 03:34
-
-
Save napsternxg/2750479273e0621c5aa697bf89843428 to your computer and use it in GitHub Desktop.
Summarization corpora taken from: https://docs.google.com/spreadsheets/d/1b1-NpM1jDK7KVHd_CwrxhpNZ1zAE8m-7M0pZ0gfZTMQ/edit#gid=0
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment