Skip to content

Instantly share code, notes, and snippets.

@leondz
Last active September 16, 2020 07:13
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save leondz/b3a53bb807a301424e3762787a04a5da to your computer and use it in GitHub Desktop.
Save leondz/b3a53bb807a301424e3762787a04a5da to your computer and use it in GitHub Desktop.

Data Statement for XX

How to use this document: Fill in each section according to the instructions. Give as much detail as you can, but there's no need to extrapolate. The goal is to help people understand your data when they approach it. This could be someone looking at it in ten years, or it could be you yourself looking back at the data in two years.

For full details, the best source is the original Data Statements paper, here: https://www.aclweb.org/anthology/Q18-1041/ .

Instruction fields are given as blockquotes; delete the instructions when you're done, and provide the file with your data, for example as "DATASTATEMENT.md". The lists in some blocks are designed to be filled in, but it's good to also leave a written description of what's happening, as well as the list. It's fine to skip some fields if the information isn't known.

Only blockquoted content should be deleted; the final about statement should be left intact.

Data set name: XX

Citation (if available):

Data set developer(s):

Data statement author(s):

Others who contributed to this document:

A. CURATION RATIONALE

Explanation. Which texts were included and what were the goals in selecting texts, both in the original collection and in any further sub-selection? This can be especially important in datasets too large to thoroughly inspect by hand. An explicit statement of the curation rationale can help dataset users make inferences about what other kinds of texts systems trained with them could conceivably generalize to.

B. LANGUAGE VARIETY/VARIETIES

Explanation. Languages differ from each other in structural ways that can interact with NLP algorithms. Within a language, regional or social dialects can also show great variation (Chambers and Trudgill, 1998). The language and language variety should be described with a language tag from BCP-47 identifying the language variety (e.g., en-US or yue-Hant-HK), and a prose description of the language variety, glossing the BCP-47 tag and also providing further information (e.g., "English as spoken in Palo Alto, California", or "Cantonese written with traditional characters by speakers in Hong Kong who are bilingual in Mandarin").

  • BCP-47 language tag:
  • Language variety description:

C. SPEAKER DEMOGRAPHIC

Explanation. Sociolinguistics has found that variation (in pronunciation, prosody, word choice, and grammar) correlates with speaker demographic characteristics (Labov, 1966), as speakers use linguistic variation to construct and project identities (Eckert and Rickford, 2001). Transfer from native languages (L1) can affect the language produced by non-native (L2) speakers (Ellis, 1994, Ch. 8). A further important type of variation is disordered speech (e.g., dysarthria). Specifications include:

  • Description:
  • Age:
  • Gender:
  • Race/ethnicity (according to locally appropriate categories):
  • First language(s):
  • Socioeconomic status:
  • Number of different speakers represented:
  • Presence of disordered speech:

D. ANNOTATOR DEMOGRAPHIC

Explanation. What are the demographic characteristics of the annotators and annotation guideline developers? Their own “social address” influences their experience with language and thus their perception of what they are annotating. Specifications include:

  • Description:
  • Age:
  • Gender:
  • Race/ethnicity (according to locally appropriate categories):
  • First language(s):
  • Training in linguistics/other relevant discipline:

E. SPEECH SITUATION

Explanation. Characteristics of the speech situation can affect linguistic structure and patterns at many levels. The intended audience of a linguistic performance can also affect linguistic choices on the part of speakers. The time and place provide broader context for understanding how the texts collected relate to their historical moment and should also be made evident in the data statement. Specifications include:

  • Description:
  • Time:
  • Place:
  • Modality (spoken/signed, written):
  • Scripted/edited vs. spontaneous:
  • Synchronous vs. asynchronous interaction:
  • Intended audience:

F. TEXT CHARACTERISTICS

Explanation. Both genre and topic influence the vocabulary and structural characteristics of texts (Biber, 1995), and should be specified.

G. RECORDING QUALITY

Explanation. For data that include audiovisual recordings, indicate the quality of the recording equipment and any aspects of the recording situation that could impact recording quality.

H. OTHER

Explanation. There may be other information of relevance as well. Please use this space to develop any further categories that are relevant for your dataset.

I. PROVENANCE APPENDIX

Explanation. For datasets built out of existing datasets, the data statements for the source datasets should be included as an appendix.

About this document

A data statement is a characterization of a dataset that provides context to allow developers and users to better understand how experimental results might generalize, how software might be appropriately deployed, and what biases might be reflected in systems built on the software.

Data Statements are from the University of Washington. Contact: datastatements@uw.edu. This document template is licensed as CC0.

This version of the markdown Data Statement is from June 4th 2020. The Data Statement template is based on worksheets distributed at the 2020 LREC workshop on Data Statements, by Emily M. Bender, Batya Friedman, and Angelina McMillan-Major. Adapted to community Markdown template by Leon Dercyznski.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment