Skip to content

Instantly share code, notes, and snippets.

@aurimasv
Last active November 28, 2016 06:26
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aurimasv/6878178 to your computer and use it in GitHub Desktop.
Save aurimasv/6878178 to your computer and use it in GitHub Desktop.

Background

Abbreviations (mostly for journal titles, but other fields need to be considered as well) are required by a number of citation styles and the rules or vocabularies for these abbreviations can vary from style to style. Thus, in order to provide correct citations, citation style language should include a way to indicate which abbreviation format should be used per style.

Styles requiring abbreviations

The following lists some citation styles that have specific rules for abbreviating various fields:

  • ICMJE/Vancouver (from Citing Medicine 2nd ed.)
    • Abbreviate titles using MEDLINE abbreviation list
      • Abbreviate and capitalize significant words in a journal title and omit other words, such as articles, conjunctions, and prepositions. For example: of, the, at, in, and, L'
      • Do not abbreviate journal titles that consist of a single word or titles written in a character-based language such as Chinese, Japanese, and Korean
      • Do not include journal subtitles as part of the abbreviated title
      • Omit any punctuation in a title
      • Ignore diacritics, accents, and special characters in titles. This rule ignores some conventions used in non-English languages to simplify rules for English-language publications
  • ACS
    • The journal name is an essential component of a periodical reference citation. Abbreviate the name according to the Chemical Abstracts Service Source Index (CASSI), and italicize it. One-word journal names are not abbreviated (e.g., Biochemistry, Macromolecules, Nature, Science). No punctuation is added to end this field; thus, a period will be there with an abbreviation but not with a spelled-out word.
  • CMoS
    • Titles of journals are italicized and capitalized headline- style. They are usually given in full--except for the omission of an initial The--in notes and bibliographies (e.g., Journal of Business). With foreignlanguage journals and magazines, an initial article should be retained (e.g., Der Spiegel).
  • Annual Reviews
    • "Abbreviate titles of journals, proceedings, symposia, and serial compendia (such as the Annual Review volumes) according to the ISSN List of Title Word Abbreviations (LTWA), published by the International Organization for Standardization".
  • Elsevier
  • Society for Biblical Studies
    • Abbreviate titles of standard works in footnotes, but cite the complete titles in the bibliography. The SBL Handbook of Style offers two extensive lists of abbreviations for journals, series, and other standard reference works. The first abbreviation list is alphabetized by the source ( SBLHS 8.4.1 ) and the second by the abbreviation (SBLHS 8.4.2). If the work you are citing is in these lists, use the standard abbreviation listed.

Abbreviation lists

  • LTWA
    • The List includes 55,650 words in about 70 languages
    • "The databases appearing on or accessible from the website "the ISSN International Centre" are the exclusive property of CIEPS and are protected under the provisions of the law of 1st July 1998 implementing in the Intellectual Property Code the European Directive of 11 March 1996 on the legal protection of databases. Any performance, whether total or partial, of this site by any company whatsoever, without the express authorization of the CIEPS is strictly forbidden and shall constitute an infringement sanctioned such as Intellectual Property Code."
    • The words are abbreviated in accordance with the ISO 4 standard
  • MEDLINE
    • "The List of Journals Indexed for MEDLINE publication ceased with the 2008 edition. The NLM Catalog can be used to obtain a list of currently indexed MEDLINE titles..."
    • As of March 1, 2007, NLM establishes title abbreviations based on the form used by the ISSN Centre as their abbreviated key title, whenever this is available, editing only for format as described below:
      • Each word in the title abbreviation is capitalized.
      • All punctuation is removed, except for parentheses used when a qualifier is supplied.
      • All diacritics are removed.
      • Qualifying elements which refer to format, such as (Print) or (Online) are omitted.
      • One word titles are never abbreviated.
      • At least two letters must be dropped from a word before it is abbreviated. Words from which only a single letter would be dropped are not abbreviated.
    • http://www.nlm.nih.gov/pubs/factsheets/constructitle.html
  • CASSI
    • "You are prohibited from using automated programs for systematic retrieval of CASSI content to create or compile, directly or indirectly, a collection, compilation, database, or directory. An example of automated retrieval is a script written to extract and download CASSI data in batches."
    • Could reference managers still have users "manually" fill in the list of their journals by clicking a button? Maybe for each journal?
  • ISI (Web of Science) Journal Title Abbreviations
  • SBLHS 8.4.1
    • Probably copyrighted (http://www.sbl- site.org/publications/publishingwithsbl.aspx)

Requirements for abbreviation system

CSL language

Abbreviation list/rule must be identified for each style

It seems that the abbreviation rules can be applied globally to the whole style. That is, it does not appear necessary to be able to define separate abbreviation rules for in-text citations and bibliography. Different fields may have different abbreviation vocabularies, but these can/should be handled within the same abbreviation list.

Because there is currently no indication as to which abbreviation list must be used for each style, a default list should be defined, so that most styles do not require updating. LTWA list appears to be the most complete and most commonly used (and has been used to generate MEDLINE abbreviations, for instance), so I suggest that this becomes the default list.

Abbreviation lists should have a defined location

One must be able to obtain the abbreviation lists. An absolute URL could serve to both identify the list uniquely and supply its location. Alternatively, the CSL specification could define a URL prefix that could be prepended to the name of the list, which would then form the complete URL. However, that would mean that lists could only be hosted on CSL servers (i.e. no custom lists for custom styles).

Availability of abbreviated form should be testable

Some styles allow (but do not require) the use of hereinafter abbreviations for institutional authors (and probably other fields). In this case, the first citation should be given with the full name followed by an abbreviation in the parentheses. Subsequent abbreviations would list the abbreviated form only. Thus, in order to determine if the abbreviated form should be used and included next to the full form on first citation, it should be possible to test whether an abbreviation exists or can be formulated via the abbreviation list.

CSL processors

Abbreviations should fall back to non-abbreviated title

If an abbreviation cannot be formulated using the abbreviation list (or the abbreviated title would be longer than the unabbreviated form??), the processor should fall back to the unabbreviated title.

List definitions

Support abbreviations for multiple fields

Some styles (particularly legal styles) require that abbreviations be used for multiple fields, not just journal titles. These include series titles (CMS, Society of Biblical Literature), institutional authors (these may not be required though), publishers (MLA), courts, reporters.

Support mapping abbreviations for exact titles

Journals that are specific to a certain field may use special abbreviations for certain journals (e.g. [Astrophysical Journal](https://forums.zotero.org/discussion/8278/text- substitution/)). Some journals may just conventionally have abbreviations that do not follow rules.

Support word-by-word abbreviations

In most cases, abbreviations can be created on-the-fly by abbreviating each word in the string.

Support partial word matches

Word-by-word abbreviations can be done more reasonably by allowing partial matches. It seems that the matches are always anchored from the beginning of the string. I don't think we need to consider matching on any other part of the string.

Support fallback lists

For compactness and de-duplication reasons, the list should be able to supply a fallback list that would be queried if no abbreviation can be created using the current list. The fallbacks could be chained.

Note

If a partial abbreviation can be constructed using the current list in a word-by-word fashion, should the remainder of the words be abbreviated using the fallback list? This may be reasonable way to allow overriding of only a subset of words used for abbreviations.

What is the order of the fallback? Do you fall back for exact matches first, then restart from the top for word-by-word abbreviations?

Support skip-words

Some words in the titles should be skipped when performing word-by- word abbreviations (e.g. articles, prepositions, etc.), but it would seem that not all of them should be skipped at the beginning or end of the sentence. Some styles state that only non-significant words should be skipped. Is there a way for us to determine programmatically which words are non-significant?

Abbreviation processors

Abbreviations should be processed by the reference manager

This is more of a suggestion than a requirement. Obviously if the CSL processor decides to implement this, no one would be upset.

There are several reasons why the CSL processor should not be expected to handle the task of abbreviating fields.

  • CSL processors often perform in an isolated environment and do not have access to disk or network I/O. The lists could be passed to the processors, however.
  • Reference managers may want to provide a way for users to override certain abbreviations (be it by exact match or word-by-word abbreviations). Thus, they are best-suited to control how these overrides take place. Otherwise, they would be forced to edit existing lists passed to the CSL processor.
  • While this would place some burden on the reference managers (which are larger in numbers than CSL processors), the abbreviation algorithms are fairly straightforward and should not be a problem to implement.
  • Use of abbreviations may not be limited to citations. There is also a demand for automatic abbreviations in metadata exports (e.g. BibTeX), thus it is in the best interest of the reference managers to implement abbreviation algorithms for internal use as well as for CSL processors.

Rules for processing abbreviations

Abbreviation list to be used is determined by the list name indicated in the citation style and the field that needs to be abbreviated. Once the abbreviation list is obtained, the abbreviations are processed as follows:

  1. If the identifier is provided, look up the abbreviation in the identifier table. If a match is found, return that match. If no match exists, continue.
  2. Normalize the string to be abbreviated:
    • Convert to lower case
    • Remove articles (the, a, an, la, el, etc.)
    • Remove diacritics
    • Replace punctuation with spaces
    • Replace all white spaces with U+0020 (space)
  3. Find a string that exactly matches the normalized string in the full-title matching table (if such table is provided). If the string exists, return the mapped string as an abbreviation.
  4. If the normalized string consists of a single word, fail and indicate that an abbreviation could not be created.
  5. Split the normalized string and abbreviate each word individually by matching to the word-by-word abbreviation list (if such table is provided).
    1. Look up the exact word in the lookup table. If the exact string exists, replace it with the mapped value and continue to next word. (Note: DO NOT replace a word with a blank string if it is the first or the last word in the string)
    2. Look up a partial match in the lookup table starting from the longest possible partial match. Partial matches in the table terminate with a hyphen-minus (U+002D). If a match is found, replace with the mapped string and continue to next word.
      • If the resulting abbreviation is only shorter by 2 or fewer characters, do not abbreviate the word.
  6. If none of the words in the normalized string were replaced, fail and indicate that an abbreviation could not be created.
  7. If some of the words in the string were not replaced, replace the normalized words with the original form.
  8. Return the resulting string.

Note

  • Add fallback handling when decided how that will work
  • Skip word list?

CSL implementation

Abbreviation list declaration

Abbreviation lists can be declared in the "info" section using "link" tags. The "rel" attribute must be set to "abbreviations" and the "href" attribute must be a publicly reachable URL of the abbreviation list.

Short form variables

Abbreviated forms of standard variables are indicated by a "-short" suffix. When an abbreviated form is not available, the full form must be used.

Abbreviation lists

Location

Abbreviation lists officially supported by CSL will be hosted at https://github.com/citation-style-language/abbreviations (This should end up being a github.io page)

Format

There are several formats that could be considered for supplying the abbreviation lists. Note that, since abbreviation lists are handled by the reference manager, this does not imply anything about how abbreviation lists should be stored internally.

Simple delimited format

EndNote uses a plain, tab-delimited list for abbreviations. Jabref uses an equals sign ("=") delimited list. However, the need to distinguish abbreviation lists for different fields, separating exact match vs. word match lists, adding metadata, etc. calls for a more flexible and better-structured format.

XML

XML could be an option, however it is not a very compact format. Considering that these lists can span well over 10,000 items, there is a lot of overhead that comes into play with XML format.

JSON

JSON is a very popular format that is easily parsed and relatively compact. This is already used in Zotero and MLZ (and perhaps others)

Structure

This is largely based on the current list implementation in Zotero

The whole file contains a single JSON object with the following properties:

info

This section contains the metadata describing the abbreviation list and must be the first object in the JSON output. The following properties are part of the info section:

  • URI: (string) resolving to the actual location of the abbreviation list and uniquely identifies the list. This can be used for updating the list. The URI is used to identify the abbreviation list in the style.
  • name: (string) human-readable name of the abbreviation list.
  • lastModified: (string) a UTC datestamp indicating the last modification time for the list. This should be compared to the Last- Modified header returned by the server hosting the abbreviation lists to determine if the list needs to be updated.
  • disableUpdates: (boolean, optional, defaults to false) a boolean value indicating whether the abbreviation list should be automatically updated.
  • fallback: (string, optional) URI of the fallback abbreviation list that will be applied to all lists.
  • dependentLists: (string[], optional) an array of abbreviation list URIs that are used as fallback in the field-specific lists. This is required if such fallback lists are used, so that all dependent lists can be downloaded ahead of time.

None of these properties may contain a literal right curly brace "}" (U+007D). This assists with parsing out the info section from the file without having to read the entire file: The first occurrence of the left curly brace indicates the end of the "info" section.

lists

The list section contains properties corresponding to the fields that the abbreviation lists can be applied to. At least one list must be defined. The possible properties are:

  • default: (optional) default list to be used if the field-specific list is not defined.
  • Any of the following (optional): container-title, series-title, author (?), publisher (add more)

Each of these properties contains one of the following:

  • a boolean false, which indicates that the field should not be abbreviated (e.g. exclude from using default? would this ever be needed?)
  • a string indicating the name of one of the other fields that has been defined with a proper list
  • a list (object) of abbreviations

abbreviation list

The abbreviation list is subdivided into "identifier", "exact", and "words". The list may also contain a "fallback" property with a URI linking to another abbreviation list. The list must contain at least one of these properties.

"identifier" list contains a list of identifiers (e.g. ISSN or ISBN depending on the type of reference being cited) mapping to the corresponding abbreviation in the same format as "exact" list.

"exact" list must contain a list of normalized strings that must exactly match the normalized string that is being abbreviated. The value for each match, must be the properly formatted string (i.e. capitalization). The value should also contain all terminating periods for the abbreviated words, because these will be stripped off (if necessary) by the CSL processor.

"words" list must contain either the exact-matching normalized words or partial matches indicated by a terminating hyphen-minus (U+002D) character. The value of these properties must be the properly formatted abbreviated string, including proper capitalization and terminating periods. The value may also be an empty string, which will cause the matching word to be skipped in the final abbreviation.

Delivery

The list may be hosted online for automatic updating at the URL indicated in the "info" section.

The server hosting the list should respond to HEAD and GET requests and return (at the very least) the Last-Modified HTTP header indicating the last modification time of the list. This is used for updating purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment