aurimasv/csl-abbreviation-proposal.md

## csl-abbreviation-proposal.md

      
    Raw
  

              csl-abbreviation-proposal.md
            
          
    Background

Abbreviations (mostly for journal titles, but other fields need to be
considered as well) are required by a number of citation styles and
the rules or vocabularies for these abbreviations can vary from style
to style. Thus, in order to provide correct citations, citation style
language should include a way to indicate which abbreviation format
should be used per style.
Styles requiring abbreviations

The following lists some citation styles that have specific rules for
abbreviating various fields:

ICMJE/Vancouver (from Citing Medicine 2nd ed.)

Abbreviate titles using MEDLINE abbreviation list

Abbreviate and capitalize significant words in a journal
title and omit other words, such as articles, conjunctions,
and prepositions. For example: of, the, at, in, and, L'
Do not abbreviate journal titles that consist of a single
word or titles written in a character-based language such as
Chinese, Japanese, and Korean
Do not include journal subtitles as part of the abbreviated
title
Omit any punctuation in a title
Ignore diacritics, accents, and special characters in titles.
This rule ignores some conventions used in non-English
languages to simplify rules for English-language publications


ACS

The journal name is an essential component of a periodical
reference citation. Abbreviate the name according to the Chemical
Abstracts Service Source Index (CASSI), and italicize it. One-word
journal names are not abbreviated (e.g., Biochemistry, Macromolecules,
Nature, Science). No punctuation is added to end this field; thus, a
period will be there with an abbreviation but not with a spelled-out
word.


CMoS

Titles of journals are italicized and capitalized headline-
style. They are usually given in full--except for the omission of an
initial The--in notes and bibliographies (e.g., Journal of Business).
With foreignlanguage journals and magazines, an initial article should
be retained (e.g., Der Spiegel).


Annual Reviews

"Abbreviate titles of journals, proceedings, symposia, and
serial compendia (such as the Annual Review volumes) according to the
ISSN List of Title Word Abbreviations (LTWA), published by the
International Organization for Standardization".


Elsevier

Elsevier Health guidelines for authors
(http://www.us.elsevierhealth.com/media/us/files/us/manuscript_guidelin
es_for_authors.pdf)

MEDLINE


Society for Biblical Studies

Abbreviate titles of standard works in footnotes, but cite the
complete titles in the bibliography. The SBL Handbook of Style offers
two extensive lists of abbreviations for journals, series, and other
standard reference works. The first abbreviation list is alphabetized
by the source ( SBLHS 8.4.1 ) and the second by the abbreviation (SBLHS
8.4.2). If the work you are citing is in these lists, use the standard
abbreviation listed.


Abbreviation lists


LTWA

The List includes 55,650 words in about 70 languages
"The databases appearing on or accessible from the website "the
ISSN International Centre" are the exclusive property of CIEPS and are
protected under the provisions of the law of 1st July 1998
implementing in the Intellectual Property Code the European Directive
of 11 March 1996 on the legal protection of databases. Any
performance, whether total or partial, of this site by any company
whatsoever, without the express authorization of the CIEPS is strictly
forbidden and shall constitute an infringement sanctioned such as
Intellectual Property Code."
The words are abbreviated in accordance with the ISO 4 standard


MEDLINE

"The List of Journals Indexed for MEDLINE publication ceased
with the 2008 edition. The NLM Catalog can be used to obtain a list of
currently indexed MEDLINE titles..."
As of March 1, 2007, NLM establishes title abbreviations based
on the form used by the ISSN Centre as their abbreviated key title,
whenever this is available, editing only for format as described below:

Each word in the title abbreviation is capitalized.
All punctuation is removed, except for parentheses used when
a qualifier is supplied.
All diacritics are removed.
Qualifying elements which refer to format, such as (Print)
or (Online) are omitted.
One word titles are never abbreviated.
At least two letters must be dropped from a word before it
is abbreviated.  Words from which only a single letter would be
dropped are not abbreviated.


http://www.nlm.nih.gov/pubs/factsheets/constructitle.html


CASSI

"You are prohibited from using automated programs for systematic
retrieval of CASSI content to create or compile, directly or
indirectly, a collection, compilation, database, or directory. An
example of automated retrieval is a script written to extract and
download CASSI data in batches."
Could reference managers still have users "manually" fill in the
list of their journals by clicking a button? Maybe for each journal?


ISI (Web of Science) Journal Title Abbreviations

http://images.webofknowledge.com/WOK46/help/WOS/A_abrvjt.html


SBLHS 8.4.1

Probably copyrighted (http://www.sbl-
site.org/publications/publishingwithsbl.aspx)


Requirements for abbreviation system

CSL language

Abbreviation list/rule must be identified for each style

It seems that the abbreviation rules can be applied globally to the
whole style. That is, it does not appear necessary to be able to
define separate abbreviation rules for in-text citations and
bibliography. Different fields may have different abbreviation
vocabularies, but these can/should be handled within the same
abbreviation list.
Because there is currently no indication as to which abbreviation list
must be used for each style, a default list should be defined, so that
most styles do not require updating. LTWA list appears to be the most
complete and most commonly used (and has been used to generate MEDLINE
abbreviations, for instance), so I suggest that this becomes the
default list.
Abbreviation lists should have a defined location

One must be able to obtain the abbreviation lists. An absolute URL
could serve to both identify the list uniquely and supply its
location. Alternatively, the CSL specification could define a URL
prefix that could be prepended to the name of the list, which would
then form the complete URL. However, that would mean that lists could
only be hosted on CSL servers (i.e. no custom lists for custom styles).
Availability of abbreviated form should be testable

Some styles allow (but do not require) the use of hereinafter
abbreviations for institutional authors (and probably other fields).
In this case, the first citation should be given with the full name
followed by an abbreviation in the parentheses. Subsequent
abbreviations would list the abbreviated form only. Thus, in order to
determine if the abbreviated form should be used and included next to
the full form on first citation, it should be possible to test whether
an abbreviation exists or can be formulated via the abbreviation list.
CSL processors

Abbreviations should fall back to non-abbreviated title

If an abbreviation cannot be formulated using the abbreviation list
(or the abbreviated title would be longer than the unabbreviated
form??), the processor should fall back to the unabbreviated title.
List definitions

Support abbreviations for multiple fields

Some styles (particularly legal styles) require that abbreviations be
used for multiple fields, not just journal titles. These include
series titles (CMS, Society of Biblical Literature), institutional
authors (these may not be required though), publishers (MLA), courts,
reporters.
Support mapping abbreviations for exact titles

Journals that are specific to a certain field may use special
abbreviations for certain journals (e.g. [Astrophysical
Journal](https://forums.zotero.org/discussion/8278/text-
substitution/)). Some journals may just conventionally have
abbreviations that do not follow rules.
Support word-by-word abbreviations

In most cases, abbreviations can be created on-the-fly by abbreviating
each word in the string.
Support partial word matches

Word-by-word abbreviations can be done more reasonably by allowing
partial matches. It seems that the matches are always anchored from
the beginning of the string. I don't think we need to consider
matching on any other part of the string.
Support fallback lists

For compactness and de-duplication reasons, the list should be able to
supply a fallback list that would be queried if no abbreviation can be
created using the current list. The fallbacks could be chained.
Note

If a partial abbreviation can be constructed using the current list
in a word-by-word fashion, should the remainder of the words be
abbreviated using the fallback list? This may be reasonable way to
allow overriding of only a subset of words used for abbreviations.
What is the order of the fallback? Do you fall back for exact matches
first, then restart from the top for word-by-word abbreviations?
Support skip-words

Some words in the titles should be skipped when performing word-by-
word abbreviations (e.g. articles, prepositions, etc.), but it would
seem that not all of them should be skipped at the beginning or end of
the sentence. Some styles state that only non-significant words should
be skipped. Is there a way for us to determine programmatically which
words are non-significant?
Abbreviation processors

Abbreviations should be processed by the reference manager

This is more of a suggestion than a requirement. Obviously if the CSL
processor decides to implement this, no one would be upset.
There are several reasons why the CSL processor should not be expected
to handle the task of abbreviating fields.

CSL processors often perform in an isolated environment and do not
have access to disk or network I/O. The lists could be passed to the
processors, however.
Reference managers may want to provide a way for users to override
certain abbreviations (be it by exact match or word-by-word
abbreviations). Thus, they are best-suited to control how these
overrides take place. Otherwise, they would be forced to edit existing
lists passed to the CSL processor.
While this would place some burden on the reference managers (which
are larger in numbers than CSL processors), the abbreviation
algorithms are fairly straightforward and should not be a problem to
implement.
Use of abbreviations may not be limited to citations. There is also
a demand for automatic abbreviations in metadata exports (e.g.
BibTeX), thus it is in the best interest of the reference managers to
implement abbreviation algorithms for internal use as well as for CSL
processors.

Rules for processing abbreviations

Abbreviation list to be used is determined by the list name indicated
in the citation style and the field that needs to be abbreviated. Once
the abbreviation list is obtained, the abbreviations are processed as
follows:

If the identifier is provided, look up the abbreviation in the
identifier table. If a match is found, return that match. If no match
exists, continue.
Normalize the string to be abbreviated:

Convert to lower case
Remove articles (the, a, an, la, el, etc.)
Remove diacritics
Replace punctuation with spaces
Replace all white spaces with U+0020 (space)


Find a string that exactly matches the normalized string in the
full-title matching table (if such table is provided). If the string
exists, return the mapped string as an abbreviation.
If the normalized string consists of a single word, fail and
indicate that an abbreviation could not be created.
Split the normalized string and abbreviate each word individually
by matching to the word-by-word abbreviation list (if such table is
provided).

Look up the exact word in the lookup table. If the exact string
exists, replace it with the mapped value and continue to next word.
(Note: DO NOT replace a word with a blank string if it is the
first or the last word in the string)
Look up a partial match in the lookup table starting from the
longest possible partial match. Partial matches in the table
terminate with a hyphen-minus (U+002D). If a match is found, replace
with the mapped string and continue to next word.

If the resulting abbreviation is only shorter by 2 or
fewer characters, do not abbreviate the word.


If none of the words in the normalized string were replaced, fail
and indicate that an abbreviation could not be created.
If some of the words in the string were not replaced, replace the
normalized words with the original form.
Return the resulting string.

Note


Add fallback handling when decided how that will work
Skip word list?

CSL implementation

Abbreviation list declaration

Abbreviation lists can be declared in the "info" section using "link"
tags. The "rel" attribute must be set to "abbreviations" and the
"href" attribute must be a publicly reachable URL of the abbreviation
list.
Short form variables

Abbreviated forms of standard variables are indicated by a "-short"
suffix. When an abbreviated form is not available, the full form
must be used.
Abbreviation lists

Location

Abbreviation lists officially supported by CSL will be hosted at
https://github.com/citation-style-language/abbreviations (This should
end up being a github.io page)
Format

There are several formats that could be considered for supplying the
abbreviation lists. Note that, since abbreviation lists are handled by
the reference manager, this does not imply anything about how
abbreviation lists should be stored internally.
Simple delimited format

EndNote uses a plain, tab-delimited list for abbreviations. Jabref
uses an equals sign ("=") delimited list. However, the need to
distinguish abbreviation lists for different fields, separating exact
match vs. word match lists, adding metadata, etc. calls for a more
flexible and better-structured format.
XML

XML could be an option, however it is not a very compact format.
Considering that these lists can span well over 10,000 items, there is
a lot of overhead that comes into play with XML format.
JSON

JSON is a very popular format that is easily parsed and relatively
compact. This is already used in Zotero and MLZ (and perhaps others)
Structure

This is largely based on the current list implementation in Zotero
The whole file contains a single JSON object with the following
properties:
info

This section contains the metadata describing the abbreviation list
and must be the first object in the JSON output. The following
properties are part of the info section:

URI: (string) resolving to the actual location of the abbreviation
list and uniquely identifies the list. This can be used for updating
the list. The URI is used to identify the abbreviation list in the
style.
name: (string) human-readable name of the abbreviation list.
lastModified: (string) a UTC datestamp indicating the last
modification time for the list. This should be compared to the Last-
Modified header returned by the server hosting the abbreviation lists
to determine if the list needs to be updated.
disableUpdates: (boolean, optional, defaults to false) a boolean
value indicating whether the abbreviation list should be automatically
updated.
fallback: (string, optional) URI of the fallback abbreviation list
that will be applied to all lists.
dependentLists: (string[], optional) an array of abbreviation list
URIs that are used as fallback in the field-specific lists. This is
required if such fallback lists are used, so that all dependent lists
can be downloaded ahead of time.

None of these properties may contain a literal right curly brace "}"
(U+007D). This assists with parsing out the info section from the file
without having to read the entire file: The first occurrence of the
left curly brace indicates the end of the "info" section.
lists

The list section contains properties corresponding to the fields that
the abbreviation lists can be applied to. At least one list must be
defined. The possible properties are:

default: (optional) default list to be used if the field-specific
list is not defined.
Any of the following (optional): container-title, series-title,
author (?), publisher (add more)

Each of these properties contains one of the following:

a boolean false, which indicates that the field should not be
abbreviated (e.g. exclude from using default? would this ever be
needed?)
a string indicating the name of one of the other fields that has
been defined with a proper list
a list (object) of abbreviations

abbreviation list

The abbreviation list is subdivided into "identifier", "exact", and
"words". The list may also contain a "fallback" property with a URI
linking to another abbreviation list. The list must contain at least
one of these properties.
"identifier" list contains a list of identifiers (e.g. ISSN or ISBN
depending on the type of reference being cited) mapping to the
corresponding abbreviation in the same format as "exact" list.
"exact" list must contain a list of normalized strings that must
exactly match the normalized string that is being abbreviated. The
value for each match, must be the properly formatted string (i.e.
capitalization). The value should also contain all terminating periods
for the abbreviated words, because these will be stripped off (if
necessary) by the CSL processor.
"words" list must contain either the exact-matching normalized words
or partial matches indicated by a terminating hyphen-minus (U+002D)
character. The value of these properties must be the properly
formatted abbreviated string, including proper capitalization and
terminating periods. The value may also be an empty string, which will
cause the matching word to be skipped in the final abbreviation.
Delivery

The list may be hosted online for automatic updating at the URL
indicated in the "info" section.
The server hosting the list should respond to HEAD and GET requests
and return (at the very least) the Last-Modified HTTP header
indicating the last modification time of the list. This is used for
updating purposes.