Skip to content

Instantly share code, notes, and snippets.

@bnewbold
Created November 2, 2021 00:12
Show Gist options
  • Save bnewbold/b437e363e6a0429719c65c751babe84d to your computer and use it in GitHub Desktop.
Save bnewbold/b437e363e6a0429719c65c751babe84d to your computer and use it in GitHub Desktop.

This is a list of raw "unstructured" citations extracted from the 2021-01 Crossref public metadata dump, which had issues when parsed with GROBID 0.7.

To clarify, these citation strings strings come directly from publishers, and were not extracted from PDFs. This subset of citations is biased in that neither the publisher nor Crossref made a successful match to an existing paper with a DOI.

The format of the list is the citation string followed by notes on the issue. Almost all of these were manually re-verified using a recent version of GROBID against the parseCitation API endpoint, via the web interface.

Russell, B. (1906). On some difficulties in the theory of transfinite numbers and order types. Proceedings of London Mathematical Society, 4, 29–53.

"Proceedings of London Mathematical Society" as journal/monograph, not publisher

OpenAI. 2017. Multi Agent Particle Environment. https://github.com/openai/multiagent-particle-envs

"OpenAI" should be an orgName, not a persName? Not sure if GROBID supports this.

Castillo, R. E., Castro, P. J. M., Cayabyab, G. T., and Rachel Aton, M. 2018. "Blocksight: A mobile image encryption using advanced encryption standard and least significant bit algorithm," in ACM International Conference Proceeding Series, 2018.

Last author is split in two; should be "M. Rachel Aton". A bit tricky because "Rachel" is a common given name, but here is a surname.

Edmund M. Clarke, Orna Grumberg, and Doron Peled. 1999. Model Checking. MIT Press.

Authors / book title mixed together.

Mitchell C, editor. Clinical Gynecologic Endocrinology and Infertility. Baltimore: Williams and Wilkins, 1994:651–666.

GROBID outputs editor as a simple string, not a persName tag.

Centers for Disease Control and Prevention. HIV/AIDS surveillance report, vol. 8. Atlanta (GA): US Department Of Health and Human Services, Public Health Service, Centers for Disease Control and Prevention; 1996.

Centers for Disease Control and Prevention. Update: trends in AIDS incidence—United States, 1996. Morb Mortal Wkly Rep 1997;46:661–7.

HIV Medicine Association. Available at: http://www.idsociety.org/HIV/toc.htm. Accessibility verified, 1 December 2002.

JAMA HIV Resource Center. http://www.ama-assn.org/special/hiv. Accessibility verified, 1 December 2002.

All of these were not parsed well.

Hughes E, Ferdorkow D, Collins J, Vandekerckhove P. Ovulation subpression vs placebo in the treatment of endometriosis. In: Lilford R, Hughes E, Vandekerckhove P, editors. Subfertility module of the cochrane database of systematic reviews. BMJ Publishing Group, 1996.

Editor as a single string, not separate persName for each.

The containing work ("Subfertility module of the cochrane database of systematic reviews") ends up in 'note' tag, not as some form of title.

Manns MP. Autoimmune hepatitis. In: Schiff ER, Sorrell MF, Maddrey WC, editors. Schiff’s Diseases of the Liver. 8th ed. Lippincott-Raven Publishers, Philadelphia, 1999. p. 919–35.

Publisher and edition text ends up in title ("monograph" title).

M.Monitor Who what why: how much gold can we get from mobile phones? https://www.bbc.com/news/blogs‐magazine‐monitor‐28802646(accessed: August2014).

Note lack of whitespace. Year (2014) does not get parsed; instead it ends up in an 'idno' field.

Yaoyao Sun: Motivation To Play Esports: Case of League of Legends[D].University of South Carolina. 2017.

Author's name not detected. Does seem like a strange citation style.

Mellier G. Incontinence urinaire d'effort. Editions médico-chirurgicales. Gynécologie, 300 A-10. Paris, 1994.

Journal name not detected? Not sure what is actually being cited in this style though (is this a chapter from a book in a series? an article in a journal?).

Von Mises, Human Action. 4th revised edition Fox and Wilkes. San Francisco, 1949.

Author surname parsed as separate forename/surname.

M.Zanin, M.Romance, S.Moral and R.Criado, "Credit card fraud detection through parenclitic network analysis", arXiv; 1706.01953v1, 1--8, 2017.

arxiv identifier not detected

S Khemapech Exercise for Health in older adult. JOPN Vol. 8 No. 2 July -December 2016.

Author name not detected

Peixoto, J.P. and Oort, A.H. 1992. Physics of Climate, (1st. ed.). American Institute of Physics, Springer-Verlag, New Yoyk, NY, USA.

Title includes part of edition (should be a note)

BMWi. EDL-G, 2010. URL http://www.gesetze-im-internet.de/edl-g/EDL-G.pdf. accessed on 2019-08-20.

Highsoft. Highcharts: Interactive JavaScript charts for your webpage. URL https://www.highcharts.com/. accessed on 2019-08-07.

LAMBRECHT meteo GmbH. Lambrecht Weather Station. URL https://www.lambrecht.net/loesungen-und-konzepte/systemloesungen/wetterstation-komplettloesung/. accessed on 2019-08-20.

In all these citations, the URL is mangled.

Source of first citation string: doi:10.1145/3366623.3368134 (in Crossref API)

Livingston , B. E. 1915 Atmometry and the porous-cup atmometer. Plant World, IS 21 30 51 74 96 143 149

Journal title ("Plant World") not extracted. The end of the string does seem to contain bogus numbers? The article is on pages 21 to 30, in volume 18 issue 2.

Source of metadata in Crossref: doi:10.2307/1940511

In the PDF (https://www.jstor.org/stable/pdf/1940511.pdf), citation looks like Livingston, B. E. I9I5. Atmometry and the porous-cup atmometer. Plant World, 18: 21-30, 51-74, 96-III, 143-149. so the mangled numbers seem to be poor OCR.

Pearson, K. 1901. On lines and planes of closest fit to systems of points in space, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11), 559--572. DOI=10.1080/14786440109462720.

DOI mangled ('=' prefix)

http://www.icmje.org/icmje-recommendations.pdf [Consulté le 2 janvier 2016].

Date not parsed. French text.

Ben Taskar Pieter Abbeel and Daphne Koller. 2012. Discriminative Probabilistic Models for Relational Data. arxiv:cs.LG/1301.0604

Meng Qu and Jian Tang. 2019. Probabilistic Logic Neural Networks for Reasoning. arxiv:cs.LG/1906.08495

Arxiv identifiers not extracted. Syntax here seems to be arxiv:, then section, then identifier.

Bishan Yang Scott Wen-tau Yih Xiaodong He Jianfeng Gao and Li Deng. 2015. Embedding Entities and Relations for Learning and Inference in Knowledge Bases. In Proceedings of the International Conference on Learning Representations (ICLR) 2015(proceedings of the international conference on learning representations (iclr) 2015 ed.). https://www.microsoft.com/en-us/research/publication/embedding-entities-and-relations-for-learning-and-inference-in-knowledge-bases/

Author names not parsed.

Crossref source record: doi:10.1145/3366424.3391265

Christoph Molnar. 2020. Interpretable Machine Learning .Leanpub Victoria Canada. https://christophm.github.io/interpretable-ml-book/.

Publisher name separation from title.

Source reference has whitespace issues.

Günter Klambauer Thomas Unterthiner Andreas Mayr and Sepp Hochreiter. 2017. Self-Normalizing Neural Networks. arxiv: 1706.02515 [cs.LG]

Byung-Hak Kim Ethan Vizitei and Varun Ganapathi. 2018. GritNet: Student Performance Prediction with Deep Learning. arxiv: 1804.07405 [cs.LG]

Sercan O. Arik and Tomas Pfister. 2019. TabNet: Attentive Interpretable Tabular Learning. arxiv: 1908.07442 [cs.LG]

Author names not extracted.

Arxiv identifier not extracted.

Wikipedia. 2019. PageRank, Retrieved September 5, 2019, from https://en.wikipedia.org/wiki/PageRank

Article title not extracted (it is a single word)

'Wikipedia' should be an orgName not a persName? Not sure.

Mohammad Al Hasan, Vineet Chaoji, Saeed Salem, and Mohammed Zaki. [n.d.]. Link Prediction using Supervised Learning. ([n. d.]), 10.

Johan J Vossensteyn Andrea Kottmann Benjamin WA Jongbloed Franciscus Kaiser Leon Cremonini Bjorn Stensaker Elisabeth Hovdhaugen and Sabine Wollscheid. 2015. Dropout and completion in higher education in Europe: Main report. Technical Report. CHEPS and NIFU .

Author names not extracted.

Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. 2018. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. Proceedings - IEEE International Conference on Robotics and Automation (2018), 3803--3810.

Author names mangled

Max Pumperla. 2019. Hyperas. https://github.com/maxpumperla/hyperas .

Title (of project), which is a single word, not extracted.

Oktay K, Karlikaya G, Aydin BA. Ovarian transplantation now a reality? In: International Symposium on Storing Reproduction, 1999; Bologna. p. O23.

Title is missing the final question mark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment