lgalke/locdb-journals.ipynb

## locdb-journals.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Journal Analysis of the LOC-DB Project\n",
    "\n",
    "\n",
    "Here we make an analysis of the journals covered by the LOC-DB project, where we concentrate on the social sciences journals licensed by the Mannheim University Library published in 2011.\n",
    "\n",
    "## Requirements and Etiquette\n",
    "\n",
    "You need to have jupyter installed together with python, and in python you need the crossrefapi, which can be installed for example like this:\n",
    "```py\n",
    "pip install crossrefapi\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'LOC-DB/ (https://locdb.bib.uni-mannheim.de/; mailto:Philipp Zumstein (Mannheim University Library)) BasedOn: CrossrefAPI/1.2.0'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from crossref.restful import Works, Journals, Etiquette\n",
    "my_etiquette = Etiquette('LOC-DB', '', 'https://locdb.bib.uni-mannheim.de/', 'Philipp Zumstein (Mannheim University Library)')\n",
    "str(my_etiquette)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Journal List\n",
    "\n",
    "We have 101 journals from social sciences licensed in 2011 with their ISSNs. We save them here into a list:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "issnList =  ['0342-300X', '0033-5177', '0340-0425', '0023-2653', '0049-089x', '1550-3585', '0335-5322', '0065-2601', '0342-2275', '0092-6566', '0340-613x', '1869-8980', '0097-9740', '1864-9335', '0514-2776', '0038-6073', '0003-1224', '0018-7267', '0038-0164', '0027-3171', '0037-783X', '0035-2969', '0033-362X', '0038-0261', '0037-7732', '0022-2445', '0022-1031', '0038-0296', '0197-6664', '0894-3257', '0067-5830', '0146-1672', '0165-4896', '0278-016X', '0002-9602', '0007-1315', '0020-7152', '0011-3204', '0018-7259', '0037-8046', '0019-8676', '0022-3506', '0163-786X', '0378-8733', '0539-0184', '0038-0385', '0038-0407', '0066-6505', '0171-5860', '0933-9361', '0735-2751', '0266-7215', '0276-5624', '0749-5978', '0174-0202', '0048-8046', '0343-4109', '0001-6993', '0197-3533', '0360-0572', '0162-895x', '0002-7642', '1231-1413', '0046-2772', '0022-250x', '0340-1804', '0021-8308', '0094-3061', '0049-1241', '0048-3931', '1536-867X', '0730-8884', '0948-423X', '1749-5679', '0340-918X', '1043-4631', '1469-5405', '0891-2432', '0950-0170', '1477-996X', '0141-9889', '0098-7921', '0304-2421', '0011-3921', '0032-3292', '0044-118x', '0863-1808', '0263-2764', '0002-7162', '0038-609x', '0037-7791', '0012-155X', '0958-9287', '0951-6328', '1438-5627', '1864-3361', '1360-7804', '1435-9871', '0959-6801', '1468-0181', '0019-8692']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Activate only for testing:\n",
    "#issnList =  ['0342-300X', '0033-5177']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "101\n"
     ]
    }
   ],
   "source": [
    "numberOfJournals = len(issnList)\n",
    "print(numberOfJournals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Number of Articles Published in 2011"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "inCrossref = {}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(86, 5737, 66.70930232558139)\n"
     ]
    }
   ],
   "source": [
    "k = 0\n",
    "sum = 0\n",
    "for issn in issnList:\n",
    "    journal = Journals(etiquette=my_etiquette).journal(issn)\n",
    "    if journal:\n",
    "        inCrossref[issn] = True\n",
    "        if (journal['breakdowns'] and journal['breakdowns']['dois-by-issued-year']):\n",
    "            dois = journal['breakdowns']['dois-by-issued-year']\n",
    "            #results[issn]['doisPerYear'] = {}\n",
    "            for doi in dois:\n",
    "                if doi[0]==2011:\n",
    "                    k += 1\n",
    "                    sum += doi[1]\n",
    "    else:\n",
    "        inCrossref[issn] = False\n",
    "doisPerJournal = sum/(k * 1.0)\n",
    "print(k, sum, doisPerJournal)\n",
    "\n",
    "# result: (86, 5737, 66.70930232558139)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, we have found 86 journals in Crossref. The number of DOIs these journals registered in 2011 in Crossref is 5,737, which gives an average of **67 DOIs per journal**. Because every journal article has a DOI, we can take this as an upper bound for the number of articles published in a year. Although some journals have more DOIs registered than they have published articles."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References in one Journal Article\n",
    "\n",
    "Next, we count the number of references per journal article. This takes quite some time for the whole list of ISSNs. Therefore, we skip all journals not in Crossref beforehands by the variable `inCrossref` created above. Moreover, in the same loop we measure some indicator for data quality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(3835, 80, 169361, 44.1619295958279)\n"
     ]
    }
   ],
   "source": [
    "k = 0\n",
    "z = 0\n",
    "sum = 0\n",
    "misclassified = 0\n",
    "onlyUnstructured = 0\n",
    "hardToResolve = 0\n",
    "occurrences = {}\n",
    "examplesUnstructured = []\n",
    "examplesHard = []\n",
    "examplesMisclassified = []\n",
    "publisherMisclassified = {}\n",
    "for issn in issnList:\n",
    "    if (inCrossref[issn]):\n",
    "        works = Journals(etiquette=my_etiquette).works(issn).filter(has_references=\"true\").filter(from_pub_date=2011).filter(until_pub_date=2011)\n",
    "        for article in works:\n",
    "            nref = article['reference-count']\n",
    "            k += 1\n",
    "            sum += nref\n",
    "            if ('reference' in article):\n",
    "                for reference in article['reference']:\n",
    "                    for label in reference.keys():\n",
    "                        if (label in occurrences):\n",
    "                            occurrences[label] += 1\n",
    "                        else:\n",
    "                            occurrences[label] = 1\n",
    "                    if ('key' in reference and 'unstructured' in reference and len(reference)==2):\n",
    "                        if (onlyUnstructured<10):\n",
    "                            examplesUnstructured.append(\"First examples of a reference with only unstructured text: \" + str(reference) + \" from the article \" + article['DOI'])\n",
    "                        onlyUnstructured += 1\n",
    "                    else:\n",
    "                        if ('DOI' not in reference and 'article-title' not in reference and  'volume-title' not in reference and 'first-page' not in reference):\n",
    "                            if (hardToResolve<10):\n",
    "                                examplesHard.append(\"First examples of a hard case but not with only unstructured text: \" + str(reference) + \" from the article \" + article['DOI'])\n",
    "                            hardToResolve += 1\n",
    "            else:\n",
    "                if (misclassified<10):\n",
    "                    examplesMisclassified.append(article['DOI'])\n",
    "                misclassified += 1\n",
    "                if (article['publisher'] in publisherMisclassified):\n",
    "                    publisherMisclassified[article['publisher']] += 1\n",
    "                else:\n",
    "                    publisherMisclassified[article['publisher']] = 1\n",
    "        if (works.count()>0): z += 1\n",
    "\n",
    "referencePerArticle = sum/(k * 1.0)\n",
    "print(k, z, sum, referencePerArticle)\n",
    "\n",
    "# result: (3835, 80, 169361, 44.1619295958279)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Crossref has references of k=3,835 journal articles (coming from z=80 different journals) suming up to sum=169,361 single references, i.e. **44 references in average per article**. Journals not in Crossref or without references in Crossref are not considered here for calculating the average number, because they will also have list of references."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Estimation of the Number of Overall References\n",
    "\n",
    "We can take these numbers now together to make an estimation for the number of all references from all articles appeared 2011 in any of the 101 journals:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "297547.1627816015"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numberOfJournals*doisPerJournal*referencePerArticle"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Thus, we can estimate that there are **298,000 references** in 2011 for all our journals in social sciences."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Misclassified Entries in Crossref\n",
    "\n",
    "It is possible that in Crossref an entry is classified to `has_references`, but there are no openly shared references in the field `reference`. We counted these cases above and saved some examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1317\n",
      "{u'Max Planck Institute for Demographic Research': 58, u'Elsevier BV': 603, u'Hogrefe Publishing Group': 34, u'Nomos Verlag': 27, u'University of Chicago Press': 187, u'Guilford Publications': 43, u'Society for Applied Anthropology': 39, u'Oxford University Press (OUP)': 226, u'Annual Reviews': 47, u'Duncker & Humblot GmbH': 53}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[u'10.1007/s11578-011-0123-0',\n",
       " u'10.1007/s11578-011-0124-z',\n",
       " u'10.1007/s11578-011-0128-8',\n",
       " u'10.1007/s11578-011-0126-x',\n",
       " u'10.1007/s11578-011-0129-7',\n",
       " u'10.1007/s11578-011-0130-1',\n",
       " u'10.1007/s11578-011-0132-z',\n",
       " u'10.1007/s11578-011-0133-y',\n",
       " u'10.1007/s11578-011-0134-x',\n",
       " u'10.1007/s11578-011-0135-9']"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(misclassified)\n",
    "print(publisherMisclassified)\n",
    "examplesMisclassified"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These examples have the information `reference-count` in Crossref but not the `reference`(s) itself."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Quality\n",
    "\n",
    "In the loop above we also have measured several indicators for the data quality."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7302\n",
      "1370\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{u'DOI': 59726,\n",
       " u'ISSN': 1,\n",
       " u'article-title': 21199,\n",
       " u'author': 73124,\n",
       " u'doi-asserted-by': 59726,\n",
       " u'edition': 763,\n",
       " u'first-page': 40145,\n",
       " u'issn-type': 1,\n",
       " u'issue': 11751,\n",
       " u'journal-title': 35229,\n",
       " u'key': 111490,\n",
       " u'series-title': 27,\n",
       " u'unstructured': 20310,\n",
       " u'volume': 34069,\n",
       " u'volume-title': 38623,\n",
       " u'year': 73205}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(onlyUnstructured)\n",
    "print(hardToResolve)\n",
    "occurrences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Out of these 169.361 references 7.302 (4%) have only unstructured information, but another 1,370 have only sparse data (no DOI, no title, no first-page) which makes the identification of these publications really hard.\n",
    "\n",
    "The most frequent structured information which are interesting is:\n",
    "* 73,205 (43%) have a year\n",
    "* 73,124 (43%) have a author (possibly multiple authors)\n",
    "* 59,822 (35%) have a title (article-title or volume-title)\n",
    "* 59,726 (35%) have a DOI\n",
    "* 40,145 (23%) have a first-page\n",
    "* 35,229 (21%) have a journal-title\n",
    "* 34,069 (20%) have a volume (number)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are some examples of references with only unstructured data and other references which are categorized as hard cases:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First examples of a reference with only unstructured text: {u'unstructured': u'Robert, P.: Measuring income in public opinion survey data. Paper presented at the ISA RC33 conference on social science methodology in the new millennium, Cologne, 2000', u'key': u'9636_CR12'} from the article 10.1007/s11135-011-9636-5\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'Treiman, D.J.: Industrialization and social stratification. In: Laumann, E.O. (ed.) Social Stratification: Research and Theory for the 1970s, pp. 207\\u2013234. Bobbs-Merrill, Indianapolis (1970)', u'key': u'9636_CR19'} from the article 10.1007/s11135-011-9636-5\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'World Bank: Growth, poverty, and inequality. Eastern Europe and the Former Soviet Union. The International Bank for Reconstruction and Development/The World Bank, Washington, DC, (2005)', u'key': u'9636_CR20'} from the article 10.1007/s11135-011-9636-5\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'Kuo, L.: Classical and prediction approaches to estimating distribution functions from survey data. In: Proceeding of the Section on Survey Research Methods. American Statistical Association, Alexandria, 280\\u2013285 (1988)', u'key': u'9625_CR11'} from the article 10.1007/s11135-011-9625-8\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'Kara, A.: Discrete stochastic dynamics of income inequality in education: an applied stochastic model and a case study. Discr. Dyn. Nat. Soc. 1\\u201315 (2007a)', u'key': u'9643_CR11'} from the article 10.1007/s11135-011-9643-6\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'Kara, A.: Stochastic equilibria in the public health care industry: a low quality trap and a resolution. In: European Applied Business Research Conference, Venice/Padova, Italy, June 2007, EABR conference\\u2014proceedings (2007b)', u'key': u'9643_CR12'} from the article 10.1007/s11135-011-9643-6\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'Colombian Department of Statistics (DANE).: Annual publication, survey of manufacturer sectors. DANE, Bogota (2005) (in Spanish)', u'key': u'9592_CR1'} from the article 10.1007/s11135-011-9592-0\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'Epstein, J., Duerr, D., Kenworthy, L., Ragin, C.: Comparative employment performance: a fuzzy-set analysis. In: Kenworthy, L., Hicks, A. (eds.) Method and Substance in Macrocomparative Analysis. Palgrave Macmillan, Houndmills (2007)', u'key': u'9592_CR3'} from the article 10.1007/s11135-011-9592-0\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'Nuschler, F.: Development policy. Informationszentrum Sozialwissenschaften, Bonn (2005)', u'key': u'9592_CR4'} from the article 10.1007/s11135-011-9592-0\n",
      "\n",
      "\n",
      "First examples of a reference with only unstructured text: {u'unstructured': u'Ragin, C.C.: User\\u2019s Guide to Fuzzy-Set/Qualitative Comparative Analysis 2.0. Tucson, Arizona: Department of Sociology, University of Arizona, Tucson, AZ (2006)', u'key': u'9592_CR7'} from the article 10.1007/s11135-011-9592-0\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for e in examplesUnstructured:\n",
    "    print(e)\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "First examples of a hard case but not with only unstructured text: {u'journal-title': u'Journal of Statistical Software', u'key': u'bibr53-0003122411407748', u'author': u'Sekhon Jasjeet S.'} from the article 10.1177/0003122411407748\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'journal-title': u'Social Networks', u'key': u'bibr34-0003122410396196', u'author': u'Faris Robert'} from the article 10.1177/0003122410396196\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'year': u'1992', u'journal-title': u'Streetwise: Race, Class, and Change in an Urban Community', u'key': u'bibr1-0003122411409705', u'author': u'Anderson Elijah'} from the article 10.1177/0003122411409705\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'New York Times', u'key': u'bibr4-0003122411398443', u'author': u'Archibald Randal C.'} from the article 10.1177/0003122411398443\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'year': u'2005', u'journal-title': u'Desert News', u'key': u'bibr10-0003122411398443', u'author': u'Bulkeley Deborah'} from the article 10.1177/0003122411398443\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'year': u'2009', u'journal-title': u'Houston Chronicle', u'key': u'bibr11-0003122411398443', u'author': u'Carroll Susan'} from the article 10.1177/0003122411398443\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'San Francisco Chronicle', u'key': u'bibr16-0003122411398443', u'author': u'Egelko Bob'} from the article 10.1177/0003122411398443\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'Chronicle of Higher Education', u'key': u'bibr17-0003122411398443', u'author': u'Field Kelly'} from the article 10.1177/0003122411398443\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'year': u'2009', u'journal-title': u'Chronicle of Higher Education', u'key': u'bibr24-0003122411398443', u'author': u'Gonzalez Jennifer'} from the article 10.1177/0003122411398443\n",
      "\n",
      "\n",
      "First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'New York Times', u'key': u'bibr36-0003122411398443', u'author': u'Lewin Tamar'} from the article 10.1177/0003122411398443\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "for e in examplesHard:\n",
    "    print(e)\n",
    "    print(\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Acknowledgement\n",
    "\n",
    "This Jupyter notebook is heavenly influenced on the nice demo [crossref-api-demo.ipynb](https://github.com/CrossRef/rest-api-doc/blob/master/demos/crossref-api-demo.ipynb) by Geoffrey Bilder and is based on the Python library [crossrefapi](https://github.com/fabiobatalha/crossrefapi) by Fabio Batalha and all of this is based on the great [Crossref API](https://github.com/CrossRef/rest-api-doc)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Journal Analysis of the LOC-DB Project\n",
	"\n",
	"\n",
	"Here we make an analysis of the journals covered by the LOC-DB project, where we concentrate on the social sciences journals licensed by the Mannheim University Library published in 2011.\n",
	"\n",
	"## Requirements and Etiquette\n",
	"\n",
	"You need to have jupyter installed together with python, and in python you need the crossrefapi, which can be installed for example like this:\n",
	"```py\n",
	"pip install crossrefapi\n",
	"```"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"'LOC-DB/ (https://locdb.bib.uni-mannheim.de/; mailto:Philipp Zumstein (Mannheim University Library)) BasedOn: CrossrefAPI/1.2.0'"
	]
	},
	"execution_count": 1,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from crossref.restful import Works, Journals, Etiquette\n",
	"my_etiquette = Etiquette('LOC-DB', '', 'https://locdb.bib.uni-mannheim.de/', 'Philipp Zumstein (Mannheim University Library)')\n",
	"str(my_etiquette)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Journal List\n",
	"\n",
	"We have 101 journals from social sciences licensed in 2011 with their ISSNs. We save them here into a list:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"issnList = ['0342-300X', '0033-5177', '0340-0425', '0023-2653', '0049-089x', '1550-3585', '0335-5322', '0065-2601', '0342-2275', '0092-6566', '0340-613x', '1869-8980', '0097-9740', '1864-9335', '0514-2776', '0038-6073', '0003-1224', '0018-7267', '0038-0164', '0027-3171', '0037-783X', '0035-2969', '0033-362X', '0038-0261', '0037-7732', '0022-2445', '0022-1031', '0038-0296', '0197-6664', '0894-3257', '0067-5830', '0146-1672', '0165-4896', '0278-016X', '0002-9602', '0007-1315', '0020-7152', '0011-3204', '0018-7259', '0037-8046', '0019-8676', '0022-3506', '0163-786X', '0378-8733', '0539-0184', '0038-0385', '0038-0407', '0066-6505', '0171-5860', '0933-9361', '0735-2751', '0266-7215', '0276-5624', '0749-5978', '0174-0202', '0048-8046', '0343-4109', '0001-6993', '0197-3533', '0360-0572', '0162-895x', '0002-7642', '1231-1413', '0046-2772', '0022-250x', '0340-1804', '0021-8308', '0094-3061', '0049-1241', '0048-3931', '1536-867X', '0730-8884', '0948-423X', '1749-5679', '0340-918X', '1043-4631', '1469-5405', '0891-2432', '0950-0170', '1477-996X', '0141-9889', '0098-7921', '0304-2421', '0011-3921', '0032-3292', '0044-118x', '0863-1808', '0263-2764', '0002-7162', '0038-609x', '0037-7791', '0012-155X', '0958-9287', '0951-6328', '1438-5627', '1864-3361', '1360-7804', '1435-9871', '0959-6801', '1468-0181', '0019-8692']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# Activate only for testing:\n",
	"#issnList = ['0342-300X', '0033-5177']"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"101\n"
	]
	}
	],
	"source": [
	"numberOfJournals = len(issnList)\n",
	"print(numberOfJournals)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Number of Articles Published in 2011"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"inCrossref = {}"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"(86, 5737, 66.70930232558139)\n"
	]
	}
	],
	"source": [
	"k = 0\n",
	"sum = 0\n",
	"for issn in issnList:\n",
	" journal = Journals(etiquette=my_etiquette).journal(issn)\n",
	" if journal:\n",
	" inCrossref[issn] = True\n",
	" if (journal['breakdowns'] and journal['breakdowns']['dois-by-issued-year']):\n",
	" dois = journal['breakdowns']['dois-by-issued-year']\n",
	" #results[issn]['doisPerYear'] = {}\n",
	" for doi in dois:\n",
	" if doi[0]==2011:\n",
	" k += 1\n",
	" sum += doi[1]\n",
	" else:\n",
	" inCrossref[issn] = False\n",
	"doisPerJournal = sum/(k * 1.0)\n",
	"print(k, sum, doisPerJournal)\n",
	"\n",
	"# result: (86, 5737, 66.70930232558139)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Thus, we have found 86 journals in Crossref. The number of DOIs these journals registered in 2011 in Crossref is 5,737, which gives an average of 67 DOIs per journal. Because every journal article has a DOI, we can take this as an upper bound for the number of articles published in a year. Although some journals have more DOIs registered than they have published articles."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## References in one Journal Article\n",
	"\n",
	"Next, we count the number of references per journal article. This takes quite some time for the whole list of ISSNs. Therefore, we skip all journals not in Crossref beforehands by the variable `inCrossref` created above. Moreover, in the same loop we measure some indicator for data quality."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"(3835, 80, 169361, 44.1619295958279)\n"
	]
	}
	],
	"source": [
	"k = 0\n",
	"z = 0\n",
	"sum = 0\n",
	"misclassified = 0\n",
	"onlyUnstructured = 0\n",
	"hardToResolve = 0\n",
	"occurrences = {}\n",
	"examplesUnstructured = []\n",
	"examplesHard = []\n",
	"examplesMisclassified = []\n",
	"publisherMisclassified = {}\n",
	"for issn in issnList:\n",
	" if (inCrossref[issn]):\n",
	" works = Journals(etiquette=my_etiquette).works(issn).filter(has_references=\"true\").filter(from_pub_date=2011).filter(until_pub_date=2011)\n",
	" for article in works:\n",
	" nref = article['reference-count']\n",
	" k += 1\n",
	" sum += nref\n",
	" if ('reference' in article):\n",
	" for reference in article['reference']:\n",
	" for label in reference.keys():\n",
	" if (label in occurrences):\n",
	" occurrences[label] += 1\n",
	" else:\n",
	" occurrences[label] = 1\n",
	" if ('key' in reference and 'unstructured' in reference and len(reference)==2):\n",
	" if (onlyUnstructured<10):\n",
	" examplesUnstructured.append(\"First examples of a reference with only unstructured text: \" + str(reference) + \" from the article \" + article['DOI'])\n",
	" onlyUnstructured += 1\n",
	" else:\n",
	" if ('DOI' not in reference and 'article-title' not in reference and 'volume-title' not in reference and 'first-page' not in reference):\n",
	" if (hardToResolve<10):\n",
	" examplesHard.append(\"First examples of a hard case but not with only unstructured text: \" + str(reference) + \" from the article \" + article['DOI'])\n",
	" hardToResolve += 1\n",
	" else:\n",
	" if (misclassified<10):\n",
	" examplesMisclassified.append(article['DOI'])\n",
	" misclassified += 1\n",
	" if (article['publisher'] in publisherMisclassified):\n",
	" publisherMisclassified[article['publisher']] += 1\n",
	" else:\n",
	" publisherMisclassified[article['publisher']] = 1\n",
	" if (works.count()>0): z += 1\n",
	"\n",
	"referencePerArticle = sum/(k * 1.0)\n",
	"print(k, z, sum, referencePerArticle)\n",
	"\n",
	"# result: (3835, 80, 169361, 44.1619295958279)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Crossref has references of k=3,835 journal articles (coming from z=80 different journals) suming up to sum=169,361 single references, i.e. 44 references in average per article. Journals not in Crossref or without references in Crossref are not considered here for calculating the average number, because they will also have list of references."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Estimation of the Number of Overall References\n",
	"\n",
	"We can take these numbers now together to make an estimation for the number of all references from all articles appeared 2011 in any of the 101 journals:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"297547.1627816015"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"numberOfJournalsdoisPerJournalreferencePerArticle"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Thus, we can estimate that there are 298,000 references in 2011 for all our journals in social sciences."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Misclassified Entries in Crossref\n",
	"\n",
	"It is possible that in Crossref an entry is classified to `has_references`, but there are no openly shared references in the field `reference`. We counted these cases above and saved some examples:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 9,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"1317\n",
	"{u'Max Planck Institute for Demographic Research': 58, u'Elsevier BV': 603, u'Hogrefe Publishing Group': 34, u'Nomos Verlag': 27, u'University of Chicago Press': 187, u'Guilford Publications': 43, u'Society for Applied Anthropology': 39, u'Oxford University Press (OUP)': 226, u'Annual Reviews': 47, u'Duncker & Humblot GmbH': 53}\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"[u'10.1007/s11578-011-0123-0',\n",
	" u'10.1007/s11578-011-0124-z',\n",
	" u'10.1007/s11578-011-0128-8',\n",
	" u'10.1007/s11578-011-0126-x',\n",
	" u'10.1007/s11578-011-0129-7',\n",
	" u'10.1007/s11578-011-0130-1',\n",
	" u'10.1007/s11578-011-0132-z',\n",
	" u'10.1007/s11578-011-0133-y',\n",
	" u'10.1007/s11578-011-0134-x',\n",
	" u'10.1007/s11578-011-0135-9']"
	]
	},
	"execution_count": 9,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"print(misclassified)\n",
	"print(publisherMisclassified)\n",
	"examplesMisclassified"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"These examples have the information `reference-count` in Crossref but not the `reference`(s) itself."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Data Quality\n",
	"\n",
	"In the loop above we also have measured several indicators for the data quality."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"7302\n",
	"1370\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"{u'DOI': 59726,\n",
	" u'ISSN': 1,\n",
	" u'article-title': 21199,\n",
	" u'author': 73124,\n",
	" u'doi-asserted-by': 59726,\n",
	" u'edition': 763,\n",
	" u'first-page': 40145,\n",
	" u'issn-type': 1,\n",
	" u'issue': 11751,\n",
	" u'journal-title': 35229,\n",
	" u'key': 111490,\n",
	" u'series-title': 27,\n",
	" u'unstructured': 20310,\n",
	" u'volume': 34069,\n",
	" u'volume-title': 38623,\n",
	" u'year': 73205}"
	]
	},
	"execution_count": 10,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"print(onlyUnstructured)\n",
	"print(hardToResolve)\n",
	"occurrences"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Out of these 169.361 references 7.302 (4%) have only unstructured information, but another 1,370 have only sparse data (no DOI, no title, no first-page) which makes the identification of these publications really hard.\n",
	"\n",
	"The most frequent structured information which are interesting is:\n",
	"* 73,205 (43%) have a year\n",
	"* 73,124 (43%) have a author (possibly multiple authors)\n",
	"* 59,822 (35%) have a title (article-title or volume-title)\n",
	"* 59,726 (35%) have a DOI\n",
	"* 40,145 (23%) have a first-page\n",
	"* 35,229 (21%) have a journal-title\n",
	"* 34,069 (20%) have a volume (number)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here are some examples of references with only unstructured data and other references which are categorized as hard cases:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"First examples of a reference with only unstructured text: {u'unstructured': u'Robert, P.: Measuring income in public opinion survey data. Paper presented at the ISA RC33 conference on social science methodology in the new millennium, Cologne, 2000', u'key': u'9636_CR12'} from the article 10.1007/s11135-011-9636-5\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'Treiman, D.J.: Industrialization and social stratification. In: Laumann, E.O. (ed.) Social Stratification: Research and Theory for the 1970s, pp. 207\\u2013234. Bobbs-Merrill, Indianapolis (1970)', u'key': u'9636_CR19'} from the article 10.1007/s11135-011-9636-5\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'World Bank: Growth, poverty, and inequality. Eastern Europe and the Former Soviet Union. The International Bank for Reconstruction and Development/The World Bank, Washington, DC, (2005)', u'key': u'9636_CR20'} from the article 10.1007/s11135-011-9636-5\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'Kuo, L.: Classical and prediction approaches to estimating distribution functions from survey data. In: Proceeding of the Section on Survey Research Methods. American Statistical Association, Alexandria, 280\\u2013285 (1988)', u'key': u'9625_CR11'} from the article 10.1007/s11135-011-9625-8\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'Kara, A.: Discrete stochastic dynamics of income inequality in education: an applied stochastic model and a case study. Discr. Dyn. Nat. Soc. 1\\u201315 (2007a)', u'key': u'9643_CR11'} from the article 10.1007/s11135-011-9643-6\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'Kara, A.: Stochastic equilibria in the public health care industry: a low quality trap and a resolution. In: European Applied Business Research Conference, Venice/Padova, Italy, June 2007, EABR conference\\u2014proceedings (2007b)', u'key': u'9643_CR12'} from the article 10.1007/s11135-011-9643-6\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'Colombian Department of Statistics (DANE).: Annual publication, survey of manufacturer sectors. DANE, Bogota (2005) (in Spanish)', u'key': u'9592_CR1'} from the article 10.1007/s11135-011-9592-0\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'Epstein, J., Duerr, D., Kenworthy, L., Ragin, C.: Comparative employment performance: a fuzzy-set analysis. In: Kenworthy, L., Hicks, A. (eds.) Method and Substance in Macrocomparative Analysis. Palgrave Macmillan, Houndmills (2007)', u'key': u'9592_CR3'} from the article 10.1007/s11135-011-9592-0\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'Nuschler, F.: Development policy. Informationszentrum Sozialwissenschaften, Bonn (2005)', u'key': u'9592_CR4'} from the article 10.1007/s11135-011-9592-0\n",
	"\n",
	"\n",
	"First examples of a reference with only unstructured text: {u'unstructured': u'Ragin, C.C.: User\\u2019s Guide to Fuzzy-Set/Qualitative Comparative Analysis 2.0. Tucson, Arizona: Department of Sociology, University of Arizona, Tucson, AZ (2006)', u'key': u'9592_CR7'} from the article 10.1007/s11135-011-9592-0\n",
	"\n",
	"\n"
	]
	}
	],
	"source": [
	"for e in examplesUnstructured:\n",
	" print(e)\n",
	" print(\"\\n\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"First examples of a hard case but not with only unstructured text: {u'journal-title': u'Journal of Statistical Software', u'key': u'bibr53-0003122411407748', u'author': u'Sekhon Jasjeet S.'} from the article 10.1177/0003122411407748\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'journal-title': u'Social Networks', u'key': u'bibr34-0003122410396196', u'author': u'Faris Robert'} from the article 10.1177/0003122410396196\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'year': u'1992', u'journal-title': u'Streetwise: Race, Class, and Change in an Urban Community', u'key': u'bibr1-0003122411409705', u'author': u'Anderson Elijah'} from the article 10.1177/0003122411409705\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'New York Times', u'key': u'bibr4-0003122411398443', u'author': u'Archibald Randal C.'} from the article 10.1177/0003122411398443\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'year': u'2005', u'journal-title': u'Desert News', u'key': u'bibr10-0003122411398443', u'author': u'Bulkeley Deborah'} from the article 10.1177/0003122411398443\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'year': u'2009', u'journal-title': u'Houston Chronicle', u'key': u'bibr11-0003122411398443', u'author': u'Carroll Susan'} from the article 10.1177/0003122411398443\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'San Francisco Chronicle', u'key': u'bibr16-0003122411398443', u'author': u'Egelko Bob'} from the article 10.1177/0003122411398443\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'Chronicle of Higher Education', u'key': u'bibr17-0003122411398443', u'author': u'Field Kelly'} from the article 10.1177/0003122411398443\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'year': u'2009', u'journal-title': u'Chronicle of Higher Education', u'key': u'bibr24-0003122411398443', u'author': u'Gonzalez Jennifer'} from the article 10.1177/0003122411398443\n",
	"\n",
	"\n",
	"First examples of a hard case but not with only unstructured text: {u'year': u'2010', u'journal-title': u'New York Times', u'key': u'bibr36-0003122411398443', u'author': u'Lewin Tamar'} from the article 10.1177/0003122411398443\n",
	"\n",
	"\n"
	]
	}
	],
	"source": [
	"for e in examplesHard:\n",
	" print(e)\n",
	" print(\"\\n\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Acknowledgement\n",
	"\n",
	"This Jupyter notebook is heavenly influenced on the nice demo [crossref-api-demo.ipynb](https://github.com/CrossRef/rest-api-doc/blob/master/demos/crossref-api-demo.ipynb) by Geoffrey Bilder and is based on the Python library [crossrefapi](https://github.com/fabiobatalha/crossrefapi) by Fabio Batalha and all of this is based on the great [Crossref API](https://github.com/CrossRef/rest-api-doc)."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 2",
	"language": "python",
	"name": "python2"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 2
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython2",
	"version": "2.7.12"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}