laurenbarker/download_share_search_csv.ipynb

## download_share_search_csv.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**GATHERING SHARE METADATA FROM A SINGLE INSTITUION**\n",
    "\n",
    "This notebook will allow you to generate a .csv spreadsheet file containing the metadata provided to SHARE by a single institution.\n",
    "\n",
    "A few general notes:\n",
    "\n",
    "  + To execute a section of code, place your cursor inside the shaded area and press *Shift* then press *Enter*.\n",
    "  \n",
    "  + Some sections of code do not need to be changed, but others will require you to edit their contents to ensure you \n",
    "   will obtain the results you desire. \n",
    "  \n",
    "  + Spacing is very important when using code, so check carefully for accuracy. \n",
    "  \n",
    "  + Boxes should be executed in the correct order, so if you make a mistake, you may need to go back to the beginning \n",
    "  to begin the process again. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Before we can generate a search, a few pieces of information must be loaded into memory. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**Execute the cell below by clicking in the shaded area, then pressing *Shift* followed by *Enter*. **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROD_SHARE_BASE_URL = 'https://share.osf.io/api/v2/search/creativeworks/'\n",
    "SEARCH_URL = '_search'\n",
    "FIELDNAMES_URL = 'https://share.osf.io/api/v2/search/_mappings'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that jupyter knows to gather its information from that URL, it needs a few more instructions. \n",
    "This next code tells the notebook how to perform some tasks. \n",
    "\n",
    "**Execute the cell below by pressing *Shift* then pressing *Enter*. **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import furl\n",
    "import requests\n",
    "import json\n",
    "\n",
    "search_url = furl.furl(PROD_SHARE_BASE_URL + SEARCH_URL)\n",
    "search_url.args['size'] = 10000\n",
    "\n",
    "def query_share(url, query):\n",
    "    # A helper function that will use the requests library,\n",
    "    # pass along the correct headers,\n",
    "    # and make the query we want\n",
    "    headers = {'Content-Type': 'application/json'}\n",
    "    data = json.dumps(query)\n",
    "    return requests.post(url, headers=headers, data=data, verify=False).json()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "These next instructions will require you to edit some of the code in the coding box below. \n",
    "\n",
    "**Replace the following red text in the box below with your institution's name, but keep the quotation marks:**\n",
    "\n",
    "**\"Washington University\"**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Searches are most successful when the metadata provided by an institution is consistent, but this is not always achieved. Institutions may use slight variations of a name, so it is important to complete a few trial searches using the SHARE website to determine which names are used for your institution. \n",
    "\n",
    "If your institution identifies itself consistently with only one name in SHARE metadata, **delete this piece of code from the box below: **\n",
    "\n",
    "\"query\" :  \"\\\"Washington University\\\"\"\n",
    "    \n",
    "and replace it with:\n",
    "\n",
    "\"query\": \"\\\"Your Institution Name\\\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "\n",
    "If your institution uses more than one name to identify itself, **replace the following piece of code from the box below with another name for your institution:** \n",
    "\n",
    "If your institution uses multiple names:\n",
    "\n",
    "\"query\": \"\\\"Your Institution Name\\\" OR \\\"ABBR\\\"\"\n",
    "\n",
    "\n",
    "*Remember to keep spacing consistent.*\n",
    "\n",
    "**When you are finished editing the code, execute the box below by pressing *Shift* and then *Enter*.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "affiliation_query = {\n",
    "    \"query\": {\n",
    "        \"bool\": {\n",
    "            \"must\": {\n",
    "                \"query_string\": {\n",
    "                    \"query\": \"*\"\n",
    "                }\n",
    "            },\n",
    "            \"filter\": [\n",
    "                {\n",
    "                    \"term\": {\n",
    "                        \"sources\": \"LawArXiv\"\n",
    "                    }\n",
    "                }\n",
    "            ]\n",
    "        }\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "If you have received any error messages after executing that cell, check one of the following:\n",
    "\n",
    "    * Did you spell the name of your institution correctly? Punctuation is important and must be exact. \n",
    "    * Did you remember to check the spacing of the punctuation marks? \n",
    "    \n",
    "\n",
    "**Execute the code in the box below in order to tell the program that you wish to perform the search. **"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/laurenbarker/.local/share/virtualenvs/notebook/lib/python3.5/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n",
      "  InsecureRequestWarning)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Number of results: 648\n"
     ]
    }
   ],
   "source": [
    "affiliation_results = query_share(search_url.url, affiliation_query)\n",
    "records = affiliation_results['hits']['hits']\n",
    "number_of_results = affiliation_results['hits']['total']\n",
    "\n",
    "print(\"Number of results:\", number_of_results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "This executed code may have generated several *InsecureRequestWarning* statements in a pink box, but these should not concern you. The statement appears because the SSL certificate that SHARE uses is more up-to-date than the SSL certificate being used. To get rid of this warning you can update the local SSL package.\n",
    "\n",
    "\n",
    "When the cell below is executed, it will generate a spreadsheet file with the search results and place the \n",
    "file on your computer. \n",
    "\n",
    "**Execute the box below by pressing *Shift* and then *Enter*.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_keys(['date_updated', 'funders', 'title', 'justification', 'hosts', 'description', 'date_created', 'lists', 'withdrawn', 'type', 'publishers', 'sources', 'registration_type', 'types', 'retracted', 'date_modified', 'date_published', 'contributors', 'id', 'language', 'tags', 'subjects', 'identifiers', 'date', 'subject_synonyms', 'affiliations'])\n"
     ]
    }
   ],
   "source": [
    "import csv\n",
    "fieldnames = requests.get(FIELDNAMES_URL).json()['share_customtax_1']['mappings']['creativeworks']['properties'].keys()\n",
    "print(fieldnames)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "---begin writing file---\n",
      "---done writing file---\n"
     ]
    }
   ],
   "source": [
    "import csv\n",
    "\n",
    "if not number_of_results:\n",
    "    raise ValueError('No results found. Try changing the query!')\n",
    "\n",
    "# get all Elasticsearch mappings for a creativework\n",
    "fieldnames = requests.get(FIELDNAMES_URL).json()['share_customtax_1']['mappings']['creativeworks']['properties'].keys()\n",
    "\n",
    "print('---begin writing file---')\n",
    "\n",
    "#set filename for generated csv\n",
    "SHARE_MATCHING_INSTITUTION_RECORDS = 'InstitutionalSHAREMetadata.csv'\n",
    "\n",
    "with open(SHARE_MATCHING_INSTITUTION_RECORDS, 'w') as csvfile:\n",
    "    writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')\n",
    "    writer.writeheader()\n",
    "    for row in records:\n",
    "        writer.writerow(row['_source'])    \n",
    "        \n",
    "print('---done writing file---')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "If you see \n",
    "\n",
    "**---begin writing file---\n",
    "---done writing file---**\n",
    "\n",
    "under the box above, the code was successfully executed.\n",
    "\n",
    "*(Those warning messages may have appeared again, but ignore them.)*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Return to the file directory from which you launched this notebook. \n",
    "\n",
    "There you should find a file titled \"InstitutionalSHAREMetadata.csv\". Clicking on that file will open it in your default \n",
    "spreadsheet program, and you will be able to view the metadata for your institution. \n",
    "\n",
    "Each column will have a heading providing some context for the information. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Happy searching, and keep adding your metadata to the SHARE database!"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"GATHERING SHARE METADATA FROM A SINGLE INSTITUION\n",
	"\n",
	"This notebook will allow you to generate a .csv spreadsheet file containing the metadata provided to SHARE by a single institution.\n",
	"\n",
	"A few general notes:\n",
	"\n",
	" + To execute a section of code, place your cursor inside the shaded area and press Shift then press Enter.\n",
	" \n",
	" + Some sections of code do not need to be changed, but others will require you to edit their contents to ensure you \n",
	" will obtain the results you desire. \n",
	" \n",
	" + Spacing is very important when using code, so check carefully for accuracy. \n",
	" \n",
	" + Boxes should be executed in the correct order, so if you make a mistake, you may need to go back to the beginning \n",
	" to begin the process again. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"Before we can generate a search, a few pieces of information must be loaded into memory. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"*Execute the cell below by clicking in the shaded area, then pressing Shift* followed by Enter. **"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [],
	"source": [
	"PROD_SHARE_BASE_URL = 'https://share.osf.io/api/v2/search/creativeworks/'\n",
	"SEARCH_URL = '_search'\n",
	"FIELDNAMES_URL = 'https://share.osf.io/api/v2/search/_mappings'"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now that jupyter knows to gather its information from that URL, it needs a few more instructions. \n",
	"This next code tells the notebook how to perform some tasks. \n",
	"\n",
	"*Execute the cell below by pressing Shift* then pressing Enter. **"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"import furl\n",
	"import requests\n",
	"import json\n",
	"\n",
	"search_url = furl.furl(PROD_SHARE_BASE_URL + SEARCH_URL)\n",
	"search_url.args['size'] = 10000\n",
	"\n",
	"def query_share(url, query):\n",
	" # A helper function that will use the requests library,\n",
	" # pass along the correct headers,\n",
	" # and make the query we want\n",
	" headers = {'Content-Type': 'application/json'}\n",
	" data = json.dumps(query)\n",
	" return requests.post(url, headers=headers, data=data, verify=False).json()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"These next instructions will require you to edit some of the code in the coding box below. \n",
	"\n",
	"Replace the following red text in the box below with your institution's name, but keep the quotation marks:\n",
	"\n",
	"\"Washington University\"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Searches are most successful when the metadata provided by an institution is consistent, but this is not always achieved. Institutions may use slight variations of a name, so it is important to complete a few trial searches using the SHARE website to determine which names are used for your institution. \n",
	"\n",
	"If your institution identifies itself consistently with only one name in SHARE metadata, delete this piece of code from the box below: \n",
	"\n",
	"\"query\" : \"\\\"Washington University\\\"\"\n",
	" \n",
	"and replace it with:\n",
	"\n",
	"\"query\": \"\\\"Your Institution Name\\\"\""
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"\n",
	"If your institution uses more than one name to identify itself, replace the following piece of code from the box below with another name for your institution: \n",
	"\n",
	"If your institution uses multiple names:\n",
	"\n",
	"\"query\": \"\\\"Your Institution Name\\\" OR \\\"ABBR\\\"\"\n",
	"\n",
	"\n",
	"Remember to keep spacing consistent.\n",
	"\n",
	"*When you are finished editing the code, execute the box below by pressing Shift* and then Enter.**"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"affiliation_query = {\n",
	" \"query\": {\n",
	" \"bool\": {\n",
	" \"must\": {\n",
	" \"query_string\": {\n",
	" \"query\": \"*\"\n",
	" }\n",
	" },\n",
	" \"filter\": [\n",
	" {\n",
	" \"term\": {\n",
	" \"sources\": \"LawArXiv\"\n",
	" }\n",
	" }\n",
	" ]\n",
	" }\n",
	" }\n",
	"}"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"If you have received any error messages after executing that cell, check one of the following:\n",
	"\n",
	" * Did you spell the name of your institution correctly? Punctuation is important and must be exact. \n",
	" * Did you remember to check the spacing of the punctuation marks? \n",
	" \n",
	"\n",
	"Execute the code in the box below in order to tell the program that you wish to perform the search. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"/Users/laurenbarker/.local/share/virtualenvs/notebook/lib/python3.5/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings\n",
	" InsecureRequestWarning)\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Number of results: 648\n"
	]
	}
	],
	"source": [
	"affiliation_results = query_share(search_url.url, affiliation_query)\n",
	"records = affiliation_results['hits']['hits']\n",
	"number_of_results = affiliation_results['hits']['total']\n",
	"\n",
	"print(\"Number of results:\", number_of_results)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"This executed code may have generated several InsecureRequestWarning statements in a pink box, but these should not concern you. The statement appears because the SSL certificate that SHARE uses is more up-to-date than the SSL certificate being used. To get rid of this warning you can update the local SSL package.\n",
	"\n",
	"\n",
	"When the cell below is executed, it will generate a spreadsheet file with the search results and place the \n",
	"file on your computer. \n",
	"\n",
	"*Execute the box below by pressing Shift* and then Enter.**"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 11,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"dict_keys(['date_updated', 'funders', 'title', 'justification', 'hosts', 'description', 'date_created', 'lists', 'withdrawn', 'type', 'publishers', 'sources', 'registration_type', 'types', 'retracted', 'date_modified', 'date_published', 'contributors', 'id', 'language', 'tags', 'subjects', 'identifiers', 'date', 'subject_synonyms', 'affiliations'])\n"
	]
	}
	],
	"source": [
	"import csv\n",
	"fieldnames = requests.get(FIELDNAMES_URL).json()['share_customtax_1']['mappings']['creativeworks']['properties'].keys()\n",
	"print(fieldnames)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 12,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"---begin writing file---\n",
	"---done writing file---\n"
	]
	}
	],
	"source": [
	"import csv\n",
	"\n",
	"if not number_of_results:\n",
	" raise ValueError('No results found. Try changing the query!')\n",
	"\n",
	"# get all Elasticsearch mappings for a creativework\n",
	"fieldnames = requests.get(FIELDNAMES_URL).json()['share_customtax_1']['mappings']['creativeworks']['properties'].keys()\n",
	"\n",
	"print('---begin writing file---')\n",
	"\n",
	"#set filename for generated csv\n",
	"SHARE_MATCHING_INSTITUTION_RECORDS = 'InstitutionalSHAREMetadata.csv'\n",
	"\n",
	"with open(SHARE_MATCHING_INSTITUTION_RECORDS, 'w') as csvfile:\n",
	" writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction='ignore')\n",
	" writer.writeheader()\n",
	" for row in records:\n",
	" writer.writerow(row['_source']) \n",
	" \n",
	"print('---done writing file---')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"If you see \n",
	"\n",
	"**---begin writing file---\n",
	"---done writing file---**\n",
	"\n",
	"under the box above, the code was successfully executed.\n",
	"\n",
	"(Those warning messages may have appeared again, but ignore them.)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"Return to the file directory from which you launched this notebook. \n",
	"\n",
	"There you should find a file titled \"InstitutionalSHAREMetadata.csv\". Clicking on that file will open it in your default \n",
	"spreadsheet program, and you will be able to view the metadata for your institution. \n",
	"\n",
	"Each column will have a heading providing some context for the information. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"Happy searching, and keep adding your metadata to the SHARE database!"
	]
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"celltoolbar": "Raw Cell Format",
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}