Skip to content

Instantly share code, notes, and snippets.

@trevormunoz
Created November 18, 2016 19:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save trevormunoz/0f62e457e5c1aeef7cb8dd81eeb5b0db to your computer and use it in GitHub Desktop.
Save trevormunoz/0f62e457e5c1aeef7cb8dd81eeb5b0db to your computer and use it in GitHub Desktop.
CRGE Intersectional Research Database scrape
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extracting Intersectional Research Database Content"
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import csv\n",
"import requests\n",
"from bs4 import BeautifulSoup, NavigableString"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All the entries from the database can be accessed as a single HTML page, which can be parsed using the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library for easy access to the data. Use of class names is consistent and makes it easy to target different bits of the page."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"r = requests.get('http://ird.crge.umd.edu/list_entries.php')\n",
"parsed_html = BeautifulSoup(r.text, 'html.parser')\n",
"entries = parsed_html.find_all('div', 'ird_entry')"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"652"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(entries)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def get_type(entry):\n",
" it = entry.find('td', 'entry_num').stripped_strings\n",
" type_string = list(filter(lambda x: x != 'Datatype:', it))[0]\n",
" return type_string.lower()"
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_citation(entry):\n",
" return entry.find('td', 'entry_citation').p.get_text()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(I noticed when eyeballing some of the annotations that there were some funky encoding issues — the data is not encoded as UTF-8 but rather using a funky Windows-specific encoding. Used an online [encoding detection widget](https://nlp.fi.muni.cz/projects/chared/) to guess the correct encoding so the requests library could handle it)"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_annotation(entry):\n",
" link = 'http://ird.crge.umd.edu/{0}'.format(entry.find('td', 'entry_controls').a.get('href'))\n",
" anno_request = requests.get(link)\n",
" anno_request.encoding = 'cp1252'\n",
" anno_div = BeautifulSoup(anno_request.text, 'html.parser').find('div', 'annotations')\n",
" data = list(filter(lambda x: x != 'Annotations', anno_div.strings))\n",
" anno_string = ''.join(data).strip()\n",
" return anno_string"
]
},
{
"cell_type": "code",
"execution_count": 148,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def get_id(entry):\n",
" return entry.find('td', 'entry_controls').a.get('href').split('=')[1]"
]
},
{
"cell_type": "code",
"execution_count": 166,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def get_keywords(entry):\n",
" kw_str = entry.find('td', 'entry_keywords').p.get_text()\n",
" return ';'.join(kw_str.split(','))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Turn each blob of HTML representing an entry into a dictionary …"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"db = []\n",
"for e in entries:\n",
" entry_parsed = {\n",
" 'id': get_id(e),\n",
" 'type': get_type(e),\n",
" 'citation': get_citation(e),\n",
" 'annotation': get_annotation(e),\n",
" 'keywords': get_keywords(e)\n",
" }\n",
" db.append(entry_parsed)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write the result to a CSV file"
]
},
{
"cell_type": "code",
"execution_count": 168,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"output_fieldnames = ['id', 'type', 'citation', 'annotation', 'keywords']\n",
"with open('/Users/umd-laptop/Downloads/crge_converted.csv', 'w') as outfile:\n",
" writer = csv.DictWriter(outfile, fieldnames=output_fieldnames)\n",
" \n",
" writer.writeheader()\n",
" \n",
" for item in db:\n",
" writer.writerow(item)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [conda env:html-tasks]",
"language": "python",
"name": "conda-env-html-tasks-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment