Skip to content

Instantly share code, notes, and snippets.

@psychemedia

psychemedia/apt.txt

Last active Nov 28, 2019
Embed
What would you like to do?
Sketches around National Archives indexes, PDFs etc
imagemagick
libmagickwand-dev
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# National Archives Index Explorer\n",
"\n",
"The National Archives provide a web based search interface for searching index catalogues of various National Archives collections.\n",
"\n",
"As well as a simple search box that does a free text search over all record columns (presumably?), we can also run advanced searches that can include reference and date limits.\n",
"\n",
"Search results containing the index records for your search hits can be downloaded as a CSV file.\n",
"\n",
"By searching for records associated with a particular collection tag / reference, we can obtain, and thence download, a copy of the collection's index records.\n",
"\n",
"We can then load these records into our own database and search them using our own search tools, as well as annotation the records using things like named entity recognition.\n",
"\n",
"So let's have a go at that..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Obtaining the Index Data\n",
"\n",
"Searching for index records associated with `HO-40-1` over the period `1800-15` leads us to a search results page with the URL:\n",
"\n",
"`https://discovery.nationalarchives.gov.uk/results/r?_cr=HO%2040-1&_dss=range&_sd=1810&_ed=1815&_ro=any&_st=adv`\n",
"\n",
"This HTTP GETs the URL `https://discovery.nationalarchives.gov.uk/results/r` with arguments:\n",
"\n",
"- `_cr:'HO 40-1'`\n",
"- `_dss:'range'`\n",
"- `_sd:1810`\n",
"- `_ed:1815`\n",
"\n",
"\n",
"To download the data records, we then need to click a form button, rather than a web link.\n",
"\n",
"We can automate this procedure by constructing the desired URL, with appropriate arguments, ensuring the correct form download options are set, \"click\" the download button and capture the response."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Mechanical soup is a combination of a simple virtual browser (mechanize) and\n",
"# a web scraping package (BeautifulSoup)\n",
"import mechanicalsoup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Define the URL of the search results and download page:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"url='https://discovery.nationalarchives.gov.uk/results/r'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify the search limits around the collection we are interested in:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"params = {'_cr':'HO 40-1','_dss':'range','_sd':1810,'_ed':1815}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Open the page with those parameters:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<Response [200]>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"browser = mechanicalsoup.StatefulBrowser()\n",
"browser.open(url, params=params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Configure the search form:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"browser.select_form('form[action=\"/search/download\"]')\n",
"browser[\"expSize\"] = \"10\"\n",
"#browser.get_current_form().print_summary()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\"Click\" the download button:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"response = browser.submit_selected()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Read the response into a *pandas* dataframe and preview the result, casting date fields into date format:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"#StringIO is a function for wrapping a file pointer around a string\n",
"from io import StringIO\n",
"\n",
"#Pandas is a package for working with tabular datasets\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Citable Reference</th>\n",
" <th>Context Description</th>\n",
" <th>Title</th>\n",
" <th>Description</th>\n",
" <th>Start Date</th>\n",
" <th>Start Date (num)</th>\n",
" <th>End Date</th>\n",
" <th>End Date (num)</th>\n",
" <th>Covering Dates</th>\n",
" <th>Held by</th>\n",
" <th>Catalogue level</th>\n",
" <th>References</th>\n",
" <th>Opening Date</th>\n",
" <th>Closure Status</th>\n",
" <th>Closure Type</th>\n",
" <th>Closure Code</th>\n",
" <th>Subjects</th>\n",
" <th>Digitised</th>\n",
" <th>ID</th>\n",
" <th>Score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>HO 40/1</td>\n",
" <td>Home Office: Disturbances Correspondence.</td>\n",
" <td>HO 40. The Luddite riots - reports</td>\n",
" <td>HO 40. The Luddite riots - reports.</td>\n",
" <td>1812-01-01</td>\n",
" <td>18120101</td>\n",
" <td>1855-12-31</td>\n",
" <td>18551231</td>\n",
" <td>1812-1855</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>Yes</td>\n",
" <td>C3083303</td>\n",
" <td>0.177554</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>HO 40/1/6</td>\n",
" <td>Home Office: Disturbances Correspondence. HO 4...</td>\n",
" <td>Lancashire. Lt. Gen. (copies of (1) above) Mai...</td>\n",
" <td>Lancashire. Lt. Gen. (copies of (1) above) Mai...</td>\n",
" <td>1812-05-01</td>\n",
" <td>18120501</td>\n",
" <td>1812-06-30</td>\n",
" <td>18120630</td>\n",
" <td>1812 May - June</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>7</td>\n",
" <td>\\r\\nFormer Reference Pro: HO 40/1/(6)</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>NaN</td>\n",
" <td>C6573173</td>\n",
" <td>0.158834</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HO 40/1/7</td>\n",
" <td>Home Office: Disturbances Correspondence. HO 4...</td>\n",
" <td>Yorkshire magistrates reports (copies of (1) a...</td>\n",
" <td>Yorkshire magistrates reports (copies of (1) a...</td>\n",
" <td>1812-03-01</td>\n",
" <td>18120301</td>\n",
" <td>1812-05-31</td>\n",
" <td>18120531</td>\n",
" <td>1812 Mar. - May</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>7</td>\n",
" <td>\\r\\nFormer Reference Pro: HO 40/1/(7)</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>NaN</td>\n",
" <td>C6573174</td>\n",
" <td>0.158834</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Citable Reference Context Description \\\n",
"0 HO 40/1 Home Office: Disturbances Correspondence. \n",
"1 HO 40/1/6 Home Office: Disturbances Correspondence. HO 4... \n",
"2 HO 40/1/7 Home Office: Disturbances Correspondence. HO 4... \n",
"\n",
" Title \\\n",
"0 HO 40. The Luddite riots - reports \n",
"1 Lancashire. Lt. Gen. (copies of (1) above) Mai... \n",
"2 Yorkshire magistrates reports (copies of (1) a... \n",
"\n",
" Description Start Date \\\n",
"0 HO 40. The Luddite riots - reports. 1812-01-01 \n",
"1 Lancashire. Lt. Gen. (copies of (1) above) Mai... 1812-05-01 \n",
"2 Yorkshire magistrates reports (copies of (1) a... 1812-03-01 \n",
"\n",
" Start Date (num) End Date End Date (num) Covering Dates \\\n",
"0 18120101 1855-12-31 18551231 1812-1855 \n",
"1 18120501 1812-06-30 18120630 1812 May - June \n",
"2 18120301 1812-05-31 18120531 1812 Mar. - May \n",
"\n",
" Held by Catalogue level \\\n",
"0 The National Archives, Kew 6 \n",
"1 The National Archives, Kew 7 \n",
"2 The National Archives, Kew 7 \n",
"\n",
" References Opening Date \\\n",
"0 NaN NaN \n",
"1 \\r\\nFormer Reference Pro: HO 40/1/(6) NaN \n",
"2 \\r\\nFormer Reference Pro: HO 40/1/(7) NaN \n",
"\n",
" Closure Status Closure Type \\\n",
"0 Open Document, Open Description Normal Closure before FOI Act: \n",
"1 Open Document, Open Description Normal Closure before FOI Act: \n",
"2 Open Document, Open Description Normal Closure before FOI Act: \n",
"\n",
" Closure Code Subjects Digitised ID Score \n",
"0 30 C10086 Public disorder Yes C3083303 0.177554 \n",
"1 30 C10086 Public disorder NaN C6573173 0.158834 \n",
"2 30 C10086 Public disorder NaN C6573174 0.158834 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(StringIO(response.text))\n",
"\n",
"#Force the start and end date columns into a date format\n",
"df['Start Date'] = pd.to_datetime(df['Start Date'],errors='coerce', dayfirst=True)\n",
"df['End Date'] = pd.to_datetime(df['End Date'],errors='coerce', dayfirst=True)\n",
"\n",
"df.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Building Up A Larger Index\n",
"\n",
"We can build up a larger index by extending our search, or by combining the downloads from mutliple searches."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a function to do the download of a single index:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def get_index(reference, start=1810, end=1815, typ='ref'):\n",
" \"\"\"Download index for a specify reference and convert it to a dataframe.\"\"\"\n",
" \n",
" url='https://discovery.nationalarchives.gov.uk/results/r'\n",
" params = {'_dss':'range','_sd':start,'_ed':end}\n",
" \n",
" if typ=='search':\n",
" params['_q']=reference\n",
" else:\n",
" params['_cr']=reference\n",
" \n",
" browser = mechanicalsoup.StatefulBrowser()\n",
" browser.open(url, params=params)\n",
" \n",
" #No results\n",
" if browser.get_current_page().find(\"div\", {\"class\": \"emphasis-block no-results\"}):\n",
" return pd.DataFrame()\n",
" \n",
" browser.select_form('form[action=\"/search/download\"]')\n",
" browser[\"expSize\"] = \"10\"\n",
" \n",
" response = browser.submit_selected()\n",
" \n",
" _df = pd.read_csv(StringIO(response.text))\n",
"\n",
" #Force the start and end date columns into a date format\n",
" _df['Start Date'] = pd.to_datetime(_df['Start Date'], errors='coerce', dayfirst=True)\n",
" _df['End Date'] = pd.to_datetime(_df['End Date'], errors='coerce', dayfirst=True)\n",
" \n",
" return _df "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Citable Reference</th>\n",
" <th>Context Description</th>\n",
" <th>Title</th>\n",
" <th>Description</th>\n",
" <th>Start Date</th>\n",
" <th>Start Date (num)</th>\n",
" <th>End Date</th>\n",
" <th>End Date (num)</th>\n",
" <th>Covering Dates</th>\n",
" <th>Held by</th>\n",
" <th>Catalogue level</th>\n",
" <th>References</th>\n",
" <th>Opening Date</th>\n",
" <th>Closure Status</th>\n",
" <th>Closure Type</th>\n",
" <th>Closure Code</th>\n",
" <th>Subjects</th>\n",
" <th>Digitised</th>\n",
" <th>ID</th>\n",
" <th>Score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>HO 42</td>\n",
" <td>Home Office: Domestic Correspondence, George III.</td>\n",
" <td>Home Office: Domestic Correspondence, George III</td>\n",
" <td>Original Home Office domestic letters. PLEASE ...</td>\n",
" <td>1782-01-01</td>\n",
" <td>17820101</td>\n",
" <td>1820-12-31</td>\n",
" <td>18201231</td>\n",
" <td>1782-1820</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>C8906</td>\n",
" <td>0.032754</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>HO 42/108</td>\n",
" <td>Home Office: Domestic Correspondence, George III.</td>\n",
" <td>HO 42. Letters and Papers. Supplementary.</td>\n",
" <td>HO 42. Letters and Papers. Supplementary.</td>\n",
" <td>1810-07-01</td>\n",
" <td>18100701</td>\n",
" <td>1810-10-31</td>\n",
" <td>18101031</td>\n",
" <td>1810 July 01-1810 Oct 31</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>C1905727</td>\n",
" <td>0.026892</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HO 42/111</td>\n",
" <td>Home Office: Domestic Correspondence, George III.</td>\n",
" <td>HO 42. Letters and Papers</td>\n",
" <td>HO 42. Letters and Papers.</td>\n",
" <td>1811-04-01</td>\n",
" <td>18110401</td>\n",
" <td>1811-06-30</td>\n",
" <td>18110630</td>\n",
" <td>1811 Apr 01-1811 June 30</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>NaN</td>\n",
" <td>Yes</td>\n",
" <td>C1905730</td>\n",
" <td>0.026892</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Citable Reference Context Description \\\n",
"0 HO 42 Home Office: Domestic Correspondence, George III. \n",
"1 HO 42/108 Home Office: Domestic Correspondence, George III. \n",
"2 HO 42/111 Home Office: Domestic Correspondence, George III. \n",
"\n",
" Title \\\n",
"0 Home Office: Domestic Correspondence, George III \n",
"1 HO 42. Letters and Papers. Supplementary. \n",
"2 HO 42. Letters and Papers \n",
"\n",
" Description Start Date \\\n",
"0 Original Home Office domestic letters. PLEASE ... 1782-01-01 \n",
"1 HO 42. Letters and Papers. Supplementary. 1810-07-01 \n",
"2 HO 42. Letters and Papers. 1811-04-01 \n",
"\n",
" Start Date (num) End Date End Date (num) Covering Dates \\\n",
"0 17820101 1820-12-31 18201231 1782-1820 \n",
"1 18100701 1810-10-31 18101031 1810 July 01-1810 Oct 31 \n",
"2 18110401 1811-06-30 18110630 1811 Apr 01-1811 June 30 \n",
"\n",
" Held by Catalogue level References Opening Date \\\n",
"0 The National Archives, Kew 3 NaN NaN \n",
"1 The National Archives, Kew 6 NaN NaN \n",
"2 The National Archives, Kew 6 NaN NaN \n",
"\n",
" Closure Status Closure Type \\\n",
"0 NaN Normal Closure before FOI Act: \n",
"1 Open Document, Open Description Normal Closure before FOI Act: \n",
"2 Open Document, Open Description Normal Closure before FOI Act: \n",
"\n",
" Closure Code Subjects Digitised ID Score \n",
"0 30 NaN NaN C8906 0.032754 \n",
"1 30 NaN Yes C1905727 0.026892 \n",
"2 30 NaN Yes C1905730 0.026892 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_index('HO 42').head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that some searches seem to be quite wideranging against particular codes (rather than lookups *by reference*), and some responses also appear to contain transcipts in the `Description` field."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Report of Soulden Lawrence on 16 individual petitions (13 from the prisoner; H Neale, officer of marines; Mr Castle, Clerk of the Crown for Durham and A Graham) and 4 collective petitions (34 members of the corporation of Durham; 2 others (31 and 34 people) with similar signatories and 3 people, the prisoner and others of London) on behalf of John Davison, late a captain in the Royal Marines, convicted at the Somerset Assizes held in Taunton in August 1809, for the theft of 6 yards of muslin, va',\n",
" 'General registers, early warrant and entry books and other records covering the multifarious subjects for which the Home Office has had responsibility; also records of subjects which do not fit into other divisional categories. Broadly, the subjects and their series in this division are as follows: Addresses, HO 55, HO 57, HO 249 Admiralty, HO 28, HO 29 Advertisements, HO 174 Animals and wild birds, HO 183, HO 285 Automatic data processing, HO 337 Betting, gaming and lotteries, HO 320 Bouillon p',\n",
" \"Board and Committee minutes HO.RVI/1/1-7 Board of Governors minutes, Court of Governors before 1948. 1911 - 1971 (7 volumes) See also HO.RVI/47 for Joint Minutes with Regional Hospital Board. HO.RVI/2/1-61 House Committee minutes, 1751 - 1971 (61 volumes) Volumes 1-4, 6-11, 17-18 contain patients' admissiosn and discharges. HO.RVI/151/1-6 Rough House Committee minutes, 1752 - 1755 (6 volumes) HO.RVI/3/1-2 Anaesthetic Committee minutes, 1924 - 1948 (2 volumes) HO.RVI/4/1-2 Appeal Committee minute\",\n",
" 'Administration HO.PM/1/1-18 Minutes 1760 - 1945 From 1760 to 1822 weekly court minutes, to 1900 also House Committee minutes, from 1900 also Finance Committee and Management Committee minutes. (18 volumes, 27 papers) HO.PM/2 Charity for the Relief of Poor Women Lying-in at Their Own Homes, minutes 1787 - 1858 Lying-in hospital House Committee minutes, 1859 (1 volume) HO.PM/3/1-3 Medical Staff meetings minutes, 1917 - 1951 (3 volumes) HO.PM/45 Honorary medical staff meetings minutes, 1935 - 1949 ',\n",
" \"Report of Alexander Thomson on 1 individual petition (the prisoner [detailed, gives information concerning family and business]) on behalf of Peter Degraves, merchant of London and Manchester, Lancashire, tried at the 'last' Lancaster Assizes held in 1810 and convicted of stealing a large quantity of goods including French cambrics, value between £2,000-3,000, property of John Parson, merchant of Manchester, from the warehouse of Thomas Benbridge/Thomas Bainbridge. Evidences supplied by John Pa\",\n",
" '1707-1812 Watchet Harbour, copies of Acts, 1707-08, 1720-21. 1770. 1809 printed and ms.) with petitions etc.; copies of Minehead Harbour Act 1711; accounts and estimates for repair c.1720 with undated agreement to build a quay at Watchet by Wm. Rowe of Bridgwater, mason, 1708; petitions re. need for improvements, 1811 with correspondence, 1812. 1 bundle 1707-1809 ms copies of Watchet Harbour Acts (as above). 1 volume 1772-1808 Watchet Quay maintenance accounts 1772-1808 (1 volume) 1782-1808 (2 c',\n",
" \"HIL/1 Records of firm and Hillman family HIL/2 Co-partnership agreements HIL/3 Apprenticeship indentures HIL/4 Assignments of debts HIL/5 Wills and executorship papers HIL/6 Title deeds of clients' properties HIL/6/1-14 Lewes: St Thomas at Cliffe HIL/6/15-29 Lewes: other parishes HIL/6/30-32 Alfriston HIL/6/33 Arlington HIL/6/34-36 Barcombe HIL/6/37 Bishopstone HIL/6/38-40 Brighton HIL/6/41 Ditchling HIL/6/42-48 Eastbourne HIL/6/49 Framfield HIL/6/50 Friston HIL/6/51-53 Hailsham HIL/6/54 Helling\",\n",
" 'SUMMARY OF CONTENTS L/C Lieutenancy - County L/C/D Deputy Lieutenancy L/C/C County - commissions L/C/C/1 Original commissions; 1853-1861 L/C/C/2 Letters of royal approval; 1835-1913 L/C/C/3 Correspondence and papers; 1778-1915 L/C/C/4 Draft commissions and precedents; 1804-1870 L/C/C/5 Lists of Deputy Lieutenants; 1807-1852 L/C/G General meetings L/C/M Militia L/C/M/1 Lists of men enrolled; 1803-1855 L/C/M/2 Subdivisional returns; 1806-1831 L/C/M/3 Regimental returns; 1804-1874 L/C/M/4 Correspon']"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Pull out the first 500 characters of records longer than 3000 characters\n",
"[r[:500] for r in get_index('HO 42',typ='search')['Description'].to_list() if len(r)>2000]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now use that function to download and combine indexes for multiple references:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"search_references = ['HO 40-1', 'HO 40-2', 'HO 43-19', 'HO 43-20', 'HO 43-21', 'HO 42-110']"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"HO 40-1: 9\n",
"HO 40-2: 10\n",
"HO 43-19: 1\n",
"HO 43-20: 1\n",
"HO 43-21: 1\n",
"HO 42-110: 1\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Citable Reference</th>\n",
" <th>Context Description</th>\n",
" <th>Title</th>\n",
" <th>Description</th>\n",
" <th>Start Date</th>\n",
" <th>Start Date (num)</th>\n",
" <th>End Date</th>\n",
" <th>End Date (num)</th>\n",
" <th>Covering Dates</th>\n",
" <th>Held by</th>\n",
" <th>Catalogue level</th>\n",
" <th>References</th>\n",
" <th>Opening Date</th>\n",
" <th>Closure Status</th>\n",
" <th>Closure Type</th>\n",
" <th>Closure Code</th>\n",
" <th>Subjects</th>\n",
" <th>Digitised</th>\n",
" <th>ID</th>\n",
" <th>Score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>HO 40/1</td>\n",
" <td>Home Office: Disturbances Correspondence.</td>\n",
" <td>HO 40. The Luddite riots - reports</td>\n",
" <td>HO 40. The Luddite riots - reports.</td>\n",
" <td>1812-01-01</td>\n",
" <td>18120101</td>\n",
" <td>1855-12-31</td>\n",
" <td>18551231</td>\n",
" <td>1812-1855</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>Yes</td>\n",
" <td>C3083303</td>\n",
" <td>0.177578</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>HO 40/1/1</td>\n",
" <td>Home Office: Disturbances Correspondence. HO 4...</td>\n",
" <td>Cheshire, Lancashire, Yorkshire ff 1-173 ff 17...</td>\n",
" <td>Cheshire, Lancashire, Yorkshire ff 1-173 ff 17...</td>\n",
" <td>1812-03-01</td>\n",
" <td>18120301</td>\n",
" <td>1812-06-30</td>\n",
" <td>18120630</td>\n",
" <td>1812 Mar. - June</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>7</td>\n",
" <td>\\r\\nFormer Reference Pro: HO 40/1/(1)</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>NaN</td>\n",
" <td>C6573168</td>\n",
" <td>0.152372</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HO 40/1/2</td>\n",
" <td>Home Office: Disturbances Correspondence. HO 4...</td>\n",
" <td>Cheshire magistrates reports (copies of (1) ab...</td>\n",
" <td>Cheshire magistrates reports (copies of (1) ab...</td>\n",
" <td>1812-03-01</td>\n",
" <td>18120301</td>\n",
" <td>1812-06-30</td>\n",
" <td>18120630</td>\n",
" <td>1812 Mar. - June</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>7</td>\n",
" <td>\\r\\nFormer Reference Pro: HO 40/1/(2)</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>NaN</td>\n",
" <td>C6573169</td>\n",
" <td>0.152372</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>HO 40/1/3</td>\n",
" <td>Home Office: Disturbances Correspondence. HO 4...</td>\n",
" <td>Lancashire magistrates reports (copies of (1) ...</td>\n",
" <td>Lancashire magistrates reports (copies of (1) ...</td>\n",
" <td>1812-03-01</td>\n",
" <td>18120301</td>\n",
" <td>1812-05-31</td>\n",
" <td>18120531</td>\n",
" <td>1812 Mar. - May</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>7</td>\n",
" <td>\\r\\nFormer Reference Pro: HO 40/1/(3)</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>NaN</td>\n",
" <td>C6573170</td>\n",
" <td>0.156902</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>HO 40/1/4</td>\n",
" <td>Home Office: Disturbances Correspondence. HO 4...</td>\n",
" <td>Lancashire magistrates reports (copies of (1) ...</td>\n",
" <td>Lancashire magistrates reports (copies of (1) ...</td>\n",
" <td>1812-03-01</td>\n",
" <td>18120301</td>\n",
" <td>1812-06-30</td>\n",
" <td>18120630</td>\n",
" <td>1812 Mar. - June</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>7</td>\n",
" <td>\\r\\nFormer Reference Pro: HO 40/1/(4)</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>NaN</td>\n",
" <td>C6573171</td>\n",
" <td>0.157119</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Citable Reference Context Description \\\n",
"0 HO 40/1 Home Office: Disturbances Correspondence. \n",
"1 HO 40/1/1 Home Office: Disturbances Correspondence. HO 4... \n",
"2 HO 40/1/2 Home Office: Disturbances Correspondence. HO 4... \n",
"3 HO 40/1/3 Home Office: Disturbances Correspondence. HO 4... \n",
"4 HO 40/1/4 Home Office: Disturbances Correspondence. HO 4... \n",
"\n",
" Title \\\n",
"0 HO 40. The Luddite riots - reports \n",
"1 Cheshire, Lancashire, Yorkshire ff 1-173 ff 17... \n",
"2 Cheshire magistrates reports (copies of (1) ab... \n",
"3 Lancashire magistrates reports (copies of (1) ... \n",
"4 Lancashire magistrates reports (copies of (1) ... \n",
"\n",
" Description Start Date \\\n",
"0 HO 40. The Luddite riots - reports. 1812-01-01 \n",
"1 Cheshire, Lancashire, Yorkshire ff 1-173 ff 17... 1812-03-01 \n",
"2 Cheshire magistrates reports (copies of (1) ab... 1812-03-01 \n",
"3 Lancashire magistrates reports (copies of (1) ... 1812-03-01 \n",
"4 Lancashire magistrates reports (copies of (1) ... 1812-03-01 \n",
"\n",
" Start Date (num) End Date End Date (num) Covering Dates \\\n",
"0 18120101 1855-12-31 18551231 1812-1855 \n",
"1 18120301 1812-06-30 18120630 1812 Mar. - June \n",
"2 18120301 1812-06-30 18120630 1812 Mar. - June \n",
"3 18120301 1812-05-31 18120531 1812 Mar. - May \n",
"4 18120301 1812-06-30 18120630 1812 Mar. - June \n",
"\n",
" Held by Catalogue level \\\n",
"0 The National Archives, Kew 6 \n",
"1 The National Archives, Kew 7 \n",
"2 The National Archives, Kew 7 \n",
"3 The National Archives, Kew 7 \n",
"4 The National Archives, Kew 7 \n",
"\n",
" References Opening Date \\\n",
"0 NaN NaN \n",
"1 \\r\\nFormer Reference Pro: HO 40/1/(1) NaN \n",
"2 \\r\\nFormer Reference Pro: HO 40/1/(2) NaN \n",
"3 \\r\\nFormer Reference Pro: HO 40/1/(3) NaN \n",
"4 \\r\\nFormer Reference Pro: HO 40/1/(4) NaN \n",
"\n",
" Closure Status Closure Type \\\n",
"0 Open Document, Open Description Normal Closure before FOI Act: \n",
"1 Open Document, Open Description Normal Closure before FOI Act: \n",
"2 Open Document, Open Description Normal Closure before FOI Act: \n",
"3 Open Document, Open Description Normal Closure before FOI Act: \n",
"4 Open Document, Open Description Normal Closure before FOI Act: \n",
"\n",
" Closure Code Subjects Digitised ID Score \n",
"0 30 C10086 Public disorder Yes C3083303 0.177578 \n",
"1 30 C10086 Public disorder NaN C6573168 0.152372 \n",
"2 30 C10086 Public disorder NaN C6573169 0.152372 \n",
"3 30 C10086 Public disorder NaN C6573170 0.156902 \n",
"4 30 C10086 Public disorder NaN C6573171 0.157119 "
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_combined = pd.DataFrame()\n",
"\n",
"for reference in search_references:\n",
" _df = get_index(reference)\n",
" print(f'{reference}: {len(_df)}')\n",
" df_combined = df_combined.append( _df )\n",
" \n",
"df_combined = df_combined.sort_values('Citable Reference').reset_index(drop=True)\n",
"df_combined.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get a better view over the descriptions:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['HO 40. The Luddite riots - reports.',\n",
" 'Cheshire, Lancashire, Yorkshire ff 1-173 ff 174-283.',\n",
" 'Cheshire magistrates reports (copies of (1) above) ff 284-341.',\n",
" 'Lancashire magistrates reports (copies of (1) above) ff 342-371.',\n",
" 'Lancashire magistrates reports (copies of (1) above) ff 372-471.',\n",
" 'Enclosures to a letter dated (copies of (1) above) 16 May, 1812 in (4) above ff 472-485.',\n",
" \"Lancashire. Lt. Gen. (copies of (1) above) Maitland's reports ff 486-540.\",\n",
" 'Yorkshire magistrates reports (copies of (1) above) ff 541-596.',\n",
" 'Yorkshire Sir Francis Lindley (copies of (1) above) Wood, Vice Lt. West Riding; reports ff 597-624.',\n",
" 'HO 40. The Luddite riots - military reports.',\n",
" 'Cheshire ff 1a - 115.',\n",
" 'Lancashire ff 116-253.',\n",
" 'Yorkshire ff 254-399 ff 400-562.',\n",
" 'Chelmsford, London and miscellaneous ff 563-646.',\n",
" 'Notebook containing names of known and suspected Luddites.',\n",
" 'Notebook containing various payments to constables, etc.',\n",
" 'Copies of letters addressed to Lt. Gen. Maitland.',\n",
" 'Copies of letters addressed to Lt. Gen. Maitland.',\n",
" 'Copies of letters addressed to Lt. Gen. Maitland.',\n",
" 'HO 42. Letters and Papers.',\n",
" 'Domestic Letter Book.',\n",
" 'Domestic Letter Book.',\n",
" 'Domestic Letter Book.']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_combined['Description'].to_list()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extract Named Entities\n",
"\n",
"The title field appears to be a subset of the description field (up to the first N characters).\n",
"\n",
"We can parse named entities out of the description field to make searching the records easier.\n",
"\n",
"The `spacy` natural language processing (NLP) package provides a named entity tagger that is good enough to get us started."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"import spacy"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"#Install the package that provides the named entity model\n",
"#!python -m spacy download en_core_web_sm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's an example of running the named entity tagger:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Joseph Radcliffe 0 16 PERSON\n",
"the Home Office 36 51 ORG\n",
"March 5th, 1812 55 70 DATE\n",
"Luddites 81 89 GPE\n"
]
}
],
"source": [
"nlp = spacy.load(\"en_core_web_sm\")\n",
"\n",
"TEST_STRING = \"Joseph Radcliffe, wrote a letter to the Home Office on March 5th, 1812 about the Luddites.\"\n",
"\n",
"doc = nlp(TEST_STRING)\n",
"\n",
"for ent in doc.ents:\n",
" print(ent.text, ent.start_char, ent.end_char, ent.label_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*`GPE` is a \"geo-political entity\". There is also a related `NORP`: \"nationalities or religious or political groups\".* The numbers are the index values identifying the first and last character of the extracted string in the original string.\n",
"\n",
"We can create a simple function to pull out the elements we want, returning a list of all elements extracted from a block of text."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"def entity_rec(txt):\n",
" \"\"\"Extract entities from text and return a list entity text and entity type tuples.\"\"\"\n",
" \n",
" doc = nlp(txt)\n",
" \n",
" ents = []\n",
" for ent in doc.ents:\n",
" #ents.append((ent.text, ent.start_char, ent.end_char, ent.label_))\n",
" #Exclude certain entity types from the returned list\n",
" if ent.label_ not in ['CARDINAL']:\n",
" ents.append((ent.text, ent.label_))\n",
" \n",
" return ents"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can apply this function to the `Description` text associated with each row:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Citable Reference</th>\n",
" <th>Context Description</th>\n",
" <th>Title</th>\n",
" <th>Description</th>\n",
" <th>Start Date</th>\n",
" <th>Start Date (num)</th>\n",
" <th>End Date</th>\n",
" <th>End Date (num)</th>\n",
" <th>Covering Dates</th>\n",
" <th>Held by</th>\n",
" <th>...</th>\n",
" <th>References</th>\n",
" <th>Opening Date</th>\n",
" <th>Closure Status</th>\n",
" <th>Closure Type</th>\n",
" <th>Closure Code</th>\n",
" <th>Subjects</th>\n",
" <th>Digitised</th>\n",
" <th>ID</th>\n",
" <th>Score</th>\n",
" <th>Entities</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>HO 40/1</td>\n",
" <td>Home Office: Disturbances Correspondence.</td>\n",
" <td>HO 40. The Luddite riots - reports</td>\n",
" <td>HO 40. The Luddite riots - reports.</td>\n",
" <td>1812-01-01</td>\n",
" <td>18120101</td>\n",
" <td>1855-12-31</td>\n",
" <td>18551231</td>\n",
" <td>1812-1855</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>Yes</td>\n",
" <td>C3083303</td>\n",
" <td>0.177554</td>\n",
" <td>[(Luddite, NORP)]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>HO 40/1/6</td>\n",
" <td>Home Office: Disturbances Correspondence. HO 4...</td>\n",
" <td>Lancashire. Lt. Gen. (copies of (1) above) Mai...</td>\n",
" <td>Lancashire. Lt. Gen. (copies of (1) above) Mai...</td>\n",
" <td>1812-05-01</td>\n",
" <td>18120501</td>\n",
" <td>1812-06-30</td>\n",
" <td>18120630</td>\n",
" <td>1812 May - June</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>...</td>\n",
" <td>\\r\\nFormer Reference Pro: HO 40/1/(6)</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>NaN</td>\n",
" <td>C6573173</td>\n",
" <td>0.158834</td>\n",
" <td>[(Lancashire, ORG), (Maitland, GPE)]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>HO 40/1/7</td>\n",
" <td>Home Office: Disturbances Correspondence. HO 4...</td>\n",
" <td>Yorkshire magistrates reports (copies of (1) a...</td>\n",
" <td>Yorkshire magistrates reports (copies of (1) a...</td>\n",
" <td>1812-03-01</td>\n",
" <td>18120301</td>\n",
" <td>1812-05-31</td>\n",
" <td>18120531</td>\n",
" <td>1812 Mar. - May</td>\n",
" <td>The National Archives, Kew</td>\n",
" <td>...</td>\n",
" <td>\\r\\nFormer Reference Pro: HO 40/1/(7)</td>\n",
" <td>NaN</td>\n",
" <td>Open Document, Open Description</td>\n",
" <td>Normal Closure before FOI Act:</td>\n",
" <td>30</td>\n",
" <td>C10086 Public disorder</td>\n",
" <td>NaN</td>\n",
" <td>C6573174</td>\n",
" <td>0.158834</td>\n",
" <td>[(Yorkshire, PERSON)]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>3 rows × 21 columns</p>\n",
"</div>"
],
"text/plain": [
" Citable Reference Context Description \\\n",
"0 HO 40/1 Home Office: Disturbances Correspondence. \n",
"1 HO 40/1/6 Home Office: Disturbances Correspondence. HO 4... \n",
"2 HO 40/1/7 Home Office: Disturbances Correspondence. HO 4... \n",
"\n",
" Title \\\n",
"0 HO 40. The Luddite riots - reports \n",
"1 Lancashire. Lt. Gen. (copies of (1) above) Mai... \n",
"2 Yorkshire magistrates reports (copies of (1) a... \n",
"\n",
" Description Start Date \\\n",
"0 HO 40. The Luddite riots - reports. 1812-01-01 \n",
"1 Lancashire. Lt. Gen. (copies of (1) above) Mai... 1812-05-01 \n",
"2 Yorkshire magistrates reports (copies of (1) a... 1812-03-01 \n",
"\n",
" Start Date (num) End Date End Date (num) Covering Dates \\\n",
"0 18120101 1855-12-31 18551231 1812-1855 \n",
"1 18120501 1812-06-30 18120630 1812 May - June \n",
"2 18120301 1812-05-31 18120531 1812 Mar. - May \n",
"\n",
" Held by ... References \\\n",
"0 The National Archives, Kew ... NaN \n",
"1 The National Archives, Kew ... \\r\\nFormer Reference Pro: HO 40/1/(6) \n",
"2 The National Archives, Kew ... \\r\\nFormer Reference Pro: HO 40/1/(7) \n",
"\n",
" Opening Date Closure Status \\\n",
"0 NaN Open Document, Open Description \n",
"1 NaN Open Document, Open Description \n",
"2 NaN Open Document, Open Description \n",
"\n",
" Closure Type Closure Code Subjects \\\n",
"0 Normal Closure before FOI Act: 30 C10086 Public disorder \n",
"1 Normal Closure before FOI Act: 30 C10086 Public disorder \n",
"2 Normal Closure before FOI Act: 30 C10086 Public disorder \n",
"\n",
" Digitised ID Score Entities \n",
"0 Yes C3083303 0.177554 [(Luddite, NORP)] \n",
"1 NaN C6573173 0.158834 [(Lancashire, ORG), (Maitland, GPE)] \n",
"2 NaN C6573174 0.158834 [(Yorkshire, PERSON)] \n",
"\n",
"[3 rows x 21 columns]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Entities'] = df['Description'].apply(lambda x: entity_rec(x))\n",
"df.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then generate a long format data frame that associates each entity tuple with each record, as identified by the record `ID`:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Entities</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>C3083303</td>\n",
" <td>(Luddite, NORP)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>C6573173</td>\n",
" <td>(Lancashire, ORG)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>C6573173</td>\n",
" <td>(Maitland, GPE)</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Entities\n",
"0 C3083303 (Luddite, NORP)\n",
"1 C6573173 (Lancashire, ORG)\n",
"2 C6573173 (Maitland, GPE)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_entities = df.explode('Entities').reset_index(drop=True)[['ID','Entities']]\n",
"df_entities.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then split out the entity tuple elements into separate columns, noting that the entity type recognition, as well the entity extraction, may be a bit ropey:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Entity</th>\n",
" <th>Type</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>C3083303</td>\n",
" <td>Luddite</td>\n",
" <td>NORP</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>C6573173</td>\n",
" <td>Lancashire</td>\n",
" <td>ORG</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>C6573173</td>\n",
" <td>Maitland</td>\n",
" <td>GPE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>C6573174</td>\n",
" <td>Yorkshire</td>\n",
" <td>PERSON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>C6573175</td>\n",
" <td>Yorkshire Sir</td>\n",
" <td>PERSON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>C6573175</td>\n",
" <td>Francis Lindley</td>\n",
" <td>PERSON</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>C6573175</td>\n",
" <td>West Riding</td>\n",
" <td>GPE</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>C6573171</td>\n",
" <td>Lancashire</td>\n",
" <td>ORG</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>C6573170</td>\n",
" <td>Lancashire</td>\n",
" <td>ORG</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>C6573168</td>\n",
" <td>Cheshire</td>\n",
" <td>ORG</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Entity Type\n",
"0 C3083303 Luddite NORP\n",
"1 C6573173 Lancashire ORG\n",
"2 C6573173 Maitland GPE\n",
"3 C6573174 Yorkshire PERSON\n",
"4 C6573175 Yorkshire Sir PERSON\n",
"5 C6573175 Francis Lindley PERSON\n",
"6 C6573175 West Riding GPE\n",
"7 C6573171 Lancashire ORG\n",
"8 C6573170 Lancashire ORG\n",
"9 C6573168 Cheshire ORG"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_entities[['Entity','Type']] = df_entities['Entities'].apply(pd.Series)\n",
"df_entities.drop(columns='Entities', inplace=True)\n",
"df_entities.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we wanted to work on this a bit more, it would be handy to try be be able to recognise English county and placenames as such. We could also try to munge any `DATE` elements through a robust date parser in order to get the dates into an actual date object.\n",
"\n",
"One other useful bit of information are the folio / page numbers."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ff 1-173', 'ff 174-283']"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"\n",
"TEST_STRING_2 = \"Cheshire, Lancashire, Yorkshire ff 1-173 ff 174-283.\"\n",
"\n",
"FF_PATTERN = r\"ff \\d+-\\d+\"\n",
"\n",
"m = re.findall(FF_PATTERN, TEST_STRING_2, re.MULTILINE)\n",
"m"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, we can capture these into a long dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Description</th>\n",
" <th>Pages</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>HO 40. The Luddite riots - reports.</td>\n",
" <td>[]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Lancashire. Lt. Gen. (copies of (1) above) Mai...</td>\n",
" <td>[ff 486-540]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Yorkshire magistrates reports (copies of (1) a...</td>\n",
" <td>[ff 541-596]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Yorkshire Sir Francis Lindley (copies of (1) a...</td>\n",
" <td>[ff 597-624]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Lancashire magistrates reports (copies of (1) ...</td>\n",
" <td>[ff 372-471]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Lancashire magistrates reports (copies of (1) ...</td>\n",
" <td>[ff 342-371]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Cheshire, Lancashire, Yorkshire ff 1-173 ff 17...</td>\n",
" <td>[ff 1-173, ff 174-283]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Cheshire magistrates reports (copies of (1) ab...</td>\n",
" <td>[ff 284-341]</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Enclosures to a letter dated (copies of (1) ab...</td>\n",
" <td>[ff 472-485]</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Description Pages\n",
"0 HO 40. The Luddite riots - reports. []\n",
"1 Lancashire. Lt. Gen. (copies of (1) above) Mai... [ff 486-540]\n",
"2 Yorkshire magistrates reports (copies of (1) a... [ff 541-596]\n",
"3 Yorkshire Sir Francis Lindley (copies of (1) a... [ff 597-624]\n",
"4 Lancashire magistrates reports (copies of (1) ... [ff 372-471]\n",
"5 Lancashire magistrates reports (copies of (1) ... [ff 342-371]\n",
"6 Cheshire, Lancashire, Yorkshire ff 1-173 ff 17... [ff 1-173, ff 174-283]\n",
"7 Cheshire magistrates reports (copies of (1) ab... [ff 284-341]\n",
"8 Enclosures to a letter dated (copies of (1) ab... [ff 472-485]"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Pages'] = df['Description'].apply(lambda x: re.findall(FF_PATTERN, x, re.MULTILINE))\n",
"df[['Description','Pages']].head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can make the table longer by exploding multiple page references for any given record, and then also splitting out the first and last page reference:"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ID</th>\n",
" <th>Pages</th>\n",
" <th>Start</th>\n",
" <th>End</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>C6573168</td>\n",
" <td>ff 1-173</td>\n",
" <td>1</td>\n",
" <td>173</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>C6573168</td>\n",
" <td>ff 174-283</td>\n",
" <td>174</td>\n",
" <td>283</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>C6573169</td>\n",
" <td>ff 284-341</td>\n",
" <td>284</td>\n",
" <td>341</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>C6573170</td>\n",
" <td>ff 342-371</td>\n",
" <td>342</td>\n",
" <td>371</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>C6573171</td>\n",
" <td>ff 372-471</td>\n",
" <td>372</td>\n",
" <td>471</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>C6573172</td>\n",
" <td>ff 472-485</td>\n",
" <td>472</td>\n",
" <td>485</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>C6573173</td>\n",
" <td>ff 486-540</td>\n",
" <td>486</td>\n",
" <td>540</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>C6573174</td>\n",
" <td>ff 541-596</td>\n",
" <td>541</td>\n",
" <td>596</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>C6573175</td>\n",
" <td>ff 597-624</td>\n",
" <td>597</td>\n",
" <td>624</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ID Pages Start End\n",
"0 C6573168 ff 1-173 1 173\n",
"1 C6573168 ff 174-283 174 283\n",
"2 C6573169 ff 284-341 284 341\n",
"3 C6573170 ff 342-371 342 371\n",
"4 C6573171 ff 372-471 372 471\n",
"5 C6573172 ff 472-485 472 485\n",
"6 C6573173 ff 486-540 486 540\n",
"7 C6573174 ff 541-596 541 596\n",
"8 C6573175 ff 597-624 597 624"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_pages = df.explode('Pages').reset_index(drop=True)[['ID','Pages']].dropna()\n",
"df_pages[['Start', 'End']] = df_pages['Pages'].str.replace('ff','').str.strip().str.split('-').apply(pd.Series)\n",
"df_pages.sort_values(['ID','Start'], inplace=True)\n",
"df_pages.reset_index(drop=True, inplace=True)\n",
"df_pages.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Referencing Into Actual PDF Documents\n",
"\n",
"When downloading a scanned collection from the National Archives, the scan associated with a reference, for example, the scan associated with `HO 40/1`, may be split into several separate PDF documents.\n",
"\n",
"We can merge these into a single document, which makes working with it slghtly easier from a programmatic point of view, albeit at making the memory requirements when dealing with a particular collection slightly heavier...\n",
"\n",
"The following cell finds the filenames of all the PDFs I downloaded as part of the `HO-40-1` download and sorts them."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['../HO - Home Office/HO-40-1_01.pdf',\n",
" '../HO - Home Office/HO-40-1_02.pdf',\n",
" '../HO - Home Office/HO-40-1_03.pdf']"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from os import listdir\n",
"\n",
"reference = 'HO-40-1'\n",
"pdfs = [f'../HO - Home Office/{f}' for f in listdir('../HO - Home Office') if f.startswith(reference)]\n",
"pdfs.sort()\n",
"\n",
"pdfs[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then merge all these separate PDFs into a single PDF and save it as a new file:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"from PyPDF2 import PdfFileMerger\n",
"\n",
"merger = PdfFileMerger()\n",
"\n",
"for pdf in pdfs:\n",
" merger.append(pdf)\n",
"\n",
"#Save the merged PDF\n",
"merger.write(f\"{reference}_result.pdf\")\n",
"merger.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can view specified pages within the merged PDF as an image file, converted from the PDF using ImageMagick, at a specific page number."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"page_num = 500"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Displaying at PDF page 500.\n"
]
},
{
"data": {
View raw

(Sorry about that, but we can’t show files that are this big right now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment