Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wjwillemse/8066890eb00fbc5b4fa27244bbadc92e to your computer and use it in GitHub Desktop.
Save wjwillemse/8066890eb00fbc5b4fa27244bbadc92e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How to download and read the Solvency 2 legislation\n",
"\n",
"In our first NLP project we will download, clean and read the Delegated Acts of the Solvency 2 legislation in all European languages."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import re\n",
"import requests\n",
"import nltk\n",
"import fitz\n",
"import PyPDF2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The languages of the European Union are\n",
"Bulgarian (BG),\n",
"Spanish (ES),\n",
"Czech (CS),\n",
"Danish (DA),\n",
"German (DE),\n",
"Estonian (ET),\n",
"Greek (EL),\n",
"English (EN),\n",
"French (FR),\n",
"Croatian (HR),\n",
"Italian (IT),\n",
"Latvian (LV),\n",
"Lithuanian (LT),\n",
"Hungarian (HU),\n",
"Maltese (MT),\n",
"Dutch (NL),\n",
"Polish (PL),\n",
"Portuguese (PT),\n",
"Romanian (RO),\n",
"Slovak (SK),\n",
"Solvenian (SL),\n",
"Finnish (FI),\n",
"Swedish (SV)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"languages = ['BG','ES','CS','DA','DE','ET','EL','EN','FR','HR','IT','LV','LT','HU','MT','NL','PL',\n",
" 'PT','RO','SK','SL','FI','SV']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The urls of the Delegated Acts of Solvency 2 are constructed for these languages."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"urls = ['https://eur-lex.europa.eu/legal-content/' + \n",
" lang + \n",
" '/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN' for lang in languages]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following for loop retrieves the pdfs of the Delegated Acts from the website of the European Union and stores them in da_path."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Retrieving BG from https://eur-lex.europa.eu/legal-content/BG/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"PdfReadWarning: Xref table not zero-indexed. ID numbers for objects will be corrected. [pdf.py:1736]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Retrieving ES from https://eur-lex.europa.eu/legal-content/ES/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving CS from https://eur-lex.europa.eu/legal-content/CS/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving DA from https://eur-lex.europa.eu/legal-content/DA/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving DE from https://eur-lex.europa.eu/legal-content/DE/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving ET from https://eur-lex.europa.eu/legal-content/ET/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving EL from https://eur-lex.europa.eu/legal-content/EL/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving EN from https://eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving FR from https://eur-lex.europa.eu/legal-content/FR/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving HR from https://eur-lex.europa.eu/legal-content/HR/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving IT from https://eur-lex.europa.eu/legal-content/IT/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving LV from https://eur-lex.europa.eu/legal-content/LV/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving LT from https://eur-lex.europa.eu/legal-content/LT/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving HU from https://eur-lex.europa.eu/legal-content/HU/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving MT from https://eur-lex.europa.eu/legal-content/MT/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving NL from https://eur-lex.europa.eu/legal-content/NL/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving PL from https://eur-lex.europa.eu/legal-content/PL/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving PT from https://eur-lex.europa.eu/legal-content/PT/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving RO from https://eur-lex.europa.eu/legal-content/RO/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving SK from https://eur-lex.europa.eu/legal-content/SK/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving SL from https://eur-lex.europa.eu/legal-content/SL/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving FI from https://eur-lex.europa.eu/legal-content/FI/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n",
"Retrieving SV from https://eur-lex.europa.eu/legal-content/SV/TXT/PDF/?uri=OJ:L:2015:012:FULL&from=EN\n"
]
}
],
"source": [
"da_path = 'data/solvency ii/'\n",
"\n",
"for index in range(len(urls)):\n",
" \n",
" print(\"Retrieving \" + languages[index] + ' from ' + urls[index])\n",
" \n",
" filename = 'Solvency II Delegated Acts - ' + languages[index]+ '.pdf'\n",
"\n",
" if not(os.path.isfile(da_path + filename)):\n",
" \n",
" try:\n",
" r = requests.get(urls[index])\n",
" except:\n",
" print(\"Error with: \" + urls[index])\n",
" else:\n",
" f = open(da_path + filename,'wb+')\n",
" f.write(r.content) \n",
" f.close()\n",
"\n",
" fh = open(da_path + filename, \"rb\")\n",
" try:\n",
" pdffile = PyPDF2.PdfFileReader(fh)\n",
" fh.close()\n",
" except PyPDF2.utils.PdfReadError:\n",
" fh.close()\n",
" print(\"invalid PDF file: \" + da_path + filename)\n",
" os.remove(da_path + filename)\n",
" else:\n",
" print(\"--> already read.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data cleaning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you look at the pdfs then you see that each page has a header with page number and information about the legislation and the language. These headers must be deleted to access the articles in the text."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"DA_dict = dict({'BG': 'Официален вестник на Европейския съюз',\n",
" 'CS': 'Úřední věstník Evropské unie',\n",
" 'DA': 'Den Europæiske Unions Tidende',\n",
" 'DE': 'Amtsblatt der Europäischen Union',\n",
" 'EL': 'Επίσημη Εφημερίδα της Ευρωπαϊκής Ένωσης',\n",
" 'EN': 'Official Journal of the European Union',\n",
" 'ES': 'Diario Oficial de la Unión Europea',\n",
" 'ET': 'Euroopa Liidu Teataja', \n",
" 'FI': 'Euroopan unionin virallinen lehti',\n",
" 'FR': \"Journal officiel de l'Union européenne\",\n",
" 'HR': 'Službeni list Europske unije', \n",
" 'HU': 'Az Európai Unió Hivatalos Lapja', \n",
" 'IT': \"Gazzetta ufficiale dell'Unione europea\",\n",
" 'LT': 'Europos Sąjungos oficialusis leidinys',\n",
" 'LV': 'Eiropas Savienības Oficiālais Vēstnesis',\n",
" 'MT': 'Il-Ġurnal Uffiċjali tal-Unjoni Ewropea',\n",
" 'NL': 'Publicatieblad van de Europese Unie', \n",
" 'PL': 'Dziennik Urzędowy Unii Europejskiej', \n",
" 'PT': 'Jornal Oficial da União Europeia', \n",
" 'RO': 'Jurnalul Oficial al Uniunii Europene', \n",
" 'SK': 'Úradný vestník Európskej únie', \n",
" 'SL': 'Uradni list Evropske unije', \n",
" 'SV': 'Europeiska unionens officiella tidning'})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following code reads the pdfs, deletes the headers from all pages and saves the clean text to a .txt file."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Reading language BG ES CS DA DE ET EL EN FR HR IT LV LT HU MT NL PL PT RO SK SL FI SV "
]
}
],
"source": [
"DA = dict()\n",
"\n",
"files = [f for f in os.listdir(da_path) if os.path.isfile(os.path.join(da_path, f))] \n",
"\n",
"print(\"Reading language \", end='')\n",
"\n",
"for language in languages:\n",
"\n",
" print(language + \" \", end='')\n",
"\n",
" if not(\"Delegated_Acts_\" + language + \".txt\" in files):\n",
" \n",
" # reading pages from pdf file\n",
" da_pdf = fitz.open(da_path + 'Solvency II Delegated Acts - ' + language + '.pdf')\n",
" da_pages = [page.getText(output = \"text\") for page in da_pdf]\n",
" da_pdf.close()\n",
"\n",
" # deleting page headers\n",
" header = \"17.1.2015\\\\s+L\\\\s+\\\\d+/\\\\d+\\\\s+\" + DA_dict[language].replace(' ','\\\\s+') + \"\\\\s+\" + language + \"\\\\s+\"\n",
" da_pages = [re.sub(header, '', page) for page in da_pages]\n",
" DA[language] = ''.join(da_pages)\n",
" \n",
" # some preliminary cleaning -> should be more \n",
" DA[language] = DA[language].replace('\\xad ', '')\n",
" \n",
" # saving txt file\n",
" da_txt = open(da_path + \"Delegated_Acts_\" + language + \".txt\", \"wb\")\n",
" da_txt.write(DA[language].encode('utf-8'))\n",
" da_txt.close()\n",
"\n",
" else:\n",
" \n",
" # loading txt file\n",
" da_txt = open(da_path + \"Delegated_Acts_\" + language + \".txt\", \"rb\")\n",
" DA[language] = da_txt.read().decode('utf-8')\n",
" da_txt.close() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Retrieve the text within articles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Retrieving the text within articles is not straightforward. In English we have 'Article 1 some text', i.e. de word Article is put before the number. But some European languages put the word after the number and there are two languages, HU and LV, that put a dot between the number and the article. To be able to read the text within the articles we need to know this ordering (and we need of course the word for article in every language)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"art_dict= dict({'BG': ['Член', 'pre'],\n",
" 'CS': ['Článek', 'pre'],\n",
" 'DA': ['Artikel', 'pre'],\n",
" 'DE': ['Artikel', 'pre'],\n",
" 'EL': ['Άρθρο', 'pre'],\n",
" 'EN': ['Article', 'pre'],\n",
" 'ES': ['Artículo', 'pre'],\n",
" 'ET': ['Artikkel', 'pre'],\n",
" 'FI': ['artikla', 'post'],\n",
" 'FR': ['Article', 'pre'],\n",
" 'HR': ['Članak', 'pre'],\n",
" 'HU': ['cikk', 'postdot'],\n",
" 'IT': ['Articolo', 'pre'],\n",
" 'LT': ['straipsnis','post'],\n",
" 'LV': ['pants', 'postdot'],\n",
" 'MT': ['Artikolu', 'pre'],\n",
" 'NL': ['Artikel', 'pre'],\n",
" 'PL': ['Artykuł', 'pre'],\n",
" 'PT': ['Artigo', 'pre'],\n",
" 'RO': ['Articolul', 'pre'],\n",
" 'SK': ['Článok', 'pre'],\n",
" 'SL': ['Člen', 'pre'],\n",
" 'SV': ['Artikel', 'pre']})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we can define a regex to select the text within an article."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def retrieve_article(language, article_num):\n",
"\n",
" method = art_dict[language][1]\n",
" \n",
" if method == 'pre':\n",
" string = art_dict[language][0] + ' ' + str(article_num) + '(.*?)' + art_dict[language][0] + ' ' + str(article_num + 1)\n",
" elif method == 'post':\n",
" string = str(article_num) + ' ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + ' ' + art_dict[language][0]\n",
" elif method == 'postdot':\n",
" string = str(article_num) + '. ' + art_dict[language][0] + '(.*?)' + str(article_num + 1) + '. ' + art_dict[language][0]\n",
"\n",
" r = re.compile(string, re.DOTALL)\n",
" \n",
" result = ' '.join(r.search(DA[language])[1].split())\n",
" \n",
" return result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Okay, where are we now? We have a function that can retrieve the text of all the articles in the Delegated Acts for each European language."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Summary 1. The solvency and financial condition report shall include a clear and concise summary. The summary of the report shall be understandable to policy holders and beneficiaries. 2. The summary of the report shall highlight any material changes to the insurance or reinsurance undertaking's business and performance, system of governance, risk profile, valuation for solvency purposes and capital management over the reporting period.\""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retrieve_article('EN', 292)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Zusammenfassung 1. Der Bericht über Solvabilität und Finanzlage enthält eine klare, knappe Zusammenfassung. Die Zusammenfassung des Berichts ist für Versicherungsnehmer und Anspruchsberechtigte verständlich. 2. In der Zusammenfassung werden etwaige wesentliche Änderungen in Bezug auf Geschäftstätigkeit und Leistung des Versicherungs- oder Rückversicherungsunternehmens, sein Governance-System, sein Risikoprofil, die Bewertung für Solvabilitätszwecke und das Kapitalmanagement im Berichtszeitraum herausgestellt.'"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retrieve_article('DE', 292)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Synthèse 1. Le rapport sur la solvabilité et la situation financière contient une synthèse concise et claire. Cette synthèse est compréhensible par les preneurs et les bénéficiaires. 2. La synthèse met en évidence tout changement important survenu dans l'activité et les résultats de l'entreprise d'assurance ou de réassurance, son système de gouvernance, son profil de risque, la valorisation qu'elle applique à des fins de solvabilité et la gestion de son capital sur la période de référence.\""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retrieve_article('FR', 292)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Περίληψη 1. Η έκθεση φερεγγυότητας και χρηματοοικονομικής κατάστασης περιλαμβάνει σαφή και σύντομη περίληψη. Η περίληψη της έκθεσης πρέπει να είναι κατανοητή από τους αντισυμβαλλομένους και τους δικαιούχους. 2. Η περίληψη της έκθεσης επισημαίνει τυχόν ουσιώδεις αλλαγές όσον αφορά τη δραστηριότητα και τις επιδόσεις της ασφαλιστικής και αντασφαλιστικής επιχείρησης, το σύστημα διακυβέρνησης, το προφίλ κινδύνου, την εκτίμηση της αξίας για τους σκοπούς φερεγγυότητας και τη διαχείριση κεφαλαίου κατά την περίοδο αναφοράς.'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retrieve_article('EL', 292)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"\"Risicoprofiel 1. Het verslag over de solvabiliteit en financiële toestand bevat kwalitatieve en kwantitatieve informatie over het risicoprofiel van de verzekerings- of herverzekeringsonderneming, zulks in overeenstemming met de leden 2 tot en met 7 en afzonderlijk voor de volgende risicocategorieën: (a) verzekeringstechnisch risico; (b) marktrisico; (c) kredietrisico; (d) liquiditeitsrisico; (e) operationeel risico; (f) andere materiële risico's. 2. Het verslag over de solvabiliteit en financiële toestand bevat de volgende informatie over de risicoblootstelling van de verzekerings- of herverzekeringsonderneming, met inbegrip van de blootstelling die voortvloeit uit buitenbalansposities en de overdracht van risico aan special purpose vehicles: (a) een beschrijving van de maatregelen om deze risico's binnen die onderneming te beoordelen, met vermelding van alle in de loop van de rapportageperiode opgetreden materiële veranderingen; (b) een beschrijving van de materiële risico's waaraan die onderneming is blootgesteld, met vermelding van alle in de loop van de rapportageperiode opgetreden materiële veranderingen; (c) een beschrijving van de wijze waarop de activa in overeenstemming met het in artikel 132 van Richtlijn 2009/138/EG beschreven „prudent person”-beginsel zijn belegd, en de in dat artikel genoemde risico's naar behoren worden beheerd. 3. Wat risicoconcentratie betreft, bevat het verslag over de solvabiliteit en financiële toestand een beschrijving van de materiële risicoconcentraties waaraan de verzekerings- of herverzekeringsonderneming is blootgesteld. 4. Wat risicolimitering betreft, bevat het verslag over de solvabiliteit en financiële toestand een beschrijving van de gehanteerde risicolimiteringstechnieken en van de procedures voor het monitoren of deze risicolimiteringstechnieken doeltreffend blijven. 5. Wat het liquiditeitsrisico betreft, vermeldt het verslag over de solvabiliteit en financiële toestand het totaalbedrag van de in toekomstige premies vervatte verwachte winst, zoals berekend in overeenstemming met artikel 260, lid 2. 6. Wat risicogevoeligheid betreft, bevat het verslag over de solvabiliteit en financiële toestand een beschrijving van de gehanteerde methoden, de gemaakte veronderstellingen en de uitkomst van stresstests en gevoeligheidsanalyses met betrekking tot materiële risico's en gebeurtenissen. 7. Het verslag over de solvabiliteit en financiële toestand bevat in een afzonderlijke afdeling alle andere materiële informatie over het risicoprofiel van de verzekerings- of herverzekeringsonderneming.\""
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"retrieve_article('NL', 295)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment