Skip to content

Instantly share code, notes, and snippets.

@xflr6
Last active Dec 17, 2018
Embed
What would you like to do?
Glottolog with Python
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring *Glottolog* with Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sebastian Bank (sebastian.bank@uni-leipzig.de) http://www.uni-leipzig.de/~sbank/\n",
"\n",
"The latest version of this [IPython Notebook](http://ipython.org/notebook.html) is available at http://gist.github.com/xflr6/9050337.\n",
"\n",
"[Glottolog](http://glottolog.org) provides its comprehensive catalog of the world's languages, language families and dialects for [download](http://glottolog.org/meta/downloads) in linked data format.\n",
"\n",
"In this notebook, I will process this data set using the following tools:\n",
"\n",
"* [Python](http://www.python.org) (2.7)\n",
"* [rdflib](http://github.com/RDFLib/rdflib)\n",
"* [sqlite3](http://docs.python.org/2/library/sqlite3.html) (included with Python)\n",
"* [pandas](http://pandas.pydata.org) (using [matplotlib](http://matplotlib.org) for visualization)\n",
"\n",
"If you are new to scientific Python, the [Anaconda Python Distribution](http://continuum.io/downloads) is probably the fastest way to get Python installed with all the commonly used scientific packages. It supports all platforms (Linux, Mac, and Windows).\n",
"\n",
"If you are on Windows, there are [Unofficial Windows Binaries](http://www.lfd.uci.edu/~gohlke/pythonlibs/) for a lot of Python extension packages used in scientific computing."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting the file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the [RDF](http://en.wikipedia.org/wiki/Resource_Description_Framework) export file with Pythons built-in `urllib` module ([docs](http://docs.python.org/2/library/urllib.html))."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import urllib\n",
"\n",
"URL = 'http://glottolog.org/static/download/2.7/glottolog-language.n3.gz'\n",
"\n",
"filename, headers = urllib.urlretrieve(URL, URL.rpartition('/')[2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The file contains RDF in [Notation3](http://en.wikipedia.org/wiki/Notation3) compressed with gzip."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"glottolog-language.n3.gz\n"
]
}
],
"source": [
"print filename"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the size in megabytes."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.63014793396\n"
]
}
],
"source": [
"size = int(headers['Content-Length'])\n",
"\n",
"print size / 1024.0 ** 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A first look"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Read the first few bytes from the file with `gzip` ([docs](http://docs.python.org/2/library/gzip.html)) so we can get an impression of the format."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import gzip\n",
"\n",
"with gzip.open(filename) as fd:\n",
" sample = fd.read(4000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Split the sample into the namespaces definitions and the actual RDF triples. They are separated by a blank line."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(932, 3066)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"head, _, body = sample.partition('\\n\\n')\n",
"\n",
"len(head), len(body)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Inspect the start of the namespaces."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"@prefix bibo: <http://purl.org/ontology/bibo/> .\n",
"@prefix dc: <http://purl.org/dc/elements/1.1/> .\n",
"@prefix dcterms: <http://purl.org/dc/terms/> .\n",
"@prefix dctype: <http://purl.org/dc/dcmitype/> .\n",
"@prefix foaf: <http://xmlns.com/foaf/0.1/> .\n",
"@prefix frbr: <http://purl.org/vocab/frbr/core#> .\n",
"@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .\n",
"@prefix gold: <http://purl.org/linguistics/gold/> .\n",
"@prefix isbd: <http://iflastandards.info/ns/isbd/elements/> .\n",
"@prefix lexvo: <http://lexvo.org/ontology#> .\n",
"@prefix owl: <http://www.w3.org/2002/07/owl#> .\n",
"@prefix rdf: <http://www.w3.org/1999/02/22-r...\n"
]
}
],
"source": [
"print head[:600] + '...'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Glottolog uses well-known ontologies and some which are dedicated to linguistics like [Lexvo](http://www.lexvo.org/) and [GOLD](http://linguistics-ontology.org/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the first RDF triples."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<http://glottolog.org/resource/languoid/id/muni1258> a dcterms:LinguisticSystem,\n",
" gold:Language ;\n",
" rdfs:label \"Muniche\"@en ;\n",
" lexvo:iso639P3PCode \"myr\"^^xsd:string ;\n",
" dcterms:description <http://glottolog.org/resource/reference/id/10167>,\n",
" <http://glottolog.org/resource/reference/id/132589>,\n",
" <http://glottolog.org/resource/reference/id/135495>,\n",
" <http://glottolog.org/resource/reference/id/300702>,\n",
" <http://glottolog.org/resource/reference/id/303200>,\n",
" <http://glottolog.org/resource/reference/id/34227>,\n",
" <http://glottolog.org/resource/re...\n"
]
}
],
"source": [
"print body[:600] + '...'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The entry starts with the full URI of the languoid, followed by its types, label, ISO 639-3 code and description."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try to extract some meaningful information from this string just using Pythons regular expressions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using text processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the whole file uncompressed into memory."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"with gzip.open(filename) as fd:\n",
" data = fd.read()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the size in megabytes."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"32.698595047\n"
]
}
],
"source": [
"print len(data) / 1024.0 ** 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract the glottocode from the start of all `dcterms:LinguisticSystem` entries with the `re` module ([docs](http://docs.python.org/2/library/re.html)) and count them."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"24393"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"\n",
"GLOTTOCODE = '<http://glottolog.org/resource/languoid/id/(\\w+)> a dcterms:LinguisticSystem'\n",
"\n",
"gcodes = re.findall(GLOTTOCODE, data)\n",
"\n",
"len(gcodes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the glottocodes of the first five entries."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['muni1258', 'west1503', 'port1278', 'west2205', 'nilo1247']"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gcodes[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks unordered, sort them alphabetically and display the first and last five entries."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['aala1237', 'aant1238', 'aari1238', 'aari1239', 'aari1240']\n",
"['zuti1239', 'zuwa1238', 'zwal1238', 'zyph1238', 'zyud1238']\n"
]
}
],
"source": [
"gcodes.sort()\n",
"\n",
"print gcodes[:5] \n",
"print gcodes[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extract everything that looks like an ISO code. Count the results."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"7822"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ISO_CODE = 'iso639P3PCode \"(\\w+)\"'\n",
"\n",
"icodes = re.findall(ISO_CODE, data)\n",
"\n",
"len(icodes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the first ten ISO codes."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['myr', 'pko', 'oki', 'mwy', 'kqh', 'mwx', 'aam', 'spy', 'tec', 'kpz']"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"icodes[:10]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sort them as well and display the start and end."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['aaa', 'aab', 'aac', 'aad', 'aae', 'aaf', 'aag', 'aah', 'aai', 'aak']\n",
"['zun', 'zuy', 'zwa', 'zyb', 'zyg', 'zyj', 'zyn', 'zyp', 'zza', 'zzj']\n"
]
}
],
"source": [
"icodes.sort()\n",
"\n",
"print icodes[:10]\n",
"print icodes[-10:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Glottocodes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Glottocodes consist of four letters and some apparently recurring digit combinations.\n",
"\n",
"Display the five most common of those digits and their frequency with `collections.Counter` ([docs](http://docs.python.org/2/library/collections.html#collections.Counter))."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('1238', 3022), ('1239', 1192), ('1242', 1039), ('1241', 997), ('1237', 903)]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import collections\n",
"\n",
"collections.Counter(g[4:] for g in gcodes).most_common(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the most common inital parts."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[('nort', 563), ('sout', 560), ('nucl', 508), ('west', 461), ('east', 425)]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"collections.Counter(g[:4] for g in gcodes).most_common(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading into RDFlib"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `rdflib` ([docs](http://rdflib.readthedocs.org/en/latest/)) to load the whole graph into memory.\n",
"\n",
"This will take a while and fill a couple hundred megabytes of RAM."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<Graph identifier=N5f0224c79a154d14bd437619ecf4e397 (<class 'rdflib.graph.Graph'>)>"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import rdflib\n",
"\n",
"graph = rdflib.Graph()\n",
"\n",
"with gzip.open(filename) as fd:\n",
" graph.parse(fd, format='n3')\n",
"\n",
"graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count the number of triples."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"670194"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(graph)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using the RDF graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display some of the triples (subject, predicate, object)."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pwon1235 rdf:type http://purl.org/dc/terms/LinguisticSystem\n",
"guin1260 dcterms:spatial http://www.geonames.org/countries/GW/\n",
"mogu1251 dcterms:description http://glottolog.org/resource/reference/id/156942\n",
"tibe1272 dcterms:description http://glottolog.org/resource/reference/id/26288\n",
"barr1251 dcterms:isReferencedBy http://glottolog.org/valuesets/vitality-barr1251\n",
"nang1261 skos:altLabel nang1261\n",
"choc1278 dcterms:spatial North America\n",
"song1308 rdf:type http://purl.org/linguistics/gold/Dialect\n",
"pato1242 void:inDataset http://glottolog.org/\n",
"nort2855 rdf:type http://purl.org/dc/terms/LinguisticSystem\n",
"chil1280 skos:broader http://glottolog.org/resource/languoid/id/nort2940\n",
"sate1242 dcterms:title Saterfriesisch\n",
"stan1290 dcterms:description http://glottolog.org/resource/reference/id/37004\n",
"bord1246 skos:broader http://glottolog.org/resource/languoid/id/komb1273\n",
"marg1251 dcterms:description http://glottolog.org/resource/reference/id/54615\n"
]
}
],
"source": [
"import itertools\n",
"\n",
"for s, p, o in itertools.islice(graph, 15):\n",
" print s[42:], graph.qname(p), o"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show all available predicates."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"lexvo:iso639P3PCode\n",
"dcterms:description\n",
"dcterms:isReferencedBy\n",
"dcterms:isReplacedBy\n",
"dcterms:spatial\n",
"dcterms:title\n",
"void:inDataset\n",
"rdf:type\n",
"rdfs:label\n",
"owl:sameAs\n",
"geo:lat\n",
"geo:long\n",
"skos:altLabel\n",
"skos:broader\n",
"skos:broaderTransitive\n",
"skos:changeNote\n",
"skos:editorialNote\n",
"skos:narrower\n",
"skos:prefLabel\n",
"skos:scopeNote\n"
]
}
],
"source": [
"for p in sorted(set(graph.predicates())):\n",
" print graph.qname(p)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create shortcuts for querying glottocodes and ISO codes. Translate glottocodes into ISO codes."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"aala1237 -> ___, aant1238 -> ___, aari1238 -> ___, aari1239 -> aiw, aari1240 -> aay,\n"
]
}
],
"source": [
"glottocode = rdflib.Namespace('http://glottolog.org/resource/languoid/id/')\n",
"lexvo = rdflib.Namespace('http://lexvo.org/ontology#')\n",
"iso639 = lexvo.iso639P3PCode\n",
"\n",
"for g in gcodes[:5]:\n",
" i = graph.value(glottocode[g], iso639, default='___')\n",
" print '%s -> %s,' % (g, i),"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Translate ISO codes into glottocodes"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"aaa -> ghot1243, aab -> alum1246, aac -> arii1243, aad -> amal1242, aae -> arbe1236,\n"
]
}
],
"source": [
"string = rdflib.namespace.XSD.string\n",
"\n",
"for i in icodes[:5]:\n",
" g = graph.value(None, iso639, rdflib.Literal(i, datatype=string))\n",
" print '%s -> %s,' % (i, g[42:]),"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Retrieve the preferred label of languoids."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"aala1237 -> Aalawa, aant1238 -> Aantantara, aari1238 -> Aari-Gayil, aari1239 -> Aari, aari1240 -> Aariya,\n"
]
}
],
"source": [
"label = rdflib.namespace.RDFS.label\n",
"\n",
"for g in gcodes[:5]:\n",
" l = graph.value(glottocode[g], label)\n",
" print '%s -> %s,' % (g, l),"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lookup an arbitrary languoid with a given label. "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://glottolog.org/resource/languoid/id/aala1237\n"
]
}
],
"source": [
"print graph.value(None, label, rdflib.Literal('Aalawa', lang='en'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the predicates and objects of an individual languoid."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"skos:prefLabel Aalawa\n",
"rdfs:label Aalawa\n",
"skos:broaderTransitive http://glottolog.org/resource/languoid/id/aust1307\n",
"dcterms:isReferencedBy http://glottolog.org/valuesets/fc42061\n",
"dcterms:title Aalawa\n",
"dcterms:isReferencedBy http://glottolog.org/valuesets/sc42061\n",
"skos:altLabel aala1237\n",
"void:inDataset http://glottolog.org/\n",
"skos:scopeNote language\n",
"skos:broader http://glottolog.org/resource/languoid/id/ramo1244\n",
"rdf:type http://purl.org/dc/terms/LinguisticSystem\n",
"dcterms:spatial Papunesia\n",
"rdf:type http://purl.org/linguistics/gold/Dialect\n"
]
}
],
"source": [
"for p, o in graph[glottocode['aala1237']]:\n",
" print graph.qname(p), o"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the nodes along a languoid's path up the tree."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Aalawa -> Ramoaaina -> Kandas-Duke of York -> Label-Bilur -> St George linkage -> New Ireland-Northwest Solomonic linkage -> Meso Melanesian linkage -> Western Oceanic linkage -> Oceanic -> Eastern Malayo-Polynesian -> Central-Eastern Malayo-Polynesian -> Malayo-Polynesian -> Nuclear Austronesian -> Austronesian\n"
]
}
],
"source": [
"broader = rdflib.namespace.SKOS.broader\n",
"\n",
"aalawa = graph.resource(glottocode['aala1237'])\n",
"\n",
"print ' -> '.join(b.label() for b in aalawa.transitive_objects(broader))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the nodes immediately below a languoid."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel\n"
]
}
],
"source": [
"narrower = rdflib.namespace.SKOS.narrower\n",
"\n",
"atlaco = graph.resource(glottocode['atla1278'])\n",
"\n",
"print '%s <- %s' % (atlaco.label(), ', '.join(n.label() for n in atlaco.objects(narrower)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count all nodes below a languoid."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"4608"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(list(atlaco.transitive_objects(narrower)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying with SPARQL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Retrieve rows of glottocode, ISO code, and label with RDFs query language [SPARQL](http://www.w3.org/TR/sparql11-query/). Also display the annotated language of the label."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"aala1237 | None | Aalawa | en\n",
"aant1238 | None | Aantantara | en\n",
"aari1238 | None | Aari-Gayil | en\n",
"aari1239 | aiw | Aari | en\n",
"aari1240 | aay | Aariya | en\n",
"aari1244 | aiz | Aari | en\n",
"aasa1238 | aas | Aasax | en\n",
"aata1238 | None | Aatasaara | en\n",
"abaa1238 | None | Aba | en\n",
"abab1239 | None | Ababda | en\n"
]
}
],
"source": [
"GIL = \"\"\"\n",
"SELECT\n",
" (substr(str(?s), 43) AS ?glottocode) ?iso ?label\n",
"WHERE\n",
" { ?s a dcterms:LinguisticSystem ; skos:prefLabel ?label \n",
" OPTIONAL { ?s lexvo:iso639P3PCode ?iso } }\n",
"ORDER BY ?s LIMIT 10\"\"\"\n",
"\n",
"for g, i, l in graph.query(GIL):\n",
" print '%s | %-4s | %-10s | %s' % (g, i, l, l.language)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the result as CSV (`json` and `xml` format are also supported)."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"glottocode,iso,label\r\n",
"aala1237,,Aalawa\r\n",
"aant1238,,Aantantara\r\n",
"aari1238,,Aari-Gayil\r\n",
"aari1239,aiw,Aari\r\n",
"aari1240,aay,Aariya\r\n",
"aari1244,aiz,Aari\r\n",
"aasa1238,aas,Aasax\r\n",
"aata1238,,Aatasaara\r\n",
"abaa1238,,Aba\r\n",
"abab1239,,Ababda\r\n",
"\n"
]
}
],
"source": [
"print graph.query(GIL).serialize(format='csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Determine the language families with the most child languages."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Atlantic-Congo\t1430\n",
"Austronesian\t1274\n",
"Indo-European\t583\n",
"Sino-Tibetan\t475\n",
"Bookkeeping\t391\n",
"Afro-Asiatic\t372\n",
"Nuclear Trans New Guinea\t315\n",
"Pama-Nyungan\t241\n",
"Otomanguean\t179\n",
"Sign Language\t168\n"
]
}
],
"source": [
"FAMILIES = \"\"\"\n",
"SELECT\n",
" ?label (count(*) as ?n)\n",
"WHERE\n",
" { ?s a gold:LanguageFamily ; rdfs:label ?label ; skos:narrower+/a gold:Language }\n",
"GROUP BY ?s\n",
"ORDER BY desc(?n) LIMIT 10\"\"\"\n",
"\n",
"for f, n in graph.query(FAMILIES):\n",
" print '%s\\t%s' % (f, n)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the immediate children for some families."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Arawan <- Suruahá, Paumarí, Aruá (Amazonas State), Madi-Madiha\n",
"Artificial Language <- Neo, Efate group based (Artificial Language), Kotava, Esperanto, Lingua Franca Nova, Talossan, Interlingua (International Auxiliary Language Association), Rennellese Sign Language, Ladakhi Sign\n",
"Athapaskan-Eyak-Tlingit <- Athapaskan-Eyak, Tlingit\n",
"Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel\n",
"Austroasiatic <- Nicobaric, Monic, Khmuic, Vietic, Mangic, Pearic, Bahnaric, Khasi-Palaung, Katuic, Mundaic, Aslian, Khmeric\n"
]
}
],
"source": [
"CHILDREN = \"\"\"\n",
"SELECT\n",
" ?label (group_concat(?o; separator=\", \") as ?children)\n",
"WHERE\n",
" { ?s a gold:LanguageFamily ; rdfs:label ?label ; skos:narrower/rdfs:label ?o }\n",
"GROUP BY ?s\n",
"ORDER BY ?label OFFSET 10 LIMIT 5\"\"\"\n",
"\n",
"for f, c in graph.query(CHILDREN):\n",
" print '%s <- %s' % (f, c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do the same for a specific languoid."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Atlantic-Congo <- Volta-Congo, North-Central Atlantic, Nalu, Mansoanka-Fore-Mboteni, Limba, Mel\n"
]
}
],
"source": [
"for l, c in graph.query(\"\"\"BASE <http://glottolog.org/resource/languoid/id/>\n",
"SELECT\n",
" ?label (group_concat(?o; separator=\", \") as ?children)\n",
"WHERE\n",
" { <atla1278> rdfs:label ?label ; skos:narrower/rdfs:label ?o }\"\"\"):\n",
" print '%s <- %s' % (l, c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's a SPARQL query that retrieves most of the [functional properties](http://www.w3.org/TR/owl-ref/#FunctionalProperty-def) of the languoids."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"LANGUOIDS = \"\"\"\n",
"SELECT\n",
" (substr(str(?s), 43) AS ?id) ?label\n",
" (substr(str(?type), 34) AS ?level)\n",
" (substr(str(?broader), 43) AS ?parent)\n",
" (if(bound(?change_note), 1, 0) AS ?obsolete)\n",
" ?status ?iso639 ?latitude ?longitude\n",
"WHERE\n",
" { ?s a dcterms:LinguisticSystem ; skos:prefLabel ?label .\n",
" ?s a ?type FILTER (strstarts(str(?type), \"http://purl.org/linguistics/gold/\"))\n",
" OPTIONAL { ?s skos:broader ?broader }\n",
" OPTIONAL { ?s skos:changeNote ?change_note FILTER (?change_note = \"obsolete\") }\n",
" OPTIONAL { ?s skos:editorialNote ?status }\n",
" OPTIONAL { ?s lexvo:iso639P3PCode ?iso639 }\n",
" OPTIONAL { ?s geo:lat ?latitude; geo:long ?longitude } }\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display some results."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pwon1235 Pwo Northern Karen Language nort2704 0 established pww 18.016 98.2709\n",
"kwes1244 Kwese Language phee1234 0 established kws -5.60445 18.5759\n",
"abaw1238 Abawa Dialect gupa1248 0 None None None None\n",
"roto1247 Rotorua-Taupo Dialect maor1246 0 None None None None\n",
"nort2855 North Coast Mengen Dialect meng1267 0 None None None None\n",
"maca1260 Maca Language mata1290 0 established mca -25.0119 -57.3694\n",
"nyon1241 Nyong Language pere1234 0 established muo 7.27419 11.0615\n",
"fars1254 Farsic-Caucasian Tat LanguageSubfamily sout3157 0 established None None None\n",
"yeng1243 Yengi Hissar Dialect uigh1240 0 None None None None\n",
"thui1238 Thui Phum Dialect ngal1291 0 None None None None\n",
"west2339 Western Asturian Dialect astu1245 0 None None None None\n",
"kele1254 Kele (C.60) LanguageFamily None 1 established None None None\n",
"zumu1241 Zumu Dialect bata1314 0 None None None None\n",
"nort2742 Northern Isan Dialect nort2741 0 None None None None\n",
"tezo1238 Tezoatlán Mixtec Language mixt1427 0 established mxb 17.6155 -97.9002\n",
"sund1254 Sundi-Kamba LanguageSubfamily laad1234 0 established None None None\n",
"long1404 Long Bento' Dialect moda1244 0 None None None None\n",
"pouy1238 Pouye Language ramm1241 0 established bye -3.72704 141.864\n",
"gola1255 Gola Language mela1257 0 established gol 7.06193 -10.8138\n",
"supp1238 Suppire-Mamara LanguageFamily None 1 established None None None\n"
]
}
],
"source": [
"for row in itertools.islice(graph.query(LANGUOIDS), 20):\n",
" print '%s %-20s %-17s %-8s %s %-11s %-4s %-8s %s' % row"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Write the results into a CSV file. Show the beginning of the file."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"id,label,level,parent,obsolete,status,iso639,latitude,longitude\n",
"pwon1235,Pwo Northern Karen,Language,nort2704,0,established,pww,18.016,98.2709\n",
"kwes1244,Kwese,Language,phee1234,0,established,kws,-5.60445,18.5759\n",
"abaw1238,Abawa,Dialect,gupa1248,0,,,,\n",
"roto1247,Rotorua-Taupo,Dialect,maor1246,0,,,,\n",
"nort2855,North Coast Mengen,Dialect,meng1267,0,,,,\n",
"maca1260,Maca,Language,mata1290,0,established,mca,-25.0119,-57.3694\n",
"nyon1241,Nyong,Language,pere1234,0,established,muo,7.27419,11.0615\n",
"fars1254,Farsic-Cau...\n"
]
}
],
"source": [
"CSV = 'glottolog.csv'\n",
"\n",
"graph.query(LANGUOIDS).serialize(CSV, format='csv')\n",
"\n",
"with open(CSV) as fd:\n",
" sample = fd.read(500)\n",
"\n",
"print sample + '...'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's put that into a relational database so we can reuse it later."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Export to SQLite"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create an [SQLite](http://www.sqlite.org/) database file connecting with `sqlite3` ([docs](http://docs.python.org/2/library/sqlite3.html)). Activate [foreign key checks](http://www.sqlite.org/foreignkeys.html) so we notice if something is inconsistent."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<sqlite3.Connection at 0x28c60858>"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import sqlite3\n",
"\n",
"DB = 'glottolog.sqlite3'\n",
"\n",
"conn = sqlite3.connect(DB)\n",
"conn.execute('PRAGMA foreign_keys = ON')\n",
"\n",
"conn.execute('PRAGMA synchronous = OFF')\n",
"conn.execute('PRAGMA journal_mode = MEMORY')\n",
"\n",
"conn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a table for the results of the languoids query with some additional sanity checks. Insert the query rows. Count them."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(24393,)"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conn.execute(\"\"\"\n",
"CREATE TABLE languoid (\n",
" id TEXT NOT NULL PRIMARY KEY,\n",
" label TEXT NOT NULL,\n",
" level TEXT NOT NULL,\n",
" parent TEXT,\n",
" obsolete BOOLEAN NOT NULL,\n",
" status TEXT,\n",
" iso TEXT UNIQUE,\n",
" latitude REAL,\n",
" longitude REAL,\n",
" FOREIGN KEY(parent) REFERENCES languoid(id) DEFERRABLE INITIALLY DEFERRED,\n",
" CHECK (level IN ('LanguageFamily', 'LanguageSubfamily', 'Language', 'Dialect')),\n",
" CHECK (obsolete IN (0, 1)),\n",
" CHECK (status IN ('established', 'spurious', 'spurious retired', 'unattested',\n",
" 'provisional', 'retired'))\n",
")\"\"\")\n",
"\n",
"conn.executemany('INSERT INTO languoid VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)',\n",
" graph.query(LANGUOIDS))\n",
"conn.commit()\n",
"\n",
"conn.execute('SELECT count(*) FROM languoid').fetchone()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Languoids may have *n* alternative labels. \n",
"\n",
"Create a table for the labels and their language. Retrieve them with SPARQL. Insert the query results into the table. Count rows."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(86463,)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conn.execute(\"\"\"\n",
"CREATE TABLE label (\n",
" id TEXT NOT NULL,\n",
" lang TEXT NOT NULL,\n",
" label TEXT NOT NULL,\n",
" PRIMARY KEY (id, lang, label),\n",
" FOREIGN KEY(id) REFERENCES languoid(id)\n",
")\"\"\")\n",
"\n",
"LABELS = \"\"\"\n",
"SELECT\n",
" (substr(str(?s), 43) AS ?id) (lang(?label) AS ?lang) ?label\n",
"WHERE\n",
" { ?s a dcterms:LinguisticSystem ; skos:altLabel ?label }\"\"\"\n",
"\n",
"conn.executemany('INSERT INTO label VALUES (?, ?, ?)',\n",
" graph.query(LABELS))\n",
"conn.commit()\n",
"\n",
"conn.execute('SELECT count(*) FROM label').fetchone()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Languoids may have *n* references.\n",
"\n",
"Create a table for the references. Retrieve them with SPARQL. Insert the query results into the table. Count."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(212614,)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conn.execute(\"\"\"\n",
"CREATE TABLE reference (\n",
" id TEXT NOT NULL,\n",
" reference INTEGER NOT NULL,\n",
" PRIMARY KEY (id, reference),\n",
" FOREIGN KEY(id) REFERENCES languoid(id)\n",
")\"\"\")\n",
"\n",
"REFERENCES = \"\"\"\n",
"SELECT\n",
" (substr(str(?s), 43) AS ?id) (substr(str(?o), 44) AS ?reference)\n",
"WHERE\n",
" { ?s a dcterms:LinguisticSystem ; dcterms:description ?o\n",
" FILTER (strstarts(str(?o), \"http://glottolog.org/resource/reference/id/\")) }\"\"\"\n",
"\n",
"conn.executemany('INSERT INTO reference VALUES (?, ?)',\n",
" graph.query(REFERENCES))\n",
"conn.commit()\n",
"\n",
"conn.execute('SELECT count(*) FROM reference').fetchone()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying with SQLite"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the number of languoids. Break it down by type and check the proportion of superseded entries. Most of the family entries are obsolete."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(24393,)\n"
]
},
{
"data": {
"text/plain": [
"[(u'Dialect', 10599, 185),\n",
" (u'Language', 8418, 21),\n",
" (u'LanguageFamily', 1505, 1263),\n",
" (u'LanguageSubfamily', 3871, 0)]"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print conn.execute('SELECT count(*) FROM languoid').fetchone()\n",
"\n",
"conn.execute('SELECT level, count(*), sum(obsolete) FROM languoid GROUP BY level').fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the distribution of status values by type. Only language entries distinguish it. "
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'Dialect', None, 10599),\n",
" (u'Language', u'established', 7945),\n",
" (u'Language', u'spurious', 199),\n",
" (u'Language', u'spurious retired', 192),\n",
" (u'Language', u'unattested', 61),\n",
" (u'Language', u'retired', 19),\n",
" (u'Language', u'provisional', 2),\n",
" (u'LanguageFamily', u'established', 1505),\n",
" (u'LanguageSubfamily', u'established', 3871)]"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conn.execute(\"\"\"SELECT level, status, count(*) AS n\n",
"FROM languoid GROUP BY level, status ORDER BY level, n DESC\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the number ISO codes. Break the proportions down by languoid type. ISO 639-3 also contains macrolanguages."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(24393, 7822)\n"
]
},
{
"data": {
"text/plain": [
"[(u'Dialect', 10599, 5),\n",
" (u'Language', 8418, 7789),\n",
" (u'LanguageFamily', 1505, 1),\n",
" (u'LanguageSubfamily', 3871, 27)]"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print conn.execute('SELECT count(*), count(iso) FROM languoid').fetchone()\n",
"\n",
"conn.execute('SELECT level, count(*), count(iso) FROM languoid GROUP BY level').fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check how many entries specify location. Only language entries do so."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'Dialect', 0),\n",
" (u'Language', 7634),\n",
" (u'LanguageFamily', 0),\n",
" (u'LanguageSubfamily', 1)]"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conn.execute('SELECT level, count(latitude) FROM languoid GROUP BY level').fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the first and last glottocodes and ISO codes."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"aala1237, aant1238, aari1238, aari1239, aari1240, aari1244, aasa1238, aata1238, abaa1238, abab1239\n",
"zyud1238, zyph1238, zwal1238, zuwa1238, zuti1239, zurr1238, zuri1238, zura1238, zuoj1238, zuni1245\n",
"aaa, aab, aac, aad, aae, aaf, aag, aah, aai, aak, aal, aam, aan, aao, aap, aaq, aar, aas, aat, aau\n",
"zzj, zza, zyp, zyn, zyj, zyg, zyb, zwa, zuy, zun, zum, zul, zuh, zua, zty, ztx, ztu, ztt, zts, ztq\n"
]
}
],
"source": [
"GLOTTOCODES = 'SELECT id FROM languoid ORDER BY id %s LIMIT 10'\n",
"\n",
"print ', '.join(g for g, in conn.execute(GLOTTOCODES % 'ASC'))\n",
"print ', '.join(g for g, in conn.execute(GLOTTOCODES % 'DESC'))\n",
"\n",
"ISO_CODES = 'SELECT iso FROM languoid WHERE iso NOT NULL ORDER BY iso %s LIMIT 20'\n",
"\n",
"print ', '.join(i for i, in conn.execute(ISO_CODES % 'ASC'))\n",
"print ', '.join(i for i, in conn.execute(ISO_CODES % 'DESC'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the number of labels. Break them down by language and entry type."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(86463,)\n",
"[(u'en', 45862), (u'x-clld', 24393), (u'fr', 899), (u'br', 671), (u'ru', 589)]\n"
]
},
{
"data": {
"text/plain": [
"[(u'Dialect', 14827),\n",
" (u'Language', 65685),\n",
" (u'LanguageFamily', 1723),\n",
" (u'LanguageSubfamily', 4228)]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print conn.execute('SELECT count(*) FROM label').fetchone()\n",
"\n",
"print conn.execute(\"\"\"SELECT lang, count(*) AS n\n",
"FROM label GROUP BY lang ORDER BY n DESC LIMIT 5\"\"\").fetchall()\n",
"\n",
"conn.execute(\"\"\"SELECT languoid.level, count(*) AS n\n",
"FROM label JOIN languoid ON languoid.id=label.id\n",
"GROUP BY languoid.level\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the minimal, mean, and maximal number of labels per entry. Check the languoids with the most labels."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1, 3.5445824621817734, 174)\n"
]
},
{
"data": {
"text/plain": [
"[(u'Standard French', 174), (u'Standard Spanish', 154), (u'Russian', 144)]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print conn.execute(\"\"\"SELECT min(n), avg(n), max(n) FROM\n",
"(SELECT count(*) AS n FROM label GROUP BY id)\"\"\").fetchone()\n",
"\n",
"conn.execute(\"\"\"SELECT languoid.label, count(*) AS n\n",
"FROM label JOIN languoid ON languoid.id=label.id\n",
"GROUP BY label.id ORDER BY n DESC LIMIT 3\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the minimal, mean, and maximal label length. Check the frequencies of the most common lengths."
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(1, 9.443553890103281, 65)]\n"
]
},
{
"data": {
"text/plain": [
"[(4, 3882), (5, 6174), (6, 7481), (7, 6887), (8, 29717), (9, 4054), (10, 3336)]"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print conn.execute(\"\"\"SELECT min(s), avg(s), max(s) FROM\n",
"(SELECT length(label) AS s FROM label)\"\"\").fetchall()\n",
"\n",
"conn.execute(\"\"\"SELECT length(label) AS l, count(*) AS n\n",
"FROM label GROUP BY l HAVING n > 3200 ORDER BY l\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display the number of references. Break them down by entry type. There are much less references for non-languages."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(212614,)\n"
]
},
{
"data": {
"text/plain": [
"[(u'Dialect', 43),\n",
" (u'Language', 210178),\n",
" (u'LanguageFamily', 1663),\n",
" (u'LanguageSubfamily', 730)]"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print conn.execute('SELECT count(*) FROM reference').fetchone()\n",
"\n",
"conn.execute(\"\"\"SELECT l.level, count(*) AS n\n",
"FROM reference AS r JOIN languoid AS l ON l.id=r.id GROUP BY l.level\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the minimal, mean, and maximal number of references per entry. Check the most referenced languoids."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1, 25.640858658948385, 2728)\n"
]
},
{
"data": {
"text/plain": [
"[(u'Luxembourgish', 2728), (u'Standard French', 2160), (u'Swahili', 1840)]"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print conn.execute(\"\"\"SELECT min(n), avg(n), max(n) FROM\n",
"(SELECT count(*) AS n FROM reference GROUP BY id)\"\"\").fetchone()\n",
"\n",
"conn.execute(\"\"\"SELECT l.label, count(*) AS n FROM reference AS r\n",
"JOIN languoid AS l ON l.id=r.id GROUP BY r.id ORDER BY n DESC LIMIT 3\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Building the tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The languoids table only specifies the direct parent of each entry. However, we want to be able to traverse the tree and query the whole path.\n",
"\n",
"As SQLite supports [hierarchical queries](http://en.wikipedia.org/wiki/Hierarchical_and_recursive_queries_in_SQL) only with version 3.8.3+, we will use a more general approach and generate a table with all tree paths.\n",
"\n",
"In other words, we will compute the *transitive closure* of the parent relation, a.k.a. tree closure table.\n",
"\n",
"Since we won't use recursion *inside* the database, we will simply put together a bunch of SQL queries and feed the results back into a new table of our database."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"PATH = \"\"\"SELECT\n",
" i0 AS child, %(depth)d AS steps, i%(depth)d AS parent, i%(next)d IS NULL AS terminal\n",
"FROM (\n",
" SELECT %(select)s\n",
" FROM languoid AS l0\n",
" %(joins)s\n",
") WHERE parent IS NOT NULL\"\"\"\n",
"\n",
"def path_query(depth):\n",
" select = ', '.join('l%(step)d.id AS i%(step)d' % {'step': i} for i in range(depth + 2))\n",
" joins = ' '.join('LEFT JOIN languoid AS l%(next)d ON l%(step)d.parent = l%(next)d.id'\n",
" % {'step': i, 'next': i + 1} for i in range(depth + 1))\n",
" return PATH % {'depth': depth, 'next': depth + 1, 'select': select, 'joins': joins}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `path_query` function generates a query for a tree walk of the length given by `depth`. Note that we will omit zero step (*reflexive*) walks."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"SELECT\n",
" i0 AS child, 1 AS steps, i1 AS parent, i2 IS NULL AS terminal\n",
"FROM (\n",
" SELECT l0.id AS i0, l1.id AS i1, l2.id AS i2\n",
" FROM languoid AS l0\n",
" LEFT JOIN languoid AS l1 ON l0.parent = l1.id LEFT JOIN languoid AS l2 ON l1.parent = l2.id\n",
") WHERE parent IS NOT NULL\n"
]
}
],
"source": [
"print path_query(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each query returns the start glottocode, number of steps, end glottocode and a boolean indicating if there is no grandparent. "
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(u'aala1237', 1, u'ramo1244', 0), (u'aant1238', 1, u'nort2920', 0), (u'aari1238', 1, u'ahkk1235', 0)]\n",
"[(u'aala1237', 2, u'kand1307', 0), (u'aant1238', 2, u'tair1260', 0), (u'aari1238', 2, u'sout2845', 1)]\n"
]
}
],
"source": [
"print conn.execute('%s ORDER BY i0 LIMIT 3' % path_query(1)).fetchall()\n",
"print conn.execute('%s ORDER BY i0 LIMIT 3' % path_query(2)).fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When all paths in the query are terminal, we have arrived at the maximal depth."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'patw1249', 18, u'indo1319', 1),\n",
" (u'yeri1239', 18, u'atla1278', 1),\n",
" (u'cher1272', 18, u'atla1278', 1),\n",
" (u'wile1238', 18, u'atla1278', 1),\n",
" (u'biri1258', 18, u'atla1278', 1),\n",
" (u'doli1238', 18, u'atla1278', 1),\n",
" (u'fufu1238', 18, u'atla1278', 1),\n",
" (u'bule1242', 18, u'atla1278', 1),\n",
" (u'pato1243', 18, u'indo1319', 1)]"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conn.execute(path_query(18)).fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a table for the results. Insert path walks of increasing depth until all walks have ended. Count the walks. "
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(145822,)"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conn.execute(\"\"\"\n",
"CREATE TABLE tree (\n",
" child TEXT NOT NULL,\n",
" steps INTEGER NOT NULL,\n",
" parent TEXT NOT NULL,\n",
" terminal BOOLEAN NOT NULL,\n",
" PRIMARY KEY (child, steps),\n",
" UNIQUE (child, parent),\n",
" UNIQUE (parent, child),\n",
" FOREIGN KEY (child) REFERENCES languoid (id),\n",
" FOREIGN KEY (parent) REFERENCES languoid (id),\n",
" CHECK (terminal IN (0, 1))\n",
")\"\"\")\n",
"\n",
"depth = 1\n",
"while True:\n",
" rows = conn.execute(path_query(depth)).fetchall()\n",
" if not rows:\n",
" break\n",
" conn.executemany('INSERT INTO tree VALUES (?, ?, ?, ?)', rows)\n",
" depth += 1\n",
"conn.commit()\n",
"\n",
"conn.execute('SELECT count(*) FROM tree').fetchone()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If the SQlite we use from Python is version 3.8.3 or later, we can also get the rows for the tree closure table with a single query:\n",
"\n",
"```sql\n",
"WITH RECURSIVE tree(child, steps, parent, terminal) AS (\n",
" SELECT l.id, 1, l.parent, 0\n",
" FROM languoid AS l\n",
" WHERE l.parent IS NOT NULL\n",
"UNION ALL\n",
" SELECT t.child, t.steps + 1, p.parent, gp.parent IS NULL\n",
" FROM languoid AS p\n",
" JOIN tree AS t ON p.id=t.parent\n",
" LEFT JOIN languoid AS gp ON gp.id=p.parent\n",
" WHERE p.parent IS NOT NULL\n",
") \n",
"SELECT * FROM tree```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Querying the tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show the minimal, mean, and maximal number of languages per family. Display the language familes with the most child languages."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(1, 33.781893004115226, 1430)\n"
]
},
{
"data": {
"text/plain": [
"[(u'Atlantic-Congo', 1430), (u'Austronesian', 1274), (u'Indo-European', 583)]"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print conn.execute(\"\"\"SELECT min(n), avg(n), max(n) FROM\n",
"(SELECT count(*) AS n FROM languoid AS p\n",
"JOIN tree AS w ON w.parent=p.id AND w.terminal\n",
"JOIN languoid AS c ON w.child=c.id AND c.level='Language'\n",
"WHERE p.level='LanguageFamily' GROUP BY p.id)\"\"\").fetchone()\n",
"\n",
"conn.execute(\"\"\"SELECT p.label, count(*) AS n FROM languoid AS p\n",
"JOIN tree AS w ON w.parent=p.id AND w.terminal\n",
"JOIN languoid AS c ON w.child=c.id AND c.level='Language'\n",
"WHERE p.level='LanguageFamily' GROUP BY p.id ORDER BY n DESC LIMIT 3\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Determine the languages with the most dialects."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(u'Gumuz', 19), (u'Basque', 11), (u'Kunama', 9), (u'Berta', 7)]"
]
},
"execution_count": 58,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conn.execute(\"\"\"SELECT p.label, count(*) AS n FROM languoid AS p\n",
"JOIN tree AS w ON w.parent=p.id AND w.terminal\n",
"JOIN languoid AS c ON w.child=c.id AND c.level='Dialect'\n",
"WHERE p.level='Language' GROUP BY p.id ORDER BY n DESC LIMIT 4\"\"\").fetchall()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Display some of the longest paths."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- North-West Dagaric <- Birifor <- Malba Birifor <= Birifor\n",
"\n",
"Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- Central-South Dagaric <- South Dagaric <- Wali (Ghana) <= 'Bulengee\n",
"\n",
"Atlantic-Congo <- Volta-Congo <- North Volta-Congo <- Gur <- Central Gur <- Northern Central Gur <- Bwamu-Oti-Volta <- Oti-Volta <- Nuclear Oti-Volta <- Gurma-Yom-Oti-Volta Occidental <- Western Oti-Volta <- Nuclear Oti-Volta Occidental <- Northwest Oti-Volta <- Safaliba-Dagaare <- Dagaaric <- Central-South Dagaric <- South Dagaric <- Wali (Ghana) <= Cherii\n",
"\n"
]
}
],
"source": [
"for child, path in conn.execute(\"\"\"SELECT c.label, (SELECT group_concat(parent, ' <- ')\n",
" FROM (SELECT g.child AS child , p.label AS parent\n",
" FROM tree AS g JOIN languoid AS p ON g.parent=p.id\n",
" WHERE child=c.id ORDER BY g.steps DESC)\n",
" GROUP BY child)\n",
"FROM languoid AS c JOIN tree AS w ON w.child=c.id AND w.terminal\n",
"ORDER BY w.steps DESC, c.id LIMIT 3\"\"\"):\n",
" print '%s <= %s\\n' % (path, child)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that with SPARQL the [number of steps is not available](http://www.w3.org/TR/sparql11-property-paths/#Outstanding_Issues), so it might be [difficult](http://stackoverflow.com/questions/5198889/calculate-length-of-path-between-nodes) to get the path in the right order like this."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analysis with pandas"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Activate [inline plotting](http://nbviewer.ipython.org/github/jrjohansson/scientific-python-lectures/blob/master/Lecture-4-Matplotlib.ipynb#The-IPython-notebook-inline-backend) in this notebook."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the language labels into a `pandas` ([docs](http://pandas.pydata.org/pandas-docs/stable/)) `DataFrame`. Display the result."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>lang</th>\n",
" <th>label</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>aari1239</th>\n",
" <td>an</td>\n",
" <td>Luenga aari</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aari1239</th>\n",
" <td>ar</td>\n",
" <td>لغة آري</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aari1239</th>\n",
" <td>en</td>\n",
" <td>Aari language</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aari1239</th>\n",
" <td>en</td>\n",
" <td>Ara</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aari1239</th>\n",
" <td>en</td>\n",
" <td>Ari</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aari1239</th>\n",
" <td>en</td>\n",
" <td>Ari-Galila</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aari1239</th>\n",
" <td>en</td>\n",
" <td>Aro</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>zuoj1238</th>\n",
" <td>en</td>\n",
" <td>Zuojiang</td>\n",
" </tr>\n",
" <tr>\n",
" <th>zuoj1238</th>\n",
" <td>x-clld</td>\n",
" <td>zuoj1238</td>\n",
" </tr>\n",
" <tr>\n",
" <th>zyph1238</th>\n",
" <td>br</td>\n",
" <td>Zac'hringeg</td>\n",
" </tr>\n",
" <tr>\n",
" <th>zyph1238</th>\n",
" <td>en</td>\n",
" <td>Zophei</td>\n",
" </tr>\n",
" <tr>\n",
" <th>zyph1238</th>\n",
" <td>en</td>\n",
" <td>Zoptei</td>\n",
" </tr>\n",
" <tr>\n",
" <th>zyph1238</th>\n",
" <td>en</td>\n",
" <td>Zyphe language</td>\n",
" </tr>\n",
" <tr>\n",
" <th>zyph1238</th>\n",
" <td>x-clld</td>\n",
" <td>zyph1238</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>65685 rows × 2 columns</p>\n",
"</div>"
],
"text/plain": [
" lang label\n",
"id \n",
"aari1239 an Luenga aari\n",
"aari1239 ar لغة آري\n",
"aari1239 en Aari language\n",
"aari1239 en Ara\n",
"aari1239 en Ari\n",
"aari1239 en Ari-Galila\n",
"aari1239 en Aro\n",
"... ... ...\n",
"zuoj1238 en Zuojiang\n",
"zuoj1238 x-clld zuoj1238\n",
"zyph1238 br Zac'hringeg\n",
"zyph1238 en Zophei\n",
"zyph1238 en Zoptei\n",
"zyph1238 en Zyphe language\n",
"zyph1238 x-clld zyph1238\n",
"\n",
"[65685 rows x 2 columns]"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"pd.set_option('max_rows', 15)\n",
"\n",
"labels = pd.read_sql_query(\"\"\"SELECT label.*\n",
"FROM label JOIN languoid ON label.id=languoid.id\n",
"WHERE languoid.level='Language' ORDER BY label.id\"\"\", conn, index_col='id')\n",
"\n",
"labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Break the number of labels down by language."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAEgCAYAAABSGc9vAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHPVJREFUeJzt3X+wXGWd5/H3BxiIPwIFoyEajOBCMOhaECHqurs0UkNA\ntyDrSvYqK6EMsy4/RlZrrUncmcrFcteFKjA6s8nuCCNJRo2RWZao2RAZaH/sIAkCBkyEW6VBbiSZ\nGSKMP3acJHz2j35uPLnc5PbN7Xu7yfm8qro4/e3ndD+nO/TnnOec249sExER9XRUtzsQERHdkxCI\niKixhEBERI0lBCIiaiwhEBFRYwmBiIgaazsEJB0l6RFJ68r9pZIGJT1cbhdX2i6RNCBpm6SLKvU5\nkrZIelLSskr9WElryjoPSJrZqQ2MiIiDG8uRwA3AD4fVbrU9p9w2AEiaDSwAZgOXAMslqbRfASyy\nPQuYJWleqS8Cdts+A1gG3Hx4mxMREWPRVghIOgV4N3Db8IdGaH4ZsMb2XtvbgQFgrqTpwFTbm0u7\nVcD8yjory/KdwIVtb0FERBy2do8EPgN8HBj+58XXS3pU0m2STii1GcDTlTY7Sm0GMFipD5baAevY\n3gc8J+mktrciIiIOyzGjNZD0HmCX7UclNSoPLQc+aduSPgXcAlzdoX6NdISBpPzGRUTEYbA94vdq\nO0cC7wQulfRj4MvAuyStsv23/u0PD30emFuWdwCvq6x/SqkdrH7AOpKOBo63vfsgG9KR29KlSzv2\nXOlT+tSLferVfqVPk9+nQxk1BGx/wvZM228A+oD7bF9ZxviHvBd4vCyvA/rKFT+nAacDm2zvBJ6X\nNLecKL4SuLuyzsKyfDlw32j9ioiI8Rt1OOgQbpZ0NvACsB34MIDtrZLWAluBPcC1/m0UXQfcAUwB\n1rtcUQTcDqyWNAA8SytsIiJigo0pBGx/C/hWWb7yEO0+DXx6hPr3gX86Qv03tC4rnTSNRmMyX64t\n6VN70qf29WK/0qf2TFafNNp4US+R5JdSfyMieoEkPI4TwxERcYRKCERE1FhCICKixhICERE1lhCI\niKixhEBERI0lBCIiaiwhEBFRYwmBiIgaSwhERNRYQiAiosYSAhERNZYQiIiosYRARESNHXEhMH36\nqUjqyG369FO7vTkREROq7RCQdJSkhyWtK/dPlLRR0hOS7pF0QqXtEkkDkrZJuqhSnyNpi6QnJS2r\n1I+VtKas84CkmYe7Qbt2PQW4I7fWc0VEHLnGciRwA60pI4csBu61fSatOYGXAEg6i9YsYbOBS4Dl\nZU5hgBXAItuzgFmS5pX6ImC37TOAZcDNh7k9ERExBm2FgKRTgHcDt1XKlwEry/JKYH5ZvhRYY3uv\n7e3AADC3TEw/1fbm0m5VZZ3qc90JXDj2TYmIiLFq90jgM8DHaY2TDDnZ9i4A2zuBaaU+A3i60m5H\nqc0ABiv1wVI7YB3b+4DnJJ3U/mZERMThGHWieUnvAXbZflRS4xBNOzn574hzYQL09/fvX240Gj05\nQXRERDc1m02azWZbbUedaF7SfwX+HbAXeBkwFbgLOBdo2N5Vhnrutz1b0mLAtm8q628AlgJPDbUp\n9T7gfNvXDLWx/aCko4FnbE8b1pW2JppvnX7oVB6JTGwfES9145po3vYnbM+0/QagD7jP9geBrwFX\nlWYLgbvL8jqgr1zxcxpwOrCpDBk9L2luOVF85bB1Fpbly2mdaI6IiAk26nDQIfw3YK2kD9Hay18A\nYHurpLW0riTaA1xb2X2/DrgDmAKst72h1G8HVksaAJ6lFTYRETHBRh0O6iUZDoqIGLtxDQdFRMSR\nKyEQEVFjCYGIiBpLCERE1FhCICKixhICERE1lhCIiKixhEBERI0lBCIiaiwhEBFRYwmBiIgaSwhE\nRNRYQiAiosYSAhERNZYQiIiosYRARESNjRoCko6T9KCkRyQ9JmlpqS+VNCjp4XK7uLLOEkkDkrZJ\nuqhSnyNpi6QnJS2r1I+VtKas84CkmZ3e0IiIeLF25hj+DXCB7XOAs4FLJM0tD99qe065bQCQNJvW\nVJOzgUuA5WVOYYAVwCLbs4BZkuaV+iJgt+0zgGXAzR3avoiIOIS2hoNs/7osHkdrXuKhORdHmq7s\nMmCN7b22twMDwFxJ04GptjeXdquA+ZV1VpblO4ELx7IRERFxeNoKAUlHSXoE2Al8s/JFfr2kRyXd\nJumEUpsBPF1ZfUepzQAGK/XBUjtgHdv7gOcknXQ4GxQREe07pp1Gtl8AzpF0PHCXpLOA5cAnbVvS\np4BbgKs71K8RJ0QG6O/v37/caDRoNBodesmIiCNDs9mk2Wy21Va2R29VXUH6Y+BXtm+t1F4PfM32\nWyQtBmz7pvLYBmAp8BRwv+3Zpd4HnG/7mqE2th+UdDTwjO1pI7y2R+tv6/TD2LbpEM/GWN+fiIhe\nIwnbI+5ct3N10KuGhnokvQz4PeBHZYx/yHuBx8vyOqCvXPFzGnA6sMn2TuB5SXPLieIrgbsr6yws\ny5cD941pCyMi4rC0Mxz0GmClpKNohcZXbK+XtErS2cALwHbgwwC2t0paC2wF9gDXVnbfrwPuAKYA\n64euKAJuB1ZLGgCeBfo6sXEREXFoYx4O6qYMB0VEjN24hoMiIuLIlRCIiKixhEBERI0lBCIiaiwh\nEBFRYwmBiIgaSwhERNRYQiAiosYSAhERNZYQiIiosYRARESNJQQiImosIRARUWMJgYiIGksIRETU\nWEIgIqLG2ple8jhJD0p6RNJjkpaW+omSNkp6QtI9Q1NQlseWSBqQtE3SRZX6HElbJD0paVmlfqyk\nNWWdByTN7PSGRkTEi40aArZ/A1xg+xzgbOASSXOBxcC9ts+kNSfwEgBJZwELgNnAJcDyMqcwwApg\nke1ZwCxJ80p9EbDb9hnAMuDmTm1gREQcXFvDQbZ/XRaPozUvsYHLgJWlvhKYX5YvBdbY3mt7OzAA\nzC0T00+1vbm0W1VZp/pcdwIXHtbWRETEmLQVApKOkvQIsBP4ZvkiP9n2LgDbO4FppfkM4OnK6jtK\nbQYwWKkPltoB69jeBzwn6aTD2qKIiGjbMe00sv0CcI6k44G7JL2JF8/m3skZ2UecEBmgv79//3Kj\n0aDRaHTwZSMiXvqazSbNZrOttrLH9t0t6Y+BXwNXAw3bu8pQz/22Z0taDNj2TaX9BmAp8NRQm1Lv\nA863fc1QG9sPSjoaeMb2tBFe26P1t3X6oVN5JMb6/kRE9BpJ2B5x57qdq4NeNXTlj6SXAb8HbAPW\nAVeVZguBu8vyOqCvXPFzGnA6sKkMGT0vaW45UXzlsHUWluXLaZ1ojoiICdbOcNBrgJWSjqIVGl+x\nvV7S94C1kj5Eay9/AYDtrZLWAluBPcC1ld3364A7gCnAetsbSv12YLWkAeBZoK8jWxcREYc05uGg\nbspwUETE2I1rOCgiIo5cCYGIiBpLCERE1FhCICKixhICERE1lhCIiKixhEBERI0lBCIiaiwhEBFR\nYwmBiIgaSwhERNRYQiAiosYSAhERNZYQiIiosYRARESNtTOz2CmS7pP0Q0mPSfqDUl8qaVDSw+V2\ncWWdJZIGJG2TdFGlPkfSFklPSlpWqR8raU1Z5wFJMzu9oRER8WLtHAnsBT5m+03AO4DrJb2xPHar\n7TnltgFA0mxas4zNBi4BlpfpJAFWAItszwJmSZpX6ouA3bbPAJYBN3di4yIi4tBGDQHbO20/WpZ/\nSWt+4Rnl4ZFmqrkMWGN7r+3twAAwt0xGP9X25tJuFTC/ss7KsnwncOFhbEtERIzRmM4JSDoVOBt4\nsJSul/SopNuGJqOnFRBPV1bbUWozgMFKfZDfhsn+dWzvA56TdNJY+hYREWPXdghIeiWtvfQbyhHB\ncuANts8GdgK3dLBfI86FGRERnXVMO40kHUMrAFbbvhvA9t9Wmnwe+FpZ3gG8rvLYKaV2sHp1nZ9J\nOho43vbukfrS39+/f7nRaNBoNNrZhIiI2mg2mzSbzbbayvbojaRVwN/Z/lilNt32zrL8UeA82x+Q\ndBbwReBttIZ5vgmcYduSvgd8BNgMfAP4nO0Nkq4F3mz7Wkl9wHzbfSP0w6P1t3UOevRtao9o5/2J\niOhlkrA94gjLqEcCkt4JXAE8JukRWt+wnwA+IOls4AVgO/BhANtbJa0FtgJ7gGsr39zXAXcAU4D1\nQ1cUAbcDqyUNAM8CLwqAiIjovLaOBHpFjgQiIsbuUEcC+YvhiIgaSwhERNRYQiAiosYSAhERNZYQ\niIiosYRARESNJQQiImosIRARUWMJgYiIGksIRETUWEIgIqLGEgIRETWWEIiIqLGEQEREjSUEIiJq\nLCEQEVFjo4aApFMk3Sfph5Iek/SRUj9R0kZJT0i6R9IJlXWWSBqQtE3SRZX6HElbJD0paVmlfqyk\nNWWdByTN7PSGRkTEi7VzJLAX+JjtNwHvAK6T9EZgMXCv7TOB+4AlAGWO4QXAbOASYLla030BrAAW\n2Z4FzJI0r9QXAbttnwEsA27uyNZFRMQhjRoCtnfafrQs/xLYBpwCXAasLM1WAvPL8qXAGtt7bW8H\nBoC5kqYDU21vLu1WVdapPtedwIXj2aiIiGjPmM4JSDoVOBv4HnCy7V3QCgpgWmk2A3i6stqOUpsB\nDFbqg6V2wDq29wHPSTppLH2LiIixO6bdhpJeSWsv/Qbbv5Q0fAb2Ts7IPuKEyAD9/f37lxuNBo1G\no4MvGxHx0tdsNmk2m221lT36d7ekY4CvA//H9mdLbRvQsL2rDPXcb3u2pMWAbd9U2m0AlgJPDbUp\n9T7gfNvXDLWx/aCko4FnbE8boR8erb+t0w+dyiPRzvsTEdHLJGF7xJ3rdoeD/hzYOhQAxTrgqrK8\nELi7Uu8rV/ycBpwObCpDRs9LmltOFF85bJ2FZflyWieaIyJigo16JCDpncC3gcdo7WIb+ASwCVgL\nvI7WXv4C28+VdZbQuuJnD63ho42l/lbgDmAKsN72DaV+HLAaOAd4FugrJ5WH9yVHAhERY3SoI4G2\nhoN6RUIgImLsOjEcFBERR6CEQEREjSUEIiJqLCEQEVFjCYGIiBpLCERE1FhCICKixhICERE1lhCI\niKixhEBERI0lBCIiaiwhEBFRYwmBiIgaSwhERNRYQiAiosYSAhERNTZqCEi6XdIuSVsqtaWSBiU9\nXG4XVx5bImlA0jZJF1XqcyRtkfSkpGWV+rGS1pR1HpA0s5MbGBERB9fOkcAXgHkj1G+1PafcNgBI\nmg0sAGYDlwDLy3zCACuARbZnAbMkDT3nImC37TOAZcDNh785ERExFqOGgO3vAj8f4aGRpiq7DFhj\ne2+ZI3gAmCtpOjDV9ubSbhUwv7LOyrJ8J3Bh+92PiIjxGM85geslPSrpNkknlNoM4OlKmx2lNgMY\nrNQHS+2AdWzvA56TdNI4+hUREW065jDXWw580rYlfQq4Bbi6Q30acTLkIf39/fuXG40GjUajQy8b\nEXFkaDabNJvNttrK9uiNpNcDX7P9lkM9JmkxYNs3lcc2AEuBp4D7bc8u9T7gfNvXDLWx/aCko4Fn\nbE87SD88Wn9bpyBG36b2iHben4iIXiYJ2yPuYLc7HCQqe+hljH/Ie4HHy/I6oK9c8XMacDqwyfZO\n4HlJc8uJ4iuBuyvrLCzLlwP3tdmniIgYp1GHgyR9CWgAvyvpp7T27C+QdDbwArAd+DCA7a2S1gJb\ngT3AtZVd9+uAO4ApwPqhK4qA24HVkgaAZ4G+jmxZRESMqq3hoF6R4aCIiLHrxHBQREQcgRICERE1\nlhCIiKixhEBERI0lBCIiaiwhEBFRYwmBiIgaSwhERNRYQiAiosYSAhERNZYQiIiosYRARESNJQQi\nImosIRARUWMJgYiIGksIRETU2KghIOl2SbskbanUTpS0UdITku6RdELlsSWSBiRtk3RRpT5H0hZJ\nT0paVqkfK2lNWecBSTM7uYEREXFw7RwJfAGYN6y2GLjX9pm05gReAiDpLGABMBu4BFhe5hQGWAEs\nsj0LmCVp6DkXAbttnwEsA24ex/ZERMQYjBoCtr8L/HxY+TJgZVleCcwvy5cCa2zvtb0dGADmlonp\np9reXNqtqqxTfa47gQsPYzsiIuIwHO45gWm2dwHY3glMK/UZwNOVdjtKbQYwWKkPltoB69jeBzwn\n6aTD7FdERIzBMR16nk7Oxj7iZMhD+vv79y83Gg0ajUYHXzoi4qWv2WzSbDbbanu4IbBL0sm2d5Wh\nnr8p9R3A6yrtTim1g9Wr6/xM0tHA8bZ3H+yFqyEQEREvNnwH+cYbbzxo23aHg8SBe+jrgKvK8kLg\n7kq9r1zxcxpwOrCpDBk9L2luOVF85bB1Fpbly2mdaI6IiEkg+9AjOZK+BDSA3wV2AUuB/w18ldYe\n/FPAAtvPlfZLaF3xswe4wfbGUn8rcAcwBVhv+4ZSPw5YDZwDPAv0lZPKI/XFbfSXzo1OidFeLyKi\n10nC9ohD7aOGQC9JCEREjN2hQiB/MRwRUWMJgYiIGksIRETUWEJgkkyffiqSxn2bPv3Ubm9KRBxB\ncmL40M/WsRPDnetXTlZHxNjkxHBERIwoIRARUWMJgYiIGksIRETUWEIgIqLGEgIRETWWEIiIqLGE\nQEREjSUEIiJqLCEQEVFjCYGIiBobVwhI2i7pB5IekbSp1E6UtFHSE5LukXRCpf0SSQOStkm6qFKf\nI2mLpCclLRtPnyIion3jPRJ4AWjYPsf23FJbDNxr+0xa8wUvAZB0FrAAmA1cAiwv8w0DrAAW2Z4F\nzJI0b5z9ioiINow3BDTCc1wGrCzLK4H5ZflSYI3tvWUO4QFgrqTpwFTbm0u7VZV1IiJiAo03BAx8\nU9JmSVeX2sm2dwHY3glMK/UZwNOVdXeU2gxgsFIfLLWIiJhgx4xz/XfafkbSq4GNkp7gxT+a39Ef\nv+/v79+/3Gg0aDQanXz6iIiXvGazSbPZbKttxyaVkbQU+CVwNa3zBLvKUM/9tmdLWgzY9k2l/QZg\nKfDUUJtS7wPOt33NCK+RSWUyqUxEjNGETCoj6eWSXlmWXwFcBDwGrAOuKs0WAneX5XVAn6RjJZ0G\nnA5sKkNGz0uaW04UX1lZJyIiJtB4hoNOBu6S5PI8X7S9UdJDwFpJH6K1l78AwPZWSWuBrcAe4NrK\nbv11wB3AFGC97Q3j6FdERLQpcwwf+tkyHBQRL3mZYzgiIkaUEIiIqLGEQEREjSUEIiJqLCEQEVFj\nCYGIiBpLCERE1FhCICKixhICERE1lhCIiKixhEBERI0lBCIiaiwhEBFRYwmBiIgaSwhERNRYz4SA\npIsl/UjSk5L+sNv9qYPp009F0rhv06ef2u1NiYjD1BMhIOko4E+BecCbgPdLeuPEvmpzYp/+sDQn\n9dV27XqK1kQ3h7rdP2qb1vN0RqeCabLDqd1JvSdbL/YrfWrPZPWpJ0IAmAsM2H7K9h5gDXDZxL5k\nc2Kf/rA0u92BETQn9dXaC6albbTpXDi1E0wXXHDBpAZTu2HZTr86GZadeq+O9D61o24hMAN4unJ/\nsNQiuq4Xg6m9PrXXr04eyXXqvTrS+9ROMN14442TsmPRKyEQEVEbvbRj0RMTzUt6O9Bv++JyfzFg\n2zcNa9f9zkZEvAQdbKL5XgmBo4EngAuBZ4BNwPttb+tqxyIijnDHdLsDALb3Sboe2EhriOr2BEBE\nxMTriSOBiIjojpwYjoiosYRARESNJQRiP0lHS/pit/sxnKSjJC3odj8ijkS1OScg6dXA7wOnUjkh\nbvtDXejLnEM9bvvhyerLcJK+C7zL9j92qw8jkfSQ7XO73Y8qSV+gdbH2Abrxbwp6/t/VX9m+cLTa\nJPfpBuALwC+A24BzgMW2N3axT7OAFcDJtt8s6S3ApbY/NVGv2RNXB02Su4HvAPcC+7rcl1vKf6cA\n5wI/AAS8BXgIeEeX+gXwY+D/SloH/GqoaPvW7nUJgHsl/SfgKxzYr93d6xJfryxPAf418LMu9QV+\n++8KDgwnlfvvmtzugKQpwMuBV0k6sfQF4Hi6/6sAH7L9WUnzgBOBDwKraV2l2C2fBz4O/E8A21sk\nfQlICHTAy233xK+T2r4AQNL/AubYfqzcfzPQ340+SVpt+4PApcBnaA0VTu1GXw7i39L6Irt2WP0N\nXegLALb/snpf0peB73apO9V/Vy+j9T79c1rv2Xdo7V12w4eB/wi8Fvh+pf4LWj8a2U1DgfQeYLXt\nH0oa8Q+qJtHLbW8a1o29E/mCdQqBr0t6t+313e5IxZlDAQBg+3FJs7vUl7dKei3wU+BPutSHQzmL\nF3+x/Y+u9ujFzgCmdbsTwErg74HPlfsfAFYBk35exfZngc9K+gPgWA78/G6b7P4M831J99DakVgs\naSrwQpf79HeS/gnlSE7S+2j9Ae2EqdM5gV/QOiz9R2AP5RDZ9vFd7NOXaQ1t/EUpXQG80vb7u9CX\njwDXAKdx4JDG0PvUtT1uAElraX2xDZ24/gBwgu2unDAue4z7gF9WyjuBJcOPECabpK22zxqtNsl9\n+irwPD3y+ZU+HQX8EXCi7Y9Kmgm83vZ3utinNwB/Bvwz4OfAT4ArbHfu1+uGv2aNQuAoWl+yp9n+\nZPnAX2P7wS72aQqtL95/WUrfBlbY/ocu9mmF7Wu69foH06NfbI/bfnO3Xv9gJP0F8Ke2v1fuvw24\nzvaVXexTL35+K2jt+b/L9uxyzmKj7fO60JePDSu9jNaQ7K9gYs/J1Wk46L9TPnDgk7TGJP8SmPQP\nfEj5sv9MufWEXgyA4mFJbx/2xfZQl/v0fUnn2d7c5X4M91bgryX9tNyfCTwh6TFaR3Vv6UKfevHz\ne5vtOZIeAbD9c0nHdqkvQ+ffzqT1nXQ3raPwD9L6LbUJU6cQ6JkPfOh/xoM93qX/SXtS5b36HX77\nxWbg9cCPutk34G3AFZKeorXHNjR01u3P7+Iuv/5IejGY9pQfrxwaf381XTonYPvG0odv07pY5Bfl\nfj/wjYl87TqFQM984MC/6tLrvhT18ns1r9sdGMlEjh+PQy8G0+eAu4Bpkv4L8D5a5wi66WRa5y2H\n/GOpTZg6nRO4gtZlhnNoXT3xPuCPbH+1qx2LiK5Ray7zC2kdxf1Vt3+9WNJ/pnUV112lNB/4iu1P\nT9hr1iUEoHc+8HKl0khvfNevWIqI7ip/+f0vyt1v235kQl+vTiEQEREHyg/IdZGkt5c/UBm6P7Vc\nNRERMSlyJNBF5UqlOS4fQvlbhodsH/KHwCIiOiVHAt0lV1LY9gvU64qtiOiyhEB3/VjSRyT9Trnd\nQOtXPCMiJkVCoLv+A63fCNkBDNL646N/39UeRUSt5JxARESN5UigR0jq2qxPEVFfCYHe0e3JLCKi\nhhICXSSp+jO63yi1Rnd6ExF1lHMCXSTpcVpzmt5M6/fDbwLOtd3NOYYjokZyJNBdbwNeB/w1rd8M\n/xnwzq72KCJqJSHQXXuA/0frKGAK8JPyB2MREZMiIdBdm2mFwHm0fjXw/WUu1oiISZFzAl0k6Vzb\nDw2rfdD26m71KSLqJSEQEVFjGQ6KiKixhEBERI0lBCIiaiwhEDGKMid0xBEpIRAxulw9EUeshEBE\nmyS9QtK9kh6S9ANJl5b66yVtlfRnkh6XtEHSceWx80rbhyXdLOmx7m5FxIESAhHt+wdgvu1zgXcB\nt1QeOx34E9tvBp4H/k2p/znw+2Xe6H3kqCJ6TEIgon0CPi3pB8C9wGslTSuP/cT20F7+94FTJZ0A\nvNL2plL/0uR2N2J0mdQ8on1XAK8CzrH9gqSf0PrNJ4DfVNrtq9QzT0T0tBwJRIxu6Iv8BOBvSgBc\nALx+hDb72X4e+HtJ55VS38R2M2LsciQQMbqhcfwvAl8rw0EPAdtGaDPc1cBtkvYB36J1viCiZ+S3\ngyImkKRX2P5VWf5DYLrtj3a5WxH75UggYmK9R9ISWv+vbQeu6mpvIobJkUBERI3lxHBERI0lBCIi\naiwhEBFRYwmBiIgaSwhERNTY/wfN8iQiZTgNlAAAAABJRU5ErkJggg==\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x28c083c8>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"labels_lang = labels.groupby('lang').size().sort_values(ascending=False)\n",
"labels_lang[labels_lang > 400].plot.bar();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Show summary statistics on the **number of labels per languoid**. Plot the more common label count frequencies."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"count 8418.000000\n",
"mean 7.802922\n",
"std 10.729925\n",
"min 1.000000\n",
"25% 2.000000\n",
"50% 5.000000\n",
"75% 9.000000\n",
"max 174.000000\n",
"dtype: float64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAEACAYAAAC9Gb03AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAHURJREFUeJzt3XuQXWWZ7/Hvr7vTSYAkJASSIQkQhECCiIJEQNAWCRcv\nhJpyEM8BBSnGAsR4GYsEq4ZYU8WA5VEUgRkPyACKCM5IImIIDGzGOEA43IIkQEYNhAhNuAQCgaQ7\n+zl/vGvbO013utO9u9e+/D5Vq3r1u9da+8muzrPWft53vUsRgZmZ1a+mvAMwM7Oh5URvZlbnnOjN\nzOqcE72ZWZ1zojczq3NO9GZmda7PRC/pWkntklZ0a79A0ipJT0i6tKx9gaTV2WvHl7UfKmmFpGck\nXV7Zf4aZmfWmP1f01wEnlDdIagM+DRwcEQcD383aZwKnAjOBk4CrJCnb7Wrg7IiYAcyQtM0xzcxs\naPSZ6CNiGfBat+ZzgUsjojPb5uWsfS5wc0R0RsQaYDUwW9JkYExEPJRtdwNwSgXiNzOzPgy0Rj8D\n+IikByTdK+mwrH0KsLZsu3VZ2xTg+bL257M2MzMbYi2D2G98RBwh6XDgVmDfyoVlZmaVMtBEvxb4\nD4CIeEjSVkm7ka7g9yrbbmrWtg6Y1kN7jyR5Ah4zswGICHVv62/pRtlSchtwLICkGUBrRLwCLAY+\nK6lV0nRgP2B5RLwIvC5pdtY5+3lgUR/B1v1y8cUX5x5DtS/+jPz5+DPq/9KbPq/oJd0EtAG7SXoO\nuBj4CXCdpCeAzVniJiJWSroFWAl0AOdF17ufD/wbMAq4IyKW9PXeZmY2eH0m+oj4X728dEYv2/8z\n8M89tD8MHLxD0ZmZ2aD5ztgctbW15R1C1fNntH3+fPrmzwi0vbpOXiRFNcZlZlbNJBGD6Iw1M7Ma\n5URvZlbnnOjNzOqcE72ZWZ1zojczq3NO9GZmdc6J3syszjnRm5nVOSd6M7M650RvZlbnnOjNzOpc\n1Sb6LVvyjsDMrD5UbaJfsSLvCMzM6kPVJvply/KOwMysPlRtov/1r/OOwMysPlTtfPS77hq89lre\nkZiZ1Y6am4/+9ddh48a8ozAzq319JnpJ10pql/Su7lFJ35BUlDShrG2BpNWSVkk6vqz9UEkrJD0j\n6fK+3xceeWRH/ilmZtaT/lzRXwec0L1R0lRgDvBsWdtM4FRgJnAScJWk0teIq4GzI2IGMEPSu45Z\nrliEQqE//wQzM9uePhN9RCwDeqqWfx/4Zre2ucDNEdEZEWuA1cBsSZOBMRHxULbdDcApfb337bf3\ntYWZmfVlQDV6SScDayPiiW4vTQHWlv2+LmubAjxf1v581rZdK1cOJDozMyvXsqM7SBoNXEQq2wyp\nt9+G9eth992H+p3MzOrXDid64D3APsDjWf19KvCIpNmkK/i9yradmrWtA6b10L4dC4mAr34Vzjmn\njba2tgGEamZWvwqFAoV+dGb2axy9pH2AX0fEwT289mfg0Ih4TdIs4GfAh0ilmbuA/SMiJD0AfAV4\nCPgN8MOIWNLL+wWkuObNg8v7HKNjZmYDHkcv6Sbgv0kjZZ6TdFa3TQIQQESsBG4BVgJ3AOdF15nk\nfOBa4BlgdW9JvrulS/uzlZmZ9aZq74wtXdGPGAGbN6dx9WZm1ruauzO2ZOtWePbZvrczM7OeVX2i\nLxZh+fK8ozAzq11Vn+gBfvObvCMwM6tdNZHo77sv7wjMzGpX1XfGAjQ3pw7Z5uYcgzIzq3I12xkL\nEAFPPZV3FGZmtakmEn2xCPffn3cUZma1qSYSPfjRgmZmA1UTNXqAPfaA9vacAjIzqwG91ehrJtFL\naTbLkSNzCsrMrMrVdGcspET/+ON5R2FmVntqJtEXi7BsWd5RmJnVnppJ9ACLF+cdgZlZ7amZGj3A\nuHGwYUMOAZmZ1YCar9EDvPFGWszMrP9qKtFL8PDDeUdhZlZbairRF4tw7715R2FmVltqKtGDpyw2\nM9tRNdUZCzB6NGzaNMwBmZnVgME8HPxaSe2SVpS1fUfSKkmPSfp3SWPLXlsgaXX2+vFl7YdKWiHp\nGUmXD/Qfsnmzp0IwM9sR/SndXAec0K1tKXBQRLwfWA0sAJA0CzgVmAmcBFwl/fWx3lcDZ0fEDGCG\npO7H7JcIP1rQzGxH9JnoI2IZ8Fq3trsjopj9+gAwNVs/Gbg5IjojYg3pJDBb0mRgTEQ8lG13A3DK\nQAKOgKVLB7KnmVljqkRn7BeBO7L1KcDastfWZW1TgOfL2p/P2gbEid7MrP9aBrOzpG8BHRHx8wrF\nU2Zh2XpbtiR//nO6ste7uhzMzBpHoVCgUCj0uV2/Rt1I2hv4dUS8r6ztTOAc4NiI2Jy1zQciIi7L\nfl8CXAw8C9wbETOz9tOAj0bEub28X6+jbgCammD1ath33z5DNzNrGIOdAkHZUjrYicA3gZNLST6z\nGDhNUquk6cB+wPKIeBF4XdLsrHP288CiAf5bKBbhwQcHureZWWPpz/DKm4D/Jo2UeU7SWcAVwC7A\nXZIekXQVQESsBG4BVpLq9udF11eG84FrgWeA1RGxZDCB+8YpM7P+qbkbpkqmTYPnnhumgMzMakDN\nP0qwu6amdPNUy6C6k83M6kddTFPc3apVeUdgZlb9ajbRF4tw//15R2FmVv1qNtGDHy1oZtYfNVuj\nB5g4EdavH4aAzMxqQN11xqbt0pTFo0YNQ1BmZlWuLjtjJXjssbyjMDOrbjWd6ItFWLYs7yjMzKpb\nTSd6gEUDnkjBzKwx1HSNHmDsWHj99SEOyMysBtRljR5g40bYsCHvKMzMqlfNJ3qAhx/OOwIzs+pV\n84k+Au69N+8ozMyqV80nevCUxWZm21PznbGQbph6++0hDMjMrAbUbWcswJYt8MILeUdhZlad6iLR\nR8Dy5XlHYWZWneom0d95Z95RmJlVp7pI9AB33ZV3BGZm1ak/Dwe/VlK7pBVlbeMlLZX0tKQ7JY0r\ne22BpNWSVkk6vqz9UEkrJD0j6fJK/0PWrElX9mZmtq3+XNFfB5zQrW0+cHdEHADcAywAkDQLOBWY\nCZwEXCWp1AN8NXB2RMwAZkjqfsxBKRbhj3+s5BHNzOpDn4k+IpYBr3Vrngtcn61fD5ySrZ8M3BwR\nnRGxBlgNzJY0GRgTEQ9l291Qtk9FFIvw4IOVPKKZWX0YaI1+j4hoB4iIF4E9svYpwNqy7dZlbVOA\n58van8/aKur22yt9RDOz2tdSoeMMQXV8Ydl6W7Zs3+9+V/kozMyqVaFQoFAo9LndQBN9u6RJEdGe\nlWVeytrXAdPKtpuatfXWvh0LdzioF16Azk5oqdTpy8ysirW1tdHW1vbX37/97W/3uF1/SzfKlpLF\nwJnZ+heARWXtp0lqlTQd2A9YnpV3Xpc0O+uc/XzZPhUTAf04uZmZNZQ+57qRdBOpbrIb0A5cDNwG\n3Eq6Sn8WODUiNmTbLwDOBjqAeRGxNGs/DPg3YBRwR0TM28577tBcN+XGjIGXX4bW1gHtbmZWs3qb\n66YuJjXbdl847jhYurTCQZmZVbm6ntSsXES6S/b66/ve1sysEdTdFX1JczM89xzsuWeFgjIzq3IN\nc0VfUizCUUd5WgQzs7pN9BHpiv6CC/KOxMwsX3Vbuuk6Ftx3HxxzTEUOZ2ZWtRpm1E1PdtopDbkc\nPbpihzQzqzoNV6Mv9/bbMGdO3lGYmeWjIRJ9BPz+93DllXlHYmY2/BqidFPS1AT/8z8wfXrFD21m\nlruGrtF3HRcmTYJ161LSNzOrJw1doy+JgPZ2+OIX847EzGz4NNQVfbklS+CEij7M0MwsXy7ddDNy\nJKxfn2a7NDOrBy7ddLNlC5TN129mVrcaNtFHwCOPwKWX5h2JmdnQatjSTdd7wZNPwsyZw/J2ZmZD\nxjX6Xt8LJkxIo3Gam4flLc3MhoRr9L2IgFdfhc98Ju9IzMyGRsMnekjJ/rbb4Ior8o7EzKzyBpXo\nJX1N0h8krZD0M0mtksZLWirpaUl3ShpXtv0CSaslrZJ0/ODDr6x58+CBB/KOwsyssgZco5e0J7AM\nODAitkj6BXAHMAt4JSK+I+lCYHxEzJc0C/gZcDgwFbgb2D96CGA4a/TdjRyZpkjYbbdc3t7MbMCG\nqkbfDOwsqQUYDawD5gKlR3NfD5ySrZ8M3BwRnRGxBlgNzB7k+1dcRwccckh6FKGZWT0YcKKPiL8A\n/wd4jpTgX4+Iu4FJEdGebfMisEe2yxRgbdkh1mVtVaVYhL/8BT75ybwjMTOrjJaB7ihpV9LV+97A\n68Ctkv437665DLAGs7BsvS1bhkdEmgvnkkvgoouG7W3NzHZIoVCgUCj0ud1gavSfAU6IiHOy388A\njgCOBdoiol3SZODeiJgpaT4QEXFZtv0S4OKIeLCHY+dWo982Dli6FI47Lu9IzMz6NhQ1+ueAIySN\nkiTg48BKYDFwZrbNF4BF2fpi4LRsZM50YD9g+SDef8hFwCc+kTpnzcxq1YBLNxGxXNIvgUeBjuzn\nj4ExwC2Svgg8C5yabb9S0i2kk0EHcF5PI26qzdat8IEPpGQ/YkTe0ZiZ7biGnwKhPyT48Ifhd7/L\nOxIzs955CoRBiIBly+DCC/OOxMxsx/mKfgfddhvMnZt3FGZm7+bZKyukuRmeeQb23TfvSMzMtuVE\nXyESjB0LL74Io0blHY2ZWRfX6CskAjZuhKOOyjsSM7P+caIfgGIRHnsMzj0370jMzPrm0s0g3Xgj\nnH563lGYmblGP2QkOPZY+NSn0s+DDvIjCc0sH070Q6SpCVpbYfNmaGlJiX/mTJgzB046CWbPhl12\nyTtKM2sETvTDaPRo2LIlJf0I2HNPOPpo+PSn4ZhjYOrUvCM0s3rkRJ+j1taU9Ds60jeAnXdO8+dc\ncgkceWTe0ZlZvXCiryLNzanMs2UL/OEPMGtW3hGZWT1woq9CEuy+O7zwQrrSNzMbDN8wVYUiYP16\nj8c3s6HlK/oqcf/9cMQReUdhZrXMpZsqN3YsvPJKqt2bmQ2ESzdVbuNG+Oxn847CzOqRr+irzJIl\ncMIJeUdhZrXIpZsaMWpUKuHstFPekZhZrRmS0o2kcZJulbRK0pOSPiRpvKSlkp6WdKekcWXbL5C0\nOtv++MG8d73avDnNm2NmVimDrdH/ALgjImYChwBPAfOBuyPiAOAeYAGApFnAqcBM4CTgKknvOvM0\nugi49174+c/zjsTM6sWASzeSxgKPRsR7urU/BXw0ItolTQYKEXGgpPlARMRl2Xa/BRZGxIM9HLth\nSzclI0ZAezuMH593JGZWK4aidDMdeFnSdZIekfRjSTsBkyKiHSAiXgT2yLafAqwt239d1mY96OyE\nj3887yjMrB4MZtR2C3AocH5E/D9J3yeVbbpfig/w0nxh2XpbtjSOCHj0UbjySjj//LyjMbNqVCgU\nKBQKfW43mNLNJOD+iNg3+/1oUqJ/D9BWVrq5NyJm9lC6WQJc7NLN9jU3w3PPpamOzcy2p+Klm6w8\ns1bSjKzp48CTwGLgzKztC8CibH0xcJqkVknTgf2A5QN9/0ZRLMJHPpKu8M3MBmJQ4+glHQJcA4wA\n/gScBTQDtwDTgGeBUyNiQ7b9AuBsoAOYFxFLezmur+jLSPCP/wgLF+YdiZlVM98wVeMkePpp2H//\nvCMxs2rlRF/jJJg0Cf7yl7RuZtadJzWrcRFpXP0FF+QdiZnVGl/R1xgJHnoIDjss70jMrNq4dFNH\ndt0VXnop3T1rZlbi0k0deeMNOOgg2LQp70jMrBY40degYhH+9Kd0E9WaNXlHY2bVzom+Rm3dCm++\nmYZb/ud/5h2NmVUzJ/oatnVrWubMgR/+MO9ozKxauTO2jpx1FvzkJ3lHYWZ58aibBiDBBz8Iv/+9\nR+SYNSIn+gbR1AS77w6PP57upDWzxuHhlQ2iWISXX4a994blnhvUzHCir0tbt0JHBxx5JPz0p3lH\nY2Z5c6KvU8ViWs44A775zbyjMbM8uUbfACQ47jj47W/TE6vMrD65M7bBNTXBXnvBY4/BuHF5R2Nm\nQ8GdsQ2uWIS1a9NInG9/GzZvzjsiMxsuvqJvME3Zqb21Fb761fSIwtGj843JzCrDpRvbhpSWESPg\nvPPgn/4Jdt4576jMbDCGrHQjqUnSI5IWZ7+Pl7RU0tOS7pQ0rmzbBZJWS1ol6fjBvrcNXEQq52zZ\nAj/4AYwfD+efDxs35h2ZmVVaJWr084CVZb/PB+6OiAOAe4AFAJJmAacCM4GTgKskP/00b6WE39EB\n//qvMGECnH02bNiQd2RmVimDSvSSpgKfAK4pa54LXJ+tXw+ckq2fDNwcEZ0RsQZYDcwezPtbZW3d\nCp2dcMMNaRqF009Pd9maWW0b7BX994Fvsm1BfVJEtANExIvAHln7FGBt2XbrsjarMp2dafnFL2Dy\nZPi7v0sPJjez2tQy0B0lfRJoj4jHJLVtZ9MB9qouLFtvyxYbTp2d6eeiRfCrX8Fll8E3vpFvTGbW\npVAoUCgU+txuwKNuJF0CnA50AqOBMcCvgA8CbRHRLmkycG9EzJQ0H4iIuCzbfwlwcUQ82MOxPeqm\nSl1yCSxYkHcUZtaTio+6iYiLImKviNgXOA24JyLOAH4NnJlt9gVgUba+GDhNUquk6cB+gOdXrDEX\nXQTf+U7eUZjZjhhw6WY7LgVukfRF4FnSSBsiYqWkW0gjdDqA86IaB/Fbny68MI3B92RpZrXBN0zZ\ngH33u67Zm1UTz3VjFfcP/wDf/37eUZhZX5zobVC+/nW4/PK8ozCz7XGit0H72tfghz/MOwoz640T\nvVXEvHlwxRV5R2FmPXGit4r5ylfgyivzjsLMunOit4r68pfhqqvyjsLMyjnRW8Wdfz78y7/kHYWZ\nlTjR25A491z48Y/zjsLMwInehtCXvgTXXNP3dmY2tJzobUidc04aZ1+aCdPMhp+nQLAh19QELS3w\n/vfDZz4Dc+fC/vun+XLMrHL8cHDL3ejRsHlzSvw77QRHHw2f+xyccEJ6opWZDY4TvVWV5mYYMSI9\nnFxKT7I6/ng47TQ45ph0UjCzHeNEb1WttbXrQeUSzJgBp5wCF1yQTgJm1jcneqspo0alq32AE0+E\nq6+GvfbKNyazaudpiq2mvPNOurovFuGuu2D6dJgzB/74x7wjM6s9TvRW9To6UsK/775U0vnoR+Gp\np/KOyqx2ONFbzSgl/Pvvh1mz4IgjYMWKvKMyq34DTvSSpkq6R9KTkp6Q9JWsfbykpZKelnSnpHFl\n+yyQtFrSKknHV+IfYI2noyN13D76aBqbf9hh8PDDeUdlVr0G3BkraTIwOSIek7QL8DAwFzgLeCUi\nviPpQmB8RMyXNAv4GXA4MBW4G9i/pweEuzPWdkRra0r+731vmkztqKN2bP933oFXX4XXXoPXX4dD\nDoGddx6aWM2G0pCPupF0G/CjbPloRLRnJ4NCRBwoaT4QEXFZtv1vgYUR8WAPx3Kitx1WSvgHHADf\n+lZqKyXw9evhpZfSz1degQ0bYONGeOutVA5qbU1375Y6gC+/HP7+7/P995jtqCFN9JL2AQrAe4G1\nETG+7LVXI2KCpCuA+yPipqz9GuCOiPiPHo7nRG8DNmJEStYjR6Y5djo7U6mnpSXdqNXcnMbqF4vp\ntdIwzpKmprT9gQfC7bfDvvvm8+8w21FDNrwyK9v8EpgXEW/y7gztjG3DqqMDtm6FTZtSEi8WU+Lu\n6Ehlmrfegjff7Hq9u9L2q1enOXnmzUvHM6tVLYPZWVILKcnfGBGLsuZ2SZPKSjcvZe3rgGllu0/N\n2nqxsGy9LVvMhk9pxs0f/Qh++lO49VY49th8YzIrVygUKBQKfW43qNKNpBuAlyPi62VtlwGvRsRl\nvXTGfgiYAtyFO2OtRjQ3pyv9OXNSwh87Nu+IzN6t4jV6SR8G/gt4gpSVA7gIWA7cQrp6fxY4NSI2\nZPssAM4GOkilnqW9HNuJ3qpSS0uq4X/ve+mRiWbVxHPdmFVIaR79/fdPnbX7759vPGYlnuvGrEIi\n0vLnP6eROeee6ydoWXXzFb3ZIDU1pZr9jTfCJz/pJ2dZfly6MRtCpc7anXdOT84644w0vfKECXlH\nZo3Eid5sGDQ1pbtsS0/OmjYNPvWp9MjE2bNTZ67ZUHGiN8tBa2u60od0EjjssJT05871g1Ss8pzo\nzapA6clZEowfn8bln356OgFMnJhKQGYD5URvVmVKY/K3bu36ucsuqa4/aRJMnZqerLX33rDnnunZ\nuX/zN+mnH55uPXGiN6sBI0akBVLi7+xMpZ8RI9LVfkT6RtDaCrvuCnvskaZV/tjH0reCmTO79rfG\n40RvVkekNDtnU1PXDJ0tLenkMHUqfOADaV6eww+H970Pdtop74htODjRmzWIkSPTzy1b0tX91q2p\n/n/wwel5u0cemU4EHvpZf5zozRpYa2sq/Wze3DXmf+RIGDMmdQpPnJjKQHvuCVOmwO67p7bddks/\nJ05M27mzuLo50ZvZNkpj/puyiVBKfQJbt6aEXuosLr3W0ZE6gceMSZ3Fe+8NM2akZa+90jJtWnq9\nUopFeOONtEyZ4hNNX5zozWzQmpvTyaH0hK7Sg1tGjEhtHR1pfeLElJinT0/zAb3nPekkMHZserRj\n+fLqq9Denh71+MorqW3DhvRwmLff7noyWLEIH/pQuuv45JPTNxDblhO9mQ2L7h3FHR1d3xCam7se\n1VjqRC4NLy29XporqHQiKX+616hRqfzU1JROJCefnBL/Bz/Y9e2jkTnRm1ldKb/ruKUFjjoqJf1P\nfzr1LTQiJ3ozq1ulbxGlu4732itNM3H66WmEUamstHlzWt55p+/1zZth3LiuvocJE6p/ZlInejNr\nGCNHprJQqS8hIq2XSkelBbZN3qV0GLHtfqXy0YQJaWTSPvukvof99us6EUyblv/9Ck70ZtawpK4k\nPhijRnVNV1H69lDqiN6yJb2+++7pZDBhQhqSOmFCKiXtumsakTR2bM8/x4wZ/KgiJ3ozsyHW1JS+\nTZS+SRSL6aRQ6nAu/0ZR3ulcGtra0pKGsI4enZ5tsMsu6UQwblzXSWPixPR7TyeMgw6qkkQv6UTg\nctJjDK+NiMt62MaJ3swaUunmtlJpqVRG6s9JY9OmKnhmrKQm4EfACcBBwOckHTicMVSXQt4B1IBC\n3gFUuULeAdSAQt4B7JAtW9L9A2+9lZZNm1IncflQ02Kxq3P57bfTNps29X7M4R55OhtYHRHPRkQH\ncDMwd5hjqCKFvAOoAYW8A6hyhbwDqAGFvAPI3XAn+inA2rLfn8/azMxsiPheMjOzOjesnbGSjgAW\nRsSJ2e/zgejeIZs6Y83MbEflPupGUjPwNPBx4AVgOfC5iFg1bEGYmTWYluF8s4jYKunLwFK6hlc6\nyZuZDaGqvGHKzMwqx52xOZG0RtLjkh6VtDzvePIm6VpJ7ZJWlLWNl7RU0tOS7pQ0Ls8Y89bLZ3Sx\npOclPZItJ+YZY94kTZV0j6QnJT0h6StZe0P/LTnR56cItEXEByJidt7BVIHrSDfSlZsP3B0RBwD3\nAAuGParq0tNnBPC9iDg0W5YMd1BVphP4ekQcBBwJnJ/dlNnQf0tO9PkR/vz/KiKWAa91a54LXJ+t\nXw+cMqxBVZlePiNIf0sGRMSLEfFYtv4msAqYSoP/LTnR5CeAuyQ9JOmcvIOpUntERDuk/8CAHx7X\nsy9LekzSNY1WktgeSfsA7wceACY18t+SE31+PhwRhwKfIH29PDrvgGqARw6821XAvhHxfuBF4Hs5\nx1MVJO0C/BKYl13Zd//baai/JSf6nETEC9nP9cCvSPMA2bbaJU0CkDQZeCnneKpORKyPrqFz/xc4\nPM94qoGkFlKSvzEiFmXNDf235ESfA0k7ZVccSNoZOB74Q75RVQWxbb15MXBmtv4FYFH3HRrQNp9R\nlrRK/hb/HQH8BFgZET8oa2vovyWPo8+BpOmkq/gg3bT2s4i4NN+o8iXpJqAN2A1oBy4GbgNuBaYB\nzwKnRsSGvGLMWy+f0cdIdegisAb4UqkW3YgkfRj4L+AJ0v+vAC4i3YV/Cw36t+REb2ZW51y6MTOr\nc070ZmZ1zonezKzOOdGbmdU5J3ozszrnRG9mVuec6M3M6pwTvZlZnfv/ULnuxbWxzxcAAAAASUVO\nRK5CYII=\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x8bb7080>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"nlabels = labels.groupby(level='id').size()\n",
"nlabels_hist = nlabels.value_counts().sort_index()\n",
"\n",
"print nlabels.describe()\n",
"nlabels_hist[nlabels_hist > 30].plot.area();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do statistics on the **string length of the labels**."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"count 65685.000000\n",
"mean 9.834315\n",
"std 5.168938\n",
"min 1.000000\n",
"25% 6.000000\n",
"50% 8.000000\n",
"75% 13.000000\n",
"max 65.000000\n",
"Name: label, dtype: float64\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAEACAYAAABVtcpZAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3X+U3HV97/HnazfZJECA8CvRBCEYQ4PaKtbUU9vjtt7y\nw54CvZ4i1RZQTn8IiKCHQvQcCe2xFtt7CW3F1hY1/LDcBOslVgzIgbW1VwuIiBBMojSbZLOzJBES\nQiDZ7L7vH5/vZId1f87M7vc7M6/HOXP2u5/5fmff3wzMez6/FRGYmVlrass7ADMzy4+TgJlZC3MS\nMDNrYU4CZmYtzEnAzKyFOQmYmbWwcZOApNsk9Ul6coTnPi5pUNJxFWUrJG2W9IyksyrKz5T0pKRN\nklZVlHdIuju75ruSXlePGzMzs/FNpCbwJeDs4YWSFgG/BXRXlC0DLgSWAecCt0pS9vTngcsiYimw\nVFL5NS8DfhYRbwBWAZ+t8l7MzGySxk0CEfEd4PkRnroZuHZY2fnA3RFxKCK2AJuB5ZIWAHMj4tHs\nvNuBCyquWZ0d3wO8e1J3YGZmVauqT0DSecC2iPjRsKcWAtsqfu/JyhYC2yvKt2dlr7omIgaAFyqb\nl8zMbOrMmOwFkuYAnyA1BU0FjX+KmZnVw6STAPB64FTgh1l7/yLgcUnLSd/8Kzt2F2VlPcDJI5RT\n8dwOSe3A0RHxs5H+sCQvdGRmVoWIGPEL9kSbg5Q9iIinImJBRJwWEYtJTTtvjYjngHXA+7IRP4uB\nJcAjEVEC9khaniWOi4F7s9deB1ySHf8e8NA4N5LL44Ybbsjtb/uefD/N8mi2e2qU+xnLRIaIfgX4\nf6QRPVslfXD453JFgtgArAE2APcBl8dQBFcAtwGbgM0RsT4rvw04QdJm4Grg+vFiMjOz+hi3OSgi\n3j/O86cN+/0zwGdGOO/7wJtHKD9AGlZqZmbTzDOGJ6izszPvEOqu2e7J91N8zXZPzXA/Gq+9qEgk\nRSPFa2ZWBJKIGjuGzcysCTkJmJm1MCcBM7MW5iRgZtbCnATMzFqYk0AD2LwZ9u7NOwoza0YeItoA\nzj0X9u2D//iPvCMxs0Y01hDRahaQs2m2dSt0d49/npnZZLk5qAHs3An790MLVoLMbIo5CRRcBDz/\nfPr50kt5R2NmzcZJoOD27IFDh9Lxz0bcZcHMrHpOAgVXKg0dPz/STs9mZjVwEii4UgnasnfJNQEz\nqzcngYIrlWDmzHTsJGBm9eYkUHCl0tCoIDcHmVm9OQkUXG8v9Pen41278o3FzJqPk0DBdXdDe3s6\n3r4931jMrPk4CRTc9u1DfQLbtuUbi5k1HyeBguvtHaoJ9PXlG4uZNR8ngYLbvRsGB4eOzczqyQvI\nFdjAALz44tA8gT178o3HzJrPuDUBSbdJ6pP0ZEXZZyU9I+kJSV+VdHTFcyskbc6eP6ui/ExJT0ra\nJGlVRXmHpLuza74r6XX1vMFGtnMnzJo1tGzE/v35xmNmzWcizUFfAs4eVvYA8MaIeAuwGVgBIOkM\n4EJgGXAucKuk8hrWnwcui4ilwFJJ5de8DPhZRLwBWAV8tob7aSqVE8UAXnklv1jMrDmNmwQi4jvA\n88PKHoyIrKWa7wGLsuPzgLsj4lBEbCEliOWSFgBzI+LR7LzbgQuy4/OB1dnxPcC7q7yXplO5bhCk\n5qHynAEzs3qoR8fwh4D7suOFQOVAxp6sbCFQOcp9e1b2qmsiYgB4QdJxdYir4fX2DjUFQRol5FnD\nZlZPNXUMS/ok0B8R/1KneABG3AKtbOXKlYePOzs76ezsrOOfLpZSCQ4eHPpdSkngpJPyi8nMiq+r\nq4uurq4JnVt1EpB0KfAe4DcrinuAkyt+X5SVjVZeec0OSe3A0REx6lJplUmg2fX0pCagsggvImdm\n4xv+BfnGG28c9dyJNgeJim/oks4BrgXOi4gDFeetAy7KRvwsBpYAj0RECdgjaXnWUXwxcG/FNZdk\nx78HPDTBmJped/erO4adBMys3satCUj6CtAJHC9pK3AD8AmgA/hWNvjnexFxeURskLQG2AD0A5dH\nHN4Z9wrgy8Bs4L6IWJ+V3wbcIWkzsBu4qE731vB6elISKDcJOQmYWb0pGmj3cknRSPHWatGiNEFs\n376hsltugauuyi8mM2s8koiIEftbPWO4wHbvTp3BlYYPGzUzq4WTQEHt35+Gh1Z2DEPqJzAzqxcv\nIFdQfX0we/bQrmJlvb35xGNmzclJoKAqN5ivtHPn9MdiZs3LSaCgKvcWruQZw2ZWT04CBVUqjbxO\nUOVIITOzWjkJFFRv76uXjCh7+eXpj8XMmpeTQEGNtqn8wYMjNxOZmVXDSaCgtm6Fjo6fL29rS7uN\nmZnVg5NAQe3YMbTBfCXJS0eYWf04CRTUzp0/P1u4zCOEzKxePGO4gCLSB/1INQFwTcDM6sc1gQJ6\n4QWYMWPk0UGDg04CZlY/TgIFVCqN3CkMQ7UEM7N6cBIooFJp9P4AgF27pi8WM2tuTgIFVCqlZp/R\nbN06fbGYWXNzEiig4RvMDzfaRDIzs8lyEiig3t6R1w0q6+ubvljMrLk5CRRQd/fow0PBo4PMrH6c\nBApo+/bRRwcB7N07fbGYWXNzEiig0TaUKXvppemLxcyam5NAAe3aNfbooLE6jc3MJsNJoGAOHUob\nxxw4MPo5AwNjP29mNlHjJgFJt0nqk/RkRdk8SQ9I2ijpfknHVDy3QtJmSc9IOqui/ExJT0raJGlV\nRXmHpLuza74r6XX1vMFGs3Nn2mB+YGD0c9rbPWvYzOpjIjWBLwFnDyu7HngwIk4HHgJWAEg6A7gQ\nWAacC9wqHZ77+nngsohYCiyVVH7Ny4CfRcQbgFXAZ2u4n4ZXKqV1g8bjJGBm9TBuEoiI7wDDP3LO\nB1Znx6uBC7Lj84C7I+JQRGwBNgPLJS0A5kbEo9l5t1dcU/la9wDvruI+mkapNLHzPEzUzOqh2j6B\nkyKiDyAiSsBJWflCYFvFeT1Z2UKgcp7r9qzsVddExADwgqTjqoyr4ZVKqV9gLBFOAmZWH/XaT6Ce\nu96OsXQarFy58vBxZ2cnnZ2ddfzT+RtvyQhwEjCzsXV1ddHV1TWhc6tNAn2S5kdEX9bU81xW3gOc\nXHHeoqxstPLKa3ZIageOjohRP+Iqk0Az6ukZu1MYnATMbGzDvyDfeOONo5470eYg8epv6OuAS7Pj\nS4B7K8ovykb8LAaWAI9kTUZ7JC3POoovHnbNJdnx75E6mltWdzfMnDn+eb29Ux+LmTW/cWsCkr4C\ndALHS9oK3AD8FbBW0oeAbtKIICJig6Q1wAagH7g8IspNRVcAXwZmA/dFxPqs/DbgDkmbgd3ARfW5\ntcbU0zP6rmKVurunJx4za24a+owuPknRSPFWY9GitL3keEtDvOtdMMEmPzNrcZKIiBH7W73RfMHs\n3j32rmJlzz03/jlmZuNxEiiQ/fvT8NDxhohCqi2YmdXKawcVSF9fWjJiIvbtm9pYzKw1OAkUyHhL\nSFd65ZWpjcXMWoOTQIGUSmkOwET094+93LSZ2UQ4CRTIeHsLV2pr8w5jZlY7J4EC6e2d+IYxkmcN\nm1ntnAQKZNu28c+p5OWkzaxWTgIFsnXr2BvMV/L6QWZWD04CBdLbm3YNmwgnATOrByeBAtm5c+Ln\nRrg5yMxq5xnDBRGRZgFPtCYAk0saZmYjcU2gIJ5/fmKrh1baunXq4jGz1uAkUBCl0sQ7hcu2bx//\nHDOzsTgJFESpNLHVQyt5JVEzq5WTQEGUSuNvKzmcRweZWa2cBAqiVJr4khFlL744NbGYWetwEiiI\nHTsmnwT275+aWMysdTgJFMTWrWl00GQcODA1sZhZ63ASKIht22DmzMldEwEvvzw18ZhZa3ASKIi+\nvolvKFPW1uZZw2ZWGyeBgti1a/KbxEhOAmZWGy8bUQD9/fDSS5OfJ+BF5MysVjXVBCRdI+kpSU9K\nuktSh6R5kh6QtFHS/ZKOqTh/haTNkp6RdFZF+ZnZa2yStKqWmBrRzp1pg/nJzhNwEjCzWlWdBCS9\nFvgIcGZE/CKpVvH7wPXAgxFxOvAQsCI7/wzgQmAZcC5wq3T4u+/ngcsiYimwVNLZ1cbViEqlyY8M\ngtR85CRgZrWotU+gHThS0gxgDtADnA+szp5fDVyQHZ8H3B0RhyJiC7AZWC5pATA3Ih7Nzru94pqW\nMJkN5odzEjCzWlSdBCJiB/C/gK2kD/89EfEgMD8i+rJzSsBJ2SULgcoNFHuysoVA5VJo27OyllHN\nkhFlO3bUNxYzay1VdwxLOpb0rf8UYA+wVtIHgOHfaav8jjuylStXHj7u7Oyks7Ozni+fi1JpcktI\nV/Jy0mY2XFdXF11dXRM6t5bRQf8DeDYifgYg6WvArwJ9kuZHRF/W1FNe67IHOLni+kVZ2WjlI6pM\nAs2ip6f6mkCpVN9YzKzxDf+CfOONN456bi19AluBd0ianXXwvhvYAKwDLs3OuQS4NzteB1yUjSBa\nDCwBHsmajPZIWp69zsUV17SE7u7JzxYu27WrvrGYWWupuiYQEY9Iugf4AdCf/fwCMBdYI+lDQDdp\nRBARsUHSGlKi6AcujzjcHXoF8GVgNnBfRKyvNq5G1NOTkkA1TUIvvFD/eMysdSiqHZaSA0nRSPFO\n1MKFsGdPmjA2WXPnwt699Y/JzJqHJCJixOmonjFcALUM83zllfrFYWatx0kgZy+9BIcOpUc1+vtT\np3J7e33jMrPW4AXkctbXl5aMqFZbW2pKMjOrhpNAzkqlyS8hXUnyrGEzq56TQM5KpckvIT2cl5M2\ns2o5CeSsVKq+PwC8kqiZ1cZJIGe9vdUvGQFOAmZWGyeBnG3bNv45Y4lwc5CZVc9JIGdbt0JHR22v\n8dxz459jZjYSJ4Gc9fbWPsbfK4maWbWcBHK2c2ftr7F9+/jnmJmNxDOGcxSRFoCrtSZQj0RiZq3J\nNYEcPf982lu4ltFB4NFBZlY9J4EclUq1dwoDvPhi7a9hZq3JSaDOBgYmvml8qZSWfajV/v21v4aZ\ntSYngTp773thyRJ49tnxz61lg/lKBw9OPPGYmVVyEqijf/93+Pa3YcsWWLoU1qwZ+/ze3rQUdD28\n/HJ9XsfMWouTQJ0MDsLVV6cP48HB9Hjf++BP/mT0b+k7dtQnCbS1uXPYzKrjJFAna9akJSDKH/jl\nn//0T/CmN43cebt1axodVCvJS0eYWXWcBOrgwAG49tq01+/w4Z4RsHEjLFgAjz326ue2bUsbzNfK\ni8iZWbWcBOrg1ltTIhitk3dgID2/fDnccstQeV9fbRvKlDkJmFm1nARq9Pzz8Od/npp7xhrpUx46\nes018Du/k/oMdu+ufUMZSK/hJGBm1fCyETX69KfTt/lXXpnY+RFw331w8slpk/l6zBOAlFDMzCar\nppqApGMkrZX0jKSnJf2KpHmSHpC0UdL9ko6pOH+FpM3Z+WdVlJ8p6UlJmyStqiWm6bRlC3zhC5Mf\nnjk4mJqCBgfrM08A0kgjM7PJqrU56BbgvohYBvwS8GPgeuDBiDgdeAhYASDpDOBCYBlwLnCrdPh7\n8OeByyJiKbBU0tk1xjUtVqxIP6sZo1+vD/+y7u76vp6ZtYaqk4Cko4Ffj4gvAUTEoYjYA5wPrM5O\nWw1ckB2fB9ydnbcF2Awsl7QAmBsRj2bn3V5xTWE99lhq1qn3h3m1+vryjsDMGlEtNYHFwC5JX5L0\nuKQvSDoCmB8RfQARUQJOys5fCFRuptiTlS0EKlfE356VFVa5g/fgweKs2+M+ATOrRi0dwzOAM4Er\nIuIxSTeTmoKGz4+t66o2K1euPHzc2dlJZ2dnPV9+Qr7xDdiwYdr/7Jj27Mk7AjMriq6uLrq6uiZ0\nrqLKlcckzQe+GxGnZb//GikJvB7ojIi+rKnn4YhYJul6ICLipuz89cANQHf5nKz8IuBdEfHhEf5m\nVBtvvRw6BMuWpdm+te4DUE9HHgn79uUdhZkVkSQiYsSxiFU3B2VNPtskLc2K3g08DawDLs3KLgHu\nzY7XARdJ6pC0GFgCPJI1Ge2RtDzrKL644prC+eIX07fueozvr6cDB/KOwMwaUdU1AQBJvwT8MzAT\neBb4INAOrAFOJn3LvzAiXsjOXwFcBvQDH42IB7LytwFfBmaTRht9dJS/l2tNYN8+OO20tDxEET90\n+/vrsxaRmTWXsWoCNSWB6ZZ3Eli5Ej73udQJW7R/tra2NELohBPyjsTMisZJoA56e+H001OfQBHX\n7m9vT53VS5eOf66ZtZYp6RNoNZ/6VPq2XcQEUOblpM1sstyCPAE7dsC//EvxmoAqeSVRM6uGawIT\n8NWvQkdHcSaGjcRJwMyq4SQwAXfeOfLOYEUS4eYgM5s8J4Fx9PTAU081xtDLUinvCMys0TgJjOOe\ne2DWrInvF5CnrVvzjsDMGo2TwDjuuqv4TUFlPT15R2BmjcZJYAzbt8PTTzdGUxDArl15R2BmjcZJ\nYAz33JNGBTVCUxC4Y9jMJs9JYAx33pn2AW4UjdJsZWbF4SQwiq1b4Zln0nIMjaLIs5nNrJicBEbR\naE1BkPY3KPKsZjMrHieBUTRaU1BZI8ZsZvlxEhhBdzds2tRYTUGQFrjz0hFmNhlOAiO45x6YObOx\nmoLKPELIzCbDSWAEd9zRuM0qrgmY2WQ4CQyzZQv85CepaaXRDA46CZjZ5DTgR93UWrs2zRAu4h7C\n4/Fy0mY2WU4Cw9x5Z7H3DRjP7t15R2BmjcRJoMJ//zc8+2xjNgWVbd+edwRm1kga+OOu/tauTcNC\nG7EpqGzbtrwjMLNG4iRQ4Y47Gn/phb6+vCMws0ZScxKQ1CbpcUnrst/nSXpA0kZJ90s6puLcFZI2\nS3pG0lkV5WdKelLSJkmrao2pGj/9aWoOkvL46/XjPgEzm4x61AQ+Cmyo+P164MGIOB14CFgBIOkM\n4EJgGXAucKt0+CP388BlEbEUWCrp7DrENSnN0BQEsGdP3hGYWSOpKQlIWgS8B/jniuLzgdXZ8Wrg\nguz4PODuiDgUEVuAzcBySQuAuRHxaHbe7RXXTJs772zMGcLDNfLIJjObfrXWBG4GrgUq166cHxF9\nABFRAk7KyhcCld2WPVnZQqByTMv2rGza/OQnab2gZtAMiczMpk/VGydK+m2gLyKekNQ5xql1Xdx4\n5cqVh487Ozvp7BzrT0/M2rVpWOjBgzW/VO4GBqC/P619ZGatqauri66urgmdq6hyAXpJfwn8AXAI\nmAPMBb4G/DLQGRF9WVPPwxGxTNL1QETETdn164EbgO7yOVn5RcC7IuLDI/zNqDbesZxxRuoYboYk\n0N4OO3bASSeNf66ZtQZJRMSIw16qbg6KiE9ExOsi4jTgIuChiPhD4OvApdlplwD3ZsfrgIskdUha\nDCwBHsmajPZIWp51FF9ccc2U27y5uSZYSV46wswmrurmoDH8FbBG0odI3/IvBIiIDZLWkEYS9QOX\nV3ytvwL4MjAbuC8i1k9BXCNauzZ9cDZDLQDS+kFeTtrMJqrq5qA8TEVz0LJlaamIZkkCElx5Jfzt\n3+YdiZkVxZQ0BzWDjRuhp6e59uWNgL/7O/iLv8g7EjNrBC2dBL761fTNub8/70jq71Ofgr/8y7yj\nMLOia+kk8LWvNfe4+k9+Ej7zmbyjMLMia9k+gT17YMGCVBNo9EXjxnPTTfBnf5Z3FGaWF/cJjOCh\nh+DII5s/AQBcdx38zd/kHYWZFVHLJoFvfhP27s07iulz7bVw8815R2FmRdOSzUERsHBhSgIvvVSH\nwBrIzTfD1VfnHYWZTSc3Bw2zeTPs29fcncKjueaaNITUzAxaNAmsX58WjBsYyDuSfFx1FXzuc3lH\nYWZF0JJJ4Otfb71moOGuvBI+9jHvP2DW6lquT+DAAZg3L622uW9fnQJrUG1t6d/hD/4gDSM98cS8\nIzKzqeA+gQrf+Q4ccYQTAMDgYJotfccd8JrXwNlnw49/nHdUZjadpmIV0UJbv94JYLhDh9LPhx+G\nN74R3vxmWLUKJrJfz759sGlTWofppz+FY46B178eTjsNTj0VZs+eysjNrFYt1xx0+umwbVtrTBKr\nVkdHqiEsXAif/jS8//3p32zjxvTYsAF++MO0LefevalmNTiYRlsdOpR2NSuvyTR3Lrz2tbBkCbzp\nTWkDn9NOg+OOS7O29+5NP8uPF15I+yHs2pV+vvQSvPe98JGPeLc0s2qN1RzUUkmgtzd9Sx0YaJ6l\no6fSzJlDtYQjj0z9BwcPpg/7WbNgxoyUTMcaZTVjRkoqg4PpWim9brk/IiI9BgbSo/z32tvTtVLq\nx5kzJ818vu669HpmNnFOApnVq9M4eW+6MjlS/sttlxPGrFlpstunPuWmJrOJcsdw5hvfaK2lIuol\n7wQAqZYwOJhqBTfdBMceCx//uIf6mtWqZWoCg4NpaOjgoDuGm4GUHu3t8Ed/lJbMPvrovKMyKybX\nBIDHH0/t0J4c1Rwihoa4/uM/wvHHp87jwcG8IzNrLC2TBO6/P31g+EOi+ZQ7lG+9FV73ujSSycwm\npmWSwLp1zbmNpA0ZHIS+Pli8GG6/Pe9ozBpDS/QJlHcRc3NQaznvPPjXf039BmatrOX7BMq7iDkB\ntJZ/+7e0HMamTXlHYlZcVScBSYskPSTpaUk/knRVVj5P0gOSNkq6X9IxFdeskLRZ0jOSzqooP1PS\nk5I2SVpV2y39vG9+M9UGrLUMDqY5IcuWwd//fd7RmBVT1c1BkhYACyLiCUlHAd8Hzgc+COyOiM9K\nug6YFxHXSzoDuAt4O7AIeBB4Q0SEpP8CroyIRyXdB9wSEfeP8Dcn3RxU3kVszx7XBFqZBL/xG3Df\nfWnCmVkrmZLmoIgoRcQT2fE+4BnSh/v5wOrstNXABdnxecDdEXEoIrYAm4HlWTKZGxGPZufdXnFN\nzcq7iB04UK9XtEYUAd/+Nsyfn9Y9MrOkLn0Ckk4F3gJ8D5gfEX2QEgVwUnbaQqBy8F5PVrYQ2F5R\nvj0rq4v772/tXcRsyMBA+kLw1remuQVmVoelpLOmoHuAj0bEPknD22vqOvxo5cqVh487OzvpHGe9\n469/3TOEbUj5y8Cf/mlaqfSTn8w3HrOp0NXVRVdX14TOrWmIqKQZwL8B34yIW7KyZ4DOiOjLmnoe\njohlkq4HIiJuys5bD9wAdJfPycovAt4VER8e4e9Nqk/Au4jZeK6+Gm6+Oe8ozKbWVA4R/SKwoZwA\nMuuAS7PjS4B7K8ovktQhaTGwBHgkazLaI2m5JAEXV1xTk//8z7QEsROAjWbVKvjDP8w7CrP8VN0c\nJOmdwAeAH0n6AanZ5xPATcAaSR8ifcu/ECAiNkhaA2wA+oHLK77WXwF8GZgN3BcR66uNq9L69V5l\n0sZ3112we3daZVYjflcya15NPWP4F34BurvTJihmY5Hg7W+H7343DSQwayYtOWO4VIKtW71gnE1M\nBDz2WNr+0mtMWStp2iTwwANp5ylvI2kTNTiY5pUsXux+JGsdTZsEvIuYVWNwMNUiTzklDSE1a3ZN\n2ScwOAjHHTc0Ochsstra0siyp59OCcGskbVcn8Djj6eOPq8VZNUaHEwDCpYsgb/+62Lss2w2FZoy\nCXgXMauHgYH0uO46OPlkeOSRvCMyq7+mTALr1rlD2OojIj2eew7e8Q54z3u8LLk1l6ZLAo8/Dhs2\nwMyZeUdizaS/PyWDb30LTjwRPvMZNxFZc2iqjuGDB9MKkc8+6wliNrWktGXp2rXwznfmHY3Z2Fqm\nY/imm2DnTi8bbVMvIv239uu/DmedlXYwM2tETVMTePpp+NVfTbUB1wJsOs3IVuD67d+Ga65JicFL\nT1iRjFUTaIokMDAAv/IrqS/g5ZdzCMwM6OiAQ4fgiCPg/PPTMtVve5sXpbP8NX1z0C23wLZtbgay\nfB08mIYl798Pa9akLybHHQd//Mfw1FN5R2c2soavCfzkJ+nblpuBrIja21NNIAKOPx4+8AG46io4\n9dS8I7NW0rTNQYOD8K53wQ9+4H0DrPhmzBiad7BkCXz842lDmzlz8o7Mml3TNgd94QuwcWNqhzUr\nukOHUpPl4CBs2ZL2OT7mGDjvPHj0Uc87sHw0bE1g2zZ485tTE9CBAzkHZlaDcofyCSfAhz8MH/lI\najoyq5emaw6KgHPOSbtAvfhi3lGZ1ceMGamWIMEv/zJ87GPpi86iRTB3bt7RWSNruiRw++2pPXXf\nPncGW3OaNSstVdHRkX7OmJFqBwsWpKWt3/AGeP3r08J2ixal4yOOyDtqK6qmSgK9vcEb35g6gt0M\nZK1kzpxUSxgYGNoCc+bMVNbfD8cemzqczzwzLXb3pjfBsmXueLYmSwK/+7vBww/DCy/kHY1Zscyc\nmR7l+QozZ6a+hmOPTTWHt70tNTMtXJhqFPPnp34Iz25ufk2VBI4/Pti/3zODzSaqMjlEDNUeBgZS\nkjjqqDSp7cQTU4I45ZT0OPFEmDdv6HHccelnR0fed2ST1RBJQNI5wCrSsNXbIuKmEc6Jjo7wXgFm\ndVROEpBqEOUNmWbMGJrsVvlcezsceWRKHscem5LD/Pmpb2LRopQ8yo8TTkg/3V+Rr8InAUltwCbg\n3cAO4FHgooj48bDzQoqcxlN3AZ15/OEp1EVz3VMXvp/p0dGRkkFbW6pdlGsVg4OpfMaMoZnS/f3p\nvKOOgvb2Lo4+upNZs2D27NRfMWdOShJz5gwll6OPTglk3ryUaCprJMcem84rwppMXV1ddHZ25h3G\nuMZKAjOmO5hRLAc2R0Q3gKS7gfOBHw8/Mb+c1UVR/4esXhfNdU9d+H6mx1i18UOHfn4C58BAGswx\nONjFyy93Hi4vz6COSAmk/LO8NeyMGSmBlPstykklIiWNo45KCalca6n8Ofwxa1Yaanv00SmRHHts\n+v3II4eST/m4nJwqE9Xs2T+/WdVkk0BE2pnuuefSY+fOoeMdO2D79nTOKaekTv6FC+G1r02P17xm\najr5i5IEFgLbKn7fTkoMZtYkyoljMjP8y6OgRnLgQEoWY3Vsl780ln+WE0x55rY0VKNpa0u/V9Yw\nKpNSOe5hUpZ/AAADrklEQVRy81lHR4rhH/5h6Drp1a9T/j0C9u5NCaC9PV3b1jb0uuUmuI6OdE15\ndnk5qZX//WbNSs1vCxakxHDMMakWdcQRKXmVa1SzZ7/6MZaiJAEzs0kZqcYxWRGTf43+/nTNK6+k\nD+7J9FFGpPNHu2b4sPf+/lcnwpdfhr4+6O2F739/cnGPpih9Au8AVkbEOdnv1wMxvHNYUv7Bmpk1\noKJ3DLcDG0kdw73AI8DvR8QzuQZmZtbkCtEcFBEDkq4EHmBoiKgTgJnZFCtETcDMzPLhCeMTIGmL\npB9K+oGkR/KOpxqSbpPUJ+nJirJ5kh6QtFHS/ZKOyTPGyRjlfm6QtF3S49njnDxjnAxJiyQ9JOlp\nST+SdFVW3pDv0Qj385GsvCHfI0mzJP1X9hnwI0k3ZOUN+f5Uck1gAiQ9C7wtIp7PO5ZqSfo1YB9w\ne0T8YlZ2E7A7Ij4r6TpgXkRcn2ecEzXK/dwAvBgR/zvX4KogaQGwICKekHQU8H3SXJkP0oDv0Rj3\n8z4a9z06IiL2Z32Y/wlcBbyXBnx/KrkmMDGiwf+tIuI7wPAkdj6wOjteDVwwrUHVYJT7gfReNZyI\nKEXEE9nxPuAZYBEN+h6Ncj8Ls6cb9T3anx3OIvWnBg36/lRq6A+2aRTAtyQ9KumP8g6mjk6KiD5I\n/9MCJ+UcTz1cKekJSf/ciFVzAEmnAm8BvgfMb/T3qOJ+/israsj3SFKbpB8AJeBbEfEoTfD+OAlM\nzDsj4kzgPcAVWVNEM2r0tsFbgdMi4i2k/1EbscnhKOAe4KPZN+jh70lDvUcj3E/DvkcRMRgRbyXV\n0JZLeiMN/v6Ak8CERERv9nMn8DWaZ0mLPknz4XAb7nM5x1OTiNh5eBNq+Cfg7XnGM1mSZpA+MO+I\niHuz4oZ9j0a6n0Z/jwAiYi9pYadzaOD3p8xJYBySjsi+zSDpSOAs4Kl8o6qaeHV77Drg0uz4EuDe\n4RcU3KvuJ/ufsOx/0njv0xeBDRFxS0VZI79HP3c/jfoeSTqh3HQlaQ7wW6R+jkZ+fwCPDhqXpMWk\nb/9B6gy6KyL+Kt+oJk/SV0hLUh4P9AE3AP8XWAucDHQDF0ZEQ+zZNsr9/Aap7XkQ2AL8Sbm9tugk\nvRP4d+BHpP/WAvgEafb8GhrsPRrjft5PA75Hkt5M6vhtyx7/JyI+Lek4GvD9qeQkYGbWwtwcZGbW\nwpwEzMxamJOAmVkLcxIwM2thTgJmZi3MScDMrIU5CZiZtTAnATOzFvb/AUL0kJu9GPxnAAAAAElF\nTkSuQmCC\n",
"text/plain": [
"<matplotlib.figure.Figure at 0x3c6e78d0>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"slabel = labels['label'].str.len()\n",
"slabel_hist = slabel.value_counts().sort_index()\n",
"\n",
"print slabel.describe()\n",
"slabel_hist[slabel_hist > 30].plot.area();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the languages and the full paths into data frames. Join them into one data frame and show the result."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>level</th>\n",
" <th>parent</th>\n",
" <th>obsolete</th>\n",
" <th>status</th>\n",
" <th>iso</th>\n",
" <th>latitude</th>\n",
" <th>longitude</th>\n",
" <th>steps</th>\n",
" <th>parent_tree</th>\n",
" <th>terminal</th>\n",
" </tr>\n",
" <tr>\n",
" <th>id</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>aari1239</th>\n",
" <td>Aari</td>\n",
" <td>Language</td>\n",
" <td>aari1238</td>\n",
" <td>0</td>\n",
" <td>established</td>\n",
" <td>aiw</td>\n",
" <td>5.95034</td>\n",
" <td>36.5721</td>\n",
" <td>3</td>\n",
" <td>sout2845</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>aari1240</th>\n",
" <td>Aariya</td>\n",
" <td>Language</td>\n",
" <td>book1242</td>\n",
" <td>0</td>\n",