This script makes a best-effort at resolving inline references and citations. Wikipedia citations are complicated, so this is a quick and dirty hack. It's intented for research in resolving citations and natural language processing.
Similar tools:
curl "https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2" | bzcat | python extract_citations.py
One JSON object per line, structured like this:
{
"page": "Page Title",
"paragraphs": [
{
"text":"The plaintext paragraph (with mediawiki style tags).",
"refs":[
{ "cite": 0, "offset":0},
{ "cite": 0, "offset":0, ... }
]
}
],
"reflist": [
{
"text citation",
{"@type":"template_name", ...}
},
]
}
(see also: https://en.wikipedia.org/wiki/Wikipedia:Citing_sources) References can either be structured using templates like `{{cite}}``, or be plain text.
Inline with text, references can have the following forms:
<ref>
tags{{r}}
or{{rp}}
templates (TODO)
The {{cite}}
templates are typically in a References section or inside the <ref>
tag.
The <ref>
tags generate references lists that are displayed using {{reflist}}
.
This list can contain {{cite}}
templates itself (using refs=
). (TODO!)
- resursive template parsing
- short ref templates