Skip to content

Instantly share code, notes, and snippets.

@reiyw
Last active January 17, 2016 17:22
Show Gist options
  • Save reiyw/31c52a2b5980ab03ad67 to your computer and use it in GitHub Desktop.
Save reiyw/31c52a2b5980ab03ad67 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"http://www.cl.ecei.tohoku.ac.jp/nlp100/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 第3章: 正規表現\n",
"\n",
"Wikipediaの記事を以下のフォーマットで書き出したファイル[jawiki-country.json.gz](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/jawiki-country.json.gz)がある.\n",
"\n",
"- 1行に1記事の情報がJSON形式で格納される\n",
"- 各行には記事名が\"title\"キーに,記事本文が\"text\"キーの辞書オブジェクトに格納され,そのオブジェクトがJSON形式で書き出される\n",
"- ファイル全体はgzipで圧縮される\n",
"\n",
"以下の処理を行うプログラムを作成せよ."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ファイル `jawiki-country.json.gz' はすでに存在するので、取得しません。\r\n",
"\r\n"
]
}
],
"source": [
"!wget -nc http://www.cl.ecei.tohoku.ac.jp/nlp100/data/jawiki-country.json.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 20. JSONデータの読み込み\n",
"Wikipedia記事のJSONファイルを読み込み,「イギリス」に関する記事本文を表示せよ.問題21-29では,ここで抽出した記事本文に対して実行せよ."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{{redirect|UK}}\r\n",
"{{基礎情報 国\r\n",
"|略名 = イギリス\r\n",
"|日本語国名 = グレートブリテン及び北アイルランド連合王国\r\n",
"|公式国名 = {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>英語以外での正式国名:<br/>\r\n",
"*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}([[スコットランド・ゲール語]])<br/>\r\n",
"*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}([[ウェールズ語]])<br/>\r\n",
"*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}([[アイルランド語]])<br/>\r\n",
"*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}([[コーンウォール語]])<br/>\r\n",
"*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}([[スコットランド語]])<br/>\r\n"
]
}
],
"source": [
"!zcat jawiki-country.json.gz | grep '\"title\": \"イギリス\"' | jq -r .text > british.txt\n",
"!head british.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 21. カテゴリ名を含む行を抽出\n",
"記事中でカテゴリ名を宣言している行を抽出せよ."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[Category:イギリス|*]]\n",
"[[Category:英連邦王国|*]]\n",
"[[Category:G8加盟国]]\n",
"[[Category:欧州連合加盟国]]\n",
"[[Category:海洋国家]]\n",
"[[Category:君主国]]\n",
"[[Category:島国|くれいとふりてん]]\n",
"[[Category:1801年に設立された州・地域]]\n"
]
}
],
"source": [
"%%bash\n",
"echo -e 'Category:\\ncategory:\\nカテゴリ:\\nカテゴリ:' | parallel grep {} british.txt | tee 021.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 22. カテゴリ名の抽出\n",
"記事のカテゴリ名を(行単位ではなく名前で)抽出せよ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"grep でマッチした部分のみ抽出(肯定戻り読みは Perl 拡張):"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"イギリス\r\n",
"英連邦王国\r\n",
"G8加盟国\r\n",
"欧州連合加盟国\r\n",
"海洋国家\r\n",
"君主国\r\n",
"島国\r\n",
"1801年に設立された州・地域\r\n"
]
}
],
"source": [
"!grep -Po '(?<=:)[^|\\]]+' 021.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"sed でグループ化して置換:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"イギリス\r\n",
"英連邦王国\r\n",
"G8加盟国\r\n",
"欧州連合加盟国\r\n",
"海洋国家\r\n",
"君主国\r\n",
"島国\r\n",
"1801年に設立された州・地域\r\n"
]
}
],
"source": [
"!sed -r 's/^[^:]+:([^]|]+).+$/\\1/' 021.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 23. セクション構造\n",
"記事中に含まれるセクション名とそのレベル(例えば\"== セクション名 ==\"なら1)を表示せよ."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1\t国名\n",
"1\t歴史\n",
"1\t地理\n",
"2\t気候\n",
"1\t政治\n",
"1\t外交と軍事\n",
"1\t地方行政区分\n",
"2\t主要都市\n",
"1\t科学技術\n",
"1\t経済\n",
"2\t鉱業\n",
"2\t農業\n",
"2\t貿易\n",
"2\t通貨\n",
"2\t企業\n",
"1\t交通\n",
"2\t道路\n",
"2\t鉄道\n",
"2\t海運\n",
"2\t航空\n",
"1\t通信\n",
"1\t国民\n",
"2\t言語\n",
"2\t宗教\n",
"2\t婚姻\n",
"2\t教育\n",
"1\t文化\n",
"2\t食文化\n",
"2\t文学\n",
"2\t哲学\n",
"2\t音楽\n",
"3\tイギリスのポピュラー音楽\n",
"2\t映画\n",
"2\tコメディ\n",
"2\t国花\n",
"2\t世界遺産\n",
"2\t祝祭日\n",
"1\tスポーツ\n",
"2\tサッカー\n",
"2\t競馬\n",
"2\tモータースポーツ\n",
"1\t脚注\n",
"1\t関連項目\n",
"1\t外部リンク\n"
]
}
],
"source": [
"%%script bash\n",
"< british.txt grep '^==' | sed -r 's/\\s+//g' | gawk -F'=' '{print (NF-3)/2\"\\t\"$((NF+1)/2)}'"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"国名\t1\r\n",
"歴史\t1\r\n",
"地理\t1\r\n",
"気候\t2\r\n",
"政治\t1\r\n",
"外交と軍事\t1\r\n",
"地方行政区分\t1\r\n",
"主要都市\t2\r\n",
"科学技術\t1\r\n",
"経済\t1\r\n",
"鉱業\t2\r\n",
"農業\t2\r\n",
"貿易\t2\r\n",
"通貨\t2\r\n",
"企業\t2\r\n",
"交通\t1\r\n",
"道路\t2\r\n",
"鉄道\t2\r\n",
"海運\t2\r\n",
"航空\t2\r\n",
"通信\t1\r\n",
"国民\t1\r\n",
"言語\t2\r\n",
"宗教\t2\r\n",
"婚姻\t2\r\n",
"教育\t2\r\n",
"文化\t1\r\n",
"食文化\t2\r\n",
"文学\t2\r\n",
"哲学\t2\r\n",
"音楽\t2\r\n",
"イギリスのポピュラー音楽\t3\r\n",
"映画\t2\r\n",
"コメディ\t2\r\n",
"国花\t2\r\n",
"世界遺産\t2\r\n",
"祝祭日\t2\r\n",
"スポーツ\t1\r\n",
"サッカー\t2\r\n",
"競馬\t2\r\n",
"モータースポーツ\t2\r\n",
"脚注\t1\r\n",
"関連項目\t1\r\n",
"外部リンク\t1\r\n"
]
}
],
"source": [
"!grep '^=' british.txt | sed -r -e 's/\\s+//g' -e 's/^=+([^=]+)====$/\\1\\t3/' -e 's/^=+([^=]+)===$/\\1\\t2/' -e 's/^=+([^=]+)==$/\\1\\t1/'"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"国名\t1\r\n",
"歴史\t1\r\n",
"地理\t1\r\n",
"気候\t2\r\n",
"政治\t1\r\n",
"外交と軍事\t1\r\n",
"地方行政区分\t1\r\n",
"主要都市\t2\r\n",
"科学技術\t1\r\n",
"経済\t1\r\n",
"鉱業\t2\r\n",
"農業\t2\r\n",
"貿易\t2\r\n",
"通貨\t2\r\n",
"企業\t2\r\n",
"交通\t1\r\n",
"道路\t2\r\n",
"鉄道\t2\r\n",
"海運\t2\r\n",
"航空\t2\r\n",
"通信\t1\r\n",
"国民\t1\r\n",
"言語\t2\r\n",
"宗教\t2\r\n",
"婚姻\t2\r\n",
"教育\t2\r\n",
"文化\t1\r\n",
"食文化\t2\r\n",
"文学\t2\r\n",
"哲学\t2\r\n",
"音楽\t2\r\n",
"イギリスのポピュラー音楽\t3\r\n",
"映画\t2\r\n",
"コメディ\t2\r\n",
"国花\t2\r\n",
"世界遺産\t2\r\n",
"祝祭日\t2\r\n",
"スポーツ\t1\r\n",
"サッカー\t2\r\n",
"競馬\t2\r\n",
"モータースポーツ\t2\r\n",
"脚注\t1\r\n",
"関連項目\t1\r\n",
"外部リンク\t1\r\n"
]
}
],
"source": [
"!grep '^=' british.txt | pandoc --from=mediawiki --to=org | grep '\\S' | awk -F' ' '{n=length($1); print $2\"\\t\"n-1}'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 24. ファイル参照の抽出\n",
"記事から参照されているメディアファイルをすべて抜き出せ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"肯定戻り読みは遅い:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Battle of Waterloo 1815.PNG\n",
"The British Empire.png\n",
"Uk topo en.jpg\n",
"BenNevis2005.jpg\n",
"Elizabeth II greets NASA GSFC employees, May 8, 2007 edit.jpg\n",
"Palace of Westminster, London - Feb 2007.jpg\n",
"David Cameron and Barack Obama at the G20 Summit in Toronto.jpg\n",
"Soldiers Trooping the Colour, 16th June 2007.jpg\n",
"Scotland Parliament Holyrood.jpg\n",
"London.bankofengland.arp.jpg\n",
"City of London skyline from London City Hall - Oct 2008.jpg\n",
"Oil platform in the North SeaPros.jpg\n",
"Eurostar at St Pancras Jan 2008.jpg\n",
"Heathrow T5.jpg\n",
"Anglospeak.svg\n",
"Royal Coat of Arms of the United Kingdom.svg\n",
"CHANDOS3.jpg\n",
"The Fabs.JPG\n",
"PalaceOfWestminsterAtNight.jpg\n",
"Westminster Abbey - West Door.jpg\n",
"Edinburgh Cockburn St dsc06789.jpg\n",
"Canterbury Cathedral - Portal Nave Cross-spire.jpeg\n",
"Kew Gardens Palm House, London - July 2009.jpg\n",
"2005-06-27 - United Kingdom - England - London - Greenwich.jpg\n",
"Stonehenge2007 07 30.jpg\n",
"Yard2.jpg\n",
"Durham Kathedrale Nahaufnahme.jpg\n",
"Roman Baths in Bath Spa, England - July 2006.jpg\n",
"Fountains Abbey view02 2005-08-27.jpg\n",
"Blenheim Palace IMG 3673.JPG\n",
"Liverpool Pier Head by night.jpg\n",
"Hadrian's Wall view near Greenhead.jpg\n",
"London Tower (1).JPG\n",
"Wembley Stadium, illuminated.jpg\n"
]
}
],
"source": [
"%%bash\n",
"echo -e '(?<=File:)[^|]+\\n(?<=ファイル:)[^|]+' | parallel grep -Po {} british.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"正規表現で何かを抽出するときは,シンプルな正規表現で数を絞り込むことから始めるべき:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Royal Coat of Arms of the United Kingdom.svg\n",
"CHANDOS3.jpg\n",
"The Fabs.JPG\n",
"PalaceOfWestminsterAtNight.jpg\n",
"Westminster Abbey - West Door.jpg\n",
"Edinburgh Cockburn St dsc06789.jpg\n",
"Canterbury Cathedral - Portal Nave Cross-spire.jpeg\n",
"Kew Gardens Palm House, London - July 2009.jpg\n",
"2005-06-27 - United Kingdom - England - London - Greenwich.jpg\n",
"Stonehenge2007 07 30.jpg\n",
"Yard2.jpg\n",
"Durham Kathedrale Nahaufnahme.jpg\n",
"Roman Baths in Bath Spa, England - July 2006.jpg\n",
"Fountains Abbey view02 2005-08-27.jpg\n",
"Blenheim Palace IMG 3673.JPG\n",
"Liverpool Pier Head by night.jpg\n",
"Hadrian's Wall view near Greenhead.jpg\n",
"London Tower (1).JPG\n",
"Wembley Stadium, illuminated.jpg\n",
"Battle of Waterloo 1815.PNG\n",
"The British Empire.png\n",
"Uk topo en.jpg\n",
"BenNevis2005.jpg\n",
"Elizabeth II greets NASA GSFC employees, May 8, 2007 edit.jpg\n",
"Palace of Westminster, London - Feb 2007.jpg\n",
"David Cameron and Barack Obama at the G20 Summit in Toronto.jpg\n",
"Soldiers Trooping the Colour, 16th June 2007.jpg\n",
"Scotland Parliament Holyrood.jpg\n",
"London.bankofengland.arp.jpg\n",
"City of London skyline from London City Hall - Oct 2008.jpg\n",
"Oil platform in the North SeaPros.jpg\n",
"Eurostar at St Pancras Jan 2008.jpg\n",
"Heathrow T5.jpg\n",
"Anglospeak.svg\n"
]
}
],
"source": [
"%%bash\n",
"echo -e 'File:\\nファイル:' | parallel grep {} british.txt | sed -r -e 's/^[^F]*File:([^|]+).+$/\\1/' -e 's/^[^フ]*ファイル:([^|]+).+$/\\1/'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 25. テンプレートの抽出\n",
"記事中に含まれる「基礎情報」テンプレートのフィールド名と値を抽出し,辞書オブジェクトとして格納せよ."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"grep で開始行と終了行を指定した抽出ができる:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"略名 = イギリス\r\n",
"日本語国名 = グレートブリテン及び北アイルランド連合王国\r\n",
"公式国名 = {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>英語以外での正式国名:<br/>\r\n",
"国旗画像 = Flag of the United Kingdom.svg\r\n",
"国章画像 = [[ファイル:Royal Coat of Arms of the United Kingdom.svg|85px|イギリスの国章]]\r\n",
"国章リンク = ([[イギリスの国章|国章]])\r\n",
"標語 = {{lang|fr|Dieu et mon droit}}<br/>([[フランス語]]:神と私の権利)\r\n",
"国歌 = [[女王陛下万歳|神よ女王陛下を守り給え]]\r\n",
"位置画像 = Location_UK_EU_Europe_001.svg\r\n",
"公用語 = [[英語]](事実上)\r\n",
"首都 = [[ロンドン]]\r\n",
"最大都市 = ロンドン\r\n",
"元首等肩書 = [[イギリスの君主|女王]]\r\n",
"元首等氏名 = [[エリザベス2世]]\r\n",
"首相等肩書 = [[イギリスの首相|首相]]\r\n",
"首相等氏名 = [[デーヴィッド・キャメロン]]\r\n",
"面積順位 = 76\r\n",
"面積大きさ = 1 E11\r\n",
"面積値 = 244,820\r\n",
"水面積率 = 1.3%\r\n",
"人口統計年 = 2011\r\n",
"人口順位 = 22\r\n",
"人口大きさ = 1 E7\r\n",
"人口値 = 63,181,775<ref>[http://esa.un.org/unpd/wpp/Excel-Data/population.htm United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population]</ref>\r\n",
"人口密度値 = 246\r\n",
"GDP統計年元 = 2012\r\n",
"GDP値元 = 1兆5478億<ref name=\"imf-statistics-gdp\">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a= IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>\r\n",
"GDP統計年MER = 2012\r\n",
"GDP順位MER = 5\r\n",
"GDP値MER = 2兆4337億<ref name=\"imf-statistics-gdp\" />\r\n",
"GDP統計年 = 2012\r\n",
"GDP順位 = 6\r\n",
"GDP値 = 2兆3162億<ref name=\"imf-statistics-gdp\" />\r\n",
"GDP/人 = 36,727<ref name=\"imf-statistics-gdp\" />\r\n",
"建国形態 = 建国\r\n",
"確立形態1 = [[イングランド王国]]/[[スコットランド王国]]<br />(両国とも[[連合法 (1707年)|1707年連合法]]まで)\r\n",
"確立年月日1 = [[927年]]/[[843年]]\r\n",
"確立形態2 = [[グレートブリテン王国]]建国<br />([[連合法 (1707年)|1707年連合法]])\r\n",
"確立年月日2 = [[1707年]]\r\n",
"確立形態3 = [[グレートブリテン及びアイルランド連合王国]]建国<br />([[連合法 (1800年)|1800年連合法]])\r\n",
"確立年月日3 = [[1801年]]\r\n",
"確立形態4 = 現在の国号「'''グレートブリテン及び北アイルランド連合王国'''」に変更\r\n",
"確立年月日4 = [[1927年]]\r\n",
"通貨 = [[スターリング・ポンド|UKポンド]] (&pound;)\r\n",
"通貨コード = GBP\r\n",
"時間帯 = ±0\r\n",
"夏時間 = +1\r\n",
"ISO 3166-1 = GB / GBR\r\n",
"ccTLD = [[.uk]] / [[.gb]]<ref>使用は.ukに比べ圧倒的少数。</ref>\r\n",
"国際電話番号 = 44\r\n",
"注記 = <references />\r\n"
]
}
],
"source": [
"!cat british.txt | grep -EA 10000 '^\\{\\{基礎情報' | grep -EB 10000 '^\\}\\}$' | grep ' = ' | sed 's/^|//' | tee 025.txt"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"GDP統計年元 = 2012\n",
"ISO 3166-1 = GB / GBR\n",
"通貨 = [[スターリング・ポンド|UKポンド]] (&pound;)\n",
"国際電話番号 = 44\n",
"公用語 = [[英語]](事実上)\n",
"人口順位 = 22\n",
"人口密度値 = 246\n",
"日本語国名 = グレートブリテン及び北アイルランド連合王国\n",
"標語 = {{lang|fr|Dieu et mon droit}}<br/>([[フランス語]]:神と私の権利)\n",
"確立年月日2 = [[1707年]]\n",
"確立年月日3 = [[1801年]]\n",
"確立年月日1 = [[927年]]/[[843年]]\n",
"確立形態3 = [[グレートブリテン及びアイルランド連合王国]]建国<br />([[連合法 (1800年)|1800年連合法]])\n",
"確立形態2 = [[グレートブリテン王国]]建国<br />([[連合法 (1707年)|1707年連合法]])\n",
"人口大きさ = 1 E7\n",
"注記 = <references />\n",
"面積大きさ = 1 E11\n",
"国歌 = [[女王陛下万歳|神よ女王陛下を守り給え]]\n",
"元首等肩書 = [[イギリスの君主|女王]]\n",
"国章画像 = [[ファイル:Royal Coat of Arms of the United Kingdom.svg|85px|イギリスの国章]]\n",
"人口値 = 63,181,775<ref>[http://esa.un.org/unpd/wpp/Excel-Data/population.htm United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population]</ref>\n",
"夏時間 = +1\n",
"通貨コード = GBP\n",
"GDP/人 = 36,727<ref name=\"imf-statistics-gdp\" />\n",
"略名 = イギリス\n",
"面積値 = 244,820\n",
"最大都市 = ロンドン\n",
"公式国名 = {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>英語以外での正式国名:<br/>\n",
"ccTLD = [[.uk]] / [[.gb]]<ref>使用は.ukに比べ圧倒的少数。</ref>\n",
"首都 = [[ロンドン]]\n",
"GDP値 = 2兆3162億<ref name=\"imf-statistics-gdp\" />\n",
"確立形態4 = 現在の国号「'''グレートブリテン及び北アイルランド連合王国'''」に変更\n",
"GDP統計年 = 2012\n",
"GDP統計年MER = 2012\n",
"国章リンク = ([[イギリスの国章|国章]])\n",
"GDP値元 = 1兆5478億<ref name=\"imf-statistics-gdp\">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a= IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>\n",
"建国形態 = 建国\n",
"首相等氏名 = [[デーヴィッド・キャメロン]]\n",
"位置画像 = Location_UK_EU_Europe_001.svg\n",
"国旗画像 = Flag of the United Kingdom.svg\n",
"確立形態1 = [[イングランド王国]]/[[スコットランド王国]]<br />(両国とも[[連合法 (1707年)|1707年連合法]]まで)\n",
"元首等氏名 = [[エリザベス2世]]\n",
"時間帯 = ±0\n",
"首相等肩書 = [[イギリスの首相|首相]]\n",
"人口統計年 = 2011\n",
"面積順位 = 76\n",
"GDP順位MER = 5\n",
"GDP値MER = 2兆4337億<ref name=\"imf-statistics-gdp\" />\n",
"水面積率 = 1.3%\n",
"GDP順位 = 6\n",
"確立年月日4 = [[1927年]]\n"
]
}
],
"source": [
"with open('025.txt') as f:\n",
" D = dict(line.strip().split(' = ') for line in f)\n",
" for k, v in D.iteritems():\n",
" print '{} = {}'.format(k, v)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"multiline matching 対応のコマンドならもう少し手軽にできる:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"略名 = イギリス\r\n",
"日本語国名 = グレートブリテン及び北アイルランド連合王国\r\n",
"公式国名 = {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>英語以外での正式国名:<br/>\r\n",
"国旗画像 = Flag of the United Kingdom.svg\r\n",
"国章画像 = [[ファイル:Royal Coat of Arms of the United Kingdom.svg|85px|イギリスの国章]]\r\n",
"国章リンク = ([[イギリスの国章|国章]])\r\n",
"標語 = {{lang|fr|Dieu et mon droit}}<br/>([[フランス語]]:神と私の権利)\r\n",
"国歌 = [[女王陛下万歳|神よ女王陛下を守り給え]]\r\n",
"位置画像 = Location_UK_EU_Europe_001.svg\r\n",
"公用語 = [[英語]](事実上)\r\n",
"首都 = [[ロンドン]]\r\n",
"最大都市 = ロンドン\r\n",
"元首等肩書 = [[イギリスの君主|女王]]\r\n",
"元首等氏名 = [[エリザベス2世]]\r\n",
"首相等肩書 = [[イギリスの首相|首相]]\r\n",
"首相等氏名 = [[デーヴィッド・キャメロン]]\r\n",
"面積順位 = 76\r\n",
"面積大きさ = 1 E11\r\n",
"面積値 = 244,820\r\n",
"水面積率 = 1.3%\r\n",
"人口統計年 = 2011\r\n",
"人口順位 = 22\r\n",
"人口大きさ = 1 E7\r\n",
"人口値 = 63,181,775<ref>[http://esa.un.org/unpd/wpp/Excel-Data/population.htm United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population]</ref>\r\n",
"人口密度値 = 246\r\n",
"GDP統計年元 = 2012\r\n",
"GDP値元 = 1兆5478億<ref name=\"imf-statistics-gdp\">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a= IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>\r\n",
"GDP統計年MER = 2012\r\n",
"GDP順位MER = 5\r\n",
"GDP値MER = 2兆4337億<ref name=\"imf-statistics-gdp\" />\r\n",
"GDP統計年 = 2012\r\n",
"GDP順位 = 6\r\n",
"GDP値 = 2兆3162億<ref name=\"imf-statistics-gdp\" />\r\n",
"GDP/人 = 36,727<ref name=\"imf-statistics-gdp\" />\r\n",
"建国形態 = 建国\r\n",
"確立形態1 = [[イングランド王国]]/[[スコットランド王国]]<br />(両国とも[[連合法 (1707年)|1707年連合法]]まで)\r\n",
"確立年月日1 = [[927年]]/[[843年]]\r\n",
"確立形態2 = [[グレートブリテン王国]]建国<br />([[連合法 (1707年)|1707年連合法]])\r\n",
"確立年月日2 = [[1707年]]\r\n",
"確立形態3 = [[グレートブリテン及びアイルランド連合王国]]建国<br />([[連合法 (1800年)|1800年連合法]])\r\n",
"確立年月日3 = [[1801年]]\r\n",
"確立形態4 = 現在の国号「'''グレートブリテン及び北アイルランド連合王国'''」に変更\r\n",
"確立年月日4 = [[1927年]]\r\n",
"通貨 = [[スターリング・ポンド|UKポンド]] (&pound;)\r\n",
"通貨コード = GBP\r\n",
"時間帯 = ±0\r\n",
"夏時間 = +1\r\n",
"ISO 3166-1 = GB / GBR\r\n",
"ccTLD = [[.uk]] / [[.gb]]<ref>使用は.ukに比べ圧倒的少数。</ref>\r\n",
"国際電話番号 = 44\r\n",
"注記 = <references />\r\n"
]
}
],
"source": [
"!/home/ryo-t/.go/bin/sift --no-color -m '{{基礎情報.+?^}}' british.txt | grep ' = ' | sed 's/^|//' | tee 025.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 26. 強調マークアップの除去\n",
"25の処理時に,テンプレートの値からMediaWikiの強調マークアップ(弱い強調,強調,強い強調のすべて)を除去してテキストに変換せよ(参考: [マークアップ早見表](http://ja.wikipedia.org/wiki/Help:%E6%97%A9%E8%A6%8B%E8%A1%A8))."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%%bash\n",
"cat << 'EOS' > 026_test.txt\n",
"1 a''b''c\n",
"2 a''b''c''d\n",
"3 '''a'''b''c''d\n",
"4 ''a''b'''c'''d\n",
"5 ''''''a'''b'''c''''''\n",
"6 ''''a'''''\n",
"7 ''a'b'\n",
"8 '''''a'''''b'''''c''d''e'''f'''g'''''\n",
"9 a\n",
"EOS"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 abc\n",
"2 abc''d\n",
"3 abcd\n",
"4 abcd\n",
"5 'abc'\n",
"6 ''''a'''''\n",
"7 ''a'b'\n",
"8 abcdefg\n",
"9 a\n"
]
}
],
"source": [
"def eliminate_emphasis(s):\n",
" for n in (5, 3, 2):\n",
" l = s.split(n * \"'\")\n",
" if len(l) % 2 == 0:\n",
" s = \"{}<<<{}>>>{}\".format(''.join(l[:-1]), n, l[-1])\n",
" else:\n",
" s = ''.join(l)\n",
" for n in (5, 3, 2):\n",
" token = '<<<{}>>>'.format(n)\n",
" s = s.replace(token, n * \"'\")\n",
" return s\n",
"\n",
"test = !cat 026_test.txt\n",
"for s in test:\n",
" print eliminate_emphasis(s)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 abc\n",
"2 abc''d\n",
"3 abcd\n",
"4 abcd\n",
"5 abc''\n",
"6 'a''\n",
"7 ''a'b'\n",
"8 abcdefg\n",
"9 a\n"
]
}
],
"source": [
"%%bash\n",
"pandoc --from mediawiki --to plain <(sed 's/$/\\n/' 026_test.txt) | grep '\\S'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 27. 内部リンクの除去\n",
"26の処理に加えて,テンプレートの値からMediaWikiの内部リンクマークアップを除去し,テキストに変換せよ(参考: [マークアップ早見表](http://ja.wikipedia.org/wiki/Help:%E6%97%A9%E8%A6%8B%E8%A1%A8))."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Overwriting 027_test.txt\n"
]
}
],
"source": [
"%%file 027_test.txt\n",
"1 [[hoge|fuga]]\n",
"2 [[foo#bar|baz]]\n",
"3 hoge\n",
"4 [[hoge]]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 fuga\n",
"2 baz\n",
"3 hoge\n",
"4 hoge\n"
]
}
],
"source": [
"import re\n",
"\n",
"def eliminate_link_to_a_section(s):\n",
" return re.sub(r'\\[\\[([^|]+\\|)?([^]]+)\\]\\]', r'\\2', s)\n",
"\n",
"test = !cat 027_test.txt\n",
"for s in test:\n",
" print eliminate_link_to_a_section(s)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 fuga\n",
"2 baz\n",
"3 hoge\n",
"4 hoge\n"
]
}
],
"source": [
"%%bash\n",
"pandoc --from mediawiki --to plain <(sed 's/$/\\n/' 027_test.txt) | grep '\\S'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 28. MediaWikiマークアップの除去\n",
"27の処理に加えて,テンプレートの値からMediaWikiマークアップを可能な限り除去し,国の基本情報を整形せよ."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"GDP/人\": \"36,727[^1]\n",
"\n",
"[^1]: \n",
"\", \n",
" \"GDP値\": \"2兆3162億[^1]\n",
"\n",
"[^1]: \n",
"\", \n",
" \"GDP値MER\": \"2兆4337億[^1]\n",
"\n",
"[^1]: \n",
"\", \n",
" \"GDP値元\": \"1兆5478億[^1]\n",
"\n",
"[^1]: [http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=\n",
" IMF>Data and Statistics>World Economic Outlook Databases>By\n",
" Countrise>United Kingdom]\n",
"\", \n",
" \"GDP統計年\": \"2012\n",
"\", \n",
" \"GDP統計年MER\": \"2012\n",
"\", \n",
" \"GDP統計年元\": \"2012\n",
"\", \n",
" \"GDP順位\": \"6\n",
"\", \n",
" \"GDP順位MER\": \"5\n",
"\", \n",
" \"ISO 3166-1\": \"GB / GBR\n",
"\", \n",
" \"ccTLD\": \".uk / .gb[^1]\n",
"\n",
"[^1]: 使用は.ukに比べ圧倒的少数。\n",
"\", \n",
" \"人口値\": \"63,181,775[^1]\n",
"\n",
"[^1]: United Nations Department of Economic and Social\n",
" Affairs>Population Division>Data>Population>Total Population\n",
"\", \n",
" \"人口大きさ\": \"1 E7\n",
"\", \n",
" \"人口密度値\": \"246\n",
"\", \n",
" \"人口統計年\": \"2011\n",
"\", \n",
" \"人口順位\": \"22\n",
"\", \n",
" \"位置画像\": \"Location_UK_EU_Europe_001.svg\n",
"\", \n",
" \"元首等氏名\": \"エリザベス2世\n",
"\", \n",
" \"元首等肩書\": \"女王\n",
"\", \n",
" \"公式国名\": \"英語以外での正式国名: \n",
"\n",
"\", \n",
" \"公用語\": \"英語(事実上)\n",
"\", \n",
" \"国旗画像\": \"Flag of the United Kingdom.svg\n",
"\", \n",
" \"国歌\": \"神よ女王陛下を守り給え\n",
"\", \n",
" \"国章リンク\": \"(国章)\n",
"\", \n",
" \"国章画像\": \"85px|イギリスの国章\n",
"\", \n",
" \"国際電話番号\": \"44\n",
"\", \n",
" \"夏時間\": \"+1\n",
"\", \n",
" \"建国形態\": \"建国\n",
"\", \n",
" \"日本語国名\": \"グレートブリテン及び北アイルランド連合王国\n",
"\", \n",
" \"時間帯\": \"±0\n",
"\", \n",
" \"最大都市\": \"ロンドン\n",
"\", \n",
" \"標語\": \" \n",
"(フランス語:神と私の権利)\n",
"\", \n",
" \"水面積率\": \"1.3%\n",
"\", \n",
" \"注記\": \"\n",
"\", \n",
" \"略名\": \"イギリス\n",
"\", \n",
" \"確立年月日1\": \"927年/843年\n",
"\", \n",
" \"確立年月日2\": \"1707年\n",
"\", \n",
" \"確立年月日3\": \"1801年\n",
"\", \n",
" \"確立年月日4\": \"1927年\n",
"\", \n",
" \"確立形態1\": \"イングランド王国/スコットランド王国 \n",
"(両国とも1707年連合法まで)\n",
"\", \n",
" \"確立形態2\": \"グレートブリテン王国建国 \n",
"(1707年連合法)\n",
"\", \n",
" \"確立形態3\": \"グレートブリテン及びアイルランド連合王国建国 \n",
"(1800年連合法)\n",
"\", \n",
" \"確立形態4\": \"現在の国号「グレートブリテン及び北アイルランド連合王国」に変更\n",
"\", \n",
" \"通貨\": \"UKポンド (£)\n",
"\", \n",
" \"通貨コード\": \"GBP\n",
"\", \n",
" \"面積値\": \"244,820\n",
"\", \n",
" \"面積大きさ\": \"1 E11\n",
"\", \n",
" \"面積順位\": \"76\n",
"\", \n",
" \"首相等氏名\": \"デーヴィッド・キャメロン\n",
"\", \n",
" \"首相等肩書\": \"首相\n",
"\", \n",
" \"首都\": \"ロンドン\n",
"\"\n",
"}\n"
]
}
],
"source": [
"from prettyprint.prettyprint import pp\n",
"from pypandoc import convert\n",
"\n",
"def mediawiki2plain(s):\n",
" return convert(s, 'plain', format='mediawiki')\n",
"\n",
"for key in D:\n",
" D[key] = mediawiki2plain(D[key])\n",
" \n",
"pp(D)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"略名 = イギリス\n",
"日本語国名 = グレートブリテン及び北アイルランド連合王国\n",
"公式国名 = 英語以外での正式国名: \n",
" 国旗画像 = Flag of the United Kingdom.svg\n",
"国章画像 = 85px|イギリスの国章\n",
"国章リンク = (国章)\n",
"標語 = \n",
"(フランス語:神と私の権利)\n",
"国歌 = 神よ女王陛下を守り給え\n",
"位置画像 = Location_UK_EU_Europe_001.svg\n",
"公用語 = 英語(事実上)\n",
"首都 = ロンドン\n",
"最大都市 = ロンドン\n",
"元首等肩書 = 女王\n",
"元首等氏名 = エリザベス2世\n",
"首相等肩書 = 首相\n",
"首相等氏名 = デーヴィッド・キャメロン\n",
"面積順位 = 76\n",
"面積大きさ = 1 E11\n",
"面積値 = 244,820\n",
"水面積率 = 1.3%\n",
"人口統計年 = 2011\n",
"人口順位 = 22\n",
"人口大きさ = 1 E7\n",
"人口値 = 63,181,775[^1]\n",
"人口密度値 = 246\n",
"GDP統計年元 = 2012\n",
"GDP値元 = 1兆5478億[^2]\n",
"GDP統計年MER = 2012\n",
"GDP順位MER = 5\n",
"GDP値MER = 2兆4337億[^3]\n",
"GDP統計年 = 2012\n",
"GDP順位 = 6\n",
"GDP値 = 2兆3162億[^4]\n",
"GDP/人 = 36,727[^5]\n",
"建国形態 = 建国\n",
"確立形態1 = イングランド王国/スコットランド王国 \n",
"(両国とも1707年連合法まで)\n",
"確立年月日1 = 927年/843年\n",
"確立形態2 = グレートブリテン王国建国 \n",
"(1707年連合法)\n",
"確立年月日2 = 1707年\n",
"確立形態3 = グレートブリテン及びアイルランド連合王国建国 \n",
"(1800年連合法)\n",
"確立年月日3 = 1801年\n",
"確立形態4 =\n",
"現在の国号「グレートブリテン及び北アイルランド連合王国」に変更\n",
"確立年月日4 = 1927年\n",
"通貨 = UKポンド (£)\n",
"通貨コード = GBP\n",
"時間帯 = ±0\n",
"夏時間 = +1\n",
"ISO 3166-1 = GB / GBR\n",
"ccTLD = .uk / .gb[^6]\n",
"国際電話番号 = 44\n",
"注記 =\n",
"[^1]: United Nations Department of Economic and Social\n",
" Affairs>Population Division>Data>Population>Total Population\n",
"[^2]: [http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=\n",
" IMF>Data and Statistics>World Economic Outlook Databases>By\n",
" Countrise>United Kingdom]\n",
"[^3]: \n",
"[^4]: \n",
"[^5]: \n",
"[^6]: 使用は.ukに比べ圧倒的少数。\n"
]
}
],
"source": [
"%%bash\n",
"pandoc --from mediawiki --to plain <(sed 's/$/\\n/' 025.txt) | grep '\\S'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 29. 国旗画像のURLを取得する\n",
"テンプレートの内容を利用し,国旗画像のURLを取得せよ.(ヒント: [MediaWiki API](http://www.mediawiki.org/wiki/API:Main_page/ja)の[imageinfo](http://www.mediawiki.org/wiki/API:Properties/ja#imageinfo_.2F_ii)を呼び出して,ファイル参照をURLに変換すればよい)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import requests\n",
"r = requests.get('https://commons.wikimedia.org/w/api.php',\n",
" {'action': 'query', 'prop': 'imageinfo', 'iiprop': 'url',\n",
" 'format': 'json', 'titles': 'File:{[国旗画像]}'.format(D)})\n",
"with open('029.txt', 'w') as f:\n",
" f.write(r.content)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg\r\n"
]
}
],
"source": [
"!cat 029.txt | jq -r '.query.pages[].imageinfo[].url'"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment