Skip to content

Instantly share code, notes, and snippets.

@friendoye
Created April 15, 2018 21:13
Show Gist options
  • Save friendoye/7233c6980cc7257ddefd9cc74f2411e6 to your computer and use it in GitHub Desktop.
Save friendoye/7233c6980cc7257ddefd9cc74f2411e6 to your computer and use it in GitHub Desktop.
Information Retrieval HW 2
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import json\n",
"import networkx as nx\n",
"from operator import itemgetter\n",
"\n",
"class WikiPage:\n",
" def __init__(self, url, title, snippet):\n",
" self.url = url\n",
" self.title = title\n",
" self.snippet = snippet\n",
"\n",
"##### Parser #####\n",
" \n",
"def parse_wiki_json(file_path, use_only_wiki_page_nodes=True):\n",
" graph = nx.DiGraph()\n",
" wike_pages_dict = dict()\n",
" \n",
" nodes = json.load(open(file_path))\n",
" for node in nodes:\n",
" url = node[\"url\"]\n",
" wike_pages_dict[url] = WikiPage(url, node[\"title\"], node[\"info\"])\n",
" graph.add_node(url)\n",
"\n",
" for node in nodes:\n",
" for url in node[\"out_urls\"]:\n",
" if (url in wike_pages_dict) or (not use_only_wiki_page_nodes):\n",
" graph.add_edge(node[\"url\"], url)\n",
" \n",
" return graph, wike_pages_dict\n",
"\n",
"##### Ranking things #####\n",
"\n",
"def print_wiki_page(url, rank, wiki_pages_dict):\n",
" if url in wiki_pages_dict:\n",
" wiki_page = wiki_pages_dict[url]\n",
" print(\"%s[rank=%s]\\n%s\\n%s\\n\" % (wiki_page.title, rank, wiki_page.url, wiki_page.snippet))\n",
" else:\n",
" print(\"%s[rank=%s]\\n%s\\n%s\\n\" % (\"...\", rank, url, \"...\"))\n",
"\n",
"def print_top_ranks(ranks, wiki_pages_dict):\n",
" top_to_bottom_ranks = sorted(list(ranks.items()), key=itemgetter(1), reverse=True)\n",
"\n",
" for (url, rank) in top_to_bottom_ranks[:10]:\n",
" print_wiki_page(url, rank, wiki_pages_dict)\n",
"\n",
"def print_pagerank_results(graph, wiki_pages_dict, alpha, tag):\n",
" print(\"PageRank results [%s]:\\n\" % tag)\n",
" print_top_ranks(nx.pagerank(graph, alpha), wiki_pages_dict)\n",
" \n",
"def analyze_wiki_graph_with_pagerank(graph, wiki_pages_dict):\n",
" # Print default PageRank results\n",
" print_pagerank_results(graph, wiki_pages_dict, 0.85, \"default\")\n",
"\n",
" # Print PageRank results for different alphas\n",
" alphas = [0.95, 0.5, 0.3] \n",
" for alpha in alphas:\n",
" tag = \"alpha = %s\" % alpha\n",
" print_pagerank_results(graph, wiki_pages_dict, alpha, tag)\n",
" \n",
"def analyze_wiki_graph_with_hits(graph, wiki_pages_dict):\n",
" hubs, authorities = nx.hits(graph, max_iter=500)\n",
" average = { url: (value + authorities[url]) / 2 for url, value in hubs.items() }\n",
" print(\"HITS results [hubs]\\n\")\n",
" print_top_ranks(hubs, wiki_pages_dict)\n",
" print(\"HITS results [authorities]\\n\")\n",
" print_top_ranks(authorities, wiki_pages_dict)\n",
" print(\"HITS results [average]\\n\")\n",
" print_top_ranks(average, wiki_pages_dict)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== BUILDING GRAPH ONLY BETWEEN WIKI PAGE NODES ===\n",
"\n",
"PageRank results [default]:\n",
"\n",
"World War II[rank=0.025849148224995167]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"New York City[rank=0.01101301659274179]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"Paris[rank=0.007194827382879238]\n",
"https://en.wikipedia.org/wiki/Paris\n",
"Paris (French pronunciation: ​[paʁi] ( listen)) is the capital and most populous city in France, with an administrative-limits area of 105 square kilometres (41 square miles) and an official population of 2,206,488 (2015).[5] The city is a commune and dep...\n",
"\n",
"United States Senate[rank=0.006117081596585612]\n",
"https://en.wikipedia.org/wiki/United_States_Senate\n",
"The United States Senate is the upper chamber of the United States Congress, which along with the United States House of Representatives—the lower chamber—comprise the legislature of the United States....\n",
"\n",
"Research[rank=0.0023184024340878744]\n",
"https://en.wikipedia.org/wiki/Research\n",
"Research comprises \"creative and systematic work undertaken to increase the stock of knowledge, including knowledge of humans, culture and society, and the use of this stock of knowledge to devise new applications.\"[1] It is used to establish or confirm f...\n",
"\n",
"Alexandria[rank=0.0021892694069322307]\n",
"https://en.wikipedia.org/wiki/Alexandria\n",
"Alexandria (/ˌælɪɡˈzændriə/ or /-ˈzɑːnd-/;[3] Arabic: الإسكندرية al-ʾIskandariyya; Egyptian Arabic: إسكندرية Eskendria; Coptic: Ⲁⲗⲉⲝⲁⲛⲇⲣⲓⲁ, Ⲣⲁⲕⲟⲧⲉ Alexandria, Rakotə) is the second-largest city in Egypt and a major economic centre, extending about 32 km (...\n",
"\n",
"Afghanistan[rank=0.0020716039774599866]\n",
"https://en.wikipedia.org/wiki/Afghanistan\n",
"Coordinates: 33°N 65°E / 33°N 65°E / 33; 65...\n",
"\n",
"Cuba[rank=0.002056113979294804]\n",
"https://en.wikipedia.org/wiki/Cuba\n",
"Coordinates: 22°00′N 80°00′W / 22.000°N 80.000°W / 22.000; -80.000...\n",
"\n",
"Cardinal Richelieu[rank=0.00194137496688989]\n",
"https://en.wikipedia.org/wiki/Cardinal_Richelieu\n",
"Cardinal Armand Jean du Plessis, 1st Duke of Richelieu and Fronsac (French pronunciation: ​[aʁmɑ̃ ʒɑ̃ dy plɛsi]; 9 September 1585 – 4 December 1642), commonly referred to as Cardinal Richelieu (French: Cardinal de Richelieu [kaʁdinal d(ə) ʁiʃ(ə)ljø]), was...\n",
"\n",
"Funding of science[rank=0.0019269159110458798]\n",
"https://en.wikipedia.org/wiki/Funding_of_science\n",
"Research funding is a term generally covering any funding for scientific research, in the areas of both \"hard\" science and technology and social science. The term often connotes funding obtained through a competitive process, in which potential research p...\n",
"\n",
"PageRank results [alpha = 0.95]:\n",
"\n",
"World War II[rank=0.03123348513129211]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"New York City[rank=0.01186502003483322]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"Paris[rank=0.00844799729074052]\n",
"https://en.wikipedia.org/wiki/Paris\n",
"Paris (French pronunciation: ​[paʁi] ( listen)) is the capital and most populous city in France, with an administrative-limits area of 105 square kilometres (41 square miles) and an official population of 2,206,488 (2015).[5] The city is a commune and dep...\n",
"\n",
"United States Senate[rank=0.007640758009980322]\n",
"https://en.wikipedia.org/wiki/United_States_Senate\n",
"The United States Senate is the upper chamber of the United States Congress, which along with the United States House of Representatives—the lower chamber—comprise the legislature of the United States....\n",
"\n",
"Research[rank=0.00453511559816445]\n",
"https://en.wikipedia.org/wiki/Research\n",
"Research comprises \"creative and systematic work undertaken to increase the stock of knowledge, including knowledge of humans, culture and society, and the use of this stock of knowledge to devise new applications.\"[1] It is used to establish or confirm f...\n",
"\n",
"Funding of science[rank=0.003908987030502495]\n",
"https://en.wikipedia.org/wiki/Funding_of_science\n",
"Research funding is a term generally covering any funding for scientific research, in the areas of both \"hard\" science and technology and social science. The term often connotes funding obtained through a competitive process, in which potential research p...\n",
"\n",
"Alexandria[rank=0.0029078159910556]\n",
"https://en.wikipedia.org/wiki/Alexandria\n",
"Alexandria (/ˌælɪɡˈzændriə/ or /-ˈzɑːnd-/;[3] Arabic: الإسكندرية al-ʾIskandariyya; Egyptian Arabic: إسكندرية Eskendria; Coptic: Ⲁⲗⲉⲝⲁⲛⲇⲣⲓⲁ, Ⲣⲁⲕⲟⲧⲉ Alexandria, Rakotə) is the second-largest city in Egypt and a major economic centre, extending about 32 km (...\n",
"\n",
"Cardinal Richelieu[rank=0.002864173113157915]\n",
"https://en.wikipedia.org/wiki/Cardinal_Richelieu\n",
"Cardinal Armand Jean du Plessis, 1st Duke of Richelieu and Fronsac (French pronunciation: ​[aʁmɑ̃ ʒɑ̃ dy plɛsi]; 9 September 1585 – 4 December 1642), commonly referred to as Cardinal Richelieu (French: Cardinal de Richelieu [kaʁdinal d(ə) ʁiʃ(ə)ljø]), was...\n",
"\n",
"Sinking of Prince of Wales and Repulse[rank=0.002479133190558128]\n",
"https://en.wikipedia.org/wiki/Sinking_of_Prince_of_Wales_and_Repulse\n",
"The sinking of Prince of Wales and Repulse was a naval engagement in the Second World War, part of the war in the Pacific, that took place north of Singapore, off the east coast of Malaya, near Kuantan, Pahang, where the British Royal Navy battleship HMS ...\n",
"\n",
"Brooklyn[rank=0.0022832110582490083]\n",
"https://en.wikipedia.org/wiki/Brooklyn\n",
"Coordinates: 40°41′34″N 73°59′25″W / 40.69278°N 73.99028°W / 40.69278; -73.99028...\n",
"\n",
"PageRank results [alpha = 0.5]:\n",
"\n",
"World War II[rank=0.013661890376013277]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"New York City[rank=0.006570507639621641]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"Paris[rank=0.0036283680826236443]\n",
"https://en.wikipedia.org/wiki/Paris\n",
"Paris (French pronunciation: ​[paʁi] ( listen)) is the capital and most populous city in France, with an administrative-limits area of 105 square kilometres (41 square miles) and an official population of 2,206,488 (2015).[5] The city is a commune and dep...\n",
"\n",
"United States Senate[rank=0.002857831882705171]\n",
"https://en.wikipedia.org/wiki/United_States_Senate\n",
"The United States Senate is the upper chamber of the United States Congress, which along with the United States House of Representatives—the lower chamber—comprise the legislature of the United States....\n",
"\n",
"Afghanistan[rank=0.0013673824686505573]\n",
"https://en.wikipedia.org/wiki/Afghanistan\n",
"Coordinates: 33°N 65°E / 33°N 65°E / 33; 65...\n",
"\n",
"Cuba[rank=0.001132609644915879]\n",
"https://en.wikipedia.org/wiki/Cuba\n",
"Coordinates: 22°00′N 80°00′W / 22.000°N 80.000°W / 22.000; -80.000...\n",
"\n",
"Alexandria[rank=0.0009417109433915612]\n",
"https://en.wikipedia.org/wiki/Alexandria\n",
"Alexandria (/ˌælɪɡˈzændriə/ or /-ˈzɑːnd-/;[3] Arabic: الإسكندرية al-ʾIskandariyya; Egyptian Arabic: إسكندرية Eskendria; Coptic: Ⲁⲗⲉⲝⲁⲛⲇⲣⲓⲁ, Ⲣⲁⲕⲟⲧⲉ Alexandria, Rakotə) is the second-largest city in Egypt and a major economic centre, extending about 32 km (...\n",
"\n",
"Brooklyn[rank=0.0008705226446613252]\n",
"https://en.wikipedia.org/wiki/Brooklyn\n",
"Coordinates: 40°41′34″N 73°59′25″W / 40.69278°N 73.99028°W / 40.69278; -73.99028...\n",
"\n",
"John Kerry[rank=0.0006878141840255482]\n",
"https://en.wikipedia.org/wiki/John_Kerry\n",
"John Forbes Kerry (/ˈkɛri/; born December 11, 1943) is an American politician who served as the 68th United States Secretary of State from 2013 to 2017. A Democrat, he previously represented Massachusetts in the United States Senate from 1985 to 2013. He ...\n",
"\n",
"Harry S. Truman[rank=0.000675294327912508]\n",
"https://en.wikipedia.org/wiki/Harry_S._Truman\n",
"Harry S. Truman[b] (May 8, 1884 – December 26, 1972) was an American statesman who served as the 33rd President of the United States (1945–1953), taking the office upon the death of Franklin D. Roosevelt. A World War I veteran, he assumed the presidency d...\n",
"\n",
"PageRank results [alpha = 0.3]:\n",
"\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"World War II[rank=0.008124281431376135]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"New York City[rank=0.003989651613354816]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"Paris[rank=0.0020964907015691273]\n",
"https://en.wikipedia.org/wiki/Paris\n",
"Paris (French pronunciation: ​[paʁi] ( listen)) is the capital and most populous city in France, with an administrative-limits area of 105 square kilometres (41 square miles) and an official population of 2,206,488 (2015).[5] The city is a commune and dep...\n",
"\n",
"United States Senate[rank=0.001674357057331771]\n",
"https://en.wikipedia.org/wiki/United_States_Senate\n",
"The United States Senate is the upper chamber of the United States Congress, which along with the United States House of Representatives—the lower chamber—comprise the legislature of the United States....\n",
"\n",
"Afghanistan[rank=0.0008793114118364494]\n",
"https://en.wikipedia.org/wiki/Afghanistan\n",
"Coordinates: 33°N 65°E / 33°N 65°E / 33; 65...\n",
"\n",
"Cuba[rank=0.0006617585065416487]\n",
"https://en.wikipedia.org/wiki/Cuba\n",
"Coordinates: 22°00′N 80°00′W / 22.000°N 80.000°W / 22.000; -80.000...\n",
"\n",
"Alexandria[rank=0.0005473154206338475]\n",
"https://en.wikipedia.org/wiki/Alexandria\n",
"Alexandria (/ˌælɪɡˈzændriə/ or /-ˈzɑːnd-/;[3] Arabic: الإسكندرية al-ʾIskandariyya; Egyptian Arabic: إسكندرية Eskendria; Coptic: Ⲁⲗⲉⲝⲁⲛⲇⲣⲓⲁ, Ⲣⲁⲕⲟⲧⲉ Alexandria, Rakotə) is the second-largest city in Egypt and a major economic centre, extending about 32 km (...\n",
"\n",
"Brooklyn[rank=0.0005055067598450495]\n",
"https://en.wikipedia.org/wiki/Brooklyn\n",
"Coordinates: 40°41′34″N 73°59′25″W / 40.69278°N 73.99028°W / 40.69278; -73.99028...\n",
"\n",
"Hamburg[rank=0.0004331829669366821]\n",
"https://en.wikipedia.org/wiki/Hamburg\n",
"Hamburg (English: /ˈhæmbɜːrɡ/; German: [ˈhambʊɐ̯k] ( listen); locally: [ˈhambʊɪ̯ç] ( listen)), Low German/Low Saxon: Hamborg [ˈhambɔːç] ( listen), officially the Free and Hanseatic City of Hamburg (German: Freie und Hansestadt Hamburg),[5] is the second-l...\n",
"\n",
"Peptide[rank=0.000410259467342855]\n",
"https://en.wikipedia.org/wiki/Peptide\n",
"Peptides (from Gr.: πεπτός, peptós \"digested\"; derived from πέσσειν, péssein \"to digest\") are short chains of amino acid monomers linked by peptide (amide) bonds....\n",
"\n"
]
}
],
"source": [
"print(\"=== BUILDING GRAPH ONLY BETWEEN WIKI PAGE NODES ===\\n\")\n",
"graph, info_dict = parse_wiki_json(\"wiki_links.json\", use_only_wiki_page_nodes=True)\n",
"analyze_wiki_graph_with_pagerank(graph, info_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___Выводы___: При изменении $alpha$ топовые 3 результат не меняются (наверное связано с тем, что у них ранг явно больше выделяется на фоне оставшихся рангов). Остальные же 4-10 топ ранги немного меняются местами из-за изменения $alpha$."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== BUILDING GRAPH USING ALL NODES AND ALL EDGES ===\n",
"PageRank results [default]:\n",
"\n",
"...[rank=8.32295165367056e-05]\n",
"https://en.wikipedia.org/wiki/United_States\n",
"...\n",
"\n",
"World War II[rank=4.870370331995472e-05]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"...[rank=4.143651987487163e-05]\n",
"https://en.wikipedia.org/wiki/France\n",
"...\n",
"\n",
"...[rank=3.817706147720798e-05]\n",
"https://en.wikipedia.org/wiki/Mathematics\n",
"...\n",
"\n",
"...[rank=3.574484291728748e-05]\n",
"https://en.wikipedia.org/wiki/United_Kingdom\n",
"...\n",
"\n",
"New York City[rank=3.2299184144809654e-05]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"...[rank=3.1074132903209096e-05]\n",
"https://en.wikipedia.org/wiki/England\n",
"...\n",
"\n",
"...[rank=3.0194270922979755e-05]\n",
"https://en.wikipedia.org/wiki/Germany\n",
"...\n",
"\n",
"...[rank=2.8601893830350863e-05]\n",
"https://en.wikipedia.org/wiki/German_language\n",
"...\n",
"\n",
"...[rank=2.6047961515101567e-05]\n",
"https://en.wikipedia.org/wiki/Greek_language\n",
"...\n",
"\n",
"PageRank results [alpha = 0.95]:\n",
"\n",
"...[rank=9.270452613362939e-05]\n",
"https://en.wikipedia.org/wiki/United_States\n",
"...\n",
"\n",
"World War II[rank=5.411685253843735e-05]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"...[rank=4.599470633510926e-05]\n",
"https://en.wikipedia.org/wiki/France\n",
"...\n",
"\n",
"...[rank=4.235178224360287e-05]\n",
"https://en.wikipedia.org/wiki/Mathematics\n",
"...\n",
"\n",
"...[rank=3.963342032369169e-05]\n",
"https://en.wikipedia.org/wiki/United_Kingdom\n",
"...\n",
"\n",
"New York City[rank=3.578238993092241e-05]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"...[rank=3.441321501383943e-05]\n",
"https://en.wikipedia.org/wiki/England\n",
"...\n",
"\n",
"...[rank=3.342983985946547e-05]\n",
"https://en.wikipedia.org/wiki/Germany\n",
"...\n",
"\n",
"...[rank=3.165012428535081e-05]\n",
"https://en.wikipedia.org/wiki/German_language\n",
"...\n",
"\n",
"...[rank=2.8795729344778055e-05]\n",
"https://en.wikipedia.org/wiki/Greek_language\n",
"...\n",
"\n",
"PageRank results [alpha = 0.5]:\n",
"\n",
"...[rank=5.006698294747138e-05]\n",
"https://en.wikipedia.org/wiki/United_States\n",
"...\n",
"\n",
"World War II[rank=2.9757681055265054e-05]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"...[rank=2.548286726403971e-05]\n",
"https://en.wikipedia.org/wiki/France\n",
"...\n",
"\n",
"...[rank=2.3565538794825842e-05]\n",
"https://en.wikipedia.org/wiki/Mathematics\n",
"...\n",
"\n",
"...[rank=2.2134821994872537e-05]\n",
"https://en.wikipedia.org/wiki/United_Kingdom\n",
"...\n",
"\n",
"New York City[rank=2.010796389341501e-05]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"...[rank=1.938734551600294e-05]\n",
"https://en.wikipedia.org/wiki/England\n",
"...\n",
"\n",
"...[rank=1.8869779645279812e-05]\n",
"https://en.wikipedia.org/wiki/Germany\n",
"...\n",
"\n",
"...[rank=1.793308723785104e-05]\n",
"https://en.wikipedia.org/wiki/German_language\n",
"...\n",
"\n",
"...[rank=1.6430774111233797e-05]\n",
"https://en.wikipedia.org/wiki/Greek_language\n",
"...\n",
"\n",
"PageRank results [alpha = 0.3]:\n",
"\n",
"...[rank=3.111696375362334e-05]\n",
"https://en.wikipedia.org/wiki/United_States\n",
"...\n",
"\n",
"World War II[rank=1.8931382618299524e-05]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"...[rank=1.6366494343564322e-05]\n",
"https://en.wikipedia.org/wiki/France\n",
"...\n",
"\n",
"...[rank=1.5216097262036003e-05]\n",
"https://en.wikipedia.org/wiki/Mathematics\n",
"...\n",
"\n",
"...[rank=1.4357667182064046e-05]\n",
"https://en.wikipedia.org/wiki/United_Kingdom\n",
"...\n",
"\n",
"New York City[rank=1.3141552321189534e-05]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"...[rank=1.270918129474229e-05]\n",
"https://en.wikipedia.org/wiki/England\n",
"...\n",
"\n",
"...[rank=1.2398641772308408e-05]\n",
"https://en.wikipedia.org/wiki/Germany\n",
"...\n",
"\n",
"...[rank=1.1836626327851142e-05]\n",
"https://en.wikipedia.org/wiki/German_language\n",
"...\n",
"\n",
"...[rank=1.0935238451880797e-05]\n",
"https://en.wikipedia.org/wiki/Greek_language\n",
"...\n",
"\n"
]
}
],
"source": [
"print(\"=== BUILDING GRAPH USING ALL NODES AND ALL EDGES ===\")\n",
"graph, info_dict = parse_wiki_json(\"wiki_links.json\", use_only_wiki_page_nodes=False)\n",
"analyze_wiki_graph_with_pagerank(graph, info_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___Выводы___: Изменение $alpha$ ничего не меняют, ибо у всех страниц маленький ранг, нет явно выдеющихся рангов + кажется, что выводятся самые популяные страницы (т.е. страницы, на которые чаще всего ссылаются остальные или на которые проще всего попасть с любой другой страницы)."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== BUILDING GRAPH ONLY BETWEEN WIKI PAGE NODES ===\n",
"\n",
"HITS results [hubs]\n",
"\n",
"United States[rank=0.001155832205297175]\n",
"https://en.wikipedia.org/wiki/United_states\n",
"Coordinates: 40°N 100°W / 40°N 100°W / 40; -100...\n",
"\n",
"List of recurring The Simpsons characters[rank=0.001141421004976474]\n",
"https://en.wikipedia.org/wiki/List_of_recurring_The_Simpsons_characters\n",
"The Simpsons includes a large array of supporting characters: co-workers, teachers, family friends, extended relatives, townspeople, local celebrities, fictional characters within the show, and even animals. The writers originally intended many of these c...\n",
"\n",
"List of recurring The Simpsons characters[rank=0.001141421004976474]\n",
"https://en.wikipedia.org/wiki/Jimbo_Jones\n",
"The Simpsons includes a large array of supporting characters: co-workers, teachers, family friends, extended relatives, townspeople, local celebrities, fictional characters within the show, and even animals. The writers originally intended many of these c...\n",
"\n",
"Airline[rank=0.0011053363179029142]\n",
"https://en.wikipedia.org/wiki/Air_transport\n",
"An airline is a company that provides air transport services for traveling passengers and freight. Airlines utilize aircraft to supply these services and may form partnerships or alliances with other airlines for codeshare agreements. Generally, airline c...\n",
"\n",
"List of Empire ships (Ta–Te)[rank=0.001101062729131132]\n",
"https://en.wikipedia.org/wiki/MV_Aqueity_(1946)\n",
"The Empire ships were a series of ships in the service of the British government. Their names were all prefixed with \"Empire\". Mostly they were used during World War II by the Ministry of War Transport (MoWT), who owned the ships but contracted out their ...\n",
"\n",
"George Padmore[rank=0.0011004559933305495]\n",
"https://en.wikipedia.org/wiki/George_Padmore\n",
"George Padmore (28 June 1903 – 23 September 1959), born Malcolm Ivan Meredith Nurse in Trinidad, was a leading Pan-Africanist, journalist, and author. He left Trinidad in 1924 to study medicine in the United States, where he also joined the Communist Part...\n",
"\n",
"Yonkers, New York[rank=0.0010857808651206088]\n",
"https://en.wikipedia.org/wiki/Yonkers,_New_York\n",
"Coordinates: 40°56′29″N 73°51′52″W / 40.94139°N 73.86444°W / 40.94139; -73.86444...\n",
"\n",
"Asbury Park, New Jersey[rank=0.001084824579835818]\n",
"https://en.wikipedia.org/wiki/Asbury_Park,_New_Jersey\n",
"Asbury Park is a city in Monmouth County, New Jersey, United States, located on the Jersey Shore and part of the New York City Metropolitan Area. As of the 2010 United States Census, the city's population was 16,116,[11][12][13] reflecting a decline of 81...\n",
"\n",
"Camden, New Jersey[rank=0.0010839092542923874]\n",
"https://en.wikipedia.org/wiki/Camden,_New_Jersey\n",
"Camden is a city in Camden County, New Jersey. Camden is located directly across the Delaware River from Philadelphia, Pennsylvania. At the 2010 United States Census, the city had a population of 77,344.[10][12][13] Camden is the 12th most populous munici...\n",
"\n",
"Christo and Jeanne-Claude[rank=0.0010833939508442625]\n",
"https://en.wikipedia.org/wiki/Christo_and_Jeanne-Claude\n",
"Christo Vladimirov Javacheff and Jeanne-Claude were a married couple who created environmental works of art. Christo and Jeanne-Claude were born on the same day, June 13, 1935; Christo in Gabrovo, Bulgaria, and Jeanne-Claude in Morocco. They first met in ...\n",
"\n",
"HITS results [authorities]\n",
"\n",
"World War II[rank=0.17257107335593755]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"New York City[rank=0.030916064842761038]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"Paris[rank=0.009105445711514273]\n",
"https://en.wikipedia.org/wiki/Paris\n",
"Paris (French pronunciation: ​[paʁi] ( listen)) is the capital and most populous city in France, with an administrative-limits area of 105 square kilometres (41 square miles) and an official population of 2,206,488 (2015).[5] The city is a commune and dep...\n",
"\n",
"United States Senate[rank=0.006607997622266011]\n",
"https://en.wikipedia.org/wiki/United_States_Senate\n",
"The United States Senate is the upper chamber of the United States Congress, which along with the United States House of Representatives—the lower chamber—comprise the legislature of the United States....\n",
"\n",
"Afghanistan[rank=0.003915033647888848]\n",
"https://en.wikipedia.org/wiki/Afghanistan\n",
"Coordinates: 33°N 65°E / 33°N 65°E / 33; 65...\n",
"\n",
"Cuba[rank=0.0038227532008992793]\n",
"https://en.wikipedia.org/wiki/Cuba\n",
"Coordinates: 22°00′N 80°00′W / 22.000°N 80.000°W / 22.000; -80.000...\n",
"\n",
"Hamburg[rank=0.0037055317754697645]\n",
"https://en.wikipedia.org/wiki/Hamburg\n",
"Hamburg (English: /ˈhæmbɜːrɡ/; German: [ˈhambʊɐ̯k] ( listen); locally: [ˈhambʊɪ̯ç] ( listen)), Low German/Low Saxon: Hamborg [ˈhambɔːç] ( listen), officially the Free and Hanseatic City of Hamburg (German: Freie und Hansestadt Hamburg),[5] is the second-l...\n",
"\n",
"Brooklyn[rank=0.0031870245003373866]\n",
"https://en.wikipedia.org/wiki/Brooklyn\n",
"Coordinates: 40°41′34″N 73°59′25″W / 40.69278°N 73.99028°W / 40.69278; -73.99028...\n",
"\n",
"Harry S. Truman[rank=0.0029114570962381196]\n",
"https://en.wikipedia.org/wiki/Harry_S._Truman\n",
"Harry S. Truman[b] (May 8, 1884 – December 26, 1972) was an American statesman who served as the 33rd President of the United States (1945–1953), taking the office upon the death of Franklin D. Roosevelt. A World War I veteran, he assumed the presidency d...\n",
"\n",
"John Kerry[rank=0.002477389765454642]\n",
"https://en.wikipedia.org/wiki/John_Kerry\n",
"John Forbes Kerry (/ˈkɛri/; born December 11, 1943) is an American politician who served as the 68th United States Secretary of State from 2013 to 2017. A Democrat, he previously represented Massachusetts in the United States Senate from 1985 to 2013. He ...\n",
"\n",
"HITS results [average]\n",
"\n",
"World War II[rank=0.0863243046111793]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"New York City[rank=0.015975650191571145]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"Paris[rank=0.004558746339330339]\n",
"https://en.wikipedia.org/wiki/Paris\n",
"Paris (French pronunciation: ​[paʁi] ( listen)) is the capital and most populous city in France, with an administrative-limits area of 105 square kilometres (41 square miles) and an official population of 2,206,488 (2015).[5] The city is a commune and dep...\n",
"\n",
"United States Senate[rank=0.003305463218487022]\n",
"https://en.wikipedia.org/wiki/United_States_Senate\n",
"The United States Senate is the upper chamber of the United States Congress, which along with the United States House of Representatives—the lower chamber—comprise the legislature of the United States....\n",
"\n",
"Afghanistan[rank=0.002400954789940294]\n",
"https://en.wikipedia.org/wiki/Afghanistan\n",
"Coordinates: 33°N 65°E / 33°N 65°E / 33; 65...\n",
"\n",
"Cuba[rank=0.0019920136459442277]\n",
"https://en.wikipedia.org/wiki/Cuba\n",
"Coordinates: 22°00′N 80°00′W / 22.000°N 80.000°W / 22.000; -80.000...\n",
"\n",
"Harry S. Truman[rank=0.0019208589389176503]\n",
"https://en.wikipedia.org/wiki/Harry_S._Truman\n",
"Harry S. Truman[b] (May 8, 1884 – December 26, 1972) was an American statesman who served as the 33rd President of the United States (1945–1953), taking the office upon the death of Franklin D. Roosevelt. A World War I veteran, he assumed the presidency d...\n",
"\n",
"Hamburg[rank=0.0018550051083835056]\n",
"https://en.wikipedia.org/wiki/Hamburg\n",
"Hamburg (English: /ˈhæmbɜːrɡ/; German: [ˈhambʊɐ̯k] ( listen); locally: [ˈhambʊɪ̯ç] ( listen)), Low German/Low Saxon: Hamborg [ˈhambɔːç] ( listen), officially the Free and Hanseatic City of Hamburg (German: Freie und Hansestadt Hamburg),[5] is the second-l...\n",
"\n",
"Brooklyn[rank=0.001691992966565913]\n",
"https://en.wikipedia.org/wiki/Brooklyn\n",
"Coordinates: 40°41′34″N 73°59′25″W / 40.69278°N 73.99028°W / 40.69278; -73.99028...\n",
"\n",
"Messerschmitt Bf 110[rank=0.0014814048407007074]\n",
"https://en.wikipedia.org/wiki/Messerschmitt_Bf_110\n",
"The Messerschmitt Bf 110, often known non-officially as the Me 110,[2] was a twin-engine heavy fighter (Zerstörer—German for \"Destroyer\") and fighter-bomber (Jagdbomber or Jabo) developed in Nazi Germany in the 1930s and used by the Luftwaffe during World...\n",
"\n"
]
}
],
"source": [
"print(\"=== BUILDING GRAPH ONLY BETWEEN WIKI PAGE NODES ===\\n\")\n",
"graph, info_dict = parse_wiki_json(\"wiki_links.json\", use_only_wiki_page_nodes=True)\n",
"analyze_wiki_graph_with_hits(graph, info_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___Выводы___: \n",
"1. HITS [hubs] выводит в топе страницы, у которых много ссылок на другие страницы;\n",
"2. HITS [average] и HITS [authorities] во многом напоминают результаты PageRank."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"=== BUILDING GRAPH USING ALL NODES AND ALL EDGES ===\n",
"HITS results [hubs]\n",
"\n",
"History of Western civilization[rank=0.004372299909070321]\n",
"https://en.wikipedia.org/wiki/History_of_Western_civilization\n",
"Western civilization traces its roots back to Europe and the Mediterranean. It is linked to the Roman Empire and with Medieval Western Christendom which emerged from the Middle Ages to experience such transformative episodes as the Renaissance, the Reform...\n",
"\n",
"United States[rank=0.0030728305857598247]\n",
"https://en.wikipedia.org/wiki/United_states\n",
"Coordinates: 40°N 100°W / 40°N 100°W / 40; -100...\n",
"\n",
"New York City[rank=0.002209641967709587]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"New York (state)[rank=0.0019379334377949697]\n",
"https://en.wikipedia.org/wiki/New_York_State\n",
"New York is a state in the northeastern United States. New York was one of the original thirteen colonies that formed the United States. With an estimated 19.85 million residents in 2017,[4] it is the fourth most populous state. To differentiate from its ...\n",
"\n",
"New York (state)[rank=0.0019379334377949697]\n",
"https://en.wikipedia.org/wiki/State_of_New_York\n",
"New York is a state in the northeastern United States. New York was one of the original thirteen colonies that formed the United States. With an estimated 19.85 million residents in 2017,[4] it is the fourth most populous state. To differentiate from its ...\n",
"\n",
"Protestantism[rank=0.001385852468307719]\n",
"https://en.wikipedia.org/wiki/Protestants\n",
"Protestantism is the second largest form of Christianity with collectively more than 900 million adherents worldwide or nearly 40% of all Christians.[1][2][3][a] It originated with the Reformation,[b] a movement against what its followers considered to be...\n",
"\n",
"Slavery[rank=0.0012877244965683193]\n",
"https://en.wikipedia.org/wiki/Slaves\n",
"Slavery is any system in which principles of property law are applied to people, allowing individuals to own, buy and sell other individuals, as a de jure form of property.[1] A slave is unable to withdraw unilaterally from such an arrangement and works w...\n",
"\n",
"World War II[rank=0.0012821112440912377]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"List of recurring The Simpsons characters[rank=0.0012700339807072338]\n",
"https://en.wikipedia.org/wiki/List_of_recurring_The_Simpsons_characters\n",
"The Simpsons includes a large array of supporting characters: co-workers, teachers, family friends, extended relatives, townspeople, local celebrities, fictional characters within the show, and even animals. The writers originally intended many of these c...\n",
"\n",
"List of recurring The Simpsons characters[rank=0.0012699723059101777]\n",
"https://en.wikipedia.org/wiki/Jimbo_Jones\n",
"The Simpsons includes a large array of supporting characters: co-workers, teachers, family friends, extended relatives, townspeople, local celebrities, fictional characters within the show, and even animals. The writers originally intended many of these c...\n",
"\n",
"HITS results [authorities]\n",
"\n",
"...[rank=0.001664715502889755]\n",
"https://en.wikipedia.org/wiki/United_States\n",
"...\n",
"\n",
"World War II[rank=0.0015551820722154571]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"...[rank=0.0009893548077395376]\n",
"https://en.wikipedia.org/wiki/United_Kingdom\n",
"...\n",
"\n",
"...[rank=0.0008736171456686618]\n",
"https://en.wikipedia.org/wiki/World_War_I\n",
"...\n",
"\n",
"...[rank=0.000841658916825792]\n",
"https://en.wikipedia.org/wiki/France\n",
"...\n",
"\n",
"...[rank=0.0007516815715244549]\n",
"https://en.wikipedia.org/wiki/Soviet_Union\n",
"...\n",
"\n",
"...[rank=0.0007265484901776222]\n",
"https://en.wikipedia.org/wiki/Germany\n",
"...\n",
"\n",
"...[rank=0.0006562844948079634]\n",
"https://en.wikipedia.org/wiki/India\n",
"...\n",
"\n",
"New York City[rank=0.0006487934266546151]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"...[rank=0.0006483296158212844]\n",
"https://en.wikipedia.org/wiki/China\n",
"...\n",
"\n",
"HITS results [average]\n",
"\n",
"History of Western civilization[rank=0.0021902483546770274]\n",
"https://en.wikipedia.org/wiki/History_of_Western_civilization\n",
"Western civilization traces its roots back to Europe and the Mediterranean. It is linked to the Roman Empire and with Medieval Western Christendom which emerged from the Middle Ages to experience such transformative episodes as the Renaissance, the Reform...\n",
"\n",
"United States[rank=0.0015374220139068278]\n",
"https://en.wikipedia.org/wiki/United_states\n",
"Coordinates: 40°N 100°W / 40°N 100°W / 40; -100...\n",
"\n",
"New York City[rank=0.001429217697182101]\n",
"https://en.wikipedia.org/wiki/New_York_City\n",
"The City of New York, often called New York City or simply New York, is the most populous city in the United States.[9] With an estimated 2017 population of 8,622,698[7] distributed over a land area of about 302.6 square miles (784 km2),[10][11] New York ...\n",
"\n",
"World War II[rank=0.0014186466581533473]\n",
"https://en.wikipedia.org/wiki/World_War_II\n",
"World War II (often abbreviated to WWII or WW2), also known as the Second World War, was a global war that lasted from 1939 to 1945, although related conflicts began earlier. The vast majority of the world's countries—including all of the great powers—eve...\n",
"\n",
"New York (state)[rank=0.0009732630711461702]\n",
"https://en.wikipedia.org/wiki/New_York_State\n",
"New York is a state in the northeastern United States. New York was one of the original thirteen colonies that formed the United States. With an estimated 19.85 million residents in 2017,[4] it is the fourth most populous state. To differentiate from its ...\n",
"\n",
"New York (state)[rank=0.0009696481659161672]\n",
"https://en.wikipedia.org/wiki/State_of_New_York\n",
"New York is a state in the northeastern United States. New York was one of the original thirteen colonies that formed the United States. With an estimated 19.85 million residents in 2017,[4] it is the fourth most populous state. To differentiate from its ...\n",
"\n",
"...[rank=0.0008323577514448775]\n",
"https://en.wikipedia.org/wiki/United_States\n",
"...\n",
"\n",
"Protestantism[rank=0.0007240464819118317]\n",
"https://en.wikipedia.org/wiki/Protestants\n",
"Protestantism is the second largest form of Christianity with collectively more than 900 million adherents worldwide or nearly 40% of all Christians.[1][2][3][a] It originated with the Reformation,[b] a movement against what its followers considered to be...\n",
"\n",
"Slavery[rank=0.0006686473364468655]\n",
"https://en.wikipedia.org/wiki/Slaves\n",
"Slavery is any system in which principles of property law are applied to people, allowing individuals to own, buy and sell other individuals, as a de jure form of property.[1] A slave is unable to withdraw unilaterally from such an arrangement and works w...\n",
"\n",
"List of recurring The Simpsons characters[rank=0.0006351502677336068]\n",
"https://en.wikipedia.org/wiki/Jimbo_Jones\n",
"The Simpsons includes a large array of supporting characters: co-workers, teachers, family friends, extended relatives, townspeople, local celebrities, fictional characters within the show, and even animals. The writers originally intended many of these c...\n",
"\n"
]
}
],
"source": [
"print(\"=== BUILDING GRAPH USING ALL NODES AND ALL EDGES ===\")\n",
"graph, info_dict = parse_wiki_json(\"wiki_links.json\", use_only_wiki_page_nodes=False)\n",
"analyze_wiki_graph_with_hits(graph, info_dict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"___Выводы___: \n",
"1) HITS [authorities] больше всего походит на результаты PageRank, HITS [hubs] и HITS[average] содержат пару страниц из топ-3 PageRank."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment