Skip to content

Instantly share code, notes, and snippets.

@AbdealiLoKo
Created September 24, 2016 10:24
Show Gist options
  • Save AbdealiLoKo/05b8d2e6ded9bcb58e10deb16c7bacd5 to your computer and use it in GitHub Desktop.
Save AbdealiLoKo/05b8d2e6ded9bcb58e10deb16c7bacd5 to your computer and use it in GitHub Desktop.
WIkimedia Hackathon - Bits Pilani Hyderabad Campus
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pywikibot Introduction"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pywikibot is a set of python functions which make it much much easier to make automated edits on mediawiki.\n",
"\n",
"<span style=\"color: #cc0000\">**Warning**:</span> You are accountable for every edit you or your python script makes. Be careful and don't get banned!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pywikibot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 1. Using a mediawiki site\n",
"\n",
"The first thing pywikibot needs to know, is which mediawiki website to target. There are many official sites like en.wikipedia.org, commons.wikimedia.org, en.wikitionary.org, en.wikiquote.org, en.wikinews.org, en.wikisource.org, etc. And each has their own versions with different languages like ml.wikipedia.org, ml.wikitionary.org, etc.\n",
"\n",
"The default website seen on PAWS is the test.wikipedia.org To check the website out, go on to <https://test.wikipedia.org>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"testwiki = pywikibot.Site()\n",
"testwiki"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A mediawiki website has 2 parts which are important. The **code** and the **family**. The pywikibot API supports a LOT of official families and codes, and can also add a local instance or a personal deployment of mediawiki.\n",
"\n",
"The **family** tells pywikibot which type of mediawiki site should be used, and it can read and write data specific to the family. Examples of family are: wikipedia, wikitionary, wikisource, etc.\n",
"\n",
"The **code** tells pywikibot which variant of the family should be used. Common examples of codes are: en, es, ml, etc. The code depends on the family though. For example, the \"commons\" family has only the \"commons\" code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"enwiki = pywikibot.Site(code=\"en\", fam=\"wikipedia\")\n",
"enwiki"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"commons = pywikibot.Site(code=\"commons\", fam=\"commons\")\n",
"commons"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"wikidata = pywikibot.Site(code=\"wikidata\", fam=\"wikidata\")\n",
"wikidata"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"testwikidata = pywikibot.Site(code=\"test\", fam=\"wikidata\")\n",
"testwikidata"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 2. Logging in\n",
"\n",
"In the PAWS interface, the user is set by default to the user account that has been used to login to PAWS. But in a local script, we would need to modify the `user-config.py` file to add the username and password. We will see this later.\n",
"\n",
"We tell pywikibot to login with the `login()` function. Then we check which user has been used to login:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"testwiki.login()\n",
"print('Logged in user is:', testwiki.user())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 3. Reading data on Pages\n",
"\n",
"To pull data from pywikibot, we use the `Page` class which holds information about a page from the mediawiki website.\n",
"\n",
"First, we create a Page object using the name of the page. Here, we use the page \"User:AbdealiJK/Pywikibot_Tutorial\" as an example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"demo_page = pywikibot.Page(testwiki, 'User:AbdealiJK/Pywikibot_Tutorial')\n",
"demo_page"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we use the class to fetch other information about the page. For example, to get the text of the page:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(demo_page.text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can get a lot of other information about the page by using various helper functions provided by pywikibot:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"Check if page exists:\", demo_page.exists())\n",
"print(\"Title of the page:\", demo_page.title())\n",
"print(\"Contributors of the page:\", demo_page.contributors())\n",
"print(\"Last edit made on page:\", demo_page.editTime())\n",
"print(\"Full URL to page:\", demo_page.full_url())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4. Writing data to Pages"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In general use the test wikipedia website for writing data, and ensure that you make changes in your User space (pages starting with `User:<Your user name>` as these are meant for your personal usage like testing these scripts :)\n",
"\n",
"For example, let's create the object for your personal Sandbox page on test wiki:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sandbox = pywikibot.Page(testwiki, 'User:' + testwiki.user() + '/Sandbox')\n",
"sandbox"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, let's try writing some wiki markup to the page. For example, let's try making your profile !\n",
"\n",
"**Note**: To get more information about the wikimarkup visit [Help:Wiki markup](https://en.wikipedia.org/wiki/Help:Wiki_markup)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sandbox.text =\"\"\"\n",
"== About Me ==\n",
"\n",
"Hello!\n",
"\n",
"My name is '''{name}'''.\n",
"\n",
"I am from {hometown} and am learning how to use pywikibot !\n",
"\n",
"This page has been written using the pywikibot API.\n",
"\"\"\".format(name=,\n",
" hometown=)\n",
"sandbox.save()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's open up the webpage and see if our changes have been added there.\n",
"\n",
"Using Jupyter and IPython, we can even embed the webpage into the notebook:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from IPython.display import IFrame\n",
"IFrame(sandbox.full_url(), width=\"100%\", height=\"400px\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 5. Textlib functions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you can get content and save new content, there are many times you'd like to get a list of categories or templates from a mediawiki instance.\n",
"\n",
"A **category** is a special namespace (Similar to the user space) which holds categories that are used to classify pages. For example the \"Python (programming language)\" page on wikipedia has the categories \"Category:Class-based programming languages\", \"Category:Cross-platform free software\", \"Category:Dynamically typed programming languages\" and so on.\n",
"\n",
"To add a category to a page, a link to the category must be added to the medaiwiki page. Hence, something like `[[Category:<name of category>]]` should be added according to the wiki markup.\n",
"\n",
"A **template** is a snippet of text which can be included into multiple other pages (Something like a #include or import). The wiki markup to add a template is `{{<template name>}}` and it can also take in arguments, for example `{{<template name>|arg1|arg2}}`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"python = pywikibot.Page(enwiki, 'Python_(programming_language)')\n",
"python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get a list of all categories added to the page:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"list(python.categories())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The textlib functions help to modify the text content on the page for specific needs like adding or removing categories. Hence, it has it's parsers which read through the text and pull out all the category links it finds based on the wiki markup."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pywikibot.textlib.getCategoryLinks(python.text)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(\"Text categories in page:\", len(pywikibot.textlib.getCategoryLinks(python.text)))\n",
"print(\"All categories associated with page:\", len(list(python.categories())))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try removing a category using the textlib functions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"new_text = pywikibot.textlib.removeCategoryLinks(python.text)\n",
"print(\"List of categories after the remove function:\", pywikibot.textlib.getCategoryLinks(new_text))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other useful methods\n",
"Textlib contains many other websties that make editing the tet in mediawiki pages easier. For example:\n",
" - `TimeStripper()` - Helps to pull out all time strings in the text and converts it into python time object\n",
" - `does_text_contain_section()` - Checks whether the section with given name exists in the text\n",
" - `extract_templates_and_params()` - Fetches a list of templates with the arguments used in the template markup\n",
" - `glue_template_and_params()` - Takes a template and arguments and creates the appropriate wiki markup for it.\n",
" - `removeHTMLParts()` - Cleans the data by removing all HTML code in the page\n",
" - `replaceCategoryInPlace()` - If a category needs to be modified to another category, this replaces it inplace\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 6. Page Generators"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many instances where it is useful to create a \"page generator\" which helps iterate over multiple pages that share a common property. For example, consider you want to find all pages of wikimedia projects:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pywikibot.pagegenerators\n",
"\n",
"wiki_projects = pywikibot.pagegenerators.CategorizedPageGenerator(\n",
" pywikibot.Category(enwiki, 'Category:Wikimedia projects'),\n",
" recurse=False)\n",
"\n",
"from pprint import pprint\n",
"pprint(list(wiki_projects))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the python module `pprint` (pretty print) to format the output in a better way rather than dumping it as a list.\n",
"\n",
"For more information on Page generators check [pywikibot documentation on pagegenerators](https://pywikibot.readthedocs.io/en/latest/pywikibot/#module-pywikibot.pagegenerators)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 7. Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1 - Write a script to remove trailing whitespace from a given page"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In many mediawiki pages, we see that editors leave trailing whitespace at the bottom of the page. While this does not matter when the page is rendered for viewing, it adds unnecessary length to the article when downloading the text and raw wikicode.\n",
"\n",
"Write a script to remove the trailing whitespace and keep only 1 newline at the end of the page. (Test this on a testwiki !)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2 - Write a script to find the number of devices using Android Operating System"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the number of pages that exist that are related to devices that use the Android Operating System's category."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 8. Setting up pywkibot locally\n",
"PAWs provides a method to run pywikibot and related commands through Jupyter notebooks. It has already installed various requirements and so on that are needed for pywikibot scripts. Hence, it's an easy way to get users started. As it's only 1 server on the internet, if everyone began using PAWs, it gets crowded and slow. In such cases, it may be easier to run these scripts locally in your own desktop/laptop."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installing basic requirements\n",
"First, install the basic requirements. This depends on your specific OS.\n",
" \n",
" - Python and Pip - Use [anaconda](https://www.continuum.io/downloads) 3.5 or 2.7 preferably\n",
" - git - [installation instructions](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Installing pywikibot\n",
"Pywikibot is currently still a release candidate, hence rather than installing the rc5 from [pip](https://pypi.python.org/pypi/pywikibot), we will get the latest source code at the master branch using git. To do this, run the following command on your terminal or command prompt:\n",
"\n",
" /home/user/git_repos/$ git clone https://github.com/wikimedia/pywikibot-core.git\n",
"\n",
"You will find the folder `pywikibot-core` has been created in the current working directory. If you wish to move the folder simple move it to another directory, or use the `cd` command to change directory before running the above git command.\n",
"\n",
"Once the git repository has been downloaded, `cd` into the directory and run:\n",
"\n",
" /home/user/git_repos/pywikibot-core/$ pip install .\n",
"\n",
"Which installs the pywikibot repository to your python installation. The `.` (dot) is required as it tells pip to find the python package at the current directory. Pywikibot also has a lot of optional dependencies which are used to run specific scripts and unittests. To install all of these (to avoid errors later) run:\n",
"\n",
" /home/user/git_repos/pywikibot-core/$ pip install -r dev-requirements.txt -r requirements.txt\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configuring pywikibot\n",
"\n",
"Once the pywikibot library has been installed, simply use the `pwb.py` script provided in the git repo:\n",
"\n",
" /home/user/git_repos/pywikibot-core/$ python pwb.py login\n",
"\n",
"And follow the questions to create a `user-config.py` which holds your configuration information."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### References\n",
"For more information about the configuration and other aspects of pywikibot, check the [Pywikibot manual](https://www.mediawiki.org/wiki/Manual:Pywikibot)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment