Skip to content

Instantly share code, notes, and snippets.

@yardenac
Forked from robla/spec_wikistuff.txt
Last active May 4, 2026 00:01
Show Gist options
  • Select an option

  • Save yardenac/048b00e0841e029d841978b53407cb64 to your computer and use it in GitHub Desktop.

Select an option

Save yardenac/048b00e0841e029d841978b53407cb64 to your computer and use it in GitHub Desktop.
Very basic Python script for using Wikidata
wikidata_stuff.py command specification
======================================
Goal
----
Provide a small command line tool for resolving Wikidata QIDs into useful
English Wikipedia identifiers, then using those identifiers for English
Wikipedia category queries.
The tool should be dependency-free Python and use the Wikidata and English
Wikipedia MediaWiki APIs directly.
Command overview
----------------
The CLI should use argparse subcommands:
wikidata_stuff.py get-wikidata-label Q4675
wikidata_stuff.py get-enwiki-page Q4675
wikidata_stuff.py list-enwiki-categories Q4675
wikidata_stuff.py list-enwiki-category-members "Category:Volcanoes of Washington (state)"
Subcommand names should be hyphenated verb phrases. Each subcommand should
also identify the relevant external system:
wikidata
Wikidata.
enwiki
English Wikipedia.
This avoids nouns that read like actions on local data. For example,
"get-wikidata-label" means "retrieve the Wikidata label", while "label" could
imply "apply a label".
Global options:
-h, --help
Show help.
--json
Emit machine-readable JSON instead of plain text.
Subcommand: get-wikidata-label
------------------------------
Usage:
wikidata_stuff.py get-wikidata-label QID [-l LANGUAGE]
Purpose:
Print the Wikidata label for a QID.
Arguments:
QID
Wikidata item ID, such as Q42 or Q4675.
Options:
-l, --language LANGUAGE
Label language code. Defaults to "en".
Behavior:
Use the Wikidata wbgetentities API with:
action=wbgetentities
ids=<QID>
props=labels
languages=<LANGUAGE>
languagefallback=1
format=json
The languagefallback=1 parameter is required so that multilingual/default
labels, such as the "mul" label on Q4675, can be returned for English.
Plain text output:
Mount St. Helens
JSON output:
{
"qid": "Q4675",
"language": "en",
"label": "Mount St. Helens"
}
Subcommand: get-enwiki-page
---------------------------
Usage:
wikidata_stuff.py get-enwiki-page QID
Purpose:
Resolve a Wikidata QID to the corresponding English Wikipedia article.
Arguments:
QID
Wikidata item ID, such as Q42 or Q4675.
Behavior:
First, use the Wikidata wbgetentities API with:
action=wbgetentities
ids=<QID>
props=sitelinks
sitefilter=enwiki
format=json
Read:
entities[QID].sitelinks.enwiki.title
Then resolve that title through the English Wikipedia API:
action=query
titles=<TITLE>
redirects=1
format=json
The Wikipedia response provides the stable English Wikipedia pageid.
Plain text output:
Mount St. Helens
57064
https://en.wikipedia.org/wiki/Mount_St._Helens
JSON output:
{
"qid": "Q4675",
"title": "Mount St. Helens",
"pageid": 57064,
"url": "https://en.wikipedia.org/wiki/Mount_St._Helens"
}
Notes:
The URL is useful for humans, but the title and pageid are more useful for
subsequent API queries.
Subcommand: list-enwiki-categories
----------------------------------
Usage:
wikidata_stuff.py list-enwiki-categories QID [--hidden] [--limit N]
Purpose:
List categories attached to the English Wikipedia article associated with a
Wikidata QID.
Arguments:
QID
Wikidata item ID, such as Q42 or Q4675.
Options:
--hidden
Include hidden maintenance categories. By default, hidden categories
should be excluded.
--limit N
Maximum number of categories to print. Defaults to all available
categories by following continuation.
Behavior:
Resolve QID to English Wikipedia pageid using the same behavior as the
get-enwiki-page subcommand.
Then call the English Wikipedia API:
action=query
pageids=<PAGEID>
prop=categories
clprop=hidden
cllimit=max
format=json
If --hidden is not set, exclude categories returned with the hidden marker.
Follow API continuation until there are no more category results or the
requested --limit has been reached.
Plain text output:
Category:Active volcanoes
Category:Cascade Volcanoes
Category:Mount St. Helens
JSON output:
{
"qid": "Q4675",
"pageid": 57064,
"title": "Mount St. Helens",
"categories": [
{
"title": "Category:Active volcanoes",
"hidden": false
}
]
}
Subcommand: list-enwiki-category-members
----------------------------------------
Usage:
wikidata_stuff.py list-enwiki-category-members CATEGORY [--type TYPE] [--namespace NS] [--limit N]
Purpose:
List members of an English Wikipedia category.
Arguments:
CATEGORY
English Wikipedia category title. The title should include the
"Category:" prefix. Example:
Category:Volcanoes of Washington (state)
Options:
--type TYPE
Restrict category member type. Allowed values:
page
subcat
file
If omitted, include all member types.
--namespace NS
Restrict results to a MediaWiki namespace number, such as 0 for
articles or 14 for categories.
--limit N
Maximum number of members to print. Defaults to all available members
by following continuation.
Behavior:
Call the English Wikipedia categorymembers API:
action=query
list=categorymembers
cmtitle=<CATEGORY>
cmprop=ids|title|type
cmlimit=max
format=json
Add cmtype=<TYPE> when --type is set.
Add cmnamespace=<NS> when --namespace is set.
Follow API continuation until there are no more members or the requested
--limit has been reached.
Plain text output:
Mount Adams
Mount Baker
Mount Rainier
Mount St. Helens
JSON output:
{
"category": "Category:Volcanoes of Washington (state)",
"members": [
{
"pageid": 12345,
"ns": 0,
"title": "Mount Adams",
"type": "page"
}
]
}
Validation and errors
---------------------
QID validation:
QIDs should match:
^Q[1-9][0-9]*$
Input may be normalized to uppercase before validation.
Category validation:
The list-enwiki-category-members command should require the "Category:"
prefix. This keeps the command explicit and avoids silently querying the
wrong title.
Exit codes:
0
Success.
1
Lookup, network, or API error.
2
Invalid command line arguments.
Error output:
Errors should be printed to stderr and start with "error:".
Implementation notes
--------------------
Use urllib from the Python standard library. No third-party packages should be
required.
Use a descriptive User-Agent header for API requests.
Build URLs and query strings with urllib.parse.urlencode.
For Wikipedia page URLs, replace spaces with underscores and percent-encode
the title:
https://en.wikipedia.org/wiki/<encoded_title>
Do not use the URL as the primary internal identifier. Prefer English
Wikipedia pageid for API calls, and keep the title for display.
Deferred features
-----------------
Recursive category traversal should be added later as a separate command, such
as "tree" or "walk".
Reasons to defer recursion:
Category graphs can contain cycles.
Broad categories can produce very large result sets.
Useful traversal needs explicit depth limits and type filters.
#!/usr/bin/env python3
"""Small read-only CLI for Wikidata and English Wikipedia lookups."""
from __future__ import annotations
import argparse
import json
import re
import sys
import urllib.error
import urllib.parse
import urllib.request
WIKIDATA_API_URL = "https://www.wikidata.org/w/api.php"
ENWIKI_API_URL = "https://en.wikipedia.org/w/api.php"
USER_AGENT = "wikidata-stuff/0.1"
QID_RE = re.compile(r"^Q[1-9]\d*$")
def api_get(api_url: str, params: dict[str, str]) -> dict:
query = urllib.parse.urlencode(params)
request = urllib.request.Request(
f"{api_url}?{query}",
headers={"User-Agent": USER_AGENT},
)
with urllib.request.urlopen(request, timeout=15) as response:
return json.load(response)
def fetch_label(qid: str, language: str) -> str:
payload = api_get(
WIKIDATA_API_URL,
{
"action": "wbgetentities",
"ids": qid,
"props": "labels",
"languages": language,
"languagefallback": "1",
"format": "json",
},
)
entity = payload.get("entities", {}).get(qid)
if not entity or entity.get("missing"):
raise LookupError(f"{qid} was not found on Wikidata")
label = entity.get("labels", {}).get(language, {}).get("value")
if not label:
raise LookupError(f"{qid} has no {language!r} label")
return label
def fetch_enwiki_title(qid: str) -> str:
payload = api_get(
WIKIDATA_API_URL,
{
"action": "wbgetentities",
"ids": qid,
"props": "sitelinks",
"sitefilter": "enwiki",
"format": "json",
},
)
entity = payload.get("entities", {}).get(qid)
if not entity or entity.get("missing"):
raise LookupError(f"{qid} was not found on Wikidata")
title = entity.get("sitelinks", {}).get("enwiki", {}).get("title")
if not title:
raise LookupError(f"{qid} has no English Wikipedia article")
return title
def fetch_enwiki_page(qid: str) -> dict:
title = fetch_enwiki_title(qid)
payload = api_get(
ENWIKI_API_URL,
{
"action": "query",
"titles": title,
"redirects": "1",
"format": "json",
},
)
pages = payload.get("query", {}).get("pages", {})
for page in pages.values():
if "missing" in page:
break
return {
"pageid": page["pageid"],
"title": page["title"],
}
raise LookupError(f"{title!r} was not found on English Wikipedia")
def fetch_enwiki_categories(pageid: int, include_hidden: bool, limit: int | None) -> list[dict]:
categories = []
params = {
"action": "query",
"pageids": str(pageid),
"prop": "categories",
"clprop": "hidden",
"cllimit": "max",
"format": "json",
}
if not include_hidden:
params["clshow"] = "!hidden"
while True:
payload = api_get(ENWIKI_API_URL, params)
pages = payload.get("query", {}).get("pages", {})
page = pages.get(str(pageid), {})
for category in page.get("categories", []):
categories.append(
{
"title": category["title"],
"hidden": "hidden" in category,
}
)
if limit is not None and len(categories) >= limit:
return categories
continuation = payload.get("continue")
if not continuation:
return categories
params.update(continuation)
def normalize_qid(qid: str) -> str:
normalized = qid.upper()
if not QID_RE.fullmatch(normalized):
raise ValueError(f"invalid QID {qid!r}; expected something like Q42")
return normalized
def positive_int(value: str) -> int:
number = int(value)
if number < 1:
raise argparse.ArgumentTypeError("must be at least 1")
return number
def run_get_wikidata_label(args: argparse.Namespace) -> int:
qid = normalize_qid(args.qid)
label = fetch_label(qid, args.language)
if args.output_json:
print(
json.dumps(
{
"qid": qid,
"language": args.language,
"label": label,
},
indent=2,
)
)
else:
print(label)
return 0
def run_list_enwiki_categories(args: argparse.Namespace) -> int:
qid = normalize_qid(args.qid)
page = fetch_enwiki_page(qid)
categories = fetch_enwiki_categories(page["pageid"], args.hidden, args.limit)
if args.output_json:
print(
json.dumps(
{
"qid": qid,
"pageid": page["pageid"],
"title": page["title"],
"categories": categories,
},
indent=2,
)
)
else:
for category in categories:
print(category["title"])
return 0
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Read-only Wikidata and English Wikipedia lookup helpers."
)
parser.add_argument(
"--json",
action="store_true",
dest="output_json",
help="emit machine-readable JSON",
)
subparsers = parser.add_subparsers(dest="command", required=True)
label_parser = subparsers.add_parser(
"get-wikidata-label",
help="print the Wikidata label for a QID",
)
label_parser.add_argument("qid", help="Wikidata entity ID, for example Q42")
label_parser.add_argument(
"--json",
action="store_true",
dest="output_json",
default=argparse.SUPPRESS,
help="emit machine-readable JSON",
)
label_parser.add_argument(
"-l",
"--language",
default="en",
help="label language code to request (default: en)",
)
label_parser.set_defaults(func=run_get_wikidata_label)
categories_parser = subparsers.add_parser(
"list-enwiki-categories",
help="list categories for the English Wikipedia article associated with a QID",
)
categories_parser.add_argument("qid", help="Wikidata entity ID, for example Q4675")
categories_parser.add_argument(
"--json",
action="store_true",
dest="output_json",
default=argparse.SUPPRESS,
help="emit machine-readable JSON",
)
categories_parser.add_argument(
"--hidden",
action="store_true",
help="include hidden maintenance categories",
)
categories_parser.add_argument(
"--limit",
type=positive_int,
help="maximum number of categories to print",
)
categories_parser.set_defaults(func=run_list_enwiki_categories)
if len(sys.argv) == 1:
parser.print_help()
raise SystemExit(0)
return parser.parse_args()
def main() -> int:
args = parse_args()
try:
return args.func(args)
except ValueError as error:
print(f"error: {error}", file=sys.stderr)
return 2
except (LookupError, urllib.error.URLError, TimeoutError) as error:
print(f"error: {error}", file=sys.stderr)
return 1
if __name__ == "__main__":
raise SystemExit(main())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment