Skip to content

Instantly share code, notes, and snippets.

@robla
Last active May 4, 2026 02:27
Show Gist options
  • Select an option

  • Save robla/32c3eacdf5c89431e7d81833b7fa713a to your computer and use it in GitHub Desktop.

Select an option

Save robla/32c3eacdf5c89431e7d81833b7fa713a to your computer and use it in GitHub Desktop.
Very basic Python script for using Wikidata
wikistuff.py command specification
==================================
Goal
----
Provide a small read-only command line tool for Wikidata and English Wikipedia
lookups.
This specification describes only behavior currently implemented in
wikistuff.py.
Command overview
----------------
Implemented subcommands:
wikistuff.py help [SUBCOMMAND]
wikistuff.py get-wikidata-label QID
wikistuff.py list-enwiki-categories QID
wikistuff.py list-example-humans
Global options:
-h, --help
Show argparse help.
--json
Emit machine-readable JSON for subcommands that support JSON output.
Subcommand names are hyphenated verb phrases. Commands that query external
systems identify that system in the command name:
wikidata
Wikidata.
enwiki
English Wikipedia.
example
A small example query, currently backed by Wikidata Query Service.
The help subcommand is an exception. It does not query Wikidata or English
Wikipedia; it only prints local command usage text.
If wikistuff.py is run with no arguments, it prints top-level help and exits
successfully.
Subcommand: help
----------------
Usage:
wikistuff.py help [SUBCOMMAND]
Purpose:
Show help for the tool or for a specific implemented subcommand.
Arguments:
SUBCOMMAND
Optional subcommand name. Implemented choices are:
get-wikidata-label
help
list-example-humans
list-enwiki-categories
Behavior:
If SUBCOMMAND is omitted, print top-level argparse help.
If SUBCOMMAND is provided, print argparse help for that subcommand.
This command does not make network requests.
JSON output:
JSON output is not implemented for help.
Subcommand: list-example-humans
-------------------------------
Usage:
wikistuff.py list-example-humans [--json]
Purpose:
List ten example human Wikidata items using SPARQL.
Options:
--json
Emit the raw SPARQL JSON binding rows instead of plain text.
Behavior:
Query the Wikidata Query Service SPARQL endpoint:
https://query.wikidata.org/sparql
The implemented query is:
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q5.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 10
Request format=json.
Plain text output:
Each result is printed as:
<label><tab><item-uri>
JSON output:
Prints the raw list from:
results.bindings
Subcommand: get-wikidata-label
------------------------------
Usage:
wikistuff.py get-wikidata-label [--json] [-l LANGUAGE] QID
Purpose:
Print the Wikidata label for a QID.
Arguments:
QID
Wikidata item ID, such as Q42 or Q4675.
Options:
--json
Emit JSON instead of plain text.
-l, --language LANGUAGE
Label language code. Defaults to "en".
Behavior:
Normalize the QID to uppercase, then validate it.
Use the Wikidata wbgetentities API:
action=wbgetentities
ids=<QID>
props=labels
languages=<LANGUAGE>
languagefallback=1
format=json
The languagefallback=1 parameter allows multilingual/default labels, such
as the "mul" label on Q4675, to be returned for English.
Plain text output:
Mount St. Helens
JSON output:
{
"qid": "Q4675",
"language": "en",
"label": "Mount St. Helens"
}
Subcommand: list-enwiki-categories
----------------------------------
Usage:
wikistuff.py list-enwiki-categories [--json] [--hidden] [--limit N] QID
Purpose:
List categories attached to the English Wikipedia article associated with a
Wikidata QID.
Arguments:
QID
Wikidata item ID, such as Q42 or Q4675.
Options:
--json
Emit JSON instead of plain text.
--hidden
Include hidden maintenance categories. Hidden categories are excluded
by default.
--limit N
Maximum number of categories to print. N must be at least 1. If omitted,
follow API continuation until all available categories are returned.
Behavior:
Normalize the QID to uppercase, then validate it.
First, resolve the QID to an English Wikipedia title using the Wikidata
wbgetentities API:
action=wbgetentities
ids=<QID>
props=sitelinks
sitefilter=enwiki
format=json
Read:
entities[QID].sitelinks.enwiki.title
Next, resolve that title to an English Wikipedia page ID using the English
Wikipedia API:
action=query
titles=<TITLE>
redirects=1
format=json
Finally, request categories for that page ID using the English Wikipedia
API:
action=query
pageids=<PAGEID>
prop=categories
clprop=hidden
cllimit=max
format=json
If --hidden is not set, add:
clshow=!hidden
Follow API continuation until there are no more category results or the
requested --limit has been reached.
Plain text output:
Category:19th-century volcanic events
Category:20th-century volcanic events
Category:21st-century volcanic events
JSON output:
{
"qid": "Q4675",
"pageid": 36649,
"title": "Mount St. Helens",
"categories": [
{
"title": "Category:19th-century volcanic events",
"hidden": false
}
]
}
Validation and errors
---------------------
QID validation:
QIDs must match:
^Q[1-9][0-9]*$
Input is normalized to uppercase before validation.
Limit validation:
--limit must be an integer greater than or equal to 1.
Exit codes:
0
Success.
1
Lookup, network, timeout, or API error.
2
Invalid command line arguments or invalid QID.
Error output:
Runtime errors handled by wikistuff.py are printed to stderr and start with
"error:".
Argparse errors use argparse's default error format.
Implementation notes
--------------------
The implementation uses only the Python standard library:
argparse
json
re
sys
urllib
Network requests are read-only HTTPS GET requests to:
https://www.wikidata.org/w/api.php
https://en.wikipedia.org/w/api.php
https://query.wikidata.org/sparql
The current User-Agent header is:
wikistuff/0.1
Query strings are built with urllib.parse.urlencode.
#!/usr/bin/env python3
"""Small read-only CLI for Wikidata and English Wikipedia lookups."""
from __future__ import annotations
import argparse
import json
import re
import sys
import urllib.error
import urllib.parse
import urllib.request
WIKIDATA_API_URL = "https://www.wikidata.org/w/api.php"
ENWIKI_API_URL = "https://en.wikipedia.org/w/api.php"
WIKIDATA_SPARQL_URL = "https://query.wikidata.org/sparql"
USER_AGENT = "wikistuff/0.1"
QID_RE = re.compile(r"^Q[1-9]\d*$")
def api_get(api_url: str, params: dict[str, str]) -> dict:
query = urllib.parse.urlencode(params)
request = urllib.request.Request(
f"{api_url}?{query}",
headers={"User-Agent": USER_AGENT},
)
with urllib.request.urlopen(request, timeout=15) as response:
return json.load(response)
def fetch_label(qid: str, language: str) -> str:
payload = api_get(
WIKIDATA_API_URL,
{
"action": "wbgetentities",
"ids": qid,
"props": "labels",
"languages": language,
"languagefallback": "1",
"format": "json",
},
)
entity = payload.get("entities", {}).get(qid)
if not entity or entity.get("missing"):
raise LookupError(f"{qid} was not found on Wikidata")
label = entity.get("labels", {}).get(language, {}).get("value")
if not label:
raise LookupError(f"{qid} has no {language!r} label")
return label
def fetch_enwiki_title(qid: str) -> str:
payload = api_get(
WIKIDATA_API_URL,
{
"action": "wbgetentities",
"ids": qid,
"props": "sitelinks",
"sitefilter": "enwiki",
"format": "json",
},
)
entity = payload.get("entities", {}).get(qid)
if not entity or entity.get("missing"):
raise LookupError(f"{qid} was not found on Wikidata")
title = entity.get("sitelinks", {}).get("enwiki", {}).get("title")
if not title:
raise LookupError(f"{qid} has no English Wikipedia article")
return title
def fetch_enwiki_page(qid: str) -> dict:
title = fetch_enwiki_title(qid)
payload = api_get(
ENWIKI_API_URL,
{
"action": "query",
"titles": title,
"redirects": "1",
"format": "json",
},
)
pages = payload.get("query", {}).get("pages", {})
for page in pages.values():
if "missing" in page:
break
return {
"pageid": page["pageid"],
"title": page["title"],
}
raise LookupError(f"{title!r} was not found on English Wikipedia")
def fetch_enwiki_categories(pageid: int, include_hidden: bool, limit: int | None) -> list[dict]:
categories = []
params = {
"action": "query",
"pageids": str(pageid),
"prop": "categories",
"clprop": "hidden",
"cllimit": "max",
"format": "json",
}
if not include_hidden:
params["clshow"] = "!hidden"
while True:
payload = api_get(ENWIKI_API_URL, params)
pages = payload.get("query", {}).get("pages", {})
page = pages.get(str(pageid), {})
for category in page.get("categories", []):
categories.append(
{
"title": category["title"],
"hidden": "hidden" in category,
}
)
if limit is not None and len(categories) >= limit:
return categories
continuation = payload.get("continue")
if not continuation:
return categories
params.update(continuation)
def normalize_qid(qid: str) -> str:
normalized = qid.upper()
if not QID_RE.fullmatch(normalized):
raise ValueError(f"invalid QID {qid!r}; expected something like Q42")
return normalized
def positive_int(value: str) -> int:
number = int(value)
if number < 1:
raise argparse.ArgumentTypeError("must be at least 1")
return number
def run_get_wikidata_label(args: argparse.Namespace) -> int:
qid = normalize_qid(args.qid)
label = fetch_label(qid, args.language)
if args.output_json:
print(
json.dumps(
{
"qid": qid,
"language": args.language,
"label": label,
},
indent=2,
)
)
else:
print(label)
return 0
def run_list_enwiki_categories(args: argparse.Namespace) -> int:
qid = normalize_qid(args.qid)
page = fetch_enwiki_page(qid)
categories = fetch_enwiki_categories(page["pageid"], args.hidden, args.limit)
if args.output_json:
print(
json.dumps(
{
"qid": qid,
"pageid": page["pageid"],
"title": page["title"],
"categories": categories,
},
indent=2,
)
)
else:
for category in categories:
print(category["title"])
return 0
def run_list_example_humans(args: argparse.Namespace) -> int:
query = """
SELECT ?item ?itemLabel WHERE {
?item wdt:P31 wd:Q5.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 10
"""
payload = api_get(
WIKIDATA_SPARQL_URL,
{
"query": query,
"format": "json",
},
)
rows = payload.get("results", {}).get("bindings", [])
if args.output_json:
print(json.dumps(rows, indent=2))
else:
for row in rows:
label = row.get("itemLabel", {}).get("value", "")
item = row.get("item", {}).get("value", "")
print(f"{label}\t{item}")
return 0
def run_help(args: argparse.Namespace) -> int:
if args.topic:
args.command_parsers[args.topic].print_help()
else:
args.parser.print_help()
return 0
def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Read-only Wikidata and English Wikipedia lookup helpers."
)
parser.add_argument(
"--json",
action="store_true",
dest="output_json",
help="emit machine-readable JSON",
)
subparsers = parser.add_subparsers(dest="command", required=True)
command_parsers = {}
label_parser = subparsers.add_parser(
"get-wikidata-label",
help="print the Wikidata label for a QID",
)
command_parsers["get-wikidata-label"] = label_parser
label_parser.add_argument("qid", help="Wikidata entity ID, for example Q42")
label_parser.add_argument(
"--json",
action="store_true",
dest="output_json",
default=argparse.SUPPRESS,
help="emit machine-readable JSON",
)
label_parser.add_argument(
"-l",
"--language",
default="en",
help="label language code to request (default: en)",
)
label_parser.set_defaults(func=run_get_wikidata_label)
categories_parser = subparsers.add_parser(
"list-enwiki-categories",
help="list categories for the English Wikipedia article associated with a QID",
)
command_parsers["list-enwiki-categories"] = categories_parser
categories_parser.add_argument("qid", help="Wikidata entity ID, for example Q4675")
categories_parser.add_argument(
"--json",
action="store_true",
dest="output_json",
default=argparse.SUPPRESS,
help="emit machine-readable JSON",
)
categories_parser.add_argument(
"--hidden",
action="store_true",
help="include hidden maintenance categories",
)
categories_parser.add_argument(
"--limit",
type=positive_int,
help="maximum number of categories to print",
)
categories_parser.set_defaults(func=run_list_enwiki_categories)
humans_parser = subparsers.add_parser(
"list-example-humans",
help="list ten example human Wikidata items using SPARQL",
)
command_parsers["list-example-humans"] = humans_parser
humans_parser.add_argument(
"--json",
action="store_true",
dest="output_json",
default=argparse.SUPPRESS,
help="emit machine-readable JSON",
)
humans_parser.set_defaults(func=run_list_example_humans)
help_parser = subparsers.add_parser(
"help",
help="show help for the tool or a subcommand",
)
help_topics = sorted([*command_parsers, "help"])
help_parser.add_argument(
"topic",
nargs="?",
choices=help_topics,
help="subcommand to describe",
)
command_parsers["help"] = help_parser
help_parser.set_defaults(
func=run_help,
parser=parser,
command_parsers=command_parsers,
)
if len(sys.argv) == 1:
parser.print_help()
raise SystemExit(0)
return parser.parse_args()
def main() -> int:
args = parse_args()
try:
return args.func(args)
except ValueError as error:
print(f"error: {error}", file=sys.stderr)
return 2
except (LookupError, urllib.error.URLError, TimeoutError) as error:
print(f"error: {error}", file=sys.stderr)
return 1
if __name__ == "__main__":
raise SystemExit(main())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment