-
-
Save yardenac/048b00e0841e029d841978b53407cb64 to your computer and use it in GitHub Desktop.
Very basic Python script for using Wikidata
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| wikidata_stuff.py command specification | |
| ====================================== | |
| Goal | |
| ---- | |
| Provide a small command line tool for resolving Wikidata QIDs into useful | |
| English Wikipedia identifiers, then using those identifiers for English | |
| Wikipedia category queries. | |
| The tool should be dependency-free Python and use the Wikidata and English | |
| Wikipedia MediaWiki APIs directly. | |
| Command overview | |
| ---------------- | |
| The CLI should use argparse subcommands: | |
| wikidata_stuff.py get-wikidata-label Q4675 | |
| wikidata_stuff.py get-enwiki-page Q4675 | |
| wikidata_stuff.py list-enwiki-categories Q4675 | |
| wikidata_stuff.py list-enwiki-category-members "Category:Volcanoes of Washington (state)" | |
| Subcommand names should be hyphenated verb phrases. Each subcommand should | |
| also identify the relevant external system: | |
| wikidata | |
| Wikidata. | |
| enwiki | |
| English Wikipedia. | |
| This avoids nouns that read like actions on local data. For example, | |
| "get-wikidata-label" means "retrieve the Wikidata label", while "label" could | |
| imply "apply a label". | |
| Global options: | |
| -h, --help | |
| Show help. | |
| --json | |
| Emit machine-readable JSON instead of plain text. | |
| Subcommand: get-wikidata-label | |
| ------------------------------ | |
| Usage: | |
| wikidata_stuff.py get-wikidata-label QID [-l LANGUAGE] | |
| Purpose: | |
| Print the Wikidata label for a QID. | |
| Arguments: | |
| QID | |
| Wikidata item ID, such as Q42 or Q4675. | |
| Options: | |
| -l, --language LANGUAGE | |
| Label language code. Defaults to "en". | |
| Behavior: | |
| Use the Wikidata wbgetentities API with: | |
| action=wbgetentities | |
| ids=<QID> | |
| props=labels | |
| languages=<LANGUAGE> | |
| languagefallback=1 | |
| format=json | |
| The languagefallback=1 parameter is required so that multilingual/default | |
| labels, such as the "mul" label on Q4675, can be returned for English. | |
| Plain text output: | |
| Mount St. Helens | |
| JSON output: | |
| { | |
| "qid": "Q4675", | |
| "language": "en", | |
| "label": "Mount St. Helens" | |
| } | |
| Subcommand: get-enwiki-page | |
| --------------------------- | |
| Usage: | |
| wikidata_stuff.py get-enwiki-page QID | |
| Purpose: | |
| Resolve a Wikidata QID to the corresponding English Wikipedia article. | |
| Arguments: | |
| QID | |
| Wikidata item ID, such as Q42 or Q4675. | |
| Behavior: | |
| First, use the Wikidata wbgetentities API with: | |
| action=wbgetentities | |
| ids=<QID> | |
| props=sitelinks | |
| sitefilter=enwiki | |
| format=json | |
| Read: | |
| entities[QID].sitelinks.enwiki.title | |
| Then resolve that title through the English Wikipedia API: | |
| action=query | |
| titles=<TITLE> | |
| redirects=1 | |
| format=json | |
| The Wikipedia response provides the stable English Wikipedia pageid. | |
| Plain text output: | |
| Mount St. Helens | |
| 57064 | |
| https://en.wikipedia.org/wiki/Mount_St._Helens | |
| JSON output: | |
| { | |
| "qid": "Q4675", | |
| "title": "Mount St. Helens", | |
| "pageid": 57064, | |
| "url": "https://en.wikipedia.org/wiki/Mount_St._Helens" | |
| } | |
| Notes: | |
| The URL is useful for humans, but the title and pageid are more useful for | |
| subsequent API queries. | |
| Subcommand: list-enwiki-categories | |
| ---------------------------------- | |
| Usage: | |
| wikidata_stuff.py list-enwiki-categories QID [--hidden] [--limit N] | |
| Purpose: | |
| List categories attached to the English Wikipedia article associated with a | |
| Wikidata QID. | |
| Arguments: | |
| QID | |
| Wikidata item ID, such as Q42 or Q4675. | |
| Options: | |
| --hidden | |
| Include hidden maintenance categories. By default, hidden categories | |
| should be excluded. | |
| --limit N | |
| Maximum number of categories to print. Defaults to all available | |
| categories by following continuation. | |
| Behavior: | |
| Resolve QID to English Wikipedia pageid using the same behavior as the | |
| get-enwiki-page subcommand. | |
| Then call the English Wikipedia API: | |
| action=query | |
| pageids=<PAGEID> | |
| prop=categories | |
| clprop=hidden | |
| cllimit=max | |
| format=json | |
| If --hidden is not set, exclude categories returned with the hidden marker. | |
| Follow API continuation until there are no more category results or the | |
| requested --limit has been reached. | |
| Plain text output: | |
| Category:Active volcanoes | |
| Category:Cascade Volcanoes | |
| Category:Mount St. Helens | |
| JSON output: | |
| { | |
| "qid": "Q4675", | |
| "pageid": 57064, | |
| "title": "Mount St. Helens", | |
| "categories": [ | |
| { | |
| "title": "Category:Active volcanoes", | |
| "hidden": false | |
| } | |
| ] | |
| } | |
| Subcommand: list-enwiki-category-members | |
| ---------------------------------------- | |
| Usage: | |
| wikidata_stuff.py list-enwiki-category-members CATEGORY [--type TYPE] [--namespace NS] [--limit N] | |
| Purpose: | |
| List members of an English Wikipedia category. | |
| Arguments: | |
| CATEGORY | |
| English Wikipedia category title. The title should include the | |
| "Category:" prefix. Example: | |
| Category:Volcanoes of Washington (state) | |
| Options: | |
| --type TYPE | |
| Restrict category member type. Allowed values: | |
| page | |
| subcat | |
| file | |
| If omitted, include all member types. | |
| --namespace NS | |
| Restrict results to a MediaWiki namespace number, such as 0 for | |
| articles or 14 for categories. | |
| --limit N | |
| Maximum number of members to print. Defaults to all available members | |
| by following continuation. | |
| Behavior: | |
| Call the English Wikipedia categorymembers API: | |
| action=query | |
| list=categorymembers | |
| cmtitle=<CATEGORY> | |
| cmprop=ids|title|type | |
| cmlimit=max | |
| format=json | |
| Add cmtype=<TYPE> when --type is set. | |
| Add cmnamespace=<NS> when --namespace is set. | |
| Follow API continuation until there are no more members or the requested | |
| --limit has been reached. | |
| Plain text output: | |
| Mount Adams | |
| Mount Baker | |
| Mount Rainier | |
| Mount St. Helens | |
| JSON output: | |
| { | |
| "category": "Category:Volcanoes of Washington (state)", | |
| "members": [ | |
| { | |
| "pageid": 12345, | |
| "ns": 0, | |
| "title": "Mount Adams", | |
| "type": "page" | |
| } | |
| ] | |
| } | |
| Validation and errors | |
| --------------------- | |
| QID validation: | |
| QIDs should match: | |
| ^Q[1-9][0-9]*$ | |
| Input may be normalized to uppercase before validation. | |
| Category validation: | |
| The list-enwiki-category-members command should require the "Category:" | |
| prefix. This keeps the command explicit and avoids silently querying the | |
| wrong title. | |
| Exit codes: | |
| 0 | |
| Success. | |
| 1 | |
| Lookup, network, or API error. | |
| 2 | |
| Invalid command line arguments. | |
| Error output: | |
| Errors should be printed to stderr and start with "error:". | |
| Implementation notes | |
| -------------------- | |
| Use urllib from the Python standard library. No third-party packages should be | |
| required. | |
| Use a descriptive User-Agent header for API requests. | |
| Build URLs and query strings with urllib.parse.urlencode. | |
| For Wikipedia page URLs, replace spaces with underscores and percent-encode | |
| the title: | |
| https://en.wikipedia.org/wiki/<encoded_title> | |
| Do not use the URL as the primary internal identifier. Prefer English | |
| Wikipedia pageid for API calls, and keep the title for display. | |
| Deferred features | |
| ----------------- | |
| Recursive category traversal should be added later as a separate command, such | |
| as "tree" or "walk". | |
| Reasons to defer recursion: | |
| Category graphs can contain cycles. | |
| Broad categories can produce very large result sets. | |
| Useful traversal needs explicit depth limits and type filters. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/env python3 | |
| """Small read-only CLI for Wikidata and English Wikipedia lookups.""" | |
| from __future__ import annotations | |
| import argparse | |
| import json | |
| import re | |
| import sys | |
| import urllib.error | |
| import urllib.parse | |
| import urllib.request | |
| WIKIDATA_API_URL = "https://www.wikidata.org/w/api.php" | |
| ENWIKI_API_URL = "https://en.wikipedia.org/w/api.php" | |
| USER_AGENT = "wikidata-stuff/0.1" | |
| QID_RE = re.compile(r"^Q[1-9]\d*$") | |
| def api_get(api_url: str, params: dict[str, str]) -> dict: | |
| query = urllib.parse.urlencode(params) | |
| request = urllib.request.Request( | |
| f"{api_url}?{query}", | |
| headers={"User-Agent": USER_AGENT}, | |
| ) | |
| with urllib.request.urlopen(request, timeout=15) as response: | |
| return json.load(response) | |
| def fetch_label(qid: str, language: str) -> str: | |
| payload = api_get( | |
| WIKIDATA_API_URL, | |
| { | |
| "action": "wbgetentities", | |
| "ids": qid, | |
| "props": "labels", | |
| "languages": language, | |
| "languagefallback": "1", | |
| "format": "json", | |
| }, | |
| ) | |
| entity = payload.get("entities", {}).get(qid) | |
| if not entity or entity.get("missing"): | |
| raise LookupError(f"{qid} was not found on Wikidata") | |
| label = entity.get("labels", {}).get(language, {}).get("value") | |
| if not label: | |
| raise LookupError(f"{qid} has no {language!r} label") | |
| return label | |
| def fetch_enwiki_title(qid: str) -> str: | |
| payload = api_get( | |
| WIKIDATA_API_URL, | |
| { | |
| "action": "wbgetentities", | |
| "ids": qid, | |
| "props": "sitelinks", | |
| "sitefilter": "enwiki", | |
| "format": "json", | |
| }, | |
| ) | |
| entity = payload.get("entities", {}).get(qid) | |
| if not entity or entity.get("missing"): | |
| raise LookupError(f"{qid} was not found on Wikidata") | |
| title = entity.get("sitelinks", {}).get("enwiki", {}).get("title") | |
| if not title: | |
| raise LookupError(f"{qid} has no English Wikipedia article") | |
| return title | |
| def fetch_enwiki_page(qid: str) -> dict: | |
| title = fetch_enwiki_title(qid) | |
| payload = api_get( | |
| ENWIKI_API_URL, | |
| { | |
| "action": "query", | |
| "titles": title, | |
| "redirects": "1", | |
| "format": "json", | |
| }, | |
| ) | |
| pages = payload.get("query", {}).get("pages", {}) | |
| for page in pages.values(): | |
| if "missing" in page: | |
| break | |
| return { | |
| "pageid": page["pageid"], | |
| "title": page["title"], | |
| } | |
| raise LookupError(f"{title!r} was not found on English Wikipedia") | |
| def fetch_enwiki_categories(pageid: int, include_hidden: bool, limit: int | None) -> list[dict]: | |
| categories = [] | |
| params = { | |
| "action": "query", | |
| "pageids": str(pageid), | |
| "prop": "categories", | |
| "clprop": "hidden", | |
| "cllimit": "max", | |
| "format": "json", | |
| } | |
| if not include_hidden: | |
| params["clshow"] = "!hidden" | |
| while True: | |
| payload = api_get(ENWIKI_API_URL, params) | |
| pages = payload.get("query", {}).get("pages", {}) | |
| page = pages.get(str(pageid), {}) | |
| for category in page.get("categories", []): | |
| categories.append( | |
| { | |
| "title": category["title"], | |
| "hidden": "hidden" in category, | |
| } | |
| ) | |
| if limit is not None and len(categories) >= limit: | |
| return categories | |
| continuation = payload.get("continue") | |
| if not continuation: | |
| return categories | |
| params.update(continuation) | |
| def normalize_qid(qid: str) -> str: | |
| normalized = qid.upper() | |
| if not QID_RE.fullmatch(normalized): | |
| raise ValueError(f"invalid QID {qid!r}; expected something like Q42") | |
| return normalized | |
| def positive_int(value: str) -> int: | |
| number = int(value) | |
| if number < 1: | |
| raise argparse.ArgumentTypeError("must be at least 1") | |
| return number | |
| def run_get_wikidata_label(args: argparse.Namespace) -> int: | |
| qid = normalize_qid(args.qid) | |
| label = fetch_label(qid, args.language) | |
| if args.output_json: | |
| print( | |
| json.dumps( | |
| { | |
| "qid": qid, | |
| "language": args.language, | |
| "label": label, | |
| }, | |
| indent=2, | |
| ) | |
| ) | |
| else: | |
| print(label) | |
| return 0 | |
| def run_list_enwiki_categories(args: argparse.Namespace) -> int: | |
| qid = normalize_qid(args.qid) | |
| page = fetch_enwiki_page(qid) | |
| categories = fetch_enwiki_categories(page["pageid"], args.hidden, args.limit) | |
| if args.output_json: | |
| print( | |
| json.dumps( | |
| { | |
| "qid": qid, | |
| "pageid": page["pageid"], | |
| "title": page["title"], | |
| "categories": categories, | |
| }, | |
| indent=2, | |
| ) | |
| ) | |
| else: | |
| for category in categories: | |
| print(category["title"]) | |
| return 0 | |
| def parse_args() -> argparse.Namespace: | |
| parser = argparse.ArgumentParser( | |
| description="Read-only Wikidata and English Wikipedia lookup helpers." | |
| ) | |
| parser.add_argument( | |
| "--json", | |
| action="store_true", | |
| dest="output_json", | |
| help="emit machine-readable JSON", | |
| ) | |
| subparsers = parser.add_subparsers(dest="command", required=True) | |
| label_parser = subparsers.add_parser( | |
| "get-wikidata-label", | |
| help="print the Wikidata label for a QID", | |
| ) | |
| label_parser.add_argument("qid", help="Wikidata entity ID, for example Q42") | |
| label_parser.add_argument( | |
| "--json", | |
| action="store_true", | |
| dest="output_json", | |
| default=argparse.SUPPRESS, | |
| help="emit machine-readable JSON", | |
| ) | |
| label_parser.add_argument( | |
| "-l", | |
| "--language", | |
| default="en", | |
| help="label language code to request (default: en)", | |
| ) | |
| label_parser.set_defaults(func=run_get_wikidata_label) | |
| categories_parser = subparsers.add_parser( | |
| "list-enwiki-categories", | |
| help="list categories for the English Wikipedia article associated with a QID", | |
| ) | |
| categories_parser.add_argument("qid", help="Wikidata entity ID, for example Q4675") | |
| categories_parser.add_argument( | |
| "--json", | |
| action="store_true", | |
| dest="output_json", | |
| default=argparse.SUPPRESS, | |
| help="emit machine-readable JSON", | |
| ) | |
| categories_parser.add_argument( | |
| "--hidden", | |
| action="store_true", | |
| help="include hidden maintenance categories", | |
| ) | |
| categories_parser.add_argument( | |
| "--limit", | |
| type=positive_int, | |
| help="maximum number of categories to print", | |
| ) | |
| categories_parser.set_defaults(func=run_list_enwiki_categories) | |
| if len(sys.argv) == 1: | |
| parser.print_help() | |
| raise SystemExit(0) | |
| return parser.parse_args() | |
| def main() -> int: | |
| args = parse_args() | |
| try: | |
| return args.func(args) | |
| except ValueError as error: | |
| print(f"error: {error}", file=sys.stderr) | |
| return 2 | |
| except (LookupError, urllib.error.URLError, TimeoutError) as error: | |
| print(f"error: {error}", file=sys.stderr) | |
| return 1 | |
| if __name__ == "__main__": | |
| raise SystemExit(main()) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment