Skip to content

Instantly share code, notes, and snippets.

@danielpunkass
Created April 15, 2026 03:15
Show Gist options
  • Select an option

  • Save danielpunkass/c4da2c13ee1fb6b0ed6bfce54d8fdcf3 to your computer and use it in GitHub Desktop.

Select an option

Save danielpunkass/c4da2c13ee1fb6b0ed6bfce54d8fdcf3 to your computer and use it in GitHub Desktop.
Prompt for an LLM to build a personal statements/documents downloading infrastructure

Prompt — "Build a personal document-download harness"

A clean-room prompt for handing to any capable coding LLM to reproduce the shape of this project from scratch, without inheriting scar tissue or any private information about the institutions this particular repo targets.


I want you to help me build a Python tool that logs into websites on my behalf, downloads my documents (statements, bills, tax forms, invoices), and files them into a tidy folder tree. I will run it on my own machine with my own credentials. You will never read the downloaded files or handle real credentials — your job is to help me build and maintain the tool itself, working only from each site's public HTML structure.

Non-negotiable privacy rules

  • Never ask me to paste account data, downloaded file contents, or authenticated-page HTML.
  • Never ask for real passwords, TOTP secrets, or session cookies.
  • When debugging, reason from the public login page and generic DOM patterns. If you need to see a specific authenticated page's structure, tell me what selectors/text to look for and I'll report back in anonymized form.

Architecture to build

Harness + plugin pattern. A small core drives a site-independent lifecycle; each site is a plugin file that supplies navigation, selectors, and download logic. Plugins are auto-discovered from a plugins/ directory.

Core components:

  • harness.py — orchestrates per-site runs: launch browser → plugin.login()plugin.handle_mfa() if needed → plugin.download_documents()plugin.logout(). Catches exceptions per-site so one failure doesn't block others.
  • browser.py — Playwright persistent context per plugin (~/.statementsync/browser_profiles/<plugin>/) so cookies/localStorage survive between runs and MFA is rare. Include an init-script hook for anti-fingerprinting patches (some sites fingerprint navigator.userAgentData.brands and reject non-"Google Chrome" Chromium).
  • config.py — YAML config at ~/.statementsync/config.yaml. Per-site entries under sites.<label>. Unknown keys land in an extras dict passed to the plugin. Passwords resolved via system keyring first, plaintext YAML as fallback.
  • cli.py — commands: init, add-site <site> <user> (prompts for password, stores in keyring), sync [--include X] [--exclude Y] [--keep-open] [--full-refresh] [--delete-cache], list-plugins, -v verbose flag that must come before the subcommand.
  • tmux_runner.pysync re-invokes itself per site inside detached tmux sessions (even for single-site runs), with a curses TUI showing live previews and [a]ttach/[k]ill/[l]og/[q]uit. Each session tees to a timestamped log. This isolates per-site failures and lets MFA prompts on one site not block others.
  • base.pySitePlugin abstract base with meta: PluginMeta, login(), handle_mfa() (default no-op), download_documents(), logout(). Provide helpers: prepare_dest(download_dir, subfolder, filename, doc_date) that builds <download_dir>/<account_label>/<year>/<doc_type>/<filename>, and memo helpers (see below).

Folder convention: <root>/<account_label>/<year>/<doc_type>/<filename>. Standard doc types: Statements, Payments, Invoices, Summaries, Tax_Forms, Confirmations, Misc. Skip downloads where destination already exists — this is the primary dedup mechanism.

Per-site memo (optional, opt-in per plugin): JSON file at ~/.statementsync/memos/<label>.json, loaded into self.memo before download_documents and saved after. Plugins use it to early-exit paginated listings once they see a date older than last run. Invariants: always strict < for comparisons (never <=, so same-date new docs aren't missed); advance memo only after the loop returns without exception; dest.exists() remains the source of truth for individual-file dedup. Support --full-refresh to ignore memo, and a memo_max_age TTL.

The inspect tool (critical)

Build tools/inspect_site.py: a Playwright-based interactive inspector that uses the shared persistent profile. When I run it with a URL, it:

  • Launches a visible browser window using the same profile/anti-fingerprinting flags as the real harness.
  • On every navigation, auto-dumps a snapshot of the current DOM (annotated: visible form inputs with their names/ids/labels, buttons with text/aria-label, links grouped by href pattern, iframes, any data-testid attributes) to /tmp/inspect-<timestamp>.log.
  • Accepts interactive commands: dump (re-snapshot now), goto <url>, eval <js>, quit.
  • Never writes authenticated content to the conversation — only to the local log file, which I inspect and report back from in summarized form.

This tool is how I give you the information you need to write selectors without exposing real account data.

Workflow for adding a new site

When I say "let's add site X," you should drive this conversation in four phases:

Phase 1 — inspect. Tell me exactly what to do with inspect_site.py: which URLs to hit (public login page, then — after I log in manually — the document listing pages), and what selectors/structures to look for. I'll come back with anonymized findings.

Phase 2 — recon questions. Before writing any code, ask me:

  • What are the doc types on this site, and which of the standard folders do they map to? (Propose a mapping; if something doesn't fit, propose a new standard name and we'll decide together.)
  • Is there one account or several? If several, is there an account-picker page, an org switcher, or are they all visible at once?
  • Login: single-step or multi-step form? MFA? Does the site offer passkeys? Is the session long-lived?
  • Listing: is it paginated? Sortable by date (confirm order)? Is there a server-side date filter we can push work to?
  • Downloads: do clicks trigger a normal browser download, navigate to a PDF URL, open a new tab with a PDF viewer, or render an inline viewer in the current page? Are there XHR calls to clean REST endpoints we could replay directly?

Phase 3 — build. Write the plugin. Default choices:

  • Prefer stable selectors: name, data-testid, aria-label, label-anchored locators. Avoid generated ids, CSS-module hashes (match with [class*="fragment"]), and positional indices across postbacks.
  • Try the authenticated URL first in login(); fall back to the login form only on redirect.
  • Wait for positive authenticated markers (specific element, query param, heading text) — never wait for "login form disappeared."
  • For downloads, prefer in order: direct API replay (if you can capture a clean XHR) → page.context.request.get(href) for predictable PDF URLs → expect_download() → route interception → new-tab capture (last resort; steals focus on macOS).
  • If the site opens PDFs in new tabs, patch window.open and URL.createObjectURL via context.add_init_script to capture URLs/blobs without activating popups.
  • Decide whether a memo is worth it. If per-row cost is low or the listing isn't reliably date-sorted, skip it — dest.exists() is enough.

Phase 4 — verify and retro. After I test, ask what broke. Then:

  1. Look for patterns that could move into the base class / harness.
  2. Update the project's CLAUDE.md with the new lessons (selectors that surprised us, new download strategies, anti-automation tricks).
  3. Update the add-plugin skill/guide if the recon questions or build defaults need adjustment.
  4. Commit everything together.

Steady-state behavior

Once a plugin is written and working, runs should be silent and successful until the site changes its HTML. Failures should be loud (non-zero exit, traceback visible in the tmux pane, which pauses via read before closing so I can see it), and should not cascade to other sites.

Deliverables I want from you

  • The repo skeleton (harness, browser, config, cli, tmux_runner, base plugin, inspect tool) with one reference plugin against a fake site I'll describe.
  • A CLAUDE.md in the repo that captures architecture and conventions but does not name any real institution or include any real account data. Lessons-learned entries should be phrased generically ("sites built on framework X sometimes do Y") rather than "SiteName does Z."
  • An add-plugin guide that walks future-me (or future-you) through the four phases above.

Start by asking me any clarifying questions about scope, then propose the initial file layout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment