Skip to content

Instantly share code, notes, and snippets.

@simonw
Created March 6, 2025 15:42
Show Gist options
  • Save simonw/2c80749935ae3339d6f7175dc7cf325b to your computer and use it in GitHub Desktop.
Save simonw/2c80749935ae3339d6f7175dc7cf325b to your computer and use it in GitHub Desktop.

Monolith Architectural Overview

Monolith is a command-line tool designed to save web pages as single HTML files. It embeds CSS, images, JavaScript, and other assets directly into the HTML, creating a self-contained, portable representation of the web page. This overview describes the architecture based on the provided source code and supporting files.

High-Level Structure

The application is written in Rust and uses a variety of crates (libraries) for different functionalities. The core logic resides in src/main.rs, which orchestrates the process of downloading, parsing, modifying, and serializing the HTML document. The src/ directory is further organized into modules that handle specific aspects of the process:

  • src/main.rs: Main entry point. Handles command-line argument parsing, orchestrates the overall process, manages input (stdin or URL), and handles output (stdout or file).
  • src/opts.rs: Defines the command-line options and their parsing logic using the clap crate.
  • src/html.rs: Contains functions for manipulating the HTML DOM (Document Object Model). This includes parsing, traversing, modifying attributes, embedding assets, adding metadata, and serializing the DOM back into HTML.
  • src/css.rs: Provides functionality for processing CSS, primarily embedding assets referenced within CSS (e.g., background images, fonts). It uses the cssparser crate.
  • src/js.rs: Offers utilities for working with JavaScript, specifically identifying event handler attributes.
  • src/url.rs: Handles URL manipulation, including resolving relative URLs, creating data URLs, cleaning URLs (removing fragments), and parsing data URLs.
  • src/cookies.rs: Handles cookies. It allows loading from files (Netscape format) and determines cookie validity.
  • src/utils.rs: Contains utility functions, such as retrieving assets (making network requests), detecting media types, and parsing Content-Type headers.

The tests/ directory contains unit and integration tests for various parts of the application. The tests/_data_/ directory stores sample HTML, CSS, JS, and other files used for testing.

Data Flow

  1. Input: The application receives a target URL (or file path, or standard input) and command-line options.
  2. Retrieval:
    • If the target is a URL, the reqwest crate is used to fetch the content. Headers, including cookies, referer, and User-Agent, are set based on command-line options. Timeouts are also handled here.
    • If the target is a file path, the content is read from the local filesystem using std::fs.
    • If the target is "-", the content is read from stdin using std::io::stdin.
  3. HTML Parsing: The fetched HTML content (or read from file/stdin) is parsed into a DOM tree using the html5ever and markup5ever_rcdom crates. The encoding is either determined from HTTP headers, specified by the command-line option, or auto-detected in html content.
  4. Base URL Resolution: The base URL for resolving relative URLs is determined. This can be explicitly set via the -b option, read from a <base> tag in the HTML, or defaults to the target URL.
  5. Asset Embedding:
    • The DOM tree is traversed, and elements with relevant attributes (e.g., src, href, srcset, style) are processed.
    • URLs in srcset attributes are processed separately to handle multiple image sources.
    • For each URL:
      • If it's a data URL, it's parsed.
      • If it's a local file URL, the content is read from the filesystem, and the security constraints for reading local files are handled.
      • If it's a remote URL (http/https), the reqwest client fetches the asset. Domain whitelisting/blacklisting is applied here.
      • The retrieved asset (or the original data URL) is embedded into the DOM. This usually involves converting the asset to a data URL using create_data_url and updating the corresponding attribute.
      • CSS is processed to embed assets referenced within url() functions and @import rules using the embed_css function.
    • JavaScript event handler attributes are identified, but no JavaScript execution takes place. Options allow for removing JS entirely.
    • NOSCRIPT tags can be handled to extract their contents, also embedding links in noscript tags.
  6. CSS Processing: The embed_css function in src/css.rs handles:
    • Parsing CSS using cssparser.
    • Identifying URLs within url() functions and @import rules.
    • Recursively embedding assets referenced within the CSS, using the same retrieval and embedding process as for HTML assets.
  7. DOM Modification:
    • A <meta> tag with a Content Security Policy (CSP) can be added to isolate the document or restrict resource loading.
    • A <meta> tag containing metadata (save time, original URL) is added (unless disabled).
    • The document's charset can be enforced.
    • <base> tag can be added or updated.
    • <noscript> tags are unwrapped if option -n is set.
  8. Serialization: The modified DOM tree is serialized back into HTML using html5ever::serialize.
  9. Output: The resulting single HTML file is written to standard output or to a specified file.

Key Dependencies

  • reqwest: For making HTTP requests (fetching web pages and assets).
  • html5ever: For parsing and serializing HTML.
  • markup5ever_rcdom: For representing the DOM as a tree structure.
  • cssparser: For parsing CSS.
  • url: For URL parsing and manipulation.
  • clap: For command-line argument parsing.
  • base64: For Base64 encoding and decoding (used in data URLs).
  • chrono: For handling timestamps (used in metadata).
  • encoding_rs: For handling character encodings.
  • regex: Used for unwrapping NOSCRIPT tags.
  • sha2: Used for integrity checks (sha256, sha384, sha512).

Build and Deployment

  • Cargo.toml: Defines the project's metadata, dependencies, and features. It specifies the vendored-openssl feature, which statically links OpenSSL.
  • Dockerfile: Provides instructions for building a Docker image. It uses a multi-stage build, first building the monolith binary in a clux/muslrust environment (for static linking with musl libc) and then copying the binary into a minimal alpine image.
  • Makefile: Contains convenience commands for building, testing, installing, and uninstalling the application.
  • monolith.nuspec: Specifies metadata for a Chocolatey package.

Strengths

  • Self-Contained Output: The primary goal of producing a single, self-contained HTML file is achieved effectively.
  • Robustness: Handles various edge cases in HTML and CSS parsing, URL resolution, and content types.
  • Security Considerations: Includes features for security, like CSP, integrity checks, and preventing access to local files from remote documents.
  • Flexibility: Offers many command-line options to customize the embedding process.
  • Portability: Uses muslrust in the Dockerfile, resulting in a statically linked binary that can run on many Linux systems without dependency issues.
  • Test Coverage: A comprehensive set of tests covering various aspects of the application.

Potential Improvements

  • Asynchronous Requests: Currently, the application uses the blocking reqwest client. Switching to the asynchronous version of reqwest could significantly improve performance, especially when embedding many assets.
  • JavaScript Execution: Monolith does not execute JavaScript. For websites that heavily rely on JavaScript to render content, adding support for a headless browser (like headless Chrome or Firefox) could be considered, though this would significantly increase complexity.
  • Error Handling: While there is some error handling, it could be made more robust and informative, providing more specific error messages to the user.
  • Caching Optimization: The caching mechanism could be improved. For example, consider using a persistent cache (e.g., on disk) to avoid re-downloading assets across multiple runs.
  • CSS Parsing Improvement: A more specialized CSS parser to better handle uncommon constructs and improve security (prevent CSS injection).
  • Memory Usage: For very large web pages with numerous embedded assets, memory usage could become a concern. Optimizations like streaming data directly to the output file, rather than holding the entire DOM in memory, could be investigated.
  • Configuration File: Introduce configuration file support to load user settings and avoid repetition of the same options on the command line.

This overview should provide a good understanding of Monolith's architecture and its key components. It is a well-structured tool that effectively achieves its goal of creating self-contained HTML archives of web pages. The suggested improvements could further enhance its performance, robustness, and usability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment