Monolith is a command-line tool designed to save web pages as single HTML files. It embeds CSS, images, JavaScript, and other assets directly into the HTML, creating a self-contained, portable representation of the web page. This overview describes the architecture based on the provided source code and supporting files.
The application is written in Rust and uses a variety of crates (libraries) for different functionalities. The core logic resides in src/main.rs, which orchestrates the process of downloading, parsing, modifying, and serializing the HTML document. The src/ directory is further organized into modules that handle specific aspects of the process:
src/main.rs: Main entry point. Handles command-line argument parsing, orchestrates the overall process, manages input (stdin or URL), and handles output (stdout or file).src/opts.rs: Defines the command-line options and their parsing logic using theclapcrate.src/html.rs: Contains functions for manipulating the HTML DOM (Document Object Model). This includes parsing, traversing, modifying attributes, embedding assets, adding metadata, and serializing the DOM back into HTML.src/css.rs: Provides functionality for processing CSS, primarily embedding assets referenced within CSS (e.g., background images, fonts). It uses thecssparsercrate.src/js.rs: Offers utilities for working with JavaScript, specifically identifying event handler attributes.src/url.rs: Handles URL manipulation, including resolving relative URLs, creating data URLs, cleaning URLs (removing fragments), and parsing data URLs.src/cookies.rs: Handles cookies. It allows loading from files (Netscape format) and determines cookie validity.src/utils.rs: Contains utility functions, such as retrieving assets (making network requests), detecting media types, and parsing Content-Type headers.
The tests/ directory contains unit and integration tests for various parts of the application. The tests/_data_/ directory stores sample HTML, CSS, JS, and other files used for testing.
- Input: The application receives a target URL (or file path, or standard input) and command-line options.
- Retrieval:
- If the target is a URL, the
reqwestcrate is used to fetch the content. Headers, including cookies, referer, and User-Agent, are set based on command-line options. Timeouts are also handled here. - If the target is a file path, the content is read from the local filesystem using
std::fs. - If the target is "-", the content is read from stdin using
std::io::stdin.
- If the target is a URL, the
- HTML Parsing: The fetched HTML content (or read from file/stdin) is parsed into a DOM tree using the
html5everandmarkup5ever_rcdomcrates. The encoding is either determined from HTTP headers, specified by the command-line option, or auto-detected in html content. - Base URL Resolution: The base URL for resolving relative URLs is determined. This can be explicitly set via the
-boption, read from a<base>tag in the HTML, or defaults to the target URL. - Asset Embedding:
- The DOM tree is traversed, and elements with relevant attributes (e.g.,
src,href,srcset,style) are processed. - URLs in
srcsetattributes are processed separately to handle multiple image sources. - For each URL:
- If it's a data URL, it's parsed.
- If it's a local file URL, the content is read from the filesystem, and the security constraints for reading local files are handled.
- If it's a remote URL (http/https), the
reqwestclient fetches the asset. Domain whitelisting/blacklisting is applied here. - The retrieved asset (or the original data URL) is embedded into the DOM. This usually involves converting the asset to a data URL using
create_data_urland updating the corresponding attribute. - CSS is processed to embed assets referenced within
url()functions and@importrules using theembed_cssfunction.
- JavaScript event handler attributes are identified, but no JavaScript execution takes place. Options allow for removing JS entirely.
- NOSCRIPT tags can be handled to extract their contents, also embedding links in noscript tags.
- The DOM tree is traversed, and elements with relevant attributes (e.g.,
- CSS Processing: The
embed_cssfunction insrc/css.rshandles:- Parsing CSS using
cssparser. - Identifying URLs within
url()functions and@importrules. - Recursively embedding assets referenced within the CSS, using the same retrieval and embedding process as for HTML assets.
- Parsing CSS using
- DOM Modification:
- A
<meta>tag with a Content Security Policy (CSP) can be added to isolate the document or restrict resource loading. - A
<meta>tag containing metadata (save time, original URL) is added (unless disabled). - The document's charset can be enforced.
<base>tag can be added or updated.<noscript>tags are unwrapped if option-nis set.
- A
- Serialization: The modified DOM tree is serialized back into HTML using
html5ever::serialize. - Output: The resulting single HTML file is written to standard output or to a specified file.
reqwest: For making HTTP requests (fetching web pages and assets).html5ever: For parsing and serializing HTML.markup5ever_rcdom: For representing the DOM as a tree structure.cssparser: For parsing CSS.url: For URL parsing and manipulation.clap: For command-line argument parsing.base64: For Base64 encoding and decoding (used in data URLs).chrono: For handling timestamps (used in metadata).encoding_rs: For handling character encodings.regex: Used for unwrapping NOSCRIPT tags.sha2: Used for integrity checks (sha256, sha384, sha512).
Cargo.toml: Defines the project's metadata, dependencies, and features. It specifies thevendored-opensslfeature, which statically links OpenSSL.Dockerfile: Provides instructions for building a Docker image. It uses a multi-stage build, first building themonolithbinary in aclux/muslrustenvironment (for static linking with musl libc) and then copying the binary into a minimalalpineimage.Makefile: Contains convenience commands for building, testing, installing, and uninstalling the application.monolith.nuspec: Specifies metadata for a Chocolatey package.
- Self-Contained Output: The primary goal of producing a single, self-contained HTML file is achieved effectively.
- Robustness: Handles various edge cases in HTML and CSS parsing, URL resolution, and content types.
- Security Considerations: Includes features for security, like CSP, integrity checks, and preventing access to local files from remote documents.
- Flexibility: Offers many command-line options to customize the embedding process.
- Portability: Uses
muslrustin the Dockerfile, resulting in a statically linked binary that can run on many Linux systems without dependency issues. - Test Coverage: A comprehensive set of tests covering various aspects of the application.
- Asynchronous Requests: Currently, the application uses the blocking
reqwestclient. Switching to the asynchronous version ofreqwestcould significantly improve performance, especially when embedding many assets. - JavaScript Execution: Monolith does not execute JavaScript. For websites that heavily rely on JavaScript to render content, adding support for a headless browser (like headless Chrome or Firefox) could be considered, though this would significantly increase complexity.
- Error Handling: While there is some error handling, it could be made more robust and informative, providing more specific error messages to the user.
- Caching Optimization: The caching mechanism could be improved. For example, consider using a persistent cache (e.g., on disk) to avoid re-downloading assets across multiple runs.
- CSS Parsing Improvement: A more specialized CSS parser to better handle uncommon constructs and improve security (prevent CSS injection).
- Memory Usage: For very large web pages with numerous embedded assets, memory usage could become a concern. Optimizations like streaming data directly to the output file, rather than holding the entire DOM in memory, could be investigated.
- Configuration File: Introduce configuration file support to load user settings and avoid repetition of the same options on the command line.
This overview should provide a good understanding of Monolith's architecture and its key components. It is a well-structured tool that effectively achieves its goal of creating self-contained HTML archives of web pages. The suggested improvements could further enhance its performance, robustness, and usability.