Monolith is a command-line tool designed to save web pages as single HTML files. It embeds CSS, images, JavaScript, and other assets directly into the HTML, creating a self-contained, portable representation of the web page. This overview describes the architecture based on the provided source code and supporting files.
The application is written in Rust and uses a variety of crates (libraries) for different functionalities. The core logic resides in src/main.rs
, which orchestrates the process of downloading, parsing, modifying, and serializing the HTML document. The src/
directory is further organized into modules that handle specific aspects of the process:
src/main.rs
: Main entry point. Handles command-line argument parsing, orchestrates the overall process, manages input (stdin or URL), and handles output (stdout or file).src/opts.rs
: Defines the command-line options and their parsing logic using theclap
crate.src/html.rs
: Contains functions for manipulating the HTML DOM (Document Object Model). This includes parsing, traversing, modifying attributes, embedding assets, adding metadata, and serializing the DOM back into HTML.src/css.rs
: Provides functionality for processing CSS, primarily embedding assets referenced within CSS (e.g., background images, fonts). It uses thecssparser
crate.src/js.rs
: Offers utilities for working with JavaScript, specifically identifying event handler attributes.src/url.rs
: Handles URL manipulation, including resolving relative URLs, creating data URLs, cleaning URLs (removing fragments), and parsing data URLs.src/cookies.rs
: Handles cookies. It allows loading from files (Netscape format) and determines cookie validity.src/utils.rs
: Contains utility functions, such as retrieving assets (making network requests), detecting media types, and parsing Content-Type headers.
The tests/
directory contains unit and integration tests for various parts of the application. The tests/_data_/
directory stores sample HTML, CSS, JS, and other files used for testing.
- Input: The application receives a target URL (or file path, or standard input) and command-line options.
- Retrieval:
- If the target is a URL, the
reqwest
crate is used to fetch the content. Headers, including cookies, referer, and User-Agent, are set based on command-line options. Timeouts are also handled here. - If the target is a file path, the content is read from the local filesystem using
std::fs
. - If the target is "-", the content is read from stdin using
std::io::stdin
.
- If the target is a URL, the
- HTML Parsing: The fetched HTML content (or read from file/stdin) is parsed into a DOM tree using the
html5ever
andmarkup5ever_rcdom
crates. The encoding is either determined from HTTP headers, specified by the command-line option, or auto-detected in html content. - Base URL Resolution: The base URL for resolving relative URLs is determined. This can be explicitly set via the
-b
option, read from a<base>
tag in the HTML, or defaults to the target URL. - Asset Embedding:
- The DOM tree is traversed, and elements with relevant attributes (e.g.,
src
,href
,srcset
,style
) are processed. - URLs in
srcset
attributes are processed separately to handle multiple image sources. - For each URL:
- If it's a data URL, it's parsed.
- If it's a local file URL, the content is read from the filesystem, and the security constraints for reading local files are handled.
- If it's a remote URL (http/https), the
reqwest
client fetches the asset. Domain whitelisting/blacklisting is applied here. - The retrieved asset (or the original data URL) is embedded into the DOM. This usually involves converting the asset to a data URL using
create_data_url
and updating the corresponding attribute. - CSS is processed to embed assets referenced within
url()
functions and@import
rules using theembed_css
function.
- JavaScript event handler attributes are identified, but no JavaScript execution takes place. Options allow for removing JS entirely.
- NOSCRIPT tags can be handled to extract their contents, also embedding links in noscript tags.
- The DOM tree is traversed, and elements with relevant attributes (e.g.,
- CSS Processing: The
embed_css
function insrc/css.rs
handles:- Parsing CSS using
cssparser
. - Identifying URLs within
url()
functions and@import
rules. - Recursively embedding assets referenced within the CSS, using the same retrieval and embedding process as for HTML assets.
- Parsing CSS using
- DOM Modification:
- A
<meta>
tag with a Content Security Policy (CSP) can be added to isolate the document or restrict resource loading. - A
<meta>
tag containing metadata (save time, original URL) is added (unless disabled). - The document's charset can be enforced.
<base>
tag can be added or updated.<noscript>
tags are unwrapped if option-n
is set.
- A
- Serialization: The modified DOM tree is serialized back into HTML using
html5ever::serialize
. - Output: The resulting single HTML file is written to standard output or to a specified file.
reqwest
: For making HTTP requests (fetching web pages and assets).html5ever
: For parsing and serializing HTML.markup5ever_rcdom
: For representing the DOM as a tree structure.cssparser
: For parsing CSS.url
: For URL parsing and manipulation.clap
: For command-line argument parsing.base64
: For Base64 encoding and decoding (used in data URLs).chrono
: For handling timestamps (used in metadata).encoding_rs
: For handling character encodings.regex
: Used for unwrapping NOSCRIPT tags.sha2
: Used for integrity checks (sha256, sha384, sha512).
Cargo.toml
: Defines the project's metadata, dependencies, and features. It specifies thevendored-openssl
feature, which statically links OpenSSL.Dockerfile
: Provides instructions for building a Docker image. It uses a multi-stage build, first building themonolith
binary in aclux/muslrust
environment (for static linking with musl libc) and then copying the binary into a minimalalpine
image.Makefile
: Contains convenience commands for building, testing, installing, and uninstalling the application.monolith.nuspec
: Specifies metadata for a Chocolatey package.
- Self-Contained Output: The primary goal of producing a single, self-contained HTML file is achieved effectively.
- Robustness: Handles various edge cases in HTML and CSS parsing, URL resolution, and content types.
- Security Considerations: Includes features for security, like CSP, integrity checks, and preventing access to local files from remote documents.
- Flexibility: Offers many command-line options to customize the embedding process.
- Portability: Uses
muslrust
in the Dockerfile, resulting in a statically linked binary that can run on many Linux systems without dependency issues. - Test Coverage: A comprehensive set of tests covering various aspects of the application.
- Asynchronous Requests: Currently, the application uses the blocking
reqwest
client. Switching to the asynchronous version ofreqwest
could significantly improve performance, especially when embedding many assets. - JavaScript Execution: Monolith does not execute JavaScript. For websites that heavily rely on JavaScript to render content, adding support for a headless browser (like headless Chrome or Firefox) could be considered, though this would significantly increase complexity.
- Error Handling: While there is some error handling, it could be made more robust and informative, providing more specific error messages to the user.
- Caching Optimization: The caching mechanism could be improved. For example, consider using a persistent cache (e.g., on disk) to avoid re-downloading assets across multiple runs.
- CSS Parsing Improvement: A more specialized CSS parser to better handle uncommon constructs and improve security (prevent CSS injection).
- Memory Usage: For very large web pages with numerous embedded assets, memory usage could become a concern. Optimizations like streaming data directly to the output file, rather than holding the entire DOM in memory, could be investigated.
- Configuration File: Introduce configuration file support to load user settings and avoid repetition of the same options on the command line.
This overview should provide a good understanding of Monolith's architecture and its key components. It is a well-structured tool that effectively achieves its goal of creating self-contained HTML archives of web pages. The suggested improvements could further enhance its performance, robustness, and usability.