This gist repository contains practical Java snippets that show how to extract data and resources from HTML pages and websites using Aspose.HTML for Java: images, icons, inline and external SVGs, text nodes, and full pages with controllable resource handling. These gists are used in the Aspose.HTML for Java documentation, particularly in the Data Extraction chapter.
- Select, navigate, and filter HTML to find target data using CSS selectors,
TreeWalker, and a customNodeFilter. - Extract resources from web pages: download images, icons, and SVG (inline and external).
- Inspect and navigate HTML documents to read text, attributes, and structure.
- Save full web pages with control over which linked resources are retrieved via
HTMLSaveOptions,ResourceHandlingOptions,MaxHandlingDepth, andPageUrlRestriction. - Perform direct network requests to fetch arbitrary files via
RequestMessageandResponseMessage.
This gist repository contains individual Java examples focused on data extraction workflows:
- DOM Traversal. Navigate the DOM tree to inspect, read, and extract elements and text content from HTML documents.
- Node Filtering. Use custom filters and
TreeWalkerto process only specific nodes, such as images or selected tags. - CSS Selectors. Use
querySelectorandquerySelectorAllto quickly find elements by tag, class, or attribute. - Extract Images and Icons. Collect and download images, site icons, and other visual resources from web pages.
- Download SVG from a Website. Extract inline SVG elements or download external SVG files for later use.
- Inspect and Analyze HTML. Parse and explore document structure, attributes, and metadata.
- Save Web Pages. Download entire websites or individual pages with advanced options like
MaxHandlingDepthandPageUrlRestrictionto control resources. - Fetch Files by URL. Perform direct requests to download and save arbitrary files from the web.
- Install the latest version of Aspose.HTML for Java https://releases.aspose.com/html/java/.
- For instructions on setting up the Aspose repository configuration and defining the Aspose.HTML for Java API dependency, see the How to Install Aspose.HTML for Java article.
- Browse the available gists and copy the code samples you need.
- Configure paths, settings, and inputs to suit your environment.
You can download a free trial of Aspose.HTML for Java and use a temporary license for unrestricted access.
Aspose.HTML for Java is an on-premise Java library for parsing, navigating, processing, and converting HTML and related formats. It provides DOM APIs, CSS selector queries, XPath-style traversal helpers, a built-in networking stack, and configurable save options that make it straightforward to extract content and resources from web pages and HTML documents.
- Official Documentation – Aspose.HTML for Java
- Product page – Aspose.HTML for Java
- Blog – Aspose.HTML Product Family
- API Reference – Aspose.HTML for Java
- Free Support Forum – Aspose.HTML
- Java SE 8 (or higher).
- Supported operating systems: Windows, macOS, Linux.
- A Java development environment (IntelliJ IDEA, Eclipse, or similar).
- Build tool: Maven or Gradle for dependency management.
- Aspose.HTML for Java library.