Skip to content

Instantly share code, notes, and snippets.

@aspose-com-gists
Last active October 1, 2025 06:37
Show Gist options
  • Save aspose-com-gists/7f366f22b71927a19b9a2423cf6f57aa to your computer and use it in GitHub Desktop.
Save aspose-com-gists/7f366f22b71927a19b9a2423cf6f57aa to your computer and use it in GitHub Desktop.
This Gist contains examples Aspose.HTML Data Extraction

Aspose.HTML for Java – Data Extraction Examples

This gist repository contains practical Java snippets that show how to extract data and resources from HTML pages and websites using Aspose.HTML for Java: images, icons, inline and external SVGs, text nodes, and full pages with controllable resource handling. These gists are used in the Aspose.HTML for Java documentation, particularly in the Data Extraction chapter.

Topics Covered

  1. Select, navigate, and filter HTML to find target data using CSS selectors, TreeWalker, and a custom NodeFilter.
  2. Extract resources from web pages: download images, icons, and SVG (inline and external).
  3. Inspect and navigate HTML documents to read text, attributes, and structure.
  4. Save full web pages with control over which linked resources are retrieved via HTMLSaveOptions, ResourceHandlingOptions, MaxHandlingDepth, and PageUrlRestriction.
  5. Perform direct network requests to fetch arbitrary files via RequestMessage and ResponseMessage.

What's Included?

This gist repository contains individual Java examples focused on data extraction workflows:

What's Included?

  • DOM Traversal. Navigate the DOM tree to inspect, read, and extract elements and text content from HTML documents.
  • Node Filtering. Use custom filters and TreeWalker to process only specific nodes, such as images or selected tags.
  • CSS Selectors. Use querySelector and querySelectorAll to quickly find elements by tag, class, or attribute.
  • Extract Images and Icons. Collect and download images, site icons, and other visual resources from web pages.
  • Download SVG from a Website. Extract inline SVG elements or download external SVG files for later use.
  • Inspect and Analyze HTML. Parse and explore document structure, attributes, and metadata.
  • Save Web Pages. Download entire websites or individual pages with advanced options like MaxHandlingDepth and PageUrlRestriction to control resources.
  • Fetch Files by URL. Perform direct requests to download and save arbitrary files from the web.

Quick Start

  1. Install the latest version of Aspose.HTML for Java https://releases.aspose.com/html/java/.
  2. For instructions on setting up the Aspose repository configuration and defining the Aspose.HTML for Java API dependency, see the How to Install Aspose.HTML for Java article.
  3. Browse the available gists and copy the code samples you need.
  4. Configure paths, settings, and inputs to suit your environment.

You can download a free trial of Aspose.HTML for Java and use a temporary license for unrestricted access.

About Aspose.HTML for Java

Aspose.HTML for Java is an on-premise Java library for parsing, navigating, processing, and converting HTML and related formats. It provides DOM APIs, CSS selector queries, XPath-style traversal helpers, a built-in networking stack, and configurable save options that make it straightforward to extract content and resources from web pages and HTML documents.

Documentation and Resources

Prerequisites

  • Java SE 8 (or higher).
  • Supported operating systems: Windows, macOS, Linux.
  • A Java development environment (IntelliJ IDEA, Eclipse, or similar).
  • Build tool: Maven or Gradle for dependency management.
  • Aspose.HTML for Java library.
Aspose.HTML for Java - Data Extraction
// Create custom NodeFilter to accept only image elements in Java
// Learn more: https://docs.aspose.com/html/java/html-navigation/
public static class OnlyImageFilter extends NodeFilter {
@Override
public short acceptNode(Node n) {
// The current filter skips all elements, except IMG elements
return "img".equals(n.getLocalName())
? FILTER_ACCEPT
: FILTER_SKIP;
}
}
// Download external SVG images from HTML using Java
// Learn more: https://docs.aspose.com/html/java/extract-svg-from-website/
// Open a document you want to download external SVGs from
final HTMLDocument document = new HTMLDocument("https://products.aspose.com/html/net/");
// Collect all image elements
HTMLCollection images = document.getElementsByTagName("img");
// Create a distinct collection of relative image URLs
java.util.Set<String> urls = new HashSet<>();
for (Element element : images) {
urls.add(element.getAttribute("src"));
}
// Filter out non SVG images
java.util.List<String> svgUrls = new ArrayList<>();
for (String url : urls) {
if (url.endsWith(".svg")) {
svgUrls.add(url);
}
}
// Create absolute SVG image URLs
java.util.List<Url> absUrls = svgUrls.stream()
.map(src -> new Url(src, document.getBaseURI()))
.collect(Collectors.toList());
// foreach to while statements conversion
for (Url url : absUrls) {
// Create a downloading request
final RequestMessage request = new RequestMessage(url);
// Download SVG image
final ResponseMessage response = document.getContext().getNetwork().send(request);
// Check whether response is successful
if (response.isSuccess()) {
String[] split = url.getPathname().split("/");
String path = split[split.length - 1];
// Save file to a local file system
FileHelper.writeAllBytes($o(path), response.getContent().readAsByteArray());
}
}
// Download icons from website using Java
// Learn more: https://docs.aspose.com/html/java/extract-images-from-website/
// Open a document you want to download icons from
final HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Collect all <link> elements
HTMLCollection links = document.getElementsByTagName("link");
// Leave only "icon" elements
java.util.Set<Element> icons = new HashSet<>();
for (Element link : links) {
if ("icon".equals(link.getAttribute("rel"))) {
icons.add(link);
}
}
// Create a distinct collection of relative icon URLs
java.util.Set<String> urls = new HashSet<>();
for (Element icon : icons) {
urls.add(icon.getAttribute("href"));
}
// Create absolute image URLs
java.util.List<Url> absUrls = urls.stream()
.map(src -> new Url(src, document.getBaseURI()))
.collect(Collectors.toList());
// foreach to while statements conversion
for (Url url : absUrls) {
// Create a downloading request
final RequestMessage request = new RequestMessage(url);
// Extract icon
final ResponseMessage response = document.getContext().getNetwork().send(request);
// Check whether a response is successful
if (response.isSuccess()) {
String[] split = url.getPathname().split("/");
String path = split[split.length - 1];
// Save file to a local file system
FileHelper.writeAllBytes($o(path), response.getContent().readAsByteArray());
}
}
// Extract images from website using Java
// Learn more: https://docs.aspose.com/html/java/extract-images-from-website/
// Open a document you want to download images from
final HTMLDocument document = new HTMLDocument("https://docs.aspose.com/svg/net/drawing-basics/svg-shapes/");
// Collect all <img> elements
HTMLCollection images = document.getElementsByTagName("img");
// Create a distinct collection of relative image URLs
Iterator<Element> iterator = images.iterator();
java.util.Set<String> urls = new HashSet<>();
for (Element e : images) {
urls.add(e.getAttribute("src"));
}
// Create absolute image URLs
java.util.List<Url> absUrls = urls.stream()
.map(src -> new Url(src, document.getBaseURI()))
.collect(Collectors.toList());
// foreach to while statements conversion
for (Url url : absUrls) {
// Create an image request message
final RequestMessage request = new RequestMessage(url);
// Extract image
final ResponseMessage response = document.getContext().getNetwork().send(request);
// Check whether a response is successful
if (response.isSuccess()) {
String[] split = url.getPathname().split("/");
String path = split[split.length - 1];
// Save file to a local file system
FileHelper.writeAllBytes($o(path), response.getContent().readAsByteArray());
}
}
// How to extract inline SVG images from a webpage using Java
// Learn more: https://docs.aspose.com/html/java/extract-svg-from-website/
// Open a document you want to download inline SVG images from
final HTMLDocument document = new HTMLDocument("https://products.aspose.com/html/net/");
// Collect all inline SVG images
HTMLCollection images = document.getElementsByTagName("svg");
for (int i = 0; i < images.getLength(); i++) {
// Save every image to a local file system
FileHelper.writeAllText("{i}.svg", images.get_Item(i).getOuterHTML());
}
// Navigate the HTML DOM using Java
// Learn more: https://docs.aspose.com/html/java/html-navigation/
// Prepare HTML code
String html_code = "<span>Hello,</span> <span>World!</span>";
// Initialize a document from the prepared code
HTMLDocument document = new HTMLDocument(html_code, ".");
// Get the reference to the first child (first <span>) of the document body
Element element = document.getBody().getFirstElementChild();
System.out.println(element.getTextContent());
// @output: Hello,
// Get the reference to the second <span> element
element = element.getNextElementSibling();
System.out.println(element.getTextContent());
// @output: World!
// Download file from URL using Java
// Learn more: https://docs.aspose.com/html/java/save-file-from-url/
// Create a blank document; it is required to access the network operations functionality
final HTMLDocument document = new HTMLDocument();
// Create a URL with the path to the resource you want to download
Url url = new Url("https://docs.aspose.com/html/net/message-handlers/message-handlers.png");
// Create a file request message
final RequestMessage request = new RequestMessage(url);
// Download file from URL
final ResponseMessage response = document.getContext().getNetwork().send(request);
// Check whether response is successful
if (response.isSuccess()) {
String[] split = url.getPathname().split("/");
String path = split[split.length - 1];
// Save file to a local file system
FileHelper.writeAllBytes($o(path), response.getContent().readAsByteArray());
}
// Extract and save a wab page with default save options in Java
// Learn more: https://docs.aspose.com/html/java/website-to-html/
// Initialize an HTML document from a URL
final HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Prepare a path to save the downloaded file
String savePath = "root/result.html";
// Save the HTML document to the specified file
document.save(savePath);
// Save a website with limited resource depth using Java
// Learn more: https://docs.aspose.com/html/java/website-to-html/
// Load an HTML document from a URL
final HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Create an HTMLSaveOptions object and set the MaxHandlingDepth property
HTMLSaveOptions options = new HTMLSaveOptions();
options.getResourceHandlingOptions().setMaxHandlingDepth(1);
// Prepare a path for downloaded file saving
String savePath = "rootAndAdjacent/result.html";
// Save the HTML document to the specified file
document.save(savePath, options);
// Save a website with restricted resource URLs using Java
// Learn more: https://docs.aspose.com/html/java/website-to-html/
// Initialize an HTML document from a URL
final HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Create an HTMLSaveOptions object and set MaxHandlingDepth and PageUrlRestriction properties
HTMLSaveOptions options = new HTMLSaveOptions();
options.getResourceHandlingOptions().setMaxHandlingDepth(1);
options.getResourceHandlingOptions().setPageUrlRestriction(UrlRestriction.SameHost);
// Prepare a path to save the downloaded file
String savePath = "rootAndManyAdjacent/result.html";
// Save the HTML document to the specified file
document.save(savePath, options);
// Download website using HTMLSaveOptions in Java
// Learn more: https://docs.aspose.com/html/java/website-to-html/
// Initialize an HTML document from a URL
final HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Create an HTMLSaveOptions object and set the JavaScript property
HTMLSaveOptions options = new HTMLSaveOptions();
options.getResourceHandlingOptions().setJavaScript(ResourceHandling.Embed);
// Prepare a path to save the downloaded file
String savePath = "rootAndEmbedJs/result.html";
// Save the HTML document to the specified file
document.save(savePath, options);
// Select HTML elements using CSS selector querySelectorAll method in Aspose.HTML for Java
// Learn more: https://docs.aspose.com/html/java/html-navigation/
// Prepare HTML code
String code = "< div class='happy' >\n" +
" <div >\n" +
" <span > Hello, </span >\n" +
" </div >\n" +
" </div >\n" +
" <p class='happy' >\n" +
" <span > World ! </span >\n" +
" </p >\n";
// Initialize a document based on the prepared code
HTMLDocument document = new HTMLDocument(code, ".");
// Here, we create a CSS Selector that extracts all elements whose 'class' attribute equals to 'happy' and their child SPAN elements
NodeList elements = document.querySelectorAll(".happy span");
// Iterate over the resulted list of elements
elements.forEach(element -> {
System.out.println(((HTMLElement) element).getInnerHTML());
// @output: Hello,
// @output: World!
});
// Select HTML elements using XPath expression in Aspose.HTML for Java
// Learn more: https://docs.aspose.com/html/java/html-navigation/
// Prepare HTML code
String code = "< div class='happy' >\n" +
" <div >\n" +
" <span > Hello! </span >\n" +
" </div >\n" +
" </div >\n" +
" <p class='happy' >\n" +
" <span > World! </span >\n" +
" </p >\n";
// Initialize a document based on the prepared code
HTMLDocument document = new HTMLDocument(code, ".");
// Here, we evaluate the XPath expression where we select all child <span> elements from elements whose 'class' attribute equals to 'happy'
IXPathResult result = document.evaluate("//*[@class='happy']//span",
document,
null,
XPathResultType.Any,
null
);
// Iterate over the resulted nodes
for (Node node; (node = result.iterateNext()) != null; ) {
System.out.println(node.getTextContent());
// @output: Hello!
// @output: World!
}
// Filter HTML elements using TreeWalker and custom NodeFilter in Aspose.HTML for Java
// Learn more: https://docs.aspose.com/html/java/html-navigation/
// Prepare HTML code
String code = " < p > Hello, </p >\n" +
" <img src = 'image1.png' >\n" +
" <img src = 'image2.png' >\n" +
" <p > World ! </p >\n";
// Initialize a document based on the prepared code
HTMLDocument document = new HTMLDocument(code, ".");
// To start HTML navigation, we need to create an instance of TreeWalker
// The specified parameters mean that it starts walking from the root of the document, iterating all nodes, and using our custom implementation of the filter
ITreeWalker iterator = document.createTreeWalker(document, NodeFilter.SHOW_ALL, new NodeFilterUsageExample.OnlyImageFilter());
// Use
while (iterator.nextNode() != null) {
// Since we are using our own filter, the current node will always be an instance of the HTMLImageElement
// So, we don't need the additional validations here
HTMLImageElement image = (HTMLImageElement) iterator.getCurrentNode();
System.out.println(image.getSrc());
// @output: image1.png
// @output: image2.png
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment