Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save aspose-com-gists/c23b81e19fbb6d153166d27da4ba66fc to your computer and use it in GitHub Desktop.

Select an option

Save aspose-com-gists/c23b81e19fbb6d153166d27da4ba66fc to your computer and use it in GitHub Desktop.
Extract data from a web page or any HTML document with Aspose.HTML for .NET

Data Extraction Examples

This GitHub gist repository contains C# code examples used in the Aspose.HTML for .NET documentation, specifically within the Data Extraction section. These gists showcase various approaches and techniques for effectively parsing, navigating, and extracting information from HTML documents using the Aspose.HTML for .NET library.

Key Topics

  • DOM Traversal. Programmatically navigate and manipulate the DOM tree using W3C-compliant traversal interfaces to inspect and retrieve content from HTML documents.
  • Filter and process nodes. Select and manipulate specific parts of the HTML.
  • Apply XPath queries and CSS selectors. Navigate the HTML document structure to retrieve specific nodes.
  • Download images and SVGs from website. Programmatically extract various types of images from a website using C#.
  • Download website. Programmatically extract and save websites, and customize the saving process to suit your needs – use MaxHandlingDepth, PageUrlRestriction, etc.
  • Extract data from tables. Retrieve structured information from HTML tables.

How to Use These Examples

Each example is self-contained and demonstrates a particular aspect of extracting structured or media content from HTML sources.

  1. Ensure your .NET project references the Aspose.HTML for .NET library. You can get it via NuGet.
  2. Select the example you're interested in and copy its content.
  3. Paste the code into your project and execute it to see how data extraction works.

You can download a free trial of Aspose.HTML for .NET and use a temporary license for unrestricted access.

Related Documentation

These samples support the tutorials in the Data Extraction chapter of the official documentation.

Related Resources

Requirements

  • .NET 6.0+, .NET Core, or .NET Framework
  • Aspose.HTML for .NET library
Aspose.HTML for .NET – Extract Data from HTML
// Filter only <img> elements in HTML tree using C#
// Learn more: https://docs.aspose.com/html/net/html-navigation/
class OnlyImageFilter : Aspose.Html.Dom.Traversal.Filters.NodeFilter
{
public override short AcceptNode(Node n)
{
// The current filter skips all elements, except IMG elements
return string.Equals("img", n.LocalName)
? FILTER_ACCEPT
: FILTER_SKIP;
}
}
// Extract data from HTML table using C#
// Learn more: https://docs.aspose.com/html/net/data-extraction/
// Load the HTML document from a local file
HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/edit-html-document/");
// Create a list to store extracted hyperlink data
List <Dictionary<string, string>> linkList = new List<Dictionary<string, string>>();
// Get all <table> elements in the document
HTMLCollection tables = document.GetElementsByTagName("table");
for (int t = 0; t < tables.Length; t++)
{
// Access the table element
Element element = tables[t];
HTMLElement htmlTable = element as HTMLElement;
if (htmlTable == null)
continue;
// Get all <a> elements (hyperlinks) within this table only
HTMLCollection links = htmlTable.GetElementsByTagName("a");
for (int i = 0; i < links.Length; i++)
{
Element link = links[i];
string href = link.GetAttribute("href");
string text = link.TextContent != null ? link.TextContent.Trim() : string.Empty;
// Add hyperlink to the result list if href exists
if (!string.IsNullOrEmpty(href))
{
Dictionary<string, string> item = new Dictionary<string, string>
{
{ "text", text },
{ "href", href }
};
linkList.Add(item);
}
}
}
// Configure JSON serializer to output indented (pretty) format
JsonSerializerOptions options = new JsonSerializerOptions
{
WriteIndented = true
};
// Serialize the link list to JSON
string json = JsonSerializer.Serialize(linkList, options);
// Ensure output directory exists
// Save the JSON to a file
string outputPath = Path.Combine(OutputDir, "links1.json");
File.WriteAllText(outputPath, json);
Console.WriteLine("Hyperlinks successfully extracted and saved to: " + outputPath);
}
// Download external SVG images from HTML using C#
// Learn more: https://docs.aspose.com/html/net/extract-svg-from-website/
// Open a document you want to download external SVGs from
using HTMLDocument document = new HTMLDocument("https://products.aspose.com/html/net/");
// Collect all image elements
HTMLCollection images = document.GetElementsByTagName("img");
// Create a distinct collection of relative image URLs
IEnumerable<string> urls = images.Select(element => element.GetAttribute("src")).Distinct();
// Filter out non SVG images
IEnumerable<string> svgUrls = urls.Where(url => url.EndsWith(".svg"));
// Create absolute SVG image URLs
IEnumerable<Url> absUrls = svgUrls.Select(src => new Url(src, document.BaseURI));
foreach (Url url in absUrls)
{
// Create a downloading request
using RequestMessage request = new RequestMessage(url);
// Download SVG image
using ResponseMessage response = document.Context.Network.Send(request);
// Check whether response is successful
if (response.IsSuccess)
{
// Save SVG image to a local file system
File.WriteAllBytes(Path.Combine(OutputDir, url.Pathname.Split('/').Last()), response.Content.ReadAsByteArray());
}
}
// Download icons from website using C#
// Learn more: https://docs.aspose.com/html/net/extract-images-from-website/
// Open a document you want to download icons from
using HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Collect all <link> elements
HTMLCollection links = document.GetElementsByTagName("link");
// Leave only "icon" elements
IEnumerable<Element> icons = links.Where(link => link.GetAttribute("rel") == "icon");
// Create a distinct collection of relative icon URLs
IEnumerable<string> urls = icons.Select(icon => icon.GetAttribute("href")).Distinct();
// Create absolute icon URLs
IEnumerable<Url> absUrls = urls.Select(src => new Url(src, document.BaseURI));
foreach (Url url in absUrls)
{
// Create a downloading request
using RequestMessage request = new RequestMessage(url);
// Extract icon
using ResponseMessage response = document.Context.Network.Send(request);
// Check whether a response is successful
if (response.IsSuccess)
{
// Save icon to a local file system
File.WriteAllBytes(Path.Combine(OutputDir, url.Pathname.Split('/').Last()), response.Content.ReadAsByteArray());
}
}
// Extract images from website using C#
// Learn more: https://docs.aspose.com/html/net/extract-images-from-website/
// Open a document you want to download images from
using HTMLDocument document = new HTMLDocument("https://docs.aspose.com/svg/net/drawing-basics/svg-shapes/");
// Collect all <img> elements
HTMLCollection images = document.GetElementsByTagName("img");
// Create a distinct collection of relative image URLs
IEnumerable<string> urls = images.Select(element => element.GetAttribute("src")).Distinct();
// Create absolute image URLs
IEnumerable<Url> absUrls = urls.Select(src => new Url(src, document.BaseURI));
foreach (Url url in absUrls)
{
// Create an image request message
using RequestMessage request = new RequestMessage(url);
// Extract image
using ResponseMessage response = document.Context.Network.Send(request);
// Check whether a response is successful
if (response.IsSuccess)
{
// Save image to a local file system
File.WriteAllBytes(Path.Combine(OutputDir, url.Pathname.Split('/').Last()), response.Content.ReadAsByteArray());
}
}
// How to extract inline SVG images from a webpage using C#
// Learn more: https://docs.aspose.com/html/net/extract-svg-from-website/
// Open a document you want to download inline SVG images from
using HTMLDocument document = new HTMLDocument("https://products.aspose.com/html/net/");
// Collect all inline SVG images
HTMLCollection images = document.GetElementsByTagName("svg");
for (int i = 0; i < images.Length; i++)
{
// Save each SVG element as an individual .svg file
File.WriteAllText(Path.Combine(OutputDir, $"{i}.svg"), images[i].OuterHTML);
}
// Access and navigate HTML elements in a document using C#
// Learn more: https://docs.aspose.com/html/net/html-navigation/
// Load a document from a file
string documentPath = Path.Combine(DataDir, "html_file.html");
using (HTMLDocument document = new HTMLDocument(documentPath))
{
// Get the html element of the document
Element element = document.DocumentElement;
Console.WriteLine(element.TagName); // HTML
// Get the last element of the html element
element = element.LastElementChild;
Console.WriteLine(element.TagName); // BODY
// Get the first element in the body element
element = element.FirstElementChild;
Console.WriteLine(element.TagName); // H1
Console.WriteLine(element.TextContent); // Header 1
}
// Navigate the HTML DOM using C#
// Learn more: https://docs.aspose.com/html/net/html-navigation/
// Prepare HTML code
string html_code = "<span>Hello,</span> <span>World!</span>";
// Initialize a document from the prepared code
using (HTMLDocument document = new HTMLDocument(html_code, "."))
{
// Get the reference to the first child (first <span>) of the <body>
Node element = document.Body.FirstChild;
Console.WriteLine(element.TextContent); // output: Hello,
// Get the reference to the whitespace between html elements
element = element.NextSibling;
Console.WriteLine(element.TextContent); // output: ' '
// Get the reference to the second <span> element
element = element.NextSibling;
Console.WriteLine(element.TextContent); // output: World!
// Set an html variable for the document
string html = document.DocumentElement.OuterHTML;
Console.WriteLine(html); // output: <html><head></head><body><span>Hello,</span> <span>World!</span></body></html>
}
// Implement NodeFilter to skip all elements except images
// Learn more: https://docs.aspose.com/html/net/html-navigation/
// Prepare HTML code
string code = @"
<p>Hello,</p>
<img src='image1.png'>
<img src='image2.png'>
<p>World!</p>";
// Initialize a document based on the prepared code
using (HTMLDocument document = new HTMLDocument(code, "."))
{
// To start HTML navigation, we need to create an instance of TreeWalker
// The specified parameters mean that it starts walking from the root of the document, iterating all nodes and using our custom implementation of the filter
using (ITreeWalker iterator = document.CreateTreeWalker(document, NodeFilter.SHOW_ALL, new OnlyImageFilter()))
{
while (iterator.NextNode() != null)
{
// Since we are using our own filter, the current node will always be an instance of the HTMLImageElement
// So, we don't need the additional validations here
HTMLImageElement image = (HTMLImageElement)iterator.CurrentNode;
Console.WriteLine(image.Src);
// output: image1.png
// output: image2.png
// Set an html variable for the document
string html = document.DocumentElement.OuterHTML;
}
}
}
// Download file from URL using C#
// Learn more: https://docs.aspose.com/html/net/save-file-from-url/
// Create a blank document; it is required to access the network operations functionality
using HTMLDocument document = new HTMLDocument();
// Create a URL with the path to the resource you want to download
Url url = new Url("https://docs.aspose.com/html/net/message-handlers/message-handlers.png");
// Create a file request message
using RequestMessage request = new RequestMessage(url);
// Download file from URL
using ResponseMessage response = document.Context.Network.Send(request);
// Check whether response is successful
if (response.IsSuccess)
{
// Save file to a local file system
File.WriteAllBytes(Path.Combine(OutputDir, url.Pathname.Split('/').Last()), response.Content.ReadAsByteArray());
}
// Extract and save a wab page with default save options in C#
// Learn more: https://docs.aspose.com/html/net/website-to-html/
// Initialize an HTML document from a URL
using HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Prepare a path to save the downloaded file
string savePath = Path.Combine(OutputDir, "root/result.html");
// Save the HTML document to the specified file
document.Save(savePath);
// Save a website with limited resource depth using C#
// Learn more: https://docs.aspose.com/html/net/website-to-html/
// Load an HTML document from a URL
using HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Create an HTMLSaveOptions object and set the MaxHandlingDepth property
HTMLSaveOptions options = new HTMLSaveOptions
{
ResourceHandlingOptions =
{
MaxHandlingDepth = 1
}
};
// Prepare the output path for saving the downloaded content
string savePath = Path.Combine(OutputDir, "rootAndAdjacent/result.html");
// Save the document along with adjacent resources only
document.Save(savePath, options);
// Save a website with restricted resource URLs using C#
// Learn more: https://docs.aspose.com/html/net/website-to-html/
// Initialize an HTML document from a URL
using HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Configure HTMLSaveOptions with restricted resource handling
HTMLSaveOptions options = new HTMLSaveOptions
{
ResourceHandlingOptions =
{
MaxHandlingDepth = 1,
PageUrlRestriction = UrlRestriction.SameHost
}
};
// Prepare the output path for the saved content
string savePath = Path.Combine(OutputDir, "rootAndManyAdjacent/result.html");
// Save the HTML document and allowed resources to the specified path
document.Save(savePath, options);
// Download website using HTMLSaveOptions in C#
// Learn more: https://docs.aspose.com/html/net/website-to-html/
// Initialize an HTML document from a URL
using HTMLDocument document = new HTMLDocument("https://docs.aspose.com/html/net/message-handlers/");
// Create an HTMLSaveOptions object and set the JavaScript property
HTMLSaveOptions options = new HTMLSaveOptions
{
ResourceHandlingOptions =
{
JavaScript = ResourceHandling.Embed
}
};
// Prepare a path to save the downloaded file
string savePath = Path.Combine(OutputDir, "rootAndEmbedJs/result.html");
// Save the HTML document to the specified file
document.Save(savePath, options);
// Extract nodes Using CSS selector in C#
// Learn more: https://docs.aspose.com/html/net/html-navigation/
// Prepare HTML code
string code = @"
<div class='happy'>
<div>
<span>Hello,</span>
</div>
</div>
<p class='happy'>
<span>World!</span>
</p>
";
// Initialize a document based on the prepared code
using (HTMLDocument document = new HTMLDocument(code, "."))
{
// Here we create a CSS Selector that extracts all elements whose 'class' attribute equals 'happy' and their child <span> elements
NodeList elements = document.QuerySelectorAll(".happy span");
// Iterate over the resulted list of elements
foreach (HTMLElement element in elements)
{
Console.WriteLine(element.InnerHTML);
// output: Hello,
// output: World!
}
}
// How to use XPath to select nodes using C#
// Learn more: https://docs.aspose.com/html/net/html-navigation/
// Prepare HTML code
string code = @"
<div class='happy'>
<div>
<span>Hello,</span>
</div>
</div>
<p class='happy'>
<span>World!</span>
</p>
";
// Initialize a document based on the prepared code
using (HTMLDocument document = new HTMLDocument(code, "."))
{
// Here we evaluate the XPath expression where we select all child <span> elements from elements whose 'class' attribute equals to 'happy':
IXPathResult result = document.Evaluate("//*[@class='happy']//span",
document,
null,
XPathResultType.Any,
null);
// Iterate over the resulted nodes
for (Node node; (node = result.IterateNext()) != null;)
{
Console.WriteLine(node.TextContent);
// output: Hello,
// output: World!
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment