Skip to content

Instantly share code, notes, and snippets.

@calebhailey
Last active April 17, 2024 20:09
Show Gist options
  • Save calebhailey/811d52335efae61d9b164dd8139447e8 to your computer and use it in GitHub Desktop.
Save calebhailey/811d52335efae61d9b164dd8139447e8 to your computer and use it in GitHub Desktop.
SOLVED: Mysterious HTMLRewriter Issue – async bug or PEBKAC?

Mysterious HTMLRewriter Issue – Async Bug or PEBKAC?

UPDATE: this was solved, thanks to the Cloudflare Community Discord. See comments for the solution.

I've been wanting to kick the tires with Cloudflare's HTMLRewriter to see if it could be used as an HTML parser. As a simple example, can Cloudflare Workers + HTMLRewriter be used to build an API to parse OpenGraph metadata and return the properties as a JSON document? Based on a cursory review of the documentation, it appears as if this should be quite simple.

However, I have observed a race condition where HTMLRewriter will always find fewer than the present number of matching elements unless a simulate a 1 millisecond "sleep":

await new Promise( function(resolve) { setTimeout(resolve, 1); })

This can be reproduced with Miniflare (via wrangler dev) using the following example code (index.js). Run the worker as-is to observe the race condition, then uncomment line 47 (the simulated "sleep") to see it work as expected.

I'm either doing something wrong (very likely!), or there's a bug in HTMLRewriter...

index.js
let worker = {};

// HTMLRewriter Element Handler to get <meta> element key:value pair(s)
// Reference: https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/#element-handlers
class MetaElementHandler {

    // public class fields
    properties;

    // constructor
    constructor(metadata={}) {
        this.properties = new Object(metadata);
    };

    // element handler method
    element(e) {
        console.debug("<meta property=\"%s\" content=\"%s\">", e.getAttribute("property"), e.getAttribute("content"));
        let key = e.getAttribute("property").replace("og:", "");
        let value = e.getAttribute("content");
        this.properties[key] = value;
        return;
    };

};

// Fetch Event Handler
// Reference: https://developers.cloudflare.com/workers/runtime-apis/fetch-event/
worker.fetch = async function(request, env, context) {

    // Initialize the request & fetch the target
    let params = Object.fromEntries(new URL(request.url).searchParams);
    let url = new URL(params.target);
    let target = new Request(url, { 
        method: "GET", 
        redirect: "follow", 
        headers: {
            "User-Agent": env.USER_AGENT || "HTMLRewriter/1.0",
        }
    });
    let source = await fetch(target);

    // Use HTMLRewriter to extract target metadata
    // Reference: https://developers.cloudflare.com/workers/runtime-apis/html-rewriter/
    console.debug("Executing HTMLRewriter...");
    var metadata = new MetaElementHandler({ url: params.target, hostname: url.hostname });
    await new HTMLRewriter().on('head meta[property^="og:"]', metadata).transform(source);
    // await new Promise( function(resolve) { setTimeout(resolve, 1); }); // Sleep for 1ms... because Y U NO async? 
    console.debug("Executed HTMLRewriter...");
    console.debug(metadata.properties);

    // Return the response
    return new Response(JSON.stringify(metadata.properties).concat("\n"), {
        status: 200,
        headers: {
            "Content-Type": "application/json",
        }
    });
};

export default worker;

Expected Output

The expected output for GET /?target=https://theverge.com should include debug output from my ElementHandler element(e) method between the Executing HTMLRewriter... and Executed HTMLRewriter... debug output.

Executing HTMLRewriter...
<meta property="og:description" content="The Verge is about technology and how it makes us feel...">
# [ more meta element debug output... ]
Executed HTMLRewriter...
{ 
    url: "https://theverge.com", 
    hostname: "theverge.com", 
    description: "The Verge is about technology and how it makes us feel...", 
    type: "website", 
    image: "https://cdn.vox-cdn.com/.../the_verge_social_share.png",
    site_name: "The Verge"
}

Actual Output

The actual output for GET /?target=https://theverge.com reveals that HTMLRewriter finding elements matching my selector, but the ElementHandler element(e) method debug output comes after the Executed HTMLRewriter... debug output.

Executing HTMLRewriter...
Executed HTMLRewriter...
{ 
    url: "https://theverge.com", 
    hostname: "theverge.com" 
}
<meta property="og:description" content="The Verge is about technology and how it makes us feel...">

NOTE: in some cases there will be no debug output from my ElementHandler element(e) method; this was what helped me realize there was a race condition.

@calebhailey
Copy link
Author

Problem solved, thanks to @kian in the #workers-discussions channel of the Cloudflare Community Discord.

HTMLRewriter doesn’t actually run until you return the body, so I’m guessing it’s that
transform(...) returns a Response, you can await the text() method on that to make HTMLRewriter run

In the end, all I needed to do was append a call to .text() and everything started working as expected!

new HTMLRewriter().on("head meta[property^='og']", metadata).transform(source).text();

🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment