Skip to content

Instantly share code, notes, and snippets.

@ScumCoder
Created July 9, 2019 21:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ScumCoder/4a0ed5a90cd5c95e5df8174e6a1f0184 to your computer and use it in GitHub Desktop.
Save ScumCoder/4a0ed5a90cd5c95e5df8174e6a1f0184 to your computer and use it in GitHub Desktop.
SSCCE for Gumbo parsing issue
#include <iostream>
#include <cassert>
#include <gumbo.h>
int main()
{
// Result is the same if there is no doctype, or if some of the nodes are missing
const char *data = "<!DOCTYPE html>\n<html>\n<head>\n</head>\n<body>\n</body>\n</html>";
GumboOutput *output = gumbo_parse(data);
// Following is just getting to the problematic node:
assert(output->root->type == GUMBO_NODE_ELEMENT);
const GumboElement &htmlNode = output->root->v.element;
assert(htmlNode.tag == GUMBO_TAG_HTML);
std::cout << "Root element is HTML and has " << htmlNode.children.length << " children" << std::endl;
assert(htmlNode.children.length > 2);
GumboNode *bodyNodeCont = static_cast<GumboNode*>(htmlNode.children.data[2]);
assert(bodyNodeCont->type == GUMBO_NODE_ELEMENT);
const GumboElement &bodyNode = bodyNodeCont->v.element;
assert(bodyNode.tag == GUMBO_TAG_BODY);
std::cout << "3rd of them is BODY which has " << bodyNode.children.length << " children" << std::endl;
assert(bodyNode.children.length > 0);
GumboNode *whitespaceCont = static_cast<GumboNode*>(bodyNode.children.data[0]);
assert(whitespaceCont->type == GUMBO_NODE_WHITESPACE);
const GumboText &whitespace = whitespaceCont->v.text;
// ...and now the problem itself:
std::cout << "1st of them is WHITESPACE which looks like this: \""
<< whitespace.text << "\", and original is "
<< whitespace.original_text.length << " bytes long and looks like this: \""
<< std::string(whitespace.original_text.data, whitespace.original_text.length)
<< "\"" << std::endl;
return 0;
}
Root element is HTML and has 3 children
3rd of them is BODY which has 1 children
1st of them is WHITESPACE which looks like this: "
", and original is 16 bytes long and looks like this: "
</body>
</html>"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment