Skip to content

Instantly share code, notes, and snippets.

@flaki
Created May 12, 2019 13:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save flaki/63c3cf04b3d627b54a06404873700f1f to your computer and use it in GitHub Desktop.
Save flaki/63c3cf04b3d627b54a06404873700f1f to your computer and use it in GitHub Desktop.
RustWASM microformats v2 parser implementation - IndieWebCamp Düsseldorf 2019.05.11-12 (devlog)

apparently there is no microformats down-to-the-metal parser for microformats (C or Rust or similar) http://microformats.org/wiki/microformats2#Implementations

chewing through the microformats docs/spec http://microformats.org/wiki/microformats2

everything is a classname, prefixes denote the type of content the element has (plaintext, more elements, url attribute) http://microformats.org/wiki/microformats2-prefixes

there is a node/browser parser https://github.com/glennjones/microformat-node

it appears to be nicely tested, this could be used to buid a compliant rust/wasm parser https://github.com/glennjones/microformat-node/blob/master/test/mf-v2-h-card-justaname.js

the node testsuite seems to be derived from the upstream microformats testsuite https://github.com/microformats/tests

setting up basic rustwasm project https://rustwasm.github.io/

basing the project off of the wasm-bindgen quickstart https://rustwasm.github.io/docs/wasm-bindgen/examples/hello-world.html

got the wasm finally working

first just trying to return the uppercased passed in string

reading up on https://doc.rust-lang.org/std/string/struct.String.html

I vaguely remembered the difference between str and String but https://mgattozzi.github.io/2016/05/26/how-do-i-str-string.html was a nice refresher

okay so string output in the browser works

I might be able to use serde to construct the final JSON output with serde https://github.com/serde-rs/json#constructing-json-values

got this working, the Serde json! macro is pretty powerful, takes embedded variables and function calls too

I need to JSON.parse() in the JS side but it seems to be working

rustwasm seems to have a new feature serde-serialize that can automatically take care of directly serializing (and de-serializing) Serde's JSON rustwasm/wasm-bindgen#171

yuss, got the JSON JavaScript Object out straight from the JS binding boilerplate's export!

working through making the binaries smaller via https://rustwasm.github.io/docs/wasm-bindgen/examples/add.html

using --target web to get rid of the webpack dependency https://rustwasm.github.io/docs/wasm-bindgen/reference/deployment.html

lto = true & optlevel = "s" got the code size to ~50k https://rustwasm.github.io/docs/book/reference/code-size.html

installed binaryen, wasm-opt shaved off another ~10k

reading up on using https://crates.io/crates/wee_alloc

tried using wee_alloc to shrink further binary size, it actually increased it o.O - thinking that this is due to wee_alloc being added additionally to the system allocator, and I would need to go no-std? no idea.

started reading up on no_std for serde https://serde.rs/no-std.html

so serde has a no_std mode, which actually avoids any allocation whatsoever? maybe this is a better approach for us (but I don't think I'll be able to find an alloc-free html parser)

but what is no_std, even? https://serde.rs/no-std.html

added default-features = false to serde, no dice, probably have to #![no_std] the library?

probably need to wasm-snip manually? rustwasm/team#19 (comment)

well, as expected my use of ::String prevents me from #![no_std], will have to come back to this when I know more

so wasm-pack has --dev / --release modes

managed to shave another kbyte off using wasm-snip by trimming rust formatting/panicking infra manually https://rustwasm.github.io/docs/book/reference/code-size.html#use-the-wasm-snip-tool

39k wasm without any kind of parsing, feels still way too much

in the --dev output, twiggy names top offenders to code size (300k) as the function names subsection (20+%), two large - 20k & 6k - data sections (??), lots of alloc/dlalloc calls and many calls to ryu (??)

apparently ryu is a library for stringifying floats https://crates.io/crates/ryu

will have to look into the data sections more later (and check if they also exist in the optimized output)

there's hundreds of 0% unnamed functions, wtf (??)

yep, 2x >10% of the wasm-opt-ed release binary (top offenders) are also large data section

trying to look at wasm2wat source to figure out what's going on there (data[28] & data[3])

reading up on https://developer.mozilla.org/en-US/docs/WebAssembly/Understanding_the_text_format - not very helpful about .data sections

both 3 & 28 look like binary strings, tabling this for later revisit

moving on to parsing

scraper seems like good way to go about this as a first try https://docs.rs/scraper/0.10.0/scraper/

trying to refactor to parse out the text content of an .h-entry from a baked in test html fragment

element.text() seems to be the right thing, it gives back ::element_ref:Text which is a collection

diving into the "how do I .join() strings in rust" https://users.rust-lang.org/t/connecting-joining-string-slices-without-a-temp-vec/1811

itertools::join looks like exactly what I need, though I will need to add it as a dependency https://docs.rs/itertools/0.7.5/itertools/fn.join.html

that worked! holy shit, the scraper build is 500k wasm o.O

okay, next stop: be able to parse the h-entry in https://avocado.lol/dus/hello.html. I need to add: passing in source strings, support for urls (u-in-reply-to, u-author, u-photo), support for subtrees (root elements like h-card, element subtrees like e-content), p-name is plaintext-parsed

first, let's try to get a working rust binary that can fetch the page

according to SO reqwest is the way to go https://crates.io/crates/reqwest

but first: git commit!

creating a .gitignore, apparently Cargo.lock should be excluded in libraries https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-lock.html

fist commit is in! now on to create the binary app.

discovered cargo init. nice!

oh so Rust constants need their types declared explicitly, apparently https://doc.rust-lang.org/rust-by-example/custom_types/constants.html

diving into cargo, for how to specify (local) dependencies https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html

= { path = "<path>" } in Cargo.toml dependencies seems to compile, but then the compiler complains about the missing dependency anyway

ugh, forgot the extern crate microformats2...but it still doesn't work, same error...

ah turns out the culprit is the cdylib https://stackoverflow.com/a/49762980

ended up temporarily switching cdylib to lib, will need to figure this out later (??) https://doc.rust-lang.org/reference/linkage.html

it started working with lib type, but now I need to re-enable passing in the src parameter

panicking on 'function not implemented on non-wasm32 targets', guess I'll have to decouple the rustwasm & rust native impls?

rrrriiight, I have removed the wasm-bindgen prelude and now it's failing on "JsValue". OBVIOUSLY. facepalm

switched it to String and serde::to_string() and now it works!

obviously the baked in selector breaks it so I'm gonna modify the library for .p-name

this worked, got back the name, now I have to rewrite the baked in json form and parser to output and query the things I want to have (as a first step)

guess shit is weird, all parser impls output different (byte-wise) output, but Tantek says they should be equivalent as JSONs (e.g. different key order is possible, but all should be there)

successfully copy-pasted one of the output JSONs into Serde's generator, for now, I will keep the output structure but try to get the values from the actual passed-in source

trying to select .h-entry was panicking -- as it turns out, I forgot that now I'm parsing a complete document (not just a fragment anymore), and fragment ignored the <body> tag (which had the .h-entry class on)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment