apparently there is no microformats down-to-the-metal parser for microformats (C or Rust or similar) http://microformats.org/wiki/microformats2#Implementations
chewing through the microformats docs/spec http://microformats.org/wiki/microformats2
everything is a classname, prefixes denote the type of content the element has (plaintext, more elements, url attribute) http://microformats.org/wiki/microformats2-prefixes
there is a node/browser parser https://github.com/glennjones/microformat-node
it appears to be nicely tested, this could be used to buid a compliant rust/wasm parser https://github.com/glennjones/microformat-node/blob/master/test/mf-v2-h-card-justaname.js
the node testsuite seems to be derived from the upstream microformats testsuite https://github.com/microformats/tests
setting up basic rustwasm
project https://rustwasm.github.io/
basing the project off of the wasm-bindgen
quickstart https://rustwasm.github.io/docs/wasm-bindgen/examples/hello-world.html
got the wasm finally working
first just trying to return the uppercased passed in string
reading up on https://doc.rust-lang.org/std/string/struct.String.html
I vaguely remembered the difference between str
and String
but https://mgattozzi.github.io/2016/05/26/how-do-i-str-string.html was a nice refresher
okay so string output in the browser works
I might be able to use serde to construct the final JSON output with serde
https://github.com/serde-rs/json#constructing-json-values
got this working, the Serde json!
macro is pretty powerful, takes embedded variables and function calls too
I need to JSON.parse() in the JS side but it seems to be working
rustwasm
seems to have a new feature serde-serialize
that can automatically take care of directly serializing (and de-serializing) Serde's JSON rustwasm/wasm-bindgen#171
yuss, got the JSON JavaScript Object out straight from the JS binding boilerplate's export!
working through making the binaries smaller via https://rustwasm.github.io/docs/wasm-bindgen/examples/add.html
using --target web
to get rid of the webpack dependency https://rustwasm.github.io/docs/wasm-bindgen/reference/deployment.html
lto = true
& optlevel = "s"
got the code size to ~50k https://rustwasm.github.io/docs/book/reference/code-size.html
installed binaryen
, wasm-opt
shaved off another ~10k
reading up on using https://crates.io/crates/wee_alloc
tried using wee_alloc
to shrink further binary size, it actually increased it o.O - thinking that this is due to wee_alloc being added additionally to the system allocator, and I would need to go no-std? no idea.
started reading up on no_std
for serde
https://serde.rs/no-std.html
so serde has a no_std mode, which actually avoids any allocation whatsoever? maybe this is a better approach for us (but I don't think I'll be able to find an alloc-free html parser)
but what is no_std
, even? https://serde.rs/no-std.html
added default-features = false
to serde, no dice, probably have to #![no_std]
the library?
probably need to wasm-snip
manually? rustwasm/team#19 (comment)
well, as expected my use of ::String prevents me from #![no_std]
, will have to come back to this when I know more
so wasm-pack has --dev
/ --release
modes
managed to shave another kbyte off using wasm-snip
by trimming rust formatting/panicking infra manually https://rustwasm.github.io/docs/book/reference/code-size.html#use-the-wasm-snip-tool
39k wasm without any kind of parsing, feels still way too much
in the --dev
output, twiggy
names top offenders to code size (300k) as the function names subsection (20+%), two large - 20k & 6k - data sections (??), lots of alloc/dlalloc calls and many calls to ryu (??)
apparently ryu is a library for stringifying floats https://crates.io/crates/ryu
will have to look into the data sections more later (and check if they also exist in the optimized output)
there's hundreds of 0% unnamed functions, wtf (??)
yep, 2x >10% of the wasm-opt
-ed release binary (top offenders) are also large data section
trying to look at wasm2wat
source to figure out what's going on there (data[28]
& data[3])
reading up on https://developer.mozilla.org/en-US/docs/WebAssembly/Understanding_the_text_format - not very helpful about .data sections
both 3 & 28 look like binary strings, tabling this for later revisit
moving on to parsing
scraper seems like good way to go about this as a first try https://docs.rs/scraper/0.10.0/scraper/
trying to refactor to parse out the text content of an .h-entry
from a baked in test html fragment
element.text()
seems to be the right thing, it gives back ::element_ref:Text
which is a collection
diving into the "how do I .join()
strings in rust" https://users.rust-lang.org/t/connecting-joining-string-slices-without-a-temp-vec/1811
itertools::join
looks like exactly what I need, though I will need to add it as a dependency https://docs.rs/itertools/0.7.5/itertools/fn.join.html
that worked! holy shit, the scraper build is 500k wasm o.O
okay, next stop: be able to parse the h-entry
in https://avocado.lol/dus/hello.html
. I need to add: passing in source strings, support for urls (u-in-reply-to
, u-author
, u-photo
), support for subtrees (root elements like h-card
, element subtrees like e-content
), p-name
is plaintext-parsed
first, let's try to get a working rust binary that can fetch the page
according to SO reqwest
is the way to go https://crates.io/crates/reqwest
but first: git commit!
creating a .gitignore
, apparently Cargo.lock
should be excluded in libraries https://doc.rust-lang.org/cargo/guide/cargo-toml-vs-cargo-lock.html
fist commit is in! now on to create the binary app.
discovered cargo init
. nice!
oh so Rust constants need their types declared explicitly, apparently https://doc.rust-lang.org/rust-by-example/custom_types/constants.html
diving into cargo, for how to specify (local) dependencies https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html
= { path = "<path>" }
in Cargo.toml dependencies seems to compile, but then the compiler complains about the missing dependency anyway
ugh, forgot the extern crate microformats2
...but it still doesn't work, same error...
ah turns out the culprit is the cdylib
https://stackoverflow.com/a/49762980
ended up temporarily switching cdylib
to lib
, will need to figure this out later (??) https://doc.rust-lang.org/reference/linkage.html
it started working with lib
type, but now I need to re-enable passing in the src
parameter
panicking on 'function not implemented on non-wasm32 targets'
, guess I'll have to decouple the rustwasm & rust native impls?
rrrriiight, I have removed the wasm-bindgen prelude and now it's failing on "JsValue". OBVIOUSLY. facepalm
switched it to String
and serde::to_string()
and now it works!
obviously the baked in selector breaks it so I'm gonna modify the library for .p-name
this worked, got back the name, now I have to rewrite the baked in json form and parser to output and query the things I want to have (as a first step)
guess shit is weird, all parser impls output different (byte-wise) output, but Tantek says they should be equivalent as JSONs (e.g. different key order is possible, but all should be there)
successfully copy-pasted one of the output JSONs into Serde's generator, for now, I will keep the output structure but try to get the values from the actual passed-in source
trying to select .h-entry
was panicking -- as it turns out, I forgot that now I'm parsing a complete document (not just a fragment anymore), and fragment ignored the <body>
tag (which had the .h-entry
class on)