[jq docs](https://jqlang.github.io/jq/manual/v1.7/) [jq playground](https://jqplay.org/) ## Normal mode ### Exploring HAR files `export HAR_FILE="/path/to/har/file"` - Example to dump responses, for a given request URI - `REQUEST_URI="https://www.facebook.com/api/graphql/" cat $HAR_FILE | jq -r ".log.entries[] | if .request.url | test(\"$REQUEST_URI\") then .response.content else empty end"` - Note the string passed to `jq` is in double quotes `"` so that the `$REQUEST_URI` is interpolated - But jq wants us to use double quotes for `test("foo")`, therefore they must be escaped like `test(\"foo\")` - Another way to do the same thing in bash using single quotes. Quotes can be tricky. - `REQUEST_URI="https://www.facebook.com/api/graphql/" cat $HAR_FILE | jq -r '.log.entries[] | if .request.url | test("'$REQUEST_URI'") then { uri: .request.url, mineType: .response.content.mimeType, content: .response.content.text | .[0:200] } else empty end'` - Note the string passed to `jq` is in three parts: - `'...etc...test("'` - `$REQUEST_URI` - `'") then...etc...else empty end'` - The content is truncated to the first 200 characters, to make it more readable - Dump full the response content, interpreted as JSON - `...todo...` ## Streaming mode ...todo ## Case studies ### Youtube Goal: Extract URLs of all your playlists (under development) - Go to https://music.youtube.com/library/playlists in browser, scroll *slowly* down to the bottom - Chrome | DevTools | Network tab | Save all as HAR - Extract response text for relevant requests - `cat $HAR_FILE | jq -r '.log.entries[] | select( .request.url | test("^https://music.youtube.com/youtubei") ) | .response.content.text' > $REQS_FILE` - Approach 1: Loop over lines of file and extract playlistIDs (status: draft -- this gets playlist titles) - `cat $REQS_FILE | while read line; do echo "$line" | jq '.contents.singleColumnBrowseResultsRenderer.tabs[].tabRenderer.content.sectionListRenderer.contents[].musicCarouselShelfRenderer.contents[].musicTwoRowItemRenderer.title.runs[] | { name: .text, id: .navigationEndpoint.browseEndpoint.browseId }'; done > $PLAYLISTS_FILE` - Bugs: - duplicate values - jq errors - last 5 entries are irrelevant - missing most entries! - Approach 2: Scan for all relevant playlist IDs, wherever they are in the document - `cat playlists.2 | jq -r 'getpath( paths | select(.[-1] == "browseId") ) | select(. | match("^VLPL"))'` - Bugs: - jq error: parse error: Invalid numeric literal at line 11, column 0 - missing some entries - Approach 3: Give up and use Perl regex - `cat $REQS_FILE | perl -lne'@ids = m/"browseId":"([^"]+)"/g; print $_ foreach map { s/^VL//; $_ } grep { /^VLPL/ && length($_) > 22 } @ids' | uniq > $PLAYLISTS_FILE` - Bugs: - This was supposed to be a jq cheat sheet, using Perl is cheating! - It **still** misses some playlists from the initial page load. - Approach 4: Found another source of data in the page - `cat $HAR_FILE | jq -r '.log.entries[] | select( .request.url | test("^https://music.youtube.com/library/playlists") ) | .response.content.text' > $SCRIPT_DATA` - Decode it - `cat $SCRIPT_DATA | perl -plne's/(\\x[[:xdigit:]]{2})/qq{"$1"}/eeg' > $DECODED_SCRIPT_DATA` - Maybe little bit of manual munging :/ - ...TODO... extract the browseIDs ### AlternativeTo Goal: Extract list of alternative software **Fetch JSON** - Go to https://alternativeto.net/software/gmail - F12 | DevTools | Network tab | filter by Fetch/XHR - Scroll to the bottom and click 'Show more alternatives'. Repeat. - DevTools | (down arrow) | Save **Extract data** - `export REGEX="software/gmail.json"; cat alternativeto.net.har | jq -r ".log.entries[] | if .request.url | test(\"$REGEX\") then .response.content.text else empty end" > page_per_line` - this results in 9 lines, one for each 'page' you loaded - change the `[]` above to `[0]` to get one page, and pipe the result through `jq` again or use the `fromjson` filter as follows: - `export REGEX="software/gmail.json"; cat alternativeto.net.har | jq -r ".log.entries[0] | if .request.url | test(\"$REGEX\") then .response.content.text | fromjson else empty end" > one_page_one_line` - Now browse this JSON data, preferably in an IDE like [vscode](https://code.visualstudio.com/) that can fold up sections easily to discover the following structure: - `export REGEX="software/gmail.json"; cat alternativeto.net.har | jq -r ".log.entries[] | if .request.url | test(\"$REGEX\") then .response.content.text | fromjson | .pageProps.items[] | { name: .name, cost: .licenseCost, model: .licenseModel, desc: .shortDescriptionOrTagLine } else empty end" > software.json` **Sample output** ``` { "name": "Mailfence", "cost": "Freemium", "model": "Proprietary", "desc": "Mailfence is a secure and private email service that fights for online privacy and digital freedom." } { "name": "Proton Mail", "cost": "Freemium", "model": "Open Source", "desc": "Secure email with absolutely no compromises, brought to you by MIT and CERN scientists." } ...etc ``` ## Tips, tricks and gotchas ### Decode HTML entities e.g. converts `AT&T Webmail` to `AT&T Webmail` ``` npm install -g he cat software.json | jq '.name' -r | he --decode ``` ### Debugging For very simple test examples, you must quote inputs twice, i.e. pass `"foo"` **with quotes** ``` echo '"hello"' | jq '.' ``` Regex. gsub = global substitution. Note the semicolon `;` to separate arguments to `gsub()`. ``` echo '"foo\r\nbar"' | jq -r 'gsub("(\r\n.+)"; "")' ```