Skip to content

Instantly share code, notes, and snippets.

@kristopolous
Last active July 24, 2023 04:12
Show Gist options
  • Save kristopolous/19260ae54967c2219da8 to your computer and use it in GitHub Desktop.
Save kristopolous/19260ae54967c2219da8 to your computer and use it in GitHub Desktop.
hn job query search
// Usage:
// Copy and paste all of this into a debug console window of the "Who is Hiring?" comment thread
// then use as follows:
//
// query(term | [term, term, ...], term | [term, term, ...], ...)
//
// When arguments are in an array then that means an "or" and when they are seperate that means "and"
//
// Term is of the format:
// ((-)text/RegExp) ( '-' means negation )
//
// A first argument of '+' signifies an additional pass on the filtered data as opposed to
// resetting everything.
//
// Example: Let's look for jobs in california that involve rust or python and not crypto:
//
// > query('ca', '-crypto', ['rust', 'python']);
// {filtered: '98.57%', query: 'ca AND NOT crypto AND (rust OR python)'}
//
// Then you see, "oh right, I don't care about blockchain either":
//
// > query('+', '-blockchain');
// {filtered: '98.57%', query: 'ca AND NOT crypto AND (rust OR python) AND NOT blockchain'}
//
// Another example:
// > query(['ca', 'sf', 'san jose', 'mountan view'])
// {filtered: '90.61%', query: '(ca OR sf OR san jose OR mountan view)'}
//
// COVID killed Silicon Valley. Quod Erat Demonstrandum!
//
// Changelog for 2022-08-02
//
// ADDED
//
// * Negation via '-'
//
// * Multi-pass querying via first argument being '+'
//
// * Debugging query string added in the response
//
// CHANGED
//
// * "or" and "and" works the opposite of how it did previously.
// This form seems to be more useful.
//
// * Whole word matching is default
//
// * Terms such as "c++" are properly escaped
//
// UPDATED
//
// * Rewrote as an absurd implementation.
// I had a fun afternoon writing this.
//
function query(...queryList) {
// HN is done with very unsemantic classes.
let jobList = [...document.querySelectorAll('.c5a,.cae,.c00,.c9c,.cdd,.c73,.c88')],
// Traverses up the dom stack trying to find a match of a specific class
upto = (node, klass) => node.classList.contains(klass) ? node : upto(node.parentNode, klass),
display = (node, what) => upto(node, 'athing').style.display = what,
hide = node => { display(node, 'none'); node.show = false},
show = node => { display(node, 'block'); node.show = true},
// Use RegExp as is. Otherwise make it a case insensitive RegExp
destring = what => [
what[0] === '-',
what.test ? what : new RegExp([
'\\b',
what.toString()
.replace(/^-/,'')
.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'),
'\\b'
].join(''), 'i'), what
];
// This is our grand reset
if(queryList[0] !== '+') {
jobList.forEach(show);
// Have fun with that.
query.hidden = +!( query.fn = [] );
} else {
queryList.shift();
}
// The AND is an artifact of the design. It's just iterative napped subsets
query.fn = query.fn.concat(queryList.map(arg => {
// Make it an array if it isn't one and pass it through our destring
let orList = Array.of(arg).flat().map(destring);
// If we're showing the job, then go through the list of terms
// If all of them do not match, hide it, then return the length.
query.hidden += jobList.filter(node => node.show
&& orList.every(([neg, r]) => neg ^ !(node.innerHTML.search(r) + 1))
).map(hide).length;
// You're on your own here - this is just the construction of
// the debug string. There's far more reasonable ways to do this
// But what fun would that be?!
return (
' ('[+!!(orList.length - 1)] +
orList.map(([neg, ig, r]) => ['', 'NOT '][+neg] + r.slice(+neg)).join(' OR ') +
' )'[+!!(orList.length - 1)]
).trim();
}));
return {
filtered: (100 * query.hidden / jobList.length).toFixed(2) + '%',
query: query.fn.join(' AND ')
};
}
@jeff303
Copy link

jeff303 commented Aug 3, 2016

It seems like it might be nice to have an option to only search the first line for certain terms (ex: REMOTE), since usually occurrences in other lines or subsequent paragraphs aren't actually what you're after. Does anyone else agree? If so, I could try to take a crack at that.

@kristopolous
Copy link
Author

kristopolous commented Aug 16, 2016

@jeff303 "line" is an amorphous term. I'm not trying to be a jerk, but the computer is and always is - that jerk. Anyway, without a more solid definition (as "line" depends on font size, monitor size etc), I've got no hope here and I want to make it useful for you.

I could do something like "Of the first 50 space-separated tokens (aka, words)" or "the first paragraph that contains more than 40 characters" or something.

The nuances here is that there's essentially the following formats:

1:

PLACE | JOB | TERMS

DESCRIPTION  

2:

PLACE
JOB
TERMS

DESCRIPTION  

3:

PLACE, JOB, TERMS, DESCRIPTION ...

etc.

So there has to be a flexible "do what I mean, not what I say" style definition that can accommodate these various degrees of freedom. It's far too easy to have a solution that is very complex, fragile, and useless (which I know is very trendy these days, but I'm not a trendy guy like that).

What would be most useful to you?

@ThomasRooney
Copy link

ThomasRooney commented Sep 6, 2016

A slight divergence, but I saw this and was inspired to squeeze it into a single (ish) console command. Relies on curl, jq, pup. Further output can then be pretty easily filtered by jq or just grep'd. Outputs a big array of the top level comments of the most recent Who's hiring thread.

curl https://news.ycombinator.com/submitted\?id\=whoishiring |
  pup 'td.title a json{}' | jq '.[] | select(.text | contains("Who is hiring")) | .href' |
  head -n 1 |
  xargs -I{} curl https://news.ycombinator.com/\{\} |
  pup '.comtr json{}' |
  jq '.[] | select(.. | select(select(.class?=="ind") | .children[0].width == "0")) |
            select(.class="comment") |
            map(..| if .tag? != "font" then .text? else null end) | map(select(. != null)) |
            del(.[length-1])'

EDIT: To filter, pipe the output somewhere then do something like this cat results | jq 'select(.[2]? | contains("London"))'

@jeff303
Copy link

jeff303 commented Sep 7, 2016

@kristopolous, I'm absolutely aware of the complexity in what seems like a simple thing on the surface. It was more of just a brainstorming idea (mostly driven by my own desire to brush up on Javascript), and not an urgent request for that particular functionality.

Attempting to parse the different sections based on commonly seen formats is pretty interesting, but would be fragile as you point out. I think, at a minimum, attempting to look in only the first "paragraph" (the breaks between which actually do correspond to new <p> elements) might be a start. And also, only looking at top-level posts would be useful, to filter out "Would you consider remote candidates?" type replies, where that word didn't occur in the parent posting.

@gotoc
Copy link

gotoc commented Jun 2, 2017

place, job, terms, description... here is a crazy idea, why not have a form field, with required fields for remote/onsite.
ie, it would have to be enforced as part of hn netiquette or just plain common sense. :)

just a thought. if they want, the right people to apply, and not get flooded, with mismatched applicants,
then at the least, they should be specific about these basic items, right?

btw, @kristopolous, thanks for script, its a way better than seeing the page raw.

@frosas
Copy link

frosas commented Nov 3, 2017

Thanks @kristopolous and @meiamsome, browser search functionality is definitely not enough to search hundreds of job positions!

Because I wanted a mix of both scripts (i.e. nested criterias AND regular expressions being first-class objects), and because it was fun to write, I ended up creating just another version which looks like this:

// Non-Angular Javascript contract positions in London or remote
hn.filter(
  hn.or(/(javascript|typescript)/i, /ES\d/, 'JS'),
  hn.not(/angular/i),
  /contract/i,
  hn.or(hn.and('ONSITE', /london/i), 'REMOTE')
);

Details at https://gist.github.com/frosas/4cadd8392a3c4af82ef640cbedea3027

@Ivanca
Copy link

Ivanca commented Dec 2, 2017

This script loads all pages via AJAX; you may execute it before this one so you search on all pages instead of just first one

;(function ajaxLoadNextPage () {
    var more = document.querySelector('.comment-tree > tbody > tr:last-child a');
    if (more && more.innerHTML === "More") {    
        var httpRequest = new XMLHttpRequest();
        httpRequest.onreadystatechange = function () {
            if (httpRequest.readyState === XMLHttpRequest.DONE) {
                if (httpRequest.status === 200) {
                  more.remove();
                  var div = document.createElement('div');
                  div.innerHTML = httpRequest.responseText;
                  var nextHTML = div.querySelector('.comment-tree > tbody').innerHTML;
                  document.querySelector('.comment-tree > tbody').innerHTML += nextHTML;
                  ajaxLoadNextPage();
                } else {
                  alert('There was a problem with the request to ' + more.href);
                }
            }
        };
        httpRequest.open('GET', more.href);
        httpRequest.send();
    }
})();

@janklimo
Copy link

janklimo commented Dec 2, 2017

Any plans to package this as an extension?

@kristopolous
Copy link
Author

I was revisiting this this month ... I think what I really want these days is exclusion more than inclusion. For instance, I don't care about healthcare, remote e-learning or fintech (I find them to be huxsters trying to arbitrage broken markets with snake oil tech) but anyway ... a blacklist seems really useful ... I should do that instead.

@kristopolous
Copy link
Author

kristopolous commented Jul 6, 2022

This also works, replace the id with whatever you want.

curl 'https://hacker-news.firebaseio.com/v0/item/31947297.json?print=pretty' | jq '.kids' | grep -Po '[0-9]*' | xargs -n 1 -P 20 -I %% wget https://hacker-news.firebaseio.com/v0/item/%%.json\?print=pretty

Then you can grep that.

@kristopolous
Copy link
Author

kristopolous commented Aug 3, 2022

ok I updated it to implement all the things I've been musing about for 7 years and to hopefully make you laugh out loud while reading it.

It is extremely silly but hopefully not stupid and still legible

@nemanjam
Copy link

nemanjam commented Sep 1, 2022

image

@kristopolous
Copy link
Author

Damn it

I'm still in bed. I'll look when I'm at my office

@kristopolous
Copy link
Author

kristopolous commented Sep 1, 2022

You're right. I was so careful in this. damn it. That's extremely disappointing. I apparently foolishly introduced the bug here when I was just using a phone and their textbox interface: https://gist.github.com/kristopolous/19260ae54967c2219da8/revisions#diff-63e9a5e5dead19d4e7a3ee13c24221089b165a04e534e6e675d491e9422576d1

Fixed. Thanks for the report @nemanjam

@nemanjam
Copy link

nemanjam commented Sep 1, 2022

Thank you.

@gabrielsroka
Copy link

Updated pagination code using fetch and async/await, and while instead of recursion. Forked from @Ivanca

It'd be nice to merge this into query().

(async function () {
    var more;
    while (more = document.querySelector('a.morelink')) {
        const r = await fetch(more.href);
        more.remove();
        const div = document.createElement('div');
        div.innerHTML = await r.text();
        document.querySelector('.comment-tree > tbody').innerHTML += div.querySelector('.comment-tree > tbody').innerHTML;
    }
})();

Also, maybe a bookmarklet? You can drag/drop or copy/paste to your boomarks toolbar. eg:

javascript:
/* /Say hello# */
(function () {
  alert('Hello, HN');
})();

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment