Skip to content

Instantly share code, notes, and snippets.

@jonschlinkert
Last active June 13, 2018 14:53
Show Gist options
  • Save jonschlinkert/f56e98e28a4809cbdd70a96c8f494ee9 to your computer and use it in GitHub Desktop.
Save jonschlinkert/f56e98e28a4809cbdd70a96c8f494ee9 to your computer and use it in GitHub Desktop.

In an effort to improve the quality of search results on sites like https://npmjs.com, there has been more and more discussion about factoring in the ranking/scoring of dependencies to influence search results. The general idea is that a library should be ranked not only on its own merits, but dependencies should weigh into the score as well.

I’m not sure what my opinion is on this yet. I was initially in favor of this, and still might be, but this document is a summary of some things that crossed my mind about the topic.

Should dependencies weigh into the search score for a library?

  1. Some very popular libraries on npm are very poorly written, but solve problems that are sufficiently complex that few alternatives exist. From a scoring standpoint, this alone wouldn't matter because the law of averages would even it out. However...
  2. It's possible for developers to write high quality, stable code on top of lesser quality libraries.
  3. The better quality the code, and more knowledgable the maintainers, the more likely it is that a codebase insulates itself from weaknesses in its dependency tree.
  4. Moreover, the higher quality the library is, the more disproportionate the difference will be (predictably) between the quality of the library itself and it's dependencies (again, because of the law of averages). Good libraries represent a spike in the asymptote. Poor libraries are the proverbial "long tail". Thus, if we use a relative scoring scale, in comparison to "the very best library" all other dependencies are relatively "poor" and would lower the score of the best library. (would a common "3" library improve it's score by having a dep with a score of "5", whilst a high-quality and rare "10" library has its score lowered by depending on "6"? Should we just penalize them both? Or should we reward high quality libraries that score well despite having a weak dependency tree?)
  5. Would version locking factor in? If so, how would it be weighted?
  6. I think it's debatable that this would encourage devs to contribute to poor quality modules. IMHO it's arguably more likely that this would encourage developers of high quality modules to remove low quality modules in lieu of publishing their own modules to use as dependencies. The higher quality the code is produced by a developer, the more likely it is that scoring dependencies which are out of his/her control will put them in a position of "risk". In which case, in order to avoid being at continual disadvantage and perpetual position of risktaking control of said libraries is most likely to yield an outcome that )
  7. Even if the current scoring model for invidivual libraries was mature and stable, any potential flaws in the model that might typically be tolerable and smoothed out in the averages could be geometrically magnified when dependencies are factored in.

One big takeaway is this: depending on how the algorithm is written, higher quality libraries that specifically go to great lengths to insulate the library from bad upstream code might be disproportionately penalized if low quality libraries are in the tree.

Conversion funnel

A page-rank-like scoring model with a conversion funnel might be a good approach to scoring. In which case the algorithm should take the following into consideration:

Conversion funnel equivalent Metric Description
awareness github stars Equivalent of "likes", but only slightly better than visits (it's possible to have many stars per one visitor). This metric only proves awareness, no causal relationship to actual interest, "real" popularity, code quality, trust, intent to use, or advancement in the conversion funnel. This metric should count for 10-15% of the score.
interest dependants Dependants are a sign of awareness and commitment from the authors that chose to depend on the library, but no correlation to actual usage, commitment from upstream implementors and users to use the dependant, and is just as likely to represent herd mentality or a reaction to recent awareness as it is to be a predictor of success. This metric should count for 10-15% of the score
conversion downloads A sign of both commitment and success. Regardless of the number of direct dependants, libraries with higher downloads are vetted by more users and upstream implementors. This is number should count for 75-80% of the score

Weighing activity and issues

Frequency of issues and commits are often seen as a sign of a popular and healthy library. But that's arguably only true of:

  • complicated libraries
  • libs with a large API surface
  • libs with lots of moving parts that are a topic of discussion or are still open for debate
  • libs that have a lot of unanswered questions
  • libs that are subject to a lot of platform-specific edge cases

versus, for example:

  • mature libraries with predictable, deliberate release cycles
  • micro-libraries that do one thing, and do it well
  • libraries that strive to be immutable (why would we penalize this?)
  • libraries that don't receive bug reports and patches because they are well-written (IMHO, libraries with high downloads and few bug reports should receive disproportionately positive scores)
  • libraries that many devs don't view as "star-worthy"
@BenJam
Copy link

BenJam commented Jul 12, 2017

A page-rank-like scoring mechanism — we unimaginatively call ours SourceRank — is what's needed here. I think the consequences of such a measure have a far more positive effect than just ranking projects:

It would be possible to create a purely probabilistic SourceRank algorithm within open source using just dependencies. This would be equivalent to PageRank in it's original and most academic form. But PageRank evolved, measures built into the algorithm started to value the content of a web page as well, and arguably this has lead to a better-quality internet in terms of performance, legibility, accessibility and overall user experience.

We can do something similar: If we accelerate convergence toward more (dare I say) 'sustainable' projects, by introducing measures like 'does this project have a licence' into a PageRank-like score we can gently incentivise developers to consider the implications of them.

@tomByrer
Copy link

tomByrer commented Jul 13, 2017

I'd like to include execution speed. I don't think every line needed to be nerded out via micro-benchmarks. But I've seen some get sucked into something that is 'cool' & easy to use, but takes 10x as long to run:
https://medium.com/dailyjs/javascript-test-runners-benchmark-3a78d4117b4

Not only there is hosting costs & time inconveniences, large performance gaps are a 'code smell' for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment