Skip to content

Instantly share code, notes, and snippets.

@CraigStuntz
Created January 15, 2017 05:11
Show Gist options
  • Save CraigStuntz/7e1ff52e9569ddfc69d06abc1dad818d to your computer and use it in GitHub Desktop.
Save CraigStuntz/7e1ff52e9569ddfc69d06abc1dad818d to your computer and use it in GitHub Desktop.

Someone said, "...the purpose of this RFC is to rank crates against each other...". I think this puts the cart before the horse. The purpose should be to help visitors to crates.io fund what they need. Ranking crates is one potential way to help, but it isn't the purpose.

Who visits crates.io and why?

Personas

Dee Veloper

Dee is a senior developer who is building a command line tool to calculate statistics about GitHub users. She would like to access the GitHub API, but would prefer not to reinvent the wheel with raw HTTP requests if possible.

Best experience for Dee Such a specialized crate is unlikely to be on the homepage by default. Dee searches for GitHub, and a short list of relevant crates is returned. The best choice for her is somewhere near the top of the list, and there's enough information with each result (popularity, activity, license, CoC, etc.) that she can identify a strong candidate or two for manual analysis.

Bad experience for Dee Searching for GitHub returns crates which have "github" in the URL. All 95% of them.

Worst experience for Dee Malicious crates in search results.

Vi Nary

Vi needs to find a crate linter.

Best experience for Vi Searching for crate linter returns a short list of relevant results. The best choice for her is somewhere near the top of the list, and there's enough information with each result (popularity, activity, license, CoC, etc.) that she can identify a strong candidate or two for manual analysis.

Bad experience for Vi Because binary crates have no inbound dependencies, the order of the results isn't helpful.

Worst experience for Vi Malicious binary.

I. Ron Oxide

Ron is a Rust compiler hacker who usually knows which crate he wants and mostly uses the command line. He mostly uses the site to see what is new, offer assistance to new crate authors, and keep an eye out for malicious crates.

I'm not sure how to best serve Ron. It's probably a version 2 feature, though.

Javi Scripps

Javi is a seasoned node.js developer who is trying Rust for the first time. He needs to left pad a string, so he decides to search for a crate.

Best experience for Javi crates.io searches the standard library and prioritizes built in functions. It recommends format!

Bad experience for Javi A joke crate is returned instead.

Worst experience for Javi Malicious crates in search results.

Ruby Gemz

Tired of rubygems.org being described as "a one way time machine to 2010," Heroku-A-Salesforce-Company™ has hired Ruby to revamp the home page and optimize the API. She's heard good things about Cargo, so she loads up crates.io to figure out what makes it special.

This is probably also a version 2 feature?

So how do we make this work?

There are a few insights from the personas above. Ranking and display of additional information like license are separate problems. The aim should be that the "right" crate is always on the first result page. If so, then the rank doesn't need to incorporate factors like license, documentation, and CoC -- those can simply be displayed alongside the result, and the user doing the searching can decide how important they are to her. This diminishes the urge to "game" these factors.

On the other hand, the ranking problem is still hard! I agree with those who say a complex formula is probably not the right approach.

Google found a reasonable solution on the web via PageRank. But library dependencies tend to be acyclic. So it's not exactly the same. It will probably always be the case that the ratio of references/downloads will be higher for libc than iron.

However, we can simplify the problem. For crates with few references (binary-only, or "high-level," like iron, we can ignore dependencies altogether. For crates which are more commonly downloaded as a transitive dependency than a primary dependency, references are very important.

For binary-only/high-level crates, use recent popularity. Say, last quarter or 6 months. jQuery is a frequently downloaded JS library, but it's not usually the best solution in 2017.

In a nutshell, I think it should come down to:

  1. Prefer the standard library when possible
  2. Otherwise, use what most Rust developers are using right now. This isn't a perfect measure. Angular is not the right answer to "what's the best JS framework for ____" for any ____. But it's probably good enough for the front page. But "popular" isn't the same as downloads, due to transitive dependencies. Hence, combine downloads with PageRank. A "common" crate is either commonly downloaded as a direct reference or transitively included by a frequently downloaded crate, but perhaps not commonly used as a direct dependency.

This should make Dee, Vi, and Javi happy. They can see the top 10-20 on the first page and then use the additional information displayed to narrow down their search to a handful of crates.

Spam

There will always be spam. People will always game ratings. However, you can solve some of this by allowing people who want to to buy their way to the top. Have a clearly indicated "promoted crates" section on the front page. Charge lots! Enough to pay for a safety review and a sizeable donation to the Mozilla Foundation or similar.

Safety

Ideally, crates.io should never suggest a malicious crate. Easier said than done! Static analysis and virustotal are probably both useful to exclude potentially harmful crates until they can be manually reviewed. These flags should be expanded in the future.

Thoughts on signals

Regarding maintenance measures, I think having five separate categories addresses the needs of the maintainer but not the crates.io user. The crates.io user probably only needs three: mainstream, experimental, or unmaintained. This can be determined almost wholly from commits over the past n months and popularity. Lots of downloads and "recent" (say, last quarter or half-year?) commits? Mainstream. Few downloads, lots of commits? Experimental. No recent commits? Unmaintained.

I don't think the ratio of tests to code is any kind of proxy for quality. There is perhaps some value in measuring whether a project has tests, period, and dinging those which don't. But one of the best ways to improve quality is to delete code, and that probably hurts a project on the tests to code ratio metric.

Miscellaneous

Documentation

Documentation is super useful, but so is sample code, maintainer blogs, and Stack Overflow answers.

Amazon and Stack Overflow

Someone commented that Amazon and Stack Overflow solve this problem. I largely disagree. Neither has a great native search experience. They get most of their traffic from Google. "Leave it to Google" is certainly an option, but "do what Amazon and SO do" is probably not the best strategy.

"Test code coverage %: this tells you the % of all the code paths/expressions/... that are exercised by the tests. IMO more is objectively better, less is objectively worse." Strong disagreement. While code coverage does have some use, ranking by test coverage is lines of code counting for people who don't think lines of code counting is cool. Static analysis does at least as much for quality and is ignored here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment