Skip to content

Instantly share code, notes, and snippets.

@samsch
Last active July 2, 2020 13:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samsch/c38f310970168c6ed2b4e1e2543f4fe4 to your computer and use it in GitHub Desktop.
Save samsch/c38f310970168c6ed2b4e1e2543f4fe4 to your computer and use it in GitHub Desktop.
Issues with Scarf-js approach

My problems with Scarf-js

The @scarf/scarf package uses the postinstall package.json option to run a script that makes a request to [somewhere] with potentially sensitive data, and can be configured to do that with no (reasonably visible) user notice. As per almost any technology, the tech involved isn't inherently bad.

General principle

This is a fairly weak argument on it's own, but it's the "gut feeling" one that leads to others.

Libraries are expected to be installed with npm i, and to take no action besides what they require to be installed.

Seeing some other non-essential task, especially one which transmits data, is a big red flag.

Precedent

There is an expectation that packages will follow the above principle, and an immediate negative reaction to something that strays outside of it. This expectation could be changed, however. Having a widely accepted single exception for Scarf-js is a consideration, and if that were possible, I would possibly support the idea. However, I don't believe that's a possibility. Instead, the likely next step would be alternative libraries running their own different analytics in postinstall, which then changes the widespread expectations to more quickly accept various non-essential postinstall commands. This then opens the doors to Big Business using their own analytics on libraries they control, and being able to aggregate the data from usage to discover possibly much more sensitive information than is being sent (consider the whole Facebook thing of: with 10 likes they can guess age, gender, religion, political views, etc). I'm going to avoid doomsday thinking, but frankly, there are a lot of negative courses that Big Business can and would take with this data.

So, keeping the strong precedent of packages not phoning home from a new source, with new data, is a protection against the same issues we deal with today with other tracking systems. Or to put it another way, if you could stop that first website from aggregating and selling internet traffic data, wouldn't you?

I recognize this may be considered a slippery slope fallacy, but I believe the evidence of how businesses have acted in the past given similar new tech options shows that the outlined progression is very likely.

Sensitive Data

This is a much simpler argument: The data being sent may be considered non-sensitive to most people, but there are wide classes who would not want it shared any more than possible. This is a new form of sharing that data.

Also, the data that is sent can change without reasonable notice to the library consumer.

An ethical argument

This is specifically for the opt-out behavior.

Is it not unethical to automatically do something that most people would refuse if the option was presented on an equal footing?

For clarity on this point, there's no obligation to send the data. Use of the library is not contingent on the data not being opted out of.

And yes, I recognize this argument applies very broadly. Yes, I think most internet tracking is unethical.

Various risk-aversion arguments

Usage of any technology has risks, etc etc, there could be bugs that send more data, data could be incorrectly deleted/accidentally stored, data systems could be breached. Frankly, these aren't strong arguments given the risks we already accept... except for the smaller class of users who consider the expected data to be very sensitive. They may have a much larger concern with this case. But also, if they are that concerned, they'll be auditing and probably would catch the need to opt-out.

Problems I do not have with Scarf-js

I think it's very reasonable for library authors to have an option that allows them to match library downloads with public company IPs, giving them insight into who is or may be using the software they built.

I'm iffy on having the dep tree data itself sent, but at least which library is necessary information.

Alternatives

I haven't explored this space very much myself. However, one potential idea was using the similar IP->company lookups for visitors to the library documentation websites or pages. This type of tracking is already widespread, so there isn't a loss of integrity, and since the only stored data can be simply company name with a number of hits, the actual tracking is minimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment