jordan-wright/rfi.md Secret

## rfi.md

      
    Raw
  

              rfi.md
            
          
    Hi everyone!
I hope you'll pardon the long post. I'm excited about this effort, since it touches on a topic I've explored and thought about quite a bit in the past. I am looking forward to seeing the exciting results!
When it comes to malware prevention, my thoughts can be divided into sections covering prevention and detection.
Preventing Malware

As we've seen, malware on package managers frequently comes from:

Hijacking existing packages through account compromise
Hijacking existing packages that have been abandoned or deleted
Registering typo-squatted packages

I'd like to take a look at what might be done to help mitigate each of these.
Hijacking via Account Compromise

Encourage 2FA Adoption

It's very exciting to see the strides PyPI has already made in enabling 2FA for accounts, which is a great first step. But I would also consider - after 2FA is both fully in production and stable - encouraging maintainers to turn on 2FA by prompting a warning during a package upload or login to PyPI if the account doesn't have 2FA enabled.
Enforce 2FA for Maintainers

I've seen some package managers, like npm, offer owners of a package the ability to force other maintainers to enable 2FA in order to publish a new version of a package. This would be a useful addition to PyPI as well. I didn't see anything on this thread that suggested this was in the works, but let me know if I'm missing something.
Monitoring for Leaked API Tokens

It's exciting to see the work being done leveraging Macaroons as API tokens. As this becomes a more widely used feature, I would recommend signing up for Github's Token Scanning service to identify and revoke API tokens that might be accidentally leaked in commits to Github. Since you're using a prefix "pypi", you should be able to craft a regex that reliably identifies the API tokens. It looks like this has been suggested in #6051, so consider this a +1.
Abandoned Packages

After the left-pad incident a while back, npm created an unpublish policy which led to the following rules:

You can unpublish a package as long as it's less than 72 hours old
Otherwise, deprecation is highly recommended. I think you can still unpublish by contacting support

I wasn't able to find a similar policy for PyPI, but the one from npm seems reasonable. I like that it offers an org like PSF the chance to transfer the package to a holding space or otherwise find a middle-ground with the original author. That said, I don't have metrics to indicate how many support tickets this would have caused in the past x months.
Registering Typo-Squatted Packages

There have been discussions around using metrics like Levenshtein distance to determine if a package being registered is too similar to an existing package. A response on a different thread suggests that this would result in too many false positives.
Instead, here's an alternative approach that may be worth considering: there are already metrics on (roughly) the number of downloads for each package. Assuming you don't have this already, adding internal metrics for the number of non-existent packages that people are attempting to download would give a prioritized list of things to consider blacklisting. My guess is that there will be entries that surface that would not have been caught using standard typo-squatting measures, like people trying to install a package called requirements.txt because the -r was missed.
Hopefully some of these changes could raise the barrier required for malware to both be uploaded to PyPI and be effective. From here, I'd like to talk about detecting what makes it through the cracks.
Detecting Malware

Right now, there's a fair bit of magic that goes into detecting malicious packages uploaded to package managers. In a post from a while back, I downloaded the metadata for all npm packages and essentially grep'd through the postinstall, preinstall, and install values. This is in line with the static analysis done by other folks to find malicious packages on PyPI. There have also been reports from people doing compelling work looking for specific syscalls during dynamic analysis of npm modules which looks promising.
But in general, I think it's important to decide and enforce what's in scope in terms of where PyPI wants to look for malware, and then what behavior is explicitly disallowed within that scope. Anything else will be a much more difficult task, and runs the risk of confusing users.
So let's talk about what, in my opinion, should be in scope.
Where to Look for Malware

In my personal opinion (I'm very open to changing my mind here - this is strictly where my head is at), a good boundary to set is what behavior occurs at the time of installation without a user's reasonable knowledge and without the user having an option to opt-out.
While some languages have very clear places where malicious code could be executed during the installation process, with Python things are a bit less clear. Some malware has resorted to simply including executable code directly in the setup.py file, though it's unclear if this executes during installation. Instead, it seems the "recommended" approach to get code execution during installation is by using the cmdclass flag to specify your own install class as shown here (with a blog post here). For example, this approach appears to have been used by the malicious colourama package here.
Alternatively, you could create your own eggsecutable script as mentioned here though I'm not exactly sure when that fires.
Just from the outset, I'd see value in more closely scrutinizing commands executing as part of the cmdclass overrides, since it seems to be a widely used method for existing malware. But more broadly, to find issues I'd probably consider leveraging dynamic analysis in a sandboxed installation, leading us to talk about what it is we'd look for.
What to Look For

At a high-level, I think there should be some guidelines on what behavior is allowed (or, more likely, disallowed) during the installation process. Just recently, for example, npm decided that ads cannot be shown during installation as a response to a package using them as a potential source for OSS income. Some examples of things that might be considered are:

What data should the installation be allowed to access?
What data should the installation be allowed to modify?
Should network connections be allowed? If so, to where?

I don't have all the answers, but defining what behavior is expected and allowed will set the tone for the larger project to identify what constitutes abuse of the platform.
Learning from Others

Last but far from least, I was happy to read in the RFI outline that there was a goal to review what other package managers are doing in this space. In these notes I've mentioned the work from npm a few times, but more broadly I'd highly encourage us to proactively reach out to the maintainers of other package managers to collaborate on solutions. For example, I really enjoyed this talk from Adam Baldwin at npm that discusses some of the ongoing work they're doing in this space.
This is a problem where package managers have many overlapping goals, many of the seemingly same problems, and as such would benefit from learning and building together.