Skip to content

Instantly share code, notes, and snippets.

View jspeed-meyers's full-sized avatar

John Speed Meyers jspeed-meyers

View GitHub Profile
@jspeed-meyers
jspeed-meyers / pypi_repo_deps2repos_output.txt
Created May 11, 2023 19:59
Command and output of running deps2repos to convert top 1000 Python packages to a list of GitHub repos
(deps2repos) ➜ deps2repos git:(main) ✗ python main.py --no_deps --python top_1000_pypi_packages.txt
WARNING: Some of the packages found on PyPI do not have GitHubs:
protobuf
cffi
docutils
grpcio-status
beautifulsoup4
openpyxl
et-xmlfile
@jspeed-meyers
jspeed-meyers / top_1000_pypi_packages.txt
Created May 10, 2023 20:17
A list of the top 1000 pypi packages, collected as May 2023.
boto3
urllib3
requests
botocore
charset-normalizer
idna
certifi
setuptools
s3transfer
python-dateutil
@jspeed-meyers
jspeed-meyers / get_top_pypi_packages.py
Created May 10, 2023 20:15
A script to collect the names of the most downloaded python packages
# script to retrieve most downloaded packages on the python package index
# also, oh yeah, chatgpt wrote some of this. I changed a little though.
# I'm not redundant yet.
import json
import urllib.request
def get_top_packages(top_n=1000):
"""Identify top packages by download count on pypi.
@jspeed-meyers
jspeed-meyers / create_docker_image_distribution_dataset.py
Created October 24, 2022 15:11
Identify OS distribution of docker image list
"""Create docker image distribution dataset."""
import csv
import logging
import re
import subprocess
# potential os locations for distribution data
# info on os-release: https://www.freedesktop.org/software/systemd/man/os-release.html
LOCATIONS = [
@jspeed-meyers
jspeed-meyers / create_top_docker_image_dataset.py
Created October 24, 2022 15:08
Create dataset of top dockerhub images by popularity
"""Create csv of top dockerhub images by popularity.
Part of the dark matter/darkfiles/diffbom analysis
Help from this SO post: https://stackoverflow.com/questions/43426746/api-to-get-top-docker-hub-images
Created by: John Speed Meyers (jsmeyers@chainguard.dev)
"""
import csv
@jspeed-meyers
jspeed-meyers / calculate_cve_reduction.py
Created September 25, 2022 22:51
for rumble data, calculate cve reduction percentage
"""Calculate percentage reduction in cve's by image
Contact John Speed Meyers or Josh Dolitsky for further information.
"""
import pandas as pd
df = pd.read_csv("rumble-2022-08-16-2022-09-14.csv")
IMAGE_LIST = [
["cgr.dev/chainguard/php:latest", "php:latest"],
@jspeed-meyers
jspeed-meyers / clean_rumble_data.py
Created September 25, 2022 22:49
Clean rumble latest.csv for making data public
"""Clean rumble data in preparation for making public.
The latest.csv represents a concatentation of daily vulnerability
scans of image data. This script prepares that csv for making
a subset of this data open source.
"""
import pandas as pd
df = pd.read_csv("latest.csv", parse_dates=["time"])
@jspeed-meyers
jspeed-meyers / calculate_attack_surface_reduction.py
Created September 20, 2022 19:15
Calculate attack surface reduction percentage for pairs of container images.
"""Calculate attack surface reduction percentage for pairs of container images.
This script calculates the number of packages present in each image and then
calculates the reduction in "attack surface."
Note: Must install syft (https://github.com/anchore/syft) to use.
Author: John Speed Meyers (jsmeyers@chainguard.dev)
"""
@jspeed-meyers
jspeed-meyers / deps_dev_retrieve_most_depended_upon_packages.sql
Created July 9, 2022 11:13
Measure number of dependencies for each version of most depended upon packages using deps.dev data - SQL Query
DECLARE LatestSnapshot TIMESTAMP;
SET LatestSnapshot = (SELECT MAX(Time) FROM `bigquery-public-data.deps_dev_v1.Snapshots`);
WITH
-- Releases includes every release of every package.
Releases AS (
SELECT
System,
Name,
@jspeed-meyers
jspeed-meyers / create-scorecards-histogram.py
Created July 9, 2022 00:39
Analyze scorecards data and create a histogram
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("csv/FILENAE.csv")
# create plot
fig, ax = plt.subplots(figsize=(6,4)) # size of sub-figures
n, _, _ = plt.hist(df.score, bins=[i/4 for i in range(0, 40)])