Skip to content

Instantly share code, notes, and snippets.

@carceneaux
Last active April 22, 2024 10:02
Show Gist options
  • Star 53 You must be signed in to star a gist
  • Fork 13 You must be signed in to fork a gist
  • Save carceneaux/b75d483e3e0cb798ae60c424300d5a0b to your computer and use it in GitHub Desktop.
Save carceneaux/b75d483e3e0cb798ae60c424300d5a0b to your computer and use it in GitHub Desktop.
Script for removing GitLab Job Artifacts.
#!/bin/bash
#
# Written by Chris Arceneaux
# GitHub: https://github.com/carceneaux
# Email: carcenea@gmail.com
# Website: http://arsano.ninja
#
# Note: This code is a stop-gap to erase Job Artifacts for a project. I HIGHLY recommend you leverage
# "artifacts:expire_in" in your .gitlab-ci.yml
#
# https://docs.gitlab.com/ee/ci/yaml/#artifactsexpire_in
#
# Software Requirements: curl, jq
#
# This code has been released under the terms of the Apache-2.0 license
# http://opensource.org/licenses/Apache-2.0
# project_id, find it here: https://gitlab.com/[organization name]/[repository name] at the top underneath repository name
project_id="207"
# token, find it here: https://gitlab.com/profile/personal_access_tokens
token="9hjGYpwmsMfBxT-Ghuu7"
server="gitlab.com"
# Retrieving Jobs list page count
total_pages=$(curl -sD - -o /dev/null -X GET \
"https://$server/api/v4/projects/$project_id/jobs?per_page=100" \
-H "PRIVATE-TOKEN: ${token}" | grep -Fi X-Total-Pages | sed 's/[^0-9]*//g')
# Creating list of Job IDs for the Project specified with Artifacts
job_ids=()
echo ""
echo "Creating list of all Jobs that currently have Artifacts..."
echo "Total Pages: ${total_pages}"
for ((i=2;i<=${total_pages};i++)) #starting with page 2 skipping most recent 100 Jobs
do
echo "Processing Page: ${i}/${total_pages}"
response=$(curl -s -X GET \
"https://$server/api/v4/projects/$project_id/jobs?per_page=100&page=${i}" \
-H "PRIVATE-TOKEN: ${token}")
length=$(echo $response | jq '. | length')
for ((j=0;j<${length};j++))
do
if [[ $(echo $response | jq ".[${j}].artifacts_file | length") > 0 ]]; then
echo "Job found: $(echo $response | jq ".[${j}].id")"
job_ids+=($(echo $response | jq ".[${j}].id"))
fi
done
done
# Loop through each Job erasing the Artifact(s)
echo ""
echo "${#job_ids[@]} Jobs found. Commencing removal of Artifacts..."
for job_id in ${job_ids[@]};
do
response=$(curl -s -X DELETE \
-H "PRIVATE-TOKEN:${token}" \
"https://$server/api/v4/projects/$project_id/jobs/$job_id/artifacts")
echo "Processing Job ID: ${job_id} - Status: $(echo $response | jq '.status')"
done
@tamasgal
Copy link

Thanks, still works fine on self-hosted GitLab EE 12.2 👍

@carceneaux
Copy link
Author

Thanks! 🍻 Glad to hear!

@YoungPyDawan
Copy link

YoungPyDawan commented Sep 30, 2019

@carceneaux
Copy link
Author

Thanks @YoungPyDawan! I've modified the API call so that only the artifacts are deleted. Good to see that API call was added. 😄

@Kage-Yami
Copy link

FYI... I came across this today in my search for an easy way to delete all artifacts for a project; unfortunately, it won't work in all cases (like mine) due to X-Total-Pages being omitted when the item count is greater than 10,000.

@carceneaux
Copy link
Author

@Kage-Yami - Thanks for the heads up! I'll work on an updated version of the code. The fix is to not worry about X-Total-Page and simply check for the X-Next-Page header and key off of it instead.

As I'm pretty busy right now, it'll take a week or two for me to get to this. If you get the code sorted before then, please share. 😄

Here's the link mentioning the new logic to be used if you're interested:

https://gitlab.com/gitlab-org/gitlab-foss/-/merge_requests/23931/diffs#34fe105b9f0ef77edad95de0c13084ff7f54c344_260_298

@Kage-Yami
Copy link

I ended up writing my own version that accepts the number of pages as an argument (with project and token also being arguments); I manually determined the page count by trial-and-error beforehand. So not great, but it got the job done.

I could probably adapt it to loop endlessly and simply exit once X-Next-Page either vanishes or equals the current page (haven't looked into what GitLab sends on the last page)... But I don't really need the script anymore, so probably won't bother.

Though as a bonus, mine is parallelised a bit; I was lucky and didn't need to worry about rate-limiting as I was only averaging around 300 calls a minute (out of the maximum of 600).

@Atarity
Copy link

Atarity commented Apr 28, 2020

It is not removed artifacts from the 1st page of my pipelines list for some reason. It also missed .status attribute in console log. The rest is as advertised, thanks!

@philipptempel
Copy link

philipptempel commented Jul 21, 2020

It is not removed artifacts from the 1st page of my pipelines list for some reason. It also missed .status attribute in console log. The rest is as advertised, thanks!

@Atarity Check the source code and you will find the hint #starting with page 2 skipping most recent 100 Jobs thus it is intended that the first page of artifacts are not removed.

@voiski
Copy link

voiski commented Sep 23, 2020

The response can have a json with breaking lines \n. Consider removing it like ${response//\\n/}

response=${response//\\n/}
length=$(echo $response | jq '. | length')

Also, you can easy simulate next page checking [ $length -ne 0 ] and having the page loop to 1000 or more.

@mitar
Copy link

mitar commented Jan 7, 2021

I made the following Python script, which works for over 10k jobs, too:

#!/usr/bin/env python3

import time

import requests

project_id = '...'
token = '...'
server = 'gitlab.com'

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=2"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        if job.get('artifacts_file', None):
            job_id = job['id']
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f"Processing job ID: {job_id} - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

@tamasgal
Copy link

tamasgal commented Nov 5, 2021

The if job.get('artifacts_file', None): needs to be changed to if job.get('artifacts', None): in the current version of the API, at least I don't see artifacts_file in any of the JSON responses.

@mitar
Copy link

mitar commented Nov 7, 2021

@tamasgal
Copy link

tamasgal commented Nov 7, 2021

I don't know why but none of the jobs on our server had artifacts_file but artifacts instead where they were listed including their sizes etc.

@willstott101
Copy link

willstott101 commented Dec 9, 2021

"artifacts_file" worked for me, but it's trivial to support both, I also tweaked the output so you can see what job failed if any, and made it start at the first page:

#!/usr/bin/env python3

import time

import requests

project_id = '...'
token = '...'
server = 'gitlab.com'
start_page = 1

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        if job.get('artifacts_file', None) or job.get('artifacts', None):
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

@kbaran1998
Copy link

While the script deletes jobs' artifacts, you can also delete project's artifacts by adding this code:

url = f"https://{server}/api/v4/projects/{project_id}/artifacts"
delete_response = requests.delete(
    url,
    headers={
        'private-token': token,
    }
)
print(f" - status: {delete_response.status_code}")

@Muffinman
Copy link

This does not work if you're project has more than 10000 jobs, due to the removal of X-Total-Pages header from the Gitlab API responses.

@cmuller
Copy link

cmuller commented Jul 7, 2023

Yes, I just found out that the X-Total-Pages header is now missing for performance reasons. Fortunately when a page number is too high, an empty json list ([]) is returned so it is quite easy to use a loop such as (here in bash):

PER_PAGE=100
PAGE=1
while JOBS=$(curl -s --header "PRIVATE-TOKEN: $TOKEN" "$GITLAB_INSTANCE/$PROJECT_ID/jobs?per_page=$PER_PAGE&page=$PAGE&sort=asc") && [ "$JOBS" != "[]" ]
do
   for JOB in $(echo $JOBS | jq .[].id)
   do
      [...]
   done
   PAGE=$((PAGE+1))
done

@mikeller
Copy link

mikeller commented Dec 7, 2023

Here's my slightly improved version for the 'do it in python' section (ignores job.log files which seem to be non-deletable, uses command line arguments to load the settings):

#!/usr/bin/env python3

import time
import requests
import sys

server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        artifacts = job.get('artifacts_file', None)
        if not artifacts:
            artifacts = job.get('artifacts', None)

        has_artifacts = False
        for artifact in artifacts:
            if artifact['filename'] != 'job.log':
                has_artifacts = True
                break

        if has_artifacts:
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

@Tim-Schwalbe
Copy link

Here's my slightly improved version for the 'do it in python' section (ignores job.log files which seem to be non-deletable, uses command line arguments to load the settings):

#!/usr/bin/env python3

import time
import requests
import sys

server = sys.argv[1]
project_id = sys.argv[2]
token = sys.argv[3]
start_page = sys.argv[4]

print("Creating list of all jobs that currently have artifacts...")
# We skip the first page.
url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page={start_page}"
while url:
    print(f"Processing page: {url}")
    response = requests.get(
        url,
        headers={
            'private-token': token,
        },
    )

    if response.status_code in [500, 429]:
        print(f"Status {response.status_code}, retrying.")
        time.sleep(10)
        continue

    response.raise_for_status()
    response_json = response.json()
    for job in response_json:
        artifacts = job.get('artifacts_file', None)
        if not artifacts:
            artifacts = job.get('artifacts', None)

        has_artifacts = False
        for artifact in artifacts:
            if artifact['filename'] != 'job.log':
                has_artifacts = True
                break

        if has_artifacts:
            job_id = job['id']
            print(f"Processing job ID: {job_id}", end="")
            delete_response = requests.delete(
                f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                headers={
                    'private-token': token,
                },
            )
            print(f" - status: {delete_response.status_code}")

    url = response.links.get('next', {}).get('url', None)

I get this error:

remove_artifacts.py", line 38, in <module>
    if artifact['filename'] != 'job.log':
       ~~~~~~~~^^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'

@mikeller
Copy link

@Tim-Schwalbe: Apologies, yes, I overlooked this case. I have amended the script to ignore artifacts_file, as this file seems to be contained in artifacts anyway.

I have improved my version a bit, it now automatically selects expired artifacts for deletion that (in my opinion) should be deleted in the first place, because they belong to jobs that were run on:

  • merge requests that have been merged or closed;
  • branches that have been merged.

It will also take a list of project ids as the last argument, making it easy to use in a cron job: Usage: {sys.argv[0]} <server> <token> <project id>...

#!/usr/bin/env python3

import time
import requests
import sys
from datetime import datetime, timezone
from dateutil import parser
import re

if len(sys.argv) < 4:
    print(f'Usage: {sys.argv[0]} <server> <token> <project id>...')

    exit(1)

server = sys.argv[1]
token = sys.argv[2]
project_ids = []
for i in range(3, len(sys.argv)):
    project_ids.append(sys.argv[i])


now = datetime.now(timezone.utc)

overall_space_savings = 0
for project_id in project_ids:
    print(f'Processing project {project_id}:')

    merge_request_url = f"https://{server}/api/v4/projects/{project_id}/merge_requests?scope=all&per_page=100&page=1"
    merge_requests = {}
    while merge_request_url:
        response = requests.get(
            merge_request_url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()

        for merge_request in response_json:
            iid = merge_request.get('iid', None)
            if iid:
                merge_requests[int(iid)] = merge_request['state']

        merge_request_url = response.links.get('next', {}).get('url', None)

    branch_url = f"https://{server}/api/v4/projects/{project_id}/repository/branches?per_page=100&page=1"
    unmerged_branches = []
    while branch_url:
        response = requests.get(
            branch_url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()

        for branch in response_json:
            is_merged = branch['merged']
            if not is_merged:
                unmerged_branches.append(branch['name'])

        branch_url = response.links.get('next', {}).get('url', None)


    url = f"https://{server}/api/v4/projects/{project_id}/jobs?per_page=100&page=1"

    job_count = 0
    artifact_count = 0
    artifact_size = 0
    deleted_artifact_count = 0
    deleted_artifact_size = 0
    while url:
        response = requests.get(
            url,
            headers={
                'private-token': token,
            },
        )

        if response.status_code in [500, 429]:
            print(f"Status {response.status_code}, retrying.")
            time.sleep(10)
            continue

        response.raise_for_status()
        response_json = response.json()
        for job in response_json:
            job_count += 1

            artifacts = job.get('artifacts', None)
            artifacts_expire_at_string = job.get('artifacts_expire_at', None)
            artifacts_expire_at = None
            if artifacts_expire_at_string:
                    artifacts_expire_at = parser.parse(artifacts_expire_at_string)

            has_expired_artifacts = False
            deleted_job_artifact_count = 0
            deleted_job_artifact_size = 0
            if artifacts:
                for artifact in artifacts:
                    if artifact['filename'] != 'job.log':
                        size = artifact['size']

                        artifact_count += 1
                        artifact_size += size

                        if not artifacts_expire_at or artifacts_expire_at < now:
                            has_expired_artifacts = True
                            deleted_job_artifact_count += 1
                            deleted_job_artifact_size += size


            delete_artifacts = False
            if has_expired_artifacts:
                ref = job['ref']
                merge_request_iid_match = re.search(r'refs\/merge-requests\/(\d+)\/head', ref)
                if merge_request_iid_match:
                    merge_request_iid = merge_request_iid_match.group(1)
                    if merge_request_iid:
                        merge_request_status = merge_requests.get(int(merge_request_iid))
                        if merge_request_status in ['merged', 'closed', None]:
                            delete_artifacts = True
                            deleted_artifact_count += deleted_job_artifact_count
                            deleted_artifact_size += deleted_job_artifact_size

                elif ref not in unmerged_branches:
                    delete_artifacts = True
                    deleted_artifact_count += deleted_job_artifact_count
                    deleted_artifact_size += deleted_job_artifact_size

            if delete_artifacts:
                job_id = job['id']
                print(f"Processing job ID: {job_id}", end="")
                delete_response = requests.delete(
                    f"https://{server}/api/v4/projects/{project_id}/jobs/{job_id}/artifacts",
                    headers={
                        'private-token': token,
                    },
                )
                print(f" - status: {delete_response.status_code}\033[K", end = "\r")


        print(f'Processed page {url}.\033[K', end = "\r")

        url = response.links.get('next', {}).get('url', None)

    overall_space_savings += deleted_artifact_size

    print()
    print(f'Jobs analysed: {job_count}');
    print(f'Pre artifact count: {artifact_count}');
    print(f'Pre artifact size [MB]: {artifact_size / (1024 * 1024)}')
    print(f'Post artifact count: {artifact_count - deleted_artifact_count}')
    print(f'Post artifact size [MB]: {(artifact_size - deleted_artifact_size) / (1024 * 1024)}')
    print()

print(f'Overall savings [MB]: {overall_space_savings / (1024 * 1024)}')

@voiski
Copy link

voiski commented Dec 13, 2023

@mikeller I suggest you write your script code in a gist, or even fork this one here and replace it with you python code =)

Each gist indicates which forks have activity, making it easy to find interesting changes from others.

@mikeller
Copy link

@mikeller
Copy link

New version that takes a GitLab group id as a parameter and then cleans up all repositories in the group: https://gist.github.com/mikeller/7034d99bc27c361fc6a2df84e19c36ff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment