Skip to content

Instantly share code, notes, and snippets.

@18alantom
Created May 7, 2024 08:22
Show Gist options
  • Save 18alantom/8b24b8147f38b73769dfc6cf4774b2f2 to your computer and use it in GitHub Desktop.
Save 18alantom/8b24b8147f38b73769dfc6cf4774b2f2 to your computer and use it in GitHub Desktop.

FC Metrics Report 7th May, 24

Screenshot 2024-05-07 at 10 33 00

Time Period: 2024-04-30 to 2024-05-07
Period Average:  30 per day
Monthly Average: 27 per day


Remote Builds:           1656 (68%)
Local Builds:             770 (32%)
-----------------------------
Total number of builds:  2426


Remote Build Failures:    173 (72%)
Local Build Failures:      68 (28%)
-----------------------------
Total number of failures: 241

Right now, ~50% of Release Groups build remotely.

Reasons

Percent Reason
47% Inexplicable. Need to dive deeper into the no-build output ones.
27% Due to users (application authors), and user addressable.
21% Due to FC. Resource issues, network, space, etc.
5% Might be cause of FC, I suspect these are cache related issues.

Specifics

Local Builds (68)

Cause Count Detail
FC 23 OSError: [Errno 28] No space left on device
?? 15 No build output
User 6 The engine "node" is incompatible with this module.
User 3 Required frappe-dependency 'erpnext' not found. #stage-validate-dependencies
User 3 Installed frappe-dependency 'frappe' version '15.25.0' does not satisfy required version. #stage-validate-dependencies
?? 3 Unexplained end of output
User 3 Building wheel for dlib (pyproject.toml) did not run successfully.
User 2 error in $APP_NAME setup command... Parse error at "'=>1.23.6'":
User 2 Package $PACKAGE_NAME requires a different Python: 3.7.17 not in '>=3.10'
FC? 2 A directory for the application 'frappe' already exists. #stage-apps-erpnext_shipping
FC 1 Build looks like succeeded but Failure Status
User 1 getting requirements to build editable did not run successfully. ...No such file or directory: 'requirements.txt'
FC 1 failed to receive status: rpc error: code = Canceled desc = context canceled
FC? 1 failed to compute cache key: "/apps/frappe" not found
FC? 1 At least one invalid signature was encountered. (apt-get install, #stage-pre-cmake libgl1-mesa-glx)
FC? 1 RollupError: Could not resolve "../assets/css/components/ProjectCard.css" from "src/components/ProjectCard.vue"

Remote Builds (173)

Cause Count Detail
?? 95 No build output
FC 25 failed to read downloaded context: rpc error: code = Unknown desc = no http response from session
User 12 Building wheel for dlib (pyproject.toml) did not run successfully.
User 9 Installed frappe-dependency 'frappe' version '15.25.0' does not satisfy required version. #stage-validate-dependencies
User 8 Required frappe-dependency 'erpnext' not found. #stage-validate-dependencies
User 8 ModuleNotFoundError: No module named
User 6 The engine "node" is incompatible with this module. #stage-apps-APP_NAME
FC? 2 Could not get lock /var/cache/apt/archives/lock. It is held by process 0 #stage-pre-python
FC? 2 No such file or directory: './apps/posawesome_pro/posawesome_pro/init.py' #stage-apps-posawesome
User 1 AttributeError: module 'frappe' has no attribute 'twofactor'
FC? 1 RollupError: Could not resolve "../assets/css/components/ProjectCard.css" from "src/components/ProjectCard.vue"
User 1 Could not build wheels for PyKCS11
FC? 1 Cannot find module '@vitejs/plugin-vue'
FC? 1 failed to compute cache key: failed to calculate checksum of ref ... "/apps/frappe": not found #stage-apps-frappe
User 1 Could not find a version that satisfies the requirement html-sanitizer~=2.2.0

Action

  • User related issues (27%) being addressed in:
  • Resource issues (21%) will be addressed by:
    • Smarter retries depending on the issue.
    • Better UI so users get an explanation of what happened.
  • Cache related issues (5%) being adddressed in:

W.r.t Inexplicable issues, I have a hunch of what the cause might be for some of these (considering high instances when running remotely):

  • Undelivered agent jobs
  • Failed to upload build context on remote
  • Status sync issues between press and remote builder

These, I'll dive deeper and check what's wrong.

Other

Stuck Builds

Another issue is people complaining of builds being stuck.

This is cause the builds are running remotely and I suspect when several agent jobs are present, some get left out.

This is how polling selection takes place:

pending_ids = [j.job_id for j in pending_jobs]
random_pending_ids = random.sample(pending_ids, k=min(100, len(pending_ids)))
polled_jobs = agent.get_jobs_status(random_pending_ids)

Prioritizing build jobs and increasing polling rate for those should help. The heuristic above can cause an indeterminate delay, so a better heuristic could be used. That being said I don't think we hit the 100 limit of pending jobs often.

That being said a longer term fix involves handling remote build jobs separately from regular agent jobs.

Deploys

Deploy metrics are missing, I've not dived deep into those.

That being said, there are some improvements I want to make w.r.t how builds are handled.

Right now, Deploy Candidate doesn't actually track the build, just initiates it by creating a Deploy, which also doesn't track if all Deploys succeeded.

This isn't a priority for me right now as only 4% of "New Bench" Agent Jobs have failed in the given Time Period.

Screenshot 2024-05-07 at 13 38 45


I mostly won't be going granular every time. This took half a day, build outputs are massive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment