Time Period: 2024-04-30 to 2024-05-07
Period Average: 30 per day
Monthly Average: 27 per day
Remote Builds: 1656 (68%)
Local Builds: 770 (32%)
-----------------------------
Total number of builds: 2426
Remote Build Failures: 173 (72%)
Local Build Failures: 68 (28%)
-----------------------------
Total number of failures: 241
Right now, ~50% of Release Groups build remotely.
Percent | Reason |
---|---|
47% | Inexplicable. Need to dive deeper into the no-build output ones. |
27% | Due to users (application authors), and user addressable. |
21% | Due to FC. Resource issues, network, space, etc. |
5% | Might be cause of FC, I suspect these are cache related issues. |
Cause | Count | Detail |
---|---|---|
FC | 23 | OSError: [Errno 28] No space left on device |
?? | 15 | No build output |
User | 6 | The engine "node" is incompatible with this module. |
User | 3 | Required frappe-dependency 'erpnext' not found. #stage-validate-dependencies |
User | 3 | Installed frappe-dependency 'frappe' version '15.25.0' does not satisfy required version. #stage-validate-dependencies |
?? | 3 | Unexplained end of output |
User | 3 | Building wheel for dlib (pyproject.toml) did not run successfully. |
User | 2 | error in $APP_NAME setup command... Parse error at "'=>1.23.6'": |
User | 2 | Package $PACKAGE_NAME requires a different Python: 3.7.17 not in '>=3.10' |
FC? | 2 | A directory for the application 'frappe' already exists. #stage-apps-erpnext_shipping |
FC | 1 | Build looks like succeeded but Failure Status |
User | 1 | getting requirements to build editable did not run successfully. ...No such file or directory: 'requirements.txt' |
FC | 1 | failed to receive status: rpc error: code = Canceled desc = context canceled |
FC? | 1 | failed to compute cache key: "/apps/frappe" not found |
FC? | 1 | At least one invalid signature was encountered. (apt-get install, #stage-pre-cmake libgl1-mesa-glx ) |
FC? | 1 | RollupError: Could not resolve "../assets/css/components/ProjectCard.css" from "src/components/ProjectCard.vue" |
Cause | Count | Detail |
---|---|---|
?? | 95 | No build output |
FC | 25 | failed to read downloaded context: rpc error: code = Unknown desc = no http response from session |
User | 12 | Building wheel for dlib (pyproject.toml) did not run successfully. |
User | 9 | Installed frappe-dependency 'frappe' version '15.25.0' does not satisfy required version. #stage-validate-dependencies |
User | 8 | Required frappe-dependency 'erpnext' not found. #stage-validate-dependencies |
User | 8 | ModuleNotFoundError: No module named |
User | 6 | The engine "node" is incompatible with this module. #stage-apps-APP_NAME |
FC? | 2 | Could not get lock /var/cache/apt/archives/lock. It is held by process 0 #stage-pre-python |
FC? | 2 | No such file or directory: './apps/posawesome_pro/posawesome_pro/init.py' #stage-apps-posawesome |
User | 1 | AttributeError: module 'frappe' has no attribute 'twofactor' |
FC? | 1 | RollupError: Could not resolve "../assets/css/components/ProjectCard.css" from "src/components/ProjectCard.vue" |
User | 1 | Could not build wheels for PyKCS11 |
FC? | 1 | Cannot find module '@vitejs/plugin-vue' |
FC? | 1 | failed to compute cache key: failed to calculate checksum of ref ... "/apps/frappe": not found #stage-apps-frappe |
User | 1 | Could not find a version that satisfies the requirement html-sanitizer~=2.2.0 |
- User related issues (27%) being addressed in:
- frappe/press#1735
- frappe/press#1747
- Future PRs that add more cases for user addressable press notifications.
- Resource issues (21%) will be addressed by:
- Smarter retries depending on the issue.
- Better UI so users get an explanation of what happened.
- Cache related issues (5%) being adddressed in:
W.r.t Inexplicable issues, I have a hunch of what the cause might be for some of these (considering high instances when running remotely):
- Undelivered agent jobs
- Failed to upload build context on remote
- Status sync issues between press and remote builder
These, I'll dive deeper and check what's wrong.
Another issue is people complaining of builds being stuck.
This is cause the builds are running remotely and I suspect when several agent jobs are present, some get left out.
This is how polling selection takes place:
pending_ids = [j.job_id for j in pending_jobs]
random_pending_ids = random.sample(pending_ids, k=min(100, len(pending_ids)))
polled_jobs = agent.get_jobs_status(random_pending_ids)
Prioritizing build jobs and increasing polling rate for those should help. The heuristic above can cause an indeterminate delay, so a better heuristic could be used. That being said I don't think we hit the 100 limit of pending jobs often.
That being said a longer term fix involves handling remote build jobs separately from regular agent jobs.
Deploy metrics are missing, I've not dived deep into those.
That being said, there are some improvements I want to make w.r.t how builds are handled.
Right now, Deploy Candidate doesn't actually track the build, just initiates it by creating a Deploy, which also doesn't track if all Deploys succeeded.
This isn't a priority for me right now as only 4% of "New Bench" Agent Jobs have failed in the given Time Period.
I mostly won't be going granular every time. This took half a day, build outputs are massive.