-
-
Save helen-fornazier/0e9528d7d011e07f08c3a19f997ac5b0 to your computer and use it in GitHub Desktop.
{ | |
"$schema": "http://json-schema.org/draft-07/schema#", | |
"type": "object", | |
"properties": { | |
"status": { | |
"type": "string", | |
"description": "A regex pattern to match statuses, in the case of tests, ignored when matching a build node." | |
}, | |
"valid_build": { | |
"type": "boolean", | |
"description": "Indicates if the build is valid, in the case of build type, ignored when matching a test node." | |
}, | |
"tree_name": { | |
"type": "string", | |
"description": "A regex pattern to match the name of the tree." | |
}, | |
"git_repository_url": { | |
"type": "string", | |
"description": "A regex pattern to match the URL of the Git repository." | |
}, | |
"git_repository_branch": { | |
"type": "string", | |
"description": "A regex pattern to match the branch of the Git repository." | |
}, | |
"architecture": { | |
"type": "string", | |
"description": "A regex pattern to match the architecture type." | |
}, | |
"compiler": { | |
"type": "string", | |
"description": "A regex pattern to match the compiler used." | |
}, | |
"origin": { | |
"type": "string", | |
"description": "A regex pattern to match the origin of the build or test." | |
}, | |
"log": { | |
"type": "string", | |
"description": "A regex pattern used to track log lines. If 'PASS' is in the status, this won't be considered for PASS nodes, only for other statuses." | |
}, | |
"path": { | |
"type": "string", | |
"description": "A regex pattern used to match paths, in the case of tests, ignored when matching a build node." | |
}, | |
"config": { | |
"type": "string", | |
"description": "A regex pattern to match the configuration name." | |
}, | |
"platform": { | |
"type": "string", | |
"description": "A regex pattern to match the platform name." | |
}, | |
"output_file": { | |
"type": "string", | |
"description": "A regex pattern to match the name of one of the output files." | |
} | |
}, | |
"anyOf": [ | |
{"required": ["status"]}, | |
{"required": ["valid_build"]}, | |
{"required": ["tree_name"]}, | |
{"required": ["git_repository_url"]}, | |
{"required": ["git_repository_branch"]}, | |
{"required": ["architecture"]}, | |
{"required": ["compiler"]}, | |
{"required": ["origin"]}, | |
{"required": ["log"]}, | |
{"required": ["path"]}, | |
{"required": ["config"]}, | |
{"required": ["platform"]}, | |
{"required": ["output_file"]} | |
], | |
"description": "This schema defines the structure for the auto-matching configuration, to auto-match incidents to the issue where this configuration is attached. All string fields are regex patterns by default. If a field is not specified, it will match all values. At least one property is required." | |
} |
- Load the arrived object (update) into the database, so it could do its job of merging updates.
- Retrieve the merged object.
What do you mean by "merging updates" and what's a "merged object" in this context? Do you mean new issues that are somehow merged with existing ones?
Can we work on a concrete example for each use case? See, the potential (but critical) problem I'm seeing is how to match test/build failures against issues, in particular. How to get the DB to perform the matches effectively if the fields to match against are custom-defined per issue.
In other words, what I'm asking is: can you find a way to do this that performs better than O(n) where n is the number of issues or the number of test / build failures in the DB?
What do you mean by "merging updates" and what's a "merged object" in this context? Do you mean new issues that are somehow merged with existing ones?
The KCIDB reporting protocol allows omitting most of the object fields (missing ones are stored as NULL in the database). If the same object (with the same ID) is sent multiple times, then each field gets a non-deterministically chosen value from across all copies, with NULLs only chosen, if there are no other values. This allows a CI system to submit an incomplete object and then send additional fields later. Such as sending information on an incomplete build/test, and then sending results after it's complete. It doesn't allow editing already present values (you could try certainly, but you can't say what you'll get).
This mechanism is implemented in the database, and to apply it you have to load the data in there. This concerns all types of objects.
Can we work on a concrete example for each use case? See, the potential (but critical) problem I'm seeing is how to match test/build failures against issues, in particular. How to get the DB to perform the matches effectively if the fields to match against are custom-defined per issue.
Yeah, I could write one, I suppose. Possibly later, as it would take some effort. Perhaps I could write an SQL query for pre-matching. For now, I think our join conditions could be built to handle NULLs (missing fields). Something like this: COALESCE(issues.test_pattern.status, '%') LIKE tests.status
. We could tweak the actual expression for performance, but that's the way, I think.
In other words, what I'm asking is: can you find a way to do this that performs better than O(n) where n is the number of issues or the number of test / build failures in the DB?
Well, my bet is that we would be able to employ indexes in the database at least for simple fields, which will get something like O(log n), then we would concentrate on shifting most frequent matching to the indexes through using simple and sometimes not-so-simple fields.
I'd like to add that we don't have to go through the whole process of formalizing and implementing this before LPC. We could discuss and decide on something for the start, stuff it into the misc
field in issues, and implement incident generation on the side. We would get real incidents and would be able to show them off, would get to experience the practical side, and then can formalize them. Most importantly we would have a chance at having something working by LPC. We won't be able to show the pattern fields in the dashboards, but we could certainly show their results: incidents.
For now, I think our join conditions could be built to handle NULLs (missing fields). Something like this:
COALESCE(issues.test_pattern.status, '%') LIKE tests.status
. We could tweak the actual expression for performance, but that's the way, I think.
If there's a way to make queries work on custom sets of conditions, or a universal way to run queries based on any set of conditions, then that'd work. As long as we can properly leverage the DBMS to do the matching and the process is scalable I'm fine with it. But I really think this should come first, because having a matching process that'll get slower as the DB grows is not a solution.
But I really think this should come first, because having a matching process that'll get slower as the DB grows is not a solution.
Agreed.
The pattern schema should describe a finite list of possible fields. Missing JSON fields would be considered NULL in the database.
There are at least two events when we need to match issues to builds/tests:
In both of these cases having indexes on basic fields to match could really speed up narrowing down the range of issues/objects to match against each other, using the remaining fields. Unfortunately having a JSONB field makes things more difficult here too (in addition to aforementioned upgrade conversion).
Regardless which kind of storage we choose, JSONB, or separate columns (which I prefer right now), I would say the processing could go like this (for the start, and only considering the direct object matching, not indirect matching of connected objects, for simplicity):
We don't have to do this separately for every updated object, I think. We can actually match objects to opposite objects in a batch, and then narrow down the fancy matches on all of them in one go. We can also use ORM queries, so we get ORM-schema data out of the database, and use the OO representation to have an easier time matching things, and to support matching fancier things generated in the OO layer (like calculated test suite statuses). For that reason, it might be better to structure the matching fields in the issues to line up with ORM objects, not with the raw objects in the DB.
For indirect object matching, such as checking a build config when matching tests, we would have to add more joins and make the condition generation more complex, but I think that's possible. We would only need to pray that PostgreSQL can figure out cases when we don't really care about those joins, because we have nothing to match against the indirect objects 😂 But if not, then we can figure something out client-side.