Skip to content

Instantly share code, notes, and snippets.

@helen-fornazier
Last active July 12, 2024 11:57
Show Gist options
  • Save helen-fornazier/0e9528d7d011e07f08c3a19f997ac5b0 to your computer and use it in GitHub Desktop.
Save helen-fornazier/0e9528d7d011e07f08c3a19f997ac5b0 to your computer and use it in GitHub Desktop.
kcidb automatching schema
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"status": {
"type": "string",
"description": "A regex pattern to match statuses, in the case of tests, ignored when matching a build node."
},
"valid_build": {
"type": "boolean",
"description": "Indicates if the build is valid, in the case of build type, ignored when matching a test node."
},
"tree_name": {
"type": "string",
"description": "A regex pattern to match the name of the tree."
},
"git_repository_url": {
"type": "string",
"description": "A regex pattern to match the URL of the Git repository."
},
"git_repository_branch": {
"type": "string",
"description": "A regex pattern to match the branch of the Git repository."
},
"architecture": {
"type": "string",
"description": "A regex pattern to match the architecture type."
},
"compiler": {
"type": "string",
"description": "A regex pattern to match the compiler used."
},
"origin": {
"type": "string",
"description": "A regex pattern to match the origin of the build or test."
},
"log": {
"type": "string",
"description": "A regex pattern used to track log lines. If 'PASS' is in the status, this won't be considered for PASS nodes, only for other statuses."
},
"path": {
"type": "string",
"description": "A regex pattern used to match paths, in the case of tests, ignored when matching a build node."
},
"config": {
"type": "string",
"description": "A regex pattern to match the configuration name."
},
"platform": {
"type": "string",
"description": "A regex pattern to match the platform name."
},
"output_file": {
"type": "string",
"description": "A regex pattern to match the name of one of the output files."
}
},
"anyOf": [
{"required": ["status"]},
{"required": ["valid_build"]},
{"required": ["tree_name"]},
{"required": ["git_repository_url"]},
{"required": ["git_repository_branch"]},
{"required": ["architecture"]},
{"required": ["compiler"]},
{"required": ["origin"]},
{"required": ["log"]},
{"required": ["path"]},
{"required": ["config"]},
{"required": ["platform"]},
{"required": ["output_file"]}
],
"description": "This schema defines the structure for the auto-matching configuration, to auto-match incidents to the issue where this configuration is attached. All string fields are regex patterns by default. If a field is not specified, it will match all values. At least one property is required."
}
@tales-aparecida
Copy link

And also, as everything that is good in the KCIDB schema, a misc field to dump trash? 😁

@mh21
Copy link

mh21 commented Jul 11, 2024

And also, as everything that is good in the KCIDB schema, a misc field to dump trash? 😁

heh, I said "to make it easy to adopt to new use cases that are not standardized yet", not trash 😂😛

@helen-fornazier
Copy link
Author

Well, for reference, the log_url description says

If the test produced multiple outputs/files, this should point to the one containing the highest-level overview of the test's operation. The rest should go into "output_files"

So I'm pretty confident that it's better to match any output file, because some bugs show up in dmesg/journalctl and some tests generate a bunch of very granular files that also help to identify a specific bug

That's why I suggested output_file in the first place 🤔

But if output_file is present should log match only the file that matched? Or should it just use to see if that output file is present?

If you would be so kind, please add some examples for each field in the description 🙈

I can do this kindness xD

If I may suggest another field... It might catch what you were trying to get out of configuration. CKI calls it package_name, but it's also referred as variant, e.g. "kernel", "kernel-debug", "kernel-rt", "kernel-rt-debug", "kernel-64k-debug"

I guess more people use the term "variant" e.g. https://ubuntu.com/kernel/variants

So I guess we need to formalize the variant field.

And also, as everything that is good in the KCIDB schema, a misc field to dump trash? 😁

You mean a generic field to match a generic misc json if it is present? Ok, I can add it (maybe not that easy to match since we need to navigate through the dictionary tree, but feasible)

@spbnick
Copy link

spbnick commented Jul 11, 2024

Since the incident and thus issue (in their present form) allows matching both a build and a test, we should have two separate objects matching those (which could use the same schema).

Also, I think it would be good to make it clear which object each field is matching.

And yes, I think we could keep this stuff in JSONB, as there will be a lot of churn there, which, BTW, we would have to convert using SQL on schema upgrades. I'm not sure how nice that experience will be.

Yes, it would be great to have status in builds soon. Any brave people here to do this while I'm on vacation? I would be able to help a bit.

Yes:
image

@spbnick
Copy link

spbnick commented Jul 11, 2024

I have a strong itch to structure the hell out of this, but I gotta resist. That would be premature. Let's stick to the fields we definitely know we'll need.

However, we still gotta clarify/specify some things.

How about this for a definition of an issue pattern (matching either build, or test, or whatever else we need later, like checks):

"An issue pattern matches a subtree of the object hierarchy", the pattern user decides which type of objects (nodes in the tree) to use out of the matched ones.

So, we could say a pattern matches checkouts coming from Maestro, with passing builds with x86_64 architecture, and particular tests failing. And then we say mark such builds as exhibiting this particular issue.

Or we could use the same pattern to mark the tests as exhibiting the issue.

This way we can use the same schema for both.

We could have two fields containing them, e.g.: match_builds and match_tests, both inside issue.

Having both specified would mean that both have to match to have either marked as exhibiting the issue. That is an issue is exhibiting, only if both of them match. That's a corner case, but we gotta handle it in a reasonable manner, since we allow it.

@spbnick
Copy link

spbnick commented Jul 11, 2024

Having the same schema would allow us to implement matching only once. The difference would only be in returning the object of specified types out of the match.

But if output_file is present should log match only the file that matched? Or should it just use to see if that output file is present?

Let's say that you could match the name of the output file and the text inside it separately and independently. This way you can match the name and/or the contents. And e.g. search for a particular string everywhere (useful while there's no consensus on file names), or just check if the file is there.

@spbnick
Copy link

spbnick commented Jul 11, 2024

The way to describe matching the misc field is by providing a JSON schema, which is essentially a JSON pattern. However, I can't quite decide if it's a good idea or not to support that, so far.

@spbnick
Copy link

spbnick commented Jul 11, 2024

And, BTW, there's an extension for PostgreSQL for matching JSON schemas (like there is almost for everything else).

However, we should be careful with generalizing our pattern too much. It would be good to be able to do at least some matching with SQL.

@spbnick
Copy link

spbnick commented Jul 11, 2024

So, taking everything I scribbled here into account, here's how I think an example issue could look (yeah, changed match_build/match_test to build_pattern/test_pattern for consistency):

{
    "id": "maestro:9871ab32334d98ef",
    "version": 1,
    "report_url": "https://bugzilla.kernel.org/show_bug.cgi?id=207065" /* ha-ha */
    "report_subject": "C-media USB audio device stops working from 5.2.0-rc3 onwards",
    "culprit": {"code": true, "tool": false, "harness": false},
    "test_pattern": {
        "checkout": {
            "git_repository_url": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git",
            "git_repository_branch": "master"
        },
        "build": {
            "architecture": "x86_64",
        },
        "test": {
            "path": "usb.audio"
            "status": ["FAIL", "ERROR"]
        }
    }
}

The above would match usb.audio tests with either a FAIL or an ERROR status, executing on x86_64 builds, made from commits out of mainline tree.

@spbnick
Copy link

spbnick commented Jul 11, 2024

I think we should first support an exact (string/number) match for each field we're interested in, and name them exactly the same in the pattern. E.g. git_repository_url. We could also support matching options using arrays with the same fields, perhaps, as I put into the example above. If we want anything fancier, then we take that field name and add the fancy on the end, to make it clear. E.g. git_repository_url_regex. If we only need the fancy match, we can drop the plain field.

@spbnick
Copy link

spbnick commented Jul 11, 2024

To match output files, we could support several name/contents "patterns" in an array, but we'd have to think about AND/OR.

Otherwise one last thing I was thinking about is adding logical references to other issues. So we could say: "this issue only matches this test when this other issue doesn't match it", or: "this issue matches when either of these several issues match". And so on. We should, of course, use these sparingly, but in general we can optimize such matching by first matching the dependencies, and then querying for their incidents. We might also need to be able to downgrade issues so they would only participate in these condition, and not be shown on dashboards/notified about, to reduce noise, but this needs more thinking.

@spbnick
Copy link

spbnick commented Jul 11, 2024

The latter could help (automatic) issue processing, when we need to break down, or combine issues. (CC @hardboprobot)

@tales-aparecida
Copy link

tales-aparecida commented Jul 11, 2024

But if output_file is present should log match only the file that matched? Or should it just use to see if that output file is present?

Let's say that you could match the name of the output file and the text inside it separately and independently. This way you can match the name and/or the contents. And e.g. search for a particular string everywhere (useful while there's no consensus on file names), or just check if the file is there.

To be clear, CKI's use-case is to match a filename and its contents.
There are a few instances where we filename_regex=.* to search for a particular string everywhere.

Honestly, it's a matter of performance. We have around 2000 regexes. Each build has a test plan that can generate hundreds of files, a lot of them with more than 10MB. It would simply take too many CPU cycles to match each file every time, in case we didn't set the filename_regex.

@spbnick
Copy link

spbnick commented Jul 11, 2024

Alright, alright, shutting up. Please tell me what you think about all of this 😁

@spbnick
Copy link

spbnick commented Jul 11, 2024

Honestly, it's a matter of performance.

Of course. This is all mostly theory, we gotta see what we need, and what we can afford.

@helen-fornazier
Copy link
Author

The above would match tests usb.audio tests with either a FAIL or an ERROR status, executing on x86_64 builds, made from commits out of mainline tree.

Ok, so we could use almost the same checkouts,build and test schema from kcidb-io. lgtm.

The way to describe matching the misc field is by providing a JSON schema, which is essentially a JSON pattern. However, I can't quite decide if it's a good idea or not to support that, so far.

maybe we could leave this for a future moment, I would like to start simple, but making sure we can extend things later.

@tales-aparecida
Copy link

tales-aparecida commented Jul 11, 2024

I think we should first support an exact (string/number) match for each field we're interested in, and name them exactly the same in the pattern. E.g. git_repository_url. We could also support matching options using arrays with the same fields, perhaps, as I put into the example above. If we want anything fancier, then we take that field name and add the fancy on the end, to make it clear. E.g. git_repository_url_regex. If we only need the fancy match, we can drop the plain field.

If we believe arrays will be easier to work with, I don't have complaints. Most CKI's regexes can be written as arrays without hassle.

To match output files, we could support several name/contents "patterns" in an array, but we'd have to think about AND/OR.

Otherwise one last thing I was thinking about is adding logical references to other issues. So we could say: "this issue only matches this test when this other issue doesn't match it", or: "this issue matches when either of these several issues match". And so on. We should, of course, use these sparingly, but in general we can optimize such matching by first matching the dependencies, and then querying for their incidents. We might also need to be able to downgrade issues so they would only participate in these condition, and not be shown on dashboards/notified about, to reduce noise, but this needs more thinking.

Both things sound complex for an MVP, but requirements for AND/OR in auto-matching-configs (regexes) did show up in CKI a few years ago.

@spbnick
Copy link

spbnick commented Jul 12, 2024

If we believe arrays will be easier to work with, I don't have complaints. Most CKI's regexes can be written as arrays without hassle.

Just a thought that popped up for me, trying to optimize our triaging. I suspect that arrays for matching test statuses would be much faster than regexes, as they would utilize the index.

Both things sound complex for an MVP, but requirements for AND/OR in auto-matching-configs (regexes) did show up in CKI a few years ago.

Yeah, we can totally start with a single pattern for a name, and a pattern for contents. The array for patterns is not likely to be needed in the near future, or ever.

Regarding the logical operations on issues, I think it could be very powerful, with not that much complexity, and could help automatic issue generation, as I said. But it's totally not necessary for the start, or even the first half a year.

That's why I didn't elaborate on the schema for either of these.

@hardboprobot
Copy link

I think this ties in with the discussion that Nikolai and myself were having about how issues are defined. Some comments:

  • From the schema description: "This schema defines the structure for the auto-matching configuration, to auto-match incidents to the issue where this configuration is attached". I understand you mean that test or build failures can be matched against issue descriptions and then create incidents to link them together, is that correct? The use of "incidents" in the description is kind of confusing because in the current schema an "incident" is already matched to an individual issue.

  • Before I move on to think about the fields, can we consider the practical implications of this and how would the actual matching be performed? Using this example that Nikolai wrote, for instance. According to the schema, each issue may specify a set of fields that defines how test or build results can be matched against it, right? But that set of fields isn't fixed, so how do you plan to do the matching?
    Let's say a test fails, how to match it against all the defined issues? Or, in the other direction: a new issue is defined, how to match it against existing results?

@spbnick
Copy link

spbnick commented Jul 12, 2024

Before I move on to think about the fields, can we consider the practical implications of this and how would the actual matching be performed? Using this example that Nikolai wrote, for instance. According to the schema, each issue may specify a set of fields that defines how test or build results can be matched against it, right? But that set of fields isn't fixed, so how do you plan to do the matching?
Let's say a test fails, how to match it against all the defined issues? Or, in the other direction: a new issue is defined, how to match it against existing results?

The pattern schema should describe a finite list of possible fields. Missing JSON fields would be considered NULL in the database.

There are at least two events when we need to match issues to builds/tests:

  • When a new or updated build or test arrive. Here we already have the object we could match, and need to lookup issues to match against.
  • When a new or updated issue arrives. Here we already have the issue, and need to lookup the objects to match.

In both of these cases having indexes on basic fields to match could really speed up narrowing down the range of issues/objects to match against each other, using the remaining fields. Unfortunately having a JSONB field makes things more difficult here too (in addition to aforementioned upgrade conversion).

Regardless which kind of storage we choose, JSONB, or separate columns (which I prefer right now), I would say the processing could go like this (for the start, and only considering the direct object matching, not indirect matching of connected objects, for simplicity):

  • Load the arrived object (update) into the database, so it could do its job of merging updates.
  • Retrieve the merged object.
  • Construct an SQL query narrowing down the other side of the match:
    • Fetch the complete row for the opposite object (for simplicity).
    • Add an inner join with the merged object, and add a condition for each pair of "simple" fields in the joined objects.
    • Add a condition for whatever time range we use to improve performance.
  • Run the query, and for each returned opposite object:
    • Compare each "fancy" field between the objects, dropping objects with non-matches
  • Submit an incident (through the normal submission interface) for each remaining matched opposite object. We would have to be clever here to generate incident IDs in such a way, so that there's only one per the pair of IDs of matched objects, including issue versions.

We don't have to do this separately for every updated object, I think. We can actually match objects to opposite objects in a batch, and then narrow down the fancy matches on all of them in one go. We can also use ORM queries, so we get ORM-schema data out of the database, and use the OO representation to have an easier time matching things, and to support matching fancier things generated in the OO layer (like calculated test suite statuses). For that reason, it might be better to structure the matching fields in the issues to line up with ORM objects, not with the raw objects in the DB.

For indirect object matching, such as checking a build config when matching tests, we would have to add more joins and make the condition generation more complex, but I think that's possible. We would only need to pray that PostgreSQL can figure out cases when we don't really care about those joins, because we have nothing to match against the indirect objects 😂 But if not, then we can figure something out client-side.

@hardboprobot
Copy link

  • Load the arrived object (update) into the database, so it could do its job of merging updates.
  • Retrieve the merged object.

What do you mean by "merging updates" and what's a "merged object" in this context? Do you mean new issues that are somehow merged with existing ones?

Can we work on a concrete example for each use case? See, the potential (but critical) problem I'm seeing is how to match test/build failures against issues, in particular. How to get the DB to perform the matches effectively if the fields to match against are custom-defined per issue.

In other words, what I'm asking is: can you find a way to do this that performs better than O(n) where n is the number of issues or the number of test / build failures in the DB?

@spbnick
Copy link

spbnick commented Jul 12, 2024

What do you mean by "merging updates" and what's a "merged object" in this context? Do you mean new issues that are somehow merged with existing ones?

The KCIDB reporting protocol allows omitting most of the object fields (missing ones are stored as NULL in the database). If the same object (with the same ID) is sent multiple times, then each field gets a non-deterministically chosen value from across all copies, with NULLs only chosen, if there are no other values. This allows a CI system to submit an incomplete object and then send additional fields later. Such as sending information on an incomplete build/test, and then sending results after it's complete. It doesn't allow editing already present values (you could try certainly, but you can't say what you'll get).

This mechanism is implemented in the database, and to apply it you have to load the data in there. This concerns all types of objects.

Can we work on a concrete example for each use case? See, the potential (but critical) problem I'm seeing is how to match test/build failures against issues, in particular. How to get the DB to perform the matches effectively if the fields to match against are custom-defined per issue.

Yeah, I could write one, I suppose. Possibly later, as it would take some effort. Perhaps I could write an SQL query for pre-matching. For now, I think our join conditions could be built to handle NULLs (missing fields). Something like this: COALESCE(issues.test_pattern.status, '%') LIKE tests.status. We could tweak the actual expression for performance, but that's the way, I think.

In other words, what I'm asking is: can you find a way to do this that performs better than O(n) where n is the number of issues or the number of test / build failures in the DB?

Well, my bet is that we would be able to employ indexes in the database at least for simple fields, which will get something like O(log n), then we would concentrate on shifting most frequent matching to the indexes through using simple and sometimes not-so-simple fields.

@spbnick
Copy link

spbnick commented Jul 12, 2024

I'd like to add that we don't have to go through the whole process of formalizing and implementing this before LPC. We could discuss and decide on something for the start, stuff it into the misc field in issues, and implement incident generation on the side. We would get real incidents and would be able to show them off, would get to experience the practical side, and then can formalize them. Most importantly we would have a chance at having something working by LPC. We won't be able to show the pattern fields in the dashboards, but we could certainly show their results: incidents.

@hardboprobot
Copy link

For now, I think our join conditions could be built to handle NULLs (missing fields). Something like this: COALESCE(issues.test_pattern.status, '%') LIKE tests.status. We could tweak the actual expression for performance, but that's the way, I think.

If there's a way to make queries work on custom sets of conditions, or a universal way to run queries based on any set of conditions, then that'd work. As long as we can properly leverage the DBMS to do the matching and the process is scalable I'm fine with it. But I really think this should come first, because having a matching process that'll get slower as the DB grows is not a solution.

@spbnick
Copy link

spbnick commented Jul 12, 2024

But I really think this should come first, because having a matching process that'll get slower as the DB grows is not a solution.

Agreed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment