Skip to content

Instantly share code, notes, and snippets.

@eguiraud
Last active June 18, 2024 08:47
Show Gist options
  • Save eguiraud/77a0ca3566e66bc6b8cd0f9e156c983b to your computer and use it in GitHub Desktop.
Save eguiraud/77a0ca3566e66bc6b8cd0f9e156c983b to your computer and use it in GitHub Desktop.
A thread-safe stateful Filter for RDataFrame
import ROOT
ROOT.gInterpreter.Declare("""
// A thread-safe stateful filter that lets only one event pass for each value of
// "category" (where "category" is a random character).
// It is using gCoreMutex, which is a read-write lock, to have a bit less contention between threads.
class FilterOnePerKind {
std::unordered_set<char> _seenCategories;
public:
bool operator()(char category) {
{
R__READ_LOCKGUARD(ROOT::gCoreMutex); // many threads can take a read lock concurrently
if (_seenCategories.count(category) == 1)
return false;
}
// if we are here, `category` was not already in _seenCategories
R__WRITE_LOCKGUARD(ROOT::gCoreMutex); // only one thread at a time can take the write lock
_seenCategories.insert(category);
return true;
}
};
""")
ROOT.EnableImplicitMT();
df = ROOT.RDataFrame(100).Define("category", "char(rdfentry_ % 10)")
cols = ROOT.std.vector['string'](["category"])
df_with_unique_categories = df.Filter(ROOT.FilterOnePerKind(), cols)
print(df_with_unique_categories.Count().GetValue())
@dlanci
Copy link

dlanci commented Jun 17, 2024

Hi @eguiraud ,

Thanks a lot for the nice snippet. Do you think it would be possible, with a similar strategy, to write some thead-safe stateful filter that would retain only the category that satisfies a condition? For example unique-ize the RDataFrame keeping only categories with the highest BDT output. I'm not sure if there is already some useful function implementation for that.

Cheers,
Davide

@dlanci
Copy link

dlanci commented Jun 17, 2024

I tried this

#include <iostream>
#include <vector>
#include <unordered_map>

class ElementTracker {
public:
    // Method to check if a given element (represented by id and value) has the highest value among elements with the same id
    bool operator()(unsigned long long id, double value) {
        // Check if the current element has a higher value than the stored value
        if (highestValues.find(id) == highestValues.end() || value > highestValues[id]) {
            highestValues[id] = value;
            return true;
        }
        return value == highestValues[id];
    } 

private:
    // Map to store the highest value for each id
    std::unordered_map<unsigned long long, double> highestValues;
};

However I'm getting the highest element with unique ID only if it's also the first element seen (quite obviously). Any idea how to solve this?

@eguiraud
Copy link
Author

hi @dlanci ,

Are you running with ROOT::EnableImplicitMT()? this code is not thread-safe (there is no mutex protecting the accesses to highestValues).
You can add print-outs to debug what's going on.

Also I'm not sure you need a Filter for this, you could implement this as a Reduce or similar.

Your best bet is to open a new topic on the forum where root devs can help out properly.

Cheers,
Enrico

@dlanci
Copy link

dlanci commented Jun 18, 2024

Hi @eguiraud ,

Thanks for your answer! I have opened a thread here: https://root-forum.cern.ch/t/select-unique-candidates-based-on-their-id-and-the-value-of-a/59668/3

I'm not currently running with ROOT::EnableImplicitMT() . For now I just want to get the expected result, then will extend functionality to MT execution

Best,
Davide

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment