Skip to content

Instantly share code, notes, and snippets.

@vmarquez
Last active October 15, 2019 02:34
Show Gist options
  • Save vmarquez/204b8f44b1279fdbae97b40f8681bc25 to your computer and use it in GitHub Desktop.
Save vmarquez/204b8f44b1279fdbae97b40f8681bc25 to your computer and use it in GitHub Desktop.
public class Event {
@PartitionKey(0) public UUID accountId
@PartitionKey(1)public String yearMonthDay;
@ClusteringKey public UUID eventId;
//other data...
}
public static void sampleUsage() {
//we want to ONLY query data from three years ago for a set of accounts, so we will generate that somehow.
//Also note that one token will likely generate many Events...
Set<UUID> accounts = getRelevantAccounts();
List<String> dateRange = generateDateRange("2016-01-01", "2017-01-01");
PCollection<Token> tokensToQuery = p.apply(generateMyTokens(accounts, dateRange)); //Note token is not serializable, we can represent it with a custom class wrapping byte arrays
PCollection<Event> events = tokensToQuery.apply(CassandraIO.<Event>readAll("Select * from Event where token(accountId, yearMonthDay) = ?"));
//query above or even just table could be specificed wtih the builder pattern, this is just an example.
}
/*
Currently CassandraIO queries over the entire token range, and allows for filtering. Obviously if we want to exclude tens or
hundreds of thousands of primary keys this won't work so well, so instead i'm proposing a way to supply a list of Tokens to query.
Similar to how CassandraIO currently bunches up token range queries as a List<List<Query>> we can do the same under the hood, ideally grouping by the node
that owns the token.
I believe it would also be possible to, under the hood, make the current implementation use something similar, where it would take a PCollection<TokenRange>
and a query, and in the above proposed case each token range would only span one actual token.
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment