timmolderez/gist:60d1165858d45e881ed13efef6114987

## gistfile1.txt
"1_3" \
"__2" |
"_00" |=> filter "000": output = {"_00", "0__", "00_"}
"_1_" |
"0__" |=> filter "123": output = {"1_3", "_2_", "1__"}
"_2_" |
"_01" |=> filter "__0": output = {"_00", "_1_", "0__", "_2_", "1__", "00_", "_20", "1_0"}
"1__" |
"00_" |=> filter "10_": output = {"1_3", "__2", "_00", "_01", "1__", "1_1", "_03", "1_2", "1_0"},
"1_1" |
"_22" |
"111" |
"_20" |
"_03" |
"1_2" |
"1_0" /

---------------------

Eugenio Bargiacchi [1:25 PM]
Hey Tim!
Long time no see :smile:
I have a question for you
I have a CS problem, and maybe you know the solution/somebody who can help me
it's a regex-like problem
I have a set of strings of length N, where they can contain at each position i a particular set of characters C_i. An additional wildcard character is also allowed at all positions in the strings; and indeed the strings are very very sparse (most characters in them are wildcards).
I get passed as input another string, and I need to find all of the strings which match the input, where matching means that at each position i they either have the same character, or at least one of the two has a wildcard there
Do you know of anything that could help me?
I already have a kind of naive solution where I keep N x ( | C_i | + 1 ) vectors, and in each I add the strings that match in that particular spot, and when I need to look up I do a set intersections of all of them (edited)
This works and is relatively fast, but I'd like it to be much faster
And I think that exploiting the fact that the strings are mostly sparse could help
I have a couple ideas but not sure if you know of already done work on this type of things
I stop at radix trees
Which I don't think in a naive form work as well as my set of strings has wildcards (edited)

Tim Molderez [1:38 PM]
Hey Eugenio, good to see you :slightly_smiling_face:
Do you have a simple example to help me understand? (dunno if I’ll have an answer, but I can forward your question on our Slack :slightly_smiling_face: )

Eugenio Bargiacchi [1:38 PM]
yess
Untitled
"1_3" \
"__2" |
"_00" |=> filter "000": output = {"_00", "0__", "00_"}
"_1_" |
"0__" |=> filter "123": output = {"1_3", "_2_", "1__"}
"_2_" |
"_01" |=> filter "__0": output = {"_00", "_1_", "0__", "_2_", "1__", "00_", "_20", "1_0"}
"1__" |
"00_" |=> filter "10_": output = {"1_3", "__2", "_00", "_01", "1__", "1_1", "_03", "1_2", "1_0"},
"1_1" |
"_22" |
"111" |
"_20" |
"_03" |
"1_2" |
"1_0" /
Collapse
These are some of my test cases
my goal is to have thousands of strings of hundreds of chars of length, mostly wildcards, and have this lightning fast (I have to do it tons of times)
Each position in the string has a domain, so for example first char could be in [0-5], second in [0-20], third in [0-9], etc etc
Not sure if it helps but just in case
I have atm found a very fast solution if the input has no wildcards, but I'm still thinking how to make it work with a wildcardish input..

Tim Molderez [1:46 PM]
a “_” is the wildcard character right?

Eugenio Bargiacchi [1:46 PM]
yes
otherwise, if this can't be done, it would be enough for me if, given an input with wildcards, to find in set all strings that are compatible with the input and with each other. There are possibly many such groups, and finding one in particular is probably a knapsack problem, so I'd be happy with any of them as long as it's the largest it can be
Untitled

input: "0__";
possible outputs:
- "__2", "_1_", "0__" (012)
- "_00", "0__", "00_" (000)
- "0__", "_03"    (003)
Click to expand inline (7 lines)
this as an example, any of these would be fine

Tim Molderez [1:53 PM]
I’ll ask around if there’s anyone at the lab who can help out :slightly_smiling_face: (i can have a closer look this evening otherwise)

Eugenio Bargiacchi [1:53 PM]
thanks!!
	"1_3" \
	"__2" \|
	"_00" \|=> filter "000": output = {"_00", "0__", "00_"}
	"_1_" \|
	"0__" \|=> filter "123": output = {"1_3", "_2_", "1__"}
	"_2_" \|
	"_01" \|=> filter "__0": output = {"_00", "_1_", "0__", "_2_", "1__", "00_", "_20", "1_0"}
	"1__" \|
	"00_" \|=> filter "10_": output = {"1_3", "__2", "_00", "_01", "1__", "1_1", "_03", "1_2", "1_0"},
	"1_1" \|
	"_22" \|
	"111" \|
	"_20" \|
	"_03" \|
	"1_2" \|
	"1_0" /

	---------------------

	Eugenio Bargiacchi [1:25 PM]
	Hey Tim!
	Long time no see :smile:
	I have a question for you
	I have a CS problem, and maybe you know the solution/somebody who can help me
	it's a regex-like problem
	I have a set of strings of length N, where they can contain at each position i a particular set of characters C_i. An additional wildcard character is also allowed at all positions in the strings; and indeed the strings are very very sparse (most characters in them are wildcards).
	I get passed as input another string, and I need to find all of the strings which match the input, where matching means that at each position i they either have the same character, or at least one of the two has a wildcard there
	Do you know of anything that could help me?
	I already have a kind of naive solution where I keep N x ( \| C_i \| + 1 ) vectors, and in each I add the strings that match in that particular spot, and when I need to look up I do a set intersections of all of them (edited)
	This works and is relatively fast, but I'd like it to be much faster
	And I think that exploiting the fact that the strings are mostly sparse could help
	I have a couple ideas but not sure if you know of already done work on this type of things
	I stop at radix trees
	Which I don't think in a naive form work as well as my set of strings has wildcards (edited)

	Tim Molderez [1:38 PM]
	Hey Eugenio, good to see you :slightly_smiling_face:
	Do you have a simple example to help me understand? (dunno if I’ll have an answer, but I can forward your question on our Slack :slightly_smiling_face: )

	Eugenio Bargiacchi [1:38 PM]
	yess
	Untitled
	"1_3" \
	"__2" \|
	"_00" \|=> filter "000": output = {"_00", "0__", "00_"}
	"_1_" \|
	"0__" \|=> filter "123": output = {"1_3", "_2_", "1__"}
	"_2_" \|
	"_01" \|=> filter "__0": output = {"_00", "_1_", "0__", "_2_", "1__", "00_", "_20", "1_0"}
	"1__" \|
	"00_" \|=> filter "10_": output = {"1_3", "__2", "_00", "_01", "1__", "1_1", "_03", "1_2", "1_0"},
	"1_1" \|
	"_22" \|
	"111" \|
	"_20" \|
	"_03" \|
	"1_2" \|
	"1_0" /
	Collapse
	These are some of my test cases
	my goal is to have thousands of strings of hundreds of chars of length, mostly wildcards, and have this lightning fast (I have to do it tons of times)
	Each position in the string has a domain, so for example first char could be in [0-5], second in [0-20], third in [0-9], etc etc
	Not sure if it helps but just in case
	I have atm found a very fast solution if the input has no wildcards, but I'm still thinking how to make it work with a wildcardish input..

	Tim Molderez [1:46 PM]
	a “_” is the wildcard character right?

	Eugenio Bargiacchi [1:46 PM]
	yes
	otherwise, if this can't be done, it would be enough for me if, given an input with wildcards, to find in set all strings that are compatible with the input and with each other. There are possibly many such groups, and finding one in particular is probably a knapsack problem, so I'd be happy with any of them as long as it's the largest it can be
	Untitled

	input: "0__";
	possible outputs:
	- "__2", "_1_", "0__" (012)
	- "_00", "0__", "00_" (000)
	- "0__", "_03" (003)
	Click to expand inline (7 lines)
	this as an example, any of these would be fine

	Tim Molderez [1:53 PM]
	I’ll ask around if there’s anyone at the lab who can help out :slightly_smiling_face: (i can have a closer look this evening otherwise)

	Eugenio Bargiacchi [1:53 PM]
	thanks!!