lakiw/gist:64d1a93106fd501d4d680fffad076e12

## gistfile1.txt
The main challenge for detecting multi-words for passwords for me has been the lack of good wordlists/dictionaries.

Based on previous experience, my rule of thumb is that a "decent" dictionary will have about a 60% coverage rate for the training set. That number is based on very out of date experiments which quite honestly I need to update, (if you are curious I can look up where in my dissertation I documented them), which is why I consider it more a rule of thumb vs an accurate statement. You can get a higher coverage by increasing the size of your dictionary but at that point the amount of junk in your wordlist starts to make Markov based brute-force sound more attractive. Still, while some people might quible with that 60% coverage statement, (rightfully so), I think it highlights the wordlist issue. If I look for multi-words but the "golden list" I use in training only has 60% coverage then this becomes a harder problem to solve.

In general it seems like a better approach is to build custom dictionaries based on previous cracked passwords. aka extract 'someword' from the password '123someword123'. Just about every serious password cracker I've talked to does this.

In general, a challenge for this is dealing with l33t letter replacements. Aka dealing with '123somew0rd123'.

More to the point of this gist, how do you go from 'someword' to identifying it as two words ['some', 'word']

Ignoring the l33t replacements for now and focusing on the multi-word detection: Here is my current plan for support in the PCFG trainer. Note, this will almost certainly evolve over time when I start actually coding this.

1) First pass over the training set, identify all strings.
2) During the first pass, also keep track of string frequencies. Aka we saw 'some' 12 times, 'word' 50 times, and 'someword' 2 times.
3) Have a threshold for 'valid base words'. Aka if the threshold is 50, and we saw 'someword' 51 times then it is considered just one word as far as our trainer is concerned. Yes this is a false positive/negative(depending on how you look at it), but I could consider it a feature since we want to keep the relationship between certain common multi-words ;p Aka from a cracking standpoint 'horsebatterystaple' should be treated as one word thanks to the popularity of the xkcd comic.
4) If a word falls below the threshold then look for possible multi-word permutations based on strings that were seen above the threshold. Requiring the base words to occur multiple times will hopefully limit false matches. Also there will need to be length requirements for the base-words since if ['s','o','m','e','w','r','d'] shows up in the training we don't want to break up 'someword' that way.
5) The downside is this may have a high false negative rate. Aka if 'some' doesn't show up much by itself in passwords then 'someword' won't be categorized as a multi-word. Also 'iloveyou' should be classified as ['i','love','you'] but that will not happen since 'i' is too short. Therefore, there may need to be some hardcoded words/rules/exceptions added to deal with that.

Long story short, the above is going to have a high false positive and a high false negative rate. The open question is if it can still add value to a cracking session while at the same time being automated enough to ease the work of an analyst.
	The main challenge for detecting multi-words for passwords for me has been the lack of good wordlists/dictionaries.

	Based on previous experience, my rule of thumb is that a "decent" dictionary will have about a 60% coverage rate for the training set. That number is based on very out of date experiments which quite honestly I need to update, (if you are curious I can look up where in my dissertation I documented them), which is why I consider it more a rule of thumb vs an accurate statement. You can get a higher coverage by increasing the size of your dictionary but at that point the amount of junk in your wordlist starts to make Markov based brute-force sound more attractive. Still, while some people might quible with that 60% coverage statement, (rightfully so), I think it highlights the wordlist issue. If I look for multi-words but the "golden list" I use in training only has 60% coverage then this becomes a harder problem to solve.

	In general it seems like a better approach is to build custom dictionaries based on previous cracked passwords. aka extract 'someword' from the password '123someword123'. Just about every serious password cracker I've talked to does this.

	In general, a challenge for this is dealing with l33t letter replacements. Aka dealing with '123somew0rd123'.

	More to the point of this gist, how do you go from 'someword' to identifying it as two words ['some', 'word']

	Ignoring the l33t replacements for now and focusing on the multi-word detection: Here is my current plan for support in the PCFG trainer. Note, this will almost certainly evolve over time when I start actually coding this.

	1) First pass over the training set, identify all strings.
	2) During the first pass, also keep track of string frequencies. Aka we saw 'some' 12 times, 'word' 50 times, and 'someword' 2 times.
	3) Have a threshold for 'valid base words'. Aka if the threshold is 50, and we saw 'someword' 51 times then it is considered just one word as far as our trainer is concerned. Yes this is a false positive/negative(depending on how you look at it), but I could consider it a feature since we want to keep the relationship between certain common multi-words ;p Aka from a cracking standpoint 'horsebatterystaple' should be treated as one word thanks to the popularity of the xkcd comic.
	4) If a word falls below the threshold then look for possible multi-word permutations based on strings that were seen above the threshold. Requiring the base words to occur multiple times will hopefully limit false matches. Also there will need to be length requirements for the base-words since if ['s','o','m','e','w','r','d'] shows up in the training we don't want to break up 'someword' that way.
	5) The downside is this may have a high false negative rate. Aka if 'some' doesn't show up much by itself in passwords then 'someword' won't be categorized as a multi-word. Also 'iloveyou' should be classified as ['i','love','you'] but that will not happen since 'i' is too short. Therefore, there may need to be some hardcoded words/rules/exceptions added to deal with that.

	Long story short, the above is going to have a high false positive and a high false negative rate. The open question is if it can still add value to a cracking session while at the same time being automated enough to ease the work of an analyst.