See https://github.com/hmage/seeborg/issues/25. We need to chew the received message for seeborg so it can properly assimilate it.
Seeborg chews messages and tries to reply to it in SeeBorg::Reply
. This applies to both the IRC and the console versions,
so that should be deemed a good enough place to start with.
Let's look around https://github.com/hmage/seeborg/blob/4be4d2a1e4085a8a2645eb528116264f989ca5af/seeborg.cpp#L107:
wstring message = inmsg;
FilterMessage(message);
splitString(message, curlines, L". ");
sz = curlines.size();
for (i = 0; i < sz; i++) splitString(curlines[i], curwords);
There literally is a thing called FilterMessage
that sanitizes the text. Might as well be a good place to start with,
you think, only to be annoyed by how unpure this function is. We are throwing the string away and getting a whole new one back...
Well anyway you can always assign the whole things.
As mentioned in the issue we need two components in our evil thing: OpenCC for normalizing stuff to a single script (currently preferring Trad. Chinese for the unfunny lack of ambuiguity), and jieba for chewing the thing up. Since we can use a separate words step, let's do the OpenCC thing first.
Assuming we got the import things done, we can just do:
/* This can be shared somewhere. */
opencc::SimpleConverter occ_conv("s2tw"); // or s2twp if you want some word convs
/* When you want to use it... in FilterMessage@seeutil.cpp */
// Time to get evil, and why not spice it up with some move?
message.assign(
std::move(
boost::nowide::widen(
occ_conv.Convert(
boost::nowide::narrow(message)))));
// Add some housekeeping if you must. I personally suggest these at the top, before the
// * "? " -> "?. " things. This saves us a change in sentence breaks.
// * "?" -> "?" (preserve consecutive usage)
// * "!" -> "!"
// * "。" -> ". " (keep splitting working)
// Flatten the puncts for Chinese:
// * “ -> 「 (or both to ' "' for some better zh/en mixing?)
// * ” -> 」 (or '" ')
// * ‘ -> 『 (or " '")
// * ’ -> 』 (or "' ")
// You will hate writing all these. Try some dank macros from C!
That's not hard either.
// Somewhere in seeborg.cpp, share this:
// see https://github.com/yanyiwu/cppjieba/blob/master/test/demo.cpp, too lazy to copy the pathes
cppjieba::Jieba jieba(DICT_PATH,
HMM_PATH,
USER_DICT_PATH,
IDF_PATH,
STOP_WORD_PATH);
// Later on, replace the space loop with:
vector<string> curstrwords;
for (i = 0; i < sz; i++) {
jieba.Cut(boost::nowide::narrow(curlines[i]), curstrwords, true);
// cppjieba seems to always clean curstrwords.
for (auto sw : curstrwords)
curwords.push_back(boost::nowide::widen(sw));
}
The SeeBorg::LearnLine
part should get some OpenCC and jieba treatment. Take away the outside for loop from the sample above for the jieba code.
Apply the same punctuation flattening and OpenCC assigning treatment to the input body. Make it a separate util function, actually.