Skip to content

Instantly share code, notes, and snippets.

@Artoria2e5
Last active July 4, 2017 04:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Artoria2e5/d432885ee6235b4b456571cbfec86eba to your computer and use it in GitHub Desktop.
Save Artoria2e5/d432885ee6235b4b456571cbfec86eba to your computer and use it in GitHub Desktop.
quick notes for seeborg@zh

Adapting seeborg for Chinese

See https://github.com/hmage/seeborg/issues/25. We need to chew the received message for seeborg so it can properly assimilate it.

Inject the nanoprobes here, right?

Seeborg chews messages and tries to reply to it in SeeBorg::Reply. This applies to both the IRC and the console versions, so that should be deemed a good enough place to start with.

Let's look around https://github.com/hmage/seeborg/blob/4be4d2a1e4085a8a2645eb528116264f989ca5af/seeborg.cpp#L107:

wstring message = inmsg;
FilterMessage(message);
splitString(message, curlines, L". ");

sz = curlines.size();
for (i = 0; i < sz; i++) splitString(curlines[i], curwords);

There literally is a thing called FilterMessage that sanitizes the text. Might as well be a good place to start with, you think, only to be annoyed by how unpure this function is. We are throwing the string away and getting a whole new one back... Well anyway you can always assign the whole things.

Building the evil filter

As mentioned in the issue we need two components in our evil thing: OpenCC for normalizing stuff to a single script (currently preferring Trad. Chinese for the unfunny lack of ambuiguity), and jieba for chewing the thing up. Since we can use a separate words step, let's do the OpenCC thing first.

Assuming we got the import things done, we can just do:

/* This can be shared somewhere. */
opencc::SimpleConverter  occ_conv("s2tw");  // or s2twp if you want some word convs

/* When you want to use it... in FilterMessage@seeutil.cpp */
// Time to get evil, and why not spice it up with some move?
message.assign(
  std::move(
    boost::nowide::widen(
      occ_conv.Convert(
        boost::nowide::narrow(message)))));
        
// Add some housekeeping if you must. I personally suggest these at the top, before the
// * "? " -> "?. " things. This saves us a change in sentence breaks.
// * "?" -> "?" (preserve consecutive usage)
// * "!" -> "!"
// * "。" -> ". " (keep splitting working)
// Flatten the puncts for Chinese:
// * “ -> 「 (or both to ' "' for some better zh/en mixing?)
// * ” -> 」 (or '" ')
// * ‘ -> 『 (or " '")
// * ’ -> 』 (or "' ")
// You will hate writing all these. Try some dank macros from C!

Playing with words

That's not hard either.

// Somewhere in seeborg.cpp, share this:
// see https://github.com/yanyiwu/cppjieba/blob/master/test/demo.cpp, too lazy to copy the pathes
  cppjieba::Jieba jieba(DICT_PATH,
        HMM_PATH,
        USER_DICT_PATH,
        IDF_PATH,
        STOP_WORD_PATH);

// Later on, replace the space loop with:
vector<string> curstrwords;
for (i = 0; i < sz; i++) {
  jieba.Cut(boost::nowide::narrow(curlines[i]), curstrwords, true);
  // cppjieba seems to always clean curstrwords. 
  for (auto sw : curstrwords)
    curwords.push_back(boost::nowide::widen(sw));
}

More spots

The SeeBorg::LearnLine part should get some OpenCC and jieba treatment. Take away the outside for loop from the sample above for the jieba code. Apply the same punctuation flattening and OpenCC assigning treatment to the input body. Make it a separate util function, actually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment