Skip to content

Instantly share code, notes, and snippets.

@rocka
Created July 6, 2022 10:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rocka/7a8e101f81de1e1fcef6c2d39b88246d to your computer and use it in GitHub Desktop.
Save rocka/7a8e101f81de1e1fcef6c2d39b88246d to your computer and use it in GitHub Desktop.
Split fcitx5-chinese-addons `sc.dict`
import { open } from 'node:fs/promises';
import readline from 'node:readline';
const segmenter = new Intl.Segmenter('zh-CN', { granularity: 'grapheme' });
const file = await open('./dict_sc.txt');
const rl = readline.createInterface({
input: file.createReadStream(),
crlfDelay: Infinity // \r followed by \n will always be considered a single newline
});
for await (const line of rl) {
const [hanzi, pinyin, weight] = line.split('\t');
let hasExtB = false;
for (const { segment } of segmenter.segment(hanzi)) {
if (segment.codePointAt(0) > 0x20000) {
hasExtB = true;
break;
}
}
if (hasExtB) {
console.warn(line);
} else {
console.log(line);
}
}
#!/bin/sh
curl -L https://download.fcitx-im.org/data/dict_sc.txt-20220628.tar.xz -o dict_sc.tar.xz
tar xf dict_sc.tar.xz
node filter.mjs 1>dict_simp.txt 2>dict_extb.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment