Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save hiropppe/2438d6a4a949d40253af4bb8801eb1ba to your computer and use it in GitHub Desktop.
Save hiropppe/2438d6a4a949d40253af4bb8801eb1ba to your computer and use it in GitHub Desktop.
文字や単語の分散が悪戯テキストの検出に使えそうなサンプル
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"import MeCab\n",
"\n",
"mecab = MeCab.Tagger('-F\\\\s%f[6] -U\\\\s%m -E\\\\n')\n",
"\n",
"def stdchar(text):\n",
" char2id = {}\n",
" boc = np.zeros(len(set(text)))\n",
" for c in text:\n",
" if c not in char2id:\n",
" char2id[c] = len(char2id)\n",
" i = char2id[c]\n",
" boc[i] += 1\n",
" std = np.std(boc)\n",
" print(std) \n",
"\n",
"def stdtoken(text):\n",
" token2id = {}\n",
" tokens = mecab.parse(text).strip().split(' ')\n",
" bow = np.zeros(len(set(tokens)))\n",
" for token in tokens:\n",
" if token not in token2id:\n",
" token2id[token] = len(token2id)\n",
" i = token2id[token]\n",
" bow[i] += 1\n",
" std = np.std(bow)\n",
" print(std)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.4812948405151698\n",
"1.1230254958405943\n"
]
}
],
"source": [
"a = \"\"\"\n",
"コロナ自粛をしない私はおかしいですか?\n",
"会社の人との感覚のズレに悩んでいます。\n",
"現在、会社は一人の感染者も出さず、同僚は「一発目になると恥ずかしいから色々自粛している」そうです。\n",
"\n",
"当の私は、夜行バスで東京、大阪、福岡を飛び回りアイドルのコンサートに行く日々を送っています。\n",
"\n",
"感染したら「不謹慎な奴」のレッテルを貼られるのでしょうか?\n",
"\"\"\"\n",
"\n",
"stdchar(a)\n",
"stdtoken(a)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.0\n",
"0.0\n"
]
}
],
"source": [
"a = \"冬のボーナス支給されましたか?\"\n",
"\n",
"stdchar(a)\n",
"stdtoken(a)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.5983516452371671\n",
"0.4453617714151233\n"
]
}
],
"source": [
"a = \"\"\"\n",
"この人は誰ですか?\n",
"グラビアアイドルの方ですか?\n",
"\"\"\"\n",
"\n",
"stdchar(a)\n",
"stdtoken(a)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5.128867104710141\n",
"3.661767875640024\n"
]
}
],
"source": [
"a = \"\"\"\n",
"ダイワ ラテオ 100ML・Q\n",
"対応ルアー:7~40g\n",
"上記のシーバスロッドでプラグ20g前後、ジグはほぼ使いませんが、よくて30gまでしか投げません。\n",
"対象となる魚種がいるかどうかは別として、40g〜60gのジグを投げてみたいのですが、ダイワ ジグキャスターMXのMまたはMHで迷っています。\n",
"Mが10g〜60gなので、20g〜40gくらいのジグが主体なのかなと想像しています。\n",
"実際、それくらいのジグを投げる方が圧倒的に多いと思います。けれど、Mだと60gのジグをなかなか投げづらいと思います。\n",
"MHの対応ルアーは25g〜90gとなっています。\n",
"\n",
"ラテオを中1の息子が扱えたらの話です。身長、155cmの体重50kgくらいです。10ftはちょっと長いかなと感じています。8.6ftのエギングロッドはキャストのシャクリも上手にできています。1人で行く時はラテオでもライトショアジギングはできるので、差別化を図る意味でもMHがいいかなと思っているのですが、ネットのレビューを見るとMHは軽めのジグは投げづらいようです。\n",
"親子でサーフに行き、1人はライトショア、1人はショアジギングができればいいかなと思っています。\n",
"もちろん、ライトショアとショアジギングの違いは、ジグの重さや対応ロッドというより、どういった魚種を狙っているのかだと思います。\n",
"経験豊富な皆様に、Mを選べばいいのかMHを選べばいいのか、御指南いただきたいです。\n",
"関西でいうところのハマチサイズは釣れるサーフです。ただ、普段はツバスサイズ、シオサイズばかりです。\n",
"やはり、60gのジグはいらないでしょうか?\n",
"\"\"\"\n",
"\n",
"stdchar(a)\n",
"stdtoken(a)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"9.542303472199759\n",
"0.5921294486432991\n"
]
}
],
"source": [
"a = \"TIL of Jólabókaflóðið, an Icelandic tradition of giving books at Christmas. Books are so popular as gifts that, per capita, they read the most books on Earth and publishing occurs just months before Christmas. Many celebrate Christmas by lying in bed eating chocolates and reading one of their books!\"\n",
"\n",
"stdchar(a)\n",
"stdtoken(a)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"6.04737539005183\n",
"0.34992710611188255\n"
]
}
],
"source": [
"a = \"Is anyone roughly aware of how the research agenda of DeepMind is structured these days? Do they still believe that RL is the way through which AGI will be created?\"\n",
"\n",
"stdchar(a)\n",
"stdtoken(a)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.823529411764706\n",
"0.0\n"
]
}
],
"source": [
"a = \"【速報】ワイ(34)の12月給与wwwwwwwwwwwww\"\n",
"\n",
"stdchar(a)\n",
"stdtoken(a)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7.46125994722071\n",
"6.754555754266893\n"
]
}
],
"source": [
"a = \"\"\"\n",
"コロナだってのに忘年会やる奴らマジ勘弁っっっっっっっっっっっっっっっっっっっっっっっっっっっっっっっwっっっっっっっっっw\n",
"今日店にケーキ買いに来た奴ら絶対クリパやる気だタヒね\n",
"wwwwwwwwwwwwwwwwwwwwwwwww\n",
"今度来たらケーキにパブロン死ぬほど入れてやる\n",
"WWWWWWWWWWWWWWWWWWWWWWWWWWWW\n",
"\"\"\"\n",
"stdchar(a)\n",
"stdtoken(a)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment