Skip to content

Instantly share code, notes, and snippets.

@BrikerMan
Last active April 10, 2020 21:04
Show Gist options
  • Save BrikerMan/efd8c95b0450fc73f387e0fc85ad7408 to your computer and use it in GitHub Desktop.
Save BrikerMan/efd8c95b0450fc73f387e0fc85ad7408 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 357
},
"colab_type": "code",
"id": "MHXIom1tFv3E",
"outputId": "b551c791-7db9-49f8-c317-48ef505049eb"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2019-05-30 06:34:33-- https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip\n",
"Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.212.128, 2607:f8b0:4001:c05::80\n",
"Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.212.128|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 381892918 (364M) [application/zip]\n",
"Saving to: ‘chinese_L-12_H-768_A-12.zip’\n",
"\n",
"chinese_L-12_H-768_ 100%[===================>] 364.20M 105MB/s in 3.5s \n",
"\n",
"2019-05-30 06:34:37 (105 MB/s) - ‘chinese_L-12_H-768_A-12.zip’ saved [381892918/381892918]\n",
"\n",
"Archive: chinese_L-12_H-768_A-12.zip\n",
" creating: chinese_L-12_H-768_A-12/\n",
" inflating: chinese_L-12_H-768_A-12/bert_model.ckpt.meta \n",
" inflating: chinese_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001 \n",
" inflating: chinese_L-12_H-768_A-12/vocab.txt \n",
" inflating: chinese_L-12_H-768_A-12/bert_model.ckpt.index \n",
" inflating: chinese_L-12_H-768_A-12/bert_config.json \n",
"bert_config.json\t\t bert_model.ckpt.index vocab.txt\n",
"bert_model.ckpt.data-00000-of-00001 bert_model.ckpt.meta\n"
]
}
],
"source": [
"# Download BERT\n",
"!wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip\n",
"!unzip chinese_L-12_H-768_A-12.zip\n",
"!ls chinese_L-12_H-768_A-12"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "mKVBZdCrDacZ"
},
"outputs": [],
"source": [
"!pip install kashgari\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"colab_type": "code",
"id": "r3mGKW2Q7mli",
"outputId": "d963ba56-49af-4454-e16c-58af6152b56b"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"chinese_L-12_H-768_A-12\n"
]
}
],
"source": [
"# BERT_PATH = '/input0/BERT/chinese_L-12_H-768_A-12/'\n",
"BERT_PATH = 'chinese_L-12_H-768_A-12'\n",
"print(BERT_PATH)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 85
},
"colab_type": "code",
"id": "bkrSeQSI7pTT",
"outputId": "c4d3052e-9197-46d0-89c8-e82fc33f4b72"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Using TensorFlow backend.\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Creating data folder...\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"file_sizes: 0%| | 0.00/2.44M [00:00<?, ?B/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Downloading data from http://storage.eliyar.biz/corpus/china-people-daily-ner-corpus.tar.gz (2.3 MB)\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"file_sizes: 100%|██████████████████████████| 2.44M/2.44M [00:02<00:00, 1.10MB/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Extracting tar.gz file...\n",
"Successfully downloaded / unzipped to /root/.kashgari/corpus\n",
"train data count: 20864\n",
"validate data count: 2318\n",
"test data count: 4636\n"
]
}
],
"source": [
"from kashgari.corpus import ChinaPeoplesDailyNerCorpus\n",
"\n",
"train_x, train_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('train')\n",
"validate_x, validate_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('validate')\n",
"test_x, test_y = ChinaPeoplesDailyNerCorpus.get_sequence_tagging_data('test')\n",
"\n",
"print(f\"train data count: {len(train_x)}\")\n",
"print(f\"validate data count: {len(validate_x)}\")\n",
"print(f\"test data count: {len(test_x)}\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 221
},
"colab_type": "code",
"id": "xplap8lL7t2b",
"outputId": "620b940a-6747-47c4-f3f5-23b99cdf863a"
},
"outputs": [],
"source": [
"from kashgari.embeddings import BERTEmbedding\n",
"embedding = BERTEmbedding(BERT_PATH, 100)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "HmeueeKY7vv5"
},
"outputs": [],
"source": [
"from kashgari.tasks.seq_labeling import BLSTMCRFModel"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"model = BLSTMCRFModel(embedding)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 4267
},
"colab_type": "code",
"id": "_mzlD5GeCpUs",
"outputId": "031241ae-b37d-4138-f042-55575b23d898"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0018 - crf_accuracy: 0.9991 - val_loss: 0.0058 - val_crf_accuracy: 0.9965\n",
"Epoch 2/50\n",
"20/20 [==============================] - 62s 3s/step - loss: -0.0023 - crf_accuracy: 0.9993 - val_loss: 8.0986e-04 - val_crf_accuracy: 0.9977\n",
"Epoch 3/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0025 - crf_accuracy: 0.9993 - val_loss: 0.0055 - val_crf_accuracy: 0.9966\n",
"Epoch 4/50\n",
"20/20 [==============================] - 63s 3s/step - loss: -0.0028 - crf_accuracy: 0.9994 - val_loss: 0.0047 - val_crf_accuracy: 0.9968\n",
"Epoch 5/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0029 - crf_accuracy: 0.9994 - val_loss: 0.0057 - val_crf_accuracy: 0.9965\n",
"Epoch 6/50\n",
"20/20 [==============================] - 65s 3s/step - loss: -0.0030 - crf_accuracy: 0.9994 - val_loss: 0.0049 - val_crf_accuracy: 0.9967\n",
"Epoch 7/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0033 - crf_accuracy: 0.9995 - val_loss: 0.0054 - val_crf_accuracy: 0.9966\n",
"Epoch 8/50\n",
"20/20 [==============================] - 63s 3s/step - loss: -0.0037 - crf_accuracy: 0.9996 - val_loss: 0.0038 - val_crf_accuracy: 0.9970\n",
"Epoch 9/50\n",
"20/20 [==============================] - 63s 3s/step - loss: -0.0038 - crf_accuracy: 0.9996 - val_loss: 0.0049 - val_crf_accuracy: 0.9967\n",
"Epoch 10/50\n",
"20/20 [==============================] - 65s 3s/step - loss: -0.0039 - crf_accuracy: 0.9995 - val_loss: 0.0042 - val_crf_accuracy: 0.9967\n",
"Epoch 11/50\n",
"20/20 [==============================] - 62s 3s/step - loss: -0.0042 - crf_accuracy: 0.9996 - val_loss: 0.0033 - val_crf_accuracy: 0.9970\n",
"Epoch 12/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0044 - crf_accuracy: 0.9996 - val_loss: 0.0049 - val_crf_accuracy: 0.9967\n",
"Epoch 13/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0045 - crf_accuracy: 0.9996 - val_loss: 0.0033 - val_crf_accuracy: 0.9970\n",
"Epoch 14/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0046 - crf_accuracy: 0.9996 - val_loss: 0.0043 - val_crf_accuracy: 0.9968\n",
"Epoch 15/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0048 - crf_accuracy: 0.9997 - val_loss: 0.0052 - val_crf_accuracy: 0.9967\n",
"Epoch 16/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0050 - crf_accuracy: 0.9997 - val_loss: 0.0041 - val_crf_accuracy: 0.9968\n",
"Epoch 17/50\n",
"20/20 [==============================] - 62s 3s/step - loss: -0.0052 - crf_accuracy: 0.9997 - val_loss: 0.0031 - val_crf_accuracy: 0.9970\n",
"Epoch 18/50\n",
"20/20 [==============================] - 67s 3s/step - loss: -0.0053 - crf_accuracy: 0.9997 - val_loss: 0.0045 - val_crf_accuracy: 0.9967\n",
"Epoch 19/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0054 - crf_accuracy: 0.9997 - val_loss: 0.0050 - val_crf_accuracy: 0.9966\n",
"Epoch 20/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0056 - crf_accuracy: 0.9997 - val_loss: 0.0029 - val_crf_accuracy: 0.9970\n",
"Epoch 21/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0058 - crf_accuracy: 0.9998 - val_loss: 0.0039 - val_crf_accuracy: 0.9970\n",
"Epoch 22/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0059 - crf_accuracy: 0.9998 - val_loss: 0.0039 - val_crf_accuracy: 0.9969\n",
"Epoch 23/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0061 - crf_accuracy: 0.9998 - val_loss: 0.0026 - val_crf_accuracy: 0.9971\n",
"Epoch 24/50\n",
"20/20 [==============================] - 65s 3s/step - loss: -0.0062 - crf_accuracy: 0.9998 - val_loss: 0.0041 - val_crf_accuracy: 0.9968\n",
"Epoch 25/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0064 - crf_accuracy: 0.9998 - val_loss: 0.0043 - val_crf_accuracy: 0.9967\n",
"Epoch 26/50\n",
"20/20 [==============================] - 63s 3s/step - loss: -0.0064 - crf_accuracy: 0.9998 - val_loss: -0.0016 - val_crf_accuracy: 0.9976\n",
"Epoch 27/50\n",
"20/20 [==============================] - 68s 3s/step - loss: -0.0065 - crf_accuracy: 0.9997 - val_loss: 0.0040 - val_crf_accuracy: 0.9967\n",
"Epoch 28/50\n",
"20/20 [==============================] - 63s 3s/step - loss: -0.0068 - crf_accuracy: 0.9998 - val_loss: 0.0036 - val_crf_accuracy: 0.9967\n",
"Epoch 29/50\n",
"20/20 [==============================] - 67s 3s/step - loss: -0.0069 - crf_accuracy: 0.9998 - val_loss: 0.0041 - val_crf_accuracy: 0.9966\n",
"Epoch 30/50\n",
"20/20 [==============================] - 62s 3s/step - loss: -0.0072 - crf_accuracy: 0.9998 - val_loss: 0.0031 - val_crf_accuracy: 0.9968\n",
"Epoch 31/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0073 - crf_accuracy: 0.9998 - val_loss: 0.0039 - val_crf_accuracy: 0.9967\n",
"Epoch 32/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0074 - crf_accuracy: 0.9998 - val_loss: 0.0014 - val_crf_accuracy: 0.9972\n",
"Epoch 33/50\n",
"20/20 [==============================] - 65s 3s/step - loss: -0.0076 - crf_accuracy: 0.9998 - val_loss: 0.0026 - val_crf_accuracy: 0.9970\n",
"Epoch 34/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0077 - crf_accuracy: 0.9998 - val_loss: 0.0031 - val_crf_accuracy: 0.9969\n",
"Epoch 35/50\n",
"20/20 [==============================] - 63s 3s/step - loss: -0.0078 - crf_accuracy: 0.9998 - val_loss: -0.0026 - val_crf_accuracy: 0.9980\n",
"Epoch 36/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0080 - crf_accuracy: 0.9998 - val_loss: 0.0037 - val_crf_accuracy: 0.9966\n",
"Epoch 37/50\n",
"20/20 [==============================] - 63s 3s/step - loss: -0.0082 - crf_accuracy: 0.9998 - val_loss: 0.0020 - val_crf_accuracy: 0.9971\n",
"Epoch 38/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0083 - crf_accuracy: 0.9998 - val_loss: 0.0042 - val_crf_accuracy: 0.9967\n",
"Epoch 39/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0085 - crf_accuracy: 0.9999 - val_loss: 0.0019 - val_crf_accuracy: 0.9971\n",
"Epoch 40/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0086 - crf_accuracy: 0.9998 - val_loss: 0.0012 - val_crf_accuracy: 0.9971\n",
"Epoch 41/50\n",
"20/20 [==============================] - 65s 3s/step - loss: -0.0087 - crf_accuracy: 0.9998 - val_loss: 0.0051 - val_crf_accuracy: 0.9964\n",
"Epoch 42/50\n",
"20/20 [==============================] - 65s 3s/step - loss: -0.0089 - crf_accuracy: 0.9999 - val_loss: 0.0015 - val_crf_accuracy: 0.9971\n",
"Epoch 43/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0090 - crf_accuracy: 0.9998 - val_loss: 0.0025 - val_crf_accuracy: 0.9969\n",
"Epoch 44/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0092 - crf_accuracy: 0.9999 - val_loss: 3.8593e-04 - val_crf_accuracy: 0.9971\n",
"Epoch 45/50\n",
"20/20 [==============================] - 65s 3s/step - loss: -0.0094 - crf_accuracy: 0.9999 - val_loss: 0.0027 - val_crf_accuracy: 0.9968\n",
"Epoch 46/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0095 - crf_accuracy: 0.9999 - val_loss: -1.1398e-04 - val_crf_accuracy: 0.9971\n",
"Epoch 47/50\n",
"20/20 [==============================] - 66s 3s/step - loss: -0.0097 - crf_accuracy: 0.9999 - val_loss: 0.0024 - val_crf_accuracy: 0.9967\n",
"Epoch 48/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0099 - crf_accuracy: 0.9999 - val_loss: 0.0012 - val_crf_accuracy: 0.9970\n",
"Epoch 49/50\n",
"20/20 [==============================] - 64s 3s/step - loss: -0.0100 - crf_accuracy: 0.9999 - val_loss: 0.0011 - val_crf_accuracy: 0.9971\n",
"Epoch 50/50\n",
"20/20 [==============================] - 65s 3s/step - loss: -0.0101 - crf_accuracy: 0.9999 - val_loss: 0.0021 - val_crf_accuracy: 0.9968\n"
]
}
],
"source": [
"model.fit(train_x,\n",
" train_y,\n",
" x_validate=validate_x,\n",
" y_validate=validate_y,\n",
" epochs=100,\n",
" batch_size=1024)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 207
},
"colab_type": "code",
"id": "hb_6hCMcn7UK",
"outputId": "9071ea6e-532f-42df-a12e-d9f1a0dda460"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" LOC 0.9296 0.9359 0.9328 3431\n",
" PER 0.9629 0.9666 0.9647 1797\n",
" ORG 0.8680 0.8761 0.8720 2147\n",
"\n",
"micro avg 0.9197 0.9260 0.9228 7375\n",
"macro avg 0.9198 0.9260 0.9229 7375\n",
"\n"
]
},
{
"data": {
"text/plain": [
"' precision recall f1-score support\\n\\n LOC 0.9296 0.9359 0.9328 3431\\n PER 0.9629 0.9666 0.9647 1797\\n ORG 0.8680 0.8761 0.8720 2147\\n\\nmicro avg 0.9197 0.9260 0.9228 7375\\nmacro avg 0.9198 0.9260 0.9229 7375\\n'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.evaluate(test_x, test_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "nCLC7TWTHWPG"
},
"outputs": [],
"source": []
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"collapsed_sections": [],
"include_colab_link": true,
"name": "Kashgar_Chinese_NER",
"provenance": [],
"version": "0.3.2"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
@RupertLuo
Copy link

你好,请问模型保存和载入的接口是什么啊?

@BrikerMan
Copy link
Author

你好,请问模型保存和载入的接口是什么啊?

请看这里 https://github.com/BrikerMan/Kashgari/

@ZhyiXu
Copy link

ZhyiXu commented Jul 23, 2019

请问如何指定GPU训练?

@BrikerMan
Copy link
Author

请问如何指定GPU训练?

只要安装 tensorflow-gpu 就可以,更多请看 https://github.com/BrikerMan/Kashgari/ 的文档

@weare1team
Copy link

请问训练好模型用怎么predict呢,具体的代码是什么

@BrikerMan
Copy link
Author

可以使用 predict 或者 predict_entities API, 文档路径:https://kashgari.bmio.net/api/tasks.labeling/#predict ,建议使用新版本。

@BrikerMan
Copy link
Author

出现错误:ZeroDivisionError: Weights sum to zero, can't be normalized是什么原因呢?

这个笔记本已经过期,请按照 https://kashgari.bmio.net 重新写一个测试。

@xmy7216
Copy link

xmy7216 commented Aug 13, 2019

您好 请问能不能使用自己的实体标签?

@BrikerMan
Copy link
Author

您好 请问能不能使用自己的实体标签?

可以的,请看这里 https://kashgari-zh.bmio.net/tutorial/text-labeling/#_2

@BrikerMan
Copy link
Author

谢谢,已经成功了,但是这个预测是不是很慢啊?我打印了下时间2457条数据预测时间花费了511秒,请问这正常吗?

BERT 预测就是慢,可以通过 GPU + TF-Serving 部署方式稍微提高一些。

@xmy7216
Copy link

xmy7216 commented Aug 15, 2019

您好 请问能不能使用自己的实体标签?

可以的,请看这里 https://kashgari-zh.bmio.net/tutorial/text-labeling/#_2

我想用自己的数据集,并且标注的也不是原来的PER等,也是自己定义的实体标签,数据格式一致后,不用指定实体标签有哪些,程序会自动识别有哪些实体标签是么?

@BrikerMan
Copy link
Author

对的,你可以指定自己的标签,程序会自动处理。可以参考这里

https://github.com/BrikerMan/Kashgari/blob/8831993ff32efd91ab1689b864f7037310670e84/tests/corpus.py#L61

@xmy7216
Copy link

xmy7216 commented Aug 15, 2019

对的,你可以指定自己的标签,程序会自动处理。可以参考这里

https://github.com/BrikerMan/Kashgari/blob/8831993ff32efd91ab1689b864f7037310670e84/tests/corpus.py#L61

好的,谢谢您啦,我先自己研究下,后面有问题了再请教您!再次感谢

@xmy7216
Copy link

xmy7216 commented Aug 19, 2019

对的,你可以指定自己的标签,程序会自动处理。可以参考这里

https://github.com/BrikerMan/Kashgari/blob/8831993ff32efd91ab1689b864f7037310670e84/tests/corpus.py#L61

您好,请问预测出现:(1)实体没有以‘’B-xx"开头,直接就是“I-xx”;(2)实体的多个字的“I-xxx”中间出现“O”截断;(3)实体以“B-labelA”开头,后面却出现“I-labelB”
不知道一般是由什么原因导致的呢?用的是BiLSTM_CRF_Model,为什么会出现这种情况呢?

@MaXXXXfeng
Copy link

你好,有一个简单的问题。 调动模型的fit方法,训练好模型后,如何保存模型呢?
以后做预测的时候是用 predict方法直接进行预测吗?

@BrikerMan
Copy link
Author

你好,有一个简单的问题。 调动模型的fit方法,训练好模型后,如何保存模型呢?
以后做预测的时候是用 predict方法直接进行预测吗?

是的,详细使用请参考 https://kashgari-zh.bmio.net

@hziheng
Copy link

hziheng commented Sep 25, 2019

你好,请问可以使用BMESO标注的数据吗,还是只能使用BIO标注的数据呢

@BrikerMan
Copy link
Author

BrikerMan commented Sep 25, 2019 via email

@hujukee
Copy link

hujukee commented Oct 12, 2019

请问在In [3]:中报错 timeout 可能是啥原因?谢谢

@steamfeifei
Copy link

可以选定多gpu训练吗

@steamfeifei
Copy link

tf2.0的版本bert+bilstm+crf出了吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment