saqib-ahmed/language_agnostic_NER.md

## language_agnostic_NER.md

      
    Raw
  

              language_agnostic_NER.md
            
          
    Language Agnostic Chatbot

Step 1:

Clone the code of rasa_nlu and checkout to lean-CRF brannch. This is only required until #1095 PR gets merged. Once they merge the PR, simple clone would do the trick.
$ git clone https://github.com/RasaHQ/rasa_nlu.git
$ git fetch
$ git checkout lean-crf

Step 2:

Create nlu data in the language of your choice using this online trainer with predefined entities and intents.
Step 3:

Create NLU config file with the following pipeline:
pipeline:
- name: "tokenizer_whitespace"
- name: "ner_crf"
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"

Step 4:

Finally, train your bot using the training data and the config file created in step 2 and 3 respectively.
$ python -m rasa_nlu.train -d <training file> -c <config file> --path <output path> --debug

You can use the following command to run the bot with trained data:
$ python -m rasa_nlu.run -m <path/to/trained/model>

Example data (Arabic):

I tried it with a small Arabic dataset and it worked perfectly (identifying correct intents and entities).
testData.json

{
  "rasa_nlu_data": {
    "common_examples": [
      {
        "text": "مرحبا",
        "intent": "greet",
        "entities": []
      },
      {
        "text": "سلام",
        "intent": "greet",
        "entities": []
      },
      {
        "text": "هل هناك أي مطاعم في شمال المدينة",
        "intent": "restaurant_search",
        "entities": [
          {
            "start": 20,
            "end": 24,
            "value": "شمال",
            "entity": "location"
          }
        ]
      },
      {
        "text": "أنا أريد أن آكل البيتزا",
        "intent": "restaurant_search",
        "entities": [
          {
            "start": 16,
            "end": 23,
            "value": "البيتزا",
            "entity": "cuisine"
          }
        ]
      },
      {
        "text": "وداعا",
        "intent": "bye",
        "entities": []
      },
      {
        "text": "هل يوجد مطعم مكسيكي في كاليفورنيا؟",
        "intent": "restaurant_search",
        "entities": [
          {
            "start": 13,
            "end": 19,
            "value": "مكسيكي",
            "entity": "cuisine"
          },
          {
            "start": 23,
            "end": 33,
            "value": "كاليفورنيا",
            "entity": "location"
          }
        ]
      },
      {
        "text": "أعطني بعض الطعام التايلاندية في وسط المدينة.",
        "intent": "restaurant_search",
        "entities": [
          {
            "start": 17,
            "end": 28,
            "value": "التايلاندية",
            "entity": "cuisine"
          },
          {
            "start": 32,
            "end": 43,
            "value": "وسط المدينة",
            "entity": "location"
          }
        ]
      },
      {
        "text": "أراك لاحقاً",
        "intent": "bye",
        "entities": []
      }
    ]
  }
}

Output

After training on this data and the above mentioned pipeline, I got the following output with rasa_nlu.run:
2018-05-26 19:02:03 INFO     __main__ - NLU model loaded. Type a message and press enter to parse it.
أعطني بعض الطعام التايلاندية في وسط المدينة
{
  "intent": {
    "name": "restaurant_search",
    "confidence": 0.9609272480010986
  },
  "entities": [
    {
      "start": 17,
      "end": 28,
      "value": "\u0627\u0644\u062a\u0627\u064a\u0644\u0627\u0646\u062f\u064a\u0629",
      "entity": "cuisine",
      "confidence": 0.8295018967592525,
      "extractor": "ner_crf"
    }
  ],
  "intent_ranking": [
    {
      "name": "restaurant_search",
      "confidence": 0.9609272480010986
    },
    {
      "name": "greet",
      "confidence": 0.01041296124458313
    },
    {
      "name": "bye",
      "confidence": -0.04000069573521614
    }
  ],
  "text": "\u0623\u0639\u0637\u0646\u064a \u0628\u0639\u0636 \u0627\u0644\u0637\u0639\u0627\u0645 \u0627\u0644\u062a\u0627\u064a\u0644\u0627\u0646\u062f\u064a\u0629 \u0641\u064a \u0648\u0633\u0637 \u0627\u0644\u0645\u062f\u064a\u0646\u0629"
}
2018-05-26 19:02:06 INFO     __main__ - Next message:
هل هناك أي مطاعم في شمال المدينة
{
  "intent": {
    "name": "restaurant_search",
    "confidence": 0.95904541015625
  },
  "entities": [
    {
      "start": 20,
      "end": 24,
      "value": "\u0634\u0645\u0627\u0644",
      "entity": "location",
      "confidence": 0.7714837162104833,
      "extractor": "ner_crf"
    }
  ],
  "intent_ranking": [
    {
      "name": "restaurant_search",
      "confidence": 0.95904541015625
    },
    {
      "name": "bye",
      "confidence": 0.016841422766447067
    },
    {
      "name": "greet",
      "confidence": -0.01291605830192566
    }
  ],
  "text": "\u0647\u0644 \u0647\u0646\u0627\u0643 \u0623\u064a \u0645\u0637\u0627\u0639\u0645 \u0641\u064a \u0634\u0645\u0627\u0644 \u0627\u0644\u0645\u062f\u064a\u0646\u0629"
}

The output is being shown in unicode for the Arabic text. I cross checked to confirm that these unicodes match the desired entity texts exactly. It isn't a problem to convert these unicode strings to utf-8 character set.
>>> print("\u0627\u0644\u062a\u0627\u064a\u0644\u0627\u0646\u062f\u064a\u0629")
التايلاندية
>>> print("\u0634\u0645\u0627\u0644")
'شمال'