NLU Comparison between Clare.AI, IBM Watson, and Dialogflow


Clare.AI builds AI enterprise conversation platform, powered by proprietary Natural Language Processing model which focuses in Asian languages. Enterprise clients use our platform to build digital assistants across the entire conversational customer journey, from queries, quotations, to claims. We have launched digital assistants in 4 countries – Hong Kong, Taiwan, Malaysia and the Philippines, and have vision to launch to additional countries across Asia. Powered by its constantly improving proprietary natural language understanding (NLU) engine, our digital assistant can respond to 10+ Asian languages (including Cantonese) and their mixed usage with English, making us a leading conversational commerce regional platform compared to our competitors and many other solutions available. Therefore, this article will aim to compare the capability, by showing and explaining the benchmark, of Clare.AI and the other chatbot solutions, like Dialogflow, especially in NLU such as English, Cantonese, and Chinese.

Before we begin testing and presenting the data to compare the chatbots, it might be useful to get insights on why Asian languages such as Cantonese and Chinese are so hard to dissect compared to English. To understand this more, you can see click here to read more details about this issue. Understanding Natural Language Queries.

NLU is able to understand natural human languages, then enables computers to understand commands without the formalized syntax of computer languages. NLU can also help computers to communicate back to humans in their own languages.

Many existing natural language understanding (NLU) commercial services currently work in a similar way:

  • Natural language query is sent to the service (e.g. “I want to buy car insurance”)
  • The intention (action based query) is detected (e.g. “buy car insurance”)
  • A structured detailed of the user query is sent back

Each of these steps, however, might introduce uncertainty and requires different models to be trained.

There are also some parameters which should be examined when comparing NLU

  • Correct and Wrong Answers
    • Correct answer shows the number of right predictions of the bot, while wrong answer shows the false prediction of the bot.
  • Accuracy
    • This is the ratio of the number of correct predictions to the number of total predictions. This parameter is beneficial to see how the variations which may include mixed languages, long forms and short forms can be recognize as the original intent.
  • False Positives
    • This parameter shows the number of wrong predictions that have confidence level over 80%. This condition means, the bot will confidently answer the question although the answer is actually wrong.


  1. Collect the Data
    • Our testing begins with a set of English FAQs of banks and insurers, compiled by the business team at Clare.AI. This set was kept from our data scientists and engineers throughout the development of our product to ensure no biases in testing. These questions were then imported into each system so we can test how accurately it can match an input to the right answer without any training and variations. The dataset used, raw performance metrics and benchmark results are all openly accessible on GitHub.
  2. Clean the Data
    • English
      • From those available FAQs, we imported 76 questions to our system. From those questions, we create total 282 variations, which include those 86 questions. This means our final dataset now is 282 questions in total.
    • Cantonese
      • From those available FAQs, we imported 36 questions to our system. From those questions, we create total 294 variations, which include those 36 questions. This means our final dataset now is 294 questions in total
    • Chinese
      • From those available FAQs, we imported 31 questions to our system. From those questions, we create total 197 variations, which include those 31 questions. This means our final dataset now is 197 questions in total
  3. Evaluate the Model
    • At this stage, we now have our final data for the learning algorithm to proceed. First, we split the dataset into 2 sets, training set and test set. Training set contains the majority of the rows, while test set contains the minority of the rows. After splitting the datasets, the rows in the training set are used to predict answers for rows in test set.
      • English
        • For English, 6:4 ratio is used, so the training set is 169, while the test set is 113.
      • Cantonese
        • For Cantonese, the division is 176 : 118, using 6 : 4 ratio, for training and test set.
      • Chinese
        • For Chinese, however, the ratio used is 16 : 84, leaving the division to be 31 : 166 for training and test set.
  4. Apply the Learning Algorithm
    • With all the datasets ready, we can import these datasets to the systems to compare the correct answers, accuracy, false positives, and processing time between the two systems. For this testing, we set the confidence threshold to be 80%.


  1. English

The test shows that Clare.AI shares the same level of performance as Watson on accuracy which are both above 60%, while Dialogflow is still below such level.

2. Cantonese

Cantonese is a variety of Chinese spoken in some countries like Hong Kong and Macau. People speaking this language, however, sometimes mix Traditional Chinese and English. Therefore, in this testing, we also include this mixed language like:

  • “你地有咩product可以provide?” which means “Do you have any product available to provide?”
  • “舖頭嘅電話number係咩?” which means “What is the telephone number?”

For Cantonese, Clare.AI shows the best capability compared to Watson and Dialogflow. Based on the testing, we can see that Clare.AI can correctly answer 79, out of 118 questions, while Watson and Dialoglow can only correctly match 51 and 41 questions respectively.

Furthermore, Clare.AI has the highest accuracy compared to the other two bots. Clare.AI has 66.9% accuracy, while Watson and Dialogflow have 65.3% and 56.8% accuracy.

3. Chinese

In Chinese, Clare.AI offers the second best performance, with less than 10% of difference in the number of questions the bot is capable for answering.

Limitations and Learning

  1. Limitations
    • The different datasets used in this study are all derived from the finance industry which might generate biases in showing the capability of the bot when answering general intents.
  2. Learning
    • In this study, a false positive occurs when the bot answers with high confidence (>80%), but in fact has recognized the wrong intent. However, since the questions imported are not randomized, we find that a more comprehensive approach to measure false positive rate would be to apply it to the random testing dataset which means irrelevant intents that do not appear in the training set.


From this test, we are able to see that different solutions may have better performance in different languages. Three languages with three parameters were examined in this test (i.e. English, Cantonese, and Chinese)  by comparing the number of correct answers, accuracy, and false positives. Based on the methodology used, Clare.AI shows to always be in the top two comparing to the other two bots. For the improvement, since the data used for training could raise biases in the bot’s performance,  random testing data set should be preferred. We will continue to improve our NLU capability for better performance more languages with different kind of dataset.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s