Clare.AI is an AI-powered digital assistant catered towards financial institutions with a focus on Asian natural language processing. Something we have to do as an emerging chatbot solution in the Asia Pacific region is to research our competitors and the many other solutions out there. In doing so, we have come across multiple chatbots that claim to support languages such as Mandarin and Cantonese. However, through developing our own proprietary Natural Language Processing engine, we realised that it’s easy to claim to support these Asian languages yet difficult to produce truly accurate results. This article will aim to test and explore some of the other chatbot solutions offered currently and benchmark Clare.AI in a A World of Chatbots. Languages compared in this benchmark include:
- Bahasa (Indonesian)
Before we begin testing and presenting data to compare chatbots, it might be useful to first explain why Asian languages such as Cantonese and Mandarin are so hard to dissect compared to English. Generally, NLU systems handle languages through tokenization, a process that chops up sentences into pieces called tokens, and throws away unimportant characters such as punctuation.
For Chinese and many other Asian languages, the characters are not separated by spaces, making it much harder for machines to understand which groups of characters represent an intent. Even different dialects of Chinese would require different tokenization. What we experience first hand is the distinction in Mandarin and Cantonese. For example:
This means if I lost your bank card, what can I do in Cantonese. However, if you were to tokenize using Simplified Chinese NLU, it would recognize bank as a noun, and 2 other names – 張卡唔, 左點算。Cheung and Zuo are common surnames in China, hence the tokenization has detected 2 human names.
We also believe in the importance of fine-tuning the algorithm into specific industries. The same word might mean different things in different industries. For example:
“I want to buy apple”
NLU engines struggle to understand the context of Apple – whether it is a fruit, a gadget or a stock. For example:
Understanding Natural Language Queries
All commercially available natural language understanding (NLU) services currently work in a similar way:
- Natural language query is sent to the service (e.g. “how do I change my password?”)
- The intention is detected (e.g. “change password”)
- A structured description of the user query is sent back
These steps each introduce uncertainty and require different models to be trained.
The aim of this article is to help clients and our team understand how our technology stacks up amidst our competitors. In order to compare products, we will focus on testing the NLU of all solutions.
Our testing begins with a set of questions in the language being tested. These questions were compiled by the business team at Clare.AI using public FAQs of banks and insurers. This set was kept separate from our data scientists and engineers throughout the development of our product to ensure no biases in testing. These sets of questions were then imported into each system so we can test how accurately it can match an input to the right answer without any training and variations. The dataset used, raw performance metrics and benchmark results are all openly accessible on GitHub.
The comparison of NLU will gauge both how accurate our system is relative to the others in English, Asian languages, and mixed language, as well as how well our system can handle data. We have created a set of questions and answers to import, then will test how accurately it can match question variations to the correct answer. When it comes to comparing NLU, we need to examine the following aspects:
- This aspect sees how well the system can handle a large batch of questions. To test this, we wrote a script to see how quickly each system can answer all the questions in our dataset. A test was done separately for each language.
- This is important for handling concurrent requests
- When we test the NLU, all inputs used were variations of the initial questions imported. Therefore we can see how accurately the system can match the variations to the original question.
- Our variations included mixed languages, long forms and short forms, in order to see whether these NLU engines are able to recognise it’s the same intent. The dataset is accessible at GitHub.
- Correct Answers
- Correct answers are basically just the number or percentage of returned answers that correctly matched our variations to the imported questions.
- False Positives
- Each system has a confidence level to see how confidently it matched the input to the right question. However, sometimes the system is confident to answer the wrong answer. These false positives are common, but should not occupy a large percentage of results.
- We further test false positives by including random questions which should not match to any question in the system. It is inherent in the NLU design that the algorithm should be able not to match with existing questions.
The table above are the results for our tests with English questions. We initially imported 86 questions into each system before testing it with 86 variations of those initial 86 questions.
As you can see, Clare.AI’s NLU produced the fastest results, managing to return answers at 3x the rate of IBM Watson. Nevertheless, speed is not everything, and would mean nothing if our answers returned were incorrect.
However, as you can see, Clare.AI managed to correctly match 51 of the variations to the correct answer, only 1.1% lower than IBM Watson’s performance. This means that performance-wise, Clare.AI exceeds all other solutions, and is also on-par with IBM on accuracy levels in English.
The table above are the results for our tests with Chinese questions. We initially imported 34 questions into each system before testing it with 181 variations, including mixed language variations.
Once again, our system returned results the quickest, much faster than the other solutions and almost 13 times faster than IBM Watson.
Clare.AI excelled in Chinese compared to the other platforms. Our system had a 12.7% better accuracy than IBM Watson and our testing also showed our capabilities of handling mixed language are much higher. For example:
For the above mixed language variation, Clare.AI matched correctly to the right question while IBM incorrectly matched.
Our testing has shown that Clare.AI is the fastest and one of the most accurate systems when using English. For Chinese, however, Clare.AI stands head and shoulders above its competition. Our high accuracy levels mean companies need to spend less time training the bot.
As for Bahasa, the main language in Indonesia, our NLU comparison only involved us againstWIT.AI. This is because the other two systems, IBM Watson and API.AI do not support Bahasa. Once again the format of our testing remained the same, except this time we imported 100 questions, and tested each system with 300 variations of those initial questions.The table above shows the results. Once again, data can be found on Github.
As you can see, Clare.AI processed and produced answers to all 300 question variations 6x faster than WIT.AI with almost double the accuracy. Clare.AI managed to correctly match 70% of the 300 variations to the original questions imported into the system compared to WIT.AI’s 35.3%, showcasing the strengths of Clare.AI’s NLU.
Limitations and Learnings
It is still difficult, and in early stages
The analysis conducted and subsequent results in this post show that no solution is perfect. Every system has its flaws and none can perform without misinterpreting queries or perfectly. Artificial Intelligence in this industry is still far from replacing humans.
However, the purpose of this benchmarking article was not to show how developed AI and NLU is, but rather to see how our engine compares to the other leading solutions that have been developed. Through testing, we realise the importance of these sorts of examinations and tests. It is vital both for us and for everyone else in the industry to continue to run tests to help differentiate to solutions and see the strengths each has, because in the end, how do you improve upon something when you don’t know what needs to be improved?
These tests have been assuring for our team at Clare.AI. We were committed to testing with unbiased data and therefore producing unbiased results, so it was wonderful to see such positive conclusions. Our English NLU managed to match the giant that is IBM Watson, while showing much faster performance speed. As for the Chinese NLU engine comparison, our “forte”, we not only produced the highest accuracy by a large margin, but did it within the shortest period. We are glad that we have been able to show the capabilities of our NLU
As mentioned previously, as favourable as the results have been, there is still much to improve for all solutions in this industry. This sentiment goes for this benchmarking test as well. We know what we have tested and the ways we have gone about doing it can always be improved, so we welcome any questions or comments you may have. If you are curious about the data we used, you can find it here on GitHub.