Sander Hak
July 14, 2021

Hi, I’m Tracey. How are you feeling today?

Do you recognize yourself in the first illustration? Trying to explain your problem to a chatbot can be a tedious experience. A conversation similar to the second illustration might be preferable. One where your issues are understood, and the chatbot you’re communicating with can understand your needs.  Recent advances in Artificial Intelligence allow this, as chatbots are increasingly equipped with empathy. To this end, emotion recognition algorithms are used to detect emotions from text, speech, or facial expression. These models are already being used in, among others, social media platforms, national security and mental healthcare (source).  Sophisticated AI algorithms in chatbots can be trained to react to customer problems appropriately to improve customer retention and satisfaction. Emotion is inherent to human beings. Therefore, you hopefully agree with me that ethical considerations are important during development of these applications.

Hello Tracey

At Tracey we are developing conversational AI for our chatbot. We are exploring the field of empathetic AI in our conversational models. I was given a chance to research the possibilities of providing our chatbot with emotion recognition abilities. As our customers are predominantly located in the Netherlands, the chatbot will need to communicate in Dutch. Artificial Intelligence (AI) algorithms require training data to learn how to do their task. I could only provide the conversational algorithms with emotion recognition abilities if a sufficiently large dataset containing examples (text, and the emotion conveyed in the text) was available. For the Dutch language, data like this is not publicly available; therefore I decided to create it. The creation of this corpus is a challenge of such complexity that it took the complete course of my thesis project. Through the creation of this corpus, my research will contribute to the end goal of developing Dutch emotional-conversational AI for chatbots.


“Everyone knows what an emotion is, until asked to give a definition. Then it seems, no one knows” (source).  There are many different theories of emotion (this article gives a profound overview of the six most important ones). Over the past century, well over 100 frameworks are developed to describe what emotions exist. The current situation in psychology is that there is no consensus reached among theorists about what framework best explains the variety in emotions that one can experience. Some argue that more than 20 basic emotions exist, others believe it are only six. The theories  that believe that n basic emotions exist, and that more complex emotions are mixtures of these basic emotions, are in literature referred to as basic emotions theories. In contrast, dimensional theorists believe that instead of being discrete, and having clear-cut boundaries, emotions are formed from a combination of (n) dimensions and are, thus, continuous in nature. Dimensional theories assume relationships between emotion categories. Finally, there are hybrid forms, such as the model of Plutchik, that formulates a set of basic (primary) emotions that were derived from evolutionary beneficial behavior, however, also assumes relationships between basic emotions. The model of Plutchik, thus, contains elements from basic emotions theory and dimensional theories.

Creation of the corpus

Natural text was acquired by scraping the internet platforms Reddit and Trustpilot. After collection, the pieces of text (utterances) became subject to a minor amount of data cleaning. Through the annotation tool doccano, a team of 16 voluntary annotators assisted in labelling the data. The annotators were given a list of emotion labels, with the respective definition. They were asked to place themselves in the position of the author of the utterance and label all emotions that they were confident to recognize. In total, 2200 utterances were annotated by the voluntary team.

But… GIGO?

Data scientists will be familiar with the term ‘Garbage in Garbage out’. It basically means that if your training data is of insufficient quality, you cannot expect satisfactory performance of any model that is trained on this data. This implies that if one would train a ‘language producing algorithm’ using a corpus that contains exclusively offensive language, this model would produce exclusively offensive language. In correspondence, the corpus that I was creating needed to be of sufficient quality in order to train a good model. At time of writing, there is no clear metric (or set of metrics) that is appropriate to measure corpus quality. Most corpus creators publish the validity of their annotations, though, using Inter-Annotator Agreement. It basically means that corpus creators assume data quality by looking at how much annotators agree with each other while labelling the same utterance.

The idea of Inter-Annotator agreement was applied in this work by letting all annotators label a set of 70 identical utterances. A naïve approach would be to look at how often the annotators pick the same label for each utterance. However, as annotators can accidentally choose the same label, this approach is invalid. The way that the probability of annotators choosing the same label is modeled differs among Inter-Annotator agreement metrics. For this work, annotations were validated using Krippendorff’s alpha.

What about Bias?

If a group of annotators shares a similar bias, this group can attain high levels of agreement in an annotation task. The resulting corpus nevertheless contains this bias, and one could question if the quality of such corpus is sufficient. Therefore, additional analytics has been performed to discover potential bias in the corpus. Several annotators have been removed from the team of annotators as they were biased towards one or two emotion labels (e.g., one annotator chose the label Joy significantly more frequent than overall). Transparency and accountability of potential bias in the dataset has been guaranteed by applying two ethical frameworks. The datasheets for datasets, and data statements.

What did we learn?

I successfully developed a Dutch emotion corpus. The inter-annotator agreement scores are superior to the ones found in literature, indicating a state-of-the-art performance. Findings of the annotation process include that severe pre-processing of utterances might improve the annotation process. For instance, some utterances were annotated with antonym emotions (e.g., Joy and Anger). Additionally, almost 25% of all utterances was considered not usable by the annotators and was thus removed from the final corpus. This can be seen as a waste of time of the annotators. There was found that the vast majority of utterances were written in a Joyous emotion. Fear was almost never found in utterances (< 1%). This indicates that if one would use this corpus to create an emotion classification algorithm, he or she needs to take this class imbalance into account.  

By creating this corpus, we came one step closer to developing emotion-recognising conversational AI.