Publications

You can also find my articles on my Google Scholar profile.

Preprints

Prompt reinforcing for long-term planning of large language models

Published in ArXiv preprint, 2025

Large language models (LLMs) have achieved remarkable success in a wide range of natural language processing tasks and can be adapted through prompting. However, they remain suboptimal in multi-turn interactions, often relying on incorrect early assumptions and failing to track user goals over time, which makes such tasks particularly challenging. Prior works in dialogue systems have shown that long-term planning is essential for handling interactive tasks. In this work, we propose a prompt optimisation framework inspired by reinforcement learning, which enables such planning to take place by only modifying the task instruction prompt of the LLM-based agent. By generating turn-by-turn feedback and leveraging experience replay for prompt rewriting, our proposed method shows significant improvement in multi-turn tasks such as text-to-SQL and task-oriented dialogue. Moreover, it generalises across different LLM-based agents and can leverage diverse LLMs as meta-prompting agents. This warrants future research in reinforcement learning-inspired parameter-free optimisation methods.

Text-to-SQL Task-oriented Dialogue Ontology Construction

Published in ArXiv, 2025

Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.

Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Published in ArXiv preprint, 2025

Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model’s own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards. RLSF simultaneously (i) refines the model’s probability estimates – restoring well-behaved calibration – and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering. By turning a model’s own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrents further research in intrinsic rewards for LLM post-training.

Download Paper | Download Bibtex

Less is More: Local Intrinsic Dimensions of Contextual Language Models

Published in Accepted at NeurIPS 2025, 2025

Understanding the internal mechanisms of large language models (LLMs) remains a challenging and complex endeavor. Even fundamental questions, such as how fine-tuning affects model behavior, often require extensive empirical evaluation. In this paper, we introduce a novel perspective based on the geometric properties of contextual latent embeddings to study the effects of training and fine-tuning. To that end, we measure the local dimensions of a contextual language model’s latent space and analyze their shifts during training and fine-tuning. We show that the local dimensions provide insights into the model’s training dynamics and generalization ability. Specifically, the mean of the local dimensions predicts when the model’s training capabilities are exhausted, as exemplified in a dialogue state tracking task, overfitting, as demonstrated in an emotion recognition task, and grokking, as illustrated with an arithmetic task. Furthermore, our experiments suggest a practical heuristic: reductions in the mean local dimension tend to accompany and predict subsequent performance gains. Through this exploration, we aim to provide practitioners with a deeper understanding of the implications of fine-tuning on embedding spaces, facilitating informed decisions when configuring models for specific applications. The results of this work contribute to the ongoing discourse on the interpretability, adaptability, and generalizability of LLMs by bridging the gap between intrinsic model mechanisms and geometric properties in the respective embeddings.

Conference Papers

Learning from Noisy Labels via Self-Taught On-the-Fly Meta Loss Rescaling

Published in AAAI Conference on Artificial Intelligence 2025, 2025

Correct labels are indispensable for training effective machine learning models. However, creating high-quality labels is expensive, and even professionally labeled data contains errors and ambiguities. Filtering and denoising can be applied to curate labeled data prior to training, at the cost of additional processing and loss of information. An alternative is on-the-fly sample reweighting during the training process to decrease the negative impact of incorrect or ambiguous labels, but this typically requires clean seed data. In this work we propose unsupervised on-the-fly meta loss rescaling to reweight training samples. Crucially, we rely only on features provided by the model being trained, to learn a rescaling function in real time without knowledge of the true clean data distribution. We achieve this via a novel meta learning setup that samples validation data for the meta update directly from the noisy training corpus by employing the rescaling function being trained. Our proposed method consistently improves performance across various NLP tasks with minimal computational overhead. Further, we are among the first to attempt on-the-fly training data reweighting on the challenging task of dialogue modeling, where noisy and ambiguous labels are common. Our strategy is robust in the face of noisy and clean data, handles class imbalance, and prevents overfitting to noisy labels. Our self-taught loss rescaling improves as the model trains, showing the ability to keep learning from the model’s own signals. As training progresses, the impact of correctly labeled data is scaled up, while the impact of wrongly labeled data is suppressed.

Local Topology Measures of Contextual Language Model Latent Spaces With Applications to Dialogue Term Extraction

Published in SIGDial 2024, 2024

A common approach for sequence tagging tasks based on contextual word representations is to train a machine learning classifier directly on these embedding vectors. This approach has two shortcomings. First, such methods consider single input sequences in isolation and are unable to put an individual embedding vector in relation to vectors outside the current local context of use. Second, the high performance of these models relies on fine-tuning the embedding model in conjunction with the classifier, which may not always be feasible due to the size or inaccessibility of the underlying feature-generation model. It is thus desirable, given a collection of embedding vectors of a corpus, i.e., a datastore, to find features of each vector that describe its relation to other, similar vectors in the datastore. With this in mind, we introduce complexity measures of the local topology of the latent space of a contextual language model with respect to a given datastore. The effectiveness of our features is demonstrated through their application to dialogue term extraction. Our work continues a line of research that explores the manifold hypothesis for word embeddings, demonstrating that local structure in the space carved out by word embeddings can be exploited to infer semantic properties.

Infusing Emotions into Task-oriented Dialogue Systems: Understanding, Management, and Generation

Published in SIGDial 2024, 2024

Emotions are indispensable in human communication, but are often overlooked in task-oriented dialogue (ToD) modelling, where the task success is the primary focus. While existing works have explored user emotions or similar concepts in some ToD tasks, none has so far included emotion modelling into a fully-fledged ToD system nor conducted interaction with human or simulated users. In this work, we incorporate emotion into the complete ToD processing loop, involving understanding, management, and generation. To this end, we extend the EmoWOZ dataset (Feng et al., 2022) with system affective behaviour labels. Through interactive experimentation involving both simulated and human users, we demonstrate that our proposed framework significantly enhances the user’s emotional experience as well as the task success.

Dialogue Ontology Relation Extraction via Constrained Chain-of-Thought Decoding

Published in SIGDial 2024, 2024

State-of-the-art task-oriented dialogue systems typically rely on task-specific ontologies for fulfilling user queries. The majority of task-oriented dialogue data, such as customer service recordings, comes without ontology and annotation. Such ontologies are normally built manually, limiting the application of specialised systems. Dialogue ontology construction is an approach for automating that process and typically consists of two steps: term extraction and relation extraction. In this work, we focus on relation extraction in a transfer learning set-up. To improve the generalisation, we propose an extension to the decoding mechanism of large language models. We adapt Chain-of-Thought (CoT) decoding, recently developed for reasoning problems, to generative relation extraction. Here, we generate multiple branches in the decoding space and select the relations based on a confidence threshold. By constraining the decoding to ontology terms and relations, we aim to decrease the risk of hallucination. We conduct extensive experimentation on two widely used datasets and find improvements in performance on target ontology for source fine-tuned and one-shot prompted large language models.

ConvLab-3: A Flexible Dialogue System Toolkit Based on a Unified Data Format

Published in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023

Task-oriented dialogue (TOD) systems function as digital assistants, guiding users through various tasks such as booking flights or finding restaurants. Existing toolkits for building TOD systems often fall short in delivering comprehensive arrays of data, model, and experimental environments with a user-friendly experience. We introduce ConvLab-3: a multifaceted dialogue system toolkit crafted to bridge this gap. Our unified data format simplifies the integration of diverse datasets and models, significantly reducing complexity and cost for studying generalization and transfer. Enhanced with robust reinforcement learning (RL) tools, featuring a streamlined training process, in-depth evaluation tools, and a selection of user simulators, ConvLab-3 supports the rapid development and evaluation of robust dialogue policies. Through an extensive study, we demonstrate the efficacy of transfer learning and RL and showcase that ConvLab-3 is not only a powerful tool for seasoned researchers but also an accessible platform for newcomers.

From Chatter to Matter: Addressing Critical Steps of Emotion Recognition Learning in Task-oriented Dialogue

Published in SIGDial 2023, 2023

Emotion recognition in conversations (ERC) is a crucial task for building human-like conversational agents. While substantial efforts have been devoted to ERC for chit-chat dialogues, the task-oriented counterpart is largely left unattended. Directly applying chit-chat ERC models to task-oriented dialogues (ToDs) results in suboptimal performance as these models overlook key features such as the correlation between emotions and task completion in ToDs. In this paper, we propose a framework that turns a chit-chat ERC model into a task-oriented one, addressing three critical aspects: data, features and objective. First, we devise two ways of augmenting rare emotions to improve ERC performance. Second, we use dialogue states as auxiliary features to incorporate key information from the goal of the user. Lastly, we leverage a multi-aspect emotion definition in ToDs to devise a multi-task learning objective and a novel emotion-distance weighted loss function. Our framework yields significant improvements for a range of chit-chat ERC models on EmoWOZ, a large-scale dataset for user emotions in ToDs. We further investigate the generalisability of the best resulting model to predict user satisfaction in different ToD datasets. A comparison with supervised baselines shows a strong zero-shot capability, highlighting the potential usage of our framework in wider scenarios.

EmoUS: Simulating User Emotions in Task-Oriented Dialogues

Published in SIGIR 2023, 2023

Existing user simulators (USs) for task-oriented dialogue systems only model user behaviour on semantic and natural language levels without considering the user persona and emotions. Optimising dialogue systems with generic user policies, which cannot model diverse user behaviour driven by different emotional states, may result in a high drop-off rate when deployed in the real world. Thus, we present EmoUS, a user simulator that learns to simulate user emotions alongside user behaviour. EmoUS generates user emotions, semantic actions, and natural language responses based on the user goal, the dialogue history, and the user persona. By analysing what kind of system behaviour elicits what kind of user emotions, we show that EmoUS can be used as a probe to evaluate a variety of dialogue systems and in particular their effect on the user’s emotional state. Developing such methods is important in the age of large language model chat-bots and rising ethical concerns.

ChatGPT for Zero-shot Dialogue State Tracking: A Solution or an Opportunity?

Published in ACL 2023, 2023

Recent research on dialogue state tracking (DST) focuses on methods that allow few- and zero-shot transfer to new domains or schemas. However, performance gains heavily depend on aggressive data augmentation and fine-tuning of ever larger language model based architectures. In contrast, general purpose language models, trained on large amounts of diverse data, hold the promise of solving any kind of task without task-specific training. We present preliminary experimental results on the ChatGPT research preview, showing that ChatGPT achieves state-of-the-art performance in zero-shot DST. Despite our findings, we argue that properties inherent to general purpose models limit their ability to replace specialized systems. We further theorize that the in-context learning capabilities of such models will likely become powerful tools to support the development of dedicated and dynamic dialogue state trackers.

Dynamic Dialogue Policy for Continual Reinforcement Learning

Published in Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022), 2022

Continual learning is one of the key components of human learning and a necessary requirement of artificial intelligence. As dialogue can potentially span infinitely many topics and tasks, a task-oriented dialogue system must have the capability to continually learn, dynamically adapting to new challenges while preserving the knowledge it already acquired. Despite the importance, continual reinforcement learning of the dialogue policy has remained largely unaddressed. The lack of a framework with training protocols, baseline models and suitable metrics, has so far hindered research in this direction. In this work we fill precisely this gap, enabling research in dialogue policy optimisation to go from static to dynamic learning. We provide a continual learning algorithm, baseline architectures and metrics for assessing continual learning models. Moreover, we propose the dynamic dialogue policy transformer (DDPT), a novel dynamic architecture that can integrate new knowledge seamlessly, is capable of handling large state spaces and obtains significant zero-shot performance when being exposed to unseen domains, without any growth in network parameter size. We validate the strengths of DDPT in simulation with two user simulators as well as with humans.

Dialogue Term Extraction using Transfer Learning and Topological Data Analysis

Published in Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue (SigDial 2022), 2022

Goal oriented dialogue systems were originally designed as a natural language interface to a fixed data-set of entities that users might inquire about, further described by domain, slots and values. As we move towards adaptable dialogue systems where knowledge about domains, slots and values may change, there is an increasing need to automatically extract these terms from raw dialogues or related non-dialogue data on a large scale. In this paper, we take an important step in this direction by exploring different features that can enable systems to discover realisations of domains, slots and values in dialogues in a purely data-driven fashion. The features that we examine stem from word embeddings, language modelling features, as well as topological features of the word embedding space. To examine the utility of each feature set, we train a seed model based on the widely used MultiWOZ data-set. Then, we apply this model to a different corpus, the Schema-guided dialogue data-set. Our method outperforms the previously proposed approach that relies solely on word embeddings. We also demonstrate that each of the features is responsible for discovering different kinds of content. We believe our results warrant further research towards ontology induction, and continued harnessing of topological data analysis for dialogue and natural language processing research.

Dialogue Evaluation with Offline Reinforcement Learning

Published in Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue (SigDial 2022), 2022

Task-oriented dialogue systems aim to fulfill user goals through natural language interactions. They are ideally evaluated with human users, which however is unattainable to do at every iteration of the development phase. Simulated users could be an alternative, however their development is nontrivial. Therefore, researchers resort to offline metrics on existing human-human corpora, which are more practical and easily reproducible. They are unfortunately limited in reflecting real performance of dialogue systems. BLEU for instance is poorly correlated with human judgment, and existing corpus-based metrics such as success rate overlook dialogue context mismatches. There is still a need for a reliable metric for task-oriented systems with good generalization and strong correlation with human judgements. In this paper, we propose the use of offline reinforcement learning for dialogue evaluation based on static data. Such an evaluator is typically called a critic and utilized for policy optimization. We go one step further and show that offline RL critics can be trained for any dialogue system as external evaluators, allowing dialogue performance comparisons across various types of systems. This approach has the benefit of being corpus- and model-independent, while attaining strong correlation with human judgements, which we confirm via an interactive user trial.

GenTUS: Simulating User Behaviour and Language in Task-oriented Dialogues with Generative Transformers

Published in Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 2022

User simulators (USs) are commonly used to train task-oriented dialogue systems via reinforcement learning. The interactions often take place on semantic level for efficiency, but there is still a gap from semantic actions to natural language, which causes a mismatch between training and deployment environment. Incorporating a natural language generation (NLG) module with USs during training can partly deal with this problem. However, since the policy and NLG of USs are optimised separately, these simulated user utterances may not be natural enough in a given context. In this work, we propose a generative transformer-based user simulator (GenTUS). GenTUS consists of an encoder-decoder structure, which means it can optimise both the user policy and natural language generation jointly. GenTUS generates both semantic actions and natural language utterances, preserving interpretability and enhancing language variation. In addition, by representing the inputs and outputs as word sequences and by using a large pre-trained language model we can achieve generalisability in feature representation. We evaluate GenTUS with automatic metrics and human evaluation. Our results show that GenTUS generates more natural language and is able to transfer to an unseen ontology in a zero-shot fashion. In addition, its behaviour can be further shaped with reinforcement learning opening the door to training specialised user simulators.

EmoWOZ: A Large-Scale Corpus and Labelling Scheme for Emotion in Task-Oriented Dialogue Systems

Published in Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), 2022

The ability to recognise emotions lends a conversational artificial intelligence a human touch. While emotions in chit-chat dialogues have received substantial attention, emotions in task-oriented dialogues remain largely unaddressed. This is despite emotions and dialogue success having equally important roles in a natural system. Existing emotion-annotated task-oriented corpora are limited in size, label richness, and public availability, creating a bottleneck for downstream tasks. To lay a foundation for studies on emotions in task-oriented dialogues, we introduce EmoWOZ, a large-scale manually emotion-annotated corpus of task-oriented dialogues. EmoWOZ is based on MultiWOZ, a multi-domain task-oriented dialogue dataset. It contains more than 11K dialogues with more than 83K emotion annotations of user utterances. In addition to Wizard-of-Oz dialogues from MultiWOZ, we collect human-machine dialogues within the same set of domains to sufficiently cover the space of various emotions that can happen during the lifetime of a data-driven dialogue system. To the best of our knowledge, this is the first large-scale open-source corpus of its kind. We propose a novel emotion labelling scheme, which is tailored to task-oriented dialogues. We report a set of experimental results to show the usability of this corpus for emotion recognition and state tracking in task-oriented dialogues.

What Does The User Want? Information Gain for Hierarchical Dialogue Policy Optimisation

Published in ASRU 2021, 2021

The dialogue management component of a task-oriented dialogue system is typically optimised via reinforcement learning (RL). Optimisation via RL is highly susceptible to sample inefficiency and instability. The hierarchical approach called Feudal Dialogue Management takes a step towards more efficient learning by decomposing the action space. However, it still suffers from instability due to the reward only being provided at the end of the dialogue. We propose the usage of an intrinsic reward based on information gain to address this issue. Our proposed reward favours actions that resolve uncertainty or query the user whenever necessary. It enables the policy to learn how to retrieve the users’ needs efficiently, which is an integral aspect in every task-oriented conversation. Our algorithm, which we call FeudalGain, achieves state-of-the-art results in most environments of the PyDial framework, outperforming much more complex approaches. We confirm the sample efficiency and stability of our algorithm through experiments in simulation and a human trial.

Uncertainty Measures in Neural Belief Tracking and the Effects on Dialogue Policy Performance

Published in EMNLP 2021, 2021

The ability to identify and resolve uncertainty is crucial for the robustness of a dialogue system. Indeed, this has been confirmed empirically on systems that utilise Bayesian approaches to dialogue belief tracking. However, such systems consider only confidence estimates and have difficulty scaling to more complex settings. Neural dialogue systems, on the other hand, rarely take uncertainties into account. They are therefore overconfident in their decisions and less robust. Moreover, the performance of the tracking task is often evaluated in isolation, without consideration of its effect on the downstream policy optimisation. We propose the use of different uncertainty measures in neural belief tracking. The effects of these measures on the downstream task of policy optimisation are evaluated by adding selected measures of uncertainty to the feature space of the policy and training policies through interaction with a user simulator. Both human and simulated user results show that incorporating these measures leads to improvements both of the performance and of the robustness of the downstream dialogue policy. This highlights the importance of developing neural dialogue belief trackers that take uncertainty into account.

Domain-independent User Simulation with Transformers for Task-oriented Dialogue Systems

Published in SigDial 2021, 2021

Dialogue policy optimisation via reinforcement learning requires a large number of training interactions, which makes learning with real users time consuming and expensive. Many set-ups therefore rely on a user simulator instead of humans. These user simulators have their own problems. While hand-coded, rule-based user simulators have been shown to be sufficient in small, simple domains, for complex domains the number of rules quickly becomes intractable. State-of-the-art datadriven user simulators, on the other hand, are still domain-dependent. This means that adaptation to each new domain requires redesigning and retraining. In this work, we propose a domain-independent transformer-based user simulator (TUS). The structure of our TUS is not tied to a specific domain, enabling domain generalisation and learning of cross-domain user behaviour from data. We compare TUS with the state of the art using automatic as well as human evaluations. TUS can compete with rule-based user simulators on pre-defined domains and is able to generalise to unseen domains in a zero-shot fashion.

LAVA: Latent Action Spaces via Variational Auto-encoding for Dialogue Policy Optimization

Published in Proceedings of the 28th International Conference on Computational Linguistics. 2020, 2020

Reinforcement learning (RL) can enable task-oriented dialogue systems to steer the conversation towards successful task completion. In an end-to-end setting, a response can be constructed in a word-level sequential decision making process with the entire system vocabulary as action space. Policies trained in such a fashion do not require expert-defined action spaces, but they have to deal with large action spaces and long trajectories, making RL impractical. Using the latent space of a variational model as action space alleviates this problem. However, current approaches use an uninformed prior for training and optimize the latent distribution solely on the context. It is therefore unclear whether the latent representation truly encodes the characteristics of different actions. In this paper, we explore three ways of leveraging an auxiliary task to shape the latent variable distribution: via pre-training, to obtain an informed prior, and via multitask learning. We choose response auto-encoding as the auxiliary task, as this captures the generative factors of dialogue responses while requiring low computational cost and neither additional data nor labels. Our approach yields a more action-characterized latent representations which support end-to-end dialogue policy optimization and achieves state-of-the-art success rates. These results warrant a more wide-spread use of RL in end-to-end dialogue models.

Out-of-Task Training for Dialog State Tracking Models

Published in Proceedings of the 28th International Conference on Computational Linguistics. 2020, 2020

Dialog state tracking (DST) suffers from severe data sparsity. While many natural language processing (NLP) tasks benefit from transfer learning and multi-task learning, in dialog these methods are limited by the amount of available data and by the specificity of dialog applications. In this work, we successfully utilize non-dialog data from unrelated NLP tasks to train dialog state trackers. This opens the door to the abundance of unrelated NLP corpora to mitigate the data sparsity issue inherent to DST.

Knowing What You Know: Calibrating Dialogue Belief State Distributions via Ensembles

Published in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020

The ability to accurately track what happens during a conversation is essential for the performance of a dialogue system. Current state-of-the-art multi-domain dialogue state trackers achieve just over 55% accuracy on the current go-to benchmark, which means that in almost every second dialogue turn they place full confidence in an incorrect dialogue state. Belief trackers, on the other hand, maintain a distribution over possible dialogue states. However, they lack in performance compared to dialogue state trackers, and do not produce well calibrated distributions. In this work we present state-of-the-art performance in calibration for multi-domain dialogue belief trackers using a calibrated ensemble of models. Our resulting dialogue belief tracker also outperforms previous dialogue belief tracking models in terms of accuracy.

Adaptable Conversational Machines

Published in Adaptable Conversational Machines. AI Magazine, Vol. 41 No. 3: Fall 2020, 2020

In recent years we have witnessed a surge in machine learning methods that provide machines with conversational abilities. Most notably, neural-network–based systems have set the state of the art for difficult tasks such as speech recognition, semantic understanding, dialogue management, language generation, and speech synthesis. Still, unlike for the ancient game of Go for instance, we are far from achieving human-level performance in dialogue. The reasons for this are numerous. One property of human–human dialogue that stands out is the infinite number of possibilities of expressing oneself during the conversation, even when the topic of the conversation is restricted. A typical solution to this problem was scaling-up the data. The most prominent mantra in speech and language technology has been “There is no data like more data.” However, the researchers now are focused on building smarter algorithms — algorithms that can learn efficiently from just a few examples. This is an intrinsic property of human behavior: an average human sees during their lifetime a fraction of data that we nowadays present to machines. A human can even have an intuition about a solution before ever experiencing an example solution. The human-inspired ability to adapt may just be one of the keys in pushing dialogue systems toward human performance. This article reviews advancements in dialogue systems research with a focus on the adaptation methods for dialogue modeling, and ventures to have a glance at the future of research on adaptable conversational machines.

Recommended citation: Lubis, N. ., Heck, M., van Niekerk, C., & Gasic, M. (2020). Adaptable Conversational Machines. AI Magazine, 41(3), 28-44. https://doi.org/10.1609/aimag.v41i3.5322
Download Paper

TripPy: A Triple Copy Strategy for Value Independent Neural Dialog State Tracking

Published in Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2020 [BEST PAPER AWARD], 2020

Task-oriented dialog systems rely on dialog state tracking (DST) to monitor the user’s goal during the course of an interaction. Multidomain and open-vocabulary settings complicate the task considerably and demand scalable solutions. In this paper we present a new approach to DST which makes use of various copy mechanisms to fill slots with values. Our model has no need to maintain a list of candidate values. Instead, all values are extracted from the dialog context on-thefly. A slot is filled by one of three copy mechanisms: (1) Span prediction may extract values directly from the user input; (2) a value may be copied from a system inform memory that keeps track of the system’s inform operations; (3) a value may be copied over from a different slot that is already contained in the dialog state to resolve coreferences within and across domains. Our approach combines the advantages of span-based slot filling methods with memory methods to avoid the use of value picklists altogether. We argue that our strategy simplifies the DST task while at the same time achieving state of the art performance on various popular evaluation sets including Multiwoz 2.1, where we achieve a joint goal accuracy beyond 55%.

Journal Articles

A Confidence-based Acquisition Model for Self-supervised Active Learning and Label Correction

Published in Transactions of the Association for Computational Linguistics Volume 13(2025), 2025

Supervised neural approaches are hindered by their dependence on large, meticulously annotated datasets, a requirement that is particularly cumbersome for sequential tasks. The quality of annotations tends to deteriorate with the transition from expert-based to crowd-sourced labeling. To address these challenges, we present CAMEL (Confidence-based Acquisition Model for Efficient self-supervised active Learning), a pool-based active learning framework tailored to sequential multi-output problems. CAMEL possesses two core features: (1) it requires expert annotators to label only a fraction of a chosen sequence, and (2) it facilitates self-supervision for the remainder of the sequence. By deploying a label correction mechanism, CAMEL can also be utilized for data cleaning. We evaluate CAMEL on two sequential tasks, with a special emphasis on dialogue belief tracking, a task plagued by the constraints of limited and noisy datasets. Our experiments demonstrate that CAMEL significantly outperforms the baselines in terms of efficiency. Furthermore, the data corrections suggested by our method contribute to an overall improvement in the quality of the resulting datasets.

Recommended citation: Carel van Niekerk, Christian Geishauser, Michael Heck, Shutong Feng, Hsien-chin Lin, Nurul Lubis, Benjamin Ruppik, Renato Vukovic, Milica Gašić; A Confidence-based Acquisition Model for Self-supervised Active Learning and Label Correction. Transactions of the Association for Computational Linguistics 2025; 13 167–187. doi: https://doi.org/10.1162/tacl_a_00734
Download Paper | Download Slides | Download Bibtex

Learning with an Open Horizon in Ever-Changing Dialogue Circumstances

Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

Task-oriented dialogue systems aid users in achieving their goals for specific tasks, e.g., booking a hotel room or managing a schedule. The systems experience various changes during their lifetime such as new tasks emerging or varying user behaviours and task requests, which requires the ability of continually learning throughout their lifetime. Current dialogue systems either perform no continual learning or do it in an unrealistic way that mostly focuses on avoiding catastrophic forgetting. Unlike current dialogue systems, humans learn in such a way that it benefits their present and future, while adapting their behaviour to current circumstances. In order to equip dialogue systems with the capability of learning for the future, we propose the usage of lifetime return in the reinforcement learning (RL) objective of dialogue policies. Moreover, we enable dynamic adaptation of hyperparameters of the underlying RL algorithm used for training the dialogue policy by employing meta-gradient reinforcement learning. We furthermore propose a more general and challenging continual learning environment in order to approximate how dialogue systems can learn in the ever-changing real world. Extensive experiments demonstrate that lifetime return and meta-gradient RL lead to more robust and improved results in continuously changing circumstances. The results warrant further development of dialogue systems that evolve throughout their lifetime.

Robust Dialogue State Tracking with Weak Supervision and Sparse Data

Published in Transactions of the Association for Computational Linguistics, Volume 10, 2022

Generalizing dialogue state tracking (DST) to new data is especially challenging due to the strong reliance on abundant and fine-grained supervision during training. Sample sparsity, distributional shift, and the occurrence of new concepts and topics frequently lead to severe performance degradation during inference. In this paper we propose a training strategy to build extractive DST models without the need for fine-grained manual span labels. Two novel input-level dropout methods mitigate the negative impact of sample sparsity. We propose a new model architecture with a unified encoder that supports value as well as slot independence by leveraging the attention mechanism. We combine the strengths of triple copy strategy DST and value matching to benefit from complementary predictions without violating the principle of ontology independence. Our experiments demonstrate that an extractive DST model can be trained without manual span labels. Our architecture and training strategies improve robustness towards sample sparsity, new concepts, and topics, leading to state-of-the-art performance on a range of benchmarks. We further highlight our model’s ability to effectively learn from non-dialogue data.

Theses

Uncertainty Estimation, Management, and Utilisation in Human-Computer Dialogue

Published in Heinrich Heine Universität, Düsseldorf, 2024

In the rapidly evolving field of human-computer interaction, there is an increasing demand for effective and reliable dialogue systems, computer programs engineered to converse with humans. However, these systems often fall short in unpredictable or ambiguous scenarios, a problem attributed to the absence of a comprehensive model for handling uncertainty. This limitation impacts the ability to communicate effectively with human users, thereby diminishing user experience and trust. Uncertainty is a fundamental aspect of human cognition and everyday decision-making processes, serving as both an obstacle and an opportunity in our constant pursuit of knowledge and effective communication. Despite its important role, uncertainty is underrepresented in the development of machine learning models for dialogue systems. To address these gaps, this thesis focuses on integrating and leveraging uncertainty within taskoriented dialogue systems, systems designed to assist users in accomplishing specific tasks. With the aim of achieving human-level interactive capabilities, we make three substantial contributions to this area. First, we enhance the system’s language understanding component, improving its accuracy in evaluating the certainty of its predictions. Secondly, we introduce SetSUMBT (Set Similarity based Slot Utterance Matching Belief Tracker), a model designed to capture various facets of uncertainty, bolstering the robustness and adaptability of the dialogue policy models responsible for generating system responses, as validated through simulated and real-user interactions. Thirdly, we present CAMELL (Confidence-based Acquisition Model for Efficient self-supervised active Learning with Label validation), an innovative framework which minimises the reliance of models on labelled data. By incorporating elements of self-supervision, where models learn from their own predictions, and label validation, CAMELL automates the rectification of unreliable human annotations, a feature with extensive applicability in various machine learning domains. Incorporating insights from psychological theories on human uncertainty management, this thesis emphasises the importance of integrating such insights into machine learning models for dialogue. Our methods will advance the field by introducing more reliable, robust, and effective dialogue systems that better handle uncertainties, ultimately enhancing the quality of human-computer interaction. Furthermore, this work challenges current limitations associated with data deficiencies, offering a data-driven approach for improving dataset quality, thereby paving the way for future research in machine learning and human-computer interaction.

Multiscale spatial modeling with applications in image analysis

Published in Dissertation (MSc)--University of Pretoria., 2018

Computer vision is a very important research area and is continuously growing. One of the prevalent research areas in computer vision is image matching. In image matching there are two main components, namely feature detection and feature matching. The aim of this this study is to determine whether Direct Sampling can be used for feature matching, and also if the combination of Direct Sampling and the Discrete Pulse Transform feature detector can be a successful image matching tool. In feature detection there are many strong methods including convolutional neural networks and scale-space models such as SIFT and SURF, which are very well-known feature detection algorithms. In this work we utilize another scale-space decomposition tool called the Discrete Pulse Transform (DPT). We particularly use the DPT decomposition to enable significant feature detection. We then concentrate on using the Direct Sampling algorithm, a stochastic spatial simulation algorithm, for modelling and matching of features. We do not consider convolutional neural networks or SIFT or SURF for texture matching in this work, this is because we particularly focus on the use of spatial statistics in image matching. We finally propose a novel multiscale spatial statistics feature detection and matching algorithm which combines the DPT feature detection with Direct Sampling for feature matching, specifically for texture classes of images. The performance of the proposed method is tested by comparing the distances obtained from the proposed algorithm between different texture images. We see that this proposed novel multiscale spatial modelling approach to feature matching with the focus on textures performs well at discriminating between difficult to discriminate between textures.