'Begin at the beginning,' the King said gravely, 'and go on till you come to the end: then stop.'

​ - King of Hearts, Alice in Wonderland (1865)

Quick Summary

The objective of this project is to compare methods for mining conversations from narrative fiction.


Firstly, dialogue systems need natural language data. A lot of it, and the richer the better. Exciting advances in dialogue systems such as Google Duplex and Microsoft Xiaoice have been powered by deep learning models trained on large quantities of tagged and highly structured conversational data.

Such data sources are hard to come by. Existing methods include mining online chat data from sites such as reddit and twitter, or crowdsourcing actual human conversations using tools like Amazon Mechanical Turk. However, online corpora data face limits either in terms of content and style bias, and crowdsourced corpora are expensive and non-scalable.

There is another way. A treasure trove of varied and life-like conversational data lies within the pages of narrative fiction. They are socially and linguistically rich and varied in ways that existing conversational corpora are not. Despite this, there are no large-scale efforts to mine these dialogue from narrative fiction. Our goal for this challenge is to extract them using machine learning.

Why it is difficult

Identifying conversations in narrative prose is tricky because stylistic and lexical features of dialogue vary a lot. Furthermore, a lot of the information about conversations are contained not in the conversations themselves but in the exposition.


Our data is Jane Austen's Pride and Prejudice, chosen for its realistic, linguistically rich, and socially complicated dialogue. We take as input the html file of the text. We then parse out paragraphs of utterances and narration using html tags. We then manually label the conversations and utterances, marking out the beginning and continuations of conversations, as well as non-conversational text.

We frame the problem as a sequence-labeling problem inspired by Named-Entity-Recognition (NER). We built a BERT + bi-LSTM sequence-labeling model using TensorFlow 2.0. We also construct utterance pairs from the resulting tags. To evaluate the model's effectiveness, we compare it against a heuristic for conversation sequence labelling.


Our sequence-labeling model strongly out-performed the heuristic in recall and precision.

Lessons Learnt

It was surprising that extracting sequences of conversation could be tackled effectively using a tagging approach inspired by the sequence-labeling used in Named-Entity-Recognition (NER).

A model that combines the bi-LSTM and Transformer architectures is to some extent able to capture the highly complex linguistic features of written conversation and the sequential relationship between dialogue and narration.

We had a pleasant experience using TensorFlow 2.0 in this challenge. We liked the combination of being able to quickly prototype with Keras while also retaining the low-level control of the original tensorflow architecture.

What's next for ConvoMiner

We will build a more general model by training on a larger and more varied dataset. Then, we will use the model to generate more conversational data from the 58,000 texts on Project Gutenberg. The resulting data will be Open-Sourced for chatbot engineers and researchers.

Built With

  • jupyter-notebook-python-tensorflow-html-project-gutenberg
Share this project: