Building Question Answering Model at Scale using ๐คTransformers

Introduction
We will build a neural question and answering system using transformers
models (RoBERTa
). This approach is capable to perform Q&A across millions of documents in few seconds.
Data
For this tutorial, I will use ArXiV’s research papers abstracts to do Q&A. The data is on Kaggle. Go to dataset. The dataset has many columns like
id
author
title
categories
but the columns we will be interested in are title
and abstract
.
abstract
contains a long summary of the research paper. We will use this column to build our Question & Answer model.
Let’s dive into the code.
Let’s Code
The format of the data is a nested json
. We will limit our analysis to just 50,000 documents because of the compute limit on Kaggle to avoid out of memory error
.
|
|

We will use abstract
column to train our QA model.
Haystack
Now, Welcome Haystack
! The secret sauce behind scaling up to thousands of documents is Haystack
.

Haystack
helps you scale QA models to large collections of documents! You can read more about this amazing library here https://github.com/deepset-ai/haystack
For installation: ! pip install git+https://github.com/deepset-ai/haystack.git
But just to give a background, there are 3 major components to Haystack.
- Document Store: Database storing the documents for our search. We recommend Elasticsearch, but have also more light-weight options for fast prototyping (SQL or In-Memory).
- Retriever: Fast, simple algorithm that identifies candidate passages from a large collection of documents. Algorithms include TF-IDF or BM25, custom Elasticsearch queries, and embedding-based approaches. The Retriever helps to narrow down the scope for Reader to smaller units of text where a given question could be answered.
- Reader: Powerful neural model that reads through texts in detail to find an answer. Use diverse models like BERT, RoBERTa or XLNet trained via FARM or Transformers on SQuAD like tasks. The Reader takes multiple passages of text as input and returns top-n answers with corresponding confidence scores. You can just load a pretrained model from Hugging Face’s model hub or fine-tune it to your own domain data.
And then there is Finder which glues together a Reader and a Retriever as a pipeline to provide an easy-to-use question answering interface.
Now, we can setup Haystack
in 3 steps:
- Install
haystack
and import its required modules - Setup
DocumentStore
- Setup
Retriever
,Reader
andFinder
1. Install haystack
Let’s install haystack
and import all the required modules
|
|
2. Setting up DocumentStore
Haystack finds answers to queries within the documents stored in a DocumentStore
. The current implementations of DocumentStore
include ElasticsearchDocumentStore
, SQLDocumentStore
, and InMemoryDocumentStore
.
But they recommend ElasticsearchDocumentStore
because as it comes preloaded with features like full-text queries, BM25 retrieval, and vector storage for text embeddings.
So - Let’s set up a ElasticsearchDocumentStore
.
|
|
Once ElasticsearchDocumentStore
is setup, we will write our documents/texts to the DocumentStore
.
- Writing documents to
ElasticsearchDocumentStore
requires a format - List of dictionaries as shown below:
|
|
(Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and can be accessed later for filtering or shown in the responses of the Finder)
- We will use
title
column to pass as name andabstract
column to pass as the text
|
|
3. Setup Retriever
, Reader
and Finder
Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.
Here: We use Elasticsearch’s default BM25 algorithm
|
|
A Reader scans the texts returned by retrievers in detail and extracts the k best answers. They are based on powerful, but slower deep learning models.
Haystack currently supports Readers based on the frameworks FARM and Transformers. With both you can either load a local model or one from Hugging Face’s model hub (https://huggingface.co/models).
Here: a medium sized RoBERTa QA model using a Reader based on FARM (https://huggingface.co/deepset/roberta-base-squad2)
|
|
And finally: The Finder sticks together reader and retriever in a pipeline to answer our actual questions.
|
|
๐ฅณ Voila! We’re Done.
Once we have our Finder
ready, we are all set to see our model fetching answers for us based on the question.
Below is the list of questions that I was asking the model
|
|
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.17 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.71 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.75 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.78 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:01<00:00, 1.08s/ Batches]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:01<00:00, 1.09s/ Batches]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.57 Batches/s]
|
|
Let’s try few more examples -
|
|
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.17 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.71 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.75 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.78 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:01<00:00, 1.08s/ Batches]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:01<00:00, 1.09s/ Batches]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.57 Batches/s]
|
|
One more -
|
|
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.17 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.71 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.75 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.78 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:01<00:00, 1.08s/ Batches]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:01<00:00, 1.09s/ Batches]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.83 Batches/s]
Inferencing Samples: 100%|โโโโโโโโโโ| 1/1 [00:00<00:00, 1.57 Batches/s]
|
|
The results are promising. Please note that we have used a pretrained model deepset/roberta-base-squad2
for this tutorial. We might expect a significant improvement if we use a QA model trained specific to our dataset and then scale it up to millions of documents using Haystack