BERT on SQuAD’s

CC-BY-NC-4.0 PyTorch Kubeflow: v1.8 Kubernetes: v1.29.3

The guides in this page demonstrate how to fine-tune BERT on the SQuAD dataset to solve Question-Answering tasks. The can run locally, on a GPU Notebook server, or leverage Kubeflow Pipelines (KFP) to scale and automate the experiment in a Kubeflow cluster.

About BERT

BERT (Bidirectional Encoder Representations from Transformers) is a revolutionary model developed by Google in 2018. Its introduction marked a significant advancement in the field, setting new state-of-the-art benchmarks across various NLP tasks.

BERT is pre-trained on a massive amount of data, acquiring a sense of what language is and what’s the meaning of context in a document. Then, this pre-trained model can then be fine-tuned for specific tasks such as sentiment analysis or question answering.

About SQuAD

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

This is an example row from the SQuAD dataset:

Context:
Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a
golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a
copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the
Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto,
a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where
the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main
drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple,
modern stone statue of Mary.

Question:
The Basilica of the Sacred heart at Notre Dame is beside to which structure?

Answer:
{"text": ["the Main Building"], "answer_start": [279]}

In this experiment, we use SQuAD 1.1. This version of the SQuAD dataset, contains 100,000+ question-answer pairs on 500+ articles.

Conceptual Guides: