Cerego’s data science team have built a range of Smart Create™ tools that use AI to save time making content. They recently introduced a major upgrade in the quality and reliability of the Smart Answers tool by integrating a powerful deep learning model, called BERT. Below, the team gives us an overview of what BERT is, and how it helps.


First off, remind us—what does Cerego’s Smart Answers actually do?

Archana Ramalingam, Data Scientist:
When you write a multiple choice quiz item, one of the most time-consuming tasks is to come up with good distractors—alternative, wrong choices that are plausible enough to make a learner really think about the question and actually learn the underlying principle. The Smart Answers function in Cerego tries to save you time by doing this automatically. So if you write a question and answer, this feature will automatically suggest potential relevant distractors (wrong answer choices) that you can add with a single click. You can also use Smart Answers if you Bulk Upload questions and answers into Cerego, or when you use Smart Suggestions to make learning items from key concepts in a video with a single click.

Iain Harlow, VP Science:
It’s a quietly powerful idea—all you need to tell Cerego is the question and correct answer, and we’re able to immediately generate quiz types of varying difficulty ranging from multiple choice to free response.


What approach were you using before?

One of the primary models we use is a word embedding model called word2vec. This type of model uses a neural network to encode words and phrases as long vectors (embeddings), usually a few hundred values long—essentially, the embeddings reduce the linguistic space from millions of words to a lower-dimensional space that preserves the semantic relationship between them. This means that terms with similar usage or semantic meaning will have vectors near each other in this space.

Although pre-trained word2vec models are available, we trained our own with a couple of differences that make it work better for our use case. First, we curated the training set to lean more heavily on declarative knowledge (like wikipedia, or textbooks) and less on conversational corpuses (like social media). We’re not trying to build a chat bot, but instead understand how different concepts and facts relate to each other. Second, we incorporate the part-of-speech (POS) for each term in our model. This helps to distinguish between similar words with different meanings: For example, the color orange (adjective) exists separately to the fruit (noun) in our model.


OK, so what does BERT do differently?

John Braunlin, Data Scientist:
BERT (Bidirectional Encoder Representations from Transformers) is a state of the art deep learning language model developed by Google AI. BERT has 110 million parameters and has been trained on a large corpus of English text containing over 3.3 billion words, making its performance robust across various topics. BERT belongs to the same class of models as GPT (Generative Pre-trained Transformer), except that BERT is bidirectional, meaning it has been trained to learn a word’s context from both the left and right side of the text the word appears in.

The model was trained on a task to predict a random word removed from a sentence. Since there are many parallels between this approach and the task of creating distractors for a question, we can now generate distractors for quizzes that are more relevant and contextually fitting. Rather than generating distractors based on their similarity to the correct answer like word2vec (static), BERT (contextualized) tries to predict distractors based on the context of the question.


BERT architecture used for training (Pre-training) and inference purposes (Fine-tuning)

The most important attribute of the transformer architecture is a mechanism called “self-attention”, which prioritizes the most semantically important words present in a sentence. BERT has 12 such layers of attention, making it a deep transformer model that can learn and retain long term dependencies in the text. BERT’s bidirectional training style helps too, because it can ‘read’ the whole sentence before predicting the language context—unlike GPT-3, or a human reading from beginning to end. Finally, it is trained at the character level, meaning it can handle unseen (out of vocabulary, or OOV) words.

All of this makes BERT much more aware of the local context of a word. While word2vec stores a static embedding for each target word + part-of-speech combination, BERT dynamically generates a context-dependent, instance specific representation of the word.

The video below captures how the attention shifts between each word as we progress along the sentence.

So how does this improve the output of the model in practice?

One place this really helps is for terms that have multiple meanings in different contexts. This is incredibly common in English! For example, one of our examples referred to the incubation period of a project. The word2vec model can only store one embedding for the term incubation, and since the most common context for incubation in the training corpus relates to biology, that’s where it draws potential distractors from: ovulation, gestation and other terms that would work well if the question was about biology. But BERT is able to dynamically take the whole question context into account and instead suggests less contextually jarring options like initiation or development.

BERT is also especially good at distracting short phrases. BERT’s output for phrases typically conforms to the syntax of the sentence much more accurately than word2vec. Additionally, this model provides higher quality distractors for answers that are uncommon in the English language. If the answer is out of vocabulary for word2vec—it’s not one of the few million stored in the model—word2vec isn’t able to generate a distractor. BERT solves this by only needing to know a few words in the context sentence in order to produce an output, character-by-character.


So is BERT a magic bullet?

No, there’s a lot more to it than just plugging in the model and using the raw output it gives us. For example, BERT predicts missing words in a sentence independently. When we need to create distractors for an answer that is a multi-word phrase, BERT doesn’t take into account its prediction for other missing words in the sentence, leading to distractors that are not syntactically correct. We alleviate this issue by implementing a beam search algorithm. Our implementation of beam search iterates through each word to be predicted in a sentence, generates an output, and then inserts that word into the sentence before predicting candidates for the next word.

Another limitation stems from an inherent difference between the model’s expected behavior and our specific use case. Since we’re not just trying to find a syntactically plausible alternative answer, but one which works well for learning, ideally the distractors should be distinct in meaning from the correct answer, but challenging enough that a learner has to engage deeply with the question. BERT’s output is based on the most probable missing word or phrase. This leads common English words to be predicted more often than less frequent words. We adjust for this by generating a large number of candidates for distractors and then ranking them based on other learning-related factors, such as their semantic or perceptual similarity to the correct answer.

BERT is enormously powerful, but no deep learning model is perfect at every task. As John has hinted at above, there’s quite a lot of processing to do on the output from a BERT model before it reaches a sufficient reliability to be used in our product.

In fact, our approach is to use a range of different models with different strengths and weaknesses, much like a great team. This means the really challenging part of the problem is figuring out when to apply each one and how to select the best of their respective outputs. For example, when the correct answer is an entity like “San Ramon” or “CNN”, we can use a knowledge graph to find other entities that share the same ontological categories: other Californian cities like “Concord”, or US news networks like “MSNBC”.

The significance of BERT is that it performs so well in some of the hardest conditions for these other models, like when the correct answer is a longer phrase or a rare technical term our other models haven’t seen before. That’s what makes the addition of BERT so effective—its strengths complement the rest of the ‘team’ of models.

For more on how Cerego works, check out our Resources.


Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL-HLT. doi:10.18653/v1/N19-1423