Zephyrnet Logo

Explained Architecture of Facebook Blender open domain chatbot

Date:

Narendra Prasath

NLP/ML Researchers knows how challenging it is to build an open domain chatbot framework. Even, the enterprise chatbot solution providers are not capable of building a chatbot with open domain knowledge space. Always, we knew that good chatbot should converse like a human and keep track of the conversation state. The Facebook AI Research (FAIR) launched an open source chatbot framework called Blender. It is trained on large datasets, 9.4Billion parameter model, and outperforms with the evaluation of engagement and feels more human.

`Good conversation requires many skills that an expert conversationalist blends in a seamless way — engagements, knowledge, empathy and personality`

The main 2 recipes of building open domain chatbot that perform well in human evaluation. Blending Skill Talk (BST) and Generation strategy. BST emphasis for learning desirable traits and focus on personality, engaging, knowledge and empathy. The system must be able to switch different task when it needed. i.e. like adjusting tone if a person changes from joking to serious. Generation strategy, minimizing perplexity is one of the main factors when training the neural network model, which measures how well the model can predict and generate the next word. Researchers found that the length of bot’s utterance impact the quality of human evaluation. 1. Too short of agent’s response, seen as dull, a lack of interest. 2. Too long of the agent’s response, it appears to be in-appropriate and not-listen.

‘Constraining the minimum beam length gives the crucial control of the dull versus spicy spectrum of responses’

  1. a lack of in-depth knowledge
  2. a tendency to stick to simpler language
  3. a tendency to repeat often used phrase

To fix the above problems, unlikelihood training and retrieve-and-refine mechanism are helped. Let’s try to understand some deeper explanation of how the architecture has been built to tackle the above scenarios.

3 types of transformers based architecture were proposed. 1. retriever, 2. generative and 3. retrieve-and-refine

Fig 1: Poly-encoder Transformer architecture for retrieval mechanism.

Retriever, as the names suggests, it finds the best next dialog utterance by scoring a large set of candidate response and selecting the highest score. Fig 1: explains the retriever architecture, Poly-encoder encodes global features of the context using multiple representations (n codes — n is a hyperparameter) by obtaining each candidate response. The final attention layer gives improved results compared to a single global vector representation. (i.e called bi-encoders). This mechanism achieved state-of-art performance on no of different dialog tasks. Researchers considered a two-poly encoder size: 256M and 622M parameter model is trained (N= 64, codes).

Gives comparable performance on winning generative models on the ConvAI2 competition task(Zhang et al., 2018) in terms of human evaluation(Li et al., 2019b).

Model size — Generator

Retrieve and Refine, introduced a new approach for read or access external knowledge other than what is embedded in our model. Still our generative models will produce dull or repetitive responses. To overcome this issue, they combined retrieval step before generation. There are 2 variants of retrieval steps introduced 1. dialogue retrieval and 2. knowledge retrieval.

The response will be retrieved for utterance based on the retrieval mechanism and then passed the input sequence to the generator by adding special separator token in the sequence. This is the flow of Dialogue retrieval. Generators are capable of producing vibrant language than high probability utterance. Knowledge retrieval, it retrieves from the large knowledge base. They used the proposed model Wizard of Wikipedia task. By then, they referred to as Wizard Generative model as follows,

  1. The initial set of candidates is generated by TF-IDF based inverted lookup table used in the Wizard of Wikipedia task.
  2. Transformer based retriever used for ranking the candidate set and select a single sentence for condition generation.
  3. Also, there should be a way of identifying where to use external knowledge. Another, two-class based transformer classifier was used to discriminate between context that require knowledge or fine-tuning tasks.

Hence, there are multiple transformer based architecture was used in the entire pipeline. One must also be careful about choosing the right objective while training the model. Let’s see how the training process is met by introducing some of the objective concepts that were discussed in the paper.

Ranking for Retrieval, (retrieval model) Cross-entropy is minimized to select a correct response, and other responses made negative in the batch. This approach was cited in the paper for enabling much faster training.

Likelihood Training for Generation, (Generative model) minimizes the Maximum likelihood Estimation (MLE) for the dataset in which x is called gold input context and y are gold next-utterance.

To retrieve and refine, simply appending dialogue retrieval responses to the context of a generative model and training with MLE, unfortunately, does not yield satisfying results.

Unlikelihood training for generation, the model will assign too high probabilities for occurring repetitive or high frequency tokens as there will be a mismatch of model and human distribution. It helps to mitigate the over-presented tokens. Example: the token has repeated more or over-presented, likelihood tries to push probability to high, and unlikelihood tries to push probability to low. Hence in the training process, chosen negative token from n-gram candidate set which their counts are the above human distribution will be counted as gold response.

Decoding method used for generating responses to the dialog context given as input. This is at inference time. In the paper, they discussed about multiple generation techniques for decoding. Will look into it each one now,

Beam Search, two types of search was used greedy search and beam search in the approach. Greedy search selects the highest probability of token, and the beam search maintains a fixed size of decoded results. Sampling, to prevent sampling low probability tokens, an approach is used to restrict preventing sampling to the subset of vocabulary at every time. The Length of the response is another factor for evaluating the model performance. For the length of the response was maintained by two approaches. 1. There is a hard constraint on generating the Minimum length of the response. 2. The response length predicted based on the conversation data. Example: trained 4-class classifier model for Predicting length e.g.<10,< 20,< 30, or >30 tokens. This model is the same as the retriever model architecture. But at the test time, the classifier will predict the length of the next response. This approach makes our system more complex. Sub-sequence Blocking, sequence model are known to repeat the subsequence. Hence, implemented beam blocking of n-grams (n=3). Beam blocking sequence considered for generated response as well as the input sequence.

So far discussed the techniques covered in the paper. The training process is not covering in this part of the article. Please give your support or leave a comment below for any improvement or suggestions. This help me motivate to enhance my content in further articles. Thanks for reading.!!! You can reach me on LinkedIn.

You can find the paper on arXiv for Recipes for Building an Open-domain Chatbot. Also, the code will be available here.

Reference:

Source: https://chatbotslife.com/explained-architecture-of-facebook-blender-open-domain-chatbot-441a4201753d?source=rss—-a49517e4c30b—4

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?