Creating CartmanBot — Part Two - Plato Data Intelligence

Developing the chatbot with Microsoft DialoGPT

In the first part of the tutorial, we extracted the required raw data by HTML scraping and constructed a data frame that contains character names and their dialogues. If you would like to refer back to the first part, here’s the link.

In this part, we will continue with the following steps –

Cleaning the data
Modifying the data frame
Training the chatbot with the data
Testing the chatbot

But before these steps, let’s talk about the NLP model that we will use to train CartmanBot.

The Natural Language Processing model that we will use to train our chatbot is called Microsoft DialoGPT. This model is a large-scale pretrained dialogue response generation model that was trained on 147M multi-turn dialogue from Reddit discussion thread. The response generated from DialoGPT is comparable to human response quality under a single-turn conversation Turing test.

Now, let’s import the packages that we need –

import pandas as pd
from sklearn.model_selection import train_test_splitfrom transformers import (
MODEL_WITH_LM_HEAD_MAPPING,
WEIGHTS_NAME,
AdamW,
AutoConfig,
AutoModelWithLMHead,
AutoTokenizer,
PreTrainedModel,
PreTrainedTokenizer,
get_linear_schedule_with_warmup)import torch
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data.distributed import DistributedSample
rfrom torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSamplerfrom tqdm.notebook import tqdm, trange
from pathlib import Path

The first thing we will check is if our data frame has rows with missing values. There are many ways to check this but we will simply create two temporary data frames. The first one is for the rows with missing character values –

df_test = df[df['Character']==""]
df_test

Checking if the data frame has missing character values

The second data frame is for the rows with missing dialogue values –

df_test2 = df[df['Line']==""]
df_test2

Checking if the data frame has missing dialogue values

So, our original data frame has both rows with missing character names and rows with missing dialogues. The rows with missing dialogues mainly belong to the character Kenny who mumbles a lot. This could be the reason why the Fandom website left these lines as blanks instead of writing out the gibberish.

1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)

2. How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

3. 5 Top Tips For Human-Centred Chatbot Design

4.Chatbot Conference Online

To scrub our data frame, we will only remove the rows with missing character values. We will keep the rows with missing dialogues since they belong to an important character in the series.

df = df[df['Character']!=""]

The next thing we will check is if our data frame has character names that are similar to Cartman.

df[df['Character'].str.contains('Cartman')]

Checking if the data frame has character names similar to Cartman

As you can see, the first row the resulting data frame has “Eric Cartman” as the character name. We will have to change this name as “Cartman” to keep our data consistent.

df['Character'][df['Character']== "Eric Cartman"] = "Cartman"

I would suggest doing the same for other characters such as Stan, Kyle, and Kenny. I will skip these steps here so that we can save time.

Next, we will convert our data frame so that it will be suitable for training the CartmanBot. We can modify the data frame in the way that every Cartman’s line will have seven prior dialogues. These seven lines can provide context to our chatbot.

To code for this task is simple. First, we will go through every row in the data frame and check if the dialogue belongs to Cartman. If the dialogue belongs to Cartman, we will add this line and the seven previous lines to a list. This list is later converted into a new data frame that contains a response column and seven context columns.

contexted = []
n = 7for i in range(n, len(df['Line'])):
if df['Character'][i] == "Cartman":
row = []
prev = i - 1 - nfor j in range(i, prev, -1):
row.append(df['Line'][j])else:
continuecontexted.append(row)columns = ['response', 'context']
columns = columns + ['context_'+str(i+2) for i in range(n-1)]df1 = pd.DataFrame.from_records(contexted, columns=columns)

Let’s see the resulting data frame –

df1.head(5)

Data Frame with Response and Context

we will then use Scikit-Learn’s “train_test_split()” function to split our new data frame as 80% training and 20% testing sets. The testing dataset will be used for evaluating the model later.

trn_df, val_df = train_test_split(df1, test_size = 0.2)

Now that we have a compatible data frame, we can move on to training the Cartman. This section has a lot of complex codes that I lifted from Nathan Cooper’s tutorial for creating open-dialog chatbots for learning new languages. If you would like to refer to this tutorial, here’s the link.

Before training the Cartman, we need to write some helper functions and the main function. The first one we will write is an argument class that will be used for easy conversion of python script to Google colab notebook.

As the next step, we will write a function that joins Cartman’s response and its context strings with a special ‘end of string’ token between them. In this way, our model will understand when each response is ending.

We will also write a train function that loads batches of data as both inputs and labels. This is necessary because Microsoft DialoGPT is an auto-regressive model, meaning that it uses some context to predict the next token. The prediction is re-used as the context and fed it again to the model to predict the next token.

In terms of evaluation, we will use the perplexity metric which is a measure of how much our model is confused in its choice of the next token. The more our model is perplexed, the higher the score will be.

Now that we have all the necessary helper functions, we can combine them and write the main function.

Okay! The main function has been created. Let’s train CartmanBot with the training and testing datasets.

main(trn_df, val_df)

Cartman Bot has been trained. It’s time to test it out. We will use the following code to chat for 5 lines. This code helps CartmanBot to realize when it has finished its turn or when it has generated the [end_of_turn] token. When CartmanBot generates this token, the conversation will switch back to our turn.

Here’s a sample chat I had with CartmanBot in Google Colab.

Chatting with CartmanBot in Google Colab NoteBook

In addition to chatting in Google Colab notebook, I also used JupyterDash to create a Dash app to chat with CartmanBot. You can refer to this guide if you want to create a similar chat app.

And, here are a few chats I had in the Dash app.

Chatting with CartmanBot in JupyterDash App

One more –

Chatting with CartmanBot in JupyterDash App

Great! Just like Eric Cartman, our CartmanBot replies with hilarious dialogues. It also responds almost comparable to how humans respond to each other. I’m pretty happy with the result.

In this tutorial, we created a chatbot that mimics Eric Cartman from the South Park animated series by using the Microsoft DialoGPT model. We started with HTML scraping of the raw data (character names and dialogues) in the first part of the tutorial. In this second part, we continue with data scrubbing, data frame modification, model training, and model testing.

With the help of this tutorial, I hope you will create chatbots of your favorite characters. If you do create one, please leave a comment and let me know!

Source: https://chatbotslife.com/creating-cartmanbot-part-two-de95c348aeb1?source=rss—-a49517e4c30b—4

Generative Data Intelligence

Creating CartmanBot — Part Two

National Geospatial-Intelligence Agency creating space intel hub

NGA wants industry’s help monitoring illegal activity in Indo-Pacific

Latest Intelligence

Green Builder Sustainability Symposium Presentation – Slides & Transcript – CleanTechnica

Nickel 28 Capital Ousts CEO Anthony Milewski and President Justin Cochrane in Leadership Purge Over Misconduct

SEC Issues Wells Notice to Robinhood Over Crypto

Sightline urges India to provide information on arrested reporter

Cadillac, Audi, & BMW Now Leading in Share of US Sales Being 100% Electric – CleanTechnica

TD Bank’s Money-Laundering and Bribery Scandal