Zephyrnet Logo

NLP Classification: Part 1

Date:

Jobeth Muncy

Wine or cocktail?

This project was done to utilize and API to collect submissions or comments from two different subreddits. NLP was utilized to determine which subreddit a string of text would belong to.

I chose the use cocktails which has 176K members and wine which has 116K members. Thinking about my past career as a bartender, many people sit down and their first choice is generally ‘do I want a cocktail or wine?’ I thought there might be many similarities in the language used to describe and talk about wine and cocktails.

The first step of this process was to get the data off of Reddit so I could create DataFrames. These are the imports used for this section of the project.

import requests
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd
import time

I didn’t know if I wanted to use the submissions or the comments from these subreddits so I tried them both. In the end, I decided to go with the comments as many on the submissions were photos of bottles of wine, liquor, or crafted creations. The comments had a much more substantial amount of actual text.

url = "https://api.pushshift.io/reddit/search/submission"

The next step was to send a request to the API for the subreddit and how many I wanted.

params = {
'subreddit' : 'cocktails', 
'size' : '500'
}

I got a code 200 returned which was a successful response. Other responses have the following errors (more about errors):

  1. 100 Informational response
  2. 200 Successful
  3. 300 Redirect
  4. 400 Client side Errors (think 404 errors)
  5. 500 Server side Errors

I like to check to make things are working one step at a time, so I used the following (per Riley Dallas) to make sure I would be able to get a DataFrame.

data = res.json()

Yes, I got the data I wanted but wanted a function to create a DataFrame of much larger than the 500 rows I currently had my request set to collect. Tim Book (GA DSI Global Instructor) helped us write this function to run the request as many times as we wanted to collect the desired number of comments.

1. Knowledge graphs and Chatbots — An analytical approach.

2. Blender Vs Rasa open source chatbots

3. Picture my voice

4. Chat bots — A Conversational AI

The time is in Epoch Time which I later used a converter for to see the date range on my pull. I added a timer to pause 3 seconds before the next pull of 500 comments. Some APIs have maximum limits per a certain amount of time so it is a good habit to use the timer and check the limits.

def get_comments(subreddit, n_iter):
df_list = []
current_time = 1587245505

for _ in range(n_iter):
res = requests.get(
url_comments,
params={
"subreddit": subreddit,
"size": 500,
"before": current_time
}
)
time.sleep(3)
df = pd.DataFrame(res.json()['data'])
df = df.loc[:, ["subreddit", "body", "created_utc"]]
df_list.append(df)
current_time = df.created_utc.min()

return pd.concat(df_list, axis=0)

I wanted 50,000 comments so I ran the following:

df_comments_cocktails = get_comments('cocktails', 100)

Next, I check the body for duplicates and there were 48,527 unique values.

df_comments_cocktails[df_comments_cocktails['body'].duplicated() == False]

As mentioned before, I used the Epoch Converter to check the range of time collected.

print('min', df_comments_cocktails['created_utc'].min())
# Friday, November 29, 2019 8:59:06 PM

As the last step, I saved my DataFrame. I have learned from past experiences that for me, it is safest for me to comment out my to_csv after I export my DataFrame. I have accidentally overwritten files by running through my notebook. By having it commented out after exporting, I have to take an extra step to intentionally override the existing file. Use index=False to prevent having to drop and Unnamed: 0 column in the next section.

# for data used for 50K comments pulled 
# commented out so I don't accidently override

The same process was done for wine. In the next section, I will show you how to clean the text data to prepare it for modeling.

Source: https://chatbotslife.com/nlp-classification-part-1-f0034d0a64a3?source=rss—-a49517e4c30b—4

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?