Hello everyone welcome to this quick introduction to NLP preprocessing, we will be focusing on text dataset and how to clean and preprocess the data before you start to work on model.
For the Preprocessing part we will be looking at
- Tokenization
- Lowercasing
- Remove stop words and punctuation
- Stemming
Importing useful libraries
import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random
nltk.download('twitter_samples')
nltk.download('stopwords')
Data Cleaning
positive_tweets= twitter_samples.strings('positive_tweets.json')
negative_tweets= twitter_samples.strings('negative_tweets.json')# selecting a sample tweet
tweet= positive_tweets[48]#remove RT
tweet_n = re.sub(r'^RT[s]+', '', tweet)# remove hyperlinks
tweet_n = re.sub(r'https?://.*[rn]*', '', tweet_n)# remove hashtags
tweet_n = re.sub(r'#', '', tweet_n)
Here we used regular expressions to filter out the contents of tweets which don’t add any value to the NLP task eg- Hashtags, Links.
1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)
2. How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project
Tokenization
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)tokens = tokenizer.tokenize(tweet_n)
Here we convert the sentence to a list of words as well as lower case the words, also we strip the words of blank spaces and tabs. This is achieved using NLTK tokenize module
Remove stop words and punctuation
tweets_clean = []
stopwords_english = stopwords.words('english')for word in tokens:
if (word not in stopwords_english and word not in string.punctuation):
tweets_clean.append(word)
Firstly we loop over the list of tokens and remove the stopwords and punctuations from the NLTK library.
Stemming
stemmer = PorterStemmer()tweets_stem = []for word in tweets_clean:
stem_word = stemmer.stem(word)
tweets_stem.append(stem_word)
We are using NLTK’s porter stemmer to instantiate a stemming class and then loop over the cleaned tokens from the previous step, and apply stem method on each token.