Zephyrnet Logo

How to preprocess text data for NLP tasks

Date:

Aniket Rai

Hello everyone welcome to this quick introduction to NLP preprocessing, we will be focusing on text dataset and how to clean and preprocess the data before you start to work on model.

img src — pixy

For the Preprocessing part we will be looking at

  1. Tokenization
  2. Lowercasing
  3. Remove stop words and punctuation
  4. Stemming

Importing useful libraries

import nltk 
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random
nltk.download('twitter_samples')
nltk.download('stopwords')

Data Cleaning

positive_tweets= twitter_samples.strings('positive_tweets.json')
negative_tweets= twitter_samples.strings('negative_tweets.json')

Here we used regular expressions to filter out the contents of tweets which don’t add any value to the NLP task eg- Hashtags, Links.

1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)

2. How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

3. 5 Top Tips For Human-Centred Chatbot Design

4. Chatbot Conference Online

Tokenization

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)

Here we convert the sentence to a list of words as well as lower case the words, also we strip the words of blank spaces and tabs. This is achieved using NLTK tokenize module

Remove stop words and punctuation

tweets_clean = []
stopwords_english = stopwords.words('english')

Firstly we loop over the list of tokens and remove the stopwords and punctuations from the NLTK library.

Stemming

stemmer = PorterStemmer()

We are using NLTK’s porter stemmer to instantiate a stemming class and then loop over the cleaned tokens from the previous step, and apply stem method on each token.

Source: https://chatbotslife.com/getting-started-with-text-sentiment-analysis-part-1-preprocessing-355feea7f0b4?source=rss—-a49517e4c30b—4

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?