How To Preprocess Text Data For NLP Tasks

Hello everyone welcome to this quick introduction to NLP preprocessing, we will be focusing on text dataset and how to clean and preprocess the data before you start to work on model.

img src — pixy

For the Preprocessing part we will be looking at

Tokenization
Lowercasing
Remove stop words and punctuation
Stemming

Importing useful libraries

import nltk 
from nltk.corpus import twitter_samples 
import matplotlib.pyplot as plt 
import random 
nltk.download('twitter_samples')
nltk.download('stopwords')

Data Cleaning

positive_tweets= twitter_samples.strings('positive_tweets.json')
negative_tweets= twitter_samples.strings('negative_tweets.json')# selecting a sample tweet
tweet= positive_tweets[48]#remove RT
tweet_n = re.sub(r'^RT[s]+', '', tweet)# remove hyperlinks
tweet_n = re.sub(r'https?://.*[rn]*', '', tweet_n)# remove hashtags
tweet_n = re.sub(r'#', '', tweet_n)

Here we used regular expressions to filter out the contents of tweets which don’t add any value to the NLP task eg- Hashtags, Links.

1. 8 Proven Ways to Use Chatbots for Marketing (with Real Examples)

2. How to Use Texthero to Prepare a Text-based Dataset for Your NLP Project

3. 5 Top Tips For Human-Centred Chatbot Design

4. Chatbot Conference Online

Tokenization

tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,
reduce_len=True)tokens = tokenizer.tokenize(tweet_n)

Here we convert the sentence to a list of words as well as lower case the words, also we strip the words of blank spaces and tabs. This is achieved using NLTK tokenize module

Remove stop words and punctuation

tweets_clean = []
stopwords_english = stopwords.words('english')for word in tokens:
if (word not in stopwords_english and word not in string.punctuation): 
tweets_clean.append(word)

Firstly we loop over the list of tokens and remove the stopwords and punctuations from the NLTK library.

Stemming

stemmer = PorterStemmer()tweets_stem = []for word in tweets_clean:
stem_word = stemmer.stem(word) 
tweets_stem.append(stem_word)

We are using NLTK’s porter stemmer to instantiate a stemming class and then loop over the cleaned tokens from the previous step, and apply stem method on each token.

Source: https://chatbotslife.com/getting-started-with-text-sentiment-analysis-part-1-preprocessing-355feea7f0b4?source=rss—-a49517e4c30b—4

Generative Data Intelligence

How to preprocess text data for NLP tasks

Data Cleaning

Tokenization

Remove stop words and punctuation

Stemming

RWDG Webinar: Metadata Management’s Impact on Data Governance – DATAVERSITY

EMEA Masters 2024 Spring Schedule, Standings and How to Watch

Latest Intelligence

Top Heroes Codes for April 2024

Transavia considers leaving Dutch market if Amsterdam Schiphol overnight closure proceeds

San Francisco files lawsuit against Oakland over airport name change

Driving the Cadillac Lyriq, Hyundai Santa Fe and a bunch of Lucid Airs | Autoblog Podcast #828

Poe’s Multi-Bot Chat: A Game-Changer in AI Interactivity

Meta Llama 3: Redefining Large Language Model Standards

Chat with us