Connect with us

Big Data

Real-World Machine Learning Case Study: Clustering Transactions Based on Text Descriptions





We are living in the era of digital technologies. When was the last time you walked into a shop that didn’t have a PayTM or BHIM UPI? These digital transaction technologies have quickly become a key part of our daily lives.

And not just at an individual level, these digital technologies are at the core of every financial institution. Executing a payment transaction or fund transfer has become very smooth with multiple possible options (like internet banking, ATM, credit or debit cards, UPI, POS Machines, etc.) having reliable systems running at the backend.

For every transaction we make, there would be an appropriate description message generated against it, like this:

In this article, we’ll talk about a real-world use case of a financial institution using clustering (a popular machine learning algorithm) to customize their product offerings for its customer base.

Motivation Behind this Case Study

As a financial institution, it’s always important to engage the existing customer base with customized offers based on their varying interests. It is a significant challenge for any financial institution to capture the ideal 360-degree view of a customer.

Social media platforms like Twitter, WhatsApp, Facebook, etc. have become primary sources of information for profiling a customer’s interests and preferences. A financial institution often incurs huge costs for availing data from third party sources. Even then, it becomes very difficult to map a social media account to a unique customer.

So how do we solve this?


A partial solution to the above problem can be addressed by using in-house transaction data available with the institution.

We can cluster the transactions performed by a customer into discrete categories based on the transaction description message. This approach can be used to flag whether a transaction was performed for Food, Sports, Clothes, Bill Payment, Household, Others etc. If a customer has most of the transactions appearing in a particular category, then we can have a better estimate of his/her preferences.

Here’s the Approach We Took

Let’s understand how we approached this problem statement and the key steps we took to figure out a solution.

Determining the number of Topics

We start the process with all transactions with their description messages mapped to each customer. To start with, we have an important task of finalizing the number of clusters (or) categories (or) topics. To achieve this objective, we use Topic Modelling.

Topic Modelling is a method for unsupervised classification of documents which finds natural groups of items even when we’re not sure what we are looking for. It mostly uses Latent Dirichlet Allocation (LDA) for fitting a topic model.

It treats each document (i.e. Transaction) as a mixture of topics, and each topic as a mixture of words. Here’s an example: the word budget might occur in the topics movie and in politics. The underlying assumption of this LDA is that every observation in the sample comes from an arbitrary unknown distribution that can be explained by a generative statistical model.

Let us view this methodology to address our problem. There exists a generative statistical model that has generated all the words in the transaction descriptions which came from unknown arbitrary distribution (i.e. unknown groups or topics). We try to estimate/build a statistical model so that it predicts the probability of a word belonging to a particular topic.

Topic Coherence

We have fixed the total number of topics by manually looking at the top keywords across topics. This might be slightly inconsistent, and we need a subjective way to assess the correct number of topics. We use the Topic Coherence measure to identify the correct number of topics.

Topic coherence is applied to the top N words from the topic. It is defined as the average/median of the pairwise word-similarity scores of the words in the topic. A good model will generate coherent topics, i.e., topics with high topic coherence scores.

Good topics are topics that can be described by a short label; therefore, this is what the topic coherence measure captures.

Time for Clustering!

We have fixed the total number of topics/clusters now (i.e. 7 topics in our case). We should start assigning each of these transaction description messages into topics. Topic modelling alone might not yield accurate results in assigning a document to a topic.

Here, we use the output of topic modeling along with a few more features to cluster transaction description messages using K-Means clustering. Here, we’ll concentrate on building a feature set for K-Means clustering.


  • Basic Features
    • Words count, Digits count, Special symbol count
    • Longest digits sequence length, digits-character ratio
    • Average, Maximum word lengths etc.
    • Week, Day and Month of Transaction, is date present, is weekend transaction, etc.
    • Transaction performed in the last 5 days or First 5 days of the month
    • Public holiday and festival transactions, etc.
  • Lookup Features – Top brands in the industry & common nouns are used as lookup names. Count the number of words in the transaction description related to a particular industry.
      • Food: Vegetables, Dominos, FreshDirect, Subway, etc.
      • Sports: Baseball, Adidas, soccer, cleats etc.
      • Health: Pharmacy, Hospital, Gym, etc.
      • Bill & EMI: Policy, power, statement, schedule, withdrawal, phone, etc.
      • Entertainment: Netflix, Prime shows, Spotify, Soundcloud, Bar
      • E-Commerce: Amazon, Walmart, eBay, Ticketmaster, etc.
  • Others: Uber, Airbus, packagers etc.
  • Topic Modelling Features
    • Perform Topic modelling on DTM Matrices generated using TF-IDF measure for unigrams & bigrams. We get 2 sets of 7 different probabilities for every topic for both unigram and bigram DTM Matrices for a transaction description

Final Thoughts

Around 30 features are made for every transaction description and we perform K-Means clustering to assign each transaction description to one of the 7 Clusters.

Results show that observations near to the cluster centers are mostly labeled with the correct topic. Few observations far from cluster centers have been assigned the wrong topic label. Out of manually reviewed 350 transaction descriptions, around 240 (~69% Accuracy) transaction descriptions are correctly labeled with the appropriate topic.

Now we have at least a basic estimate of the in-house customers’ preferences and interests. We can send customized offers and options to keep them engaged and improve business.

Though the approach of using a topic model is relatively novel, the approach of using transactions for classifying interests of customers has been in use mostly by credit card issuers. For instance, American Express has been using this approach for creating interest graphs for its customers. Such interest graphs not only categorize transactions into major groups like food, travel, etc. but also creates micro-segments like Thai Food fans, Wildlife enthusiasts, etc. And all these from the wealth of the transaction data only!

About the Author

Ravindra Reddy Tamma – Data Scientist (Actify Data Labs)

Ravi is a machine learning expert at Actify Data Labs. His expertise spans across credit risk analytics, application fraud modelling, OCR, text mining and deployment of models as APIs. He has worked extensively with lenders for developing application, behaviour, and collection scorecards.

Ravi has also developed a national-level application fraud model for unsecured lending in India using unstructured credit bureau header information. In addition to credit risk, Ravi has deep expertise in OCR, image analytics and text mining. Ravi also brings-in deep expertise in automating production data pipelines and deployment of machine learning models as scalable APIs.

You can also read this article on our Mobile APP Get it on Google Play

Related Articles


Big Data

How Can Technology Help Fight the COVID-19 Pandemic?




Illustration: © IoT For All

As the COVID-19 pandemic continues unfolding, technology solutions and government initiatives are multiplying to help monitor and control the virus’s journey. Their aid includes reducing the load on the health system and reinforcing the efforts of overworking and burned-out healthcare workers.

While smart technologies cannot replace or compensate public institution measures, they do play a crucial role in emergency responses. Let’s take a look at the promising use cases of how technology can help fight the novel coronavirus outbreak.

Technologies Used for Good

People tend to think of technology as a heartless machine, which is true, but only until it’s used for good. Just look at all the wonderful things we’ve managed to do with its help.

Telemedicine is gaining traction by offering remote patient monitoring and interactive remote doctor’s visits. At the same time, 3D printing and open-source solutions are facilitating the production of more affordable face masks, ventilators, and breathing filters as well as optimizing the supply of the medical equipment. Even more, the pandemic has driven scientists to desperate measures. They are now experimenting with gene editing, synthetic biology, and nanotechnology to develop and test vaccines faster than ever in the history of humanity.

Smart technologies like the Internet of things (IoT), big data, and artificial intelligence (AI) are being massively adopted to help track the disease spread and contagion, manage insurance payments, uphold medical supply chains, and enforce restrictive measures. Let’s go step by step to see how IoT, AI, big data, and mobile solutions are actually enhancing medical care.

IoT for Smart Patient Care Management and Home Automation

IoT has already found its use among healthcare providers. Today, connected patient imaging, health devices or applications, worker solutions, and ambulance programs are being adopted globally. But COVID-19 made the technology take on new applications to help the world combat the epidemic. Tracking quarantine, pre-screening and diagnosing, cleaning and disinfecting, innovative usage of drones, reducing in-home infections, are all “new normals” thanks to IoT.


For example, an American health technology company Kinsa creates smart thermometers that screen and aggregate people’s temperature and symptoms data in real-time. Having gathered data from over one million connected thermometers, Kinsa rolled out its US HealthWeather™ Map.

The map is updated daily, highlighting how severely the population is being affected by influenza-like illness (ILI). This real-time information helps health authorities see an increase. In fevers as early indicators of the community spread of COVID-19 to streamline the allocation of health resources. These areas are marked in the “Atypical” mode of the map.

To slow down the spread of COVID-19, a team of Seattle engineers created Immutouch, a smart wristband vibrating every time a person wearing it tries to touch their face.


Smart speakers, lights, and security systems are being used to open doors and switch on lights to reduce in-home infections. These gadgets allow people to avoid touching the surfaces of doorknobs, switches, mail, packages, or anything that could easily spread germs.

The Role of Big Data in Fighting Coronavirus

Tapping into big data is a must to develop real-time forecasts and arm healthcare professionals with a profound database to help with decision-making.


IBM Clinical Development system is an advanced Electronic Data Capture (EDC) platform that allows an accelerated delivery of medications to market and reduces the time and cost of clinical trials thanks to cognitive computing, patient data assets, and IoT. Additionally, the U.S. government had been in active talks with Facebook, Google, and others to determine how to use location data to glean insights for combating the COVID-19 pandemic.

Could Mobile Apps be Used to Control the Pandemic?

The COVID-19 pandemic has become a game-changer for the healthcare continuum. Today’s mobile apps are on guard to help patients receive online therapy, at-home testing, conclude self-checks, and improve mental well-being. Thanks to smartphone apps, it is now possible to trace the virus’s journey and help limit its spread.

Apple COVID-19, for instance, was created in partnership with the Centers for Disease Control and Prevention (CDC), the White House, and the Federal Emergency Management Agency (FEMA). The application contains vital and relevant information from trusted sources on the coronavirus pandemic: hand hygiene practices, social distancing FAQs, quarantine guidelines, self-checking tutorials, tips on cleaning, and disinfecting surfaces. On top of that, it has a screening tool that advises people on what to do when a person has COVID-19 symptoms, has just returned from abroad, or has come in close contact with someone who might be infected with the disease.

Meanwhile, health authorities in Abu Dhabi have created the TraceCovid app for Bluetooth-enabled smartphones to minimize the spread of the disease. The service allows tracing individuals who have come into proximity with a person tested positive for COVID-19. Thanks to it, medical professionals сan react faster and render the necessary healthcare. Germany, in turn, is going to roll out a smartphone app, which will use Bluetooth to alert people if they are close to someone with the confirmed viral infection.


Telemedicine has also proved to be an efficient tool for flattening the curve. The Sheba Medical Centre, the largest Israeli hospital, launched a telehealth program for remote patient-monitoring to control the pandemic spread. Doctolib, a Franco-German company, Qare (France), Livi (Sweden), Push Doctor (the UK), Compugroup Medical (Germany) are offering virtual doctors too.

Using AI to Identify, Track and Forecast Outbreaks

Artificial intelligence-powered by natural language processing (NLP) and location monitoring is crucial for identifying, tracking, and scanning outbreaks, predicting hotspots and helping make better decisions.

For example, Microsoft collaborated with the U.S. Centers for Disease Control and Prevention (CDC) to create an AI-based COVID-19 Assessment bot to treat patients more effectively and allocate limited resources. The bot, nicknamed Clara, can evaluate symptoms, advice on the next steps to take and track users who need urgent care the most.

The Canadian startup BlueDot has applied AI to spot and track the spread of COVID-19 and predict outbreaks, and the Japanese company Bespoke rolled out Bebot, an AI-powered chatbot that was developed specifically for travelers. This mobile app informs and assists them with coronavirus-related questions as they move about.


There’s no doubt that the coronavirus pandemic has become a real-life test for everyone. It has caused tremendous damage, but at the same time, it has forced tech innovators to roll out advanced solutions, and it seems that they don’t plan on slowing down anytime soon.

Healthcare providers across the globe are continually switching to smart technologies. So if you are in the smart technology niche, consider the current trends to steer your business in the right direction.


Continue Reading

Big Data

Chatbots and Intelligent Automation Solutions Paving the Way towards Seamless Business Continuity




Frequent business disruptions in the form of storms, pandemics, lockdowns, etc., pose risk to seamless operations and revenue generation in service industries. One day of operation disruption leads to losses worth millions. Semi-automation is not able to stop the cascading business effects of an unprecedented business disruption. Services such as banks, financial services, insurance, healthcare, information technology services, etc., cannot afford the risk of downtime. Chatbots powered by Intelligent Automation is that indispensable solution in the omni-channel customer interface that keeps the business moving 24×7 even in the face of a major business disruption such as long prevailing pandemic.

How do Intelligent Automation powered Chat-bots offer seamless business continuity?

Chatbots engage diverse skill sets such as Robotic Process Automation (RPA) and Artificial Intelligence (AI) / Machine Learning (ML), in short Intelligent Automation, and offer a lifeline to the service industry businesses. Chatbots are located on the key pages of a business website or social media pages of the business, and can be accessed by customers and prospects round the clock in different international languages. They augment the services of the regular service desk and helps tide over most emergency situations.

Chatbots can handle complex queries and the functioning depends on training data set and the streamlined data in the CRM database. All chatbot interactions can be further cleaned and stored in the CRM and analysed. Based on these interactions at different stages of the customer journey, the chatbots can make intelligent suggestions to the customer during the subsequent customer interaction.

The chatbots offer tremendous business benefits. The responses are highly accurate and relevant and have a minuscule turnaround time. The on time responses right from order booking to bill payment while taking care of customer preference ensures high productivity and thereby generates high revenue even when a business executive is not able to interact directly with a customer.

In conclusion:

Chatbot solution powered by Intelligent Automation is that indispensable tool in the omni-channel customer service desk of a service industry business. It helps to keep the business up and running even when customer executives are not able to interact directly with the customer due to unprecedented business disruptions. Chatbot solutions thereby enable businesses to stay up and functioning at all times in a 24x7x365 scenario.

Image Credit:


Continue Reading

Big Data

How Hazelcast hopes to make digital transformation mainstream




Commentary: Even as the coronavirus pandemic has hastened digital transformation efforts, success remains elusive for many companies. This one-stop shop to digital transformation might help.

Digital transformation

Image: metamorworks, Getty Images/iStockphoto

It’s no secret that, as CircleCI CEO Jim Rose put it, “The pandemic has compressed the time[line]” for digital transformation. What is perhaps surprising is just how broad and deep that transformation is spreading. In an interview with Hazelcast chief product officer David Brimley, he stressed that while Fortune 500 e-commerce and finance companies have historically paid the bills for Hazelcast, provider of an open source in-memory data grid (IMDG), mid-sized enterprises “are coming to us and saying, we want to start digitizing and [adding digital] channels for our business.”

How they get there, and how fast, is the question. 

SEE: Digital transformation road map (free PDF) (TechRepublic)

A one-stop shop to digital transformation 

As keen as companies are to move workloads to the cloud to facilitate digital transformation, not all companies are alike in their readiness, Brimley said. In particular, these mid-sized enterprises may lack the personnel or other resources to push aggressively into the cloud, whatever their intentions. As such, he said, many companies are trying to figure out “the quickest way I can get the applications and hardware I’ve got today in my own data centers and add a digital channel on the top of it as quickly as I can.” 

No PhD in Computer Geekery required.

SEE: Special report: Prepare for serverless computing (free PDF) (TechRepublic)

By pairing Hazelcast IMDG for distributed coordination and in-memory data storage with Hazelcast Jet for building streaming data pipelines, Brimley said, organizations can build digital integration hubs without having the technical chops of a Netflix or Facebook. “There are a lot of companies that can’t make head nor tail of this plethora of Cloud Native Computing Foundation products [Kubernetes, Envy, Fluentd, etc.], and they just want to stand up a Java process, have it clustered together, have some way of running their ‘microservices’ on this Java cluster, and off they go.”

Once, a company (and open source project) like Hazelcast would have had to pitch themselves to banks and credit card companies for low-latency, high-performance distributed systems; these were the types of organizations that valued IMDGs. Today, however, such concerns span a much broader range of companies, particularly with this crushing need to achieve digital transformation.  

For Brimley and Hazelcast, they’re not pitching themselves as a database or any particular technology. Even the IMDG label might not fit particularly well. After all, the company isn’t positioning itself as about technology, per se, but rather about solving business problems; about how developers can use Hazelcast to capture “interesting new architectural patterns,” in Brimley’s words. It’s taking on the “I need to embrace an event-driven architecture crowd,” and not selling a data cache or, yes, not even an in-memory data grid.

Disclosure: I work for AWS, but these are my views and don’t reflect those of my employer.

Also see


Continue Reading
AR/VR41 mins ago

Guide for the correct implementation of Virtual Reality in the educational system and universities…

AR/VR42 mins ago

These 3 Factors Stand in The Way of VR Mass Adoption

Mobility52 mins ago

How to organize your Android phone photos and screenshots

AR/VR1 hour ago

‘Firefox Reality’ VR Web Browser Comes to PC in Preview Version

Gaming2 hours ago

Server status – Is Fall Guys down?

Gaming3 hours ago

Evening Reading – August 6, 2020

AR/VR3 hours ago

Gravity Lab Rolls Onto Oculus Quest 20th August

IOT3 hours ago

Authentication In IoT: Securing the Front Door

AR/VR4 hours ago

Freerunning VR Experience Stride Steps Into Early Access This Month

Gaming5 hours ago

Grounded update 0.1.1 patch notes squashes bugs

Blockchain5 hours ago

BAT, Stellar Lumens, VeChain Price Analysis: 07 August

Payments5 hours ago

This Week in Fintech ending 7th August 2020

Biotechnology5 hours ago

How a protein promotes pancreatic cancer metastasis

Biotechnology5 hours ago

Autolus CMO Peddareddigari departs to return to the US

Fintech6 hours ago

FinTech Connect

Fintech6 hours ago

Australian FinTech – Connecting the Australian FinTech industry to the world

Biotechnology6 hours ago

Fauci: Political pressure won’t interfere with FDA decisions on COVID-19 vaccines

Fintech6 hours ago

Duena Blomstrom

Covid196 hours ago

A Cooking Camp Chef’s Recipe For Remote Education: Make It Ambitious

Fintech6 hours ago

Clarus Financial Technology

Fintech6 hours ago

Core Banking Software Solution & Wallet Engine |

Blockchain7 hours ago

Analyst Explains Reasons Bitcoin Price Could Fall Back to Lower $10Ks

Gaming7 hours ago

Horizon Zero Dawn PC impressions: The disappointing side of Decima

Gaming7 hours ago

Turn Based Strategy RPG ‘Warhammer Quest: Silver Tower’ Releases Next Month on iOS and Android with Pre-Registrations Now Live on Google Play

Networks8 hours ago

VMware gets into apps with Bluetooth-pinging COVID-safe-office tools

Semiconductor8 hours ago

ams’ VCSELs used in Ibeo’s solid-state LiDAR for Great Wall Motor

Biotechnology8 hours ago

AbbVie cuts Editas CRISPR pact it inherited from Allergan

Gaming8 hours ago

Dr Disrespect Was Banned From Twitch, But Now He’s Coming Back On YouTube

Payments8 hours ago

Cambodia payments fintech Clik lands $3.7m

Payments8 hours ago

Interview with John O’Neill of Silent Eight on how to use AI in financial services

Start Ups8 hours ago

Beauty brand MyGlamm acquires women-centric platform POPxo

Publications8 hours ago

UK digital bank Starling’s losses doubled in 2019 — but it expects to break even this year

Gaming8 hours ago

Microsoft Explains Why xCloud Won’t Be On iOS After Prematurely Ending Testing

Payments9 hours ago

Fintech funding rebounds in Q2 but deal numbers continue to fall

Start Ups9 hours ago

Home remedies for constipation – Immediate cure

Publications9 hours ago

Trump issues executive orders banning U.S. transactions with WeChat and TikTok in 45 days

Payments9 hours ago

Why isn’t Ethereum Classic worth $0? Macro investor asks after 51% attacks

Gaming9 hours ago

‘The Pathless’ from Giant Squid and Annapurna Interactive Gets an Extended Gameplay Showcase Ahead of Its Release This Year on Apple Arcade

Gaming10 hours ago

All announcements, trailers, and reveals from PlayStation State of Play August 2020

Cyber Security10 hours ago

Notice of Class Action Settlement RE Google Plus