Connect with us


Duena Blomstrom





Duena is the author of the book “Emotional Banking: Fixing Culture, Leveraging FinTech and Transforming Retail Banks into Brands”. She is a serial entrepreneur and intrapreneur, a mentor for start-ups, a LinkedIn Top Voice, named an industry influencer in most lists, a Forbes contributor, a blogger with cutting-edge, unconventional and unapologetic opinion style, an international keynote speaker at industry events, the inventor of the EmotionalBanking™ and MoneyMoments concepts and the Co-Founder and CEO of PeopleNotTech Ltd – a software solution provider revolutionizing the organizations of companies that need to use technology and new ways of work focused on the concept of Psychological Safety in high performing teams.

Over the past 18 years, Duena has worked with multiple large organizations in particular banks, be it to assist them in their digital strategy or to help them transform. With a background in Psychology as well as Business, Duena is on a crusade to see lasting, meaningful and profound change in the way organizations work. In banks’ case this means she is intensely passionate about getting them to embrace the concept of “Emotional Banking” -or how to redesign around Money Moments not financial products- but is chiefly working with various organisations on deep cultural change programs that enable them to become true technology driven, leverage Agile and put Experience at the heart of the proposition to build strong brands that deliver by focusing on their people.

Duena delivers different, entertaining, engaging and thought-provoking keynotes internationally and occasionally leads panels and fireside chats on topics ranging from FinTech, Digital, CX and Innovation to the bigger themes of the Future of Work, Technology, Psychological Safety and Agile as they reflect in Employees Engagement and Organizational Psychology and Design and has a strong following of 200k+ subscribers and followers on Twitter and LinkedIn.

Cover of book Emotional Banking - Fixing Culture, Leveraging Fintech, and Transforming Retail Banks into Brands by Duena Blomstrom
Click here to read more from Duena on



“Duena always has new, fresh and thought provoking content!”

“We had Duena as our Keynote Speaker twice – there was never a bored face in the audience and the feedback reflected that”

“Duena was a keynote speaker at our annual conference and our audience loved her! Both Duena and her team are the utmost professionals and it was a pleasure to deal with them for the preparation of her visit“

“Both chairing a panel and as a keynote speaker, Duena has delivered. Delegates came interested and left fascinated. Looking forward to working together again.”

“Our board had executive briefing fatigue and we searched for the most innovative content and most comprehensive expertise packaged in a dynamic way which is what we found with Duena Blomstrom’s presentation.”



  • “How to put Emotion back in Banking and create true Money Moments™”
  • “Emotional Banking™- Financial institutions need to investigate customers’ feelings about their money and they need to do it NOW”
  • “FinTech – not the cure-all it says on the box”
  • ”Everyone is a Challenger when you’re a bank” – GAFA, digital experiences, Open banking and the unhappy consumer


  • “#VUCA is here: will you bend and win or break and vanish?”
  • “Psychological Safety – Silicon Valley’s Secret Sauce”
  • “Team = Family. Psychologically Safe teams can make magic and move mountains together.”
  • “Servant Leadership and The Emotionally Intelligent Leader”
  • “What are you afraid of? – Impression Management and the fear of appearing incompetent, intrusive, negative or ignorant stops us from being courageous and speaking up at work”
  • “DevOps is the future of HR”
  • “Wake up or rest in peace” – Companies winning today use technology and work in different new ways that make them money and their employees happy. Everyone else is still in the 60s structure of paper-pushing offices. Change or face extinction.


  • “You can’t have the WoW without the WoT – Succeed with Agile only if you change mentalities”
  • “Agile as a religion not a re-org exercise”;
  • “Technology is meaningless and customer-centricity dead without Agile”
  • “Digital Excellence and Innovation – It’s all about the People, not the Tech”
  • “Agile from the heart not a consultancy PowerPoint”
  • “Knowledge, Passion, and Courage – Sine qua non conditions to utilizing Tech and making your Digital customers happy”
  • “Time to ditch Agile “transformations”




FUTURE OF WORK: SIBOS Future of Work Panel:



LUXEMBOURG: It’s all about the people, not the technology

Fireside chat about People and Security:

Interview about Emotional Banking:



BBC Interview: Breaking Banks: Minute 19:50 to 43:23 –

To have Duena delight your audience email BOOKINGS@DUENABLOMSTROM.COM


Big Data

Machine learning and its impact in various facets of healthcare




By Inderpreet Kambo

Data in the healthcare settings is used for generating day-to-day insights as well as continuous improvement of the overall healthcare system. Mandatory practices such as Electronic Health Records (EHR) have already improved the traditional healthcare processes by incorporating big data to perform state-of-the-art data analytics. AI/ML tools are further destined to add value to this realm.

From analyzing radiographs, to identifying tissue abnormalities, to improving the accuracy of stroke prediction based on clinical signs, to benefiting family practitioner or internist at the bedside in supporting their clinical decisions, machine learning is providing the much needed objective opinion to improve efficiency, reliability, and accuracy in the healthcare flow.

Predictive algorithms and machine learning are as good as the training data behind these advanced models. As more data becomes available, there will be better information to build these machine-learning models. And hence we see that much of the initial machine learning successes have come from large organizations with big datasets – Google, Aurora Healthcare, NIH of England to name a few organizations which are able to harness the data.

These firms with massive amount of data facilitate the development of center of excellences (CoEs) that produce unique and powerful machine learning algorithms. Having centralized and synchronous data repositories enable deployment of these algorithms across use cases in healthcare. Below are some of the major use cases, although not comprehensive, that outlines the deployment and use of machine learning in healthcare.

Personalized Medicine

Medicine has come a long way, starting from a generalized broad-spectrum antibiotic treatment approach to disease treatment and prevention approach that takes into account individual variability in genes, environment, and lifestyle for each person.

InsightRX, for example, leverages quantitative pharmacology with machine learning to provide a customized and individualized patient’s response to various treatments. By combining clinical, pharmaceutical, and socioeconomic data with machine learning algorithms, researchers and providers are able to observe patterns in the effectiveness of particular treatments and identify the genetic variations that may be correlated with success or failure.

Healthcare Management

From automating routine front office and reporting to analyzing pharmaceutical marketing research, machine learning is making strides in multiple areas of operations and management. Founded in 2010, LeanTaaSis using machine learning to optimize hospital resources such as waiting period and operating rooms. With Goldman Sachs leading the investment backing, the company has raises over 100 million dollars worth of funding and is one of the pioneers in using machine learning for improvement in healthcare management.

Machine learning guided diagnosis

Data scientists working at Google have developed machine-learning algorithms to detect breast cancer by training the algorithm to differentiate cancer patterns from otherwise healthy surrounding tissue. The machine-learning algorithm entered vast amounts of data into its system and trained to differentiate abnormal tissue pattern from normal surrounding cells.

Studies show that these machine learning, predictive analytics and pattern recognition technology has been adjudged to have over 89 percent accuracy as compared to below 75 percent trained pathologists and medical radiologists’ accuracy.

Research and Development

Pharmaceutical firms and healthcare organizations have been spending billions of dollars in R&D to identify factors affectingpatient’s response and improve healthcare outcomes. However, machine learning has revolutionized research by using these factors inter alia to identify which patients will have better outcomes than others. From enabling early cancer detection to identifying COVID -19 patients who require ventilator support, machine learning is enhancing outcome based research across the various facets of healthcare R&D.


AI and Machine Learning continue to grow in the healthcare industry with the ever-evolving technology advancements. There have been more healthcare focused startups that deploy machine learning than it has ever been. However machine-learning models have not been implemented to the same extent in healthcare as they have been in other verticals. Firstly, machine learning is a recent technology and is far away from the state of perfection.

Whether its FDA, ICMR or EMA approval, it is a long, arduous and expensive process to test, validate and approve the technology in a healthcare setting. Secondly, data privacy and security are one of the biggest barriers of machine learning adoption in healthcare. In healthcare industry, the technologies and systems must be developed so as they comply with the respective data laws and rules of governing organizations.

Despite multitude of challenges in healthcare, a need for a breakthrough in healthcare delivery is much needed. From aging population of various countries, diminishing physician to patient ratio and higher scrutiny on accuracy of diagnosis, the need for new innovative solutions in healthcare is clear and explicit. The best opportunities for AI in healthcare is where clinicians are supported in diagnosis, treatment planning, and identifying risk factors, but where physicians retain ultimate responsibility for the patient’s care.


Continue Reading

Big Data

Huobi announces the establishment of Huobi DeFi Labs




Huobi announces the establishment of Huobi DeFi Labs

Huobi DeFi Labs is the platform for DeFi (Decentralized Finance) research, investment, and incubation and eco system building in DeFi space. It aims to build a better financial system in collaborations with the global crypto and DeFi community for the future.

“Huobi as the leading crypto financial services provider in Asia and worldwide, our mission is to provide the best crypto financial products and services to our users regardless it is CeFi or DeFi,” said Leon Li, founder and CEO of Huobi Group. “We are excited to join as a part of the global DeFi ecosystem and will be very honoured to work with the global community to provide the best support possible.”

The DeFi initiatives will be led by Huobi’s Chief Investment Officer Sharlyn Wu, who was a Wall Street veteran, ex UBS and also have been leading blockchain investment at China Merchant Bank International.

“Over the past two years, we have witnessed the birth and exponential growth of DeFi. The width, depth and speed of innovations are unparalleled in human history. It is exciting to see the power of permissionless economy unleashed at global scale. However, there are still many problems to be solved at theoretical and technical level.” said Sharyn Wu, Huobi’s Chief Investment Officer. “There is also a lot of investor education to do in order to bring crypto and DeFi to mainstream users. As DeFi is still in its infancy, it needs collective efforts from the global community to build and grow the space together.”

Huobi Group will allocate tens of millions of dollars to an initial investment fund, which will be managed by Huobi DeFi Labs. The team consists of 4 research and investment professionals initially.

The DeFi Labs will be focused on the following three areas:

– Research of underlying financial theories and technology

– Investment and incubation of DeFi projects

– Work with the best DeFi projects to service the entire ecosystem

DeFi and CeFi to Collectively Change the Landscapes of Traditional Finance

Sharyn Wu explains why Huobi group invests in the DeFi space and Huobi DeFi Labs’ mission:

  • DeFi brings many benefits including transparency and composability, which will improve the efficiency and governance of finance to the next level. More importantly, for the first time ever, it is possible to create a finance system without credit risk and principal agency risk.
  • This system can provide people with trust, safety and certainty, which are not present in our society today. When financial institutions, which are professional at pricing risks come in, they look at the risk parameters, they will tell that this trust-less model deserves better pricing because it removes the risks and uncertainties caused by human behaviors.
  • This will also benefit average users in the ecosystem hugely as every user regardless where they are can tap into the global liquidity pool and all the financial products worldwide through their mobile.
  • Crypto is a perfect system for finance. As blockchain technology optimizes over time, DeFi and CeFi to collective change the landscapes of traditional finance and serve the use cases they are best suited for. Huobi strives to work with the entire crypto and DeFi ecosystem to reshape the global financial systems.
  • Crypto will disrupt finance as the internet has managed to change many other industries. Today is a world that every business operates with their ledger. Society is operating at huge costs for account reconciliations and monopolies ruling out long tail. The power of millions of ledgers merging into one will enable that every user, asset and data is accessible to the entire ecosystem at literally zero costs.


Continue Reading


How To Use Machine Learning Models To Predict Loan Eligibility




Author profile picture

@mridulrbMridul Bhandari

Hi there! 👋 I’m a ☁️ Developer Advocate 👩‍💻👨‍💻from 👁️🐝Ⓜ️

Build predictive models to automate the process of targeting the right applicants.

GitHub Repository


Loans are the core business of banks. The main profit comes directly from the loan’s interest. The loan companies grant a loan after an intensive process of verification and validation. However, they still don’t have assurance if the applicant is able to repay the loan with no difficulties.

In this tutorial, we’ll build a predictive model to predict if an applicant is able to repay the lending company or not. We will prepare the data using Jupyter Notebook and use various models to predict the target variable.

Table of Contents

1. Getting the system ready and loading the data

2. Understanding the data

3. Exploratory Data Analysis (EDA)

i. Univariate Analysis

ii. Bivariate Analysis

4. Missing value and outlier treatment

5. Evaluation Metrics for classification problems

6. Model Building: Part 1

7. Logistic Regression using stratified k-folds cross-validation

8. Feature Engineering

9. Model Building: Part 2

i. Logistic Regression

ii. Decision Tree

iii. Random Forest

iv. XGBoost

Getting the system ready and loading the data

We will be using Python for this course along with the below-listed libraries.



For this problem, we have three CSV files: train, test, and sample submission.

Train file will be used for training the model, i.e. our model will learn from this file. It contains all the independent variables and the target variable.

Test file contains all the independent variables, but not the target variable. We will apply the model to predict the target variable for the test data.

Sample submission file contains the format in which we have to submit out predictions.

Reading data

train = pd.read_csv(‘Dataset/train.csv’)
test = pd.read_csv(‘Dataset/test.csv’)

Understanding the data

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'], dtype='object')

We have 12 independent variables and 1 target variable, i.e. Loan_Status in the training dataset.

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History', 'Property_Area'], dtype='object')

We have similar features in the test dataset as the training dataset except for the Loan_Status. We will predict the Loan_Status using the model built using the train data.

Loan_ID object
Gender object
Married object
Dependents object
Education object
Self_Employed object
ApplicantIncome int64
CoapplicantIncome float64
LoanAmount float64
Loan_Amount_Term float64
Credit_History float64
Property_Area object
Loan_Status object
dtype: object

We can see there are three formats of data types:

object: Object format means variables are categorical. Categorical variables in our dataset are Loan_ID, Gender, Married, Dependents, Education, Self_Employed, Property_Area, Loan_Status.

int64: It represents the integer variables. ApplicantIncome is of this format.

float64: It represents the variable that has some decimal values involved. They are also numerical.

(614, 13)

We have 614 rows and 13 columns in the train dataset.

(367, 12)

We have 367 rows and 12 columns in test dataset.

train[‘Loan_Status’].value_counts() Y 422
N 192
Name: Loan_Status, dtype: int64

Normalize can be set to True to print proportions instead of number

train[‘Loan_Status’].value_counts(normalize=True) Y 0.687296
N 0.312704
Name: Loan_Status, dtype: float64

The loan of 422(around 69%) people out of 614 were approved.

Now, let’s visualize each variable separately. Different types of variables are Categorical, ordinal, and numerical.

  • Categorical features: These features have categories (Gender, Married, Self_Employed, Credit_History, Loan_Status)
  • Ordinal features: Variables in categorical features having some order involved (Dependents, Education, Property_Area)
  • Numerical features: These features have numerical values (ApplicantIncome, Co-applicantIncome, LoanAmount, Loan_Amount_Term)

Independent Variable (Categorical)

train[‘Gender’].value_counts(normalize=True),10), title=’Gender’)

It can be inferred from the above bar plots that:

  • 80% of applicants in the dataset are male.
  • Around 65% of the applicants in the dataset are married.
  • Around 15% of applicants in the dataset are self-employed.
  • Around 85% of applicants have repaid their doubts.

Independent Variable (Ordinal)

train[‘Dependents’].value_counts(normalize=True),6), title=’Dependents’)

The following inferences can be made from the above bar plots:

  • Most of the applicants don’t have any dependents.
  • Around 80% of the applicants are Graduate.
  • Most of the applicants are from the Semiurban area.

Independent Variable (Numerical)

Till now we have seen the categorical and ordinal variables and now let’s visualize the numerical variables. Let’s look at the distribution of Applicant income first.


It can be inferred that most of the data in the distribution of applicant income are towards the left which means it is not normally distributed. We will try to make it normal in later sections as algorithms work better if the data is normally distributed.

The boxplot confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society. Part of this can be driven by the fact that we are looking at people with different education levels. Let us segregate them by Education.

train.boxplot(column=’ApplicantIncome’, by = ‘Education’) plt.suptitle(“”)

We can see that there are a higher number of graduates with very high incomes, which are appearing to be outliers.

Let’s look at the Co-applicant income distribution.


We see a similar distribution as that of the applicant’s income. The majority of co-applicants income ranges from 0 to 5000. We also see a lot of outliers in the applicant’s income and it is not normally distributed.


We see a lot of outliers in this variable and the distribution is fairly normal. We will treat the outliers in later sections.

Bivariate Analysis

Let’s recall some of the hypotheses that we generated earlier:

Applicants with high incomes should have more chances of loan approval. Applicants who have repaid their previous debts should have higher chances of loan approval. Loan approval should also depend on the loan amount. If the loan amount is less, the chances of loan approval should be high. Lesser the amount to be paid monthly to repay the loan, the higher the chances of loan approval.

Let’s try to test the above-mentioned hypotheses using bivariate analysis.

After looking at every variable individually in univariate analysis, we will now explore them again with respect to the target variable.

Categorical Independent Variable vs Target Variable

First of all, we will find the relation between the target variable and categorical independent variables. Let us look at the stacked bar plot now which will give us the proportion of approved and unapproved loans.

Gender.div(Gender.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True,figsize=(4,4))

It can be inferred that the proportion of male and female applicants is more or less the same for both approved and unapproved loans.

Now let us visualize the remaining categorical variables vs target variable.

Married.div(Married.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True,figsize=(4,4))
Dependents.div(Dependents.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True,figsize=(4,4))
Education.div(Education.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True,figsize=(4,4))
Self_Employed.div(Self_Employed.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True,figsize=(4,4))
  • The proportion of married applicants is higher for approved loans.
  • Distribution of applicants with 1 or 3+ dependents is similar across both the categories of Loan_Status.
  • There is nothing significant we can infer from Self_Employed vs Loan_Status plot.

Now we will look at the relationship between remaining categorical independent variables and Loan_Status.

Credit_History.div(Credit_History.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True,figsize=(4,4))
Property_Area.div(Property_Area.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True)
  • It seems people with a credit history as 1 are more likely to get their loans approved.
  • The proportion of loans getting approved in the semi-urban area is higher as compared to that in rural or urban areas.

Now let’s visualize numerical independent variables with respect to the target variable.

Numerical Independent Variable vs Target Variable

We will try to find the mean income of people for which the loan has been approved vs the mean income of people for which the loan has not been approved.


Here the y-axis represents the mean applicant income. We don’t see any change in the mean income. So, let’s make bins for the applicant income variable based on the values in it and analyze the corresponding loan status for each bin.

group=[‘Low’,’Average’,’High’,’Very high’]
Income_bin.div(Income_bin.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True)

It can be inferred that Applicant’s income does not affect the chances of loan approval which contradicts our hypothesis in which we assumed that if the applicant’s income is high the chances of loan approval will also be high.

We will analyze the co-applicant income and loan amount variable in a similar manner.

Coapplicant_Income_bin.div(Coapplicant_Income_bin.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True)

It shows that if co-applicants income is less the chances of loan approval are high. But this does not look right. The possible reason behind this may be that most of the applicants don’t have any co-applicant so the co-applicant income for such applicants is 0 and hence the loan approval is not dependent on it. So, we can make a new variable in which we will combine the applicant’s and co-applicants income to visualize the combined effect of income on loan approval.

Let us combine the Applicant Income and Co-applicant Income and see the combined effect of Total Income on the Loan_Status.

group=[‘Low’,’Average’,’High’,’Very high’]
Total_Income_bin.div(Total_Income_bin.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True)

We can see that Proportion of loans getting approved for applicants having low Total_Income is very less compared to that of applicants with Average, High & Very High Income.

Let’s visualize the Loan Amount variable.

LoanAmount_bin.div(LoanAmount_bin.sum(1).astype(float), axis=0).plot(kind=”bar”,stacked=True)

It can be seen that the proportion of approved loans is higher for Low and Average Loan Amount as compared to that of High Loan Amount which supports our hypothesis in which we considered that the chances of loan approval will be high when the loan amount is less.

Let’s drop the bins which we created for the exploration part. We will change the 3+ in dependents variable to 3 to make it a numerical variable. We will also convert the target variable’s categories into 0 and 1 so that we can find its correlation with numerical variables. One more reason to do so is few models like logistic regression takes only numeric values as input. We will replace N with 0 and Y with 1.

train=train.drop([‘Income_bin’, ‘Coapplicant_Income_bin’, ‘LoanAmount_bin’, ‘Total_Income_bin’, ‘Total_Income’], axis=1)
train[‘Dependents’].replace(‘3+’, 3,inplace=True)
test[‘Dependents’].replace(‘3+’, 3,inplace=True)
train[‘Loan_Status’].replace(’N’, 0,inplace=True)
train[‘Loan_Status’].replace(‘Y’, 1,inplace=True)

Now let’s look at the correlation between all the numerical variables. We will use the heat map to visualize the correlation. Heatmaps visualize data through variations in coloring. The variables with darker color means their correlation is more.

matrix = train.corr()
f, ax = plt.subplots(figsize=(9,6))
sns.heatmap(matrix,vmax=.8,square=True,cmap=”BuPu”, annot = True)

We see that the most correlate variables are (ApplicantIncome — LoanAmount) and (Credit_History — Loan_Status). LoanAmount is also correlated with CoapplicantIncome.

Missing value imputation

Let’s list out feature-wise count of missing values.

Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64

There are missing values in Gender, Married, Dependents, Self_Employed, LoanAmount, Loan_Amount_Term, and Credit_History features.

We will treat the missing values in all the features one by one.

We can consider these methods to fill the missing values:

  • For numerical variables: imputation using mean or median
  • For categorical variables: imputation using mode

There are very few missing values in Gender, Married, Dependents, Credit_History, and Self_Employed features so we can fill them using the mode of the features.

train[‘Gender’].fillna(train[‘Gender’].mode()[0], inplace=True)
train[‘Married’].fillna(train[‘Married’].mode()[0], inplace=True)
train[‘Dependents’].fillna(train[‘Dependents’].mode()[0], inplace=True)
train[‘Self_Employed’].fillna(train[‘Self_Employed’].mode()[0], inplace=True)
train[‘Credit_History’].fillna(train[‘Credit_History’].mode()[0], inplace=True)

Now let’s try to find a way to fill the missing values in Loan_Amount_Term. We will look at the value count of the Loan amount term variable.

train[‘Loan_Amount_Term’].value_counts() 360.0 512
180.0 44
480.0 15
300.0 13
84.0 4
240.0 4
120.0 3
36.0 2
60.0 2
12.0 1
Name: Loan_Amount_Term, dtype: int64

It can be seen that in the loan amount term variable, the value of 360 is repeating the most. So we will replace the missing values in this variable using the mode of this variable.

train[‘Loan_Amount_Term’].fillna(train[‘Loan_Amount_Term’].mode()[0], inplace=True)

Now we will see the LoanAmount variable. As it is a numerical variable, we can use mean or median to impute the missing values. We will use the median to fill the null values as earlier we saw that the loan amount has outliers so the mean will not be the proper approach as it is highly affected by the presence of outliers.

train[‘LoanAmount’].fillna(train[‘LoanAmount’].median(), inplace=True)

Now let’s check whether all the missing values are filled in the dataset.

train.isnull().sum() Loan_ID 0
Gender 0
Married 0
Dependents 0
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 0
Loan_Amount_Term 0
Credit_History 0
Property_Area 0
Loan_Status 0
dtype: int64

As we can see that all the missing values have been filled in the test dataset. Let’s fill all the missing values in the test dataset too with the same approach.

test[‘Gender’].fillna(train[‘Gender’].mode()[0], inplace=True)
test[‘Married’].fillna(train[‘Married’].mode()[0], inplace=True)
test[‘Dependents’].fillna(train[‘Dependents’].mode()[0], inplace=True)
test[‘Self_Employed’].fillna(train[‘Self_Employed’].mode()[0], inplace=True)
test[‘Credit_History’].fillna(train[‘Credit_History’].mode()[0], inplace=True)
test[‘Loan_Amount_Term’].fillna(train[‘Loan_Amount_Term’].mode()[0], inplace=True)
test[‘LoanAmount’].fillna(train[‘LoanAmount’].median(), inplace=True)

Outlier Treatment

As we saw earlier in univariate analysis, LoanAmount contains outliers so we have to treat them as the presence of outliers affects the distribution of the data. Let’s examine what can happen to a data set with outliers. For the sample data set:


We find the following: mean, median, mode, and standard deviation

Mean = 2.58

Median = 2.5


Standard Deviation = 1.08

If we add an outlier to the data set:


The new values of our statistics are:

Mean = 35.38

Median = 2.5


Standard Deviation = 114.74

It can be seen that having outliers often has a significant effect on the mean and standard deviation and hence affecting the distribution. We must take steps to remove outliers from our data sets. Due to these outliers bulk of the data in the loan amount is at the left and the right tail is longer. This is called right skewness. One way to remove the skewness is by doing the log transformation. As we take the log transformation, it does not affect the smaller values much but reduces the larger values. So, we get a distribution similar to normal distribution. Let’s visualize the effect of log transformation. We will do similar changes to the test file simultaneously.


Now the distribution looks much closer to normal and the effect of extreme values has been significantly subsided. Let’s build a logistic regression model and make predictions for the test dataset.

Model Building: Part I

Let us make our first model predict the target variable. We will start with Logistic Regression which is used for predicting binary outcome.

  • Logistic Regression is a classification algorithm. It is used to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables.
  • Logistic regression is an estimation of Logit function. The logit function is simply a log of odds in favor of the event.
  • This function creates an S-shaped curve with the probability estimate, which is very similar to the required stepwise function

To learn further on logistic regression, refer to this article:
Let’s drop the Loan_ID variable as it does not have any effect on the loan status. We will do the same changes to the test dataset which we did for the training dataset.


We will use scikit-learn (sklearn) for making different models which is an open source library for Python. It is one of the most efcient tools which contains many inbuilt functions that can be used for modeling in Python.

To learn further about sklearn, refer here:

Sklearn requires the target variable in a separate dataset. So, we will drop our target variable from the training dataset and save it in another dataset.

X = train.drop(‘Loan_Status’,1)
y = train.Loan_Status

Now we will make dummy variables for the categorical variables. The dummy variable turns categorical variables into a series of 0 and 1, making them a lot easier to quantify and compare. Let us understand the process of dummies first:

  • Consider the “Gender” variable. It has two classes, Male and Female.
  • As logistic regression takes only the numerical values as input, we have to change male and female into a numerical value.
  • Once we apply dummies to this variable, it will convert the “Gender” variable into two variables(Gender_Male and Gender_Female), one for each class, i.e. Male and Female.
  • Gender_Male will have a value of 0 if the gender is Female and a value of 1 if the gender is Male.
X = pd.get_dummies(X)

Now we will train the model on the training dataset and make predictions for the test dataset. But can we validate these predictions? One way of doing this is we can divide our train dataset into two parts: train and validation. We can train the model on this training part and using that make predictions for the validation part. In this way, we can validate our predictions as we have the true predictions for the validation part (which we do not have for the test dataset).

We will use the train_test_split function from sklearn to divide our train dataset. So, first, let us import train_test_split.

from sklearn.model_selection import train_test_split
x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size=0.3)

The dataset has been divided into training and validation part. Let us import LogisticRegression and accuracy_score from sklearn and fit the logistic regression model.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression(), y_train) LogisticRegression()

Here the C parameter represents the inverse of regularization strength. Regularization is applying a penalty to increasing the magnitude of parameter values in order to reduce overfitting. Smaller values of C specify stronger regularization. To learn about other parameters, refer here: http://scikit-

Let’s predict the Loan_Status for validation set and calculate its accuracy.

pred_cv = model.predict(x_cv)
accuracy_score(y_cv,pred_cv) 0.7891891891891892

So our predictions are almost 80% accurate, i.e. we have identified 80% of the loan status correctly.

Let’s make predictions for the test dataset.

pred_test = model.predict(test)

Let’s import the submission file which we have to submit on the solution checker.

submission = pd.read_csv(‘Dataset/sample_submission.csv’)

We only need the Loan_ID and the corresponding Loan_Status for the final submission. we will fill these columns with the Loan_ID of the test dataset and the predictions that we made, i.e., pred_test respectively.


Remember we need predictions in Y and N. So let’s convert 1 and 0 to Y and N.

submission[‘Loan_Status’].replace(0, ’N’, inplace=True)
submission[‘Loan_Status’].replace(1, ‘Y’, inplace=True)

Finally, we will convert the submission to .csv format.

pd.DataFrame(submission, columns=[‘Loan_ID’,’Loan_Status’]).to_csv(‘Output/logistic.csv’)

Logistic Regression using stratified k-folds cross-validation

To check how robust our model is to unseen data, we can use Validation. It is a technique that involves reserving a particular sample of a dataset on which you do not train the model. Later, you test your model on this sample before finalizing it. Some of the common methods for validation are listed below:

  • The validation set approach
  • k-fold cross-validation
  • Leave one out cross-validation (LOOCV)
  • Stratified k-fold cross-validation

If you wish to know more about validation techniques, then please refer to this article:

In this section, we will learn about stratified k-fold cross-validation. Let us understand how it works:

  • Stratification is the process of rearranging the data so as to ensure that each fold is a good representative of the whole.
  • For example, in a binary classification problem where each class comprises of 50% of the data, it is best to arrange the data such that in every fold, each class comprises of about half the instances.
  • It is generally a better approach when dealing with both bias and variance.
  • A randomly selected fold might not adequately represent the minor class, particularly in cases where there is a huge class imbalance.

Let’s import StratifiedKFold from sklearn and fit the model.

Now let’s make a cross-validation logistic model with stratified 5 folds and make predictions for the test dataset.

    mean = 0
    kf = StratifiedKFold(n_splits=5,random_state=1)
    for train_index,test_index in kf.split(X,y): print (‘n{} of kfold {} ‘.format(i,kf.n_splits)) xtr,xvl = X.loc[train_index],X.loc[test_index] ytr,yvl = y[train_index],y[test_index] model = LogisticRegression(random_state=1),ytr) pred_test=model.predict(xvl) score=accuracy_score(yvl,pred_test) mean += score print (‘accuracy_score’,score) i+=1 pred_test = model.predict(test) pred = model.predict_proba(xvl)[:,1]
    print (‘n Mean Validation Accuracy’,mean/(i-1))
    1 of kfold 5 accuracy_score 0.8048780487804879 2 of kfold 5 accuracy_score 0.7642276422764228 3 of kfold 5 accuracy_score 0.7804878048780488 4 of kfold 5 accuracy_score 0.8455284552845529 5 of kfold 5 accuracy_score 0.8032786885245902 Mean Validation Accuracy 0.7996801279488205

The mean validation accuracy for this model turns out to be 0.80. Let us visualize the roc curve.

    from sklearn import metrics
    fpr, tpr, _ = metrics.roc_curve(yvl, pred)
    auc = metrics.roc_auc_score(yvl, pred)
    plt.plot(fpr, tpr, label=”validation, auc=”+str(auc))
    plt.xlabel(‘False Positive Rate’)
    plt.ylabel(‘True Positive Rate’)

We got an auc value of 0.70.

Remember we need predictions in Y and N. So let’s convert 1 and 0 to Y and N.

    submission[‘Loan_Status’].replace(0, ’N’, inplace=True)
    submission[‘Loan_Status’].replace(1, ‘Y’, inplace=True) pd.DataFrame(submission, columns=[‘Loan_ID’,’Loan_Status’]).to_csv(‘Output/Log1.csv’)

Feature Engineering

Based on the domain knowledge, we can come up with new features that might affect the target variable. We will create the following three new features:

Total Income — As discussed during bivariate analysis we will combine the Applicant Income and Co-applicant Income. If the total income is high, the chances of loan approval might also be high.

EMI — EMI is the monthly amount to be paid by the applicant to repay the loan. The idea behind making this variable is that people who have high EMI’s might find it difficult to pay back the loan. We can calculate the EMI by taking the ratio of the loan amount with respect to the loan amount term.

Balance Income — This is the income left after the EMI has been paid. The idea behind creating this variable is that if this value is high, the chances are high that a person will repay the loan and hence increasing the chances of loan approval.

Let’s check the distribution of Total Income.

We can see it is shifted towards left, i.e., the distribution is right-skewed. So, let’s take the log transformation to make the distribution normal.

Now the distribution looks much closer to normal and the effect of extreme values has been significantly subsided. Let’s create the EMI feature now.

Let’s check the distribution of the EMI variable.

    train[‘Balance Income’] = train[‘Total_Income’]-(train[‘EMI’]*1000)
    test[‘Balance Income’] = test[‘Total_Income’]-(test[‘EMI’]*1000)
    sns.distplot(train[‘Balance Income’])

Let us now drop the variables which we used to create these new features. The reason for doing this is, the correlation between those old features and these new features will be very high, and logistic regression assumes that the variables are not highly correlated. We also want to remove the noise from the dataset, so removing correlated features will help in reducing the noise too.

    train=train.drop([‘ApplicantIncome’, ‘CoapplicantIncome’, ‘LoanAmount’, ‘Loan_Amount_Term’], axis=1)
    test=test.drop([‘ApplicantIncome’, ‘CoapplicantIncome’, ‘LoanAmount’, ‘Loan_Amount_Term’], axis=1)

Model Building: Part II

After creating new features, we can continue the model building process. So we will start with the logistic regression model and then move over to more complex models like RandomForest and XGBoost. We will build the following models in this section.

  • Logistic Regression
  • Decision Tree
  • Random Forest
  • XGBoost

Let’s prepare the data for feeding into the models.

Logistic Regression

    mean = 0
    kf = StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
    for train_index,test_index in kf.split(X,y): print (‘n{} of kfold {} ‘.format(i,kf.n_splits)) xtr,xvl = X.loc[train_index],X.loc[test_index] ytr,yvl = y[train_index],y[test_index] model = LogisticRegression(random_state=1),ytr) pred_test=model.predict(xvl) score=accuracy_score(yvl,pred_test) mean += score print (‘accuracy_score’,score) i+=1 pred_test = model.predict(test) pred = model.predict_proba(xvl)[:,1]
    print (‘n Mean Validation Accuracy’,mean/(i-1))
    1 of kfold 5 accuracy_score 0.7967479674796748 2 of kfold 5 accuracy_score 0.6910569105691057 3 of kfold 5 accuracy_score 0.6666666666666666 4 of kfold 5 accuracy_score 0.7804878048780488 5 of kfold 5 accuracy_score 0.680327868852459 Mean Validation Accuracy 0.7230574436891909
    submission['Loan_Status'].replace(0, 'N', inplace=True)
    submission['Loan_Status'].replace(1, 'Y', inplace=True)
    pd.DataFrame(submission, columns=['Loan_ID','Loan_Status']).to_csv('Output/Log2.csv')

Decision Tree

Decision tree is a type of supervised learning algorithm(having a pre-defined target variable) that is mostly used in classification problems. In this technique, we split the population or sample into two or more homogeneous sets(or sub-populations) based on the most significant splitter/differentiator in input variables.

Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable.

For a detailed explanation visit

Let’s fit the decision tree model with 5 folds of cross-validation.

    from sklearn import tree
    mean = 0
    kf = StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
    for train_index,test_index in kf.split(X,y): print ('n{} of kfold {} '.format(i,kf.n_splits)) xtr,xvl = X.loc[train_index],X.loc[test_index] ytr,yvl = y[train_index],y[test_index] model = tree.DecisionTreeClassifier(random_state=1),ytr) pred_test=model.predict(xvl) score=accuracy_score(yvl,pred_test) mean += score print ('accuracy_score',score) i+=1 pred_test = model.predict(test) pred = model.predict_proba(xvl)[:,1]
    print ('n Mean Validation Accuracy',mean/(i-1))
    1 of kfold 5 accuracy_score 0.7398373983739838 2 of kfold 5 accuracy_score 0.6991869918699187 3 of kfold 5 accuracy_score 0.7560975609756098 4 of kfold 5 accuracy_score 0.7073170731707317 5 of kfold 5 accuracy_score 0.6721311475409836 Mean Validation Accuracy 0.7149140343862455
    submission['Loan_Status'].replace(0, 'N', inplace=True)
    submission['Loan_Status'].replace(1, 'Y', inplace=True)
    pd.DataFrame(submission, columns=['Loan_ID','Loan_Status']).to_csv('Output/DecisionTree.csv')

Random Forest

  • RandomForest is a tree-based bootstrapping algorithm wherein a certain no. of weak learners (decision trees) are combined to make a powerful prediction model.
  • For every individual learner, a random sample of rows and a few randomly chosen variables are used to build a decision tree model.
  • Final prediction can be a function of all the predictions made by the individual learners.
  • In the case of a regression problem, the final prediction can be the mean of all the predictions.

For a detailed explanation visit this article

    from sklearn.ensemble import RandomForestClassifier
    mean = 0
    kf = StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
    for train_index,test_index in kf.split(X,y): print (‘n{} of kfold {} ‘.format(i,kf.n_splits)) xtr,xvl = X.loc[train_index],X.loc[test_index] ytr,yvl = y[train_index],y[test_index] model = RandomForestClassifier(random_state=1, max_depth=10),ytr) pred_test=model.predict(xvl) score=accuracy_score(yvl,pred_test) mean += score print (‘accuracy_score’,score) i+=1 pred_test = model.predict(test) pred = model.predict_proba(xvl)[:,1]
    print (‘n Mean Validation Accuracy’,mean/(i-1))
    1 of kfold 5 accuracy_score 0.8292682926829268 2 of kfold 5 accuracy_score 0.8130081300813008 3 of kfold 5 accuracy_score 0.7723577235772358 4 of kfold 5 accuracy_score 0.8048780487804879 5 of kfold 5 accuracy_score 0.7540983606557377 Mean Validation Accuracy 0.7947221111555378

We will try to improve the accuracy by tuning the hyperparameters for this model. We will use a grid search to get the optimized values of hyper parameters. Grid-search is a way to select the best of a family of hyper parameters, parametrized by a grid of parameters.

We will tune the max_depth and n_estimators parameters. max_depth decides the maximum depth of the tree and n_estimators decides the number of trees that will be used in the random forest model.

Grid Search

    from sklearn.model_selection import GridSearchCV
    paramgrid = {‘max_depth’: list(range(1,20,2)), ‘n_estimators’: list(range(1,200,20))}
    from sklearn.model_selection import train_test_split
    x_train, x_cv, y_train, y_cv = train_test_split(X,y, test_size=0.3, random_state=1),y_train)
    GridSearchCV(estimator=RandomForestClassifier(random_state=1), param_grid={'max_depth': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19], 'n_estimators': [1, 21, 41, 61, 81, 101, 121, 141, 161, 181]})
    RandomForestClassifier(max_depth=5, n_estimators=41, random_state=1)
    mean = 0
    kf = StratifiedKFold(n_splits=5,random_state=1,shuffle=True)
    for train_index,test_index in kf.split(X,y): print ('n{} of kfold {} '.format(i,kf.n_splits)) xtr,xvl = X.loc[train_index],X.loc[test_index] ytr,yvl = y[train_index],y[test_index] model = RandomForestClassifier(random_state=1, max_depth=3, n_estimators=41),ytr) pred_test = model.predict(xvl) score = accuracy_score(yvl,pred_test) mean += score print ('accuracy_score',score) i+=1 pred_test = model.predict(test) pred = model.predict_proba(xvl)[:,1]
    print ('n Mean Validation Accuracy',mean/(i-1))
    1 of kfold 5 accuracy_score 0.8130081300813008 2 of kfold 5 accuracy_score 0.8455284552845529 3 of kfold 5 accuracy_score 0.8048780487804879 4 of kfold 5 accuracy_score 0.7967479674796748 5 of kfold 5 accuracy_score 0.7786885245901639 Mean Validation Accuracy 0.8077702252432362
    submission['Loan_Status'].replace(0, 'N', inplace=True)
    submission['Loan_Status'].replace(1, 'Y', inplace=True)
    pd.DataFrame(submission, columns=['Loan_ID','Loan_Status']).to_csv('Output/RandomForest.csv')

Let us find the feature importance now, i.e. which features are most important for this problem. We will use the feature_importances_ attribute of sklearn to do so.

    importances=pd.Series(model.feature_importances_, index=X.columns)
    importances.plot(kind=’barh’, figsize=(12,8))

We can see that Credit_History is the most important feature followed by Balance Income, Total Income, EMI. So, feature engineering helped us in predicting our target variable.


XGBoost is a fast and efficient algorithm and has been used by the winners of many data science competitions. It’s a boosting algorithm and you may refer the below article to know more about boosting:

XGBoost works only with numeric variables and we have already replaced the categorical variables with numeric variables. Let’s have a look at the parameters that we are going to use in our model.

n_estimator: This specifies the number of trees for the model.max_depth: We can specify the maximum depth of a tree using this parameter.

GBoostError: XGBoost Library (libxgboost.dylib) could not be loaded. If you face this error in macOS, run “`brew install libomp“` in “`Terminal“`

    from xgboost import XGBClassifier
    i=1 mean = 0
    kf = StratifiedKFold(n_splits=5,random_state=1,shuffle=True) for train_index,test_index in kf.split(X,y): print(‘n{} of kfold {}’.format(i,kf.n_splits)) xtr,xvl = X.loc[train_index],X.loc[test_index] ytr,yvl = y[train_index],y[test_index] model = XGBClassifier(n_estimators=50, max_depth=4), ytr) pred_test = model.predict(xvl) score = accuracy_score(yvl,pred_test) mean += score print (‘accuracy_score’,score) i+=1 pred_test = model.predict(test) pred = model.predict_proba(xvl)[:,1]
    print (‘n Mean Validation Accuracy’,mean/(i-1))
    1 of kfold 5
    accuracy_score 0.7804878048780488 2 of kfold 5
    accuracy_score 0.7886178861788617 3 of kfold 5
    accuracy_score 0.7642276422764228 4 of kfold 5
    accuracy_score 0.7804878048780488 5 of kfold 5
    accuracy_score 0.7622950819672131 Mean Validation Accuracy 0.7752232440357191
    submission['Loan_Status'].replace(0, 'N', inplace=True)
    submission['Loan_Status'].replace(1, 'Y', inplace=True)
    pd.DataFrame(submission, columns=['Loan_ID','Loan_Status']).to_csv('Output/XGBoost.csv')

SPSS Modeler

To create an SPSS Modeler Flow and build a machine learning model using it, follow the instructions here:

Predict loan eligibility using IBM Watson Studio

Sign-up for an IBM Cloud account to try this tutorial.


In this tutorial, we learned how to create models to predict the target variable, i.e. if the applicant will be able to repay the loan or not.

Also published at

Author profile picture

Read my stories

Hi there! 👋 I’m a ☁️ Developer Advocate 👩‍💻👨‍💻from 👁️🐝Ⓜ️


The Noonification banner

Subscribe to get your daily round-up of top tech stories!


Continue Reading
AI1 hour ago

How Artificial Intelligence Can Bring Online Casinos to the Next Level

Techcrunch6 hours ago

WeWork sells majority stake in Chinese entity, seeks localization

Blockchain6 hours ago

Ripple Aims To Expand Its Financial Institutions Network

CNBC6 hours ago

Seagate’s 1TB Game Drive for Xbox Series X, Series S costs $220

Automotive7 hours ago

Mobileye signs driver-assistance deal with Geely, one of China’s largest privately-held auto makers

Automotive8 hours ago

2021 Acura TLX First Drive | The mojo is returning

Techcrunch8 hours ago

Facebook gives more details about its efforts against hate speech before Myanmar’s general election

CNBC9 hours ago

TikTok ask the court to prevent a US ban from taking effect

CNBC9 hours ago

ByteDance applies for export license from China as TikTok deal waits for approval

CNBC9 hours ago

Nintendo just surprised Switch owners by releasing ‘Kirby Fighters 2’

CNBC10 hours ago

Stock futures little changed following sell-off on Wall Street

Blockchain11 hours ago

These 4 Trends Show That Bitcoin’s Likely to Move Higher After 20% Drop

Blockchain11 hours ago

RockX Launches $20M Investment Program to Support Polkadot

CNBC11 hours ago

Economist Stephen Roach issues new dollar crash warning, sees double-dip recession odds above 50%

CNBC12 hours ago

Jim Cramer on GoodRx’s IPO: Start buying if the stock pulls back a little

CNBC12 hours ago

‘Among Us’ developers cancel sequel plans, focus on their new/old smash hit

CNBC12 hours ago

Trump won’t commit to peaceful transfer of power if he loses the election

Blockchain13 hours ago

Here’s the Crucial Level Bitcoin Needs to Close Above to Kick Off Its Uptrend

Techcrunch13 hours ago

New report finds VC investment into climate tech growing five times faster than overall VC

Startups13 hours ago

Peterson Ventures, a firm that quietly backed Allbirds and Bonobos, just closed a $65 million fund

CNBC13 hours ago

JPMorgan to pay almost $1 billion fine to resolve US investigation into trading practices

AI14 hours ago

Gaining insights into winning football strategies using machine learning

CNBC14 hours ago

‘Amnesia: The Dark Descent’ and its sequel go open source

CNBC14 hours ago

Should front-line medical workers get the coronavirus vaccine first? Not necessarily

CNBC14 hours ago

VW unveils new global ID.4 electric SUV; U.S. production starts in 2022

Big Data14 hours ago

Machine learning and its impact in various facets of healthcare

CNBC14 hours ago

Stocks making the biggest moves after hours: Tesla, Dollar Tree, Jefferies Financial Group & more

CNBC15 hours ago

Google will try ‘hybrid’ work-from-home models, as most employees don’t want to come in every day

Blockchain16 hours ago

Analyst: Ethereum May Grind to $280 Before ETH 2.0 Hype Propels It Higher

SaaS16 hours ago

Email Marketing Consulting

Automotive16 hours ago

Autoblog is turning beautiful car photos into jigsaw puzzles

Author profile picture
Publications17 hours ago

To Remote Work or Not to Remote Work: That is the Question

Automotive17 hours ago

2021 Rolls-Royce Ghost has a fascinating new part to make it one of the most comfortable cars in the world

Visual Capitalist17 hours ago

The New Rules of Leadership: 5 Forces Shaping Expectations of CEOs

Twitter social icon
Publications17 hours ago

How to Measure your Trading Strategy Returns

Publications17 hours ago

How To Improve Organic Ranking By Resolving Wrongly Ranked Pages?

Automotive17 hours ago

Join Autoblog AMA on Thursday at noon ET | Bring your Bronco, Tesla, ID.4 and any other questions

Author profile picture
Publications17 hours ago

4 Reasons Why Email Is Obsolete, and You Should Move On

Publications17 hours ago

Death, Taxes, and Password Negligence: The Inevitability of Pwned Passwords

Automotive17 hours ago

Why the Volkswagen ID.4 is a Very Big Deal