Connect with us

# How to detect spurious correlations, and how to find the real ones

Published

on

Specifically designed in the context of big data in our research lab, the new and simple strong correlation synthetic metric proposed in this article should be used, whenever you want to check if there is a real association between two variables, especially in large-scale automated data science or machine learning projects. Use this new metric now, to avoid being accused of reckless data science and evenbeing sued for wrongful analytic practice.

In this paper, the traditional correlation is referred to as the weak correlation, as it captures only a small part of the association between two variables: weak correlation results in capturing spurious correlations and predictive modeling deficiencies, even with as few as 100 variables. In short, our strong correlation (with a value between 0 and 1) is high (say above 0.80) if not only the weak correlation is also high (in absolute value), but when the internal structures (auto-dependencies) of both variables X and Y that you want to compare, exhibit a similar pattern or correlogram. Yet this new metric is simple and involves just one parameter a (with a = 0 corresponding to weak correlation, and a =1 being the recommended value forstrong correlation). This setting is designed to avoid over-fitting.

Our strong correlation blends together the concept of ordinary or weak regression – indeed, an improved, robust, outlier-resistant version of ordinary regression (or see my book pages 130-140) – together with the concept of X and Y sharing similar bumpiness (or see my book pages 125-128).

In short, even nowadays, what makes two variables X and Y seem related in most scientific articles and pretty much all articles written by journalists, is based on ordinary (weak) regression. But there are plenty of other metrics that you can use to compare two variables. Including bumpiness in the mix (together with weak regression in just one single blended metric called strong correlation to boost accuracy) guarantees that high strong correlation means that the two variables are really associated, not just based on flawy, old-fashioned weak correlations, but also associated based on sharing similar internal auto-dependencies and structure. To put it differently, two variables can be highly weakly correlated yet have very different bumpiness coefficients, as shown in my original article – meaning that there might be no causal relationship (or see my book pages 165-168) or hidden factors explaining the link. An artificial example is provided below in figure 3.

Using strong, rather than weak correlation, eliminates the majority of these spurious correlations, as we shall see in the examples below. This strong correlation metric is designed to be integrated in automated data science algorithms.

1. Formal definition of strong correlation

Let’s define

• Weak correlation c(X, Y) as the absolute value of the ordinary correlation, with value between 0 and 1. This number is high (close to 1) if X and Y are highly correlated. I recommend using my rank-based, L-1 correlation (or see my book pages 130-140) to eliminate problems caused by outliers.
• c1(X) as the lag-1 auto-correlation for X, that is, if we have n observations X_1 … X_n, then c1(X) = c(X_1 … X_{n-1}, X_2 … X_n)
• c1(Y) as the lag-1 auto-correlation for Y
• d-correlation d(X, Y) = exp{ –a * | ln( c1(X) / c1(Y) ) | }, with possible adjustment if numerator or denominator is zero, and parameter a must be positive or zero. This number, with value between 0 and 1, is high (close to 1) if X and Y have similar lag-1 auto-correlations.
• Strong correlation r(X, Y) = min{ c(X, Y), d(X, Y) }

Note that c1(X), and c1(Y) are the bumpiness coefficients (or see my book pages 125-128) for X and Y. Also, d(X, Y) and thus r(X, Y) are between 0 and 1, with 1 meaning strong similarity between X and Y, and 0 meaning either dissimilar lag-1 auto-correlations for X and Y, or lack of old-fashioned correlation.

The strong correlation between X and Y is, by definition, r(X, Y). This is an approximation to having both spectra identical, a solution mentioned in my article The curse of Big Data (see also my book pages 41-45).

This definition of strong correlation was initially suggested in one of our weekly challenges.

2. Comparison with traditional (weak) correlation

When a = 0, weak and strong correlations are identical. Note that the strong correlation r(X, Y) still shares the same properties as the weak correlation c(X, Y): it is symmetric and invariant under linear transformations (such as re-scaling) of variables X or Y, regardless of a

# Seven Tools for Effective CDO Leadership

Published

on

The position of Chief Data Officer (CDO) is relatively new in the federal government, and emerging regulations are providing leadership opportunities for the CDO. A new law, the Foundations for Evidence-Based Policymaking Act, went into effect on January 14, 2019, establishing a set of standards and practices for the United States federal government to modernize its data handling.

Title II of this act is called the Open, Public, Electronic and Necessary (OPEN) Government Data Act, which arose out of the 2013 Open Data Policy. The OPEN Government Data Act requires federal agencies to publish a comprehensive inventory of all data assets, made available as machine-readable data in an open format, under open licenses, as well as putting in place a non-politically appointed senior executive (now the CDO) responsible for actively managing data as an asset. “Not just to talk about it, not just try to leverage value for the enterprise, but to treat it like an asset,” said Corlan Budd, Manager of Data, and Analytics, and Technology Strategy with Ernst & Young. He discussed this during his presentation titled The Chief Data Officer as an Effective Leader at the DATAVERSITY® DGVision Conference. He shared seven tools that can help the CDO be a more effective leader, whether in a government agency, or in the private sector.

#### Key Responsibilities

Budd identified four key
responsibilities of the CDO:

• Managing data as an asset
• Transforming how the agency interacts with data
• Value generation
• Regulatory Compliance

Previously, government agencies treated data like a by-product of the system without much concern about practices around the data. Now that the CDO is responsible for changing the culture and transforming the way the agency interacts with data, compliance with the Evidence-Based Policymaking Act, as well as a number of other data privacy acts, including HIPAA, is within with the CDO’s purview. The CDO is also responsible for value generation, which is measured differently in the government space than it is in the private sector, he said. Rather than valuing the data and trying to monetize it, “we have to support the mission and improve public service,” he said.

#### Culture and the CDO Challenge

Budd quoted Peter Drucker: “Culture will eat strategy for breakfast.” Building an effective strategy is a waste of time if the culture puts up roadblocks to its success. The key to ensuring strategy is embraced rather than ‘eaten for breakfast,’ Budd said, is leadership, yet, “The culture and the organizational dynamics don’t necessarily line up for success immediately.” Cultural factors are dependent on context, and the organizational structure where the CDO resides, whether that is in finance, or risk, or another part of the organization. Support from the CIO and the dynamics of power above the CDO have an effect on autonomy. Culture issues below the CDO often stem from staff buy-in and stakeholder support.

#### Funding and Proving Value

The CDO must show the value of the data itself as well as the value of improving the organization’s relationship with data, while managing expectations about how and when this will happen.  Contracts that are project-based, or with more sophisticated capabilities tend to have an easier time getting funding than program-based proposals that could enhance customer value and provide better service company-wide. With some business units, he said, essentially the only value that they get is the ability to operate their program.

Innovation and transformation provide peak value when C-level
execs are able to make data-driven decisions, optimize performance, and reduce
costs. What often stands in the way of that is culture. The key is to change
from a program or business unit focus to an enterprise-wide approach. “Get
folks in a room and get them talking,” creating an environment that facilitates
conversation among data enthusiasts where they can discuss data issues and leverage
data sharing initiatives. This can provide a lot of value and open up
possibilities for positive cultural change, he said.

#### Assessing Culture: Hofstede’s 6 Dimensions of Culture

Budd suggests using three elements of social psychologist Geert Hofstede’s Six Dimensions of Culture as a guide to qualitatively assess the organizational culture: Individualism vs. collectivism, uncertainty avoidance, and long-term vs. short-term orientation.

• Individualism vs. Collectivism: An
individualistic culture values individual performance and recognition over
playing a role as part of larger extended team or group. Loyalties in an
individualistic culture are focused on the individual. Collectivist culture
loyalties are focused on groups or departments. When building a team
environment, everyone has to understand that in some circumstances they will be
recognized for individual accomplishment, but in relationship to data, each
person has a role as part of a team. “That helps the overall success of not
just the chief data officer, but how effectively we can utilize our data and
how much value we can get from our data for the entire organization, not just
in that C-suite area.”
• Short Term vs. Long Term Orientation: Budd
was surprised at how prevalent short-term orientation was throughout his
organization, with an almost complete lack of interest in any long-term
orientation for strategy. The value of a strategy happens over the course of
time, so he suggests finding some of the low-hanging fruit without sacrificing
longer-term goals. When focusing on moving the needle from short-term
orientation toward the long-term orientation side, “The only way I was able to
do that was to satisfy some of the short-term need, at least for the moment,”
which gave him enough momentum to focus in on some of the longer-term strategy
issues.
• Low vs. High Uncertainty Tolerance: Uncertainty avoidance can be a stumbling block or a wise choice depending on the situation. Concern about investments in new technology is a good idea if the tool is unproven. Stakeholders may have difficulty buying in if there’s a high level of uncertainty about the vision or the likelihood of success, especially if they previously saw a Chief Data Officer who tried something similar and didn’t succeed the first time. With uncertainty avoidance, he considered his efforts a success if there was any move across the halfway point toward risk.

When you come across a situation where you’re on one extreme of the continuum, figure out how you can move that needle culture-wise back to an acceptable area for your strategy to succeed,” he said.

Budd found two leadership principles from John Maxwell’s 21 Irrefutable Laws of Leadership particularly useful for developing skills needed to adapt to the existing environment and connect with the people in it.

• The Law of the Lid: Leadership ability
determines a person’s level of effectiveness. Implementing required changes
without buy-in has a negative effect on culture, he said. “There are a lot of
things that you just can’t do unless you have consensus.” Understand the importance
of developing multiple leadership styles based on the existing culture, such as
using a transformative leadership style in some circumstances, and democratic
leadership in other circumstances. “When you need to develop consensus, you
might have to switch your leadership style to one that’s a little bit more
democratic,”
• The Law of Connection: Leaders touch a
heart before they ask for a hand. A leader needs to develop a personal
connectionbefore successfully affecting culture or leading individuals
in the organization, said Budd. “Followers don’t necessarily follow a
particular thing, but they will follow your vision, and if they connect with

#### Effective Leadership: Influence and Motivate

Three more of Maxwell’s laws, as well as Jim Collins’ Turning the Flywheel provide guidance for learning how to influence and motivate others:

• The Law of Explosive Growth: To add
essentially lead the entire agency, because everyone is a consumer of data, he
said. Identify a group of data consumers and empower them – enable them to the
impact for culture change has essentially multiplied.”
• The Law of Influence: The true measure of
build on one another and contribute to a leader’s level of influence.  “If we want to be effective, and the
measurement of our effectiveness is our influence, then that’s what we need to
make sure we’re honing in on.”
• The
Law of the Big Mo:
Momentum is the leader’s best friend. It’s the little
things that lead to the big things
• The Flywheel Concept: Establish momentum
early on in the process by getting some wins and providing short-term value.
This is similar to riding a bike or turning a flywheel. “The first couple of
strides are always really, really difficult, but once you get that momentum
going when you’re riding the bike, then the machine does a lot of the work for
you.”

According to Jim Collins’ Good to Great, effectively leading an organization into greatness entails sustaining a certain level of performance and growth over time. “A leader’s lasting value is measured by how things continue after they’re gone,” said Budd, yet often when a leader leaves, their initiatives fall by the wayside. An effective leader uses Maxwell’s Law of Explosive Growthto build sustainability. “‘It takes a leader to raise a leader,’ so the essential strategy for sustainability is to develop leaders who will support your data initiatives into the future.”

#### Effective Leadership: First Things First

To manage short-term value expectations, Budd recommends Steven
Covey’s concept of ‘first things first.’ With effective prioritizing, a leader
is able to focus on values, plan ahead, and have opportunities for networking,
relationship-building, and impacting the culture.

Budd uses the Eisenhower Decision Matrix as tool for effectively determining which tasks are important but not urgent, and how to move from reactive to proactive, “Instead of trying to get through the day putting out fires.”

As new activities are added to his plate, Budd uses the chart to
ask himself where they fit in the matrix and whether they line up with his priorities
and strategy. This process, he said, “provides some pretty good immediate
value.” Socializing the Eisenhower matrix can create buy-in and ownership among
team members. When all members participate in thinking through where time
should be spent and work together to ensure that quadrant one
(Important/Urgent) and quadrant two (Important/Not Urgent) are balanced,
priorities are shared and value becomes apparent. “The key also is making sure
that when you do that, you track the value and you measure it, and you
celebrate your win whenever you get one.”

Want to learn more about DATAVERSITY’s upcoming events? Check out our current lineup of online and face-to-face conferences here.

Here is the video of the DGVision Presentation:

# Key Considerations for Executing a Successful M&A Data Migration or Carve-Out

Published

on

Mergers, acquisitions, and divestitures
are just as much of an undertaking for a CIO as they are for a CFO; they are
impactful on both the business and technology side. Determining which SAP
systems and data sets to migrate, integrate, or carve-out as part of the deal —
and then executing on those migrations or carve-outs — can be costly, lengthy,
and incredibly complex processes, which in turn impacts your overall timeline.
Missteps in the data migration process can result in unnecessary technical
debt, potential Transition Services Agreement penalties, and even delays in achieving
your final goals for the project.

There are some key considerations that I would recommend to companies undergoing mergers, acquisitions, or divestitures when it comes to their data migration needs. Chief among those considerations is the need to build automation into the heart of your migration or carve-out strategies and why aligning with the right software-driven partner is integral for executing a data migration or carve-out that stays on track and achieves overall timelines and goals.

#### Create a Clear Plan of Action

Mergers, acquisitions, and
divestitures are incredibly complex processes. Obviously, no business
undertakes one without first outlining a clear plan of action and a timeline
for that plan to proceed along. But it’s crucial that that plan also prioritizes
the data migration side of the operation; it can’t just be a business-facing
process. Data migrations and carve-outs are among the most daunting tasks that
come with executing a merger, acquisition, or divestiture — so getting it right
is critical to accomplishing the broader mission at hand.

While every company’s situation is
different, there are a few key questions that businesses undergoing a merger,
acquisition, or divestiture need to ask themselves to ensure their data needs
aren’t being overlooked:

• Do we need to
integrate the company we just bought into our ERP systems? In the case of a
divestiture: Do we need to identify and carve out data from our systems?
• Does the company we’re
acquiring use SAP or another kind of ERP? Do both companies already share the
same kind of ERP?
• What regulatory issues
may come up that could lengthen, halt, or delay the process? Are there any
potential TSA compliance hurdles that we might come up against?

• One area to consider
is sales overlaps. With a 20 percent overlap from balance sheet to balance
sheet, this can present a significant potential regulatory obstacle.
• How quickly do we need
the data migration or carve-out done?

While these may seem like fundamental
first steps, they’re crucial ones. Without a clear outline of your data needs,
you could end up in a situation where a merger, acquisition, or divestiture
results in the new company taking on excessive levels of technical debt or
violating regulatory compliance — which itself carries a whole host of new
problems with which to deal.

#### Putting Automation Front and Center

Whether you’re integrating or carving out data, the process is incredibly labor-intensive and rife with repetitive tasks. More than that, each decision to be made carries potentially far-reaching consequences for everything from data history preservation and master data relevance to security and compliance. In other words, getting it right the first time is business-critical.

This is all the more the reason why
automation needs to be treated as an integral part of these processes.
Automating data migrations or carve-outs ensures that the volume of menial
tasks is being executed both quickly and painlessly while leaving the more
weighted choices to be done manually. Automation ensures decision-makers are
essentially only spending their time and resources on the tasks that most
require their input — all of which enables IT teams to best allocate and
prioritize their resources for performing even the most challenging carve-out
or migration plans.

This also comes in handy in the
aftermath of the merger, where automation can speed up post-merger/acquisition
integration projects, both accelerating how quickly and seamlessly the
migration can take hold while providing a new level of insight and control over
the process that can’t otherwise be achieved through traditional, manual
approaches.

#### Executing with Minimal Business Disruptions

After building a plan of action
and wrapping it around an automation-driven strategy, the next consideration
ultimately turns to the go-live date: Can your business handle a disruption
that lasts longer than a weekend? How quickly do you need to execute the data
operations? Just how long is too long?

This might be the last step in the
process, but it’s no less critical. Being able to carry out your new data
migration or carve-out with minimal downtime or disruptions to the business is
essentially the first proving ground of how successful your new merger,
acquisition, or divestiture will be. To that end, businesses undergoing these
transformations need to ensure they’ve aligned themselves with the right
software partner ahead of time. Successful data migrations and carve-outs are
integral to the success of the newly merged or divested company and key to
averting the technical debt or TSA violations that can otherwise knock you off
track. Getting that done on time and in line with your goals requires getting
off on the right foot with the right partner.

With so much at stake, businesses
undergoing a merger, acquisition, or divestiture need nothing less than a
predictable process for executing their data migration and carve-out needs — a
software-driven, end-to-end, automated process that is predictable in its
speed, efficiency, and success rate in delivering on your goals within your
timetable.

# Parallel ways of Data Scientist and Machine Learning

Published

on

👉 📊 There are endless conversations, debates, and discussions over this popular topic, and it can be a little overwhelming to know where to start from data science experts to complete newbies.

🔥 While, from researchers to students, industry experts, and machine learning (ML) enthusiasts — keeping up with the best and the latest machine learning research is a matter of finding reliable information. Here in this blog, we are going to share information on how data science is evolving with the rising demand for Machine Learning.

### Inside 🎰 Machine Learning- 👇

In amazingly simple words every time we pick our phones to get seek information from any search engine like google or any social media platform like Facebook or Instagram, Machine Learning is playing its role each moment. It is the role of Machine Learning to provide the most relevant information/ recommendations to the searcher. From searching for good restaurant hopping options to tips for skincare regime, we are contributing machine learning through our searches on the internet, without realizing it.

🎯 Machine Learning technology plays a big role in collecting and keeping track of user search behavioral data for the companies, so the same can be taken into consideration while taking the important product of services related decisions by Data Scientist or business personnel.

🗨 So, this was the explanation of how in our daily lives we are interacting with Machine learning Cluelessly. Now let us understand the role of data scientists and how it related to Machine Learning.

### 📉 Who is a Data Scientist?

🚀 This can be drafted as the one who is an expert in extracting meaningful information from the heaps of data. They are specialists, gathering, and analyzing large sets of structured and unstructured data. With a combination of computer science, statistics, and mathematics, Data scientists are analytical experts who utilize their skills both technologically and ethically to find trends and manage data. They analyze, process, and model data then translate the results to create actionable plans for companies and other organizations.

👩‍💻 The Sufficient knowledge of different Machine Learning techniques and like Python, SAS, R, and SQL/NoSQL database, and other tools Data Scientist can perform the task with very few challenges and easily outrank the competitor.

### 🎰 Machine Learning for Data Scientist or Vise-Versa? 👇

Taking into consideration the role of Data Scientist discussed above- without data, machine learning does not fulfill its use. This is how machine learning and data science go hand in hand as they both are incomplete without each other.

🗨 Where machine learning collects the data for Data scientists to evaluate and extract the meaningful out of it. With the increased use of technology/internet, the use of ML acts as a spur to push data science in high demand.

In the world of 📈 data science one can never feel the shortage of tools and algorithms to be applied to data, with this we can say data science skills also involves the ability to evaluate Machine learning and can make the machine as smart as to make their analyses process easier. Going forward, essential levels of machine learning will become a benchmark for data scientists. 🔻

Seeing from a different perspective, to match human abilities, machines need to be smart enough and Machine Learning is the soul of Artificial intelligence.

👨‍⚖️ Data Scientists must understand Machine Learning for the best outcomes and quality results. This can help machines to make the right decisions and smarter actions in real-time with zero human intervention. Hence, Data Scientists must acquire skills in Machine Learning. 👇

### 📖 Conclusion-

In the world of Data Science, Machine learning has already proven its worth, it is turning out to be the best solution to a deeper analysis of a huge amount of data. Data scientists must acquire knowledge of ML to standout in the competitive market.

#### ✍ Author Bio :⤵

Senior Data Scientist and Alumnus of IIM- C (Indian Institute of Management – Kolkata) with over 25 years of professional experience Specialized in Data Science, Artificial Intelligence, and Machine Learning.
PMP Certified
ITIL Expert certified APMG, PEOPLECERT, and EXIN Accredited Trainer for all modules of ITIL till Expert Trained over 3000+ professionals across the globe currently authoring a book on ITIL “ITIL MADE EASY”.

Conducted myriad Project management and ITIL Process consulting engagements in various organizations. Performed maturity assessment, gap analysis, and Project management process definition and end to end implementation of Project management best practices. 👇