Connect with us

Big Data

Feature Store as a Foundation for Machine Learning




Feature Store as a Foundation for Machine Learning

With so many organizations now taking the leap into building production-level machine learning models, many lessons learned are coming to light about the supporting infrastructure. For a variety of important types of use cases, maintaining a centralized feature store is essential for higher ROI and faster delivery to market. In this review, the current feature store landscape is described, and you can learn how to architect one into your MLOps pipeline.

By German Osin, Senior Solutions Architect at Provectus.

Image by Yurchanka Siarhei from shutterstock.

Artificial intelligence and machine learning have reached an inflection point. In 2020, organizations in diverse industries of various sizes began evolving their ML projects from experimentation to production on an industrial scale. While doing so, they realized they were wasting a lot of time and effort on feature definition and extraction.

Feature store is a fundamental component of the ML stack and of any robust data infrastructure because it enables efficient feature engineering and management. It also allows for simple re-use of features, feature standardization across the company, and feature consistency between offline and online models. A centralized, scalable feature store allows organizations to innovate faster and drive ML processes at scale.

The team at Provectus has built several feature stores, and we would like to share our experience and lessons learned. In this article, we will explore how feature stores can help eliminate rework and enforce data traceability from source to model across teams. We will look into the specifics of building a scalable feature store and discuss how to achieve consistency between real-time and training features, and to improve reproducibility with time traveling for data.

What Is a Feature Store?

A Feature Store is a data management layer for machine learning features. ML features are measurable properties of phenomena under observation, like raw words, pixels, sensor values, rows of data in a data store, fields in a CSV file, aggregates (min, max, sum, mean), or derived representations (embedding or cluster).

From a business perspective, feature stores offer two major benefits:

  1. Better ROI from feature engineering through reduction of cost per model, which facilitates collaboration, sharing, and re-using of features
  2. Faster time to market for new models, driven by increased productivity of ML Engineers. This enables organizations to decouple storage implementation and features serving API from ML engineers, freeing them to work on models, not on latency problems, for more efficient online serving.

Some use cases do not require a centralized, scalable feature store. Feature stores work for projects where models need a stateful, ever-changing representation of the system. Examples of such use cases include demand forecasting, personalization and recommendation engines, dynamic pricing optimization, supply chain optimization, logistics and transportation optimization, fraud detection, and predictive maintenance.

Concepts of a Feature Store

A standardized feature store has certain key concepts that revolve around it. Those concepts are:

  1. Online Feature Store. Online applications look for a feature vector that is sent to an ML model for predictions.
  2. ML Specific Metadata. Enables the discovery and re-use of features.
  3. ML Specific API and SDK. High-level operations for fetching training feature sets and online access.
  4. Materialized Versioned Datasets. Maintains versions of feature sets used to train ML models.

Image by Author.

All of the concepts are represented in the above image. Analytical data is taken from a data lake and pushed through the feature engineering pipeline, where output is stored in the feature store. From there, ML engineers can discover the features, use them for training new models, and then re-use the features for serving.

These four concepts are supported by multiple products. The leaders of the market are Feast, Tecton, Hopsworks, and AWS SageMaker Feature Store. We will focus on open source products: Feast and Hopsworks.

#1 Feast

Feast is an ML service that helps teams bridge the gap between data and machine learning models. It allows teams to register, ingest, serve, and monitor features in production.

This service is in the active development stage, but it has already been battle-tested with GoJek, Farfetch, Postmates, and Zulily. It can be integrated with Kubeflow and is backed by Kubeflow’s strong community.

As of 2020, Feast is a GCP-only service. It is a bit infrastructure-heavy, lacks composability, and does not offer data versioning. Note that the company plans to address these challenges in its roadmap for 2021. And starting November 2020, Feast is available as an open source version of Tecton.

Image by Author.

#2 Hopsworks

Hopsworks is an Enterprise Platform for developing and operating AI applications. It allows teams to manage ML features quickly and efficiently. The team behind Hopsworks are feature store evangelists, and they offer a lot of great educational content.

The Hopsworks feature store can be integrated with most Python libraries for ingestion and training. It also supports offline stores with time travel. Most importantly, it is AWS-, GCP-, Azure-, and on-premise ready.

What makes Hopsworks a challenge to use is its heavy reliance on the HopsML infrastructure. Also, the Hopsworks online store may not satisfy all latency requirements.

Image by Author.

Challenges of Modern Data Platforms

Before looking into the specifics of building feature stores, the challenges of modern data platforms have to be considered. Feature stores cannot be examined in isolation from the rest of the data and the ML infrastructure.

Traditionally, a canonical data lake architecture looks like this:

Image by Author.

This high-level architecture has components for sourcing, ingestion, processing, persisting, querying, and a consumer component. While it works fine for most tasks, it is not ideal.

Data access and discoverability can be a problem. Data is scattered across multiple data sources and services. This once helped protect data, but now it only adds a new layer of complexity and creates data silos. Such architecture entails the tedious process of managing AWS IAM roles, Amazon S3 policies, API Gateways, and database permissions. This becomes even more complicated in a multi-account setup. As a result, engineers are confused about what data exists and which is actually accessible to them since metadata is not discoverable by default. This creates an environment where investments in data and machine learning are curtailed due to data access issues.

Monolithic data teams are another issue to consider. Since data and machine learning teams are extremely specialized, they often have to operate out of context. Lack of ownership and domain context creates silos between data producers and data consumers, making it considerably more difficult for a backlogged data team to keep up with business demands. Data and ML Engineering fall victim to complex dependencies, failing to sync their operations. Any fast, end-to-end experimentation is not possible in such circumstances.

Machine learning experimentation infrastructure is uncharted territory. Traditional architecture lacks an experimentation component, which not only leads to data discovery and data access issues but also makes it more complex to maintain the reproducibility of datasets, ML pipelines, ML environments, and offline experiments. Although Amazon SageMaker and Kubeflow have made some progress here, reproducibility is problematic. Production experimentation frameworks are immature and cannot be entirely relied upon. As a result, it can take from three to six months to push one end-to-end experiment from data to production ML.

Scaling ML in production is complex and requires a lot of time and effort. While machine learning is mostly discussed as an offline activity (e.g., data collection, processing, model training, evaluating results, etc.), the ways models are used and served online in the real world are what actually matter. With a traditional architecture, you cannot access features during model serving in a unified and consistent way. You also cannot re-use features between multiple training pipelines and ML applications. The monitoring and maintenance of ML applications are also more complicated. As a result, the time and cost required to scale from 1 to 100 models in production grow exponentially, which limits the ability of organizations to innovate at the desired pace.

Emerging Architectural Shifts

To address these challenges, several architectural shifts have emerged:

  1. From Data Lakes to Hudi/Delta lakes. A data lake is not just a folder in Amazon S3. It is a feature-rich, fully managed layer for Data Ingestion, Incremental Processing, and Serving with ACID transactions and point-in-time queries.
  2. From Data Lakes to Data Mesh. Ownership of data domains, data pipelines, metadata, and APIs is moving away from centralized teams to product teams. Another impactful benefit is treating and owning data as a complete product rather than a side effect that nobody cares about.
  3. From Data Lakes to Data Infrastructure as a Platform. If the ownership of data is decentralized, platform components have to be unified and packaged into a reusable data platform.
  4. From Endpoint Protection to Global Data Governance. As part of the shift towards centralized data platforms, organizations are moving away from Endpoint Protection to Global Data Governance, which is a higher-level control plan for managing security and data access policies across available data sources.
  5. From Metadata Store to Global Data Catalog. Metadata stores like Hive meta store cannot aggregate many data sources. The industry needs a Global Data Catalog to support User Experience around data discovery, lineage, and versioning.
  6. Feature Store. Feature store is a new emerging component of the ML stack that enables scaling of ML experimentation and operations by adding a separate data management layer for ML Features.

All of these transformations are happening in parallel and should be thought of holistically. You cannot start designing a feature store and end up having separate data catalogs for features and for other data applications. While building a feature store, you have to rely on data versioning features, which can easily be part of other parallel projects.

Before moving forward, just a few words about four major components that drive the aforementioned transformations from disjointed ML jobs to MLOps, to provide a wider context for the importance of feature stores.

#1 Delta/Hudi Lakes

ACID data lakes enable managed ingestion, efficient dataset versioning for ML training, cheap “deletes” to make it GDPR/CCPA compliant, and “upserts” for data ingestions. They also offer an audit log to keep track of dataset changes and ACID transactions while enforcing data quality through schemas. Delta and Hudi lakes bring stream processing to Big Data, providing fresh data much more efficiently than traditional batch processing.

#2 Global Data Governance

Because it is no longer a standard to manage AWS IAM roles, Amazon S3 policies, API Gateways, and database permissions at the user level, a company-wide data governance structure should be used to:

  1. Accelerate privacy operations with data you already have. Automate business processes, data mapping, and PI discovery and classification for privacy workflows.
  2. Operationalize policies in a central location. Govern privacy policies to ensure policies are effectively managed across the enterprise. Define and document workflows, traceability views, and business process registers.
  3. Scale compliance across multiple regulations. Use a platform designed and built with privacy in mind that can be easily extended to support new regulations.

#3 Global Data Catalog

Although there is no single market leader here, Marquez, Amundsen, Apache Atlas, Collibra, Alation, and Data Hub are worth mentioning.

A global data catalog is extremely useful for answering questions such as Does this data exist? Where is it? What is the source of truth of this data? Do I have access? Who is the owner? Who are the users of this data? Are there existing assets I can re-use? Can I trust this data? Basically, it acts as a meta metadata store of sorts.

#4 Reproducible ML Pipelines

The final component is reproducible ML pipelines for experimentation.

Image by Author.

The above chart represents the architecture for MLOps and reproducible experimentation pipelines. It starts with four inputs: ML model code, ML pipeline code, Infrastructure as a code, and Versioned dataset. The versioned dataset — an input for your machine learning pipeline — should be sourced from the feature store.

Modern Data Infrastructure

Now let’s look at modern data infrastructure.

Image by Author.

We have batch processing for raw data and stream processing for event data. We store processed artifacts in cold storage for business reports and in a near-real-time, incrementally updated hot index for our API. The same data can be used in both scenarios and, to make it consistent, we use different pub/subsystems.

This is the traditional architecture for data platforms. Its goal is to provide consistency between cold and hot storage and to allow for discoverability from the data catalog, data quality, and global security with a fine-grained control on top of it.

Image by Author.

If we look at a feature store design, we will see features and infrastructure components almost identical to what we have in the data platform. In this case, the feature store is not a separate silo that brings yet another ingestion system, storage, catalog, and quality gates. It serves as a lightweight API between our data platform and ML tools. It can be nicely integrated with everything that has already been done in your data infrastructure. It should be composable and lightweight and without an opinionated design.

Image by Author.

When you begin designing and building your data infrastructure, consider the following “lessons learned” so far:

  1. Start by designing a consistent ACID Data Lake before investing in a Feature Store.
  2. Value from existing open-source products does not justify investments in integration and the dependencies they bring.
  3. A feature store is not a new infrastructure and data storage solution but a lightweight API and SDK integrated into your existing data infrastructure.
  4. Data Catalog, Data Governance, and Data Quality components are horizontal for the entire Data Infrastructure, including the feature store.
  5. There are no mature open source or cloud solutions for Global Data Catalog and Data Quality monitoring.

Reference Architecture

Image by Author.

This chart depicts the reference architecture we have been using for our customers. It features the services we have opted to use, but you should not be limited by our choices. The idea here is that you have to choose cold and hot storage based on your data workloads and on your business needs.

For hot storage, you may choose from DynamoDB, Cassandra, Hbase, or traditional RDBMS like Mysql, PostgreSQL, or even Redis. It is important that your hot storage be composable, pluggable, and in alignment with your data infrastructure strategy

For cold storage, Apache Hudi and Delta lake are our favorites. They offer such features as Time Travel, Incremental ingestion, and Materialized views.

There are some blank spaces on the diagram, which we hope to fill soon. For example, so far, there is no over-the-shelf leader for the data catalog. Data quality tools are also in the early stages. For now, you can choose from Great Expectations or Apache Deequ, which are great tools, but they do not provide a complete solution.

Image by Author.

In the image above, the question marks occupy spaces where you can choose from solutions built by open source communities, build your own solution in-house, or collaborate with cloud providers (e.g., AWS’ latest addition — Amazon SageMaker Feature Store for Machine Learning).

Moving Forward with Feature Store

Although it is still early in the game for feature stores, organizations that are not just experimenting but actively moving machine learning projects to production have already realized the need to have a centralized repository to store, update, retrieve, and share features.

In this article, we have shown how to design and build such a repository. While some of the points featured here are debatable and open for feedback from the community, it is clear that:

  1. Your existing data infrastructure should cover at least 90% of feature store requirements, including streaming ingestion, consistency, data catalog, and versioning, to achieve the desired outcome.
  2. It makes sense to build a lightweight Feature Store API to integrate with your existing storage solutions in-house.
  3. You should collaborate with community and cloud vendors to maintain compatibility with standards and state-of-the-art ecosystems.
  4. You should be ready to migrate to a managed service or to an open-source alternative as the market matures.

Original. Reposted with permission.


Checkout PrimeXBT
Trade with the Official CFD Partners of AC Milan
The Easiest Way to Way To Trade Crypto.

Big Data

EHR Data Launches Movement for Patients to Own Health Data




EHR Data has announced the official launch of EHR Data Wavemakers, a movement geared towards educating and empowering individuals to create waves that will push a much-needed change in the healthcare industry—for patients to be able to own and control their health data.

Many people from all walks of life must have had experiences dealing with miscommunication due to delays and failures in retrieving and sharing their own health records that may have negatively impacted their own or loved ones’ medical treatment and care. At the forefront of the EHR Data Wavemakers movement is the digital campaign My EHR Story that encourages people to share their stories on social media using the hashtag #myEHRstory. This does not only create awareness about the current situation wherein patients are having difficulties accessing their personal health data—when that data should be rightfully theirs to own and control—but also spearheads a movement towards responsible data management towards a blockchain-based global healthcare database.

“EHR Data is a company that’s existed for 41 years in the U.S. It’s not a startup. It is a significant player in the world of healthcare data in the U.S. It’s bringing its 41 years of experience to follow Craig Wright’s lead in empowering people to have more control over their data, in this case, healthcare data. They want to create more patient safety. They’re building the concept of a global electronic health record so that all of your health data can live in one place on the blockchain. As opposed to the current systems we have in the U.S. and many countries where I go to my general practitioner doctor, and they have some of my health records; my dentist has some health records,” Bitcoin Association Founding President Jimmy Nguyen said in a presentation of enterprise solutions built on the Bitcoin SV blockchain during an event in Ljubljana, Slovenia last year.

This movement could not have come at a more appropriate time. As people realize the value of data during this time of the pandemic, now is the time to make waves and enact the changes needed for people to own their data and benefit from it. Not only that, the global healthcare database is built on the Bitcoin SV blockchain, which provides transparency, security, scalability, and immutability to data. Furthermore, the Bitcoin SV blockchain can accommodate big data and low-cost microtransactions and operates on an economically incentivized model,

which makes it perfect for a global healthcare database.

“Times are changing, and a greater focus is being placed on interoperability and the patients’ absolute right to increased access to their health data. We will spearhead and shepherd this process; it’s high time that there was a centralized location for healthcare data, controlled and permission by the patient that they and their team of providers can access at any time,” Ron Austring, EHR Data Chief Scientist, explained.

Because patients own their health data, they will be asked for permission whenever their data is needed for use by various institutions, and they will be paid for it. This is in contrast to the current system where only big businesses profit off of collecting people’s data. And this is why there is a need for change. People must come together to revolutionize the system so they may claim back ownership of their data.

Visit to become part of this movement and to learn more on how to share your stories.

Author:Makkie Maclang

source link:


Continue Reading

Artificial Intelligence

Deep Learning vs Machine Learning: How an Emerging Field Influences Traditional Computer Programming




When two different concepts are greatly intertwined, it can be difficult to separate them as distinct academic topics. That might explain why it’s so difficult to separate deep learning from machine learning as a whole. Considering the current push for both automation as well as instant gratification, a great deal of renewed focus has been heaped on the topic.

Everything from automated manufacturing worfklows to personalized digital medicine could potentially grow to rely on deep learning technology. Defining the exact aspects of this technical discipline that will revolutionize these industries is, however, admittedly much more difficult. Perhaps it’s best to consider deep learning in the context of a greater movement in computer science.

Defining Deep Learning as a Subset of Machine Learning

Machine learning and deep learning are essentially two sides of the same coin. Deep learning techniques are a specific discipline that belong to a much larger field that includes a large variety of trained artificially intelligent agents that can predict the correct response in an equally wide array of situations. What makes deep learning independent of all of these other techniques, however, is the fact that it focuses almost exclusively on teaching agents to accomplish a specific goal by learning the best possible action in a number of virtual environments.

Traditional machine learning algorithms usually teach artificial nodes how to respond to stimuli by rote memorization. This is somewhat similar to human teaching techniques that consist of simple repetition, and therefore might be thought of the computerized equivalent of a student running through times tables until they can recite them. While this is effective in a way, artificially intelligent agents educated in such a manner may not be able to respond to any stimulus outside of the realm of their original design specifications.

That’s why deep learning specialists have developed alternative algorithms that are considered to be somewhat superior to this method, though they are admittedly far more hardware intensive in many ways. Subrountines used by deep learning agents may be based around generative adversarial networks, convolutional neural node structures or a practical form of restricted Boltzmann machine. These stand in sharp contrast to the binary trees and linked lists used by conventional machine learning firmware as well as a majority of modern file systems.

Self-organizing maps have also widely been in deep learning, though their applications in other AI research fields have typically been much less promising. When it comes to defining the deep learning vs machine learning debate, however, it’s highly likely that technicians will be looking more for practical applications than for theoretical academic discussion in the coming months. Suffice it to say that machine learning encompasses everything from the simplest AI to the most sophisticated predictive algorithms while deep learning constitutes a more selective subset of these techniques.

Practical Applications of Deep Learning Technology

Depending on how a particular program is authored, deep learning techniques could be deployed along supervised or semi-supervised neural networks. Theoretically, it’d also be possible to do so via a completely unsupervised node layout, and it’s this technique that has quickly become the most promising. Unsupervised networks may be useful for medical image analysis, since this application often presents unique pieces of graphical information to a computer program that have to be tested against known inputs.

Traditional binary tree or blockchain-based learning systems have struggled to identify the same patterns in dramatically different scenarios, because the information remains hidden in a structure that would have otherwise been designed to present data effectively. It’s essentially a natural form of steganography, and it has confounded computer algorithms in the healthcare industry. However, this new type of unsupervised learning node could virtually educate itself on how to match these patterns even in a data structure that isn’t organized along the normal lines that a computer would expect it to be.

Others have proposed implementing semi-supervised artificially intelligent marketing agents that could eliminate much of the concern over ethics regarding existing deal-closing software. Instead of trying to reach as large a customer base as possible, these tools would calculate the odds of any given individual needing a product at a given time. In order to do so, it would need certain types of information provided by the organization that it works on behalf of, but it would eventually be able to predict all further actions on its own.

While some companies are currently relying on tools that utilize traditional machine learning technology to achieve the same goals, these are often wrought with privacy and ethical concerns. The advent of deep structured learning algorithms have enabled software engineers to come up with new systems that don’t suffer from these drawbacks.

Developing a Private Automated Learning Environment

Conventional machine learning programs often run into serious privacy concerns because of the fact that they need a huge amount of input in order to draw any usable conclusions. Deep learning image recognition software works by processing a smaller subset of inputs, thus ensuring that it doesn’t need as much information to do its job. This is of particular importance for those who are concerned about the possibility of consumer data leaks.

Considering new regulatory stances on many of these issues, it’s also quickly become something that’s become important from a compliance standpoint as well. As toxicology labs begin using bioactivity-focused deep structured learning packages, it’s likely that regulators will express additional concerns in regards to the amount of information needed to perform any given task with this kind of sensitive data. Computer scientists have had to scale back what some have called a veritable fire hose of bytes that tell more of a story than most would be comfortable with.

In a way, these developments hearken back to an earlier time when it was believed that each process in a system should only have the amount of privileges necessary to complete its job. As machine learning engineers embrace this paradigm, it’s highly likely that future developments will be considerably more secure simply because they don’t require the massive amount of data mining necessary to power today’s existing operations.

Image Credit:

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading


What did COVID do to all our models?




What did COVID do to all our models?

An interview with Dean Abbott and John Elder about change management, complexity, interpretability, and the risk of AI taking over humanity.

By Heather Fyson, KNIME

What did COVID do to all our models?

After the KNIME Fall Summit, the dinosaurs went back home… well, switched off their laptops. Dean Abbott and John Elder, longstanding data science experts, were invited to the Fall Summit by Michael to join him in a discussion of The Future of Data Science: A Fireside Chat with Industry Dinosaurs. The result was a sparkling conversation about data science challenges and new trends. Since switching off the studio lights, Rosaria has distilled and expanded some of the highlights about change management, complexity, interpretability, and more in the data science world. Let’s see where it brought us.

What is your experience with change management in AI, when reality changes and models have to be updated? What did COVID do to all our models?

[Dean] Machine Learning (ML) algorithms assume consistency between past and future. When things change, the models fail. COVID has changed our habits, and therefore our data. Pre-COVID models struggle to deal with the new situation.

[John] A simple example would be the Traffic layer on Google Maps. After lockdowns hit country after country in 2020, Google Maps traffic estimates were very inaccurate for a while. It had been built on fairly stable training data but now that system was thrown completely out of whack.

How do you figure out when the world has changed and the models don’t work anymore?

[Dean] Here’s a little trick I use: I partition my data by time and label records as “before” and “after”. I then build a classification model to discriminate the “after” vs. the “before” from the same inputs the model uses. If the discrimination is possible, then the “after” is different from the “before”, the world has changed, the data has changed, and the models must be retrained.

How complicated is it to retrain models in projects, especially after years of customization?

[John] Training models is usually the easiest step of all! The vast majority of otherwise successful projects die in the implementation phase. The greatest time is spent in the data cleansing and preparation phase. And the most problems are missed or made in the business understanding / project definition phase. So if you understand what the flaw is and can obtain new data and have the implementation framework in place, creating a new model is, by comparison, very straightforward.

Based on your decades-long experience, how complex is it to put together a really functioning Data Science application?

[John] It can vary of course, by complexity. Most of our projects get functioning prototypes at least in a few months. But for all, I cannot stress enough the importance of feedback: You have to talk to people much more often than you want to. And listen! We learn new things about the business problem, the data, or constraints, each time. Not all us quantitative people are skilled at speaking with humans, so it often takes a team. But the whole team of stakeholders has to learn to speak the same language.

[Dean] It is important to talk to our business counterpart. People fear change and don’t want to change the current status. One key problem really is psychological. The analysts are often seen as an annoyance. So, we have to build the trust between the business counterpart and the analytics geeks. The start of a project should always include the following step: Sync up domain experts / project managers, the analysts, and the IT and infrastructure (DevOps) team so everyone is clear on the objectives of the project and how it will be executed. Analysts are number 11 on the top 10 list of people they have to see every day! Let’s avoid embodying data scientist arrogance: “The business can’t understand us/our techniques, but we know what works best”. What we don’t understand, however, are the domains experts are actually experts in the domain we are working in! Translation of data science assumptions and approaches into language that is understood by the domain experts is key!

The latest trend now is deep learning, apparently it can solve everything. I got a question from a student lately, asking “why do we need to learn other ML algorithms if deep learning is the state of the art to solve data science problems”?

[Dean] Deep learning sucked a lot of the oxygen out of the room. It feels so much like the early 1990s when neural networks ascended with similar optimism! Deep Learning is a set of powerful techniques for sure, but they are hard to implement and optimize. XGBoost, Ensembles of trees, are also powerful but currently more mainstream. The vast majority of problems we need to solve using advanced analytics really don’t require complex solutions, so start simple; deep learning is overkill in these situations. It is best to use the Occam’s razor principle: if two models perform the same, adopt the simplest.

About complexity. The other trend, opposite to deep learning, is ML interpretability. Here, you greatly (excessively?) simplify the model in order to be able to explain it. Is interpretability that important?

[John] I often find myself fighting interpretability. It is nice, sure, but often comes at too high a cost of the most important model property: reliable accuracy. But many stakeholders believe interpretability is essential, so it becomes a barrier for acceptance. Thus, it is essential to discover what kind of interpretability is needed. Perhaps it is just knowing what the most important variables are? That’s doable with many nonlinear models. Maybe, as with explaining to credit applicants why they were turned down, one just needs to interpret outputs for one case at a time? We can build a linear approximation for a given point. Or, we can generate data from our black box model and build an “interpretable” model of any complexity to fit that data.

Lastly, research has shown that if users have the chance to play with a model – that is, to poke it with trial values of inputs and see its outputs, and perhaps visualize it – they get the same warm feelings of interpretability. Overall, trust – in the people and technology behind the model – is necessary for acceptance, and this is enhanced by regular communication and by including the eventual users of the model in the build phases and decisions of the modeling process.

[Dean] By the way KNIME Analytics Platform has a great feature to quantify the importance of the input variables in a Random Forest! The Random Forest Learner node outputs the statistics of candidate and splitting variables. Remember that, when you use the Random Forest Learner node.

There is an increase in requests for explanations of what a model does. For example, for some security classes, the European Union is demanding verification that the model doesn’t do what it’s not supposed to do. If we have to explain it all, then maybe Machine Learning is not the way to go. No more Machine Learning?

[Dean]  Maybe full explainability is too hard to obtain, but we can achieve progress by performing a grid search on model inputs to create something like a score card describing what the model does. This is something like regression testing in hardware and software QA. If a formal proof what models are doing is not possible, then let’s test and test and test! Input Shuffling and Target Shuffling can help to achieve a rough representation of the model behavior.

[John] Talking about understanding what a model does, I would like to raise the problem of reproducibility in science. A huge proportion of journal articles in all fields — 65 to 90% — is believed to be unreplicable. This is a true crisis in science. Medical papers try to tell you how to reproduce their results. ML papers don’t yet seem to care about reproducibility. A recent study showed that only 15% of AI papers share their code.

Let’s talk about Machine Learning Bias. Is it possible to build models that don’t discriminate?

[John] (To be a nerd for a second, that word is unfortunately overloaded. To “discriminate” in the ML world word is your very goal: to make a distinction between two classes.) But to your real question, it depends on the data (and on whether the analyst is clever enough to adjust for weaknesses in the data): The models will pull out of the data the information reflected therein. The computer knows nothing about the world except for what’s in the data in front of it. So the analyst has to curate the data — take responsibility for those cases reflecting reality. If certain types of people, for example, are under-represented then the model will pay less attention to them and won’t be as accurate on them going forward. I ask, “What did the data have to go through to get here?” (to get in this dataset) to think of how other cases might have dropped out along the way through the process (that is survivor bias). A skilled data scientist can look for such problems and think of ways to adjust/correct for them.

[Dean] The bias is not in the algorithms. The bias is in the data. If the data is biased, we’re working with a biased view of the world. Math is just math, it is not biased.

Will AI take over humanity?!

[John] I believe AI is just good engineering. Will AI exceed human intelligence? In my experience anyone under 40 believes yes, this is inevitable, and most over 40 (like me, obviously): no! AI models are fast, loyal, and obedient. Like a good German Shepherd dog, an AI model will go and get that ball, but it knows nothing about the world other than the data it has been shown. It has no common sense. It is a great assistant for specific tasks, but actually quite dimwitted.

[Dean] On that note, I would like to report two quotes made by Marvin Minsky in 1961 and 1970, from the dawn of AI, that I think describe well the future of AI.

“Within our lifetime some machines may surpass us in general intelligence” (1961)

“In three to eight years we’ll have a machine with the intelligence of a human being” (1970)

These ideas have been around for a long time. Here is one reason why AI will not solve all the problems: We’re judging its behavior based on one number, one number only! (Model error.) For example, predictions of stock prices over the next five years, predicted by building models using root mean square error as the error metric, cannot possibly paint the full picture of what the data are actually doing and severely hampers the model and its ability to flexibly uncover the patterns. We all know that RMSE is too coarse of a measure. Deep Learning algorithms will continue to get better, but we also need to get better at judging how good a model really is. So, no! I do not think that AI will take over humanity.

We have reached the end of this interview. We would like to thank Dean and John for their time and their pills of knowledge. Let’s hope we meet again soon!

About Dean Abbott and John Elder

What did COVID do to all our models Dean Abbott is Co-Founder and Chief Data Scientist at SmarterHQ. He is an internationally recognized expert and innovator in data science and predictive analytics, with three decades of experience solving problems in omnichannel customer analytics, fraud detection, risk modeling, text mining & survey analysis. Included frequently in lists of pioneering data scientists and data scientists, he is a popular keynote speaker and workshop instructor at conferences worldwide, also serving on Advisory Boards for the UC/Irvine Predictive Analytics and UCSD Data Science Certificate programs. He is the author of Applied Predictive Analytics (Wiley, 2014) and co-author of The IBM SPSS Modeler Cookbook (Packt Publishing, 2013).

What did COVID do to all our models John Elder founded Elder Research, America’s largest and most experienced data science consultancy in 1995. With offices in Charlottesville VA, Baltimore MD, Raleigh, NC, Washington DC, and London, they’ve solved hundreds of challenges for commercial and government clients by extracting actionable knowledge from all types of data. Dr. Elder co-authored three books — on practical data mining, ensembles, and text mining — two of which won “book of the year” awards. John has created data mining tools, was a discoverer of ensemble methods, chairs international conferences, and is a popular workshop and keynote speaker.

Bio: Heather Fyson is the blog editor at KNIME. Initially on the Event Team, her background is actually in translation & proofreading, so by moving to the blog in 2019 she has returned to her real passion of working with texts. P.S. She is always interested to hear your ideas for new articles.

Original. Reposted with permission.


Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading

Big Data

Shapash: Making Machine Learning Models Understandable




Shapash: Making Machine Learning Models Understandable

Establishing an expectation for trust around AI technologies may soon become one of the most important skills provided by Data Scientists. Significant research investments are underway in this area, and new tools are being developed, such as Shapash, an open-source Python library that helps Data Scientists make machine learning models more transparent and understandable.

By Yann Golhen, MAIF, Lead Data Scientist.

Shapash Web App Demo

Shapash by MAIF is a Python Toolkit that facilitates the understanding of Machine Learning models to data scientists. It makes it easier to share and discuss the model interpretability with non-data specialists: business analysts, managers, and end-users.

Concretely, Shapash provides easy-to-read visualizations and a web app. Shapash displays results with appropriate wording (preprocessing inverse/post-processing). Shapash is useful in an operational context as it enables data scientists to use explicability from exploration to production: You can easily deploy local explainability in production to complete each of your forecasts/recommendations with a summary of the local explainability.

In this post, we will present the main features of Shapash and how it operates. We will illustrate the implementation of the library on a concrete use case.

Elements of context

Interpretability and explicability of models are hot topics. There are many articles, publications, and open-source contributions about it. All these contributions do not deal with the same issues and challenges.

Most data scientists use these techniques for many reasons: to better understand their models, to check that they are consistent and unbiased, as well as for debugging.

However, there is more to it:

Intelligibility matters for pedagogic purposes. Intelligible Machine Learning models can be debated with people that are not data specialists: business analysts, final users…

Concretely, there are two steps in our Data Science projects that involve non-specialists:

Exploratory step & Model fitting

At this step, data scientists and business analysts discuss what is at stake and define the essential data they will integrate into the projects. It requires understanding the subject well and the main drivers of the problem we are modeling.

To do this, data scientists study global explicability, features importance, and which role the model’s top features play. They can also locally look at some individuals, especially outliers. A Web App is interesting at this phase because they need to look at visualizations and graphics. Discussing these results with business analysts is interesting to challenge the approach and validate the model.

Deploying the model in a production environment

That’s it! The model is validated, deployed, and gives predictions to the end-users. Local explicability can bring them a lot of value, only if there is a way to provide them with a good, useful, and understandable summary. It will be valuable to them for two reasons:

  • Transparency brings trust: They will trust models if they understands them.
  • Human stays in control: No model is 100% reliable. When they can understand the algorithm’s outputs, users can overturn the algorithm suggestions if they think they rest on incorrect data.

Shapash has been developed to help data scientists to meet these needs.

Shapash key features

  • Easy-to-read visualizations, for everyone.
  • A web app: To understand how a model works, you have to look at multiple graphs, features importance, and global contribution of a feature to a model. A web app is a useful tool for this.
  • Several methods to show results with appropriate wording (preprocessing inverse, post-processing). You can easily add your data dictionaries, category-encodersobject, or sklearn ColumnTransformer for more explicit outputs.
  • Functions to easily save Picklefiles and to export results in tables.
  • Explainability summary: the summary is configurable to fit with your need and to focus on what matters for local explicability.
  • Ability to easily deploy in a production environment and to complete every prediction/recommendation with a local explicability summary for each operational apps (Batch or API)
  • Shapashis open to several ways of proceeding: It can be used to easily access results or to work on better wording. Very few arguments are required to display results. But the more you work with cleaning and documenting the dataset, the clearer the results will be for the end-user.

Shapash works for Regression, Binary Classification, or Multiclass problems. It is compatible with many models: CatboostXgboostLightGBMSklearn EnsembleLinear modelsSVM.

Shapash is based on local contributions calculated with Shap (shapley values), Lime, or any technique which allows computing summable local contributions.


You can install the package through pip:

$pip install shapash 

Shapash Demonstration

Let’s use Shapash on a concrete dataset. In the rest of this article, we will show you how Shapash can explore models.

We will use the famous “House Prices” dataset from Kaggle to fit a regressor and predict house prices! Let’s start by loading the Dataset:

import pandas as pd
from import data_loading house_df, house_dict = data_loading('house_prices')

Encode the categorical features:

from category_encoders import OrdinalEncoder categorical_features = [col for col in X_df.columns if X_df[col].dtype == 'object']
encoder = OrdinalEncoder(cols=categorical_features).fit(X_df)

Train, test split and model fitting:

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor Xtrain, Xtest, ytrain, ytest = train_test_split(X_df, y_df, train_size=0.75)
reg = RandomForestRegressor(n_estimators=200, min_samples_leaf=2).fit(Xtrain,ytrain) 

And predict test data:

y_pred = pd.DataFrame(reg.predict(Xtest), columns=['pred'], index=Xtest.index) 

Let’s discover and use Shapash SmartExplainer.

Step 1 — Import

from shapash.explainer.smart_explainer import SmartExplainer 

Step 2 — Initialise a SmartExplainer Object

xpl = SmartExplainer(features_dict=house_dict) # Optional parameter 
  • features_dict: dict that specifies the meaning of each column name of the x pd.DataFrame.

Step 3 — Compile

xpl.compile( x=Xtest, model=regressor, preprocessing=encoder,# Optional: use inverse_transform method y_pred=y_pred # Optional

The compile method permits to use of another optional parameter: postprocess. It gives the possibility to apply new functions to specify to have better wording (regex, mapping dict, …).

Now, we can display results and understand how the regression model works!

Step 4 — Launching the Web App

app = xpl.run_app() 

The web app link appears in Jupyter output (access the demo here).

There are four parts to this Web App:

Each one interacts to help to explore the model easily.

Features Importance: you can click on each feature to update the contribution plot below.

Contribution plot: How does a feature influence the prediction? Display violin or scatter plot of each local contribution of the feature.

Local Plot:

  • Local explanation: which features contribute the most to the predicted value.
  • You can use several buttons/sliders/lists to configure the summary of this local explainability. We will describe below with the filter method the different parameters you can work your summary with.
  • This web app is a useful tool to discuss with business analysts the best way to summarize the explainability to meet operational needs.

Selection Table: It allows the Web App user to select:

  • A subset to focus the exploration on this subset
  • A single row to display the associated local explanation

How do you use the data table to select a subset? At the top of the table, just below the name of the column that you want to use to filter, specify:

  • =Value, >Value, <Value
  • If you want to select every row containing a specific word, just type that word without “=”

There are a few options available on this web app (top right button). The most important one is probably the size of the sample (default: 1000). To avoid latency, the web app relies on a sample to display the results. Use this option to modify this sample size.

To kill the app:


Step 5 — The plots

All the plots are available in jupyter notebooks, and the paragraph below describes the key points of each plot.

Feature Importance

This parameter allows comparing features importance of a subset. It is useful to detect specific behavior in a subset.

subset = [ 168, 54, 995, 799, 310, 322, 1374, 1106, 232, 645, 1170, 1229, 703, 66, 886, 160, 191, 1183, 1037, 991, 482, 725, 410, 59, 28, 719, 337, 36 ]

Contribution plot

Contribution plots are used to answer questions like:

How a feature impacts my prediction? Does it contribute positively? Is the feature increasingly contributing? decreasingly? Are there any threshold effects? For a categorical variable, how does each modality contributes? This plot completes the importance of the features for the interpretability, the global intelligibility of the model to better understand the influence of a feature on a model.

There are several parameters on this plot. Note that the plot displayed adapts depending on whether you are interested in a categorical or continuous variable (Violin or Scatter) and depending on the type of use case you address (regression, classification).


Contribution plot applied to a continuous feature.

Classification Case: Titanic Classifier — Contribution plot applied to categorical feature.

Local plot

You can use local plots for local explainability of models.

The filter() and local_plot() methods allow you to test and choose the best way to summarize the signal that the model has picked up. You can use it during the exploratory phase. You can then deploy this summary in a production environment for the end-user to understand in a few seconds what are the most influential criteria for each recommendation.

We will publish a second article to explain how to deploy local explainability in production.

Combine the filter and local_plot methods

Use the filter method to specify how to summarize local explainability. You have four parameters to configure your summary:

  • max_contrib: maximum number of criteria to display
  • threshold: minimum value of the contribution (in absolute value) necessary to display a criterion
  • positive: display only positive contribution? Negative? (default None)
  • features_to_hide: list of features you don’t want to display

After defining these parameters, we can display the results with the local_plot() method, or export them with to_pandas().


Export to pandas DataFrame:

summary_df = xpl.to_pandas()

Compare plot

With the compare_plot() method, the SmartExplainer object makes it possible to understand why two or more individuals do not have the same predicted values. The most decisive criterion appears at the top of the plot.

xpl.plot.compare_plot(row_num=[0, 1, 2, 3, 4], max_features=8) 

We hope that Shapash will be useful in building trust in AI. Thank you in advance to all those who will give us their feedback, idea… Shapash is opensource! Feel free to contribute by commenting on this post or directly on the GitHub discussions.

Original. Reposted with permission.


Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading
Esports2 days ago

chessbae removed as moderator from amid drama

Esports4 days ago

Dota 2 Dawnbreaker Hero Guide

Esports3 days ago

Why did Twitch ban the word “obese” from its predictions?

Esports4 days ago

Dallas Empire escape with a win against Minnesota at the Stage 2 Major

Esports4 days ago

A detailed look at Dawnbreaker, Dota 2’s first new carry in four years

Esports5 days ago

Dota 2 new hero: A list of possible suspects

Esports1 day ago

Hikaru Nakamura drops chessbae, apologizes for YouTube strike

Esports4 days ago

Dota 2: Patch 7.29 Analysis Of Top Changes

Esports1 day ago

DreamHack Online Open Ft. Fortnite April Edition – How To Register, Format, Dates, Prize Pool & More

Esports3 days ago

Dota 2: Team Nigma Completes Dota 2 Roster With iLTW

Fintech2 days ago

Australia’s Peppermint Innovation signs agreement with the Philippine’s leading micro-financial services provider

Esports5 days ago

Apex Legends tier list: the best legends to use in Season 8

Blockchain5 days ago

Krypto-News Roundup 9. April

Esports4 days ago

xQc calls ZULUL supporters racist for wanting Twitch emote back

Esports4 days ago

Dota 2 patch 7.29: Impact of Outposts, Water Runes and other major general gameplay changes

Esports4 days ago

Geely Holdings’ LYNK&CO Sponsors LNG Esports’ LPL Team

Esports3 days ago

Fortnite: Blatant Cheater Finishes Second In A Solo Cash Cup

Esports4 days ago

Mission Control, Tripleclix Team with Hollister for Fortnite Event/Product Launch

Blockchain4 days ago

Revolut integriert 11 neue Kryptowährungen

Esports3 days ago

Hikaru Nakamura accused of striking Eric Hansen’s YouTube channel