By German Osin, Senior Solutions Architect at Provectus.
Artificial intelligence and machine learning have reached an inflection point. In 2020, organizations in diverse industries of various sizes began evolving their ML projects from experimentation to production on an industrial scale. While doing so, they realized they were wasting a lot of time and effort on feature definition and extraction.
Feature store is a fundamental component of the ML stack and of any robust data infrastructure because it enables efficient feature engineering and management. It also allows for simple re-use of features, feature standardization across the company, and feature consistency between offline and online models. A centralized, scalable feature store allows organizations to innovate faster and drive ML processes at scale.
The team at Provectus has built several feature stores, and we would like to share our experience and lessons learned. In this article, we will explore how feature stores can help eliminate rework and enforce data traceability from source to model across teams. We will look into the specifics of building a scalable feature store and discuss how to achieve consistency between real-time and training features, and to improve reproducibility with time traveling for data.
What Is a Feature Store?
A Feature Store is a data management layer for machine learning features. ML features are measurable properties of phenomena under observation, like raw words, pixels, sensor values, rows of data in a data store, fields in a CSV file, aggregates (min, max, sum, mean), or derived representations (embedding or cluster).
From a business perspective, feature stores offer two major benefits:
- Better ROI from feature engineering through reduction of cost per model, which facilitates collaboration, sharing, and re-using of features
- Faster time to market for new models, driven by increased productivity of ML Engineers. This enables organizations to decouple storage implementation and features serving API from ML engineers, freeing them to work on models, not on latency problems, for more efficient online serving.
Some use cases do not require a centralized, scalable feature store. Feature stores work for projects where models need a stateful, ever-changing representation of the system. Examples of such use cases include demand forecasting, personalization and recommendation engines, dynamic pricing optimization, supply chain optimization, logistics and transportation optimization, fraud detection, and predictive maintenance.
Concepts of a Feature Store
A standardized feature store has certain key concepts that revolve around it. Those concepts are:
- Online Feature Store. Online applications look for a feature vector that is sent to an ML model for predictions.
- ML Specific Metadata. Enables the discovery and re-use of features.
- ML Specific API and SDK. High-level operations for fetching training feature sets and online access.
- Materialized Versioned Datasets. Maintains versions of feature sets used to train ML models.
Image by Author.
All of the concepts are represented in the above image. Analytical data is taken from a data lake and pushed through the feature engineering pipeline, where output is stored in the feature store. From there, ML engineers can discover the features, use them for training new models, and then re-use the features for serving.
These four concepts are supported by multiple products. The leaders of the market are Feast, Tecton, Hopsworks, and AWS SageMaker Feature Store. We will focus on open source products: Feast and Hopsworks.
Feast is an ML service that helps teams bridge the gap between data and machine learning models. It allows teams to register, ingest, serve, and monitor features in production.
This service is in the active development stage, but it has already been battle-tested with GoJek, Farfetch, Postmates, and Zulily. It can be integrated with Kubeflow and is backed by Kubeflow’s strong community.
As of 2020, Feast is a GCP-only service. It is a bit infrastructure-heavy, lacks composability, and does not offer data versioning. Note that the company plans to address these challenges in its roadmap for 2021. And starting November 2020, Feast is available as an open source version of Tecton.
Image by Author.
Hopsworks is an Enterprise Platform for developing and operating AI applications. It allows teams to manage ML features quickly and efficiently. The team behind Hopsworks are feature store evangelists, and they offer a lot of great educational content.
The Hopsworks feature store can be integrated with most Python libraries for ingestion and training. It also supports offline stores with time travel. Most importantly, it is AWS-, GCP-, Azure-, and on-premise ready.
What makes Hopsworks a challenge to use is its heavy reliance on the HopsML infrastructure. Also, the Hopsworks online store may not satisfy all latency requirements.
Image by Author.
Challenges of Modern Data Platforms
Before looking into the specifics of building feature stores, the challenges of modern data platforms have to be considered. Feature stores cannot be examined in isolation from the rest of the data and the ML infrastructure.
Traditionally, a canonical data lake architecture looks like this:
Image by Author.
This high-level architecture has components for sourcing, ingestion, processing, persisting, querying, and a consumer component. While it works fine for most tasks, it is not ideal.
Data access and discoverability can be a problem. Data is scattered across multiple data sources and services. This once helped protect data, but now it only adds a new layer of complexity and creates data silos. Such architecture entails the tedious process of managing AWS IAM roles, Amazon S3 policies, API Gateways, and database permissions. This becomes even more complicated in a multi-account setup. As a result, engineers are confused about what data exists and which is actually accessible to them since metadata is not discoverable by default. This creates an environment where investments in data and machine learning are curtailed due to data access issues.
Monolithic data teams are another issue to consider. Since data and machine learning teams are extremely specialized, they often have to operate out of context. Lack of ownership and domain context creates silos between data producers and data consumers, making it considerably more difficult for a backlogged data team to keep up with business demands. Data and ML Engineering fall victim to complex dependencies, failing to sync their operations. Any fast, end-to-end experimentation is not possible in such circumstances.
Machine learning experimentation infrastructure is uncharted territory. Traditional architecture lacks an experimentation component, which not only leads to data discovery and data access issues but also makes it more complex to maintain the reproducibility of datasets, ML pipelines, ML environments, and offline experiments. Although Amazon SageMaker and Kubeflow have made some progress here, reproducibility is problematic. Production experimentation frameworks are immature and cannot be entirely relied upon. As a result, it can take from three to six months to push one end-to-end experiment from data to production ML.
Scaling ML in production is complex and requires a lot of time and effort. While machine learning is mostly discussed as an offline activity (e.g., data collection, processing, model training, evaluating results, etc.), the ways models are used and served online in the real world are what actually matter. With a traditional architecture, you cannot access features during model serving in a unified and consistent way. You also cannot re-use features between multiple training pipelines and ML applications. The monitoring and maintenance of ML applications are also more complicated. As a result, the time and cost required to scale from 1 to 100 models in production grow exponentially, which limits the ability of organizations to innovate at the desired pace.
Emerging Architectural Shifts
To address these challenges, several architectural shifts have emerged:
- From Data Lakes to Hudi/Delta lakes. A data lake is not just a folder in Amazon S3. It is a feature-rich, fully managed layer for Data Ingestion, Incremental Processing, and Serving with ACID transactions and point-in-time queries.
- From Data Lakes to Data Mesh. Ownership of data domains, data pipelines, metadata, and APIs is moving away from centralized teams to product teams. Another impactful benefit is treating and owning data as a complete product rather than a side effect that nobody cares about.
- From Data Lakes to Data Infrastructure as a Platform. If the ownership of data is decentralized, platform components have to be unified and packaged into a reusable data platform.
- From Endpoint Protection to Global Data Governance. As part of the shift towards centralized data platforms, organizations are moving away from Endpoint Protection to Global Data Governance, which is a higher-level control plan for managing security and data access policies across available data sources.
- From Metadata Store to Global Data Catalog. Metadata stores like Hive meta store cannot aggregate many data sources. The industry needs a Global Data Catalog to support User Experience around data discovery, lineage, and versioning.
- Feature Store. Feature store is a new emerging component of the ML stack that enables scaling of ML experimentation and operations by adding a separate data management layer for ML Features.
All of these transformations are happening in parallel and should be thought of holistically. You cannot start designing a feature store and end up having separate data catalogs for features and for other data applications. While building a feature store, you have to rely on data versioning features, which can easily be part of other parallel projects.
Before moving forward, just a few words about four major components that drive the aforementioned transformations from disjointed ML jobs to MLOps, to provide a wider context for the importance of feature stores.
#1 Delta/Hudi Lakes
ACID data lakes enable managed ingestion, efficient dataset versioning for ML training, cheap “deletes” to make it GDPR/CCPA compliant, and “upserts” for data ingestions. They also offer an audit log to keep track of dataset changes and ACID transactions while enforcing data quality through schemas. Delta and Hudi lakes bring stream processing to Big Data, providing fresh data much more efficiently than traditional batch processing.
#2 Global Data Governance
Because it is no longer a standard to manage AWS IAM roles, Amazon S3 policies, API Gateways, and database permissions at the user level, a company-wide data governance structure should be used to:
- Accelerate privacy operations with data you already have. Automate business processes, data mapping, and PI discovery and classification for privacy workflows.
- Operationalize policies in a central location. Govern privacy policies to ensure policies are effectively managed across the enterprise. Define and document workflows, traceability views, and business process registers.
- Scale compliance across multiple regulations. Use a platform designed and built with privacy in mind that can be easily extended to support new regulations.
#3 Global Data Catalog
Although there is no single market leader here, Marquez, Amundsen, Apache Atlas, Collibra, Alation, and Data Hub are worth mentioning.
A global data catalog is extremely useful for answering questions such as Does this data exist? Where is it? What is the source of truth of this data? Do I have access? Who is the owner? Who are the users of this data? Are there existing assets I can re-use? Can I trust this data? Basically, it acts as a meta metadata store of sorts.
#4 Reproducible ML Pipelines
The final component is reproducible ML pipelines for experimentation.
Image by Author.
The above chart represents the architecture for MLOps and reproducible experimentation pipelines. It starts with four inputs: ML model code, ML pipeline code, Infrastructure as a code, and Versioned dataset. The versioned dataset — an input for your machine learning pipeline — should be sourced from the feature store.
Modern Data Infrastructure
Now let’s look at modern data infrastructure.
Image by Author.
We have batch processing for raw data and stream processing for event data. We store processed artifacts in cold storage for business reports and in a near-real-time, incrementally updated hot index for our API. The same data can be used in both scenarios and, to make it consistent, we use different pub/subsystems.
This is the traditional architecture for data platforms. Its goal is to provide consistency between cold and hot storage and to allow for discoverability from the data catalog, data quality, and global security with a fine-grained control on top of it.
Image by Author.
If we look at a feature store design, we will see features and infrastructure components almost identical to what we have in the data platform. In this case, the feature store is not a separate silo that brings yet another ingestion system, storage, catalog, and quality gates. It serves as a lightweight API between our data platform and ML tools. It can be nicely integrated with everything that has already been done in your data infrastructure. It should be composable and lightweight and without an opinionated design.
Image by Author.
When you begin designing and building your data infrastructure, consider the following “lessons learned” so far:
- Start by designing a consistent ACID Data Lake before investing in a Feature Store.
- Value from existing open-source products does not justify investments in integration and the dependencies they bring.
- A feature store is not a new infrastructure and data storage solution but a lightweight API and SDK integrated into your existing data infrastructure.
- Data Catalog, Data Governance, and Data Quality components are horizontal for the entire Data Infrastructure, including the feature store.
- There are no mature open source or cloud solutions for Global Data Catalog and Data Quality monitoring.
Image by Author.
This chart depicts the reference architecture we have been using for our customers. It features the services we have opted to use, but you should not be limited by our choices. The idea here is that you have to choose cold and hot storage based on your data workloads and on your business needs.
For hot storage, you may choose from DynamoDB, Cassandra, Hbase, or traditional RDBMS like Mysql, PostgreSQL, or even Redis. It is important that your hot storage be composable, pluggable, and in alignment with your data infrastructure strategy
For cold storage, Apache Hudi and Delta lake are our favorites. They offer such features as Time Travel, Incremental ingestion, and Materialized views.
There are some blank spaces on the diagram, which we hope to fill soon. For example, so far, there is no over-the-shelf leader for the data catalog. Data quality tools are also in the early stages. For now, you can choose from Great Expectations or Apache Deequ, which are great tools, but they do not provide a complete solution.
Image by Author.
In the image above, the question marks occupy spaces where you can choose from solutions built by open source communities, build your own solution in-house, or collaborate with cloud providers (e.g., AWS’ latest addition — Amazon SageMaker Feature Store for Machine Learning).
Moving Forward with Feature Store
Although it is still early in the game for feature stores, organizations that are not just experimenting but actively moving machine learning projects to production have already realized the need to have a centralized repository to store, update, retrieve, and share features.
In this article, we have shown how to design and build such a repository. While some of the points featured here are debatable and open for feedback from the community, it is clear that:
- Your existing data infrastructure should cover at least 90% of feature store requirements, including streaming ingestion, consistency, data catalog, and versioning, to achieve the desired outcome.
- It makes sense to build a lightweight Feature Store API to integrate with your existing storage solutions in-house.
- You should collaborate with community and cloud vendors to maintain compatibility with standards and state-of-the-art ecosystems.
- You should be ready to migrate to a managed service or to an open-source alternative as the market matures.
Original. Reposted with permission.