By Nick Acosta, Developer Advocate, Alliances, Fivetran.
Converging data stacks
Although data analytics and data science are quite unique disciplines, there is considerable overlap in the data processing steps used to effectively achieve them. Both benefit from having access to large amounts of high-quality data stored in a centralized location as well as efficient and reliable processes to bring data from sources to these central repositories. Until recently, work has been duplicated with separate technologies for the different fields, in a data warehouse for analytics and business intelligence and as a data lake for data science and machine learning. A number of new services are working on merging these data stacks into a single environment, and this article will provide an overview of these services and the value they can add to a data organization.
Benefits of a Modern Data Science Stack
A modern data stack is a collection of technologies that bring and store multiple data sources to a centralized cloud data warehouse that has become popular in analytics. It can be extended to accommodate machine learning workloads into something called the modern data science stack. A modern data science stack removes silos and services performing duplicate work for data analytics and data science teams and moves models closer to data they are training with and using to make predictions, easing the shift from model-centric AI development to data-centric AI development. Many organizations have a considerable investment in data warehousing technologies to keep the environment secure, governed, operational, organized and performant, but data loses all of these qualities the moment it is sampled from a data warehouse to a data lake.
There are three more, less obvious benefits I would also like to highlight that I discovered since my transition to a modern data science stack. Having models stored in a data warehouse means their predictions can be stored as well and obtained via SQL queries. Performing table lookups instead of requiring embedded models or frameworks to use machine learning can go a long way in democratizing the use of machine learning in an organization. Also, because each step of the machine learning process happens in the same place on the same data, there is less of a chance for differences between data being sent to models at training time and at serving time, which means training-serving skew and the tools used to detect it can largely be avoided. Finally, since every step of the machine learning process can be performed as SQL, it becomes straightforward to compose the different steps together into a data pipeline with a tool like Apache Airflow.
Overview of In-Warehouse Machine Learning Services
BigQuery ML & Redshift ML
AWS and Google Cloud both recently added machine learning capabilities to their data warehouses, Redshift (left) and BigQuery (right).
BigQuery ML and Redshift ML add machine learning capabilities to BigQuery and Redshift, Google Cloud Platform’s and AWS’s respective data warehouses. AWS just recently announced the general availability of Redshift ML, and BigQuery ML has been available for some time.
Both extend SQL syntax with a CREATE MODEL command that allows for the creation of machine learning models and the specification of parameters such as model type, the table to be used as training data, and the target feature to generate predictions on. These new SQL commands leverage automated machine learning processes to provide data transformations and model tuning to identify the best performance among candidate models. Custom models can be used with each as well and offer considerable flexibility in model architecture and performance, but each has some restrictions in development. Custom models have to be saved as TensorFlow models to be used in BigQuery, and Redshift ML must use models deployed with the AWS data science development platform SageMaker. Once models are either trained or imported into the warehouse, SELECT statements can be used with FROM specifying a trained model in place of a table to invoke inferences, which can then easily be inserted into a predictions table in the warehouse for use, auditing, and error analysis.
Snowflake & Other Options
Snowflake has said that their “entire initiative in AI and ML has been to build extensibility into [their data warehouse] so you can interface with your tool of choice.” AWS’s Sagemaker platform mentioned earlier is an example of an ML tool Snowflake can integrate with, and Databricks is as well. More impressive development is happening at Databricks, which just released version 1.0.0 of Delta Lake, which converges data analytics and data science technology stacks from the opposite direction. Instead of bringing machine learning capabilities to a data warehouse, Delta Lake adds traditional analytics and business intelligence capabilities like ACID transactions to a data lake into a new data lakehouse architecture that provides similar benefits to a modern data science stack.
If your organization is interested in performing both data analytics and data science, there are a number of options to facilitate the two disciplines, but there’s too much in common between their data pipelines to have separate tooling for data ingestion, storage, and transformations for separate workloads. In-warehouse machine learning tools can be used to build a modern data science stack that can remove the silos that occur in the data engineering and model serving components of a data science practice by moving everything data and the practitioners operating on that data to a centralized location.
Bio: Nick Acosta is a Developer Advocate and Data Scientist at Fivetran and studied Computer Science at Purdue University and the University of Southern California. Fivetran automates data ingestion and is happy to be technology partners with a number of organizations listed in this article, including Amazon, Databricks, Google, and Snowflake.