Connect with us

Big Data

In-Warehouse Machine Learning and the Modern Data Science Stack

Published

on

In-Warehouse Machine Learning and the Modern Data Science Stack

As your organization matures its data science portfolio and capabilities, establishing a modern data stack is vital to enabling such growth. Here, we overview various in-data warehouse machine learning services, and discuss each of their benefits and requirements.


By Nick Acosta, Developer Advocate, Alliances, Fivetran.

Modern Data Stack

Converging data stacks

Although data analytics and data science are quite unique disciplines, there is considerable overlap in the data processing steps used to effectively achieve them. Both benefit from having access to large amounts of high-quality data stored in a centralized location as well as efficient and reliable processes to bring data from sources to these central repositories. Until recently, work has been duplicated with separate technologies for the different fields, in a data warehouse for analytics and business intelligence and as a data lake for data science and machine learning. A number of new services are working on merging these data stacks into a single environment, and this article will provide an overview of these services and the value they can add to a data organization.

Benefits of a Modern Data Science Stack

A modern data stack is a collection of technologies that bring and store multiple data sources to a centralized cloud data warehouse that has become popular in analytics. It can be extended to accommodate machine learning workloads into something called the modern data science stack. A modern data science stack removes silos and services performing duplicate work for data analytics and data science teams and moves models closer to data they are training with and using to make predictions, easing the shift from model-centric AI development to data-centric AI development. Many organizations have a considerable investment in data warehousing technologies to keep the environment secure, governed, operational, organized and performant, but data loses all of these qualities the moment it is sampled from a data warehouse to a data lake.

There are three more, less obvious benefits I would also like to highlight that I discovered since my transition to a modern data science stack. Having models stored in a data warehouse means their predictions can be stored as well and obtained via SQL queries. Performing table lookups instead of requiring embedded models or frameworks to use machine learning can go a long way in democratizing the use of machine learning in an organization. Also, because each step of the machine learning process happens in the same place on the same data, there is less of a chance for differences between data being sent to models at training time and at serving time, which means training-serving skew and the tools used to detect it can largely be avoided.  Finally, since every step of the machine learning process can be performed as SQL, it becomes straightforward to compose the different steps together into a data pipeline with a tool like Apache Airflow.

Overview of In-Warehouse Machine Learning Services

BigQuery ML & Redshift ML

Redshift Vs Bigquery

AWS and Google Cloud both recently added machine learning capabilities to their data warehouses, Redshift (left) and BigQuery (right).

BigQuery ML and Redshift ML add machine learning capabilities to BigQuery and Redshift, Google Cloud Platform’s and AWS’s respective data warehouses. AWS just recently announced the general availability of Redshift ML, and BigQuery ML has been available for some time.

Both extend SQL syntax with a CREATE MODEL command that allows for the creation of machine learning models and the specification of parameters such as model type, the table to be used as training data, and the target feature to generate predictions on. These new SQL commands leverage automated machine learning processes to provide data transformations and model tuning to identify the best performance among candidate models. Custom models can be used with each as well and offer considerable flexibility in model architecture and performance, but each has some restrictions in development. Custom models have to be saved as TensorFlow models to be used in BigQuery, and Redshift ML must use models deployed with the AWS data science development platform SageMaker. Once models are either trained or imported into the warehouse, SELECT statements can be used with FROM specifying a trained model in place of a table to invoke inferences, which can then easily be inserted into a predictions table in the warehouse for use, auditing, and error analysis.

Snowflake & Other Options

Snowflake has said that their “entire initiative in AI and ML has been to build extensibility into [their data warehouse] so you can interface with your tool of choice.” AWS’s Sagemaker platform mentioned earlier is an example of an ML tool Snowflake can integrate with, and Databricks is as well. More impressive development is happening at Databricks, which just released version 1.0.0 of Delta Lake, which converges data analytics and data science technology stacks from the opposite direction. Instead of bringing machine learning capabilities to a data warehouse, Delta Lake adds traditional analytics and business intelligence capabilities like ACID transactions to a data lake into a new data lakehouse architecture that provides similar benefits to a modern data science stack.

Review

If your organization is interested in performing both data analytics and data science, there are a number of options to facilitate the two disciplines, but there’s too much in common between their data pipelines to have separate tooling for data ingestion, storage, and transformations for separate workloads. In-warehouse machine learning tools can be used to build a modern data science stack that can remove the silos that occur in the data engineering and model serving components of a data science practice by moving everything data and the practitioners operating on that data to a centralized location.

Bio: Nick Acosta is a Developer Advocate and Data Scientist at Fivetran and studied Computer Science at Purdue University and the University of Southern California. Fivetran automates data ingestion and is happy to be technology partners with a number of organizations listed in this article, including Amazon, Databricks, Google, and Snowflake.

Related:

Coinsmart. Beste Bitcoin-Börse in Europa
Source: https://www.kdnuggets.com/2021/06/in-warehouse-machine-learning-modern-data-science-stack.html

Big Data

How much Mathematics do you need to know for Machine Learning?

Published

on



Mathematics For Machine Learning | Maths to understand ML Algorithms





















Learn everything about Analytics



PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.analyticsvidhya.com/blog/2021/07/how-much-mathematics-do-you-need-to-know-for-machine-learning/

Continue Reading

Big Data

If you did not already know

Published

on

ML Health google


Deployment of machine learning (ML) algorithms in production for extended periods of time has uncovered new challenges such as monitoring and management of real-time prediction quality of a model in the absence of labels. However, such tracking is imperative to prevent catastrophic business outcomes resulting from incorrect predictions. The scale of these deployments makes manual monitoring prohibitive, making automated techniques to track and raise alerts imperative. We present a framework, ML Health, for tracking potential drops in the predictive performance of ML models in the absence of labels. The framework employs diagnostic methods to generate alerts for further investigation. We develop one such method to monitor potential problems when production data patterns do not match training data distributions. We demonstrate that our method performs better than standard ‘distance metrics’, such as RMSE, KL-Divergence, and Wasserstein at detecting issues with mismatched data sets. Finally, we present a working system that incorporates the ML Health approach to monitor and manage ML deployments within a realistic full production ML lifecycle. …

Guided Zoom google


We propose Guided Zoom, an approach that utilizes spatial grounding to make more informed predictions. It does so by making sure the model has ‘the right reasons’ for a prediction, being defined as reasons that are coherent with those used to make similar correct decisions at training time. The reason/evidence upon which a deep neural network makes a prediction is defined to be the spatial grounding, in the pixel space, for a specific class conditional probability in the model output. Guided Zoom questions how reasonable the evidence used to make a prediction is. In state-of-the-art deep single-label classification models, the top-k (k = 2, 3, 4, …) accuracy is usually significantly higher than the top-1 accuracy. This is more evident in fine-grained datasets, where differences between classes are quite subtle. We show that Guided Zoom results in the refinement of a model’s classification accuracy on three finegrained classification datasets. We also explore the complementarity of different grounding techniques, by comparing their ensemble to an adversarial erasing approach that iteratively reveals the next most discriminative evidence. …

UniParse google


This paper describes the design and use of the graph-based parsing framework and toolkit UniParse, released as an open-source python software package. UniParse as a framework novelly streamlines research prototyping, development and evaluation of graph-based dependency parsing architectures. UniParse does this by enabling highly efficient, sufficiently independent, easily readable, and easily extensible implementations for all dependency parser components. We distribute the toolkit with ready-made configurations as re-implementations of all current state-of-the-art first-order graph-based parsers, including even more efficient Cython implementations of both encoders and decoders, as well as the required specialised loss functions. …

Sparse Constraint Preserving Matching (SPM) google


Many problems of interest in computer vision can be formulated as a problem of finding consistent correspondences between two feature sets. Feature correspondence (matching) problem with one-to-one mapping constraint is usually formulated as an Integral Quadratic Programming (IQP) problem with permutation (or orthogonal) constraint. Since it is NP-hard, relaxation models are required. One main challenge for optimizing IQP matching problem is how to incorporate the discrete one-to-one mapping (permutation) constraint in its quadratic objective optimization. In this paper, we present a new relaxation model, called Sparse Constraint Preserving Matching (SPM), for IQP matching problem. SPM is motivated by our observation that the discrete permutation constraint can be well encoded via a sparse constraint. Comparing with traditional relaxation models, SPM can incorporate the discrete one-to-one mapping constraint straightly via a sparse constraint and thus provides a tighter relaxation for original IQP matching problem. A simple yet effective update algorithm has been derived to solve the proposed SPM model. Experimental results on several feature matching tasks demonstrate the effectiveness and efficiency of SPM method. …

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://analytixon.com/2021/07/29/if-you-did-not-already-know-1461/

Continue Reading

Big Data

Nokia lifts full-year forecast as turnaround takes root

Published

on

HELSINKI (Reuters) -Telecom equipment maker Nokia reported a stronger-than-expected second-quarter operating profit on Thursday and raised its full-year outlook as promised, thanks to a turnaround of its business.

The Finnish company’s April-June comparable operating profit rose to 682 million euros ($808.51 million) from 423 million euros a year earlier, beating the 408-million euro mean estimate in a Refinitiv poll of analysts.

Shifting geopolitics and a sharp round of cost cutting have put Nokia firmly back in the global 5G rollout race just a year after CEO Pekka Lundmark took the reins, allowing it to gain ground on Swedish arch-rival Ericsson.

“We have executed faster than planned on our strategy in the first half which provides us with a good foundation for the full year,” Lundmark said in a statement on Thursday, but added that Nokia still expects the 2021 second-half results to be less pronounced.

Nokia said it now expects full-year net sales of 21.7 billion-22.7 billion euros, up from its prior estimate of 20.6 billion-21.8 billion euros, with an operating profit margin of 10-12% instead of the 7% to 10% expected previously.

The company had announced on July 13 that it would raise its outlook, but did not provide any details.

($1 = 0.8435 euros)

(Reporting by Essi Lehto; editing by Terje Solsvik and Sriraj Kaluvila)

Image Credit: Reuters

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://datafloq.com/read/nokia-lifts-full-year-forecast-turnaround-takes-root/16713

Continue Reading

Big Data

Robinhood, gateway to ‘meme’ stocks, raises $2.1 billion in IPO

Published

on

By Echo Wang and David French

(Reuters) -Robinhood Markets Inc, the owner of the trading app which emerged as the go-to destination for retail investors speculating on this year’s “meme’ stock trading frenzy, raised $2.1 billion in its initial public offering on Wednesday.

The company was seeking to capitalize on individual investors’ fascination with cryptocurrencies and stocks such as GameStop Corp, which have seen wild swings after becoming the subject of trading speculation on social media sites such as Reddit. Robinhood’s monthly active users surged from 11.7 million at the end of December to 21.3 million as of the end of June.

The IPO valued Robinhood at $31.8 billion, making it greater as a function of its revenue than many of its traditional rivals such as Charles Schwab Corp, but the offering priced at the bottom of the company’s indicated range.

Some investors stayed on the sidelines, citing concerns over the frothy valuation, the risk of regulators cracking down on Robinhood’s business, and even lingering anger with the company’s imposition of trading curbs when the meme stock trading frenzy flared up at the end of January.

Robinhood said it sold 55 million shares in the IPO at $38 apiece, the low end of its $38 to $42 price range. This makes it one of the most valuable U.S. companies to have gone public year-to-date, amid a red-hot market for new listings.

In an unusual move, Robinhood had said it would reserve between 20% and 35% of its shares for its users.

Robinhood’s platform allows users to make unlimited commission-free trades in stocks, exchange-traded funds, options and cryptocurrencies. Its simple interface made it popular with young investors trading from home during the COVID-19 pandemic.

Robinhood enraged some investors and U.S. lawmakers earlier this year when it restricted trading in some popular stocks following a 10-fold rise in deposit requirements at its clearinghouse. It has been at the center of many regulatory probes.

The company disclosed this week that it has received inquiries from U.S. regulators looking into whether its employees traded shares of GameStop and AMC Entertainment Holdings, Inc before the trading curbs were placed at the end of January.

In June, Robinhood agreed to pay nearly $70 million to settle an investigation by Wall Street’s own regulator, the Financial Industry Regulatory Authority, for “systemic” failures, including systems outages, providing “false or misleading” information, and weak options trading controls.

The brokerage has also been criticized for relying on “payment for order flow” for most of its revenue, under which it receives fees from market makers for routing trades to them and does not charge users for individual trades.

Critics argue the practice, which is used by many other brokers, creates a conflict of interest, on the grounds that it incentivizes brokers to send orders to whoever pays the higher fees. Robinhood contends that it routes trades based on what is cheapest for its users, and that charging a commission would be more expensive. The U.S. Securities and Exchange Commission is examining the practice.

Robinhood was founded in 2013 by Stanford University roommates Vlad Tenev and Baiju Bhatt. They will hold a majority of the voting power after the offering, these filings showed, with Bhatt having around 39% of the voting power of outstanding stock while Tenev will hold about 26.2%.

The company’s shares are scheduled to start trading on Nasdaq on Thursday under the ticker “HOOD”

Goldman Sachs and J.P. Morgan were the lead underwriters in Robinhood’s IPO.

(Reporting by Echo Wang and David French in New York; Editing by Leslie Adler)

Image Credit: Reuters

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://datafloq.com/read/robinhood-gateway-meme-stocks-raises-21-billion-ipo/16712

Continue Reading
AR/VR1 day ago

Review: Winds & Leaves

watch-live-russias-pirs-module-set-to-depart-space-station-today.jpg
Aerospace4 days ago

Watch live: Russia’s Pirs module set to depart space station today

Esports5 days ago

Genshin Impact Sacred Sakura Cleansing Ritual Quest Guide

Esports5 days ago

Genshin Impact Sacrificial Offering: How to Complete

Esports5 days ago

Best bot lane Pokémon on Pokémon UNITE

Esports4 days ago

League of Legends Wild Rift Patch 2.4 Release Date

Energy4 days ago

NexGen Announces Commencement of 2021 Field and Regional Exploration Drilling Programs at the Rook I Property

Blockchain5 days ago

Ethereum 2.0 Exceeds 200K Validators, Has 6.6 Million ETH in Staking

Crowdfunding5 days ago

Digital Asset Firm Kraken Releases Report on Benefits of Centralized Finance Platforms Amid DeFi Boom

Aviation5 days ago

RAAF Globemaster ‘weaving between’ Brisbane skyscrapers goes viral

Energy4 days ago

Nowa umowa partnerska Shanghai Electric zawarta podczas WAIC 2021 doprowadzi do rozwoju i przemiany wielu branż dzięki transformacji cyfrowej

Blockchain4 days ago

Ethereum Inventor Debuts As An Actor? Joins Mila Kunis In NFT-Based Show

Esports4 days ago

TFT Set 5.5 11.15 B-patch nerfs Hecarim, Lucian, and Irelia

Crowdfunding5 days ago

Tezos (XTZ) Trading Support Added by Digital Asset Firm Gemini, but Not Yet Offering Custody for Tez

Fintech4 days ago

Finding the right balance with hybrid client experiences

Esports3 days ago

Legends of Runeterra adding new Lab of Legends mode: The Saltwater Scourge

Energy4 days ago

Specialty Tapes Market worth $67.2 Billion by 2026 – Exclusive Report by MarketsandMarkets™

Energy4 days ago

SOL: Sasol Limited – Production And Sales Metrics And Financial Results For The Year Ended 30 June 2021

Fintech5 days ago

Aurion Biotech Announces IOTA Cell Therapy Trial

Energy4 days ago

Novo acordo de parceria da Shanghai Electric na WAIC 2021 é criado com a finalidade de atualizar e transformar os setores com Empoderamento Digital

Trending