Connect with us

Big Data

Simple & Intuitive Ensemble Learning in R

Avatar

Published

on

Simple & Intuitive Ensemble Learning in R

Read about metaEnsembleR, an R package for heterogeneous ensemble meta-learning (classification and regression) that is fully-automated.


By Ajay Arunachalam, Orebro University

I always believe in democratizing AI and machine learning, and spreading the knowledge in such a way to cater the larger audiences in general to potentially exploit the power of AI.

One such attempt is the development of the R package for meta-level ensemble learning (Classification, Regression) that is fully-automated. It significantly lowers the barrier for the practitioners to apply heterogeneous ensemble learning techniques in an amateur fashion to their everyday predictive problems.

Before we dwell into the package details, let’s understand a few basic concepts.

Why Ensemble Learning?

 
Generally, predictions become unreliable when the input sample is out of the training distribution, bias to data distribution or error prone to noise, and so on. Most approaches require changes to the network architecture, fine tuning, balanced data, increasing model size, etc. Further, the selection of the algorithm plays a vital role, while the scalability and learning ability decrease with the complex datasets. Combining multiple learners is an effective approach, and have been applied to many real-world problems. Ensemble learners combine a diverse collection of predictions from the individual base models to produce a composite predictive model that is more accurate and robust than its components. With meta ensemble learning one can minimize generalization error to some extent irrespective of the data distribution, number of classes, choice of algorithm, number of models, complexity of the datasets, etc. So, in summary, the predictive models will be able to generalize better.

How can we build models in more stable fashion while minimizing under-fitting/overfitting which is very critical to the overall outcome? The solution is ensemble meta-learning of a heterogeneous collection of base learners.

Common Ensemble Learning Techniques

 
The different popular ensemble techniques are referred to in the figure below. Stacked generalization is a general method of using a high-level model to combine lower- level models to achieve greater predictive accuracy. In the Bagging method, the independent base models are derived from the bootstrap samples of the original dataset. The Boosting method grows an ensemble in a dependent fashion iteratively, which adjusts the weight of an observation based on the past prediction. There are several extensions of bagging and boosting.

Image for post

Overview

 
metaEnsembleR is an R package for automated meta-learning (Classification, Regression). The functionalities provided includes simple user input based predictive modeling with the selection choice of the algorithms, train-validation-test split, model valuations, and easy guided unseen data prediction which can help the user’s to build stack ensembles on the go. The core aim of this package is to cater the larger audiences in general. metaEnsembleR significantly lowers the barrier for the practitioners to apply heterogeneous ensemble learning techniques in an amateur fashion to their everyday predictive problems.

Using metaEnsembleR

The package consists of the following components:

  • Ensemble Classifiers Training and Prediction
  • Ensemble Regressor Training and Prediction
  • Model Evaluation, Model Results (Observation vs. Prediction on test data) & new unseen data prediction and Disk write I/O performance charts & saving prediction results

All these functions are very intuitive, and their use is illustrated with examples below covering the Classification and Regression problem in general.

Getting Started

 
The package can be installed directly from CRAN

Install from Rconsole:

install.packages(“metaEnsembleR”)


However, the latest stable version (if any) could be found on Github, and installed using devtools package.

Install from GitHub:

if(!require(devtools)) install.packages(“devtools”)
devtools::install_github(repo = ‘ajayarunachalam/metaEnsembleR’, ref = ‘main’)


Usage

library(“metaEnsembleR”)
set.seed(111)


Training the ensemble classification model is as simple as one-line call to the ensembler.classifier function, in the following ways either passing the csv file directly or the imported dataframe, that takes into account the arguments in the following order starting the Dataset, Outcome/Response Variable index, Base Learners, Final Learner, Train-Validation-Test split ratio, and the Unseen data

ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, read.csv(‘./unseen_data.csv’))


OR

unseen_new_data_testing ← iris[130:150,]
ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing)


The above function returns the following, i.e., test data with the predictions, prediction labels, model result, and finally the predictions on unseen data.

testpreddata ← data.frame(ensembler_return[1])
table(testpreddata$actual_label)
table(ensembler_return[2])


Performance comparison

modelresult ← ensembler_return[3]
modelresult


Unseen data

unseenpreddata ← data.frame(ensembler_return[4])
table(unseenpreddata$unseenpreddata)


 
Training the ensemble regression model is the same as one-line call to the ensembler.regression function, in the following ways either passing the csv file directly or the imported dataframe, that takes into account the arguments in the following order starting the Dataset, Outcome/Response Variable index, Base Learners, Final Learner, Train-Validation-Test split ratio, and the Unseen data

house_price ←read.csv(file = ‘./data/regression/house_price_data.csv’)
unseen_new_data_testing_house_price ←house_price[250:414,]
write.csv(unseen_new_data_testing_house_price, ‘unseen_house_price_regression.csv’, fileEncoding = ‘UTF-8’, row.names = F)
ensembler_return ← ensembler.regression(house_price[1:250,], 1, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, read.csv(‘./unseen_house_price_regression.csv’))


OR

ensembler_return ← ensembler.regression(house_price[1:250,], 1, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing_house_price )


The above function returns the following, i.e., test data with the predictions, prediction values, model result, and finally the unseen data with the predictions.

testpreddata ← data.frame(ensembler_return[1])


Performance comparison

modelresult ← ensembler_return[3]
modelresult
write.csv(modelresult[[1]], “performance_chart.csv”)


Unseen data

unseenpreddata ← data.frame(ensembler_return[4])


Examples

Classification

library(“metaEnsembleR”)
attach(iris)
data(“iris”)
unseen_new_data_testing ← iris[130:150,]
write.csv(unseen_new_data_testing, ‘unseen_check.csv’, fileEncoding = ‘UTF-8’, row.names = F)
ensembler_return ← ensembler.classifier(iris[1:130,], 5, c(‘treebag’,’rpart’), ‘gbm’, 0.60, 0.20, 0.20, unseen_new_data_testing)
testpreddata ← data.frame(ensembler_return[1])
table(testpreddata$actual_label)
table(ensembler_return[2])


Performance comparison

modelresult ← ensembler_return[3]
modelresult
act_mybar ← qplot(testpreddata$actual_label, geom= “bar”)
act_mybar
pred_mybar ← qplot(testpreddata$predictions, geom= ‘bar’)
pred_mybar
act_tbl ← tableGrob(t(summary(testpreddata$actual_label)))
pred_tbl ← tableGrob(t(summary(testpreddata$predictions)))
ggsave(“testdata_actual_vs_predicted_chart.pdf”,grid.arrange(act_tbl, pred_tbl))
ggsave(“testdata_actual_vs_predicted_plot.pdf”,grid.arrange(act_mybar, pred_mybar))


Unseen data

unseenpreddata ← data.frame(ensembler_return[4])
table(unseenpreddata$unseenpreddata)
table(unseen_new_data_testing$Species)


Regression

library(“metaEnsembleR”)
data(“rock”)
unseen_rock_data ← rock[30:48,]
ensembler_return ← ensembler.regression(rock[1:30,], 4,c(‘lm’), ‘rf’, 0.40, 0.30, 0.30, unseen_rock_data)
testpreddata ← data.frame(ensembler_return[1])


Performance comparison

modelresult ← ensembler_return[3]
modelresult
write.csv(modelresult[[1]], “performance_chart.csv”)


Unseen data

unseenpreddata ← data.frame(ensembler_return[4])


More Examples

 
Comprehensive demonstrations can be found in the Demo.R file, to see the results run Rscript Demo.R in the terminal.

If there is some implementation you would like to see here or add in some examples feel free to do so. You can always reach me at ajay.aruanchalam08@gmail.com

Always Keep Learning & Sharing Knowledge!!!

 
Bio: Ajay Arunachalam (personal website) is a Postdoctoral Researcher (Artificial Intelligence) at Centre for Applied Autonomous Sensor Systems, Orebro University, Sweden. Prior to this, he was working as a Data Scientist at True Corporation, a Communications Conglomerate, working with Petabytes of data, building & deploying deep models in production. He truly believes that Opacity in AI systems is need of the hour, before we fully accept the power of AI. With this in mind, he has always strived to democratize AI, and be more inclined towards building Interpretable Models. His interest is in Applied Artificial Intelligence, Machine Learning, Deep Learning, Deep RL, and Natural Language Processing, specifically learning good representations. From his experience working on real-world problems, he fully acknowledges that finding good representations is the key in designing the system that can solve interesting challenging real-world problems, that go beyond human-level intelligence, and ultimately explain complicated data for us that we don’t understand. In order to achieve this, he envisions learning algorithms that can learn feature representations from both unlabelled and labelled data, be guided with and/or without human interaction, and that are on different levels of abstractions in order to bridge the gap between low-level data and high-level abstract concepts.

Original. Reposted with permission.

Related:

Source: https://www.kdnuggets.com/2020/12/simple-intuitive-meta-learning-r.html

AI

Researchers propose using the game Overcooked to benchmark collaborative AI systems

Avatar

Published

on

Deep reinforcement learning systems are among the most capable in AI, particularly in the robotics domain. However, in the real world, these systems encounter a number of situations and behaviors to which they weren’t exposed during development.

In a step toward systems that can collaborate with humans in order to help them accomplish their goals, researchers at Microsoft, the University of California, Berkeley, and the University of Nottingham developed a methodology for applying a testing paradigm to human-AI collaboration that can be demonstrated in a simplified version of the game Overcooked. Players in Overcooked control a number of chefs in kitchens filled with obstacles and hazards to prepare meals to order under a time limit.

The team asserts that Overcooked, while not necessarily designed with robustness benchmarking in mind, can successfully test potential edge cases in states a system should be able to handle as well as the partners the system should be able to play with. For example, in Overcooked, systems must contend with scenarios like when a plates are accidentally left on counters and when a partner stays put for a while because they’re thinking or away from their keyboard.

Above: Screen captures from the researchers’ test environment.

The researchers investigated a number of techniques for improving system robustness, including training a system with a diverse population of other collaborative systems. Over the course of experiments in Overcooked, they observed whether several test systems could recognize when to get out of the way (like when a partner was carrying an ingredient) and when to pick up and deliver orders after a partner has been idling for a while.

According to the researchers, current deep reinforcement agents aren’t very robust — at least not as measured by Overcooked. None of the systems they tested scored above 65% in the video game, suggesting, the researchers say, that Overcooked can serve as a useful human-AI collaboration metric in the future.

“We emphasize that our primary finding is that our [Overcooked] test suite provides information that may not be available by simply considering validation reward, and our conclusions for specific techniques are more preliminary,” the researchers wrote in a paper describing their work. “A natural extension of our work is to expand the use of unit tests to other domains besides human-AI collaboration … An alternative direction for future work is to explore meta learning, in order to train the agent to adapt online to the specific human partner it is playing with. This could lead to significant gains, especially on agent robustness with memory.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Source: https://venturebeat.com/2021/01/15/researchers-propose-using-the-game-overcooked-to-benchmark-collaborative-ai-systems/

Continue Reading

AI

AI Weekly: Meet the people trying to replicate and open-source OpenAI’s GPT-3

Avatar

Published

on

In June, OpenAI published a paper detailing GPT-3, a machine learning model that achieves strong results on a number of natural language benchmarks. At 175 billion parameters — the part of the model that has learned from historical training data — it’s one of the largest of its kind. It’s also among the most sophisticated, with the ability to make primitive analogies, write in the style of Chaucer, and even complete basic code.

In contrast to GPT-3’s predecessors, GPT-2 and GPT-1, OpenAI chose not to open-source the model or training dataset, opting instead to make the former available through a commercial API. The company further curtailed access by choosing to exclusively license GPT-3 to Microsoft, which OpenAI has a business relationship with. Microsoft has invested $1 billion in OpenAI and built an Azure-hosted supercomputer designed to further OpenAI’s research.

Several efforts to recreate GPT-3 in open source have emerged, but perhaps the furthest along is GPT-Neo, a project spearheaded by EleutherAI. A grassroots collection of researchers working to open-source machine learning research, EleutherAI and its founding members — Connor Leahy, Leo Gao, and Sid Black — aim to deliver the code and weights needed to run a model similar, though not identical, to GPT-3 as soon as August. (Weights are parameters within a neural network that transform input data.)

EleutherAI

According to Leahy, EleutherAI began as “something of a joke” on TPU Podcast, a machine learning Discord server, where he playfully suggested someone should try to replicate GPT-3. Leahy, Gao, and Black took this to its logical extreme and founded the EleutherAI Discord server, which became the base of the organization’s operations.

“I consider GPT-3 and other similar results to be strong evidence that it may indeed be possible to create [powerful models] with nothing more than our current techniques,” Leahy told VentureBeat in an interview. “It turns out to be in fact very, very hard, but not impossible with a group of smart people, as EleutherAI has shown, and of course with access to unreasonable amounts of computer hardware.”

As part of a personal project, Leahy previously attempted to replicate GPT-2, leveraging access to compute through Google’s Tensorflow Research Cloud (TFRC) program. The original codebase, which became GPT-Neo, was built to run on tensor processing units (TPUs), Google’s custom AI accelerator chips. But the EleutherAI team concluded that even the generous amount of TPUs provided through TFRC wouldn’t be sufficient to train the GPT-3-like version of GPT-Neo in under two years.

EleutherAI’s fortunes changed when the company was approached by CoreWeave, a U.S.-based cryptocurrency miner that provides cloud services for CGI rendering and machine learning workloads. Last month, CoreWeave offered the EleutherAI team access to its hardware in exchange for an open source GPT-3-like model its customers could use and serve.

Leahy insists that the work, which began around Christmas, won’t involve money or other compensation going in either direction. “CoreWeave gives us access to their hardware, we make an open source GPT-3 for everyone to use (and thank them very loudly), and that’s all,” he said.

Training datasets

EleutherAI concedes that because of OpenAI’s decision not to release some key details of GPT-3’s architecture, GPT-Neo will deviate from it in at least those ways. Other differences might arise from the training dataset EleutherAI plans to use, which was curated by a team of 10 people at EleutherAI, including Leahy, Gao, and Black.

Language models like GPT-3 often amplify biases encoded in data. A portion of the training data is not uncommonly sourced from communities with pervasive gender, race, and religious prejudices. OpenAI notes that this can lead to placing words like “naughty” or “sucked” near female pronouns and “Islam” near words like “terrorism.” Other studies, like one published in April by Intel, MIT, and the Canadian Institute for Advanced Research (CIFAR) researchers, have found high levels of stereotypical bias in some of the most popular models, including Google’s BERT and XLNetOpenAI’s GPT-2, and Facebook’s RoBERTa. Malicious actors could leverage this bias to foment discord by spreading misinformation, disinformation, and outright lies that “radicalize individuals into violent far-right extremist ideologies and behaviors,” according to the Middlebury Institute of International Studies.

For their part, the EleutherAI team says they’ve performed “extensive bias analysis” on the GPT-Neo training dataset and made “tough editorial decisions” to exclude some datasets they felt were “unacceptably negatively biased” toward certain groups or views. The Pile, as it’s called, is an 835GB corpus consisting of 22 smaller datasets combined to ensure broad generalization abilities.

“We continue to carefully study how our models act in various circumstances and how we can make them more safe,” Leahy said.

Leahy personally disagrees with the idea that releasing a model like GPT-3 would have a direct negative impact on polarization. An adversary seeking to generate extremist views would find it much cheaper and easier to hire a troll farm, he argues, as autocratic governments have already done. Furthermore, Leahy asserts that discussions of discrimination and bias point to a real issue but don’t offer a complete solution. Rather than censoring the input data of a model, he says the AI research community must work toward systems that can “learn all that can be learned about evil and then use that knowledge to fight evil and become good.”

“I think the commoditization of GPT-3 type models is part of an inevitable trend in the falling price of the production of convincing digital content that will not be meaningfully derailed whether we release a model or not,” Leahy continued. “The biggest influence we can have here is to allow more low-resource users, especially academics, to gain access to these technologies to hopefully better study them, and also perform our own brand of safety-focused research on it, instead of having everything locked inside industry labs. After all, this is still ongoing, cutting-edge research. Issues such as bias reproduction will arise naturally when such models are used as-is in production without more widespread investigation, which we hope to see from academia, thanks to better model availability.”

Google recently fired AI ethicist Timnit Gebru, reportedly in part over a research paper on large language models that discussed risks such as the impact of their carbon footprint on marginalized communities. Asked about the environmental impact of training GPT-Neo, Leahy characterized the argument as a “red herring,” saying he believes it’s a matter of whether the ends justify the means — that is, whether the output of the training is worth the energy put into it.

“The amount of energy that goes into training such a model is much less than, say, the energy that goes into serving any medium-sized website, or a single trans-Atlantic flight to present a paper about the carbon emissions of AI models at a conference, or, God forbid, Bitcoin mining,” Leahy said. “No one complains about the energy bill of CERN (The European Organization for Nuclear Research), and I don’t think they should, either.”

Future work

EleutherAI plans to use architectural tweaks the team has found to be useful to train GPT-Neo, which they expect will enable the model to achieve performance “similar” to GPT-3 at roughly the same size (around 350GB to 700GB of weights). In the future, they plan to distill the final model down “an order of magnitude or so smaller” for easier inference. And while they’re not planning to provide any kind of commercial API, they expect CoreWeave and others to set up services to make GPT-Neo accessible to users.

As for the next iteration of GPT and similarly large, complex models, like Google’s trillion-parameter Switch-C, Leahy thinks they’ll likely be more challenging to replicate. But there’s evidence that efficiency improvements might offset the mounting compute requirements. An OpenAI survey found that since 2012, the amount of compute needed to train an AI model to the same performance classifying images in a popular benchmark (ImageNet) has been decreasing by a factor of two every 16 months. But the extent to which compute contributes to performance compared with novel algorithmic approaches remains an open question.

“It seems inevitable that models will continue to increase in size as long as increases in performance follow,” Leahy said. “Sufficiently large models will, of course, be out of reach for smaller actors, but this seems to me to just be a fact of life. There seems to me to be no viable alternative. If bigger models equals better performance, whoever has the biggest computer will make the biggest model and therefore have the best performance, easy as that. I wish this wasn’t so, but there isn’t really anything that can be done about it.”

For AI coverage, send news tips to Khari Johnson and Kyle Wiggers and AI editor Seth Colaner — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Source: https://venturebeat.com/2021/01/15/ai-weekly-meet-the-people-trying-to-replicate-and-open-source-openais-gpt-3/

Continue Reading

AI

Feature store repositories emerge as an MLOps linchpin for advancing AI

Avatar

Published

on

A battle for control over machine learning operations (MLOps) is beginning in earnest as organizations embrace feature store repositories to build AI models more efficiently.

A feature store is at its core a data warehouse through which developers of AI models can share and reuse the artifacts that make up an AI model as well as an entire AI model that might need to be modified or further extended. In concept, feature store repositories play a similar role as a Git repository does in enabling developers to build applications more efficiently by sharing and reusing code.

Early pioneers of feature store repositories include Uber, which built a platform dubbed Michaelangelo, and Airbnb, which created a feature store dubbed Zipline. But neither of those platforms are available as open source code. Leading providers of feature store repositories trying to fill that void include Tecton, Molecula, Hopsworks, Splice Machine, and, most recently, Amazon Web Services (AWS). There is also an open source feature store project, dubbed Feast, that counts among its contributors Google and Tecton.

It can take a data science team six months or longer to construct a single AI model, so pressure to accelerate those processes is building. Organizations that employ AI models not only want to build more of them faster, but AI models deployed in production environments also need to be either regularly updated or replaced as business conditions change.

Less clear right now, however, is to what degree feature store repositories represent a standalone category versus being a foundational element of a larger MLOps platform. As investment capital starts to pour into the category, providers of feature store platforms are trying to have it both ways.

Splice Machine, for example, offers a SQL-based feature store platform that organizations can deploy apart from its platform for managing data science processes. “It’s important to modularize the feature store so it can be used in other environments,” said Splice Machine CEO Monte Zweben. “I think you’ll see adoption of feature stores in both manners.”

Over time, however, it will become apparent that feature stores one way or another need to be part of a larger platform to derive the most value, he added.

Fresh off raising an additional $17.6 million in funding, Molecula is also positioning its feature store as a standalone offering in addition to being a foundation around which MLOps processes will revolve. In fact, Molecula is betting that feature stores, in addition to enabling AI models to be constructed more efficiently, will also become critical to building any type of advanced analytics application, said Molecula CEO H.O. Maycotte.

To achieve that goal, Molecula built its own storage architecture to eliminate all the manual copy-and-paste processes that make building AI models and other types of advanced analytics applications so cumbersome today, he noted. “It’s not just for MLOps,” said Maycotte. “Our buyer is the data engineer.”

Tecton, meanwhile, appears to be more focused on enabling the creation of a best-of-breed MLOps ecosystem around its core feature flag platform. “Feature stores will be at the center of an MLOps toolchain,” said Tecton CEO Mike Del Balso.

Casting a shadow over each of these vendors, however, are cloud service providers that will make feature store repositories available as a service. Most AI models are trained on a public cloud because of the massive amounts of data required and the cost of the graphics processor units (GPUs) required. Adding a feature store repository to a cloud service that is already being employed to build an AI model is simply a logical extension.

However, providers of feature store platforms contend it’s only a matter of time before MLOps processes span multiple clouds. Many enterprise IT organizations are going to standardize on a feature store repository that makes it simpler to share AI models and their components across multiple clouds.

Regardless of how MLOps evolves, the need for a centralized repository for building AI models has become apparent. The issue enterprise IT organizations need to address now is determining which approach makes the most sense today, because whatever feature store platform they select now will have a major impact on their AI strategy for years to come.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Source: https://venturebeat.com/2021/01/15/feature-store-repositories-emerge-as-an-mlops-linchpin-for-advancing-ai/

Continue Reading

AI

Microsoft’s new settings let users contribute recordings to improve its speech recognition systems

Avatar

Published

on

Microsoft today announced it will give customers finer-grain control over whether their voice data is used to improve its speech recognition products. The new policy will allow customers to decide if reviewers, including Microsoft employees and contractors, can listen to recordings of what they said while speaking to Microsoft products and services that use speech recognition technology including Microsoft Translator, SwiftKey, Windows, Cortana, HoloLens, Mixed Reality, and Skype voice translation.

Maintaining privacy when it comes to voice recognition is a challenging task, given that state-of-the-art AI techniques have been used to infer attributes like intention, gender, emotional state, and identity from timbre, pitch, and speaker style. Recent reporting revealed that accidental voice assistant activations exposed private conversations, and a study by Clemson University School of Computing researchers found that Amazon Alexa and Google Assistant voice app privacy policies are often “problematic” and violate baseline requirements. The risk is such that law firms including Mishcon de Reya have advised staff to mute smart speakers when they talk about client matters at home.

Microsoft stopped storing voice clips processed by its speech recognition technologies on October 30, and Google Assistant, Siri, Cortana, Alexa, and other major voice recognition platforms allow users to delete recorded data. But this requires some (and in several cases, substantial) effort. That’s why over the next few months, Microsoft says it’ll roll out new settings for voice clip review across all of its applicable products. If customers choose to opt in, the company says people may review these clips to improve the performance of Microsoft’s AI systems “across a diversity of people, speaking styles, accents, dialects, and acoustic environments.”

“The goal is to make Microsoft’s speech recognition technologies more inclusive by making them easier and more natural to interact with,” Microsoft wrote in a pair of blog posts published this morning. “Voice clips will be de-identified as they are stored — they won’t be associated with [a] Microsoft account or any other Microsoft IDs that could tie them back to [a customer]. New voice data will no longer show up in [the] Microsoft account privacy dashboard.”

If a customer chooses to let Microsoft employees or contractors listen to their recordings to improve the company’s technology, in part by manually transcribing what they hear, Microsoft says it will retain the data for up to two years. If a contributed voice clip is sampled for transcription, the company says it might retain it for more than two years to “continue training and improving the quality of speech recognition AI.”

Microsoft says that customers who choose not to contribute their voice clips for review will still be able to use its voice-enabled products and services. However, the company reserves the right to continue accessing information associated with user voice activity, such as the transcriptions automatically generated during user interactions with speech recognition AI.

Tech giants including Apple and Google have been the subject of reports uncovering the potential misuse of recordings collected to improve assistants such as Siri and Google Assistant. In April 2019, Bloomberg revealed that Amazon employs contract workers to annotate thousands of hours of audio from Alexa-powered devices, prompting the company to roll out user-facing tools that quickly delete cloud-stored data. And in July, a third-party contractor leaked Google Assistant voice recordings for users in the Netherlands that contained personally identifiable data, like names, addresses, and other private information. Following the latter revelation, a German privacy authority briefly ordered Google to stop harvesting voice data in Europe for human reviewers.

For its part, Microsoft says it removes certain personal information from voice clips as they’re processed in the cloud, including strings of letters or numbers that could be telephone numbers, social security numbers, and email addresses. Moreover, the company says it doesn’t use human reviewers to listen to audio collected from speech recognition features built into its enterprise offerings.

Increasingly, privacy isn’t merely a question of philosophy, but table stakes in the course of business. Laws at the state, local, and federal levels aim to make privacy a mandatory part of compliance management. Hundreds of bills that address privacy, cybersecurity, and data breaches are pending or have already been passed in 50 U.S. states, territories, and the District of Columbia. Arguably the most comprehensive of them all — the California Consumer Privacy Act — was signed into law roughly two years ago. That’s not to mention the Health Insurance Portability and Accountability Act (HIPAA), which requires companies to seek authorization before disclosing individual health information. And international frameworks like the EU’s General Privacy Data Protection Regulation (GDPR) aim to give consumers greater control over personal data collection and use.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Source: https://venturebeat.com/2021/01/15/microsofts-new-settings-let-users-contribute-voice-clips-to-improve-its-speech-recognition-systems/

Continue Reading
Amb Crypto3 days ago

Ethereum, Dogecoin, Maker Price Analysis: 15 January

Amb Crypto3 days ago

How are Chainlink’s whales propping up its price?

Amb Crypto3 days ago

NavCoin releases its new privacy protocol, one day after Binance adds NAV to its staking program

Blockchain3 days ago

Bitcoin Cloud Mining With Shamining: Is it Worth it? [Review]

Blockchain3 days ago

Litecoin Regains Footing After Being Knocked Back by Resistance

Blockchain3 days ago

Warp Finance Relaunches With ‘Additional Security’ from Chainlink

Cyber Security5 days ago

Hackers Leak Stolen Pfizer-BioNTech COVID-19 Vaccine Data

Venture Capital4 days ago

Ghana fintech startup secures $700k investment 

Cyber Security5 days ago

Sophisticated Hacks Against Android, Windows Reveal Zero-Day Trove

Blockchain5 days ago

Crypto Games May Substitute Regular Video Games in 2021

Automotive5 days ago

Nokian One All-Season Tire Has Life Expectancy Of 80,000 Miles

Blockchain5 days ago

Amundi and BNY Mellon form strategic alliance

Cyber Security4 days ago

High-Severity Cisco Flaw Found in CMX Software For Retailers

Cannabis5 days ago

The Cannabis Craze is Back in Gear (NASDAQ: SNDL) (NASDAQ: GRWG) (OTC US: MEDH) (OTC US: CRLBF)

SPACS2 days ago

Affinity Gaming’s SPAC Gaming & Hospitality Acquisition files for a $150 million IPO

Blockchain1 day ago

The Countdown is on: Bitcoin has 3 Days Before It Reaches Apex of Key Formation

Cyber Security5 days ago

CISOs Prep For COVID-19 Exposure Notification in the Workplace

Blockchain4 days ago

Is Gold Still Worth Buying in the Bitcoin Age?

Blockchain5 days ago

Schroders appoints Global Head of Infrastructure in Private Assets

Blockchain5 days ago

Biden to Pick Former CFTC Chair Gary Gensler as SEC Chairman: Report

Trending