Connect with us

Big Data

Here Is How To Selectively Backup Your Data




When it comes to data backup, a business can’t always back up all its information. Sure, if a company is willing to spend a lot of money on its backup solution, then theoretically, they can back up as much data as they can create. However, in a realistic world, cost-effectiveness dictates that a business should limit the amount of data it saves on backup servers. To this end, a lot of enterprises have implemented selective backup methodologies for their data.

The Balance SMB mentions that a business’s continuity depends on its backup integrity. Because of this, companies need to be discerning about how they select the data for backup. Since there’s limited space available, the enterprise must have some rules defined to carry out its selective backup protocols. In this article, we’ll look at some of these particular data backup protocols in detail.

Before The Backups Start

A prerequisite to enable backups on a VMware or Hyper-V server is the server configuration. Before the IT department can create any backups, they should verify server connection and ensure that all authentications are green. Additionally, any software that is being used for the backups should be installed and configured to run on the system. This step is necessary before deciding on a paradigm for your selective backup. Once a company has done that, they can proceed to the selection processes.

Selective Inclusion

Selective inclusion is the most common method for choosing which files get added to the backup queue. This backup method is the closest to how humans think when it comes to the selection of sets. It’s unlikely that a manual selection would include something that the business doesn’t need in its backup. The granularity of control that the operator has, in this case, is unmatched. Unfortunately, this method depends on the individual to execute and can be time-consuming. Each new virtual machine or physical device added to the system requires IT personnel to fill out their backup table manually. If a new file is distributed and needs to be added to the backup schedule, this can also take time.

Selective Exclusion

The best way to think of this method would be like asking the backup server to backup “everything except” what the list says. Selective exclusion systems use rules that may define paths or extensions that the business deems as extraneous to its needs. These exclusions might add any folder named “Temp” or other temporary files noticeable because of their *.tmp extensions. Selective exclusion is only useful when you already have a system that deals with backing up the entire file system, which brings us to our next methodology.

Automatic Inclusion

This methodology takes into account every file on every machine (both virtual and physical) added to the backup schedule. Automatic inclusion will be an ideal solution for a business if it has unlimited storage space and bandwidth on its backups. The methodology may even work for companies with only a little bit of data to back up. However, in most cases, automatic inclusion is simply a starting point. Using selective exclusion, as mentioned above, allows a business to select which files and extensions the backups system can safely ignore. Automatic inclusion is a broad-brush solution that’s only useful by itself in some edge-cases.

Tag-Based Inclusion

Tags are useful utilities in any software environment. The same is true of backup since tags can help the system know which files are deemed essential. The shorthand for tagging is using the hashtag (#) symbol to define those tags beforehand. So, for example, a business can tag their relevant databases as #essential, while their test databases can get the #test tag. When the backup system is selecting tags for backup, it’ll choose anything that’s tagged as #essential, ensuring that the business never wasted storage space on backing up test data.

Default Inclusion

As Black Fin Marketing notes, some clients have data that they want to ensure gets backed up, even if no other rules apply to it. Default inclusion is the solution that works best in this situation. If a file or folder isn’t already included or excluded according to the backup schedule’s rules, it’s automatically added to the files and folders to back up. While this may bring about cases where one or more extraneous folders or files end up being backed up inadvertently, it’s a calculated risk. A business can’t restore data that is doesn’t have a backup for.

Backing Up Data is Essential

Small Business Trends noted that in 2017, more than half of small businesses in America were woefully underprepared to deal with catastrophic data loss. Since then, with the rise of viruses and ransomware, many companies have taken a more proactive approach to data backups. These filtering methods for limiting the size of a backup (and, by extension, the amount of time it takes to create and restore that backup) can help a business be more efficient in saving its data. Many companies have opted to implement cloud-based backup solutions as more of them start using the cloud for their data storage and recovery. Whether we’ll see more businesses adopting this methodology in the future remains to be seen.


Big Data

PyTorch Multi-GPU Metrics Library and More in New PyTorch Lightning Release




PyTorch Multi-GPU Metrics Library and More in New PyTorch Lightning Release

PyTorch Lightning, a very light-weight structure for PyTorch, recently released version 0.8.1, a major milestone. With incredible user adoption and growth, they are continuing to build tools to easily do AI research.

By William Falcon, Founder PyTorch Lightning

Today [Recently] we released 0.8.1 which is a major milestone for PyTorch Lightning. With incredible user adoption and growth, we’re continuing to build tools to easily do AI research.

This major release puts us on track for final API changes for our v1.0.0 coming soon!

PyTorch Lightning

PyTorch Lightning is a very light-weight structure for PyTorch — it’s more of a style guide than a framework. But once you structure your code, we give you free GPU, TPU, 16-bit precision support and much more!


Lightning is just structured PyTorch


This release has a major new package inside lightning, a multi-GPU metrics package!

There are two key facts about the metrics package in Lightning.

  1. It works with plain PyTorch!
  2. It automatically handles multi-GPUs for you via DDP. That means that even if you calculate the accuracy on one or 20 GPUs, we handle that for you automatically.

The metrics package also includes mappings to sklearn metrics to bridge between numpy, sklearn and PyTorch, as well as a fancy class you can use to implement your own metrics.

class RMSE(TensorMetric): def forward(self, x, y): return torch.sqrt(torch.mean(torch.pow(x-y, 2.0)))

The metrics package has over 18 metrics currently implemented (including functional metrics). Check out our documentation for a full list!


This release also cleaned up really cool debugging tools we’ve had in lightning for a while. The overfit_batches flag can now let you overfit on a small subset of data to sanity check that your model doesn’t have major bugs.

The logic is that if you can’t even overfit 1 batch of data, then there’s no use in training the rest of the model. This can help you figure out if you’ve implemented something correctly, or to make sure your math is correct


If you do this in Lightning, this is what you will get:

Faster multi-GPU training

Another key part of this release is speed-ups we made to distributed training via DDP. The change comes from allowing DDP to work with num_workers>0 in Dataloaders

Dataloader(dataset, num_workers=8)

Today when you use DDP by launching it via .spawn() and you try to use num_workers>0 in Dataloader, your program will likely freeze and not start training (this is also true outside of Lightning).

The solution for most is to set num_workers=0, but that means that your training is going to be reaaaaally slow. To enable num_workers>0 AND DDP, we now launch DDP under the hood without spawn. This removes a lot of other weird restrictions like the need to pickle everything and the need for model weights to not be available once training has finished (because the weights were learned in a subprocess with different memory).

Thus, our implementation of DDP here is much much faster than normal. But of course, we keep both for flexibility:

# very fast :)
Trainer(distributed_backend='ddp')# very slow

Other cool features of the release

  • .test() now automatically loads the best model weights for you!
model = Model()
trainer = Trainer() automatically loads the best weights!

  • Install lightning via conda now
conda install pytorch-lightning -c conda-forge

  • ModelCheckpoint tracks the path to the best weights
ckpt_callback = ModelCheckpoint(...)
trainer = Trainer(model_checkpoint=ckpt_callback) = ckpt_callback.best_model_path

  • Automatically move data to correct device during inference
class LitModel(LightningModule): @auto_move_data def forward(self, x): return xmodel = LitModel()
x = torch.rand(2, 3)
model = model.cuda(2)# this works!

  • many more speed improvements including single-TPU speed-ups (we already support multi-tpu out of the box as well)

Try Lightning today

If you haven’t yet! give Lightning a chance 🙂

This video explains how to refactor your PyTorch code into Lightning.

Bio: William Falcon is an AI Researcher, and Founder at PyTorch Lightning. He is trying to understand the brain, build AI and use it at scale.

Original. Reposted with permission.



Continue Reading


MIT takes down 80 Million Tiny Images data set due to racist and offensive content




Creators of the 80 Million Tiny Images data set from MIT and NYU took the collection offline this week, apologized, and asked other researchers to refrain from using the data set and delete any existing copies. The news was shared Monday in a letter by MIT professors Bill Freeman and Antonio Torralba and NYU professor Rob Fergus published on the MIT CSAIL website.

Introduced in 2006 and containing photos scraped from internet search engines, 80 Million Tiny Images was recently found to contain a range of racist, sexist, and otherwise offensive labels such as nearly 2,000 images labeled with the N-word, and labels like “rape suspect” and “child molester.” The data set also contained pornographic content like non-consensual photos taken up women’s skirts. Creators of the 79.3 million-image data set said it was too large and its 32 x 32 images too small, making visual inspection of the data set’s complete contents difficult. According to Google Scholar, 80 Million Tiny Images has been cited more 1,700 times.

Above: Offensive labels found in the 80 Million Tiny Images data set

“Biases, offensive and prejudicial images, and derogatory terminology alienates an important part of our community — precisely those that we are making efforts to include,” the professors wrote in a joint letter. “It also contributes to harmful biases in AI systems trained on such data. Additionally, the presence of such prejudicial images hurts efforts to foster a culture of inclusivity in the computer vision community. This is extremely unfortunate and runs counter to the values that we strive to uphold.”

The trio of professors say the data set’s shortcoming were brought to their attention by an analysis and audit published late last month (PDF) by University of Dublin Ph.D. student Abeba Birhane and Carnegie Mellon University Ph.D. student Vinay Prabhu. The authors say their assessment is the first known critique of 80 Million Tiny Images.

VB Transform 2020 Online – July 15-17. Join leading AI executives: Register for the free livestream.

Both the paper authors and the 80 Million Tiny Images creators say part of the problem comes from automated data collection and nouns from the WordNet data set for semantic hierarchy. Before the data set was taken offline, the coauthors suggested the creators of 80 Million Tiny Images do like ImageNet creators did and assess labels used in the people category of the data set. The paper finds that large-scale image data sets erode privacy and can have a disproportionately negative impact on women, racial and ethnic minorities, and communities at the margin of society.

Birhane and Prabhu assert that the computer vision community must begin having more conversations about the ethical use of large-scale image data sets now in part due to the growing availability of image-scraping tools and reverse image search technology. Citing previous work like the Excavating AI analysis of ImageNet, the analysis of large-scale image data sets shows that it’s not just a matter of data, but a matter of a culture in academia and industry that finds it acceptable to create large-scale data sets without the consent of participants “under the guise of anonymization.”

“We posit that the deeper problems are rooted in the wider structural traditions, incentives, and discourse of a field that treats ethical issues as an afterthought. A field where in the wild is often a euphemism for without consent. We are up against a system that has veritably mastered ethics shopping, ethics bluewashing, ethics lobbying, ethics dumping, and ethics shirking,” the paper states.

To create more ethical large-scale image data sets, Birhane and Prabhu suggest:

  • Blur the faces of people in data sets
  • Do not use Creative Commons licensed material
  • Collect imagery with clear consent from data set participants
  • Include a data set audit card with large-scale image data sets, akin to the model cards Google AI uses and the datasheets for data sets Microsoft Research proposed

The work incorporates Birhane’s previous work on relational ethics, which suggests that the creators of machine learning systems should begin their work by speaking with the people most affected by machine learning systems, and that concepts of bias, fairness, and justice are moving targets.

ImageNet was introduced at CVPR in 2009 and is widely considered important to the advancement of computer vision and machine learning. Whereas previously some of the largest data sets could be counted in the tens of thousands, ImageNet contains more than 14 million images. The ImageNet Large Scale Visual Recognition Challenge ran from 2010 to 2017 and led to the launch of a variety of startups like Clarifai and MetaMind, a company Salesforce acquired in 2017. According to Google Scholar, ImageNet has been cited nearly 17,000 times.

As part of a series of changes detailed in December 2019, ImageNet creators including lead author Jia Deng and Dr. Fei-Fei Li found that 1,593 of the 2,832 people categories in the data set potentially contain offensive labels, which they said they plan to remove.

“We indeed celebrate ImageNet’s achievement and recognize the creators’ efforts to grapple with some ethical questions. Nonetheless, ImageNet as well as other large image datasets remain troublesome,” the Birhane and Prabhu paper reads.


Continue Reading


Mozilla Common Voice updates will help train the ‘Hey Firefox’ wakeword for voice-based web browsing




Mozilla today released the latest version of Common Voice, its open source collection of transcribed voice data for startups, researchers, and hobbyists to build voice-enabled apps, services, and devices. Common Voice now contains over 7,226 total hours of contributed voice data in 54 different languages, up from 1,400 hours across 18 languages in February 2019.

Common Voice consists not only of voice snippets, but of voluntarily contributed metadata useful for training speech engines, like speakers’ ages, sex, and accents. It’s designed to be integrated with DeepSpeech, a suite of open source speech-to-text, text-to-speech engines, and trained models maintained by Mozilla’s Machine Learning Group.

Collecting the over 5.5 million clips in Common Voice required a lot of legwork, namely because the prompts on the Common Voice website had to be translated into each language. Still, 5,591 of the 7,226 hours have been confirmed valid by the project’s contributors so far. And according to Mozilla, five languages in Common Voice — English, German, French, Italian, and Spanish — now have over 5,000 unique speakers, while seven languages — English, German, French, Kabyle, Catalan, Spanish, and Kinyarwandan — have over 500 recorded hours.

Today also saw the release of Mozilla’s first-ever data set target segment, which aims to collect voice data for specific purposes and use cases. This segment includes the digits “zero” through “nine” as well as the words “yes,” “no,” “hey,” and “Firefox,” spoken by 11,000 people for 120 hours collectively across 18 languages. Previously, Common Voice product lead Megan Branson said it would be used partly for “Hey Firefox” wakeword testing.

VB Transform 2020 Online – July 15-17. Join leading AI executives: Register for the free livestream.

“This segment data will help Mozilla benchmark the accuracy of our open source voice recognition engine, DeepSpeech, in multiple languages for a similar task and will enable more detailed feedback on how to continue improving the dataset,” Branson wrote in a blog post. “With contributions from all over the globe, [our contributors] are helping us follow through on our goal to create a voice dataset that is publicly available to anyone and represents the world we live in.”

The Common Voice refresh follows a significant update to DeepSpeech that incorporated one of the fastest open source speech recognition models to date. The latest version added support for TensorFlow Lite, a distribution of Google’s TensorFlow machine learning framework that’s optimized for compute-constrained mobile and embedded devices, and cut down DeepSpeech’s memory consumption by 22 times while boosting its startup speed by over 500 times.

Both Common Voice and DeepSpeech inform work on Mozilla projects like Firefox Voice, a browser extension that adds voice recognition support to Firefox. Currently, Firefox Voice can understand commands like “What is the weather” and “Find the Gmail tab,” but the goal is to facilitate “meaningful interactions” with websites using voice alone.


Continue Reading
Blockchain8 mins ago

Second Lawsuit Emerges: Why Are Hackers Targeting AT&T Crypto Investors?

Blockchain15 mins ago

Bitcoin Breaks Below $9,000 as Sellers Invalidate Bullish Technical Pattern

IOT18 mins ago

PiGirrl Zero w/ Audio #3DThursday #3DPrinting

IOT18 mins ago

A Massive Star Has Seemingly Vanished from Space With No Explanation

Blockchain22 mins ago

Crypto Trading Woes: What They Are and How UpBots Solves Them

Payments23 mins ago

Centerbridge Sells KIK Piece to Voyant

Blockchain24 mins ago

Why Isn’t BCH Way More Popular on the Dark Web?

IOT37 mins ago

Review: Calculator Kit is Just a Few Hacks From Greatness

IOT38 mins ago

EYE on NPI – Alorium Evo M51 #EyeOnNPI #DigiKey #Adafruit @digikey @Adafruit @AloriumTech @Intel ​

IOT38 mins ago

Fursuit- or puppet-head base – version 71 – Toon dragon #3Dprinting #3DThursday

Covid1938 mins ago

St Paul’s bomb plotter now denies she got cold feet, court hears

Blockchain38 mins ago

Twitter Adds Crypto Emoji for Binance, Following Bitcoin (BTC) and Coin (CRO)

Blockchain38 mins ago

Will This Year’s Third Quarter Be Negative for Bitcoin?

Gaming39 mins ago

Fortnite Fireworks Locations: Where To Light Fireworks For Captain America Challenge

Payments39 mins ago

Evolution Managers Backs Another New Manager

Private Equity41 mins ago

Dominus Capital raises up to $350m for Fund 3 but the target size for the fund looks to have dropped down to $525m

Blockchain41 mins ago

Bitcoin Price Prediction: BTC/USD Couldn’t Push Higher; Price Remains Below $9,200

Blockchain43 mins ago

Unknown Cybercrime Gang Holds Thousands of Databases For Ransom

Blockchain45 mins ago

Peter Schiff: Bitcoin HODLers Remain Delusional

Publications48 mins ago

EARN IT Act amendments transfer the fight over Section 230 to the states

Gaming49 mins ago

Fable Trademark Renewed Ahead Of Xbox Series X First-Party Showcase

CovId1958 mins ago

David Starkey widely criticised for ‘slavery was not genocide’ remarks

Blockchain1 hour ago

Lithuania to Launch CBDC This Month — Intended for Collection Purposes, Not Trade

AR/VR1 hour ago

Facebook’s Prototype Photoreal Avatars Now Have Realistic Eyes

Blockchain1 hour ago

Sun Sets on Offshore Banking as Assets Worth $11 Trillion Uncovered

Payments1 hour ago

White paper: Three key trends driving financial services IT transformation

BBC1 hour ago

Jeffrey Epstein ex-girlfriend Ghislaine Maxwell arrested by FBI

Blockchain1 hour ago

TRX in Trouble? Binance Moves $300M Tether From Tron to Ethereum

Gaming1 hour ago

NBA 2K21 Is Now Available For Digital Pre-order And Pre-download On Xbox One

BBC1 hour ago

Child-interrupted TV broadcasts ‘show reality for working parents’

Covid191 hour ago

Ghislaine Maxwell charged with two perjury counts for role in Jeffrey Epstein case – live

Blockchain1 hour ago

Cardano Hits 2020 High Following Its Network Upgrade

Cyber Security1 hour ago

Apache Guacamole Opens Door for Total Control of Remote Footprint

Gaming1 hour ago

Ubisoft’s Hyper Scape Battle Royale Revealed, Twitch Drops Provide Access

Blockchain1 hour ago

OKEx Extends Support to COMP Governance Token, Foresees DeFi to #FinanceAll

Blockchain1 hour ago

Lithuania To Sell 24,000 Blockchain-Based Digital Tokens Called LBCOINs Next Week

Publications1 hour ago

Options traders set to cash in on Tesla’s big delivery beat

Blockchain1 hour ago

Austrian Mobile Operator A1 Adopts Fintech Firm’s Crypto Service

Blockchain2 hours ago

Core Scientific and Horizon Kinetics expand investment in next-gen crypto mining equipment

Blockchain2 hours ago

IOTA Collaborates with University of Oslo’s Department of Informatics to Accelerate Tangle Research