Connect with us

Big Data

If You Can Write Functions, You Can Use Dask

Published

on

If You Can Write Functions, You Can Use Dask

This article is the second article of an ongoing series on using Dask in practice. Each article in this series will be simple enough for beginners, but provide useful tips for real work. The first article in the series is about using LocalCluster.


By Hugo Shi, Founder of Saturn Cloud

I’ve been chatting with many data scientists who’ve heard of Dask, the Python framework for distributed computing, but don’t know where to start. They know that Dask can probably speed up many of their workflows by having them run in parallel across a cluster of machines, but the task of learning a whole new methodology can seem daunting. I’m here to tell you that you can start getting value from Dask without having to learn the entire framework. If you spend time waiting for notebook cells to execute, there’s a good chance Dask can save you time. Even if you only know how to write Python functions, you can take advantage of this without learning anything else! This blog post is a “how to use Dask without learning the whole thing” tutorial.

Dask, dataframes, bags, arrays, schedulers, workers, graphs, RAPIDS, oh no!

 
 
There are a lot of complicated content pieces out there about Dask, which can be overwhelming. This is because Dask can utilize a cluster of worker machines to do many cool things! But forget about all that for now. This article focuses on simple techniques that can save you time, without having to change much about how you work.
 

For loops and functions

 
 
Pretty much every data scientist has done something like this, where you have a set of dataframes stored in separate files and you use a for loop to read them all, do some logic, then combine them:

results = []
for file in files: defer = pd.read_csv(file) ## begin genius algorithm brilliant_features = [] for feature in features: brilliant_features.append(compute_brilliant_feature(df, feature)) magical_business_insight = make_the_magic(brilliant_features) results.append(magical_business_insight)

Over time, you end up with more files, or the genius_algorithm gets more complicated and takes longer to run. And you end up waiting. And waiting.


 

Step 1 is to encapsulate your code in a function. You want to encapsulate the stuff that goes inside the for loop. This makes it easier to understand what the code is doing (converting a file to something useful via magic). More importantly, it makes it easier to use that code in ways besides for loops.

def make_all_the_magics(file): df = pd.read_csv(file) brilliant_features = [] for feature in features: brilliant_features.append(compute_brilliant_feature(df, feature)) magical_business_insight = make_the_magic(brilliant_features) return magical_business_insight results = [] for file in files: magical_business_insight = make_all_the_magics(file) results.append(magical_business_insight)

Step 2 is to parallelize it with Dask. Now, instead of using a for loop, where each iteration happens after the previous one, Dask will run them in parallel on a cluster. This should give us results far faster, and is only three lines longer than the for-loop code!

from dask import delayed
from dask.distributed import Client # same function but with a Dask delayed decorator
@delayed
def make_all_the_magics(file): df = pd.read_csv(file) brilliant_features = [] for feature in features: brilliant_features.append(compute_brilliant_feature(df, feature)) magical_business_insight = make_the_magic(brilliant_features) return magical_business_insight results = []
for file in files: magical_business_insight = make_all_the_magics(file) results.append(magical_business_insight) # new Dask code
c = Client()
results = c.compute(results, sync=True)

How it works:

  • The delayed decorator transforms your function. Now, when you call it, it isn’t evaluated. Instead, you get back a delayed object, which Dask can execute later.
  • Client().compute sends all those delayed objects to the Dask cluster, where they are evaluated in parallel! That’s it, you win!
  • Instantiating a Client automatically provisions a LocalCluster. This means that the Dask parallel workers are all processes on the same machine as the one calling Dask. This makes for a concise example. For real work, I recommend creating local clusters in the terminal.

Practical Topics

 
 
The above stops where most Dask tutorials stop. I’ve used this approach with my own work, and with numerous customers, and a couple practical issues always arise. These next tips will help you go from that textbook example above to more useful methods in practice by covering two topics that constantly come up: large objects and error handling.
 

Large Objects

 
In order to compute functions on a distributed cluster, the objects that the functions are called on need to be sent to the workers. This can lead to performance issues, since those need to be serialized (pickled) on your computer, and sent over the network. Imagine you were doing processes on gigabytes of data–you don’t want to have to transfer that each time a function runs on it. If you accidentally send large objects, you may see a message from Dask like this:

Consider scattering large objects ahead of time
with client.scatter to reduce scheduler burden and
keep data on workers

There are two ways to stop this from happening: you can send smaller objects to workers so the burden isn’t so bad, or you can try to send each object to a worker only once, so you don’t keep having to make transfers.
Fix 1: send small objects when possible
This example is good, because we’re sending a file path (small string), instead of the dataframe.

# good, do this
results = []
for file in files: magical_business_insight = make_all_the_magics(file) results.append(magical_business_insight)

Below is what not to do. Both because you would be doing CSV reading (expensive and slow) in the loop, which is not parallel, but also because we’re now sending dataframes (which can be large).

# bad, do not do this
results = []
for file in files: df = pd.read_csv(file) magical_business_insight = make_all_the_magics(df) results.append(magical_business_insight)

Often times, code can be rewritten to change where data is being managed–either on the client or on the workers. Depending on your situation, there may be huge time savings by thinking through what functions take as input and how data transfers can be minimized.
Fix 2: send objects only once
If you have to send a big object, don’t send it multiple times. For example, if I need to send a big model object in order to compute, simply adding the parameter will serialize the model multiple times (once per file)

# bad, do not do this
results = []
for file in files: # big model has to be sent to a worker each time the function is called magical_business_insight = make_all_the_magics(file, big_model) results.append(magical_business_insight)

I can tell Dask not to do that, by wrapping it in a delayed object.

# good, do this
results = []
big_model = client.scatter(big_model) #send the model to the workers first for file in files: magical_business_insight = make_all_the_magics(file, big_model) results.append(magical_business_insight)

Handling Failure

 
As your computational tasks grow, often times you’ll want to be able to power through failure. In this case, maybe 5% of my CSVs have bad data that I can’t handle. I’d like to process 95% of the CSVs successfully, but keep track of the failures so I can adjust my methods and try again.

This loop does this.

import traceback
from distributed.client import wait, FIRST_COMPLETED, ALL_COMPLETED queue = c.compute(results)
futures_to_index = {fut: i for i, fut in enumerate(queue)}
results = [None for x in range(len(queue))] while queue: result = wait(queue, return_when=FIRST_COMPLETED) for future in result.done: index = futures_to_index[future] if future.status == 'finished': print(f'finished computation #{index}') results[index] = future.result() else: print(f'errored #{index}') try: future.result() except Exception as e: results[index] = e traceback.print_exc() queue = result.not_done print(results)

Since this function is fairly complicated at first glance, let’s break it down.

queue = c.compute(results)
futures_to_index = {fut: i for i, fut in enumerate(queue)}
results = [None for x in range(len(queue))]

We call compute on results, but since we’re not passing sync=True, we immediately get back futures, which represent the computation, which has not completed yet. We also create a mapping from the future itself, to the _n_th input argument that generated it. Finally, we populate a list of results filled with Nones for now.

while queue: result = wait(queue, return_when=FIRST_COMPLETED)

Next, we wait for results, and we process them as they come in. When we wait for futures, they are separated into futures that are done, and those that are not_done.

 if future.status == 'finished': print(f'finished computation #{index}') results[index] = future.result()

If the future is finished, then we print that we succeeded, and we store the result.

 else: print(f'errored #{index}') try: future.result() except Exception as e: results[index] = e traceback.print_exc()

Otherwise, we store the exception and print the stack trace.

 queue = result.not_done

Finally, we set the queue to those futures that have not yet been completed.
 

Conclusion

 
 
Dask can definitely save you time. If you spend time waiting for code to run, you should use these simple tips to parallelize your work. There are also many advanced things you can do with Dask, but this is a good starting point.

 
Bio: Hugo Shi is Founder of Saturn Cloud, the go-to cloud workspace to scale Python, collaborate, deploy jobs, and more.

Original. Reposted with permission.

Related:


PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.kdnuggets.com/2021/09/write-functions-use-dask.html

Big Data

If you did not already know

Published

on

DataOps google
DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations. From a process and methodology perspective, DataOps applies Agile software development, DevOps software development practices and the statistical process control used in lean manufacturing, to data analytics. In DataOps, development of new analytics is streamlined using Agile software development, an iterative project management methodology that replaces the traditional Waterfall sequential methodology. Studies show that software development projects complete significantly faster and with far fewer defects when Agile Development is used. The Agile methodology is particularly effective in environments where requirements are quickly evolving – a situation well known to data analytics professionals. DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics. Like lean manufacturing, DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert. DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, agility, quality, security, access and ease of use. …

CoSegNet google
We introduce CoSegNet, a deep neural network architecture for co-segmentation of a set of 3D shapes represented as point clouds. CoSegNet takes as input a set of unsegmented shapes, proposes per-shape parts, and then jointly optimizes the part labelings across the set subjected to a novel group consistency loss expressed via matrix rank estimates. The proposals are refined in each iteration by an auxiliary network that acts as a weak regularizing prior, pre-trained to denoise noisy, unlabeled parts from a large collection of segmented 3D shapes, where the part compositions within the same object category can be highly inconsistent. The output is a consistent part labeling for the input set, with each shape segmented into up to K (a user-specified hyperparameter) parts. The overall pipeline is thus weakly supervised, producing consistent segmentations tailored to the test set, without consistent ground-truth segmentations. We show qualitative and quantitative results from CoSegNet and evaluate it via ablation studies and comparisons to state-of-the-art co-segmentation methods. …

Stochastic Computation Graph (SCG) google
Stochastic computation graphs are directed acyclic graphs that encode the dependency structure of computation to be performed. The graphical notation generalizes directed graphical models. …

Smooth Density Spatial Quantile Regression google
We derive the properties and demonstrate the desirability of a model-based method for estimating the spatially-varying effects of covariates on the quantile function. By modeling the quantile function as a combination of I-spline basis functions and Pareto tail distributions, we allow for flexible parametric modeling of the extremes while preserving non-parametric flexibility in the center of the distribution. We further establish that the model guarantees the desired degree of differentiability in the density function and enables the estimation of non-stationary covariance functions dependent on the predictors. We demonstrate through a simulation study that the proposed method produces more efficient estimates of the effects of predictors than other methods, particularly in distributions with heavy tails. To illustrate the utility of the model we apply it to measurements of benzene collected around an oil refinery to determine the effect of an emission source within the refinery on the distribution of the fence line measurements. …

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://analytixon.com/2021/10/24/if-you-did-not-already-know-1540/

Continue Reading

Big Data

If you did not already know

Published

on

DataOps google
DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations. From a process and methodology perspective, DataOps applies Agile software development, DevOps software development practices and the statistical process control used in lean manufacturing, to data analytics. In DataOps, development of new analytics is streamlined using Agile software development, an iterative project management methodology that replaces the traditional Waterfall sequential methodology. Studies show that software development projects complete significantly faster and with far fewer defects when Agile Development is used. The Agile methodology is particularly effective in environments where requirements are quickly evolving – a situation well known to data analytics professionals. DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics. Like lean manufacturing, DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert. DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, agility, quality, security, access and ease of use. …

CoSegNet google
We introduce CoSegNet, a deep neural network architecture for co-segmentation of a set of 3D shapes represented as point clouds. CoSegNet takes as input a set of unsegmented shapes, proposes per-shape parts, and then jointly optimizes the part labelings across the set subjected to a novel group consistency loss expressed via matrix rank estimates. The proposals are refined in each iteration by an auxiliary network that acts as a weak regularizing prior, pre-trained to denoise noisy, unlabeled parts from a large collection of segmented 3D shapes, where the part compositions within the same object category can be highly inconsistent. The output is a consistent part labeling for the input set, with each shape segmented into up to K (a user-specified hyperparameter) parts. The overall pipeline is thus weakly supervised, producing consistent segmentations tailored to the test set, without consistent ground-truth segmentations. We show qualitative and quantitative results from CoSegNet and evaluate it via ablation studies and comparisons to state-of-the-art co-segmentation methods. …

Stochastic Computation Graph (SCG) google
Stochastic computation graphs are directed acyclic graphs that encode the dependency structure of computation to be performed. The graphical notation generalizes directed graphical models. …

Smooth Density Spatial Quantile Regression google
We derive the properties and demonstrate the desirability of a model-based method for estimating the spatially-varying effects of covariates on the quantile function. By modeling the quantile function as a combination of I-spline basis functions and Pareto tail distributions, we allow for flexible parametric modeling of the extremes while preserving non-parametric flexibility in the center of the distribution. We further establish that the model guarantees the desired degree of differentiability in the density function and enables the estimation of non-stationary covariance functions dependent on the predictors. We demonstrate through a simulation study that the proposed method produces more efficient estimates of the effects of predictors than other methods, particularly in distributions with heavy tails. To illustrate the utility of the model we apply it to measurements of benzene collected around an oil refinery to determine the effect of an emission source within the refinery on the distribution of the fence line measurements. …

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://analytixon.com/2021/10/24/if-you-did-not-already-know-1540/

Continue Reading

Big Data

If you did not already know

Published

on

Correntropy google
Correntropy is a nonlinear similarity measure between two random variables.
Learning with the Maximum Correntropy Criterion Induced Losses for Regression


Patient Event Graph (PatientEG) google
Medical activities, such as diagnoses, medicine treatments, and laboratory tests, as well as temporal relations between these activities are the basic concepts in clinical research. However, existing relational data model on electronic medical records (EMRs) lacks explicit and accurate semantic definitions of these concepts. It leads to the inconvenience of query construction and the inefficiency of query execution where multi-table join queries are frequently required. In this paper, we propose a patient event graph (PatientEG) model to capture the characteristics of EMRs. We respectively define five types of medical entities, five types of medical events and five types of temporal relations. Based on the proposed model, we also construct a PatientEG dataset with 191,294 events, 3,429 distinct entities, and 545,993 temporal relations using EMRs from Shanghai Shuguang hospital. To help to normalize entity values which contain synonyms, hyponymies, and abbreviations, we link them with the Chinese biomedical knowledge graph. With the help of PatientEG dataset, we are able to conveniently perform complex queries for clinical research such as auxiliary diagnosis and therapeutic effectiveness analysis. In addition, we provide a SPARQL endpoint to access PatientEG dataset and the dataset is also publicly available online. Also, we list several illustrative SPARQL queries on our website. …

LogitBoost Autoregressive Networks google
Multivariate binary distributions can be decomposed into products of univariate conditional distributions. Recently popular approaches have modeled these conditionals through neural networks with sophisticated weight-sharing structures. It is shown that state-of-the-art performance on several standard benchmark datasets can actually be achieved by training separate probability estimators for each dimension. In that case, model training can be trivially parallelized over data dimensions. On the other hand, complexity control has to be performed for each learned conditional distribution. Three possible methods are considered and experimentally compared. The estimator that is employed for each conditional is LogitBoost. Similarities and differences between the proposed approach and autoregressive models based on neural networks are discussed in detail. …

Discretification google
Discretification’ is the mechanism of making continuous data discrete. If you really grasp the concept, you may be thinking ‘Wait a minute, the type of data we are collecting is discrete in and of itself! Data can EITHER be discrete OR continuous, it can’t be both!’ You would be correct. But what if we manually selected values along that continuous measurement, and declared them to be in a specific category? For instance, if we declare 72.0 degrees and greater to be ‘Hot’, 35.0-71.9 degrees to be ‘Moderate’, and anything lower than 35.0 degrees to be ‘Cold’, we have ‘discretified’ temperature! Our readings that were once continuous now fit into distinct categories. So, where we do we draw the boundaries for these categories? What makes 35.0 degrees ‘Cold’ and 35.1 degrees ‘Moderate’? At is at this juncture that the TRUE decision is being made. The beauty of approaching the challenge in this manner is that it is data-centric, not concept-centric. Let’s walk through our marketing example first without using discretification, then with it. …

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://analytixon.com/2021/10/23/if-you-did-not-already-know-1539/

Continue Reading

Big Data

Capturing the signal of weak electricigens: a worthy endeavour

Published

on

Recently several non-traditional electroactive microorganisms have been discovered. These can be considered weak electricigens; microorganisms that typically rely on soluble electron acceptors and donors in their lifecycle but are also capable of extracellular electron transfer (EET), resulting in either a low, unreliable, or otherwise unexpected current. These unanticipated electroactive microorganisms represent a new chapter in electromicrobiology and have important medical, environmental, and biotechnological relevance.
PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.cell.com/trends/biotechnology/fulltext/S0167-7799(21)00229-8?rss=yes

Continue Reading
Blockchain3 days ago

People’s payment attitude: Why cash Remains the most Common Means of Payment & How Technology and Crypto have more Advantages as a Means of payment

Automotive4 days ago

7 Secrets That Automakers Wish You Don’t Know

Startups3 days ago

The 12 TikTok facts you should know

Gaming4 days ago

New Steam Games You Might Have Missed In August 2021

Energy2 days ago

U Power ties up with Bosch to collaborate on Super Board technology

Supply Chain3 days ago

LPG tubes – what to think about

Blockchain4 days ago

What Is the Best Crypto IRA for Me? Use These 6 Pieces of Criteria to Find Out More

Gaming3 days ago

How do casinos without an account work?

IOT4 days ago

The Benefits of Using IoT SIM Card Technology

Blockchain4 days ago

The Most Profitable Cryptocurrencies on the Market

Gaming4 days ago

Norway will crack down on the unlicensed iGaming market with a new gaming law

Blockchain4 days ago

What does swapping crypto mean?

Energy2 days ago

Piperylene Market Size to Grow by USD 428.50 mn from 2020 to 2024 | Growing Demand for Piperylene-based Adhesives to Boost Growth | Technavio

Energy2 days ago

Notice of Data Security Breach Incident

AR/VR4 days ago

Preview: Little Cities – Delightful City Building on Quest

Blockchain2 days ago

Blockchain & Infrastructure Post-Event Release

Blockchain2 days ago

Week Ahead – Between a rock and a hard place

Cyber Security2 days ago

Ransomware Took a New Twist with US Leading a Law Enforcement Effort to Hack Back

Code2 days ago

How does XML to JSON converter work?

Esports2 days ago

How to get Shiny Zacian and Zamazenta in Pokémon Sword and Shield

Trending