Connect with us


Conversational Assistants and Quality with Watson Assistant — The Measures




Daniel Toczala

In an earlier blog post on Conversational Assistants and Quality that can be found here (and here), I provided a link to a Python notebook that does an automated k-fold test and some intent and entity analysis of your Watson Assistant workspace. I have shared it with a few different Watson Assistant users, and they all seem to like the data that it provides. What they often have issues with is understanding what that data is telling them.

Part of the common set of measures and scores returned when doing statistical testing of a chatbot, are the terms Accuracy, Precision, Recall and F1 Score. We’ll go through these measures one by one.

Something to keep in mind as you read this. Don’t get yourself focused on delivering the “perfect” chatbot. You want to test and measure your chatbot, but focus on the trends and the relative values of your various intents.

What are TP, TN, FP and FN?

These represent all possible outcomes for any given intent, for our testing data. The first result is the True Positive (TP). We like TP — it means that for a given phrase, we wanted it to resolve to intent X, and our chatbot DID CORRECTLY resolve it to intent X.

Our next result is the True Negative (TN). We also like TN — it means that for a given phrase, we wanted it to resolve to some intent other than X, and our chatbot DID CORRECTLY resolve it to that other intent.

Now we get to the areas where our chatbot isn’t doing so well. The first of these measures is the False Positive (FP). False positives occur when our phrase resolves to our intent X (it’s a positive match), even though they should have resolved to some other intent (it’s false).

The final type of result is the False Negative (FN). False negative’s occur when our phrase resolves to some other intent (it’s a negative match), even though it should have resolved to intent X (it’s false).


Accuracy is a measure of how well the chatbot is determining proper intents. It is a measure of how many times we are getting things correct, divided by the total number of measurements. Accuracy can be looked at as a percentage or a ratio — a perfect model would have an accuracy of 1.

You should think of accuracy as being how well you are predicting the right user intents. An accuracy of greater than 80% is pretty good. If you need to improve accuracy, it usually means that you need to add more training data for your intent.


Precision is a measure of how well the chatbot is determining a particular intent. How confident can I be in a prediction of some intent? It is a measure of how many times we are getting things correct, divided by the total number of times (both right and wrong) we predicted some intent. Precision can be looked at as a percentage or a ratio — a perfect model would have a precision of 1. If my chatbot predicted that intent X was indicated correctly 9 times, and once made an incorrect prediction of intent X, then we would have a precision of 0.9 (9 correct / 9 correct + 1 incorrect).

You should think of precision as how well you are predicting a particular predicted intent X. Using the above example, if my chatbot predicted intent X, I could be about 90% confident that it did so correctly. A precision of greater than 80% is pretty good. Improving precision means getting your intent better defined — your intent is “reaching” and data that should not get classified to your intent is getting assigned here. Better examples, and checking for training data that is similar between different intents (ambiguous training data) and eliminating it can help this measure.


Recall is a measure of how well the chatbot is in surfacing a particular intent. How confident can I be that my chatbot will recognize some intent? It is a measure of how many times we are getting things correct, divided by the total number of times we SHOULD HAVE predicted some intent. Recall can be looked at as a percentage or a ratio — a perfect model would have a recall of 1. If my chatbot predicted that intent X was indicated correctly 8 times, and twice made an incorrect prediction of intent Y instead of intent X, then we would have a precision of 0.8 (8 correct / 8 correct + 2 incorrect).

You should think of recall as how well you are in identifying a particular predicted intent X. Using the above example, if I am worried about catching intent X, I could be about 80% confident that I have caught all instances of intent X. An recall of greater than 80% is pretty good. Better training examples and more examples can help recall. Balancing your training data — with similar numbers of examples for similar intents, can also help improve recall scores.

F1 Score

The F1 score is what you share with stakeholders. It combines the above measures into a single score — one which you should be tracking over time. It is the harmonic average of the Precision and Recall scores. These scores tend to be quite a bit lower than the ratios representing accuracy, precision and recall.

Think of an intent where we have a precision of 0.95 (which is really good), and a recall of 0.90 (which is pretty good), our resulting F1 score would be 0.92. An F1 score of 0.8 or better is pretty good (since a “perfect” chatbot would have an F1 score of 1.0).

Accuracy: Accuracy measures the ratio of correct predicted user examples out of all user examples.
Accuracy = (TP + TN) / (TP + TN + FP + FN)

Precision: Precision measures the ratio of correctly predicted positive observations out of total predicted positive observations.
Precision = TP / (TP + FP)

Recall: Recall measures the ratio of correctly predicted positive observations out of all observations of the target intent.
Recall = TP / (TP + FN)

F1 Score: F1 Score is the harmonic average of Precision and Recall.
F1 = (2 * (Precision * Recall) ) / (Precision + Recall)



Facebook uses Amazon EC2 to evaluate the Deepfake Detection Challenge




In October 2019, AWS announced that it was working with Facebook, Microsoft, and the Partnership on AI on the first Deepfake Detection Challenge. Deepfake algorithms are the same as the underlying technology that has given us realistic animation effects in movies and video games. Unfortunately, those same algorithms have been used by bad actors to blur the distinction between reality and fiction. Deepfake videos result from using artificial intelligence to manipulate audio and video to make it appear as though someone did or said something they didn’t. For more information about deepfake content, see The Partnership on AI Steering Committee on AI and Media Integrity.

In machine learning (ML) terms, the Generative Adversarial Networks (GAN) algorithm has been the most popular algorithm to create deepfakes. GANs use a pair of neural networks: a generative network that produces candidates by adding noise to the original data, and a discriminative network that evaluates the data until it determines they aren’t synthesized. GANs matches one network against the other in an adversarial manner to generate new, synthetic instances of data that can pass for real data. This means the deepfake is indistinguishable from a normal dataset.

The goal of this challenge was to incentivize researchers around the world to build innovative methods that can help detect deepfakes and manipulated media. The competition, which ended on March 31, 2020, was popular amongst the Kaggle data science community. The deepfake project emphasized the benefits of scaling and optimizing the cost of deep learning batch inference. Once the competition was complete, the team at Facebook hosted the deepfake competition data on AWS and made it available to the world, encouraging researchers to keep fighting this problem.

There were over 4,200 total submissions from over 2,300 teams worldwide. The participating submissions are scored with the following log loss function, where a smaller score is better (for more information about scoring, see the contest rules):

Four groups of datasets were associated with the competition:

  • Training – The participating teams used this set for training their model. It consisted of 470 GB of video files, with real and fake labels for each video.
  • Public validation – Consisted of a sample of 400 videos from the test dataset.
  • Public test – Used by the Kaggle platform to compute the public leaderboard.
  • Private test – Held by the Facebook team, the host outside of the Kaggle competition platform for scoring the competition. The results from using the private test set were displayed on the competition’s private leaderboard. This set contains videos with a similar format and nature as the training and public validation and test sets, but contain real, organic videos as well as deepfakes.

After the competition deadline, Kaggle transferred the code for the two final submissions from each team to the competition host. The hosting team re-ran the submission code against this private dataset and returned prediction submissions to Kaggle to compute the final private leaderboard scores. The submissions were based on two types of compute virtual machines (VMs): GPU-based and CPU-based. Most of the submissions were GPU-based.

The competition hosting team at Facebook recognized several challenges in conducting an evaluation from the unexpectedly large number of participants. With over 4,200 total submissions and 9 GPU hours of runtime required for each using a p3.2xl Amazon Elastic Compute Cloud (Amazon EC2) P3 instance; they would need an estimated 42,000 GPU compute hours (or almost 5 years’ worth of compute hours) to complete the competition. To make the project even more challenging, they needed to do 5 years of GPU compute in 3 weeks.

Given the tight deadline, the host team had to address several constraints to complete the evaluation within the time and budget allotted.

Operational efficiency

To meet the tight timeframes for the competition and make the workload efficient due to the small team size, the solution must be low-code. To address the low-code requirement, they chose AWS Batch for scheduling and scaling out the compute workload. The following diagram illustrates the solution architecture.

AWS Batch was originally designed for developers, scientists, and engineers to easily and efficiently manage large numbers of batch computing jobs on AWS with little coding or cloud infrastructure deployment experience. There’s no need to install and manage batch computing software or server clusters, which allows you to focus on analyzing and solving problems. AWS Batch provides scheduling and scales out batch computing workloads across the full range of AWS compute services, such as Amazon EC2 and Spot Instances. Furthermore, AWS Batch has no additional charges for managing cluster resources. In this use case, the host simply submitted 4,200 compute jobs, which registered each Kaggle submission container, which ran for about 9 hours each. Using a cluster of instances, all jobs were complete in less than three weeks.


The tight timeframes for the competition, as well as requiring those instances for only a short period, speaks to the need for elasticity in compute. For example, the team estimated they would need a minimum of 85 Amazon EC2 P3 GPUs running in parallel around the clock to complete the evaluation. To account for restarts and other issues causing lost time, there was the potential for an additional 50% in capacity. Facebook was able to quickly scale up the number of GPUs and CPUs needed for the evaluation and scale them down when finished, only paying for what they used. This was much more efficient in terms of budget and operations effort than acquiring, installing, and configuring the compute on-premises.


Security was another significant concern. Submissions from such a wide array of participants could contain viruses, malware, bots, or rootkits. Running these containers in a sandboxed, cloud environment avoided that risk. If the evaluation environment was exposed to various infectious agents, the environment could be terminated and easily rebuilt without exposing any production systems to downtime or data loss.

Privacy and confidentiality

Privacy and confidentiality are closely related to the security concerns. To address those concerns, all the submissions and data were held in a single, closely held AWS account with private virtual private clouds (VPCs) and restrictive permissions using AWS Identity and Access Management (IAM). To ensure privacy and confidentiality of the submitted models, and fairness in grading, a single, dedicated engineer was responsible for conducting the evaluation without looking into any of the Docker images submitted by the various teams.


Cost was another important constraint the team had to consider. A rough estimate of 42,000 hours of Amazon EC2 P3 instance runtime would cost about $125,000.

To lower the cost of GPU compute, the host team determined that the Amazon EC2 G4 (Nvida Tesla T4 GPUs) instance type was more cost-effective for this workload than the P3 instance (Volta 100 GPUs). Amongst the GPU instances in the cloud, Amazon EC2 G4 are cost-effective and versatile GPU instances for deploying ML models.

These instances are optimized for ML application deployments (inference), such as image classification, object detection, recommendation engines, automated speech recognition, and language translation, which push the boundary on AI innovation and latency.

The host team completed a few test runs with the G4 instance type. The test runtime for each submission resulted in a little over twice the comparative runtime of the P3 instances, resulting in the need for approximately 90,000 compute hours. The G4 instances cost up to 83% less per hour than the P3 instances. Even with longer runtimes per job with the G4 instances, the total compute cost decreased from $125,000 to just under $50,000. The following table illustrates the cost-effectiveness of the G4 instance type per inference.

p3.2xl g4dn.8xl
Runtime (hours) 90,000 25,000
Cost (USD) $125,000 $50,000
Cost per Inference $30 $12

The host team shared that many of the submission runs completed with less compute time than originally projected. The initial projection was based upon early model submissions, which were larger than the average size for all models submitted. About 80% of the runs took advantage of the G4 instance type, while some had to be run on the P3 instances due to slight differences in available GPU memory between the two instance types. The final numbers were 25,000 G4 (GPU) compute hours, 5,000 C4 (CPU) compute hours, and 800 P3 (GPU) compute hours, totaling $20,000 in compute cost. After approximately two weeks of around-the-clock evaluation, the host team completed the challenging task of evaluating all the submissions early and consumed less than half of the $50,000 estimate.


The host team was able to complete a full evaluation of the over 4,200 submission evaluations in less time than was available, while meeting the grading fairness criteria and coming in under budget. The host team successfully replicated the evaluation environment with a success rate of 94%, which is high for a two-stage competition.

Software projects are often risk-prone due to technological uncertainties, and perhaps even more so due to inherent complexity and constraints. The breadth and depth of AWS services running on Amazon EC2 allow you to solve your unique challenges by reducing technology uncertainty. In this case, the Facebook team completed the deepfake evaluation challenge on time and under budget with only one software engineer. The engineer started by selecting a low-code solution, AWS Batch, which is a proven service for even larger-scale HPC workloads, and reduced the evaluation cost by 2/3 through the choice of the AI inference-optimized G4 EC2 instance type.

AWS believes there’s no one solution to a problem. Solutions often consist of multiple and flexible building blocks from which you can craft solutions that meet your needs and priorities.

About the Authors

Wenming Ye is an AI and ML specialist architect at Amazon Web Services, helping researchers and enterprise customers use cloud-based machine learning services to rapidly scale their innovations. Previously, Wenming had a diverse R&D experience at Microsoft Research, SQL engineering team, and successful startups.

Tim O’Brien is a Senior Solutions Architect at AWS focused on Machine Learning and Artificial Intelligence. He has over 30 years of experience in information technology, security, and accounting. In his spare time, he likes hiking, climbing, and skiing with his wife and two dogs.


Continue Reading


Best Research Papers From ACL 2020




ACL is the leading conference in the field of natural language processing (NLP), covering a broad spectrum of research areas in computational linguistics. Due to the COVID-19 risks, ACL 2020 took place 100% virtually, similar to other big academic conferences of this year.

However, as always, it was the best place to learn about the latest NLP research trends and cutting-edge research papers in language modeling, conversational AI, machine translation, and other NLP research topics.

Following the long-standing tradition, the best paper awards were announced during the last day of the main conference. In this article, we’ve summarized the key research ideas of the papers that received the Best Paper Award and Honorable Mentions at ACL 2020.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

  1. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
  2. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
  3. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

ACL 2020 Best Paper Awards

1. Beyond Accuracy: Behavioral Testing of NLP models with CheckList, by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

Original Abstract

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

Our Summary

The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList, a new evaluation methodology for testing of NLP models. The approach is inspired by principles of behavioral testing in software engineering. Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test ideation. Multiple user studies demonstrate that CheckList is very effective at discovering actionable bugs, even in extensively tested NLP models.


What’s the core idea of this paper?

  • Existing approaches to evaluation of NLP models have many significant shortcomings:
    • The primary approach to the evaluation of models’ generalization capabilities, which is accuracy on held-out data, may lead to performance overestimation, as the held-out data often contains the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much in figuring out where the NLP model is failing and how to fix these bugs.
    • The alternative approaches are usually designed for evaluation of specific behaviors on individual tasks and thus, lack comprehensiveness.
  • To address this problem, the research team introduces CheckList, a new methodology for evaluating NLP models, inspired by the behavioral testing in software engineering:
    • CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named entity recognition, and negation.
    • Then, to break down potential capability failures into specific behaviors, CheckList suggests different test types, such as prediction invariance or directional expectation tests in case of certain perturbations.
    • Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
  • The suggested implementation of CheckList also introduces a variety of abstractions to help users generate large numbers of test cases easily.

What’s the key achievement?

  • Evaluation of state-of-the-art models with CheckList demonstrated that even though some NLP tasks are considered “solved” based on accuracy results, the behavioral testing highlights many areas for improvement.
  • Applying CheckList to an extensively tested public-facing system for sentiment analysis showed that this methodology:
    • helps to identify and test for capabilities not previously considered;
    • results in a more thorough and comprehensive testing for previously considered capabilities;
    • helps to discover many more actionable bugs.

What does the AI community think?

  • The paper received the Best Paper Award at ACL 2020, the leading conference in natural language processing.

What are possible business applications?

  • CheckList can be used to create more exhaustive testing for a variety of NLP tasks.
  • Such comprehensive testing that helps in identifying many actionable bugs is likely to lead to more robust NLP systems.

Where can you get implementation code?

  • The code for testing NLP models with CheckList is available on GitHub.

2. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, by Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith

Original Abstract

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

Our Summary

The research team from the Allen Institute for Artificial Intelligence investigates whether the leading language models trained on massive heterogeneous corpora work universally or whether it is still useful to build separate models pretrained for specific domains. They address this question by considering four different domains and eight classification tasks, spanning low- and high-resource settings. Furthermore, they consider domain-adaptive pretraining as well as task-adaptive pretraining. The findings of the researchers suggest that both pretraining approaches consistently improve the performance of RoBERTa, one of the leading language models. They also show that manual curation of datasets for specific tasks further enhances model performance.

Don't stop pretraining

What’s the core idea of this paper?

  • Today’s leading language models are trained on massive heterogeneous datasets and achieve strong performance across many tasks. At the same time, the benefits of domain-specific or task-specific pretraining are not well investigated.
  • The research team addresses this question for one of the leading language models, RoBERTa:
    • They consider four domains (biomedical research papers, computer science papers, news, and reviews) and eight classification tasks (two in each domain), in both high- and low-resource settings.
    • The experiments cover continued pretraining on the domain, known as domain-adaptive pretraining (DAPT), and pretraining on a directly task-relevant corpus, known as task-adaptive pretraining (TAPT).
    • Additionally, the researchers study the benefits of datasets for task-adaptive pretraining being manually curated by task designers.

What’s the key achievement?

  • Demonstrating the importance of domain-specific and task-specific pretraining. The experiments show that:
    • Domain-adaptive pretraining consistently improves performance on tasks from the target domain, in both low- and high-resource settings.
    • Task-adaptive pretraining significantly boosts the performance of the language model, with or without domain-adaptive pretraining.
    • Benefits from task-adaptive pretraining increase with additional unlabeled data that has been manually curated by task designers or annotators.

What does the AI community think?

  • The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing.

What are future research areas?

  • The authors suggest the following directions for future research:
    • better data selection for task-adaptive pretraining;
    • efficient adaptation of large pretrained language models to distant domains;
    • building reusable language models after adaptation.

What are possible business applications?

  • The approaches studied in this paper can be applied to any pretrained language model to further improve its performance in specific domains and for specific NLP tasks.

Where can you get implementation code?

  • The implementation code as well as pretrained models for multiple domains and tasks are publicly available on GitHub.

3. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics, by Nitika Mathur, Timothy Baldwin, Trevor Cohn

Original Abstract

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

Our Summary

The most recent Conference on Machine Translation (WMT) has revealed that, based on Pearson’s correlation coefficient, automatic metrics poorly match human evaluations of translation quality when comparing only a few best systems. Even negative correlations were exhibited in some instances. The research team from the University of Melbourne investigates this issue by studying the role of outlier systems, exploring how the correlation coefficient reflects different patterns of errors (type I vs. type II errors), and what magnitude of difference in the metric score corresponds to true improvements in translation quality as judged by humans. Their findings suggest that small BLEU differences (i.e., 1–2 points) have little meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over BLEU. However, only human evaluations can be a reliable basis for drawing important empirical conclusions.

Tangled up in BLEU

What’s the core idea of this paper?

  • Automatic metrics are used as a proxy for human translation evaluation, which is considerably more expensive and time-consuming.
  • However, evaluating how well different automatic metrics concur with human evaluation is not a straightforward problem:
    • For example, the recent findings show that if the correlation between leading metrics and human evaluations is computed using a large set of translation systems, it is typically very high (i.e., 0.9). However, if only a few best systems are considered, the correlation reduces markedly and can even be negative in some cases.
  • The authors of this paper take a closer look at this problem and discover that:
    • The identified problem with Pearson’s correlation is due to the small sample size and not specific to comparing strong MT systems.
    • Outlier systems, whose quality is much higher or lower than the rest of the systems, have a disproportionate effect on the computed correlation and should be removed.
    • The same correlation coefficient can reflect different patterns of errors. Thus, a better approach for gaining insights into metric reliability is to visualize metric scores against human scores.
    • Small BLEU differences of 1-2 points correspond to true improvements in translation quality (as judged by humans) only in 50% of cases.

What’s the key achievement?

  • Conducting a thorough analysis of automatic metrics vs. human judgments in machine translation, and providing key recommendations on evaluating MT systems:
    • Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU and TER.
    • Moving away from using small changes in evaluation metrics as the sole basis to draw important empirical conclusions, and always ensuring support from human evaluations before claiming that one MT system significantly outperforms another one.

What does the AI community think?

  • The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing.

Where can you get implementation code?

  • The implementation code, data, and additional analysis will be released on GitHub.

If you like these research summaries, you might be also interested in the following articles:

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.


Continue Reading


Wireless aquatic robot could clean water and transport cells




Researchers at Eindhoven University of Technology developed a tiny plastic robot, made of responsive polymers, which moves under the influence of light and magnetism. In the future this ‘wireless aquatic polyp’ should be able to attract and capture contaminant particles from the surrounding liquid or pick up and transport cells for analysis in diagnostic devices. The researchers published their results in the journal PNAS.

The mini robot is inspired by a coral polyp; a small soft creature with tentacles, which makes up the corals in the ocean. Doctoral candidate Marina Pilz Da Cunha: “I was inspired by the motion of these coral polyps, especially their ability to interact with the environment through self-made currents.” The stem of the living polyps makes a specific movement that creates a current which attracts food particles. Subsequently, the tentacles grab the food particles floating by.

The developed wireless artificial polyp is 1 by 1 cm, has a stem that reacts to magnetism, and light steered tentacles. “Combining two different stimuli is rare since it requires delicate material preparation and assembly, but it is interesting for creating untethered robots because it allows for complex shape changes and tasks to be performed,” explains Pilz Da Cunha. The tentacles move by shining light on them. Different wavelengths lead to different results. For example, the tentacles ‘grab’ under the influence of UV light, while they ‘release’ with blue light.


The device now presented can grab and release objects underwater, which is a new feature of the light-guided package delivery mini robot the researchers presented earlier this year. This land-based robot couldn’t work underwater, because the polymers making up that robot act through photothermal effects. The heat generated by the light fueled the robot, instead of the light itself. Pilz Da Cunha: “Heat dissipates in water, which makes it impossible to steer the robot under water.” She therefore developed a photomechanical polymer material that moves under the influence of light only. Not heat.

And that is not its only advantage. Next to operating underwater, this new material can hold its deformation after being activated by light. While the photothermal material immediately returns to its original shape after the stimuli has been removed, the molecules in the photomechanical material actually take on a new state. This allows different stable shapes, to be maintained for a longer period of time. “That helps to control the gripper arm; once something has been captured, the robot can keep holding it until it is addressed by light once again to release it,” says Pilz Da Cunha.


By placing a rotating magnet underneath the robot, the stem circles around its axis. Pilz Da Cunha: “It was therefore possible to actually move floating objects in the water towards the polyp, in our case oil droplets.”

The position of the tentacles (open, closed or something in between), turned out to have an influence on the fluid flow. “Computer simulations, with different tentacle positions, eventually helped us to understand and get the movement of the stem exactly right. And to ‘attract’ the oil droplets towards the tentacles,” explains Pilz Da Cunha.


An added advantage is that the robot operates independently from the composition of the surrounding liquid. This is unique, because the dominant stimuli-responsive material used for underwater applications nowadays, hydrogels, are sensitive for their environment. Hydrogels therefore behave differently in contaminated water. Pilz Da Cunha: “Our robot also works in the same way in salt water, or water with contaminants. In fact, in the future the polyp may be able to filter contaminants out of the water by catching them with its tentacles.”


PhD student Pilz Da Cunha is now working on the next step: an array of polyps that can work together. She hopes to realize transport of particles, in which one polyp passes on a package to the other. A swimming robot is also on her wish list. Here, she thinks of biomedical applications such as capturing specific cells.

To achieve this, the researchers still have to work on the wavelengths to which the material responds. “UV light affects cells and the depth of penetration in the human body is limited. In addition, UV light might damage the robot itself, making it less durable. Therefore we want to create a robot that doesn’t need UV light as a stimuli,” concludes Pilz Da Cunha.



Continue Reading
Automotive8 mins ago

Toyota sources crucial EV steel from China Baowu

AI30 mins ago

Facebook uses Amazon EC2 to evaluate the Deepfake Detection Challenge

Automotive33 mins ago

Porsche starts testing 3D-printed pistons, gains 30 horsepower in a 911 GT2 RS

Automotive41 mins ago

A Week With: 2020 Genesis G70

AR/VR45 mins ago

4 BIG Changes For In Death: Unchained

AR/VR47 mins ago

4 Big Changes In Death: Unchained Going From PC To Quest

Automotive49 mins ago

VW-SAIC JV to make Audis in Shanghai plants – report

Big Data56 mins ago

Clustering Uber Rideshare Data

AI2 hours ago

Best Research Papers From ACL 2020

AI2 hours ago

Wireless aquatic robot could clean water and transport cells

Automotive2 hours ago

Lightning strikes behind the grille of Maserati’s first hybrid model

AR/VR2 hours ago

Superhot Dev Continuing To Experiment With VR, But New Game Skips Support

Automotive2 hours ago

Grant Imahara of ‘Mythbusters’ dies at age 49

Automotive2 hours ago

Fisker Going Public Through Merger with Investor-Backed Shell Company

Automotive3 hours ago

These are the 2021 Ford Bronco and Bronco Sport paint colors

AI3 hours ago

Robot jaws shows medicated chewing gum could be the future

AI3 hours ago

Why the gym of the future is your living room

AI3 hours ago

Couch Potato No More: How the Benefits of Exercise Transfer to the Brain

Big Data3 hours ago

A Complete Guide To Survival Analysis In Python, part 2

AR/VR3 hours ago

Chomp Down On Sharks Of Mars: Prologue, Available Now For Rift, Steam Soon

AI3 hours ago

How Conversational AI Helps Insurance Agents

AI3 hours ago

An Overview of Artificial Neural Networks

AI4 hours ago

Marvel Web Scraper & Twitter Bot in Under 50 Lines of Code

AR/VR4 hours ago

Educational Tool HistoryMaker VR Steps Onto Steam in August

AR/VR4 hours ago

PSVR Exclusive Iron Man VR Hangs Onto Top 10 In UK Sales Charts

Automotive4 hours ago

2021 Ford Bronco First Edition reservations sell out, Bronco website overwhelmed

Automotive4 hours ago

2021 Ford Bronco Sport vs Jeep Cherokee, Compass Trailhawks | How they compare on paper

AR/VR5 hours ago

Firmament’s 2020 Launch ‘Wildly Optimistic’, now Expected in 2022

Crowdfunding5 hours ago

Royal Bank of Scotland’s Tyl Contactless Payment Service Reports Solid Uptake

Business Insider5 hours ago

The mother Korean Air’s infamous ‘nut rage’ executive was convicted of assaulting her chauffeur

Crowdfunding5 hours ago

Square Announces Acquisition of Operations Management Platform Stitch Labs

Automotive5 hours ago

Bucking Bronco! Our Favorite Features on Ford’s New SUVs – and Guide to Our Full Coverage

Business Insider5 hours ago

BANK OF AMERICA: Buy these 7 pharma stocks now as they race to develop COVID-19 treatments and vaccines

Big Data5 hours ago

Auto Rotate Images Using Deep Learning

Automotive5 hours ago

2021 Ford Bronco trim breakdown | All seven trims and how they differ

Business Insider5 hours ago

The Canadian biotech Medicago is betting it can make a coronavirus vaccine out of plants, and it just started testing it in humans

Business Insider5 hours ago

6 in 10 US workers support going back to in-person learning in the fall, but a lot of people are worried schools aren’t ready

Business Insider5 hours ago

Under Thomas Kurian, Google Cloud is announcing some heavyweight enterprise customers and it’s a good sign for his ultimate ambitions (GOOG, GOOGL)

Crowdfunding5 hours ago

Update: Snowball Money Hits $600,000 Maximum Funding Goal Reached on Republic

Crowdfunding5 hours ago

Fintech in Need of Finance? Report States COVID-19 May Necessitate £825 Million in New Financing