Connect with us


Creating a complete TensorFlow 2 workflow in Amazon SageMaker




Managing the complete lifecycle of a deep learning project can be challenging, especially if you use multiple separate tools and services. For example, you may use different tools for data preprocessing, prototyping training and inference code, full-scale model training and tuning, model deployments, and workflow automation to orchestrate all of the above for production. Friction caused by switching tools can slow down projects and increase costs. This post shows how to efficiently manage the complete lifecycle of deep learning projects with Amazon SageMaker. TensorFlow 2 is the framework used in example code, although the concepts described are generally applicable to other frameworks as well.

This post also has an associated sample notebook, which you can run in less than an hour to demonstrate all of the features discussed here. For more information, see the GitHub repo.

Overview of the Amazon SageMaker workflow

Every data science project using TensorFlow 2 or another framework begins with a dataset: obtaining, exploring, and preprocessing it. In the context of an Amazon SageMaker workflow, data exploration typically occurs within notebooks. These notebooks are preferably on relatively small, less powerful, and inexpensive instance types because they typically run most of the workday.

Accordingly, unless the dataset is relatively small, a notebook isn’t the best place to perform full-scale data processing, model training, and inference. Because these tasks typically require substantial parallel computing resources, a notebook isn’t feasible for performing them. Instead, it’s much more practical and cost-effective to use Amazon SageMaker’s functionality for spinning up separate clusters of right-sized, more powerful instances that can complete these tasks promptly. All of these charges are billed by the second, and at job completion, Amazon SageMaker automatically shuts down the instances. As a result, in a typical Amazon SageMaker workflow, the most frequent charges are only for relatively inexpensive notebooks for data exploration and prototyping, rather than for more powerful and expensive GPU and accelerated compute instances.

When prototyping is complete, you can move beyond notebooks with workflow automation. An automated pipeline is necessary for orchestrating the complete workflow through model deployment in a robust and repeatable way. Amazon SageMaker provides a native solution for this as well. The following sections of this post introduce various features of Amazon SageMaker that you can use to implement these project lifecycle stages.

Data transformation with Amazon SageMaker Processing

Amazon SageMaker Processing helps you preprocess large datasets in a right-sized, managed cluster separate from notebooks. Amazon SageMaker Processing includes off-the-shelf support for Scikit-learn, and supports any other technology that is containerized. For example, you can launch transient Apache Spark clusters for feature transformations within Amazon SageMaker Processing.

To use Amazon SageMaker Processing with Scikit-learn, supply a Python data preprocessing script with standard Scikit-learn code. There is only a minimal contract for the script: input and output data must be placed in specified locations. Amazon SageMaker Processing automatically loads the input data from Amazon Simple Storage Service (Amazon S3) and uploads transformed data back to Amazon S3 when the job is complete.

Before starting an Amazon SageMaker Processing job, instantiate a SKLearnProcessor object as shown in the following code example. In this object, specify the instance type to use in the job and the number of instances.

from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor sklearn_processor = SKLearnProcessor(framework_version='0.20.0', role=get_execution_role(), instance_type='ml.m5.xlarge', instance_count=2)

To distribute the data files equally among the cluster instances for processing, specify the ShardedByS3Key distribution type in the ProcessingInput object. This makes sure that if there are n instances, each instance receives 1/n files from the specified S3 bucket. The ability to easily create a large cluster of instances for stateless data transformations is just one of the many benefits Amazon SageMaker Processing provides.

from sagemaker.processing import ProcessingInput, ProcessingOutput
from time import gmtime, strftime processing_job_name = "tf-2-workflow-{}".format(strftime("%d-%H-%M-%S", gmtime()))
output_destination = 's3://{}/{}/data'.format(bucket, s3_prefix)'', job_name=processing_job_name, inputs=[ProcessingInput( source=raw_s3, destination='/opt/ml/processing/input', s3_data_distribution_type='ShardedByS3Key')], outputs=[ProcessingOutput(output_name='train', destination='{}/train'.format(output_destination), source='/opt/ml/processing/train'), ProcessingOutput(output_name='test', destination='{}/test'.format(output_destination), source='/opt/ml/processing/test')])

Prototyping training and inference code with local mode

When the dataset is ready for training, the next step is to prototype the training code. For TensorFlow 2, the most convenient workflow is to provide a training script for ingestion by the Amazon SageMaker prebuilt TensorFlow 2 container. This feature is named script mode, and works seamlessly with the Amazon SageMaker local mode training feature.

Local mode is a convenient way to make sure code is working locally on a notebook before moving to full-scale, hosted training in a separate right-sized cluster that Amazon SageMaker manages. In local mode, you typically train for a short time for just a few epochs, possibly on only a sample of the full dataset, to confirm the code is working properly and avoid wasting full-scale training time. Also, specify the instance type as either local_gpu or local, depending on whether the notebook is on a GPU or CPU instance.

from sagemaker.tensorflow import TensorFlow git_config = {'repo': '', 'branch': 'master'} model_dir = '/opt/ml/model'
train_instance_type = 'local'
hyperparameters = {'epochs': 5, 'batch_size': 128, 'learning_rate': 0.01}
local_estimator = TensorFlow(git_config=git_config, source_dir='tf-2-workflow/train_model', entry_point='', model_dir=model_dir, train_instance_type=train_instance_type, train_instance_count=1, hyperparameters=hyperparameters, role=sagemaker.get_execution_role(), base_job_name='tf-2-workflow', framework_version='2.1', py_version='py3', script_mode=True)

Although local mode training is very useful to make sure training code is working before moving on to full-scale training, it’s also convenient to have an easy way to prototype inference code locally. One possibility is to fetch a TensorFlow SavedModel artifact or a model checkpoint saved in Amazon S3 and load it in a notebook for testing. However, an easier way to do this is to use local mode endpoints.

You can deploy a model in a local mode endpoint, which contains an Amazon SageMaker TensorFlow Serving container, by using the estimator object from the local mode training job. With one exception, this code is the same as the code for deploying a model to a separate hosted endpoint. Just invoke the local estimator’s deploy method, and again specify the instance type as either local_gpu or local, depending on whether the notebook is on a GPU or CPU instance.

local_predictor = local_estimator.deploy(initial_instance_count=1, instance_type='local')
local_results = local_predictor.predict(x_test[:10])['predictions']

Before using local mode, make sure that docker-compose or nvidia-docker-compose (for GPU instances) can be run on your instance. The GitHub repo for this blog post has a script you can use for this purpose.

Automatic Model Tuning

After prototyping is complete, the next step is to use Amazon SageMaker hosted training and automatic model tuning. Hosted training is preferred for doing full-scale training, especially large-scale, distributed training. Unlike local mode, for hosted training the actual training occurs not on the notebook itself, but on a separate cluster of machines that Amazon SageMaker manages. An estimator object for hosted training is similar to a local mode estimator, except for the following:

Also, because local mode prototyping proved the training code is working, you can modify the hosted training estimator to train for a larger number of epochs, and on the full dataset if you just used a sample in local mode.

However, running individual hosted training jobs and manually tweaking hyperparameters in search of the best model is likely to be a daunting, time-consuming task. Selecting the right combination of hyperparameters depends on the dataset and algorithm. Some algorithms have many different hyperparameters that you can tweak, some are very sensitive to the hyperparameter values selected, and most have a non-linear relationship between model fit and hyperparameter values. Automatic model tuning speeds up the tuning process: it runs multiple training jobs with different hyperparameter combinations to find the set with the best model performance.

As shown in the following code example, to use automatic model tuning, first specify the hyperparameters to tune, their tuning ranges, and an objective metric to optimize. A HyperparameterTuner object takes these as parameters. Each tuning job also must specify a maximum number of training jobs within the tuning job, in this case 15, and how much parallelism to employ, in this case five jobs at a time. With these parameters, the tuning job is complete after three series of five jobs in parallel are run. For the default Bayesian Optimization tuning strategy, the results of previous groups of training jobs inform the tuning search, so it’s preferable to divide them into groups of parallel jobs instead of running all in parallel. There is a trade-off: using more parallel jobs finishes tuning sooner, but likely sacrifices tuning search accuracy.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner hyperparameter_ranges = { 'learning_rate': ContinuousParameter(0.001, 0.2, scaling_type="Logarithmic"), 'epochs': IntegerParameter(10, 50), 'batch_size': IntegerParameter(64, 256),
} metric_definitions = [{'Name': 'loss', 'Regex': ' loss: ([0-9\.]+)'}, {'Name': 'val_loss', 'Regex': ' val_loss: ([0-9\.]+)'}] objective_metric_name = 'val_loss'
objective_type = 'Minimize' tuner = HyperparameterTuner(estimator, objective_metric_name, hyperparameter_ranges, metric_definitions, max_jobs=15, max_parallel_jobs=5, objective_type=objective_type) tuning_job_name = "tf-2-workflow-{}".format(strftime("%d-%H-%M-%S", gmtime())), job_name=tuning_job_name)

Deployment and workflow automation with the AWS Step Functions Data Science SDK

A convenient option to deploy the best model from tuning is an Amazon SageMaker hosted endpoint, which serves real-time predictions (batch transform jobs also are available for asynchronous, offline predictions). The endpoint retrieves the TensorFlow SavedModel and deploys it to an Amazon SageMaker TensorFlow Serving container. You can accomplish this with one line of code by calling the HyperparameterTuner object’s deploy method:

tuning_predictor = tuner.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

However, although notebooks are great for prototyping, notebooks aren’t typically used for deployment in a production environment. Instead, a workflow orchestrator is preferable for running a pipeline with multiple steps including training and deployment. For example, a simple pipeline in Amazon SageMaker consists of four steps:

  1. Training the model.
  2. Creating an Amazon SageMaker Model object that wraps the model artifact for serving.
  3. Creating an Amazon SageMaker endpoint configuration specifying how the model should be served (including instance type and number of instances).
  4. Deploying the trained model to the configured Amazon SageMaker endpoint.

The AWS Step Functions Data Science SDK automates the process of creating and running such pipelines using Amazon SageMaker and AWS Step Functions, a serverless workflow orchestration service. This SDK enables workflow creation using short, simple Python scripts that define workflow steps and chain them together. AWS Step Functions coordinates all the workflow steps without any need for you to manage the underlying infrastructure.

Although the AWS Step Functions Data Science SDK provides various primitives to build up complex pipelines from scratch, it also has prebuilt templates for common workflows, including a simple TrainingPipeline workflow for model training and deployment. The following code configures such a pipeline with just a few parameters, primarily the training estimator and input and output locations in Amazon S3:

import stepfunctions
from stepfunctions.template.pipeline import TrainingPipeline workflow_execution_role = "<StepFunctions-execution-role-arn>" pipeline = TrainingPipeline( estimator=estimator, role=workflow_execution_role, inputs=inputs, s3_bucket=bucket

After you define a pipeline, you can visualize it as a graph, instantiate it, and execute it as many times as needed. In fact, you can run multiple workflows in parallel. While a workflow is running, you can check workflow progress either in the AWS Step Functions console or by calling the pipeline’s render_progress method. The following diagram shows a rendered workflow execution making progress on the training step.

The AWS Step Functions Data Science SDK enables many other possible workflows for automating TensorFlow 2 and other machine learning projects. One example is a workflow to automate model retraining periodically. Such a workflow could include a test of model quality after training, with subsequent conditional branches for the cases of passing the quality test (model is deployed) or failing (no model deployment). Other possible workflow steps include automatic model tuning, ETL with AWS Glue, and more. For more information about retraining workflows, see Automating model retraining and deployment using the AWS Step Functions Data Science SDK for Amazon SageMaker.


This post discussed Amazon SageMaker features for data transformation, prototyping training and inference code, automatic model tuning, and hosted training and inference. Additionally, you learned how the AWS Step Functions Data Science SDK helps automate workflows after project prototyping is complete. All these features are central elements of projects involving TensorFlow 2 and other deep learning frameworks in Amazon SageMaker.

In addition to these features, many others may be applicable. For example, to handle common problems during model training such as vanishing or exploding gradients, Amazon SageMaker Debugger is useful. To manage common problems such as data drift for models deployed in production, you can apply Amazon SageMaker Model Monitor. For more information about the Amazon SageMaker workflow features covered in this post, see the related GitHub repo.

About the Author

Brent Rabowsky focuses on data science at AWS, and leverages his expertise to help AWS customers with their own data science projects.



Facebook uses Amazon EC2 to evaluate the Deepfake Detection Challenge




In October 2019, AWS announced that it was working with Facebook, Microsoft, and the Partnership on AI on the first Deepfake Detection Challenge. Deepfake algorithms are the same as the underlying technology that has given us realistic animation effects in movies and video games. Unfortunately, those same algorithms have been used by bad actors to blur the distinction between reality and fiction. Deepfake videos result from using artificial intelligence to manipulate audio and video to make it appear as though someone did or said something they didn’t. For more information about deepfake content, see The Partnership on AI Steering Committee on AI and Media Integrity.

In machine learning (ML) terms, the Generative Adversarial Networks (GAN) algorithm has been the most popular algorithm to create deepfakes. GANs use a pair of neural networks: a generative network that produces candidates by adding noise to the original data, and a discriminative network that evaluates the data until it determines they aren’t synthesized. GANs matches one network against the other in an adversarial manner to generate new, synthetic instances of data that can pass for real data. This means the deepfake is indistinguishable from a normal dataset.

The goal of this challenge was to incentivize researchers around the world to build innovative methods that can help detect deepfakes and manipulated media. The competition, which ended on March 31, 2020, was popular amongst the Kaggle data science community. The deepfake project emphasized the benefits of scaling and optimizing the cost of deep learning batch inference. Once the competition was complete, the team at Facebook hosted the deepfake competition data on AWS and made it available to the world, encouraging researchers to keep fighting this problem.

There were over 4,200 total submissions from over 2,300 teams worldwide. The participating submissions are scored with the following log loss function, where a smaller score is better (for more information about scoring, see the contest rules):

Four groups of datasets were associated with the competition:

  • Training – The participating teams used this set for training their model. It consisted of 470 GB of video files, with real and fake labels for each video.
  • Public validation – Consisted of a sample of 400 videos from the test dataset.
  • Public test – Used by the Kaggle platform to compute the public leaderboard.
  • Private test – Held by the Facebook team, the host outside of the Kaggle competition platform for scoring the competition. The results from using the private test set were displayed on the competition’s private leaderboard. This set contains videos with a similar format and nature as the training and public validation and test sets, but contain real, organic videos as well as deepfakes.

After the competition deadline, Kaggle transferred the code for the two final submissions from each team to the competition host. The hosting team re-ran the submission code against this private dataset and returned prediction submissions to Kaggle to compute the final private leaderboard scores. The submissions were based on two types of compute virtual machines (VMs): GPU-based and CPU-based. Most of the submissions were GPU-based.

The competition hosting team at Facebook recognized several challenges in conducting an evaluation from the unexpectedly large number of participants. With over 4,200 total submissions and 9 GPU hours of runtime required for each using a p3.2xl Amazon Elastic Compute Cloud (Amazon EC2) P3 instance; they would need an estimated 42,000 GPU compute hours (or almost 5 years’ worth of compute hours) to complete the competition. To make the project even more challenging, they needed to do 5 years of GPU compute in 3 weeks.

Given the tight deadline, the host team had to address several constraints to complete the evaluation within the time and budget allotted.

Operational efficiency

To meet the tight timeframes for the competition and make the workload efficient due to the small team size, the solution must be low-code. To address the low-code requirement, they chose AWS Batch for scheduling and scaling out the compute workload. The following diagram illustrates the solution architecture.

AWS Batch was originally designed for developers, scientists, and engineers to easily and efficiently manage large numbers of batch computing jobs on AWS with little coding or cloud infrastructure deployment experience. There’s no need to install and manage batch computing software or server clusters, which allows you to focus on analyzing and solving problems. AWS Batch provides scheduling and scales out batch computing workloads across the full range of AWS compute services, such as Amazon EC2 and Spot Instances. Furthermore, AWS Batch has no additional charges for managing cluster resources. In this use case, the host simply submitted 4,200 compute jobs, which registered each Kaggle submission container, which ran for about 9 hours each. Using a cluster of instances, all jobs were complete in less than three weeks.


The tight timeframes for the competition, as well as requiring those instances for only a short period, speaks to the need for elasticity in compute. For example, the team estimated they would need a minimum of 85 Amazon EC2 P3 GPUs running in parallel around the clock to complete the evaluation. To account for restarts and other issues causing lost time, there was the potential for an additional 50% in capacity. Facebook was able to quickly scale up the number of GPUs and CPUs needed for the evaluation and scale them down when finished, only paying for what they used. This was much more efficient in terms of budget and operations effort than acquiring, installing, and configuring the compute on-premises.


Security was another significant concern. Submissions from such a wide array of participants could contain viruses, malware, bots, or rootkits. Running these containers in a sandboxed, cloud environment avoided that risk. If the evaluation environment was exposed to various infectious agents, the environment could be terminated and easily rebuilt without exposing any production systems to downtime or data loss.

Privacy and confidentiality

Privacy and confidentiality are closely related to the security concerns. To address those concerns, all the submissions and data were held in a single, closely held AWS account with private virtual private clouds (VPCs) and restrictive permissions using AWS Identity and Access Management (IAM). To ensure privacy and confidentiality of the submitted models, and fairness in grading, a single, dedicated engineer was responsible for conducting the evaluation without looking into any of the Docker images submitted by the various teams.


Cost was another important constraint the team had to consider. A rough estimate of 42,000 hours of Amazon EC2 P3 instance runtime would cost about $125,000.

To lower the cost of GPU compute, the host team determined that the Amazon EC2 G4 (Nvida Tesla T4 GPUs) instance type was more cost-effective for this workload than the P3 instance (Volta 100 GPUs). Amongst the GPU instances in the cloud, Amazon EC2 G4 are cost-effective and versatile GPU instances for deploying ML models.

These instances are optimized for ML application deployments (inference), such as image classification, object detection, recommendation engines, automated speech recognition, and language translation, which push the boundary on AI innovation and latency.

The host team completed a few test runs with the G4 instance type. The test runtime for each submission resulted in a little over twice the comparative runtime of the P3 instances, resulting in the need for approximately 90,000 compute hours. The G4 instances cost up to 83% less per hour than the P3 instances. Even with longer runtimes per job with the G4 instances, the total compute cost decreased from $125,000 to just under $50,000. The following table illustrates the cost-effectiveness of the G4 instance type per inference.

p3.2xl g4dn.8xl
Runtime (hours) 90,000 25,000
Cost (USD) $125,000 $50,000
Cost per Inference $30 $12

The host team shared that many of the submission runs completed with less compute time than originally projected. The initial projection was based upon early model submissions, which were larger than the average size for all models submitted. About 80% of the runs took advantage of the G4 instance type, while some had to be run on the P3 instances due to slight differences in available GPU memory between the two instance types. The final numbers were 25,000 G4 (GPU) compute hours, 5,000 C4 (CPU) compute hours, and 800 P3 (GPU) compute hours, totaling $20,000 in compute cost. After approximately two weeks of around-the-clock evaluation, the host team completed the challenging task of evaluating all the submissions early and consumed less than half of the $50,000 estimate.


The host team was able to complete a full evaluation of the over 4,200 submission evaluations in less time than was available, while meeting the grading fairness criteria and coming in under budget. The host team successfully replicated the evaluation environment with a success rate of 94%, which is high for a two-stage competition.

Software projects are often risk-prone due to technological uncertainties, and perhaps even more so due to inherent complexity and constraints. The breadth and depth of AWS services running on Amazon EC2 allow you to solve your unique challenges by reducing technology uncertainty. In this case, the Facebook team completed the deepfake evaluation challenge on time and under budget with only one software engineer. The engineer started by selecting a low-code solution, AWS Batch, which is a proven service for even larger-scale HPC workloads, and reduced the evaluation cost by 2/3 through the choice of the AI inference-optimized G4 EC2 instance type.

AWS believes there’s no one solution to a problem. Solutions often consist of multiple and flexible building blocks from which you can craft solutions that meet your needs and priorities.

About the Authors

Wenming Ye is an AI and ML specialist architect at Amazon Web Services, helping researchers and enterprise customers use cloud-based machine learning services to rapidly scale their innovations. Previously, Wenming had a diverse R&D experience at Microsoft Research, SQL engineering team, and successful startups.

Tim O’Brien is a Senior Solutions Architect at AWS focused on Machine Learning and Artificial Intelligence. He has over 30 years of experience in information technology, security, and accounting. In his spare time, he likes hiking, climbing, and skiing with his wife and two dogs.


Continue Reading


Best Research Papers From ACL 2020




ACL is the leading conference in the field of natural language processing (NLP), covering a broad spectrum of research areas in computational linguistics. Due to the COVID-19 risks, ACL 2020 took place 100% virtually, similar to other big academic conferences of this year.

However, as always, it was the best place to learn about the latest NLP research trends and cutting-edge research papers in language modeling, conversational AI, machine translation, and other NLP research topics.

Following the long-standing tradition, the best paper awards were announced during the last day of the main conference. In this article, we’ve summarized the key research ideas of the papers that received the Best Paper Award and Honorable Mentions at ACL 2020.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

  1. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
  2. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
  3. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

ACL 2020 Best Paper Awards

1. Beyond Accuracy: Behavioral Testing of NLP models with CheckList, by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

Original Abstract

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

Our Summary

The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList, a new evaluation methodology for testing of NLP models. The approach is inspired by principles of behavioral testing in software engineering. Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test ideation. Multiple user studies demonstrate that CheckList is very effective at discovering actionable bugs, even in extensively tested NLP models.


What’s the core idea of this paper?

  • Existing approaches to evaluation of NLP models have many significant shortcomings:
    • The primary approach to the evaluation of models’ generalization capabilities, which is accuracy on held-out data, may lead to performance overestimation, as the held-out data often contains the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much in figuring out where the NLP model is failing and how to fix these bugs.
    • The alternative approaches are usually designed for evaluation of specific behaviors on individual tasks and thus, lack comprehensiveness.
  • To address this problem, the research team introduces CheckList, a new methodology for evaluating NLP models, inspired by the behavioral testing in software engineering:
    • CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named entity recognition, and negation.
    • Then, to break down potential capability failures into specific behaviors, CheckList suggests different test types, such as prediction invariance or directional expectation tests in case of certain perturbations.
    • Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
  • The suggested implementation of CheckList also introduces a variety of abstractions to help users generate large numbers of test cases easily.

What’s the key achievement?

  • Evaluation of state-of-the-art models with CheckList demonstrated that even though some NLP tasks are considered “solved” based on accuracy results, the behavioral testing highlights many areas for improvement.
  • Applying CheckList to an extensively tested public-facing system for sentiment analysis showed that this methodology:
    • helps to identify and test for capabilities not previously considered;
    • results in a more thorough and comprehensive testing for previously considered capabilities;
    • helps to discover many more actionable bugs.

What does the AI community think?

  • The paper received the Best Paper Award at ACL 2020, the leading conference in natural language processing.

What are possible business applications?

  • CheckList can be used to create more exhaustive testing for a variety of NLP tasks.
  • Such comprehensive testing that helps in identifying many actionable bugs is likely to lead to more robust NLP systems.

Where can you get implementation code?

  • The code for testing NLP models with CheckList is available on GitHub.

2. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, by Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith

Original Abstract

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

Our Summary

The research team from the Allen Institute for Artificial Intelligence investigates whether the leading language models trained on massive heterogeneous corpora work universally or whether it is still useful to build separate models pretrained for specific domains. They address this question by considering four different domains and eight classification tasks, spanning low- and high-resource settings. Furthermore, they consider domain-adaptive pretraining as well as task-adaptive pretraining. The findings of the researchers suggest that both pretraining approaches consistently improve the performance of RoBERTa, one of the leading language models. They also show that manual curation of datasets for specific tasks further enhances model performance.

Don't stop pretraining

What’s the core idea of this paper?

  • Today’s leading language models are trained on massive heterogeneous datasets and achieve strong performance across many tasks. At the same time, the benefits of domain-specific or task-specific pretraining are not well investigated.
  • The research team addresses this question for one of the leading language models, RoBERTa:
    • They consider four domains (biomedical research papers, computer science papers, news, and reviews) and eight classification tasks (two in each domain), in both high- and low-resource settings.
    • The experiments cover continued pretraining on the domain, known as domain-adaptive pretraining (DAPT), and pretraining on a directly task-relevant corpus, known as task-adaptive pretraining (TAPT).
    • Additionally, the researchers study the benefits of datasets for task-adaptive pretraining being manually curated by task designers.

What’s the key achievement?

  • Demonstrating the importance of domain-specific and task-specific pretraining. The experiments show that:
    • Domain-adaptive pretraining consistently improves performance on tasks from the target domain, in both low- and high-resource settings.
    • Task-adaptive pretraining significantly boosts the performance of the language model, with or without domain-adaptive pretraining.
    • Benefits from task-adaptive pretraining increase with additional unlabeled data that has been manually curated by task designers or annotators.

What does the AI community think?

  • The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing.

What are future research areas?

  • The authors suggest the following directions for future research:
    • better data selection for task-adaptive pretraining;
    • efficient adaptation of large pretrained language models to distant domains;
    • building reusable language models after adaptation.

What are possible business applications?

  • The approaches studied in this paper can be applied to any pretrained language model to further improve its performance in specific domains and for specific NLP tasks.

Where can you get implementation code?

  • The implementation code as well as pretrained models for multiple domains and tasks are publicly available on GitHub.

3. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics, by Nitika Mathur, Timothy Baldwin, Trevor Cohn

Original Abstract

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

Our Summary

The most recent Conference on Machine Translation (WMT) has revealed that, based on Pearson’s correlation coefficient, automatic metrics poorly match human evaluations of translation quality when comparing only a few best systems. Even negative correlations were exhibited in some instances. The research team from the University of Melbourne investigates this issue by studying the role of outlier systems, exploring how the correlation coefficient reflects different patterns of errors (type I vs. type II errors), and what magnitude of difference in the metric score corresponds to true improvements in translation quality as judged by humans. Their findings suggest that small BLEU differences (i.e., 1–2 points) have little meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over BLEU. However, only human evaluations can be a reliable basis for drawing important empirical conclusions.

Tangled up in BLEU

What’s the core idea of this paper?

  • Automatic metrics are used as a proxy for human translation evaluation, which is considerably more expensive and time-consuming.
  • However, evaluating how well different automatic metrics concur with human evaluation is not a straightforward problem:
    • For example, the recent findings show that if the correlation between leading metrics and human evaluations is computed using a large set of translation systems, it is typically very high (i.e., 0.9). However, if only a few best systems are considered, the correlation reduces markedly and can even be negative in some cases.
  • The authors of this paper take a closer look at this problem and discover that:
    • The identified problem with Pearson’s correlation is due to the small sample size and not specific to comparing strong MT systems.
    • Outlier systems, whose quality is much higher or lower than the rest of the systems, have a disproportionate effect on the computed correlation and should be removed.
    • The same correlation coefficient can reflect different patterns of errors. Thus, a better approach for gaining insights into metric reliability is to visualize metric scores against human scores.
    • Small BLEU differences of 1-2 points correspond to true improvements in translation quality (as judged by humans) only in 50% of cases.

What’s the key achievement?

  • Conducting a thorough analysis of automatic metrics vs. human judgments in machine translation, and providing key recommendations on evaluating MT systems:
    • Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU and TER.
    • Moving away from using small changes in evaluation metrics as the sole basis to draw important empirical conclusions, and always ensuring support from human evaluations before claiming that one MT system significantly outperforms another one.

What does the AI community think?

  • The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing.

Where can you get implementation code?

  • The implementation code, data, and additional analysis will be released on GitHub.

If you like these research summaries, you might be also interested in the following articles:

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.


Continue Reading


Wireless aquatic robot could clean water and transport cells




Researchers at Eindhoven University of Technology developed a tiny plastic robot, made of responsive polymers, which moves under the influence of light and magnetism. In the future this ‘wireless aquatic polyp’ should be able to attract and capture contaminant particles from the surrounding liquid or pick up and transport cells for analysis in diagnostic devices. The researchers published their results in the journal PNAS.

The mini robot is inspired by a coral polyp; a small soft creature with tentacles, which makes up the corals in the ocean. Doctoral candidate Marina Pilz Da Cunha: “I was inspired by the motion of these coral polyps, especially their ability to interact with the environment through self-made currents.” The stem of the living polyps makes a specific movement that creates a current which attracts food particles. Subsequently, the tentacles grab the food particles floating by.

The developed wireless artificial polyp is 1 by 1 cm, has a stem that reacts to magnetism, and light steered tentacles. “Combining two different stimuli is rare since it requires delicate material preparation and assembly, but it is interesting for creating untethered robots because it allows for complex shape changes and tasks to be performed,” explains Pilz Da Cunha. The tentacles move by shining light on them. Different wavelengths lead to different results. For example, the tentacles ‘grab’ under the influence of UV light, while they ‘release’ with blue light.


The device now presented can grab and release objects underwater, which is a new feature of the light-guided package delivery mini robot the researchers presented earlier this year. This land-based robot couldn’t work underwater, because the polymers making up that robot act through photothermal effects. The heat generated by the light fueled the robot, instead of the light itself. Pilz Da Cunha: “Heat dissipates in water, which makes it impossible to steer the robot under water.” She therefore developed a photomechanical polymer material that moves under the influence of light only. Not heat.

And that is not its only advantage. Next to operating underwater, this new material can hold its deformation after being activated by light. While the photothermal material immediately returns to its original shape after the stimuli has been removed, the molecules in the photomechanical material actually take on a new state. This allows different stable shapes, to be maintained for a longer period of time. “That helps to control the gripper arm; once something has been captured, the robot can keep holding it until it is addressed by light once again to release it,” says Pilz Da Cunha.


By placing a rotating magnet underneath the robot, the stem circles around its axis. Pilz Da Cunha: “It was therefore possible to actually move floating objects in the water towards the polyp, in our case oil droplets.”

The position of the tentacles (open, closed or something in between), turned out to have an influence on the fluid flow. “Computer simulations, with different tentacle positions, eventually helped us to understand and get the movement of the stem exactly right. And to ‘attract’ the oil droplets towards the tentacles,” explains Pilz Da Cunha.


An added advantage is that the robot operates independently from the composition of the surrounding liquid. This is unique, because the dominant stimuli-responsive material used for underwater applications nowadays, hydrogels, are sensitive for their environment. Hydrogels therefore behave differently in contaminated water. Pilz Da Cunha: “Our robot also works in the same way in salt water, or water with contaminants. In fact, in the future the polyp may be able to filter contaminants out of the water by catching them with its tentacles.”


PhD student Pilz Da Cunha is now working on the next step: an array of polyps that can work together. She hopes to realize transport of particles, in which one polyp passes on a package to the other. A swimming robot is also on her wish list. Here, she thinks of biomedical applications such as capturing specific cells.

To achieve this, the researchers still have to work on the wavelengths to which the material responds. “UV light affects cells and the depth of penetration in the human body is limited. In addition, UV light might damage the robot itself, making it less durable. Therefore we want to create a robot that doesn’t need UV light as a stimuli,” concludes Pilz Da Cunha.



Continue Reading
Automotive28 mins ago

Toyota sources crucial EV steel from China Baowu

Biotechnology34 mins ago

Palliative nursing’s role during COVID-19 and beyond

Biotechnology40 mins ago

Deep Longevity Inc to collaborate with and launch aging clocks with Human Longevity Inc

Biotechnology48 mins ago

Tale of the tape: Sticky bits make better batteries

AI50 mins ago

Facebook uses Amazon EC2 to evaluate the Deepfake Detection Challenge

Automotive53 mins ago

Porsche starts testing 3D-printed pistons, gains 30 horsepower in a 911 GT2 RS

Biotechnology54 mins ago

Short gamma ray burst leaves most-distant optical afterglow ever detected

Biotechnology59 mins ago

Evolution after Chicxulub asteroid impact: Rapid response of life to end-cretaceous mass

Automotive1 hour ago

A Week With: 2020 Genesis G70

AR/VR1 hour ago

4 BIG Changes For In Death: Unchained

AR/VR1 hour ago

4 Big Changes In Death: Unchained Going From PC To Quest

Automotive1 hour ago

VW-SAIC JV to make Audis in Shanghai plants – report

Biotechnology1 hour ago

FDA panel to chew over endpoint, safety strategies for Mallinckrodt’s terlipressin

Biotechnology1 hour ago

3M pairs with MIT to develop a paper-based coronavirus diagnostic test

Big Data1 hour ago

Clustering Uber Rideshare Data

AI2 hours ago

Best Research Papers From ACL 2020

AI2 hours ago

Wireless aquatic robot could clean water and transport cells

Automotive2 hours ago

Lightning strikes behind the grille of Maserati’s first hybrid model

AR/VR2 hours ago

Superhot Dev Continuing To Experiment With VR, But New Game Skips Support

Biotechnology2 hours ago

electroCore’s nerve stimulator authorized for asthma patients facing COVID-19

Automotive3 hours ago

Grant Imahara of ‘Mythbusters’ dies at age 49

Automotive3 hours ago

Fisker Going Public Through Merger with Investor-Backed Shell Company

Biotechnology3 hours ago

AstraZeneca taps IQVIA to ‘warp speed’ its U.S. pandemic vaccine research

Automotive3 hours ago

These are the 2021 Ford Bronco and Bronco Sport paint colors

AI3 hours ago

Robot jaws shows medicated chewing gum could be the future

Biotechnology3 hours ago

Amryt Makes Nasdaq Debut to Fund Rare Disease Treatments

AI3 hours ago

Why the gym of the future is your living room

AI3 hours ago

Couch Potato No More: How the Benefits of Exercise Transfer to the Brain

Big Data3 hours ago

A Complete Guide To Survival Analysis In Python, part 2

AR/VR3 hours ago

Chomp Down On Sharks Of Mars: Prologue, Available Now For Rift, Steam Soon

AI4 hours ago

How Conversational AI Helps Insurance Agents

AI4 hours ago

An Overview of Artificial Neural Networks

AI4 hours ago

Marvel Web Scraper & Twitter Bot in Under 50 Lines of Code

AR/VR4 hours ago

Educational Tool HistoryMaker VR Steps Onto Steam in August

Biotechnology4 hours ago

NASA & SETI Sets Out To Protect Solar System From Biological Contamination

AR/VR4 hours ago

PSVR Exclusive Iron Man VR Hangs Onto Top 10 In UK Sales Charts

Automotive5 hours ago

2021 Ford Bronco First Edition reservations sell out, Bronco website overwhelmed

Automotive5 hours ago

2021 Ford Bronco Sport vs Jeep Cherokee, Compass Trailhawks | How they compare on paper

Biotechnology5 hours ago

Junshi taps Revitope for next-gen anticancer bispecifics

AR/VR5 hours ago

Firmament’s 2020 Launch ‘Wildly Optimistic’, now Expected in 2022