Connect with us


Amazon Textract recognizes handwriting and adds five new languages




Documents are a primary tool for communication, collaboration, record keeping, and transactions across industries, including financial, medical, legal, and real estate. The format of data can pose an extra challenge in data extraction, especially if the content is typed, handwritten, or embedded in a form or table. Furthermore, extracting data from your documents is manual, error-prone, time-consuming, expensive, and does not scale. Amazon Textract is a machine learning (ML) service that extracts printed text and other data from documents as well as tables and forms.

We’re pleased to announce two new features for Amazon Textract: support for handwriting in English documents, and expanding language support for extracting printed text from documents typed in Spanish, Portuguese, French, German, and Italian.

Handwriting recognition with Amazon Textract

Many documents, such as medical intake forms or employment applications, contain both handwritten and printed text. The ability to extract text and handwriting has been a need our customers have asked us for. Amazon Textract can now extract printed text and handwriting from documents written in English with high confidence scores, whether it’s free-form text or text embedded in tables and forms. Documents can also contain a mix of typed text or handwritten text.

The following image shows an example input document containing a mix of typed and handwritten text, and its converted output document.

You can log in to the Amazon Textract console to test out the handwriting feature, or check out the new demo by Amazon Machine Learning Hero Mike Chambers.

Not only can you upload documents with both printed text and handwriting, you can also use Amazon Augmented AI (Amazon A2I), which makes it easy to build workflows for a human review of the ML predictions. Adding in Amazon A2I can help you get to market faster by having your employees or AWS Marketplace contractors review the Amazon Textract output for sensitive workloads. For more information about implementing a human review, see Using Amazon Textract with Amazon Augmented AI for processing critical documents. If you want to use one of our AWS Partners, take a look at how Quantiphi is using handwriting recognition for their customers.

Additionally, we’re pleased to announce our language expansion. Customers can now extract and process documents in more languages.

New supported languages in Amazon Textract

Amazon Textract now supports processing printed documents in Spanish, German, Italian, French, and Portuguese. You can send documents in these languages, including forms and tables, for data and text extraction, and Amazon Textract automatically detects and extracts the information for you. You can simply upload the documents on the Amazon Textract console or send them using either the AWS Command Line Interface (AWS CLI) or AWS SDKs.

AWS customer success stories

AWS customers like yourself are always looking for ways to overcome document processing. In this section, we share what our customers are saying about Amazon Textract.


Intuit is a provider of innovative financial management solutions, including TurboTax and QuickBooks, to approximately 50 million customers worldwide.

“Intuit’s document understanding technology uses AI to eliminate manual data entry for our consumer, small business, and self-employed customers. For millions of Americans who rely on TurboTax every year, this technology simplifies tax filing by saving them from the tedious, time-consuming task of entering data from financial documents. Textract is an important element of Intuit’s document understanding capability, improving data extraction accuracy by analyzing text in the context of complex financial forms.”

– Krithika Swaminathan, VP of AI, Intuit


Veeva helps cosmetics, consumer goods and chemical companies bring innovative, high quality products to market faster without compromising compliance.

“Our customers are processing millions of documents per year and have a critical need to extract the information stored within the documents to make meaningful business decisions. Many of our customers are multinational organizations which means the documents are submitted in various languages like Spanish or Portuguese. Our recent partnership with AWS allowed us early access to Amazon Textract’s new feature that supports additional languages like Spanish, and Portuguese. This partnership with Textract has been key to work closely, iterate and deliver exceptional solutions to our customers.”

– Ali Alemdar, Sr Product Manager, Veeva Industries

Baker Tilly

Baker Tilly is a leading advisory, tax and assurance firm dedicated to building long-lasting relationships and helping customers with their most pressing problems — and enabling them to create new opportunities.

“Across all industries, forms are one of the most popular ways of collecting data. Manual efforts can take hours or days to “read” through digital forms. Leveraging Amazon Textract’s Optical Character Recognition (OCR) technology we can now read through these digital forms quicker and effortlessly. We now leverage handwriting as part of Textract to parse out handwritten entities. This allows our customers to upload forms with both typed and handwritten text and improve their ability to make key decisions through data quickly and in a streamlined process. Additional, Textract easily integrates with Amazon S3 and RDS for instantaneous access to processed forms and near real-time analytics.”

-Ollie East – Director of Advanced Analytics and Data Engineering

ARG Group

ARG Group is the leading end-to-end provider of digital solutions for the corporate and government market.

“At ARG Group, we work with different transportation companies and their physical asset maintenance teams. Their processes have been refined over many years. Previous attempts to digitize the process caused too much disruption and consequently failed to be adopted. Textract allowed us to provide a hybrid solution to gain the benefits of predictive insights coming from digitizing maintenance data, whilst still allowing our customer workforce to continue following their preferred handwritten process. This is expected to result in a 22% reduction in downtime and 18% reduction in maintenance cost, as we can now predict when parts are likely to fail and schedule for maintenance to happen outside of production hours. We are also expecting the lifespan of our customer assets to increase, now that we are preventing failure scenarios.”

– Daniel Johnson, Business Segment Director, ARG Group

Belle Fleur

Belle Fleur believes the ML revolution is altering the way we live, work, and relate to one another, and will transform the way every business in every industry operates.

“We use Amazon Textract to detect text for our clients that have the three Vs when it pertains to data: Variety, Velocity, and Volume, and particularly our clients that have different document formats to process information and data properly and efficiently. The feature designed to recognize the various different formats, whether it’s tables or forms and now with handwriting recognition, is an AI dream come true for our medical, legal, and commercial real estate clients. We are so excited to roll out this new handwriting feature to all of our customers to further enhance their current solution, especially those with lean teams. We are able to allow the machine learning to handle the heavy lifting via automation to read thousands of documents in a fraction of the time and allow their teams to focus on higher-order assignments.”

– Tia Dubuisson, President, Belle Fleur


Lumiq is a data analytics company, holding the deep domain and technical expertise to build and implement AI- and ML-driven products and solutions. Their data products are built like building blocks and run on AWS, which helps their customers scale the value of their data and drive tangible business outcomes.

“With thousands of documents being generated and received across different stages of the consumer engagement lifecycle every day, one of our customers (a leading insurance service provider in India) had to invest several manual hours for data entry, data QC, and validation. The document sets consisted of proposal forms, supporting documents for identity, financials, and medical reports, among others. These documents were in different, non-standardized formats and some of them were handwritten, resulting in an increased average lag in lead to policy issuance and impacted customer experience.

“We leveraged Amazon’s machine learning-powered Textract to extract information and insights from various types of documents, including handwritten text. Our custom solution built on top of Amazon Textract and other AWS services helped in achieving a 97% reduction in human labor for PII redaction and a projected 70% reduction in work hours for data entry. We are excited to further deep-dive into Textract to enable our customers with an E2E paperless workflow and enhance their end-consumer experience with significant time savings.”

– Mohammad Shoaib, Founder and CEO, Lumiq (Crisp Analytics)

QL Resources

QL is among Asean’s largest egg producers and surimi manufacturers, and is building a presence in the sustainable palm oil sector with activities including milling, plantations, and biomass clean energy.

“We have a large amount of handwritten documents that are generated daily in our factories, where it is challenging to ubiquitously install digital capturing devices. With the custom solution developed by our AWS partner Axrail using Amazon Textract and various AWS services, we are able to digitize documents for both printed and handwritten hard copy forms that we generated on the production floor daily, especially in production areas where digital capturing tools are not available or economical. This is a sensible solution and completes the missing link for full digitization of our production data.”

– Chia Lik Khai, Director, QL Resources


We continually make improvements to our products based on your feedback, and we encourage you to log in to the Amazon Textract console and upload a sample document and use the APIs available. You can also talk with your account manager about how best to incorporate these new features. Amazon Textract has many resources to help you get started, like blog posts, videos, partners, and getting started guides. Check out the Textract resources page for more information.

You have millions of documents, which means you have a ton of meaningful and critical data within those documents. You can extract and process your data in seconds rather than days, and keep it secure by using Amazon Textract. Get started today.

About the Author

Andrea Morton-Youmans is a Product Marketing Manager on the AI Services team at AWS. Over the past 10 years she has worked in the technology and telecommunications industries, focused on developer storytelling and marketing campaigns. In her spare time, she enjoys heading to the lake with her husband and Aussie dog Oakley, tasting wine and enjoying a movie from time to time.



Trump pardons former Google self-driving car engineer Anthony Levandowski




(Reuters) — U.S. President Donald Trump said on Wednesday he had given a full pardon to a former Google engineer sentenced for stealing a trade secret on self-driving cars months before he briefly headed Uber’s rival unit.

Anthony Levandowski, 40, was sentenced in August to 18 months in prison after pleading guilty in March. He was not in custody but a judge had said he could enter custody once the COVID-19 pandemic subsided.

The White House said Levandowski had “paid a significant price for his actions and plans to devote his talents to advance the public good.”

Alphabet’s Waymo, a self-driving auto technology unit spun out of Google, declined to comment. The company previously described Levandowski’s crime as “a betrayal” and his sentence “a win for trade secret laws.”

The pardon was backed by several leaders in the technology industry who have supported Trump, including investors Peter Thiel and Blake Masters and entrepreneur Palmer Luckey, according to the White House.

Levandowski transferred more than 14,000 Google files, including development schedules and product designs, to his personal laptop before he left, and while negotiating a new role with Uber.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member


Continue Reading


Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines




We recently announced Amazon SageMaker Pipelines, the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). SageMaker Pipelines is a native workflow orchestration tool for building ML pipelines that take advantage of direct Amazon SageMaker integration. Three components improve the operational resilience and reproducibility of your ML workflows: pipelines, model registry, and projects. These workflow automation components enable you to easily scale your ability to build, train, test, and deploy hundreds of models in production, iterate faster, reduce errors due to manual orchestration, and build repeatable mechanisms.

SageMaker projects introduce MLOps templates that automatically provision the underlying resources needed to enable CI/CD capabilities for your ML development lifecycle. You can use a number of built-in templates or create your own custom template. You can use SageMaker Pipelines independently to create automated workflows; however, when used in combination with SageMaker projects, the additional CI/CD capabilities are provided automatically. The following screenshot shows how the three components of SageMaker Pipelines can work together in an example SageMaker project.

This post focuses on using an MLOps template to bootstrap your ML project and establish a CI/CD pattern from sample code. We show how to use the built-in build, train, and deploy project template as a base for a customer churn classification example. This base template enables CI/CD for training ML models, registering model artifacts to the model registry, and automating model deployment with manual approval and automated testing.

MLOps template for building, training, and deploying models

We start by taking a detailed look at what AWS services are launched when this build, train, and deploy MLOps template is launched. Later, we discuss how to modify the skeleton for a custom use case.

To get started with SageMaker projects, you must first enable it on the Amazon SageMaker Studio console. This can be done for existing users or while creating new ones. For more information, see SageMaker Studio Permissions Required to Use Projects.

In SageMaker Studio, you can now choose the Projects menu on the Components and registries menu.

On the projects page, you can launch a preconfigured SageMaker MLOps template. For this post, we choose MLOps template for model building, training, and deployment.

Launching this template starts a model building pipeline by default, and while there is no cost for using SageMaker Pipelines itself, you will be charged for the services launched. Cost varies by Region. A single run of the model build pipeline in us-east-1 is estimated to cost less than $0.50. Models approved for deployment incur the cost of the SageMaker endpoints (test and production) for the Region using an ml.m5.large instance.

After the project is created from the MLOps template, the following architecture is deployed.

Included in the architecture are the following AWS services and resources:

  • The MLOps templates that are made available through SageMaker projects are provided via an AWS Service Catalog portfolio that automatically gets imported when a user enables projects on the Studio domain.
  • Two repositories are added to AWS CodeCommit:
    • The first repository provides scaffolding code to create a multi-step model building pipeline including the following steps: data processing, model training, model evaluation, and conditional model registration based on accuracy. As you can see in the file, this pipeline trains a linear regression model using the XGBoost algorithm on the well-known UCI Abalone dataset. This repository also includes a build specification file, used by AWS CodePipeline and AWS CodeBuild to run the pipeline automatically.
    • The second repository contains code and configuration files for model deployment, as well as test scripts required to pass the quality gate. This repo also uses CodePipeline and CodeBuild, which run an AWS CloudFormation template to create model endpoints for staging and production.
  • Two CodePipeline pipelines:
    • The ModelBuild pipeline automatically triggers and runs the pipeline from end to end whenever a new commit is made to the ModelBuild CodeCommit repository.
    • The ModelDeploy pipeline automatically triggers whenever a new model version is added to the model registry and the status is marked as Approved. Models that are registered with Pending or Rejected statuses aren’t deployed.
  • An Amazon Simple Storage Service (Amazon S3) bucket is created for output model artifacts generated from the pipeline.
  • SageMaker Pipelines uses the following resources:
    • This workflow contains the directed acyclic graph (DAG) that trains and evaluates our model. Each step in the pipeline keeps track of the lineage and intermediate steps can be cached for quickly re-running the pipeline. Outside of templates, you can also create pipelines using the SDK.
    • Within SageMaker Pipelines, the SageMaker model registry tracks the model versions and respective artifacts, including the lineage and metadata for how they were created. Different model versions are grouped together under a model group, and new models registered to the registry are automatically versioned. The model registry also provides an approval workflow for model versions and supports deployment of models in different accounts. You can also use the model registry through the boto3 package.
  • Two SageMaker endpoints:
    • After a model is approved in the registry, the artifact is automatically deployed to a staging endpoint followed by a manual approval step.
    • If approved, it’s deployed to a production endpoint in the same AWS account.

All SageMaker resources, such as training jobs, pipelines, models, and endpoints, as well as AWS resources listed in this post, are automatically tagged with the project name and a unique project ID tag.

Modifying the sample code for a custom use case

After your project has been created, the architecture described earlier is deployed and the visualization of the pipeline is available on the Pipelines drop-down menu within SageMaker Studio.

To modify the sample code from this launched template, we first need to clone the CodeCommit repositories to our local SageMaker Studio instance. From the list of projects, choose the one that was just created. On the Repositories tab, you can select the hyperlinks to locally clone the CodeCommit repos.

ModelBuild repo

The ModelBuild repository contains the code for preprocessing, training, and evaluating the model. The sample code trains and evaluates a model on the UCI Abalone dataset. We can modify these files to solve our own customer churn use case. See the following code:

|-- codebuild-buildspec.yml
|-- pipelines
| |-- abalone
| | |--
| | |--
| | |--
| | |--
| |--
| |--
| |--
| |--
| |--
|-- sagemaker-pipelines-project.ipynb
|-- setup.cfg
|-- tests
| --
|-- tox.ini

We now need a dataset accessible to the project.

  1. Open a new SageMaker notebook inside Studio and run the following cells:
    !unzip -o
    !mv "Data sets" Datasets import os
    import boto3
    import sagemaker
    prefix = 'sagemaker/DEMO-xgboost-churn'
    region = boto3.Session().region_name
    default_bucket = sagemaker.session.Session().default_bucket()
    role = sagemaker.get_execution_role() RawData = boto3.Session().resource('s3')
    .Bucket(default_bucket).Object(os.path.join(prefix, 'data/RawData.csv'))
    .upload_file('./Datasets/churn.txt') print(os.path.join("s3://",default_bucket, prefix, 'data/RawData.csv'))

  1. Rename the abalone directory to customer_churn. This requires us to modify the path inside codebuild-buildspec.yml as shown in the sample repository. See the following code:
    run-pipeline --module-name pipelines.customer-churn.pipeline 

  1. Replace the code with the customer churn preprocessing script found in the sample repository.
  2. Replace the code with the customer churn pipeline script found in the sample repository.
    1. Be sure to replace the “InputDataUrl” default parameter with the Amazon S3 URL obtained in step 1:
      input_data = ParameterString( name="InputDataUrl", default_value=f"s3://YOUR_BUCKET/RawData.csv",

    2. Update the conditional step to evaluate the classification model:
      # Conditional step for evaluating model quality and branching execution
      cond_lte = ConditionGreaterThanOrEqualTo( left=JsonGet(step=step_eval, property_file=evaluation_report, json_path="binary_classification_metrics.accuracy.value"), right=0.8

    One last thing to note is the default ModelApprovalStatus is set to PendingManualApproval. If our model has greater than 80% accuracy, it’s added to the model registry, but not deployed until manual approval is complete.

  1. Replace the code with the customer churn evaluation script found in the sample repository. One piece of the code we’d like to point out is that, because we’re evaluating a classification model, we need to update the metrics we’re evaluating and associating with trained models:
    report_dict = { "binary_classification_metrics": { "accuracy": { "value": acc, "standard_deviation" : "NaN" }, "auc" : { "value" : roc_auc, "standard_deviation": "NaN" }, },
    } evaluation_output_path = '/opt/ml/processing/evaluation/evaluation.json'
    with open(evaluation_output_path, 'w') as f: f.write(json.dumps(report_dict))

The JSON structure of these metrics are required to match the format of sagemaker.model_metrics for complete integration with the model registry. 

ModelDeploy repo

The ModelDeploy repository contains the AWS CloudFormation buildspec for the deployment pipeline. We don’t make any modifications to this code because it’s sufficient for our customer churn use case. It’s worth noting that model tests can be added to this repo to gate model deployment. See the following code:

├── buildspec.yml
├── endpoint-config-template.yml
├── prod-config.json
├── staging-config.json
└── test ├── buildspec.yml └──

Triggering a pipeline run

Committing these changes to the CodeCommit repository (easily done on the Studio source control tab) triggers a new pipeline run, because an Amazon EventBridge event monitors for commits. After a few moments, we can monitor the run by choosing the pipeline inside the SageMaker project.

 The following screenshot shows our pipeline details. Choosing the pipeline run displays the steps of the pipeline, which you can monitor.

When the pipeline is complete, you can go to the Model groups tab inside the SageMaker project and inspect the metadata attached to the model artifacts.

If everything looks good, we can manually approve the model.

This approval triggers the ModelDeploy pipeline and exposes an endpoint for real-time inference.



SageMaker Pipelines enables teams to leverage best practice CI/CD methods within their ML workflows. In this post, we showed how a data scientist can modify a preconfigured MLOps template for their own modeling use case. Among the many benefits is that the changes to the source code can be tracked, associated metadata can be tied to trained models for deployment approval, and repeated pipeline steps can be cached for reuse. To learn more about SageMaker Pipelines, check out the website and the documentation. Try SageMaker Pipelines in your own workflows today.

About the Authors

Sean Morgan is an AI/ML Solutions Architect at AWS. He previously worked in the semiconductor industry, using computer vision to improve product yield. He later transitioned to a DoD research lab where he specialized in adversarial ML defense and network security. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Addons.

Hallie Weishahn is an AI/ML Specialist Solutions Architect at AWS, focused on leading global standards for MLOps. She previously worked as an ML Specialist at Google Cloud Platform. She works with product, engineering, and key customers to build repeatable architectures and drive product roadmaps. She provides guidance and hands-on work to advance and scale machine learning use cases and technologies. Troubleshooting top issues and evaluating existing architectures to enable integrations from PoC to a full deployment is her strong suit.

Shelbee Eigenbrode is an AI/ML Specialist Solutions Architect at AWS. Her current areas of depth include DevOps combined with ML/AI. She has been in technology for 23 years, spanning multiple roles and technologies. With over 35 patents granted across various technology domains, her passion for continuous innovation combined with a love of all things data turned her focus to the field of d ata science. Combining her backgrounds in data, DevOps, and machine learning, her current passion is helping customers to not only embrace data science but also ensure all models have a path to production by adopting MLOps practices. In her spare time, she enjoys reading and spending time with family, including her fur family (aka dogs), as well as friends.


Continue Reading


Labeling mixed-source, industrial datasets with Amazon SageMaker Ground Truth




Prior to using any kind of supervised machine learning (ML) algorithm, data has to be labeled. Amazon SageMaker Ground Truth simplifies and accelerates this task. Ground Truth uses pre-defined templates to assign labels that classify the content of images or videos or verify existing labels. Ground Truth allows you to define workflows for labeling various kinds of data, such as text, video, or images, without writing any code. Although these templates are applicable to a wide range of use cases in which the data to be labeled is in a single format or from a single source, industrial workloads often require labeling data from different sources and in different formats. This post explores the use case of industrial welding data consisting of sensor readings and images to show how to implement customized, complex, mixed-source labeling workflows using Ground Truth.

For this post, you deploy an AWS CloudFormation template in your AWS account to provision the foundational resources to get started with implementing of this labeling workflow. This provides you with hands-on experience for the following topics:

  • Creating a private labeling workforce in Ground Truth
  • Creating a custom labeling job using the Ground Truth framework with the following components:
    • Designing a pre-labeling AWS Lambda function that pulls data from different sources and runs a format conversion where necessary
    • Implementing a customized labeling user interface in Ground Truth using crowd templates that dynamically loads the data generated by the pre-labeling Lambda function
    • Consolidating labels from multiple workers using a customized post-labeling Lambda function
  • Configuring a custom labeling job using Ground Truth with a customized interface for displaying multiple pieces of data that have to be labeled as a single item

Prior to diving deep into the implementation, I provide an introduction into the use case and show how the Ground Truth custom labeling framework eases the implementation of highly complex labeling workflows. To make full use of this post, you need an AWS account on which you can deploy CloudFormation templates. The total cost incurred on your account for following this post is under $1.

Labeling complex datasets for industrial welding quality control

Although the mechanisms discussed in this post are generally applicable to any labeling workflow with different data formats, I use data from a welding quality control use case. In this use case, the manufacturing company running the welding process wants to predict whether the welding result will be OK or if a number of anomalies have occurred during the process. To implement this using a supervised ML model, you need to obtain labeled data with which to train the ML model, such as datasets representing welding processes that need to be labeled to indicate whether the process was normal or not. We implement this labeling process (not the ML or modeling process) using Ground Truth, which allows welding experts to make assessments about the result of a welding and assign this result to a dataset consisting of images and sensor data.

The CloudFormation template creates an Amazon Simple Storage Service (Amazon S3) bucket in your AWS account that contains images (prefix images) and CSV files (prefix sensor_data). The images contain pictures taken during an industrial welding process similar to the following, where a welding beam is applied onto a metal surface (for image source, see TIG Stainless Steel 304):

The CSV files contain sensor data representing current, electrode position, and voltage measured by sensors on the welding machine. For the full dataset, see the GitHub repo. A raw sample of this CSV data is as follows:


The first column of the data is a timestamp in milliseconds normalized to the start of the welding process. Each row consists of various sensor values associated with the timestamp. The first row is the electrode position, the second row is the current, and the third row is the voltage (the other values are irrelevant here). For instance, the row with timestamp 1, 100 milliseconds after the start of the welding process, has an electrode position of 94.79, a current of 1464, and a voltage of 428.

Because it’s difficult for humans to make assessments using the raw CSV data, I also show how to preprocess such data on the fly for labeling and turn it into more easily readable plots. This way, the welding experts can view the images and the plots to make their assessment about the welding process.

Deploying the CloudFormation template

To simplify the setup and configurations needed in the following, I created a CloudFormation template that deploys several foundations into your AWS account. To start this process, complete the following steps:

  1. Sign in to your AWS account.
  2. Choose one of the following links, depending on which AWS Region you’re using:
  1. Keep all the parameters as they are and select I acknowledge that AWS CloudFormation might create IAM resources with custom names and I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  2. Choose Create stack to start the deployment.

The deployment takes about 3–5 minutes, during which time a bucket with data to label, some AWS Lambda functions, and an AWS Identity and Access Management (IAM) role are deployed. The process is complete when the status of the deployment switches to CREATE_COMPLETE.

The Outputs tab has additional information, such as the Amazon S3 path to the manifest file, which you use throughout this post. Therefore, it’s recommended to keep this browser tab open and follow the rest of the post in another tab.

Creating a Ground Truth labeling workforce

Ground Truth offers three options for defining workforces that complete the labeling: Amazon Mechanical Turk, vendor-specific workforces, and private workforces. In this section, we configure a private workforce because we want to complete the labeling ourselves. Create a private workforce with the following steps:

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
  2. On the Private tab, choose Create private team.

  1. Enter a name for the labeling workforce. For our use case, I enter welding-experts.
  2. Select Invite new workers by email.
  3. Enter your e-mail address, an organization name, and a contact e-mail (which may be the same as the one you just entered).
  4. Choose Create private team.

The console confirms the creation of the labeling workforce at the top of the screen. When you refresh the page, the new workforce shows on the Private tab, under Private teams.

You also receive an e-mail with login instructions, including a temporary password and a link to open the login page.

  1. Choose the link and use your e-mail and temporary password to authenticate and change the password for the login.

It’s recommended to keep this browser tab open so you don’t have to log in again. This concludes all necessary steps to create your workforce.

Configuring a custom labeling job

In this section, we create a labeling job and use this job to explain the details and data flow of a custom labeling job.

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling jobs.
  2. Choose Create labeling job.

  1. Enter a name for your labeling job, such as WeldingLabelJob1.
  2. Choose Manual data setup.
  3. For Input dataset location, enter the ManifestS3Path value from the CloudFormation stack Outputs
  4. For Output dataset location, enter the ProposedOutputPath value from the CloudFormation stack Outputs
  5. For IAM role, choose Enter a custom IAM role ARN.
  6. Enter the SagemakerServiceRoleArn value from the CloudFormation stack Outputs
  7. For the task type, choose Custom.
  8. Choose Next.

The IAM role is a customized role created by the CloudFormation template that allows Ground Truth to invoke Lambda functions and access Amazon S3.

  1. Choose to use a private labeling workforce.
  2. From the drop-down menu, choose the workforce welding-experts.
  3. For task timeout and task expiration time, 1 hour is sufficient.
  4. The number of workers per dataset object is 1.
  5. In the Lambda functions section, for Pre-labeling task Lambda function, choose the function that starts with PreLabelingLambda-.
  6. For Post-labeling task Lambda function, choose the function that starts with PostLabelingLambda-.
  7. Enter the following code into the templates section. This HTML code specifies the interface that the workers in the private label workforce see when labeling items. For our use case, the template displays four images, and the categories to classify welding results is as follows:
    <script src=""></script>
    <crowd-form> <crowd-classifier name="WeldingClassification" categories="['Good Weld', 'Burn Through', 'Contamination', 'Lack of Fusion', 'Lack of Shielding Gas', 'High Travel Speed', 'Not sure']" header="Please classify the welding process." > <classification-target> <div> <h3>Welding Image</h3> <p><strong>Welding Camera Image </strong>{{ task.input.image.title }}</p> <p><a href="{{ task.input.image.file | grant_read_access }}" target="_blank">Download Image</a></p> <p> <img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.image.file | grant_read_access }}"/> </p> </div> <hr/> <div> <h3>Current Graph</h3> <p><strong>Current Graph </strong>{{ task.input.current.title }}</p> <p><a href="{{ task.input.current.file | grant_read_access }}" target="_blank">Download Current Plot</a></p> <p> <img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.current.file | grant_read_access }}"/> </p> </div> <hr/> <div> <h3>Electrode Position Graph</h3> <p><strong>Electrode Position Graph </strong>{{ task.input.electrode.title }}</p> <p><a href="{{ task.input.electrode.file | grant_read_access }}" target="_blank">Download Electrode Position Plot</a></p> <p> <img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.electrode.file | grant_read_access }}"/> </p> </div> <hr/> <div> <h3>Voltage Graph</h3> <p><strong>Voltage Graph </strong>{{ task.input.voltage.title }}</p> <p><a href="{{ task.input.voltage.file | grant_read_access }}" target="_blank">Download Voltage Plot</a></p> <p> <img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.voltage.file | grant_read_access }}"/> </p> </div> </classification-target> <full-instructions header="Classification Instructions"> <p>Read the task carefully and inspect the image as well as the plots.</p> <p> The image is a picture taking during the welding process. The plots show the corresponding sensor data for the electrode position, the voltage and the current measured during the welding process. </p> </full-instructions> <short-instructions> <p>Read the task carefully and inspect the image as well as the plots</p> </short-instructions> </crowd-classifier>

The wizard for creating the labeling job has a preview function in the section Custom labeling task setup, which you can use to check if all configurations work properly.

  1. To preview the interface, choose Preview.

This opens a new browser tab and shows a test version of the labeling interface, similar to the following screenshot.

  1. To create the labeling job, choose Create.

Ground Truth sets up the labeling job as specified, and the dashboard shows its status.

Assigning labels

To finalize the labeling job that you configured, you log in to the worker portal and assign labels to different data items consisting of images and data plots. The details on how the different components of the labeling job work together are explained in the next section.

  1. On the Amazon SageMaker console, under Ground Truth, choose Labeling workforces.
  2. On the Private tab, choose the link for Labeling portal sign-in URL.

When Ground Truth is finished preparing the labeling job, you can see it listed in the Jobs section. If it’s not showing up, wait a few minutes and refresh the tab.

  1. Choose Start working.

This launches the labeling UI, which allows you to assign labels to mixed datasets consisting of welding images and plots for current, electrode position, and voltage.

For this use case, you can assign seven different labels to a single dataset. These different classes and labels are defined in the HTML of the UI, but you can also insert them dynamically using the pre-labeling Lambda function (discussed in the next section). Because we don’t actually use the labeled data for ML purposes, you can assign the labels randomly to the five items that are displayed by Ground Truth for this labeling job.

After labeling all the items, the UI switches back to the list with available jobs. This concludes the section about configuring and launching the labeling job. In the next section, I explain the mechanics of a custom labeling job in detail and also dive deep into the different elements of the HTML interface.

Custom labeling deep dive

A custom labeling job combines the data to be labeled with three components to create a workflow that allows workers from the labeling workforce to assign labels to each item in the dataset:

  • Pre-labeling Lambda function – Generates the content to be displayed on the labeling interface using the manifest file specified during the configuration of the labeling job. For this use case, the function also converts the CSV files into human readable plots and stores these plots as images in the S3 bucket under the prefix plots.
  • Labeling interface – Uses the output of the pre-labeling function to generate a user interface. For this use case, the interface displays four images (the picture taken during the welding process and the three graphs for current, electrode position, and voltage) and a form that allows workers to classify the welding process.
  • Label consolidation Lambda function – Allows you to implement custom strategies to consolidate classifications of one or several workers into a single response. For our workforce, this is very simple because there is only a single worker whose labels are consolidated into a file, which is stored by Ground Truth into Amazon S3.

Before we analyze these three components, I provide insights into the structure of the manifest file, which describes the data sources for the labeling job.

Manifest and dataset files

The manifest file is a file conforming to the JSON lines format, in which each line represents one item to label. Ground Truth expects either a key source or source-ref in each line of the file. For this use case, I use source, and the mapped value must be a string representing an Amazon S3 path. For this post, we only label five items, and the JSON lines are similar to the following code:

{"source": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-1.json"}

For our use case with multiple input formats and files, each line in the manifest points to a dataset file that is also stored on Amazon S3. Our dataset is a JSON document, which contains references to the welding images and the CSV file with the sensor data:

{ "sensor_data": {"s3Path": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/sensor_data/weld.1.csv"}, "image": {"s3Path": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png"}

Ground Truth takes each line of the manifest file and triggers the pre-labeling Lambda function, which we discuss next.

Pre-labeling Lambda function

A pre-labeling Lambda function creates a JSON object that is used to populate the item-specific portions of the labeling interface. For more information, see Processing with AWS Lambda.

Before Ground Truth displays an item for labeling to a worker, it runs the pre-labeling function and forwards the information in the manifest’s JSON line to the function. For our use case, the event passed to the function is as follows:

{ "version": "2018-10-06", "labelingJobArn": "arn:aws:sagemaker:eu-west-1:XXX:labeling-job/weldinglabeljob1", "dataObject": { "source": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-1.json" }

Although I omit the implementation details here (for those interested, the code is deployed with the CloudFormation template for review), the function for our labeling job uses this input to complete the following steps:

  1. Download the file referenced in the source field of the input (see the preceding code).
  2. Download the dataset file that is referenced in the source
  3. Download a CSV file containing the sensor data. The dataset file is expected to have a reference to this CSV file.
  4. Generate plots for current, electrode position, and voltage from the contents of the CSV file.
  5. Upload the plot files to Amazon S3.
  6. Generate a JSON object containing the references to the aforementioned plot files and the welding image referenced in the dataset file.

When these steps are complete, the function returns a JSON object with two parts :

  • taskInput – Fully customizable JSON object that contains information to be displayed on the labeling UI.
  • isHumanAnnotationRequired – A string representing a Boolean value (True or False), which you can use to exclude objects from being labeled by humans. I don’t use this flag for this use case because we want to label all the provided data items.

For more information, see Processing with AWS Lambda.

Because I want to show the welding images and the three graphs for current, electrode position, and voltage, the result of the Lambda function is as follows for the first dataset:

{ "taskInput": { "image": { "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png", "title": " from image at s3://iiot-custom-label-blog-bucket-unn4d0l4j0/images/weld.1.png" }, "voltage": { "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-current.png", "title": " from file at plots/weld.1.csv-current.png" }, "electrode": { "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-electrode_pos.png", "title": " from file at plots/weld.1.csv-electrode_pos.png" }, "current": { "file": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/plots/weld.1.csv-voltage.png", "title": " from file at plots/weld.1.csv-voltage.png" } }, "isHumanAnnotationRequired": "true"

In the preceding code, the taskInput is fully customizable; the function returns the Amazon S3 paths to the images to display, and also a title, which has some non-functional text. Next, I show how to access these different parts of the taskInput JSON object when building the customized labeling UI displayed to workers by Ground Truth.

Labeling UI: Accessing taskInput content

Ground Truth uses the output of the Lambda function to fill in content into the HTML skeleton that is provided at the creation of the labeling job. In general, the contents of the taskInput output object is accessed using task.input in the HTML code.

For instance, to retrieve the Amazon S3 path where the welding image is stored from the output, you need to access the path taskInput/image/file. Because the taskInput object from the function output is mapped to task.input in the HTML, the corresponding reference to the welding image file is task.input.image.file. This reference is directly integrated into the HTML code of the labeling UI to display the welding image:

<img style="height: 30vh; margin-bottom: 10px" src="{{ task.input.image.file | grant_read_access }}"/>

The grant_read_access filter is needed for files in S3 buckets that aren’t publicly accessible. This makes sure that the URL passed to the browser contains a short-lived access token for the image and thereby avoids having to make resources publicly accessible for labeling jobs. This is often mandatory because the data to be labeled, such as machine data, is confidential. Because the pre-labeling function has also converted the CSV files into plots and images, their integration into the UI is analogous.

Label consolidation Lambda function

The second Lambda function that was configured for the custom labeling job runs when all workers have labeled an item or the time limit of the labeling job is reached. The key task of this function is to derive a single label from the responses of the workers. Additionally, the function can be for any kind of further processing of the labeled data, such as storing them on Amazon S3 in a format ideally suited for the ML pipeline that you use.

Although there are different possible strategies to consolidate labels, I focus on the cornerstones of the implementation for such a function and show how they translate to our use case. The consolidation function is triggered by an event similar to the following JSON code:

{ "version": "2018-10-06", "labelingJobArn": "arn:aws:sagemaker:eu-west-1:261679111194:labeling-job/weldinglabeljob1", "payload": { "s3Uri": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/output/WeldingLabelJob1/annotations/consolidated-annotation/consolidation-request/iteration-1/2020-09-15_16:16:11.json" }, "labelAttributeName": "WeldingLabelJob1", "roleArn": "arn:aws:iam::261679111194:role/AmazonSageMaker-Service-role-unn4d0l4j0", "outputConfig": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/output/WeldingLabelJob1/annotations", "maxHumanWorkersPerDataObject": 1 }

The key item in this event is the payload, which contains an s3Uri pointing to a file stored on Amazon S3. This payload file contains the list of datasets that have been labeled and the labels assigned to them by workers. The following code is an example of such a list entry:

{ "datasetObjectId": "4", "dataObject": { "s3Uri": "s3://iiot-custom-label-blog-bucket-unn4d0l4j0/dataset/dataset-5.json" }, "annotations": [ { "workerId": "", "annotationData": { "content":"{"WeldingClassification":{"label":"Not sure"}}" } } ] }

Along with an identifier that you could use to determine which worker labeled the item, each entry lists for each dataset which labels have been assigned. For example, in the case of multiple workers, there are multiple entries in annotations. Because I created a single worker that labeled all the items for this post, there is only a single entry. The file dataset-5.json has been labeled with Not Sure for the classifier WeldingClassification.

The label consolidation function has to iterate over all list entries and determine for each dataset a label to use as the ground truth for supervised ML training. Ground Truth expects the function to return a list containing an entry for each dataset item with the following structure:

{ "datasetObjectId": "4", "consolidatedAnnotation": { "content": { "WeldingLabelJob1": { "WeldingClassification": "Not sure" } } } }

Each entry of the returned list must contain the datasetObjectId for the corresponding entry in the payload file and a JSON object consolidatedAnnotation, which contains an object content. Ground Truth expects content to contain a key that equals the name of the labeling job, (for our use case, WeldingLabelJob1). For more information, see Processing with AWS Lambda.
You can change this behavior when you create the labeling job by selecting I want to specify a label attribute name different from the labeling job name and entering a label attribute name.

The content inside this key equaling the name of the labeling job is freely configurable and can be arbitrarily complex. For our use case, it’s enough to return the assigned label Not Sure. If any of these formatting requirements are not met, Ground Truth assumes the labeling job didn’t run properly and failed.

Because I specified output as the desired prefix during the creation of the labeling job, the requirements are met, and Ground Truth uploads the list of JSON entries into the bucket and prefix specified during the creation of the consolidated labels, and they are uploaded with the following prefix:


You can use such files for training ML algorithms in Amazon SageMaker or for further processing.

Cleaning up

To avoid incurring future charges, delete all resources created for this post.

  1. On the AWS CloudFormation console, choose Stacks.
  2. Select the stack iiot-custom-label-blog.
  3. Choose Delete.

This step removes all files and the S3 bucket from your account. The process takes about 3–5 minutes.


Supervised ML requires labeled data, and Ground Truth provides a platform for creating labeling workflows. This post showed how to build a complex industrial IoT labeling workflow, in which data from multiple sources needs to be considered for labeling items. The post explained how to create a custom labeling job and provided details on the mechanisms Ground Truth requires to implement such a workflow. To get started with writing your own custom labeling job, refer to the custom labeling documentation page for Ground Truth and potentially re-deploy the CloudFormation template of this post to get a sample for the pre-labeling and consolidation lambdas. Additionally, the blog post “Creating custom labeling jobs with AWS Lambda and Amazon SageMaker Ground Truth” provides additional insights into building custom labeling jobs.

About the Author

As a Principal Prototyping Engagement Manager, Dr. Markus Bestehorn is responsible for building business-critical prototypes with AWS customers, and is a specialist for IoT and machine learning. His “career” started as a 7-year-old when he got his hands on a computer with two 5.25” floppy disks, no hard disk, and no mouse, on which he started writing BASIC, and later C as well as C++ programs. He holds a PhD in computer science and all currently available AWS certifications. When he’s not on the computer, he runs or climbs mountains.


Continue Reading


Facebook and Instagram’s AI-generated image captions now offer far more details




Every picture posted to Facebook and Instagram gets a caption generated by an image analysis AI, and that AI just got a lot smarter. The improved system should be a treat for visually impaired users, and may help you find your photos faster in the future.

Alt text is a field in an image’s metadata that describes its contents: “A person standing in a field with a horse,” or “a dog on a boat.” This lets the image be understood by people who can’t see it.

These descriptions are often added manually by a photographer or publication, but people uploading photos to social media generally don’t bother, if they even have the option. So the relatively recent ability to automatically generate one — the technology has only just gotten good enough in the last couple years — has been extremely helpful in making social media more accessible in general.

Facebook created its Automatic Alt Text system in 2016, which is eons ago in the field of machine learning. The team has since cooked up many improvements to it, making it faster and more detailed, and the latest update adds an option to generate a more detailed description on demand.

The improved system recognizes 10 times more items and concepts than it did at the start, now around 1,200. And the descriptions include more detail. What was once “Two people by a building” may now be “A selfie of two people by the Eiffel Tower.” (The actual descriptions hedge with “may be…” and will avoid including wild guesses.)

But there’s more detail than that, even if it’s not always relevant. For instance, in this image the AI notes the relative positions of the people and objects:

Image Credits: Facebook

Obviously the people are above the drums, and the hats are above the people, none of which really needs to be said for someone to get the gist. But consider an image described as “A house and some trees and a mountain.” Is the house on the mountain or in front of it? Are the trees in front of or behind the house, or maybe on the mountain in the distance?

In order to adequately describe the image, these details should be filled in, even if the general idea can be gotten across with fewer words. If a sighted person wants more detail they can look closer or click the image for a bigger version — someone who can’t do that now has a similar option with this “generate detailed image description” command. (Activate it with a long press in the Android app or a custom action in iOS.)

Perhaps the new description would be something like “A house and some trees in front of a mountain with snow on it.” That paints a better picture, right? (To be clear, these examples are made up, but it’s the sort of improvement that’s expected.)

The new detailed description feature will come to Facebook first for testing, though the improved vocabulary will appear on Instagram soon. The descriptions are also kept simple so they can be easily translated to other languages already supported by the apps, though the feature may not roll out in other countries simultaneously.


Continue Reading
Blockchain4 days ago

The Countdown is on: Bitcoin has 3 Days Before It Reaches Apex of Key Formation

Blockchain3 days ago

Litecoin, VeChain, Ethereum Classic Price Analysis: 17 January

Blockchain4 days ago

Is Ethereum Undervalued, or Polkadot Overvalued?

Blockchain4 days ago

Here’s why Bitcoin or altcoins aren’t the best bets

Blockchain4 days ago

Chainlink Futures OI follows asset’s price to hit ATH

PR Newswire5 days ago

The merger of FCA and Groupe PSA has been completed

Blockchain2 days ago

5 Best Bitcoin Alternatives in 2021

Blockchain3 days ago

Bitcoin Worth $140 Billion Lost Says UK Council

Blockchain3 days ago

Data Suggests Whales are Keen on Protecting One Key Bitcoin Support Level

Blockchain3 days ago

Bitcoin Cash Price Analysis: 17 January

Blockchain4 days ago

eToro’s New Bitcoin Account Incentives Are So Good, They Had To Disable Buy Orders

Blockchain2 days ago

Mitsubishi and Tokyo Tech Tap Blockchain for P2P Energy Trading Network

Blockchain3 days ago

Cardano, Cosmos, BAT Price Analysis: 17 January

Blockchain4 days ago

Mt. Gox Creditors Could Get Bankruptcy-tied Bitcoin

Blockchain4 days ago

Grayscale’s Bitcoin Trust adds over 5k BTC in 24 hours

Blockchain2 days ago

New Highs Inbound: Ethereum is About to See an Explosive Rally Against BTC

Blockchain3 days ago

Why Bitcoin denominated payments won’t be mainstream anytime soon

Cyber Security1 day ago

Rob Joyce to Take Over as NSA Cybersecurity Director

Nano Technology4 days ago

Scientists’ discovery is paving the way for novel ultrafast quantum computers

Blockchain4 days ago

Was Bitcoin’s rally overextended? If yes, what next