Connect with us


Huawei trained the Chinese-language equivalent of GPT-3




Join Transform 2021 this July 12-16. Register for the AI event of the year.

For the better part of a year, OpenAI’s GPT-3 has remained among the largest AI language models ever created, if not the largest of its kind. Via an API, people have used it to automatically write emails and articles, summarize text, compose poetry and recipes, create website layouts, and generate code for deep learning in Python. But GPT-3 has key limitations, chief among them that it’s only available in English. The 45-terabyte dataset the model was trained on drew exclusively from English-language sources.

This week, a research team at Chinese company Huawei quietly detailed what might be the Chinese-language equivalent of GPT-3. Called PanGu-Alpha (stylized PanGu-α), the 750-gigabyte model contains up to 200 billion parameters — 25 million more than GPT-3 — and was trained on 1.1 terabytes of Chinese-language ebooks, encyclopedias, news, social media, and web pages.

The team claims that the model achieves “superior” performance in Chinese-language tasks spanning text summarization, question answering, and dialogue generation. Huawei says it’s seeking a way to let nonprofit research institutes and companies gain access to pretrained PanGu-α models, either by releasing the code, model, and dataset or via APIs.

Familiar architecture

In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well.

Large language models like OpenAI’s GPT-3 learn to write humanlike text by internalizing billions of examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs.

Huawei language model

Above: PanGu-α generating dialog for a video game.

Akin to GPT-3, PanGu-α is what’s called a generative pretrained transformer (GPT), a language model that is first pretrained on unlabeled text and then fine-tuned for tasks. Using Huawei’s MindSpore framework for development and testing, the researchers trained the model on a cluster of 2,048 Huawei Ascend 910 AI processors, each delivering 256 teraflops of computing power.

To build the training dataset for PanGu-α, the Huawei team collected nearly 80 terabytes of raw data from public datasets, including the popular Common Crawl dataset, as well as the open web. They then filtered the data, removing documents containing fewer than 60% Chinese characters, less than 150 characters, or only titles, advertisements, or navigation bars. Chinese text was converted into simplified Chinese, and 724 potentially offensive words, spam, and “low-quality” samples were filtered out.

One crucial difference between GPT-3 and PanGu-α is the number of tokens on which the models trained. Tokens, a way of separating pieces of text into smaller units in natural language, can be either words, characters, or parts of words. While GPT-3 trained on 499 billion tokens, PanGu-α trained on only 40 billion, suggesting it’s comparatively undertrained.

Huawei language model

Above: PanGu-α writing fiction.

Image Credit: Huawei

In experiments, the researchers say that PanGu-α was particularly adept at writing poetry, fiction, and dialog as well as summarizing text. Absent fine-tuning on examples, PanGu-α could generate poems in the Chinese forms of gushi and duilian. And given a brief conversation as prompt, the model could brainstorm rounds of “plausible” follow-up dialog.

This isn’t to suggest that PanGu-α solves all of the problems plaguing language models of its size. A focus group tasked with evaluating the model’s outputs found 10% of them to be “unacceptable” in terms of quality. And the researchers observed that some of PanGu-α’s creations contained irrelevant, repetitive, or illogical sentences.

Huawei language model

Above: PanGu-α summarizing text from news articles.

The PanGu-α team also didn’t address some of the longstanding challenges in natural language generation, including the tendency of models to contradict themselves. Like GPT-3, PanGu-α can’t remember earlier conversations, and it lacks the ability to learn concepts through further conversation and to ground entities and actions to experiences in the real world.

“The main point of excitement is the extension of these large models to Chinese,” Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, told VentureBeat via email. “In other ways, it’s similar to GPT-3 in both its benefits and risks. Like GPT-3, it’s a huge model and can generate plausible outputs in a variety of scenarios, and so it’s exciting that we can extend this to non-English scenarios … By constructing this huge dataset, [Huawei is] able to train a model in Chinese at a similar scale to English models like GPT-3. So in sum, I’d point to the dataset and the Chinese domain as the most interesting factors, rather than the model architecture, though training a big model like this is always an engineering feat.”


Indeed, many experts believe that while PanGu-α and similarly large models are impressive with respect to their performance, they don’t move the ball forward on the research side of the equation. They’re prestige projects that demonstrate the scalability of existing techniques, rather, or that serve as a showcase for a company’s products.

“I think the best analogy is with some oil-rich country being able to build a very tall skyscraper,” Guy Van den Broeck, an assistant professor of computer science at UCLA, said in a previous interview with VentureBeat. “Sure, a lot of money and engineering effort goes into building these things. And you do get the ‘state of the art’ in building tall buildings. But there is no scientific advancement per se … I’m sure academics and other companies will be happy to use these large language models in downstream tasks, but I don’t think they fundamentally change progress in AI.”

Huawei language model

Above: PanGu-α writing articles.

Even OpenAI’s GPT-3 paper hinted at the limitations of merely throwing more compute at problems in natural language. While GPT-3 completes tasks from generating sentences to translating between languages with ease, it fails to perform much better than chance on a test — adversarial natural language inference — that tasks it with discovering relationships between sentences.

The PanGu-α team makes no claim that the model overcomes other blockers in natural language, like answering math problems correctly or responding to questions without paraphrasing training data. More problematically, their experiments didn’t probe PanGu-α for the types of bias and toxicity found to exist in models like GPT-3. OpenAI itself notes that GPT-3 places words like “naughty” or “sucked” near female pronouns and “Islam” near terms like “terrorism.” A separate paper by Stanford University Ph.D. candidate and Gradio founder Abubakar Abid details the inequitable tendencies of text generated by GPT-3, like associating the word “Jews” with “money.”

Carbon impact

Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models, examining who benefits from them and who’s disadvantaged. A paper coauthored by Gebru earlier this year spotlights the impact of large language models’ carbon footprint on minority communities and such models’ tendency to perpetuate abusive language, hate speech, microaggressions, stereotypes, and other dehumanizing language aimed at specific groups of people.

In particular, the effects of AI and machine learning model training on the environment have been brought into relief. In June 2020, researchers at the University of Massachusetts at Amherst released a report estimating that the amount of power required for training and searching a certain model involves the emissions of roughly 626,000 pounds of carbon dioxide, equivalent to nearly 5 times the lifetime emissions of the average U.S. car.

Huawei language model

Above: PanGu-α creating poetry.

While the environmental impact of training PanGu-α is unclear, it’s likely that the model’s footprint is substantial — at least compared with language models a fraction of its size. As the coauthors of a recent MIT paper wrote, evidence suggests that deep learning is approaching computational limits. “We do not anticipate that the computational requirements implied by the targets … The hardware, environmental, and monetary costs would be prohibitive,” the researchers said. “Hitting this in an economical way will require more efficient hardware, more efficient algorithms, or other improvements such that the net impact is this large a gain.”

Antoniak says that it’s an open question as to whether larger models are the right approach in natural language. While the best performance scores on tasks currently come from large datasets and models, whether the pattern of dumping enormous amounts of data into models will pay off is uncertain. “The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets,” she said. “These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.”

Future directions

The PanGu-α team’s choices aside, they might not have long to set standards that address the language model’s potential impact on society. A paper published by researchers from OpenAI and Stanford University found that large language model developers like Huawei, OpenAI, and others may only have a six- to nine-month advantage until others can reproduce their work. EleutherAI, a community of machine learning researchers and data scientists, expects to release an open source implementation of GPT-3 in August.

The coauthors of the OpenAI and Stanford paper suggest ways to address the negative consequences of large language models, such as enacting laws that require companies to acknowledge when text is generated by AI — perhaps along the lines of California’s bot law. Other recommendations include:

  • Training a separate model that acts as a filter for content generated by a language model
  • Deploying a suite of bias tests to run models through before allowing people to use the model
  • Avoiding some specific use cases

The consequences of failing to take any of these steps could be catastrophic over the long term. In recent research, the Middlebury Institute of International Studies’ Center on Terrorism, Extremism, and Counterterrorism claims that GPT-3 could reliably generate “informational” and “influential” text that might radicalize people into violent far-right extremist ideologies and behaviors. And toxic language models deployed into production might struggle to understand aspects of minority languages and dialects. This could force people using the models to switch to “white-aligned English,” for example, to ensure that the models work better for them, which could discourage minority speakers from engaging with the models to begin with.

Given Huawei’s ties with the Chinese government, there’s also a concern that models like PanGu-α could be used to discriminate against marginalized peoples including Uyghurs living in China. A Washington Post report revealed that Huawei tested facial recognition software that could send automated “Uighur alarms” to government authorities when its camera systems identified members of the minority group.

We’ve reached out to Huawei for comment and will update this article once we hear back.

“With PanGu-α, like with GPT-3, there are risks of memorization, biases, and toxicity in the outputs,” Antoniak said. “This suggests that perhaps we should try to better model how humans learn language.”


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Coinsmart. Beste Bitcoin-Börse in Europa


Build a scalable machine learning pipeline for ultra-high resolution medical images using Amazon SageMaker




Neural networks have proven effective at solving complex computer vision tasks such as object detection, image similarity, and classification. With the evolution of low-cost GPUs, the computational cost of building and deploying a neural network has drastically reduced. However, most techniques are designed to handle pixel resolutions commonly found in visual media. For example, typical resolution sizes are 544 and 416 pixels for YOLOv3, 300 and 512 pixels for SSD, and 224 pixels for VGG. Training a classifier over a dataset consisting of gigapixel images (10^9+ pixels) such as satellite or digital pathology images is computationally challenging. These images cannot be directly input into a neural network because each GPU is limited by available memory. This requires specific preprocessing techniques such as tiling to be able to process the original images in smaller chunks. Furthermore, due to the large size of these images, the overall training time tends to be high, often requiring several days or weeks without the use of proper scaling techniques such as distributed training.

In this post, we explain how to build a highly scalable machine learning (ML) pipeline to fulfill three objectives:


In this post, we use a dataset consisting of whole-slide digital pathology images obtained from The Cancer Genome Atlas (TCGA) to accurately and automatically classify them as LUAD (adenocarcinoma), LUSC (squamous cell carcinoma), or normal lung tissue, where LUAD and LUSC are the two most prevalent subtypes of lung cancer. The dataset is available for public use by NIH and NCI.

The raw high-resolution images are in SVS format. SVS files are used for archiving and analyzing Aperio microscope images. You can apply the techniques and tools used in this post to any ultra high-resolution image dataset, including satellite images.

The following is a sample image of a tissue slide. This single image contains over a quarter of a billion pixels, and occupies over 750 MB of memory. This image cannot be fed directly to a neural network in its original form, so we must tile the image into many smaller images.

The following are samples of tiled images generated after preprocessing the preceding tissue slide image. These RGB 3-channel images are of size 512×512 and can be directly used as inputs to a neural network. Each of these tiled images is assigned the same label as the parent slide. Additionally, tiled images with more than 50% background are discarded.

Architecture overview

The following figure shows the overall end-to-end architecture, from the original raw images to inference. First, we use SageMaker Processing to tile, zoom, and sort the images into train and test splits, and then package them into the necessary number of shards for distributed SageMaker training. Second, a SageMaker training job loads the Docker container from Amazon Elastic Container Registry (Amazon ECR). The job uses Pipe mode to read the data from the prepared shards of images, trains the model, and stores the final model artifact in Amazon Simple Storage Service (Amazon S3). Finally, we deploy the trained model on a real-time inference endpoint that loads the appropriate Docker container (from Amazon ECR) and model (from Amazon S3) to process inference requests with low latency.

Data preprocessing using SageMaker Processing

The SVS slide images are preprocessed in three steps:

  • Tiling images – The images are tiled by non-overlapping 512×512 pixel windows, and tiles containing over 50% background are discarded. The tiles are stored as JPEG images.
  • Converting images to TFRecords – We use SageMaker Pipe mode to reduce our training time, which requires the data to be available in a proto-buffer format. TFRecord is a popular proto-buffer format used for training models with TensorFlow. We explain SageMaker Pipe mode and proto-buffer format in more detail in the following section.
  • Sorting TFRecords – We sort the dataset into test, train, and validation cohorts for a three-way classifier (LUAD/LUSC/Normal). The TCGA dataset can have multiple slide images corresponding to a single patient. We need to make sure all the tiles generated from slides corresponding to the same patient occupy the same split to avoid data leakage. For the test set, we create per-slide TFRecord containing all the tiles from that slide so that we can evaluate the model in the way it will be used in deployment.

The following is the preprocessing code:

def generate_tf_records(base_folder, input_files, output_file, n_image, slide=None): record_file = output_file count = n_image with as writer: while count: filename, label = random.choice(input_files) temp_img = plt.imread(os.path.join(base_folder, filename)) if temp_img.shape != (512, 512, 3): continue count -= 1 image_string = np.float32(temp_img).tobytes() slide_string = slide.encode('utf-8') if slide else None tf_example = image_example(image_string, label, slide_string) writer.write(tf_example.SerializeToString())

We use SageMaker Processing for the preceding preprocessing steps, which allows us to run data preprocessing or postprocessing, feature engineering, data validation, and model evaluation workloads with SageMaker. Processing jobs accept data from Amazon S3 as input and store processed output data back into Amazon S3.

A benefit of using SageMaker Processing is the ease of distributing inputs across multiple compute instances. We can simply set s3_data_distribution_type=ShardedByS3Key parameter to divide data equally among all processing containers.

Importantly, the number of processing instances matches the number of GPUs we will use for distributed training with Horovod (i.e., 16). The reasoning becomes clearer when we introduce Horovod training.

The processing script is available on GitHub.

processor = Processor(image_uri=image_name, role=get_execution_role(), instance_count=16, # run the job on 16 instances base_job_name='processing-base', # should be unique name instance_type='ml.m5.4xlarge', volume_size_in_gb=1000)[ProcessingInput( source=f's3://<bucket_name>/tcga-svs', # s3 input prefix s3_data_type='S3Prefix', s3_input_mode='File', s3_data_distribution_type='ShardedByS3Key', # Split the data across instances destination='/opt/ml/processing/input')], # local path on the container outputs=[ProcessingOutput( source='/opt/ml/processing/output', # local output path on the container destination=f's3://<bucket_name>/tcga-svs-tfrecords/' # output s3 location )], arguments=['10000'], # number of tiled images per TF record for training dataset wait=True, logs=True)

Distributed model training using SageMaker Training

Taking ML models from conceptualization to production is typically complex and time-consuming. We have to manage large amounts of data to train the model, choose the best algorithm for training it, manage the compute capacity while training it, and then deploy the model into a production environment. SageMaker reduces this complexity by making it much easier to build and deploy ML models. It manages the underlying infrastructure to train your model at petabyte scale and deploy it to production.

After we preprocess the whole-slide images, we still have hundreds of gigabytes of data. Training on a single instance (GPU or CPU) would take several days or weeks to finish. To speed things up, we need to distribute the workload of training a model across multiple instances. For this post, we focus on distributed deep learning based on data parallelism using Horovod, a distributed training framework, and SageMaker Pipe mode.

Horovod: A cross-platform distributed training framework

When training a model with a large amount of data, the data needs to distributed across multiple CPUs or GPUs on either a single instance or multiple instances. Deep learning frameworks provide their own methods to support distributed training. Horovod is a popular framework-agnostic toolkit for distributed deep learning. It utilizes an allreduce algorithm for fast distributed training (compared with a parameter server approach) and includes multiple optimization methods to make distributed training faster. For more examples of distributed training with Horovod on SageMaker, see Multi-GPU and distributed training using Horovod in Amazon SageMaker Pipe mode and Reducing training time with Apache MXNet and Horovod on Amazon SageMaker.

SageMaker Pipe mode

You can provide input to SageMaker in either File mode or Pipe mode. In File mode, the input files are copied to the training instance. With Pipe mode, the dataset is streamed directly to your training instances. This means that the training jobs start sooner, compute and download can happen in parallel, and less disk space is required. Therefore, we recommend Pipe mode for large datasets.

SageMaker Pipe mode requires data to be in a protocol buffer format. Protocol buffers are language-neutral, platform-neutral, extensible mechanisms for serializing structured data. TFRecord is a popular proto-buffer format used for training models with TensorFlow. TFRecords are optimized for use with TensorFlow in multiple ways. First, they make it easy to combine multiple datasets and integrate seamlessly with the data import and preprocessing functionality provided by the library. Second, you can store sequence data—for instance, a time series or word encodings—in a way that allows for very efficient and (from a coding perspective) convenient import of this type of data.

The following diagram illustrates data access with Pipe mode.

Data sharding with SageMaker Pipe mode

You should keep in mind a few considerations when working with SageMaker Pipe mode and Horovod:

  • The data that is streamed through each pipe is mutually exclusive of the other pipes. The number of pipes dictates the number of data shards that need to be created.
  • Horovod wraps the training script for each compute instance. This means that data for each compute instance needs to be from a different shard.
  • With the SageMaker Training parameter S3DataDistributionType set to ShardedByS3Key, we can share a pipe with more than one instance. The data is streamed in round-robin fashion across instances.

To illustrate this better, let’s say we use two instances (A and B) of type ml.p3.8xlarge. Each ml.p3.8xlarge instance has four GPUs. We create four pipes (P1, P2, P3, and P4) and set S3DataDistributionType = 'ShardedByS3Key’. As shown in the following table, each pipe equally distributes the data between two instances in a round-robin fashion. This is the core concept needed in setting up pipes with Horovod. Because Horovod wraps the training script for each GPU, we need to create as many pipes as there are GPUs per training instance.

The following code shards the data in Amazon S3 for each pipe. Each shard should have a separate prefix in Amazon S3.

# Definite distributed training hyperparameters
train_instance_count = 4
gpus_per_host = 4
num_of_shards = gpus_per_host * train_instance_count distributions = {'mpi': { 'enabled': True, 'processes_per_host': gpus_per_host }

# Sharding
client = boto3.client('s3')
result = client.list_objects(Bucket=s3://<bucket_name>, Prefix='tcga-svs-tfrecords/train/', Delimiter='/') j = -1
for i in range(num_of_shards): copy_source = { 'Bucket': s3://<bucket_name>, 'Key': result['Contents'][i]['Key'] } print(result['Contents'][i]['Key']) if i % gpus_per_host == 0: j += 1 dest = 'tcga-svs-tfrecords/train_sharded/' + str(j) +'/' + result['Contents'][i]['Key'].split('/')[2] print(dest) s3.meta.client.copy(copy_source, s3://<bucket_name>, dest) # Define inputs to SageMaker estimator
svs_tf_sharded = f's3://<bucket_name>/tcga-svs-tfrecords'
shuffle_config = sagemaker.session.ShuffleConfig(234)
train_s3_uri_prefix = svs_tf_sharded
remote_inputs = {} for idx in range(gpus_per_host): train_s3_uri = f'{train_s3_uri_prefix}/train_sharded/{idx}/' train_s3_input = s3_input(train_s3_uri, distribution ='ShardedByS3Key', shuffle_config=shuffle_config) remote_inputs[f'train_{idx}'] = train_s3_input remote_inputs['valid_{}'.format(idx)] = '{}/valid'.format(svs_tf_sharded)
remote_inputs['test'] = '{}/test'.format(svs_tf_sharded)

We use a SageMaker estimator to launch training on four instances of ml.p3.8xlarge. Each instance has four GPUs. Thus, there are a total of 16 GPUs. See the following code:

local_hyperparameters = {'epochs': 5, 'batch-size' : 16, 'num-train':160000, 'num-val':8192, 'num-test':8192} estimator_dist = TensorFlow(base_job_name='svs-horovod-cloud-pipe', entry_point='src/', role=role, framework_version='2.1.0', py_version='py3', distribution=distributions, volume_size=1024, hyperparameters=local_hyperparameters, output_path=f's3://<bucket_name>/output/', instance_count=4, instance_type=train_instance_type, input_mode='Pipe'), wait=True)

The following code snippet of the training script shows how to orchestrate Horovod with TensorFlow for distributed training:

mpi = False
if 'sagemaker_mpi_enabled' in args.fw_params: if args.fw_params['sagemaker_mpi_enabled']: import horovod.keras as hvd mpi = True # Horovod: initialize Horovod. hvd.init() # Pin GPU to be used to process local rank (one GPU per process) gpus = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
else: hvd = None callbacks = []
if mpi: callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0)) callbacks.append(hvd.callbacks.MetricAverageCallback()) if hvd.rank() == 0: callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.ckpt', save_weights_only=True, verbose=2))
else: callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.ckpt', save_weights_only=True, verbose=2)) train_dataset = train_input_fn(hvd, mpi)
valid_dataset = valid_input_fn(hvd, mpi)
test_dataset = test_input_fn()
model = model_def(args.learning_rate, mpi, hvd)"Starting training")
size = 1
if mpi: size = hvd.size(), steps_per_epoch=((args.num_train // args.batch_size) // size), epochs=args.epochs, validation_data=valid_dataset, validation_steps=((args.num_val // args.batch_size) // size), callbacks=callbacks, verbose=2)

Because Pipe mode streams the data to each of our instances, the training script cannot calculate the data size during training (which is needed to compute steps_per_epoch). The parameter is therefore provided manually as a hyperparameter to the TensorFlow estimator. Additionally, the number of data points must be specified so that it can be divided equally amongst the GPUs. An unequal division could lead to a Horovod deadlock, because the time taken by each GPU to complete the training process is no longer identical. To ensure that the data points are equally divided, we use the same of number of instances for preprocessing as the number of GPUs for training. In our example, this number is 16.

Inference and deployment

After we train the model using SageMaker, we deploy it for inference on new images. To set up a persistent endpoint to get one prediction at a time, use SageMaker hosting services. To get predictions for an entire dataset, use SageMaker batch transform.

In this post, we deploy the trained model as a SageMaker endpoint. The following code deploys the model to an m4 instance, reads tiled image data from TFRecords, and generates a slide-level prediction:

# Generate predictor object from trained model
predictor = estimator_dist.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge') # Tile-level prediction
raw_image_dataset ='images/{local_file}') # read a TFrecord
parsed_image_dataset = # Parse TFrecord to JPEGs pred_scores_list = []
for i, element in enumerate(parsed_image_dataset): image = element[0].numpy() label = element[1].numpy() slide = element[2].numpy().decode() if i == 0: print(f"Making tile-level predictions for slide: {slide}...") print(f"Querying endpoint for a prediction for tile {i+1}...") pred_scores = predictor.predict(np.expand_dims(image, axis=0))['predictions'][0] pred_class = np.argmax(pred_scores) if i > 0 and i % 10 == 0: plt.figure() plt.title(f'Tile {i} prediction: {pred_class}') plt.imshow(image / 255) pred_scores_list.append(pred_scores)
print("Done.") # Slide-level prediction (average score over all tiles)
mean_pred_scores = np.mean(np.vstack(pred_scores_list), axis=0)
mean_pred_class = np.argmax(mean_pred_scores)
print(f"Slide-level prediction for {slide}:", mean_pred_class)

The model is trained on individual tile images. During inference, the SageMaker endpoint provides classification scores for each tile. These scores are averaged out across all tiles to generate the slide-level score and prediction. The following diagram illustrates this workflow.

A majority vote scheme would also be appropriate.

To perform inference on a large new batch of slide images, you can run a batch transform job for offline predictions on the dataset in Amazon S3 on multiple instances. Once the processed TFRecords are retrieved from Amazon S3, you can replicate the preceding steps to generate a slide-level classification for each of the new images.


In this post, we introduced a scalable machine learning pipeline for ultra high-resolution images that uses SageMaker Processing, SageMaker Pipe mode, and Horovod. The pipeline simplifies the convoluted process of large-scale training of a classifier over a dataset consisting of images that approach the gigapixel scale. With SageMaker and Horovod, we eased the process by distributing inputs across multiple compute instances, which reduces training time. We also provided a simple but effective strategy to aggregate tile-level predictions to produce slide-level inference.

For more information about SageMaker, see Build, train, and deploy a machine learning model with Amazon SageMaker. For the complete example to run on SageMaker, in which Pipe mode and Horovod are applied together, see the GitHub repo.


  1. Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakellaropoulos, Navneet Narula, Matija Snuderl, David Fenyö, Andre L. Moreira, Narges Razavian, Aristotelis Tsirigos. “Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning”. Nature Medicine, 2018; DOI: 10.1038/s41591-018-0177-5

About the Authors

Karan Sindwani is a Data Scientist at Amazon Machine Learning Solutions where he builds and deploys deep learning models. He specializes in the area of computer vision. In his spare time, he enjoys hiking.

Vinay Hanumaiah is a Deep Learning Architect at Amazon ML Solutions Lab, where he helps customers build AI and ML solutions to accelerate their business challenges. Prior to this, he contributed to the launch of AWS DeepLens and Amazon Personalize. In his spare time, he enjoys time with his family and is an avid rock climber.

Ryan Brand is a Data Scientist in the Amazon Machine Learning Solutions Lab. He has specific experience in applying machine learning to problems in healthcare and the life sciences, and in his free time he enjoys reading history and science fiction.

Tatsuya Arai, Ph.D. is a biomedical engineer turned deep learning data scientist on the Amazon Machine Learning Solutions Lab team. He believes in the true democratization of AI and that the power of AI shouldn’t be exclusive to computer scientists or mathematicians.

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading


Apply to join Transform’s annual Tech Showcase




Join Transform 2021 this July 12-16. Register for the AI event of the year.

VentureBeat’s annual Tech Showcase returns at Transform 2021: Accelerating Enterprise Transformation with AI and Data, hosted July 12-16.

VentureBeat will be selecting 5 companies to present the latest and greatest AI and data products on the main stage during the first day of Transform 2021.

We are looking for companies that have built the coolest products and solutions leveraging bleeding edge technologies to help businesses achieve real and tangible results using AI and data. Whether you are a stealth startup or Fortune 500, or anywhere in between, we welcome your submission.

If you have a story to tell, and an AI or Data product/solution with tangible business results and demonstrative use cases, please submit your application here before 5pm PST on Tuesday, June 1.

Selected companies will present on Transform’s main stage in front of hundreds of industry decision makers and will be featured in our on-demand video-hub following the event.

Attendees from across the globe will join online to hear from top industry experts on strategy and technology in the main application areas of AI/ML automation technology, data, analytics, intelligent automation, conversational AI, intelligent AI assistants, AI at the edge, IoT, & computer vision. Executives across industries are invited to join Transform 2021, so register today to join VentureBeat for 5-days of AI and data.


Contact with any questions regarding your application.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading


TikTok removes 500k+ accounts in Italy after DPA order to block underage users




Video sharing social network TikTok has removed more than 500,000 accounts in Italy following an intervention by the country’s data protection watchdog earlier this year ordering it to recheck the age of all Italian users and block access to any under the age of 13.

Between February 9 and April 21 more than 12.5M Italian users were asked to confirm that they are over 13 years old, according to the regulator.

Online age verification remains a hard problem and it’s not clear how many of the removed accounts definitively belonged to under 13s. The regulator said today that TikTok removed over 500k users because they were “likely” to be under the age of 16; around 400,000 because they declared an age under 13 and 140,000 through what the DPA describes as “a combination of moderation and reporting tools” implemented within the app.

TikTok has also agreed to take a series of additional measures to strengthen its ability to detect and block underage users — including potentially developing AI tools to help it identify when children are using the service.

Reached for comment, TikTok sent us a statement confirming that it is trialling “additional measures to help ensure that only users aged 13 or over are able to use TikTok”.

Here’s the statement, which TikTok attributed to Alexandra Evans, its head of child safety in Europe:

“TikTok’s top priority is protecting the privacy and safety of our users, and in particular our younger users. Following continued engagement with the Garante, we will be trialling additional measures to help ensure that only users aged 13 or over are able to use TikTok.

“We already take industry-leading steps to promote youth safety on TikTok such as setting accounts to private by default for users aged under 16 and enabling parents to link their account to their teen’s through Family Pairing. There is no finish line when it comes to safety, and we continue to evaluate and improve our policies, processes and systems, and consult with external experts.”

Italy’s data protection regulator made an emergency intervention in January — ordering TikTok to recheck the age of all users and block any users whose age it could not verify. The action followed reports in local media about a 10-year-old girl from Palermo who died of asphyxiation after participating in a “blackout challenge” on the social network.

Among the beefed up measures TikTok has agreed to take is a commitment to act faster to remove underage users — with the Italian DPA saying the platform has guaranteed it will cancel reported accounts it verifies as belonging to under 13s within 48 hours.

The regulator said TikTok has also committed to “study and develop” solutions — which may include the use of artificial intelligence — to “minimize the risk of children under 13 using the service”.

TikTok has also agree to launch ad campaigns, both in app and through radio and newspapers in Italy, to raise awareness about safe use of the platform and get the message out that it is not suitable for under-12s — including targeting this messaging in a language and format that’s likely to engage underage minors themselves.

The social network has also agreed to share information with the regulator relating to the effectiveness of the various experimental measures — to work with the regulator to identify the best ways of keeping underage users off the service.

The DPA said it will continue to monitor TikTok’s compliance with its commitments.

Prior to the Garante’s action, TikTok’s age verification checks had been widely criticized as trivially easier for kids to circumvent — with children merely needing to input a false birth date that suggested they are older than 13 to circumvent the age gate and access the service.

A wider investigation that the DPA opened into TikTok’s handling and processing of children’s data last year remains ongoing.

The regulator announced it had begun proceedings against the platform in December 2020, following months of investigation, saying then that it believed TikTok was not complying with EU data protection rules which set stringent requirements for processing children’s data.

In January the Garante also called for the European Data Protection Board to set up an EU taskforce to investigate concerns about the risks of children’s use of the platform — highlighting similar concerns being raised by other agencies in Europe and the U.S.

In February the European consumer rights organization, BEUC, also filed a series of complaints against TikTok, including in relation to its handling of kids’ data.

Earlier this year TikTok announced plans to bring in outside experts in the region to help with content moderation and said it would open a ‘transparency’ center in Europe where outside experts could get information on its content, security and privacy policies.

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading


New book cites 4 forces to create disruption: Tech, policy, business models and social dynamics




The new book “Future Tech” asserts that this disruption is a good predictor of what’s coming next. For true change the forces must combine.

Future Tech by Trond Arne Undheim

” data-credit=”Image: Kogan Page”>future-tech-book.jpg

Future Tech by Trond Arne Undheim

Image: Kogan Page

According to Trond Arne Undheim, Ph. D., futurist and author, the combination of the four forces of technology, policy, business models and social dynamics are what create industry disruption. An understanding of these phenomena, he said, is predictive of the future. 

More about Innovation

Undheim, the former director of MIT startup exchange, examines the results of disruption in his latest book “Future Tech: How to Capture Value from Disruptive Industry Trends.” Undheim noted that technology is considered the single force for market disruption, but added that one only has to look at historical evidence to find tech stories that did not meet hyped expectations and failed. Undheim said that true change is only possible from the collaboration of science and technology, policy and regulation, new business models, or sharing economy and social dynamics, whether or not people adopt it.  

“Future Tech was written out of the realization that despite massive evidence to the contrary, the common understanding is that technology drives society when it is more of an interplay, especially for succeeding with innovation long term,” Undheim said. “It was part of my strategy to get the maximum out of learning from failure.” He cited his previous work, “Disruption Games: How To Thrive On Serial Failure,” where he pointed out that failing slowly and painfully is the best path to deep wisdom.”

His most recent book, however, “is a lighter take on the subject, drilling in on very basic, but fundamental forces of change in our contemporary world. I’m deeply engaged in understanding the future but in order to do that well, you have to have a framework in place. Otherwise, you end up just inductively looking at any random signal out there, thinking it is more than a fad or trend.”

SEE:  Money makes the world go round: Mobile wallets and the future of commerce (TechRepublic)

He describes his seven years at MIT, where he also taught at MIT’s Sloan School of Management, as “a 24/7 learning machine.” In “Future Tech,” Undheim asserted where disruptive forces will come from: “The next decade will be dominated by platform technologies such as AI, blockchain, robotics, synthetic biology and 3D printing.”

His discovery of the four forces, tech, policy, business models and social dynamics, “became the backbone of my work to try to structure the world’s information. Four is a simple enough concept that everyone can remember. Underlying each is a set of key dimensions, which I cover in the book. The point is to move beyond simply considering one force—technology. My work in government—as well as in standardization, which is important both to government and business—has sensitized me to the role that regulation plays in facilitating innovation. My work at MIT Sloan School of Management convinced me that business models no longer are a theoretical exercise carried out by management school professors; it literally makes or breaks businesses overnight. Finally, social dynamics is the correction that people provide on anything that is shiny. ‘Does it work for me? Will I share it with my peer group? Will I create a social movement to support or fight it?’ Society, ultimately, is a constellation made possible by social dynamics, meaning individuals acting in consort, forming teams, groups, networks and societies.”

As an example of the four forces at work, Undheim said, “AI is enabling analytics on epidemiologic patterns, drug discovery, advanced tracking of the supply chain and more, algorithms of some form will eventually facilitate nearly every human decision,” and added, “but perhaps not replace it.” 

Blockchain empowers “the next generation of peer-to-peer economic exchange in myriad fields, allowing this activity to occur without a middleman, thus encouraging the kind of distributed activity we need to foster in a post-centralized model where humans need to spread out to avoid contagion.” Despite the fact that blockchain is generally considered a financial solution, Undheim said it “is more akin to an exchange principle” and could empower human interactions and provide some sort of trackable ledger. 

In the time of a pandemic, it’s particularly relevant that “synthetic biology enables a much more efficient way to produce vaccines in the future and potentially could speed up clinical trials,” he said, and noted that “3D printing is enabling distributed production and consumption of manufactured items without going through a vulnerable, physical supply chain, even printing metals and now wood. Imagine what this might mean for sustainability if you can print physical goods using all kinds of recycled material, including organic.”

Undheim concluded: “The goal with all of my work is not only to help executives and policy makers better understand change but also to profit from it and make the world a better place,” 

He’s looking toward the big picture: “I’m always going for magnitudes of impact, not just incremental improvements. We don’t have much time. I reckon we have about 500 years left on this planet, if we are lucky. This is the subject of a book I’m hopefully coming out with in 2022.”

Also see

Coinsmart. Beste Bitcoin-Börse in Europa

Continue Reading
Aviation5 days ago

What Happened To Lufthansa’s Boeing 707 Aircraft?

Blockchain5 days ago

Launch of Crypto Trading Team by Goldman Sachs

Aviation4 days ago

JetBlue Hits Back At Eastern Airlines On Ecuador Flights

Cyber Security4 days ago

Cybersecurity Degrees in Massachusetts — Your Guide to Choosing a School

Aviation5 days ago

United Airlines Uses The Crisis To Diversify Latin American Network

Blockchain4 days ago

Miten tekoälyä käytetään videopeleissä ja mitä tulevaisuudessa on odotettavissa

Blockchain5 days ago

DOGE Co-founder Reveals the Reasons Behind its Price Rise

Blockchain4 days ago

“Privacy is a ‘Privilege’ that Users Ought to Cherish”: Elena Nadoliksi

Cyber Security5 days ago

U.S. and the U.K. Published Attack on IT Management Company SolarWinds

AI24 hours ago

Build a cognitive search and a health knowledge graph using AWS AI services

Fintech5 days ago

The Spanish fintech Pecunpay strengthens its position as a leader in the issuance of corporate programs

Private Equity5 days ago

This Dream Job Will Pay You to Gamble in Las Vegas on the Company’s Dime

Blockchain News5 days ago

Nasdaq-Listed Metromile Backs Bitcoin for its Insurance Products

Blockchain5 days ago Now Enabling Cosmos’ Inter-Blockchain Communication Functionality for Cross-Chain Transfers

Cyber Security4 days ago

Cybersecurity Degrees in Texas — Your Guide to Choosing a School

SaaS5 days ago

Cleantech5 days ago

What We Know About Tesla’s “Bobcat Project”

Energy2 days ago

ONE Gas to Participate in American Gas Association Financial Forum

Blockchain22 hours ago

Meme Coins Craze Attracting Money Behind Fall of Bitcoin

Aviation5 days ago

Ryanair Goes Full Steam Ahead On Portugal Capacity Expansion