Connect with us

Big Data

Essential Functionalities to Guide you While using AWS Glue and PySpark!

Avatar

Published

on

Introduction

In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts.

AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amounts of datasets from various sources for analytics and data processing.

While creating the AWS Glue job, you can select between Spark, Spark Streaming, and Python shell. These jobs can run a proposed script generated by AWS Glue, or an existing script that you provide or a new script authored by you. Also, you can select different monitoring options, job execution capacity, timeouts, delayed notification threshold, and non-overridable and overridable parameters.

AWS Glue PySpark

AWS Glue PySpark

Recently AWS recently launched Glue version 2.0 which features 10x faster Spark ETL job start times and reducing the billing duration from a 10-minute minimum to 1-minute minimum.

With AWS Glue you can create development endpoint and configure SageMaker or Zeppelin notebooks to develop and test your Glue ETL scripts.

AWS Glue PySpark - AWS Glue

I create a SageMaker notebook connected to the Dev endpoint to the author and test the ETL scripts. Depending on the language you are comfortable with, you can spin up the notebook.

AWS Glue PySpark -Jupyter Notebook

Now, let’s talk about some specific features and functionalities in AWS Glue and PySpark which can be helpful.

1. Spark DataFrames

Spark DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database. You can create DataFrame from RDD, from file formats like csv, json, parquet.

With SageMaker Sparkmagic(PySpark) Kernel notebook, the Spark session is automatically created.

AWS Glue PySpark -Spark Dataframe

To create DataFrame –

# from CSV files S3_IN = "s3://mybucket/train/training.csv"csv_df = ( spark.read.format("org.apache.spark.csv") .option("header", True) .option("quote", '"') .option("escape", '"') .option("inferSchema", True) .option("ignoreLeadingWhiteSpace", True) .option("ignoreTrailingWhiteSpace", True) .csv(S3_IN, multiLine=False)
)# from PARQUET files S3_PARQUET="s3://mybucket/folder1/dt=2020-08-24-19-28/"df = spark.read.parquet(S3_PARQUET)# from JSON files
df = spark.read.json(S3_JSON)# from multiline JSON file df = spark.read.json(S3_JSON, multiLine=True)

2. GlueContext

GlueContext is the entry point for reading and writing DynamicFrames in AWS Glue. It wraps the Apache SparkSQL SQLContext object providing mechanisms for interacting with the Apache Spark platform.

from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
from awsglue.dynamicframe import DynamicFrameglueContext = GlueContext(SparkContext.getOrCreate())

Glue Context

3. DynamicFrame

AWS Glue DynamicFrames are similar to SparkSQL DataFrames. It represents a distributed collection of data without requiring you to specify a schema. Also, it can be used to read and transform data that contains inconsistent values and types.

DynamicFrame can be created using the following options –

  • create_dynamic_frame_from_rdd — created from an Apache Spark Resilient Distributed Dataset (RDD)
  • create_dynamic_frame_from_catalog — created using a Glue catalog database and table name
  • create_dynamic_frame_from_options — created with the specified connection and format. Example — The connection type, such as Amazon S3, Amazon Redshift, and JDBC

DynamicFrames can be converted to and from DataFrames using .toDF() and fromDF(). Use the following syntax-

#create DynamicFame from S3 parquet files
datasource0 = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths": [S3_location] }, format="parquet", transformation_ctx="datasource0")#create DynamicFame from glue catalog datasource0 = glueContext.create_dynamic_frame.from_catalog( database = "demo", table_name = "testtable", transformation_ctx = "datasource0")#convert to spark DataFrame df1 = datasource0.toDF()#convert to Glue DynamicFrame
df2 = DynamicFrame.fromDF(df1, glueContext , "df2")

You can read more about this here.

4. AWS Glue Job Bookmark

AWS Glue Job bookmark helps process incremental data when rerunning the job on a scheduled interval, preventing reprocessing of old data.

You can read more about this here. Also, you can read this.

5. Write out data

The DynamicFrame of the transformed dataset can be written out to S3 as non-partitioned (default) or partitioned. “partitionKeys” parameter can be specified in connection_option to write out the data to S3 as partitioned. AWS Glue organizes these datasets in Hive-style partition.

In the following code example, AWS Glue DynamicFrame is partitioned by year, month, day, hour, and written in parquet format in Hive-style partition on to S3.

s3://bucket_name/table_name/year=2020/month=7/day=13/hour=14/part-000–671c.c000.snappy.parquet

S3_location = "s3://bucket_name/table_name"datasink = glueContext.write_dynamic_frame_from_options(
frame= data,
connection_type="s3",
connection_options={ "path": S3_location, "partitionKeys": ["year", "month", "day", "hour"]
},
format="parquet",
transformation_ctx ="datasink")

You can read more about this here.

6. “glueparquet” format option

glueparquet is a performance-optimized Apache parquet writer type for writing DynamicFrames. It computes and modifies the schema dynamically.

datasink = glueContext.write_dynamic_frame_from_options( frame=dynamicframe, connection_type="s3", connection_options={ "path": S3_location, "partitionKeys": ["year", "month", "day", "hour"] }, format="glueparquet", format_options = {"compression": "snappy"}, transformation_ctx ="datasink")

You can read more about this here.

7. S3 Lister and other options for optimizing memory management

AWS Glue provides an optimized mechanism to list files on S3 while reading data into DynamicFrame which can be enabled using additional_options parameter “useS3ListImplementation” to true.

You can read more about this here.

8. Purge S3 path

purge_s3_path is a nice option available to delete files from a specified S3 path recursively based on retention period or other available filters. As an example, suppose you are running AWS Glue job to fully refresh the table per day writing the data to S3 with the naming convention of s3://bucket-name/table-name/dt=<data-time>. Based on the defined retention period using the Glue job itself you can delete the dt=<date-time> s3 folders. Another option is to set the S3 bucket lifecycle policy with the prefix.

#purge locations older than 3 days
print("Attempting to purge S3 path with retention set to 3 days.")
glueContext.purge_s3_path( s3_path=output_loc, options={"retentionPeriod": 72})

You have other options like purge_table, transition_table, and transition_s3_path also available. The transition_table option transitions the storage class of the files stored on Amazon S3 for the specified catalog’s database and table.

You can read more about this here.

9. Relationalize Class

Relationalize class can help flatten nested json outermost level.

You can read more about this here.

10. Unbox Class

The Unbox class helps the unbox string field in DynamicFrame to specified format type(optional).

You can read more about this here.

11. Unnest Class

The Unnest class flattens nested objects to top-level elements in a DynamicFrame.

root
|-- id: string
|-- type: string
|-- content: map
| |-- keyType: string
| |-- valueType: string

With content attribute/column being map Type, we can use the unnest class to unnest each key element.

unnested = UnnestFrame.apply(frame=data_dynamic_dframe)
unnested.printSchema()root
|-- id: string
|-- type: string
|-- content.dateLastUpdated: string
|-- content.creator: string
|-- content.dateCreated: string
|-- content.title: string

12. printSchema()

To print the Spark or Glue DynamicFrame schema in tree format use printSchema().

datasource0.printSchema()root
|-- ID: int
|-- Name: string
|-- Identity: string
|-- Alignment: string
|-- EyeColor: string
|-- HairColor: string
|-- Gender: string
|-- Status: string
|-- Appearances: int
|-- FirstAppearance: choice
| |-- int
| |-- long
| |-- string
|-- Year: int
|-- Universe: string

13. Fields Selection

select_fields can be used to select fields from Glue DynamicFrame.

# From DynamicFramedatasource0.select_fields(["Status","HairColor"]).toDF().distinct().show()

AWS Glue PySpark -Fields Selection

To select fields from Spark Dataframe to use “select” –

# From Dataframedatasource0_df.select(["Status","HairColor"]).distinct().show()

Image for post

14. Timestamp

For instance, the application writes data into DynamoDB and has a last_updated attribute/column. But, DynamoDB does not natively support date/timestamp data type. So, you could either store it as String or Number. In case stored as a number, it’s usually done as epoch time — the number of seconds since 00:00:00 UTC on 1 January 1970. You could see something like “1598331963” which is 2020–08–25T05:06:03+00:00 in ISO 8601.

You can read more about Timestamp here.

How can you convert it to a timestamp?

When you read the data using AWS Glue DynamicFrame and view the schema, it will show it as “long” data type.

root
|-- version: string
|-- item_id: string
|-- status: string
|-- event_type: string
|-- last_updated: long

To convert the last_updated long data type into timestamp data type, you can use the following code-

import pyspark.sql.functions as f
import pyspark.sql.types as tnew_df = ( df .withColumn("last_updated", f.from_unixtime(f.col("last_updated")/1000).cast(t.TimestampType()))
)

15. Temporary View from Spark DataFrame

In case you want to store the Spark DataFrame as a table and query it using spark SQL, you can convert the DataFrame into a temporary view that is available for only that spark session using createOrReplaceTempView.

df = spark.createDataFrame( [ (1, ['a', 'b', 'c'], 90.00), (2, ['x', 'y'], 99.99), ], ['id', 'event', 'score'] )df.printSchema()
root |-- id: long (nullable = true) |-- event: array (nullable = true) | |-- element: string (containsNull = true) |-- score: double (nullable = true)df.createOrReplaceTempView("example")spark.sql("select * from example").show()+---+---------+-----+
| id| event|score|
+---+---------+-----+
| 1|[a, b, c]| 90.0|
| 2| [x, y]|99.99|
+---+---------+-----+

16. Extract element from ArrayType

Suppose from the above example, you want to create a new attribute/column to store only the last event. How would you do it?

You use the element_at function. It returns an element of the array at the given index in extraction if col is an array. Also, it can be used to extract the given key in extraction if col is a map.

import pyspark.sql.functions as element_atnewdf = df.withColumn("last_event", element_at("event", -1))newdf.printSchema()
root |-- id: long (nullable = true) |-- event: array (nullable = true) | |-- element: string (containsNull = true) |-- score: double (nullable = true) |-- last_event: string (nullable = true)newdf.show()
+---+---------+-----+----------+
| id| event|score|last_event|
+---+---------+-----+----------+
| 1|[a, b, c]| 90.0| c|
| 2| [x, y]|99.99| y|
+---+---------+-----+----------+

17. explode

The explode function in PySpark is used to explode array or map columns in rows. For example, let’s try to explode “event” column from the above example-

from pyspark.sql.functions import explodedf1 = df.select(df.id,explode(df.event))df1.printSchema()
root |-- id: long (nullable = true) |-- col: string (nullable = true)df1.show()
+---+---+
| id|col|
+---+---+
| 1| a|
| 1| b|
| 1| c|
| 2| x|
| 2| y|
+---+---+

18. getField

In a Struct type, if you want to get a field by name, you can use “getField”. The following is its syntax-

import pyspark.sql.functions as f
from pyspark.sql import Rowfrom pyspark.sql import Row
df = spark.createDataFrame([Row(attributes=Row(Name='scott', Height=6.0, Hair='black')), Row(attributes=Row(Name='kevin', Height=6.1, Hair='brown'))]
)df.printSchema()
root |-- attributes: struct (nullable = true) | |-- Hair: string (nullable = true) | |-- Height: double (nullable = true) | |-- Name: string (nullable = true)df.show()
+-------------------+
| attributes|
+-------------------+
|[black, 6.0, scott]|
|[brown, 6.1, kevin]|
+-------------------+df1 = (df .withColumn("name", f.col("attributes").getField("Name")) .withColumn("height", f.col("attributes").getField("Height")) .drop("attributes") )df1.show()
+-----+------+
| name|height|
+-----+------+
|scott| 6.0|
|kevin| 5.1|
+-----+------+

19. startswith

In case, you want to find records based on a string match you can use “startswith”.

In the following example I am searching for all records where value for description column starts with “[{“.

import pyspark.sql.functions as fdf.filter(f.col("description").startswith("[{")).show()

20. Extract year, month, day, hour

One of the common use cases is to write the AWS Glue DynamicFrame or Spark DataFrame to S3 in Hive-style partition. To do so you can extract the year, month, day, hour, and use it as partitionkeys to write the DynamicFrame/DataFrame to S3.

import pyspark.sql.functions as fdf2 = (raw_df .withColumn('year', f.year(f.col('last_updated'))) .withColumn('month', f.month(f.col('last_updated'))) .withColumn('day', f.dayofmonth(f.col('last_updated'))) .withColumn('hour', f.hour(f.col('last_updated'))) )

About the Author

Anand Prakash – 5x AWS Certified | 5x Oracle Certified

Avid learner of technology solutions around databases, big-data, machine learning.
Connect on Twitter @anandp86

You can also read this article on our Mobile APP Get it on Google Play

Related Articles

Source: https://www.analyticsvidhya.com/blog/2020/08/essential-functionalities-to-guide-you-while-using-aws-glue-and-pyspark/

AI

Allen Institute launches GENIE, a leaderboard for human-in-the-loop language model benchmarking

Avatar

Published

on

There’s been an explosion in recent years of natural language processing (NLP) datasets aimed at testing various AI capabilities. Many of these datasets have accompanying leaderboards, which provide a means of ranking and comparing models. But the adoption of leaderboards has thus far been limited to setups with automatic evaluation, like classification and knowledge retrieval. Open-ended tasks requiring natural language generation such as language translation, where there are often many correct solutions, lack techniques that can reliably automatically evaluate a model’s quality.

To remedy this, researchers at the Allen Institute for Artificial Intelligence, the Hebrew University of Jerusalem, and the University of Washington created GENIE, a leaderboard for human-in-the-loop evaluation of text generation. GENIE posts model predictions to a crowdsourcing platform (Amazon Mechanical Turk), where human annotators evaluate them according to predefined, dataset-specific guidelines for fluency, correctness, conciseness, and more. In addition, GENIE incorporates various automatic machine translation, question answering, summarization, and common-sense reasoning metrics including BLEU and ROUGE to show how well they correlate with the human assessment scores.

As the researchers note, human-evaluation leaderboards raise a couple of novel challenges, first and foremost potentially high crowdsourcing fees. To avoid deterring submissions from researchers with limited resources, GENIE aims to keep submission costs around $100, with initial submissions to be paid by academic groups. In the future, the coauthors plan to explore other payment models including requesting payment from tech companies while subsidizing the cost for smaller organizations.

To mitigate another potential issue — the reproducibility of human annotations over time across various annotators — the researchers use techniques including estimating annotator variance and spreading the annotations over several days. Experiments show that GENIE achieves “reliable scores” on the included tasks, they claim.

“[GENIE] standardizes high-quality human evaluation of generative tasks, which is currently done in a case-by-case manner with model developers using hard-to-compare approaches,” Daniel Khashabi, a lead developer on the GENIE project, explained in a Medium post. “It frees model developers from the burden of designing, building, and running crowdsourced human model evaluations. [It also] provides researchers interested in either human-computer interaction for human evaluation or in automatic metric creation with a central, updating hub of model submissions and associated human-annotated evaluations.”

The coauthors believe that the GENIE infrastructure, if widely adopted, could alleviate the evaluation burden for researchers while ensuring high-quality, standardized comparison against previous models. Moreover, they anticipate that GENIE will facilitate the study of human evaluation approaches, addressing challenges like annotator training, inter-annotator agreement, and reproducibility — all of which could be integrated into GENIE to compare against other evaluation metrics on past and future submissions.

“We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation,” the coauthors wrote in a paper describing their work. “This is a novel deviation from how text generation is currently evaluated, and we hope that GENIE contributes to further development of natural language generation technology.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Source: https://venturebeat.com/2021/01/20/allen-institute-launches-genie-a-leaderboard-for-human-in-the-loop-language-model-benchmarking/

Continue Reading

AI

Swapp raises $7 million to automate construction planning with AI

Avatar

Published

on

Swapp, a company that leverages AI for construction planning, today announced that it raised $7 million in venture capital. The company plans to put the funds toward “continued market expansion” and growing its platform’s AI capabilities.

The construction industry and its broader ecosystem erects buildings, infrastructure, and industrial structures that form the foundation of whole economies. Private-equity firms raised more than $388 billion to fund infrastructure projects, including $100 billion in 2019 alone, a 24% increase from 2018. But construction, including various conception, architectural design, and engineering processes, requires consulting with experts including architects, engineers, and land surveyors.

Swapp, which former Autodesk Israel CEO Eitan Tsarfati cofounded in 2019, claims its AI-powered platform eliminates the need to work with outside experts by streamlining the construction planning phase. After uploading site, floor drawings, and requirements for the exterior or interior of a project, Swapp customers receive a selection of algorithmically generated planning options to maximize building efficiency and minimize construction costs.

Swapp’s product automates tasks like initial mass planning and analyzing architectural typologies, and it integrates with third-party geographic information platforms and different data sources from locations across the globe. All data relevant to a project is visualized in a dashboard that users can view on the web.

“Swapp’s AI solution is a game-changer in the field of real estate development and construction-planning,” Tsarfati said in a statement. “For the first time in the history of construction, real estate developers and construction companies can use a single platform to build their entire construction planning project and begin work within weeks instead of 9-12 months. We are already working … to replace the slow, tedious, and inflexible construction planning process with our smart, efficient, and flexible, planning solution. This investment will help us grow our customer base and expand our AI capabilities to advance the future of construction planning.”

Point72 Ventures and Entrée Capital led the seed round in Swapp, which has offices in Tel Aviv as well as London. “We believe Swapp has the ability to reinvent architecture by automating the entire construction planning process,” Daniel Gwak, partner at Point72 Ventures, said. “Swapp’s AI-powered platform is designed to help modernize real estate development by simplifying the slow and fragmented planning process, allowing developers to create a full set of architectural plans within weeks. We are pleased to support their continued growth.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Source: https://venturebeat.com/2021/01/20/swapp-raises-7-million-to-automate-construction-planning-with-ai/

Continue Reading

AI

Researchers find conventional voice AI overlooks trans and nonbinary users

Avatar

Published

on

Voice-activated AI (VAI) is increasingly ubiquitous, whether in the form of conversational assistants or more generalized personal assistants like Alexa, Google Assistant, and Siri. Researchers to date have studied the social consequences of their design and deployment, with one focus being on the social implications of gendering VAIs. (Most VAIs have a feminine-sounding synthetic voice set by default.) A forthcoming study aims to advance previous work by undertaking a series of interviews with trans and/or nonbinary users of VAIs, a historically understudied group, to explore their experiences, wants, and needs. The coauthors say the results show that these needs are far more than improvements to representation and that users raise concerns around the framing of gender — even by well-intentioned developers.

As the researchers point out, voices are important elements of interaction to many trans people because of how deeply gendered different styles of speaking can be. An incongruence in voice can serve as a source of pain for a trans and/or nonbinary person’s sense of self. Moreover, voice can serve as a way in which people are identified as trans, potentially leading to discrimination and violence. A 2015 survey found that 46% of transgender people had experienced verbal harassment, 47% had been sexually assaulted, and 54% had experienced violence from a partner.

The researchers conducted interviews with a group of 15 participants recruited through a call to LGBTQ+ community centers along with LGBTQ+ Facebook groups based in two cities. They asked questions centered around core themes, specifically (1) the participants’ experiences of being represented by VAI, (2) their suggestions for VAI development, and (3) tensions with the expanding role of VAIs.

Thirteen out of the 15 participants were negative about the representativeness of VAIs’ voices, with 11 stating that VAIs were not designed for them. They pushed back against the idea that gender should be treated as equivalent to “where your voice falls within a stereotypical rage of pitch,” but at the same time, they proposed alternative forms of representation like providing a wide spectrum of “ungendered” voice options.

The participants in the study also worried about how technologies like VAIs might amplify the hardships they experience on a daily basis. Several had mixed feelings about the idea of a system featuring trans representation in voice and gender-affirming design, expressing privacy concerns and a skepticism about developers’ ability to deliver on their promises. A single fixed voice aimed to represent trans and/or nonbinary people, the participants said, would be a reduction of the diversity within the community and perpetuate the notion that specific vocal ranges are linked to specific genders.

“Implicit in our participants’ understanding of representation is the importance of recognition: a way of affirming the humanity, moral agency and equality of each other — as individuals, and as communities,” the researchers wrote. “This approach chafes with our participants’ understanding of how VAI designers approach representation. They worry that designers’ understanding contains only a surface-level approach to trans needs, visible in their focus on developing gender-neutral voices.”

The researchers cite Q, a project spearheaded by media company Vice and its for-profit spinoff Virtue, as an example of poorly representative, exclusive design. While Q was billed as the “world’s first nonbinary voice for technology” when it was announced in early 2019, the coauthors argue its design process — which partly entailed recording 6 people identifying as male, female, transgender, or nonbinary — “raises as many questions as it answers.” The developers behind Q appear to treat trans voices as representing a “monolithic population” rather than not those of men and women. Furthermore, by treating nonbinary voices as a “third gender option,” Q risks denaturing fixed, categorical ideas of gender, the researchers say.

To address these concerns and others, the coauthors recommend that VAIs at a minimum be designed with and for trans-specific privacy considerations, have features for trans-specific purposes, be representative and gender-affirming. Most importantly, they say the development of VAI features must be grounded in a participatory process.

“The development of VAIs that explicitly ‘move beyond merely ‘allowing’ trans people to exist’ would help bridge disparities experienced by this population under structural cisnormativity by providing material improvements that enhance their … comfort in identity, selfhood, embodiment, sexuality,” the researchers said. “Considering constraints within the commercial contexts where these devices are currently developed, we suggest that researchers and technologists go further and work outside these structures towards developing grassroots VAIs driven by and accountable to trans communities, while employing strategies for disrupting the aggravation of digital privacy breaches through the use of voice analytics.”

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Source: https://venturebeat.com/2021/01/20/researchers-find-conventional-voice-ai-overlooks-trans-and-nonbinary-users/

Continue Reading

AI

Microsoft researchers tap AI for anonymous data sharing for health care providers

Avatar

Published

on

The use of images to build diagnostic models of diseases has become an active research topic in the AI community. But capturing the patterns in a condition and an image requires exposing a model to a rich variety of medical cases. It’s well-known that images from a source can be biased by demographics, equipment, and means of acquisition, which means training a model on such images would cause it to perform poorly for other populations.

In search of a solution, researchers at Microsoft and the University of British Columbia developed a framework called Federated Learning with a Centralized Adversary (FELICIA). It extends a family of a type of model called a generative adversarial network (GAN) to a federated learning environment using a “centralized adversary.” The team says FELICIA could enable stakeholders like medical centers to collaborate with each other and improve models in a privacy-preserving, distributed data-sharing way.

GANs are two-part AI models consisting of a generator that creates samples and a discriminator that attempts to differentiate between the generated samples and real-world samples. As for federated learning, it entails training algorithms across decentralized devices holding data samples without exchanging those samples. Local algorithms are trained on local data samples and the weights, or learnable parameters of the algorithms, are exchanged between the algorithms at some frequency to generate a global model.

With FELICIA, the researchers propose duplicating the discriminator and generator architectures of a “base GAN” to other component generator-discriminator pairs. A privacy discriminator is selected to be nearly identical in design to the other discriminators, and most of the optimization effort is dedicated to training the base GAN on the whole training data to generate realistic — but synthetic — medical image scans.

In experiments, the researchers simulated two hospitals with different populations, considering a “very restrictive” regulation preventing sharing images, as well as models have that had access to images. The team used a dataset of handwritten digits (MNIST) to see whether FELICIA could help generate high-quality synthetic data even when both data owners have biased coverage. They also sourced a more complex dataset (CIFAR10) to show how the utility could be significantly improved when a certain type of image was underrepresented in the data. And they tested FELICIA in a federated learning setting with medical imagery using a popular skin lesion image dataset.

According to the researchers, the results of the experiments show that FELICIA has potentially wide application in health care research settings. For example, it could be used to augment an image dataset to improve diagnostics, like the classification of cancer pathology images. “The data from one research center is often biased toward the dominating population of the available data for training. FELICIA could help mitigate bias by allowing sites from all over the world to create a synthetic dataset based on a more general population,” the researchers wrote in a paper describing their work.

In the future, the researchers plan to implement FELECIA with a GAN that can generate “highly complex” medical images, such as CT scans, X-rays, and histopathology slides in real-world federated learning settings with “non-local” data owners.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform
  • networking features, and more

Become a member

Source: https://venturebeat.com/2021/01/20/microsofts-felicia-taps-ai-to-enable-health-providers-to-share-data-anonymously/

Continue Reading
Blockchain3 days ago

5 Best Bitcoin Alternatives in 2021

Cyber Security2 days ago

Critical Cisco SD-WAN Bugs Allow RCE Attacks

Blockchain5 days ago

Data Suggests Whales are Keen on Protecting One Key Bitcoin Support Level

Cyber Security3 days ago

SolarWinds Malware Arsenal Widens with Raindrop

SPACS2 days ago

Intel Chairman Gets Medtronic Backing for $750 Million SPAC IPO

PR Newswire2 days ago

Global Laboratory Information Management Systems Market (2020 to 2027) – Featuring Abbott Informatics, Accelerated Technology Laboratories & Autoscribe Informatics Among Others

Blockchain4 days ago

Mitsubishi and Tokyo Tech Tap Blockchain for P2P Energy Trading Network

SPAC Insiders3 days ago

Queen’s Gambit Growth Capital (GMBT.U) Prices Upsized $300M IPO

SPAC Insiders3 days ago

Churchill Capital IV (CCIV) Releases Statement on Lucid Motors Rumor

PR Newswire5 days ago

Verringerung des klinischen Schweregrads von COVID-19: Jubilant Therapeutics gibt Forschungskooperation mit dem Wistar Institute zur Bewertung der Aktivität neuartiger PAD4-Inhibitoren bekannt

Medical Devices3 days ago

FDA’s Planning for Coronavirus Medical Countermeasures

Medical Devices2 days ago

Elcam Medical Joins Serenno Medical as Strategic Investor and Manufacturer of its Automatic Monitoring of Kidney Function Device

SPACS4 days ago

Why Clover Health Chose a SPAC, Not an IPO, to Go Public

SPAC Insiders3 days ago

FoxWayne Enterprises Acquisition Corp. (FOXWU) Prices $50M IPO

SPACS4 days ago

With the Boom in SPACs, Private Companies Are Calling the Shots

Blockchain4 days ago

New Highs Inbound: Ethereum is About to See an Explosive Rally Against BTC

SPACS2 days ago

Payments Startup Payoneer in Merger Talks With SPAC

Blockchain News4 days ago

Ethereum Addresses with at least 0.1 ETH Hit All-Time High as Amount Held on Exchanges Drops

PR Newswire5 days ago

Live-Streaming-Sender kommen zusammen, um Resilienz und Kreativität bei der BIGO Awards Gala 2021 zu feiern

Cyber Security3 days ago

Rob Joyce to Take Over as NSA Cybersecurity Director

Trending