Zephyrnet Logo

CSAM found in large AI image generator-training dataset

Date:

A massive public dataset that served as training data for popular AI image generators including Stable Diffusion has been found to contain thousands of instances of child sexual abuse material (CSAM).

In a study published today, the Stanford Internet Observatory (SIO) said it pored over more than 32 million data points in the LAION-5B dataset and was able to validate, using the Microsoft-developed tool PhotoDNA, 1,008 CSAM images – some included multiple times. That number is likely “a significant undercount,” the researchers said in their paper.

LAION-5B doesn’t include the images themselves, and is instead a collection of metadata including a hash of the image identifier, a description, language data, whether it may be unsafe, and a URL pointing to the image. A number of the CSAM photos found linked in LAION-5B were found hosted on websites like Reddit, Twitter, Blogspot, and WordPress, as well as adult websites like XHamster and XVideos.

To find images in the dataset worth testing, SIO focused on images tagged by LAION’s safety classifier as “unsafe.” Those images were scanned with PhotoDNA to detect CSAM, and matches were sent to the Canadian Centre for Child Protection (C3P) to be verified.

“Removal of the identified source material is currently in progress as researchers reported the image URLs to the National Center for Missing and Exploited Children (NCMEC) in the US and the C3P,” the SIO said.

LAION-5B was used to train popular AI image generator Stable Diffusion, version 1.5 of which is well known in certain corners of the internet for its ability to create explicit images. While not directly linked to cases like a child psychiatrist using AI to generate pornographic images of minors, it’s that sort of tech that’s made deepfake sextortion and other crimes easier.

According to the SIO, Stable Diffusion 1.5 remains popular online for generating explicit photos after “widespread dissatisfaction from the community” with the release of Stable Diffusion 2.0, which added additional filters to prevent unsafe images from slipping into the training dataset.

It’s unclear if Stability AI, which developed Stable Diffusion, knew about the presence of potential CSAM in its models due to the use of LAION-5B; the company didn’t respond to our questions.

Oops, they did it again

While it’s the first time German non-profit LAION’s AI training data has been accused of harboring child porn, the organization has caught flack for including questionable content in its training data before.

Google, which used a LAION-2B predecessor known as LAION-400M to train its Imagen AI generator, decided to never release the tool due to several concerns, including whether the LAION training data had helped it build a biased and problematic model.

According to the Imagen team, the generator showed “an overall bias towards generating images of people with lighter skin tones and …  portraying different professions to align with Western gender stereotypes.” Modeling things other than humans didn’t improve the situation, causing Imagen to “encode a range of social and cultural biases when generating images of activities, events and objects.”

An audit of LAION-400M itself “uncovered a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes.”

A few months after Google decided to pass on making Imagen public, an artist spotted medical images from a surgery she underwent in 2013 present in LAION-5B, which she never gave permission to include.

LAION didn’t respond to our questions on the matter, but founder Christoph Schuhmann did tell Bloomberg earlier this year that he was unaware of any CSAM present in LAION-5B, while also admitting “he did not review the data in great depth.”

Coincidentally or not – the SIO study isn’t mentioned – LAION chose yesterday to introduce plans for “regular maintenance procedures,” beginning immediately, to remove “links in LAION datasets that still point to suspicious, potentially unlawful content on public internet.”

“LAION has a zero tolerance policy for illegal content,” the company said. “The public datasets will be temporarily taken down, to return back after update filtering.” LAION plans to return its datasets to the public in the second half of January. ®

spot_img

Latest Intelligence

spot_img