Massive AI training dataset removed due to discovery of concerning illegal content

A study conducted by the Stanford Internet Observatory has revealed that the largest image dataset used for training AI image generation models, called LAION-5B, contains 3,226 images suspected to be child sexual abuse material (CSAM). As a result, LAION has removed its dataset from public access until it can ensure that it is free from any inappropriate content. The dataset, which consists of over 5.8 billion pairs of online image URLs and corresponding captions, is used to train AI models such as Stable Diffusion. The dataset was created by scraping the internet using Common Crawl. The researchers at Stanford used NSFW classifiers and perceptual hashing techniques to filter and match the images, and the ‘definite matches’ were validated by the Canadian Centre for Child Protection.

Following the publication of the study, Stable Diffusion stated that they have internal filters in place to eliminate CSAM and other illegal and offensive material from their training data. Under US federal law, it is illegal to possess and transmit CSAM, as well as other forms of visual media that can be converted into an image. However, the legality surrounding datasets like LAION-5B, which only contain URLs and not the actual images, is unclear. The study highlights the challenge of distinguishing AI-generated CSAM from actual CSAM, and the potential impact of such contaminated training data on generative AI models.

The proliferation of AI technology has raised concerns and finding solutions to these issues will require collaboration between the legislature, law enforcement, the tech industry, academia, and the general public.

Next
Previous