Over 16,000 artists’ names have been linked with the non-consensual training of Midjourney’s image generation models.

The Midjourney artist database is attached to an amended lawsuit submitted against Sta­bil­ity AI, DeviantArt, and Mid­jour­ney, filed under Exhibit J, and in a recently leaked public Google spreadsheet, a part of which may be viewed within the Internet Archive here

Artist Jon Lam shared screenshots on X from a Midjourney Discord chat where developers discuss using artist names and styles from Wikipedia and other sources.

The spreadsheet is believed to have originally been sourced from Midjourney’s development team and squares up with the leaked Discord chats from Midjourney developers, which allude to the artist’s work being mapped to ‘styles.’

By encoding artist work as ‘styles,’ Midjourney can efficiently recreate work of their style. 

Lam writes, “Midjourney developers caught discussing laundering, and making a database of Artists (who’ve been dehumanized to styles.”

Lam also shared videos of lists of artists, including those used for Midjourney styles and one other list of ‘proposed artists.’ Numerous X users stated their names were on these lists. 

One screenshot appears to point out an announcement by Midjourney CEO David Holz celebrating the addition of 16,000 artists to the training program. 

Another shows a Midjourney developer discussing that you’ve got to “launder it” through a “Codex,” though, without context, it’s tough to say whether that is referring to artists’ work.

Others (not Midjourney employees) in that very same conversation check with how processing artwork through an AI model essentially disembodies it from copyright.

One says, “all you’ve got to do is just use those scraped datasets and the conveniently forget what you used to coach the model. Boom legal problems solved ceaselessly.”

How legal cases are developing

In legal cases submitted against Midjourney, Stability AI, and likewise OpenAI, Meta, and Google (but for text-based work, reasonably than images), artists, writers, and others have found it tough to prove their work is de facto ‘inside’ the model verbatim.

That could be the smoking gun they should prove copyright violations.  

Copyright, basically, stays poorly defined within the era of AI. AI models are trained on data that has to come back from somewhere, and what higher source to seek out that data than the web?

The developers ‘scrape’ what’s termed as ‘open,’ ‘open-source,’ or ‘public’ data from the web, but again, these concepts are poorly defined. It may be fair to say that when AI developers smelled the upcoming gold rush, they seized as much ‘open’ data from the web as they may and used it to coach their models.

Legal processes are slow; AI is lightspeed compared. It was very easy for developers to outflank copyright law and train models long before copyright holders and the law that governs mental property could react.

The response process is now underway, but each the AI training process and the technical process involved in generating AI outputs (e.g., text or images) from user inputs challenge the character of mental property law.

Specifically, it’s a) hard to prove that AI models are definitely trained on copyright material and b) hard to prove their outputs replicate copyright material sufficiently.

There’s also the problem of accountability. AI corporations like OpenAI and Midjourney at the least partly used data harvested by others reasonably than harvesting it themselves. So, would it not not be the unique data scrapers chargeable for infringement?

In the context of this recent situation at Midjourney, Midjourney’s models, like others, will all the time reproduce a mix of works contained inside its data. Artists can’t easily prove what pieces they’ve used. 

For example, when a recent copyright case against Midjourney, Stability AI, and DeviantArt was dismissed (it’s since been resubmitted with latest plaintiffs), Federal Judge Orrick identified several defects in the way in which the claims were framed, particularly of their understanding of how AI image generators function. 

The original lawsuit alleged that Stability AI, in training its Stable Diffusion model, stored compressed copies of the pictures. 

Stability AI refuted this, clarifying that the training process involves extracting attributes reminiscent of lines, shades, and colours and developing parameters based on these attributes reasonably than storing copies of the pictures.

Orrick’s ruling highlighted the necessity for the plaintiffs to amend their claims to more accurately represent the operation of those AI models. 

This features a need for a clearer explanation of whether the claim against Midjourney was attributable to its use of Stable Diffusion, its independent use of coaching images, or each (as Midjourney can also be being accused of using Stability AI’s models, which allegedly use copyrighted works). 

Another challenge for the plaintiffs is demonstrating that Midjourney’s outputs are substantially much like their original artworks. Orrick noted that the plaintiffs themselves admitted that the output images from Stable Diffusion are unlikely to closely match any specific image within the training data. 

As of now, the case is alive, with the court denying AI corporations’ most up-to-date attempts to dismiss the artists’ claims. 

LAION dataset usage thrown into the combo

Legal cases submitted against Midjourney and co. also emphasized their potential use of the LAION-5B dataset – a compilation of 5.85 billion internet-sourced images, including copyrighted content. 

Stanford recently blasted LAION for holding illicit sexual images, including child sex abuse and various sexist, racist, and otherwise deplorable content – all of which now also ‘lives’ contained in the AI models that society is beginning to rely upon for creative and skilled uses. 

The long-term implications of which might be hotly debated, but the very fact these AIs are possibly firstly trained on stolen work and secondly on illegal content doesn’t shed positive light on AI development basically. 

Midjourney developer comments have been widely lambasted on social media and the Y Combinator forum.

It’s very likely that 2024 will cook up more fiery legal debates, and the Wild West chapter of AI development may be coming to an in depth.


This article was originally published at dailyai.com