Did you know that you have meta models trained in data from a piracy site? Here's what a process has revealed

Meta was examined earlier this year, after unrequited judicial documents suggested that the technological giant train their artificial intelligence models using Library Genesis (Libgen), a well-known pirated card warehouse. The revelations appeared as part of a process filed by a group of authors who accuse the metal of copyright violation.

As Wired reported in January, the case, Kadrey et al. v. meta platformsIt is among the first legal battles that challenge the training practices. Its result, together with other similar processes, could have a significant impact on how companies have used copyright content in their models and if such practices fall into the “correct use” doctrine.

Meta approach to hide criticized information

A major development in the trial came when Judge Vince Chhabria from the US District Court for Northern District of California ordered the release of previously played judgments. Chhabria criticized the META approach to hide the information, calling it “absurd” and stating that the company was trying to “avoid negative advertising”, rather than protecting commercial secrets.

One of the most notable disclosures included an internal message of a Meta employee who is supposed to have concerned about the potential reactions. “If there is a media coverage that suggests that we have used a data set that we know is pirated, such as Libgen, this can undermine our negotiation position with the regulatory authorities on these issues,” said the employee. Meta refused to comment on this issue.

In court of novelists

The process was initially initially filed in July 2023 by novelists Richard Kadrey and Christopher Golden, along with comedian Sarah Silverman. They claim that Meta has used the works protected by copyright, without the permission to train her models. Meta has claimed that its use of public available materials is protected in accordance with the doctrine of fair use, arguing that analyzing the text to model language does not constitute a direct violation of copyright.

Before leaving these documents, Meta acknowledged that she trained her language model using Books3, a data set containing almost 200,000 books. However, the company has not previously disclosed any use of data from Libgen. Recently disclosed documents suggest that the AI AI team was aware of the nature of the database, with an engineer noting that “the torrent from a corporate laptop (meta-detected) does not feel correct.”

Zuck said it is up to date with the data set

Other statements in the process claim that Meta’s management, including CEO Mark Zuckerberg, has been aware of the origins of the database. It seems that the internal communications referred to him as “MZ” when discussing decisions about the use of Libgen data. The applicants claim that these exchanges demonstrate the meta knowledge of the questionable legality of the data set.

The Meta pushed back against the applicants’ attempts to change their process, calling it “an eleven o’clock based on a false and inflammatory premise.” The company claims that it has revealed the use of Libgen in July 2024 and that the applicants had extensive opportunities to adjust their claims before the discovery in December.

The legal experts closely pursue the case, because its resolution could establish important precedents for AI training. While some statements in the trial – such as violations of the Digital Copyright Law of Millenenum – were rejected in 2023 due to insufficient evidence, the applicants claim that the newly disclosed documents provide reasons for reviewing these accusations. They also argue that the meta has exceeded the use of the pirated content for training, actively distributing, a process known as “sown” in torrent networks.

Libgen, which originated in Russia in 2008, remains one of the largest shadows libraries worldwide. US courts have long tried to close it, with a New York judge who ordered the platform to pay $ 30 million in damage in 2024. Despite these legal challenges, the site continues to operate through alternative fields.

As the trial progresses, Judge Chhabria warned Meta against subsequent attempts to draft judicial records, warning that any future overload could lead to the release of all related materials. “If the Meta submits an unjustified sealing request again, all the materials will simply be uncertain,” he said.

The case could have major implications for the development of AI, the right of copyright and the way in which the technological companies sail on intellectual property issues in the generative era.