Tumblr and WordPress are reportedly set to strike deals to sell user data to artificial intelligence companies OpenAI and Midjourney. 404 Media reports that the platforms’ parent company, Automattic, is close to finalizing a deal to provide data to help train the company’s AI models.

It’s unclear what data will be included, but the report suggests that Automattic may have gone overboard initially. A purported internal post by Tumblr product manager Cyle Gage suggests that Automattic was willing to send personal or partner-related data that shouldn’t have been included in the deal. The questionable content reportedly included private posts in public blog posts, deleted or suspended blogs, unanswered questions (therefore not publicly posted), private replies, posts flagged as obscene, and content from premium partner blogs (such as the former music site of Apple).

The internal post suggests that Automattic engineers are preparing a list of post IDs that should have been excluded. It is not clear if the data has already been sent to the AI ​​companies.

Engadget emailed Automattic seeking comment on the report. The company responded with a published statementclaiming, “We will only share public content that is hosted on WordPress.com and Tumblr from sites that have not opted out.” The statement noted that legal regulations do not currently require AI companies’ web crawlers to take preferences into account for user opt-out.

The last line of Automattic’s statement appears to match the reported trades. “We also work directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control,” Automattic wrote. “Our partnerships will respect all opt-out settings. We also plan to take this a step further and regularly inform all partners of people who have recently opted out and request that their content be removed from past sources and future training.

NEW YORK, NEW YORK - DECEMBER 12: Sam Altman speaks on stage during A Year In TIME at The Plaza Hotel on December 12, 2023 in New York City.  (Photo by Mike Coppola/Getty Images for TIME)

OpenAI CEO Sam Altman (Mike Coppola via Getty Images)

The company is reportedly planning to release a new opt-out tool on Wednesday that it claims will allow users to block third parties — including artificial intelligence companies — from studying their data. 404 Media reviewed a purported internal FAQ Automattic prepared for the tool, which included the response: “If you opt out from the start, we will block robots from accessing your content by blacklisting your site. If you change your mind later, we also plan to inform all partners of people who have recently opted out and request that their content be removed from past sources and future training.

The phrase that describes it as “asking” AI companies to remove data may be apt.

A purported internal document from Automattic’s head of AI, Andrew Spittle, in response to a staff question about data removal guarantees when using the tool, explains: “We will regularly notify existing partners of anyone who opted out last time, when we provided a list. I want this to be an ongoing process where we regularly advocate for excluding past content based on current preferences. We will request that the content be deleted and removed from any future training. I believe the partners will respect that based on our conversations with them so far. I don’t think they gain much overall by keeping him.”

So if a Tumblr or WordPress user requests to opt out of AI training, Automattic will “ask” and “advocate” for their removal. And the AI ​​company boss “believes” that AI companies will find it in their best interest to comply “based on our conversations.” (What comfort is that!)

AI data training deals have become a lucrative opportunity for websites treading water in today’s slippery landscape of online publishing. (Tumblr’s staff is reportedly down to a skeleton crew at the end of 2023.) Last week, Google struck a deal with Reddit (before the latter’s IPO) to train on the platform’s vast knowledge base of user-generated content. Meanwhile, OpenAI launched a partnership program last year to collect third-party datasets to help train its AI models.

Update, February 27, 2024, 3:56 PM ET: This story has been updated to add a statement released by WordPress and Tumblr’s parent company Automattic.

https://www.engadget.com/tumblr-and-wordpress-posts-will-reportedly-be-used-for-openai-and-midjourney-training-204425798.html?src=rss