Angry Bots: What the next generation of AI will learn from training on the toxic pool of social media

Post a comment on Reddit, answer coding questions on Stack Overflow, edit a Wikipedia entry, or share a baby photo on your public Facebook or Instagram feed and you are also helping to train the next generation of artificial intelligence.

Not everyone is OK with that — especially as the same online forums where they’ve spent years contributing are increasingly flooded with AI-generated commentary mimicking what real humans might say.

Some longtime users have tried to delete their past contributions or rewrite them into gibberish, but the protests have not had much effect. A handful of governments — including Brazil’s privacy regulator in June — have also tried to step in.

“A more significant portion of the population just kind of feels helpless,” said Reddit volunteer moderator Sarah Gilbert, who also studies online communities at Cornell University. “There’s nowhere to go except just completely going offline or not contributing in ways that bring value to them and value to others.”

Platforms are responding — with mixed results. Take Stack Overflow, the popular hub for computer programming tips. First, it banned ChatGPT-written responses due to frequent errors, but now it’s partnering with AI chatbot developers and has punished some of its own users who tried to erase their past contributions in protest.

It is one of a number of social media platforms grappling with user wariness — and occasional revolts — as they try to adapt to the changes brought by generative AI.

Software developer Andy Rotering of Bloomington, Minnesota, has used Stack Overflow daily for 15 years and said he worries the company “could be inadvertently hurting its greatest resource” — the community of contributors who have donated time to help other programmers.

“Keeping contributors incentivized to provide commentary should be paramount,” he said.

Stack Overflow CEO Prashanth Chandrasekar said the company is trying to balance rising demand for instant chatbot-generated coding assistance with the desire for a community “knowledge base” where people still want to post and “get recognized” for what they’ve contributed.

“Fast forward five years — there’s going to be all sorts of machine-generated content on the web,” he said in an interview. “There’s going to be very few places where there’s truly authentic, original human thought. And we’re one of those places.”

Chandrasekar readily describes Stack Overflow’s challenges as like one of the “case studies” he learned about at Harvard Business School, of a how a business survives — or doesn’t — after a disruptive technological change.

For more than a decade, users typically landed on Stack Overflow after typing a coding question in Google, and then found the answer, copied and pasted it. The answers they were most likely to see came from volunteers who’d built up points measuring their credibility — which in some cases could help land them a job.

Now programmers can simply ask an AI chatbot — some of which are already trained on everything ever posted to Stack Overflow — and it can instantly spit out an answer.

ChatGPT’s debut in late 2022 threatened to put Stack Overflow out of business. So Chandrasekar carved out a special 40-person team at the company to race out the launch of its own specialized AI chatbot, called Overflow AI. Then, the company made deals with Google and ChatGPT maker OpenAI, enabling the AI developers to tap into Stack Overflow’s question-and-answer archive to further improve their AI large language models.

That kind of strategy makes sense but may have come too late, said Maria Roche, an assistant professor at Harvard Business School. “I’m surprised that Stack Overflow wasn’t working on this earlier,” she said.

When some Stack Overflow users tried to delete their past comments after the Open AI partnership was announced, the company responded by suspending their accounts due to terms that make all contributions “perpetually and irrevocably licensed to Stack Overflow.”

“We quickly addressed it and said, ‘Look, that’s not acceptable behavior’,” said Chandrasekar, describing the protesters as a small minority in the “low hundreds” of the platform’s 100 million users.

Brazil’s national data protection authority on Tuesday took action to ban social media giant Meta Platforms from training its AI models on the Facebook and Instagram posts of Brazilians. It established a daily fine of 50,000 reais ($8,820) for non-compliance.

Meta in a statement called it a “step backwards for innovation” and said it has been more transparent than many industry counterparts doing similar AI training on public content, and that its practices comply with Brazilian laws.

Meta has also encountered resistance in Europe, where it recently put on hold its plans to start feeding people’s public posts into training AI systems. In the U.S., where there is no national law protecting online privacy, such training is already likely happening.

“The vast majority of people just have no idea that their data is being used,” Gilbert said.

Reddit has taken a different approach — partnering with AI developers like OpenAI and Google while also making clear that content can’t be taken in bulk without the platform’s approval by commercial entities “with no regard for user rights or privacy.”

The deals helped bring Reddit the money it needed to debut on Wall Street in March, with investors pushing the value of the company close to $9 billion seconds after it began trading on the New York Stock Exchange.

Reddit has not tried to punish users who protested — nor could it easily do so given how much say voluntary moderators have on what happens in their specialty forums known as subreddits. But what worries Gilbert, who helps moderate the “AskHistorians” subreddit, is the increasing flow of AI-generated commentary that moderators must decide whether to allow or ban.

“People come to Reddit because they want to talk to people, they don’t want to talk to bots,” Gilbert said. “There’s apps where they can talk to bots if they want to. But historically Reddit has been for connecting with humans.”

She said it is ironic that the AI-generated content threatening Reddit was sourced on the comments of millions of human Redditors, and “there’s a real risk that eventually it could end up pushing people out.”