
Stack Overflow Will Charge AI Giants For Training Data (wired.com) 31
"Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive," Stack Overflow's Chandrasekar says. "We're very supportive of Reddit's approach." Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need "to be trained on something that's progressing knowledge forward. They need new knowledge to be created." But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs.
Chandrasekar says that LLM developers are violating Stack Overflow's terms of service. Users own the content they post on Stack Overflow, as outlined in its TOS, but it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they "are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license," Chandrasekar says. Neither Stack Overflow nor Reddit has released pricing information. "Both Stack Overflow and Reddit will continue to license data for free to some people and companies," notes Wired. "Chandrasekar says Stack Overflow only wants remuneration only from companies developing LLMs for big, commercial purposes."
"When people start charging for products that are built on community-built sites like ours, that's where it's not fair use," he says.