Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
AI IT

Copyright Group Takes Down Dutch Language AI Dataset (aol.com) 14

Dutch-based copyright enforcement group BREIN has taken down a large language dataset that was being offered for use in training AI models, the organization said on Tuesday. From a report: The dataset included information collected without permission from tens of thousands of books, news sites, and Dutch language subtitles harvested from "countless" films and TV series, BREIN said in a statement. Director Bastiaan van Ramshorst told Reuters it was not clear whether or how widely the dataset may already have been used by AI companies. "It's very difficult to know, but we are trying to be on time" to avoid future lawsuits, he said. He said the European Union's AI Act will require AI firms to disclose what datasets they have used to train their models.
This discussion has been archived. No new comments can be posted.

Copyright Group Takes Down Dutch Language AI Dataset

Comments Filter:
  • Off topic.... but the link is to AOL.com and I'm more shocked by this than the article itself.
    • by ebunga ( 95613 )

      Back before they were bought by Verizon back in 2015, AOL was a multi-billion-dollar company more than a decade past the end of AOL classic being an important product. They are now owned by Yahoo, which is also still a multi-billion-dollar company with over 10,000 employees. Oh, and apparently AOL desktop is still a thing, and it costs $7/mo.

  • Now they need to track down any LLMs that may have been contaminated with the compromised training material, take down those and inform any users of derivative work to put up appropriate disclaimers.
    • by cowdung ( 702933 )

      In courts they will argue that LLMs are transformative, not derivative. But I guess we have to wait to see how this plays out.

      • by cowdung ( 702933 )

        The dataset may violate copyright, but the LLM may not even if trained on the dataset.

      • Wrll, they eill go to dutch courts And I don't know if that distinction matters there!

      • It's interesting to imagine a world in which LLMs are ruled to be derivative, and yet the copyright law we know stays unchanged.

        This would mean that illegally-trained LLMs can use a larger dataset and deliver much better results than legal ones, opening the door to all sorts of abuse. People would illegally mod their own vocal assistants, search engines, and image generators to get better results. Startups would be tempted to violate the law and secretly use an illegal model on their server to gain customer
        • That's an interesting world. Also, police SWAT teams would break down doors and shoot or arrest people running illegal LLMs, electricity companies would monitor potentially illegal power surges, and the mob would offer protection rackets for LLM businesses.
        • Well, we have a similar situation with food. We track where it was grown, restrict certain practices, monitor storage and transportation conditions, and crack down on problematic products. We'll have to make sure any use of LLMs is documented, declared and can be verified by independent sources, even if they have to sign NDAs for access. It's all a combination of LLM model, settings, seeds and prompts. If you cant reproduce a result, you're doing something fishy.
      • European law does not have such distinctions in copyright law, especially not with the AI angle, copyright violation there is based on intent and profit motives. They will consider damages and guilt primarily based on the size, mission and origin of the company.

Real computer scientists don't comment their code. The identifiers are so long they can't afford the disk space.

Working...