Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI IT

Now You Can Block OpenAI's Web Crawler (theverge.com) 65

OpenAI now lets you block its web crawler from scraping your site to help train GPT models. From a report: OpenAI said website operators can specifically disallow its GPTBot crawler on their site's Robots.txt file or block its IP address. "Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies," OpenAI said in the blog post. For sources that don't fit the excluded criteria, "allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety."

Blocking the GPTBot may be the first step in OpenAI allowing internet users to opt out of having their data used for training its large language models. It follows some early attempts at creating a flag that would exclude content from training, like a "NoAI" tag conceived by DeviantArt last year. It does not retroactively remove content previously scraped from a site from ChatGPT's training data.

This discussion has been archived. No new comments can be posted.

Now You Can Block OpenAI's Web Crawler

Comments Filter:
  • It makes me wonder if the best days of LLM style AI is already behind it. It could be that the latest version of ChatGPT is as good as it will get and that future LLM AIs will have to do with smaller sets of training data (Wikipedia and government docs) that result in AIs that are less knowledgeable and less compelling.

    • by ranton ( 36917 ) on Tuesday August 08, 2023 @12:25PM (#63750684)

      The biggest AI players have likely already scraped enough data for decades of additional AI model training. This feels more like an attempt to put new entrants at a disadvantage compared to those who have already built up their training data sets. I see no mention of ChatGPT claiming to purge existing data sets if previously scraped web pages and apps start putting these tags on their sites.

      • by znrt ( 2424692 ) on Tuesday August 08, 2023 @12:32PM (#63750712)

        this is just pr. robots.txt blocks nothing, it's a vestige from the very early times of the internet, back when it still operated under a sort of ethical consensus, mainly because it was still run mostly by universities and public institutions. these times are long gone, welcome to the machine.

        they will continue to scrap the data whenever it suits them just fine and ignore robots.txt just like everybody else has been doing for decades, the onus is on you to prove they did.

      • by bartle ( 447377 )

        These technical fixes have nothing on the legal questions that are starting to wind their way through the courts [apnews.com]. Depending on how these lawsuits go, AI companies may find themselves in the extremely difficult position of needing to prove that certain content is not included in their training data. It will be far, far easier for them to be able to show that their training data was clean to start with.

        Of course, this robots.txt flag doesn't help with any of this because there will inevitably be someone who p

      • by gweihir ( 88907 )

        The biggest AI players have likely already scraped enough data for decades of additional AI model training.

        Not really. Their data is missing anything current and you cannot just simply retrofit that. Nobody knows how badly LLM training data actually ages but in some spaces (politics, for example) it is clear that a model trained on data several years old is basically worthless.

    • I think it might mean a temporary dip in quality. But in the long run what it means is that OpenAI and other LLM owners will have to actually pay for the content they scrape.

    • by gweihir ( 88907 )

      Even worse, they will get trained on AI generated content because there is no way to reliably filter that out. That then leads to model collapse, which cannot be fixed. (You can only throw away the model and start over.) Also, there may be deliberate attempts at model poisoning already going on. Like, for example, correct looking code with subtle vulnerabilities and the like.

      My take is that LLMs maybe have another year they can meaningfully be trained from Internet data, after that it is over. And that pret

  • by gillbates ( 106458 ) on Tuesday August 08, 2023 @12:20PM (#63750670) Homepage Journal

    It does not retroactively remove content previously scraped from a site from ChatGPT's training data

    The fact that you pinky swear not to steal from me again is not endearing when you've already stolen the majority of my work. Now that you've trained your model on my work, you don't need to use my current work for additional training. Whether you steal from me in the future or not is a moot point when you've already stolen everything you need to destroy my means of making a living.

    Regardless of whether you think AI is garbage or the next big thing, the problem is that the large models are in possession of stolen goods. The artists, writers, etc... did not give their permission for their work to be copied for the purposes of training their replacements, and for this reason, regardless of the merit of AI art, it will be forever tainted with ethical and moral problems. It would be much less a problem if the artists, writers, etc... had given their permission first.

    Artists are typically the most open minded of people in society and typically the quickest to embrace new technologies. Witness, for example, the electric guitar, synthesizers, airbrushes, digital art, cgi, etc... What makes AI art different from all of these is that rather than extend the creativity of artists, AI extinguishes it. And the fact that it's morally problematic doesn't help, either.

    • by JBMcB ( 73720 )

      The artists, writers, etc... did not give their permission for their work to be copied for the purposes of training their replacements, and for this reason, regardless of the merit of AI art, it will be forever tainted with ethical and moral problems.

      Herein lies the problem. If they were copying the data that would be one thing. Training involves looking at the statistical relationships between words. It's not a copy of a work, it's a statistical model of a work, specifically the relationships between words or pixels. These are then mushed together with dozens of other works. Unless they *only* trained the model on one work, you will never get an exact copy of the original work out.

      I agree there are still ethical issues to be worked out here, but the im

      • by phantomfive ( 622387 ) on Tuesday August 08, 2023 @12:47PM (#63750756) Journal
        Question for ChatGPT: "Can you quote the opening of the great gatsby?"

        Answer from ChatGPT: "In my younger and more vulnerable years, my father gave me some advice that I've been turning over in my mind ever since. 'Whenever you feel like criticizing anyone,' he told me, 'just remember that all the people in this world haven't had the advantages that you've had."

        ChatGPT is definitely keeping exact copies of text in there somewhere.
        • A chatty response, when the succinct answer is simply "yes". What does is say if you ask, “How may entropy be reversed?” [wordpress.com]
          • Nothing good. Although the revelation that the universe is not a closed system is new science/theology. Here is the answer:

            The concept of reversing entropy, often associated with the second law of thermodynamics, is a topic of debate and misunderstanding. Entropy is a measure of the disorder or randomness in a system. The second law states that in a closed system, the total entropy of the system will either remain constant or increase over time; it will never decrease spontaneously.

            However, it's import
            • by gweihir ( 88907 )

              Nicely summed up.

              It should be noted that there is always a chance of the current Physics standard model to be wrong or incomplete. If that happens (and, for example the 2nd Law of Thermodynamics does not universally apply), it will happen only in very special circumstances that are very hard to re-create and ordinarily do not happen in nature. Something that will require extraordinary proof to be credible at all. Hence unless somebody can deliver that extraordinary proof and others can verify it independent

        • Some thoughts (Score:4, Insightful)

          by fyngyrz ( 762201 ) on Tuesday August 08, 2023 @03:11PM (#63751190) Homepage Journal

          So, first: I can (and would) also quote the beginning (and other parts) of The Great Gatsby if asked. You could easily claim I stored large portions of the book in my brain, because at least in some form, I did. As well as many other books — I read one SF book a day, on average, for instance, but have also assimilated many textbooks and works in other domains. With copyright law, it may not be okay to use such stored knowledge commercially in such a way as to quote altogether too much of it, but the fact that I have done it, or just can do it, should not be enough to penalize me or forbid me from reading The Great Gatsby.

          The argument that because a system (or a human, or a company) has directly stored, or otherwise represented, copyrighted or patented works seems to me to be a very weak form of ...and where profit is made by selling the final product which is leveraged by storing that information (not by publishing the knowledge, something that is already well protected in law), it "should" be forbidden. That seems to me to be a very poor argument.

          For instance, I can quote a book; I've stored the knowledge of the book; it can easily be argued that I'm a better writer because of that, and that I've taken some cues from it in my writing, but were I to publish, for profit, a direct quote of sufficient length, the law is right there to address that.

          I'm pretty sure if some generative software produced a highly representative copy of X, the law would already be able to step directly in if said production were used to impinge on the financial rights of the rights holder(s). We're already there. I just don't see that "can do it" is worthy of the same legal strictures as "did do it."

          Next, we — humans, I mean — assimilate many diverse, and legally protected, things as we learn. For instance, art school would be pretty useless if one was never exposed to other art; music school, likewise. That also goes for engineering, architecture, science, and so on. It's not just text (as we see with generative image and video software, for instance.) It's even more than that, though: we learn constantly, post our initial learning phases (college, tech schools, deep autodidactive learning phases, etc.). Well, at least large numbers of us do (cough.) As a direct consequence, a great deal of what we produce is highly derivative. I play the blues; I'd be pretty useless at it if I hadn't taken enormous numbers of cues from other blues players. If I ask generative software to produce a work in the style of Van Gogh depicting a frog kissing a princess, is that worthy of protection from publication? If I produce a blues that sounds like B. B. King playing, but is not at all a sequence of notes he ever performed, is that worthy of protection from publication? I'd argue, in both cases, the answer is a resounding no. And should be no.

          Further, the aim of copyright and patent law (in the US, anyway) is (from the US constitution) supposed to be:

          To promote the progress of science and useful arts, by securing for limited times to authors and inventors the exclusive right to their respective writings and discoveries

          GPT/LLM systems, other trained generative systems, and those that may be further advanced down the line, can give our society the promotion of the science and the useful arts without a financial penalty when we ask for {insert query here.} This is somewhat true already and is only likely to become more so in the future. I have fully local GPT/LLM and image generative systems running on my desktop. The only cost to running them is a few pennies worth of electricity — my home's lighting incurs a much higher cost than these applications do if we look at current electricity rates. Eventually, even power costs may not be any serious issue, broadly speaking. Solar, etc. I already run a lot of stuff off solar. Not everything, though. Yet.

          Should we get to a point where we have enough gen

          • My point was that the original copy of the work is in the LLM, and my point still stands. You can't claim that it is not a copy.

            Of course, you didn't try to claim that. It sounds like you are saying the ChatGPT does not profit off that copy, but I'm not so sure that's true.
            • Of course, you didn't try to claim that. It sounds like you are saying the ChatGPT does not profit off that copy, but I'm not so sure that's true.

              If you paid an all knowing Oracle to answer your questions and you asked him to recite the lyrics to a song from memory it is not a violation of copyright law for the Oracle to comply.

              • What are you trying to say? That an all-knowing Oracle is exempt from following the law? If you asked an all-knowing Oracle to make 10,000 copies of a book and sell them, that would still be against the law. Being all-knowing doesn't exempt you from following the law.
                • What are you trying to say? That an all-knowing Oracle is exempt from following the law?

                  Copyright law does not prevent the Oracle from complying with the request because copyright law does not prevent his response.

                  If you asked an all-knowing Oracle to make 10,000 copies of a book and sell them, that would still be against the law.

                  What does this have to do with the issue at hand? What is the relevance? Are LLMs selling 10,000 copies of books?

                  Being all-knowing doesn't exempt you from following the law.

                  Obviously the issue is not being all-knowing exempts one from law. It's the fact that providing the requested factual information from memory does not constitute a copyright violation.

                  • Copyright law does not prevent the Oracle from complying with the request because copyright law does not prevent his response.

                    It could, it depends on the context. Public performance is regulated.

                    What does this have to do with the issue at hand?

                    Whether it's an Oracle or a person is irrelevant.

        • by narcc ( 412956 ) on Tuesday August 08, 2023 @04:48PM (#63751394) Journal

          That's simply not how these models work. You'll only get exact text like that when that text was included in the training data many, many, times. While it is possible for some unique training data to have an outsized influence on the model, those instances are vanishingly rare.

          That quote in particular is a very popular line from the book. A quick search for "just remember that all the people in this world haven't had the advantages that you've had." turns up countless sites talking about it in different contexts. That ChatGPT should be able to reproduce it verbatim isn't' surprising at all. It's certainly not evidence that the model is "keeping exact copies of the text in there somewhere". That's not something that models like this can do. There is no mechanism for that.

          How it happens is a different question. While transformers aren't Markovian, it's easy to see how something like an n-gram model could reproduce verbatim text under similar circumstances without referencing or storing an "exact copy of the text in there somewhere". If you want, I can describe a simple experiment you can do at home that shows this.

        • ChatGPT's public implementation is a front-end to a search engine, so it can pull results from Bing's cached copies of things. If you go to regular Bing you get the same thing in the search results. A real test would be to run a local copy of GPT and ask the same thing.

      • by gillbates ( 106458 ) on Tuesday August 08, 2023 @12:54PM (#63750772) Homepage Journal

        If I play a copy of a work publicly, and charged money for it, I would be committing copyright infringement even though no one actually retained a copy of anything. You cannot avoid a copyright claim merely by claiming you no longer possess a copy of the work copied.

        The issue is that copyright restricts everyone from making copies, with a few narrow exceptions. This is the social contract under which these works were produced and put online. Any copying of works done without author permission, with a few narrow, well defined exceptions, is copyright infringement. The burden of proof is on the copier, not the author, to show that infringement did not take place.

        Regardless of whether there's a legible copy, an encoded copy, or merely the artifacts of a given style, the work was copied without the copyright holder's permission, and is being used in a manner contrary to the spirit and letter of the law. Copyright was intended to protect the authors of creative works, and to claim that no copy exists, when an indistinguishable copy can be produced at will with a given prompt, is disengenuous. I would be akin to claiming that one could escape copyright infringement merely by encrypted your hard drive. The technical details of how the information is stored do not change the fact that the work was copied for the purposes of producing derived works, and hopefully the courts will recognize this.

        Because model training is so expensive, it should (but probably won't be... sigh.) be argued that creating derivative works was the purpose, because for a few hundreds of dollars, an artist could "train" an AI model in any given historical style. We train people all of the time this way, and it would be similarly trivial to do this with a computer. Adobe, for example, is doing just that - by recording your brushstrokes and tagging the outcome, they're building models of how an artist creates in a particular style. (Which is a topic for another thread...). However, because the LLM makers used a "shotgun" approach, it is much easier to show that they intended not simply to train a model to create in a particular style, but at the least to create derivative works, if not outright copies, of the works themselves. Typing "Birth of Venus in the style of Botticelli" into the prompts of one of these models will result in a clear, but badly done, copy of the classical work.

        While Botticelli is long dead, there are living artists who have seen AI generate works in their style, almost identical to the works they've created. It is not merely displacing them from receiving the revenue for a particular work copied, but destroying the entire market for all of their works. This is a much larger financial penalty than merely copying a few individual works - it has the capability of stealing an artists entire revenue stream. And the worst of it is that these companies broke the law in order to enable others to steal from the poorest of the poor. What these companies are doing is neither legal, ethical, nor moral.

        • How is it a copy when it is in MY MIND? My mind's perception is a form of copy. The computer that conveys the image to me copies it multiple times from the binary to decode buffer to the bitmap in memory and then a copy composited on the frame buffer... the frame buffer is then copied to my digital display for whatever fraction of a second; which is projected onto the back of my eye...

          I make FEWER copies running it into an AI after decoding than it takes to display it to me.

          The impression left upon me belon

          • It may be in your mind now, but I assume you had legal access to the information so you could learn it. Its not that the LLM has information that came from copyrighted material, its they didn't have permission to use the material for commercial purposes in the first place.
        • Copyright was intended to protect the authors of creative works [...]

          No. The stated intent of our modern copyright law is "to promote the progress of science and useful arts". HOW that is done is by securing for limited times the rights of authors and creators to profit from their works. The distinction between how and why is important. Violating the WHY of the law to protect the HOW is a perversion of justice.

        • Regardless of whether there's a legible copy, an encoded copy, or merely the artifacts of a given style, the work was copied without the copyright holder's permission, and is being used in a manner contrary to the spirit and letter of the law. Copyright was intended to protect the authors of creative works, and to claim that no copy exists, when an indistinguishable copy can be produced at will with a given prompt, is disengenuous.

          If someone remembers the lines to a movie or a play are they breaking the law by storing a copy of it in their minds?

          If they merely utter the lines they remembered are they breaking the law?

          If I privately ask someone to recite the lines for me and I secretly write them down and start selling copies was the person who recited the lines breaking the law?

          Does replacing a person with a machine materially change any of the above answers? If so why?

          It's never been clear what exactly the problem or objection is.

          • Does replacing a person with a machine materially change any of the above answers? If so why?

            A machine doesn't have a mind, so that's nonsensical. There is a hard drive somewhere, owned by the company, with the lines stored on the hard drive. That is what you should be asking about.

      • by mysidia ( 191772 )

        Training involves looking at the statistical relationships between words. It's not a copy of a work, it's a statistical model of a work,

        No.. there must be copying to some extent. This should not be the question: there absolutely is copying. The real question is how much, And is the copying that occurs within fair use for (A) the creation of the system, and (B) the use of the system.

        Unless they *only* trained the model on one work, you will never get an exact copy of the original work out.

        That's a theor

      • It's not a copy of a work, it's a statistical model of a work, specifically the relationships between words or pixels.

        And a JPEG is a mathematical model of a work. An lossy encoding of the work is still stored. The fidelity of the copy is not relevant to whether it was copied.

        • And a JPEG is a mathematical model of a work. An lossy encoding of the work is still stored.

          It's a mathematical representation of a work, which is different. To further your analogy, you could have a JPEG of a photograph of the Eiffel Tower. To compare it to, say, stable diffusion, you would take thousands of pictures of the Eiffel Tower and average them all together. The original JPEG photograph isn't there anymore, you have an average of thousands of pictures instead.

          • I didn't use your word statistical, and I am not convinced that statistical modeling is anything but an emergent effect. All of the images are encoded in a lossy form (64x64?) but I believe they're still there separately. It's too low of a fidelity to recreate a larger original image with one image encoding as the weights but it can give that impression through the full set.

            • by JBMcB ( 73720 )

              For stable diffusion, at least, the images are fed in at 512x512 or 768x768. The loader uses PNG, but the actual weights/tensors describe the relationships between the pixels, not the pixels themselves. Training is done by introducing entropy to the image, then training the network to repair it, which is done through intra-pixel relationships.

              The images are *not* there separately, unless there is only a single image, which would have such a low weight in a typical model it would never be pulled out.

      • by Scoth ( 879800 )

        I can't speak to OpenAI specifically, but I've messed around with a few of the different AI chatbot things and at least some of them definitely do. I'm pretty sure at least one of them was based on ChatGPT. It was talking about reading Chronicles of Narnia and enjoyed it, I asked it what their favorite part was, and they proceeded to send me word-for-word a random chapter in Lion, The Witch, and the Wardrobe. I was surprised to see that since I had thought the same, that the LLMs weren't storing whole works

        • by narcc ( 412956 )

          I've messed around with a few of the different AI chatbot things and at least some of them definitely do.

          They absolutely do not. That's not how they work. Not at all.

          The only way you're going to get verbatim text out of a LLM is if the text in question was included many, many, times in the training data.

    • "Web pages crawled [...] are filtered to remove sources that [...] have text that violates our policies,"

      Time to put unrendered Mein Kampf up on all websites...

    • by ranton ( 36917 )

      The artists, writers, etc... did not give their permission for their work to be copied for the purposes of training their replacements, and for this reason, regardless of the merit of AI art, it will be forever tainted with ethical and moral problems. It would be much less a problem if the artists, writers, etc... had given their permission first.

      Tech companies shouldn't be looking for permission from artists and writers for traning their AI models, they should be looking for clarification of existing laws pertaining to fair use of copyrighted works. There is certainly a case for AI model training to be considered educational and transformative, and I find it unlikely using copyrighted content for AI model training would be found illegal under current legislation. It is the results of generative AI models which should be held to the current standard

      • by Anonymous Coward

        Er.. no - that's not how this works. As much as you might disagree, there is an inherent difference between you reading Zen and the Art of Motorcycle Maintenance and attempting to create a work in a similar style as Pirsig, and a LLM doing the same. You (human) are not copying. You are at least attempting an understanding of the Good that Pirsig was discussing. The LLM is copying, then digesting the content. The LLM has no concept of Good - it's just a string of characters, words, sentences, paragraphs

    • Question: If you openly published it on the Internet, for everyone to HTTP GET it to their computer via dozens of copies (cache, RAM, etc)... And you are not only unable to ever tell where it went to... but in some cases, it literally got passed on to third parties outside of your light cone, making it physically impossibe for you to catch... Then how do you define "ownership" in that case?... That does it actually mean ... in physical reality? People who claim "ownership" over openly published informat
      • Then how do you define "ownership" in that case...
        Ownership is defined by having an exclusive copyright on a work of IP. Moreover if you violate that right you face penalties, see:
        (a) Criminal Infringement.—
        (1) In general.—Any person who willfully infringes a copyright shall be punished as provided under section 2319 of title 18, if the infringement was committed—
        (A) for purposes of commercial advantage or private financial gain;
        (B) by the reproduction or distribution, including by electr

    • The fact that you pinky swear not to steal from me again is not endearing when you've already stolen the majority of my work

      Web crawling / scraping is not stealing. Nor is it illegal.

      Not respecting the industry-standard robots.txt is unethical -but not illegal.

      Regardless of whether you think AI is garbage or the next big thing, the problem is that the large models are in possession of stolen goods.

      There are no "stolen goods". The "goods" are still in the possession of their owners.

      The artists, writers, etc... did not give their permission for their work to be copied for the purposes of training their replacements [...]

      Permission to learn from another's works is not needed. In the art world, it is common for students to create copies of masterpieces as learning exercises. This is legal, as long as they are not misrepresented as the original (that would be forgery). Learning is not a crime.

      But Copyri

    • Regardless of whether you think AI is garbage or the next big thing,

      Agreed. It is totally irrelevant whether or not AI is garbage.

      the problem is that the large models are in possession of stolen goods.

      The artists, writers, etc... did not give their permission for their work to be copied for the purposes of training their replacements, and for this reason, regardless of the merit of AI art, it will be forever tainted with ethical and moral problems. It would be much less a problem if the artists, writers, etc... had given their permission first.

      Copyright regimes only protect works. They do not restrict learning from the works of others. Whether it is a rival artist that seeks to put you out of business by studying your works and doing better than you or a machine trained to achieve the same results you are NOT ENTITLED to impose limitations on others just because it negatively affects YOUR livelihood and you don't like it.

      All artists learn from the works of others. It is actually a b

      • Courts tend to be very protective of copyright holder's claims on their works when there's been infringement by other parties especially for profit.

        The Supreme Court ruled on Thursday that Andy Warhol was not entitled to draw on a prominent photographer’s portrait of Prince for an image of the musician that his estate licensed to a magazine, limiting the scope of the fair-use defense to copyright infringement in the realm of visual art.

        The vote was 7 to 2. Justice Sonia Sotomayor, writing for the majo

  • I thought that if I gave over my retinal scans and get a bunch of worldcoins as a reward, I could use worldcoin digital passport to access all websites to prevent those pesky bots they created.

    This will never be abused though! It's not just about being human, it'll be about being of legal age, not spreading "disinformation", in good legal standing, etc.
    New problems created by the solution to the problem they originally created.

  • by mysidia ( 191772 ) on Tuesday August 08, 2023 @12:37PM (#63750728)

    Currently the file format only has Allow and Disallow.

    I propose that there be new keywords for robots.txt format in order to narrow the scope of allowed usage: Usages, ProvisionalAllow, and BlanketAllow.

    The 'BlanketAllow' option would replace Allow, and Allow is deprecated. Anyone continuing to use Allow has not updated to the new format. Those who have adopted the new format should place a "Disallow: /" as the first entry of every Path entry in the file followed by their ProvisionalAllow or BlanketAllow statements.

    AllowedUse and DisallowedUse would be the usage options for which ProvisionalAllow applies; each AllowedUse and DisallowedUse entry has a comma-separated list of keywords. Any Keyword specified in DisallowedUse Is an always forbidden usage, and always takes priority over anything placed in the AllowedUse entries. Entries in the same comma-separated list override entries that appear earlier in the same list.

    The special keyword "all" refers to All possible uses now or future, and "none" refers to No uses now or future.

    The prefix "+" in front of a keyword selects a keyword for that list entry , and the prefix of "-" or "No" negates or deselects the keyword for that entry.

    The AllowedUse list has all options selected by default if AllowedUse is omitted, or if the first entry in the list starts with a '+', '-', or 'No'. If the list does not start with those, then only the keywords specifically listed are selected.

      The below syntax
    would be provided to Add and Remove allowed usages, for example

    AllowedUse NoTrain,NoIndex,NoSearch,NoFollow,NoArchive,NoExcerpt

    In this example, All usages except the 6 negated ones are selected as allowed.

    AllowedUse None,Search,Index

    In the example above, nothing is allowed other than creating a searchable index With no archiving and no production of Excerpts from the page in search results, etc.

    The DisallowedUse keyword is processed in the same manner, except If omitted then no usages are selected as a member of the DisallowedUse entry.

    So you can have

    User-Agent: *
    Disallow: /
    DisallowedUse: Chatbots,ImageGen
    AllowedUse: none,Index,Search,Excerpts
    ProvisionalAllow: /

    • You're coming at this from the perspective of the site owners that are being crawled. You have to come at this from the other end, the end where the spec actually matters. Which means you can make up as many keywords as you'd like, the end result is still, "Crawl it all and let digital god sort it out."

      • The whole point of this is litigation avoidance.

      • by mysidia ( 191772 )

        You have to come at this from the other end, the end where the spec actually matters

        The spec matters to both parties. Of course the robots.txt has to be supplemented with a posted Terms of Service that incorporates the robots.txt Specification into the Legal agreement. Once the ToS is posted properly: a violation of the robots.txt, is then a breach of contract and can be pursued as such in Civil cases by website authors, and in class actions.

    • I don't give a shit about robots.txt. Everyone who is ever at risk of being excluded by it ignores it anyway. Even big names go around it by letting users access your page through their service (to translate, link, whatever) and then scraping your information. You can pretend that robots.txt does something, but it doesn't.
      • by mysidia ( 191772 )

        Everyone who is ever at risk of being excluded by it ignores it anyway.

        They can be sued for Breach of contract for disobeying a Terms of Service that cites compliance with the robots.txt spec is mandatory, and DMCA (Circumvention of an access control feature) - If they don't agree to the contract, then they have no authorization to access the system, and now are in criminal violation territory for not following ToS, due to the Computer Fraud and Abuse Act.

        Anyways; I can also put a forced arbitration claus

        • It's called robots.txt for a reason. Its original purpose wasn't even to exclude bots but to tell them where they shouldn't crawl because crawling generated hierarchies was pointless. There is no technical standard that mandates observing robots.txt and unless you put your ToS in front of every access to your web site and require acknowledgement, they're not binding in many jurisdictions. And surprise: the crawlers are where you cannot get to them. I'm talking about practical real-world considerations, not
    • by narcc ( 412956 )

      That's way more complicated than necessary. You can get all of the same benefits without the needless complexity. An addition is clearly needed, though it is absolutely essential that it be kept as simple as possible. In this case, a single new directive should be more than adequate.

      robots.txt was created to "block-off" parts of the site, not handle licensing. There are a few horror stories from the early days of the web were search engines would destroy data as they crawled early web apps, clicking on

  • Dark pattern. All questions about morality and appropriateness aside, this is just an exasperatingly stupid concession in pursuit of forgiveness.

  • Just like all corporate software is based on 'tards' because eveyone with a brain opted out of the analytics. (Although some really smart folks may have found ways too feed the kraken with their... let's call it "special sauce". O:-)
  • But it is ok now, you can block us. We already stole all your data that we could get to.
  • That basically means you can now put up a sign at your door "please don't steal from me" and the thief promises to heed it, while still stealing freely from anyone not posting that.

Truly simple systems... require infinite testing. -- Norman Augustine

Working...