Why Extracting Data from PDFs Remains a Nightmare for Data Experts (arstechnica.com) 65

Posted by msmash on Tuesday March 11, 2025 @01:26PM from the tough-luck dept.

Businesses, governments, and researchers continue to struggle with extracting usable data from PDF files, despite AI advances. These digital documents contain valuable information for everything from scientific research to government records, but their rigid formats make extraction difficult.

"PDFs are a creature of a time when print layout was a big influence on publishing software," Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, told ArsTechnica. This print-oriented design means many PDFs are essentially "pictures of information" requiring optical character recognition (OCR) technology.

Traditional OCR systems have existed since the 1970s but struggle with complex layouts and poor-quality scans. New AI language models from companies like Google and Mistral now attempt to process documents more holistically, with varying success. "Right now, the clear leader is Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's recent OCR solution "performed poorly" in tests.

Why Extracting Data from PDFs Remains a Nightmare for Data Experts

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 65 Comments Log In/Create an Account

Comments Filter:

How is this worse than dealing with (Score:3, Insightful)

by Tablizer ( 95088 ) writes: on Tuesday March 11, 2025 @01:37PM (#65225739) Journal

...paper?

- Re:How is this worse than dealing with (Score:5, Informative)
  
  by Anonymous Coward writes: on Tuesday March 11, 2025 @01:41PM (#65225745)
  
  There are three kinds of PDFs - bitmapped PDFs, which this article is talking about - basically a scan in a PDF. These are problematic in that you can't redo the scan later with better hardware/software. There's "searchable PDFs" which are decent. Then there's this god-awful thing where the searchable PDF format is abused to make them non-copiable, by mixing up the letters in the font table and in the content of the document. I would make the last of these a felony.
  
  - Re: (Score:2)
    
    by ctilsie242 ( 4841247 ) writes:
    
    I remember the problematic PDFs "handled" by "printing" them to a bitmap, then OCR-ing the bitmap. However, that means a ton of stuff is lost.
    PDFs are... just weird. You don't just have regular PDFs, you have PDF/a, PDF/x, and even PDFs for 3D models, as well as PDFs with scripting. They are like Word documents where they are a non-standard, standard, that only MS is 100% compatible with. Same with PDF files, in my experience, there are some files that only Acrobat can handle without crashing.
    - Re: (Score:2)
      
      by drinkypoo ( 153816 ) writes:
      
      PDFs are just shit and the worst part is that they don't have to be like that. We've all gotten PDFs where when you C&P the text it gets all fragmented and spaced out in bizarre ways, including PDFs made with Acrobat Pro! But the format allows you to do kerning and spacing that would prevent that, separating the content from the presentation... and somehow the people who invented the format (by bastardizing their prior creation, PostScript) still can't do it in a way that doesn't make it pure trash.
  - Re: (Score:1)
    
    by Tablizer ( 95088 ) writes:
    
    Then there's this god-awful thing where the searchable PDF format is abused to make them non-copiable, by mixing up the letters in the font table and in the content of the document. I would make the last of these a felony.
    What do you mean by "non-copiable"? You mean non-digital text mixed in with digital text? That's still not worse than paper that I see.
    Back to the original question, suppose you were given the task of scanning 100k documents. Would you rather have 100k of the above kind of PDF's -or- 100k
    - Re:How is this worse than dealing with (Score:4, Interesting)
      
      by UnknowingFool ( 672806 ) writes: on Tuesday March 11, 2025 @02:41PM (#65225951)
      
      Back to the original question, suppose you were given the task of scanning 100k documents. Would you rather have 100k of the above kind of PDF's -or- 100k paper documents, and why?
      I worked for a company that had lots of these PDFs they needed to convert to data. It was easier to print them out, scan them, and run them through an OCR. The bulk of them was a form that had been filled out over decades by hand, typewriter, and on the computer. The ones that were filled out on the computer were the PDF ones. The issue was PDF treated some sections that was filled out as uncopiable. The work was outsourced to a company that specialized in digitization to handle them all. The existed paper forms were scanned. The PDF ones were printed out first then scanned in like a paper form. The process was then OCR converted the data. Each of the documents was checked by a person.
      
    - - Re: (Score:3)
        
        by dgatwood ( 11270 ) writes:
        
        The weird thing is that on macOS, I can generally copy text from screenshots in PNG format. Built-in OCR has rendered PDF largely moot.
    - Re: (Score:2)
      
      by mindwhip ( 894744 ) writes:
      
      Paper (or as that AC mentioned microfiche). Not only do PDFs have issues with internal layout, missing fonts resulting in font substitution, embedded low quality raster images etc. they also have all sorts of permission weirdness under the hood that can cause issues processing them in anything other than Acrobat. And Acrobat will enforce those permissions. For instance they have a "page extraction" permission which can block any attempt to separate the document into single pages and then turn the pages i
      - Re: (Score:1)
        
        by Tablizer ( 95088 ) writes:
        
        > For instance they have a "page extraction" permission which can block any attempt to separate the document into single pages and then turn the pages into images for an OCR process.
        In theory a print-driver could be devised to "print" each page into an image, such as BMP, PNG, etc. But I agree the devil's often in the details.
        Another permission can just block you from printing them. Combine these and you end up with a real problem when companies create "fillable" PDFs
        The ratio of problem-PDF's to "normal
        
        Re: (Score:1)
        
        by Tablizer ( 95088 ) writes:
        
        Correction: the following should have been quoted:
        Another permission can just block you from printing them. Combine these and you end up with a real problem when companies create "fillable" PDFs
  - Re: (Score:2)
    
    by shilly ( 142940 ) writes:
    
    You explained more than the actual article did. It's been *years* since I've seen a bitmapped pdf. I would be interested to know what percentage of pdf files globally are bitmapped, and what percentage of such files actually contain non-perished data. I can see why this might be a problem for a historian wanting to do a quant analysis of a big archive from the 90s, but I don't understand why it's such an issue outside those quite narrow bounds.
- Re:How is this worse than dealing with (Score:5, Interesting)
  
  by UnknowingFool ( 672806 ) writes: on Tuesday March 11, 2025 @02:21PM (#65225889)
  
  It is worse because it is already digitized. When I was dealing with this problem at a previous company, the first solution was to print out the PDF as paper then run the document through a scanner and OCR as that was easier than trying to extract the data in the PDF. Later on To to save paper they eventually "printed" the PDF as an image then processed the image with OCR.
  
  - Re: (Score:2)
    
    by phantomfive ( 622387 ) writes:
    
    I worked for a while extracting data from PDFs. If you're a programmer, extracting it is fairly straightforward, since it is usually generated by a rather simple algorithm. 10 lines of code, or 30 if you're inelegant. But for non-programmers, it's a lot harder. Copy and Paste doesn't quite work the way you'd want it to.
    
    Of course, if the PDF is just a scanned image in a PDF format, that doesn't apply.
    - Re: (Score:2)
      
      by UnknowingFool ( 672806 ) writes:
      
      It is simple algorithm if nothing changes from PDF to PDF. With PDFs from different software over decades of versions, the problem becomes complicated. The number of exceptions and corner cases can eventually be overcome with a dedicated programming staff given enough time and resources.
      For example, from what I remember: PDFs saved by a Spanish version of Adobe Acrobat on English Windows was slightly different than one saved from an English version of Acrobat on a Spanish version of Windows and different f
- Re: (Score:2)
  
  by Big Hairy Gorilla ( 9839972 ) writes:
  
  ha ha... I immediately thought of the promise of the "paperless office" when I read the summary.
  Ain't it wonderful?
  - Re: (Score:2)
    
    by wwphx ( 225607 ) writes:
    
    I remember when they first started talking about the paperless office back in the '80s. I laughed my butt off then, and if anyone talks about it now, I'll do an eye roll and leave.
- Re: (Score:2)
  
  by Tony Isaac ( 1301187 ) writes:
  
  Yes, worse. Because most PDFs come in two flavors:
  Flavor 1: A scanned document that has been OCR'ed. The OCR text is not grouped in paragraphs or connected, as you would expect when trying to analyze a document. It's strictly placed on the page based on position. The text in a field might be split up into fragments, because the OCR doesn't care about field boundaries.
  Flavor 2: A machine-generated document, like a fillable form. This is better, but still often lacks associations between fields and labels.
  Bec
Prompt injection that friggen easy? (Score:2)

by Tablizer ( 95088 ) writes:

Article: Despite their promise, LLMs introduce several new problems to document processing. Among them, they can introduce confabulations or hallucinations (plausible-sounding but incorrect information), accidentally follow instructions in the text (thinking they are part of a user prompt),... [Emph. added]
Prompt injection is a thing? I found the solution to Fermi's Paradox!
...and the detectives interviewed the distraught suspect, who started shouting, "Kill all humans, kill all humans!"...
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  The suspect then insisted, "Let me tell you about my mother."
- Bitmap (Score:5, Informative)
  
  by JBMcB ( 73720 ) writes: on Tuesday March 11, 2025 @01:58PM (#65225809)
  
  They are talking about bitmapped PDFs, where the PDF is just a wrapper around a TIFF or JPEG2K image of a document. The heading is a bit misleading, as the issue isn't with PDFs themselves, the issue is people scanning in bitmaps of stuff without OCRing them.
  
- Re: (Score:3)
  
  by greytree ( 7124971 ) writes:
  
  1. PDFs contain Postscript code .They are not Postscript code.
  
  2. The problem docuements are bad scans in a crappy format. The crappy format is documented. So you don't need Adobe.
- no (Score:1)
  
  by Anonymous Coward writes:
  
  It's not a program it's a page description language that happens to resemble Postscript. Postscript is a language, yes. If you actually care about the difference go find the specification for PDF 1.0 and take a look (none of the stuff added on later matters for basic documents). That said, the later PDF versions do allow you to embed Javascript but that's certainly not "every PDF file." --signed, someone who used to write Postscript programs and has also written PDF files.
For some reason the nonsense continues (Score:2)

by itsme1234 ( 199680 ) writes:

There are people insisting in having stuff in PDF, even if available in much better formats, for example Wikipedia pages. There's even a GitHub project to do that in bulk, for many thousands of these.
Master Yoda asks (Score:2)

by vbdasc ( 146051 ) writes:

If so powerful these AIs are... why can't read PDFs?
- Re: (Score:2)
  
  by greytree ( 7124971 ) writes:
  
  Just like people, AIs can read good scans, but have trouble with bad scans.
  
  Old fashioned non-AI OCR also has trouble with bad scans.
  
  Maybe combining the two will make things better.
More basic (Score:5, Insightful)

by sjames ( 1099 ) writes: on Tuesday March 11, 2025 @02:16PM (#65225861) Homepage Journal

The fundamental problem is that PDF is a program that draws graphics, not a data format. That confusion leads to the problem.
Simple example, a CSV file containing a table of observations is quite useful as input to a model for simulation, but may be hard to read. A PS or PDF file COULD contain that table and a program to display it as a nice human readable table with headings and nice spacing and margins, leaving the CSV easy to extract, or it could be just a series of calls to drawing functions that draw what the CSV should look like but not containing a copy of the CSV. MANY if not most PDFs are the latter, not the former.
Side note, in spite of strident denials, PDF is basically PS which is basically Forth with a few aliases, new defined operators, and a graphics module (that can be implemented in Forth).
When it comes to archiving data, text > simple HTML > "modern" HTML > PS or PDF

- Re:More basic (Score:4, Insightful)
  
  by Brett Buck ( 811747 ) writes: on Tuesday March 11, 2025 @05:07PM (#65226343)
  
  Right, it is more-or-less making images of each page. That's also the appeal of it, in that you don't ruin the formatting or render something incorrectly because you don't have a particular font, or some other local feature required to make something like WORD work. Even just the "font substitution" bug alone is enough to make people want to use PDF, and I haven't heard any better solution for archival documents.
  We had a Platinum-level trouble ticket with MS for the font substitution issue, they concluded it was insoluble and that to keep our documents from getting corrupted, that we print it on paper, scan it as a TIFF, and save the TIFFS.
  Aside from paper - which works fantastically well for this purpose, from the many examples I have at hand - I still don't see an answer that keeps searchable electronic documents intact over time and program version changes. And certainly not that are WYSIWYG when creating it in the first place. Various typesetting programs, - TeX and LaTex, formerly Runoff, etc, are just as prone to bit rot over version changes over many years/decades and also *torturous to use in the first place*.
  
  - Re: (Score:2)
    
    by sjames ( 1099 ) writes:
    
    All of that is great until you want to do any sort of processing on the information. Then it's crap. Good old text is quite resilliant and processable. It's just not "pretty".
    - Re: (Score:2)
      
      by Mr. Barky ( 152560 ) writes:
      
      The intent of PDFs wasn't for a computer to process the information, but for people. People are much better at acquiring information if it is well presented. There is a lot of information that is non-textual. Images, graphs and even the relative position of text on a page are important to people. (A good graph will be processed by a person far faster than a table of numbers even though it contains "the same" information.)
      If the goal is for the computer to process the information, a nice structured file is b
      - Re: (Score:2)
        
        by sjames ( 1099 ) writes:
        
        PDF also tends to be weak for searching (even searchable PDF is only searchable in the limited ways envisioned when it is created). Even if the end result is intended to be a human reading the document, good data mining is a useful tool for that human to find the relevant documents to read.
        Markup languages such as HTML strike a better balance. It is relatively easy to index an HTML document, or even feed it into an AI to index concepts.
        The crux of the problem is that many people feel that they have saved th
        
        Re: (Score:2)
        
        by Brett Buck ( 811747 ) writes:
        
        I am not sure why you say it is "particularly useless". It's not appropriate for storing large amounts of raw data. It is probably the best electronic format for storing actual documents, that is, things like engineering reports and analysis that have to be rendered properly and are primarily intended to be read by a person. It's still not nearly as good/safe as actual paper but if your application cannot tolerate math symbols showing as random font-substituted garbage, then it's about the only game in tow
        
        Re: (Score:2)
        
        by sjames ( 1099 ) writes:
        
        HTML is about as good for presentation but a lot easier to mine data out of for further processing. As unicode support is now widespread (except for /.), screwed garbled fonts are less problematic.
    - Re: (Score:2)
      
      by Brett Buck ( 811747 ) writes:
      
      We pass data files as text or FORTRAN binary, not PDF. We archive engineering reports as PDF, TIFF scans - or, the best, actual paper in a file cabinet, which so far has proven far and away the most reliable. PDF is hardly immune to corruption issues itself, depending on how you do it, it ALSO attempts to OCR or somehow convert information into something, and invariably corrupts the document. If it's not searchable, fine, at least it is *correct*.
- No, just no (Score:3, Informative)
  
  by Anonymous Coward writes:
  
  "PDF is basically PS which is basically Forth with a few aliases, new defined operators, and a graphics module (that can be implemented in Forth)." Bruh, NO.
  Postscript is a turing complete language. You can run software which does just about anything (subject to the limitations of the intepreter, which was not intended to be abused too much) and outputs its results on paper. Your software can loop and calculate and do if-thens and draw bezier curves to draw a field of pretty flowers. In theory you could wri
  - Re:No, just no (Score:5, Interesting)
    
    by sjames ( 1099 ) writes: on Tuesday March 11, 2025 @07:58PM (#65226729) Homepage Journal
    
    PDF at it's core is PostScript stripped of flow control and the compiler, making it no longer Turing complete but, they hope, less likely to lock up a printer (since PS doesn't necessarily halt). Early on, PDF was viewed in Linux by pre-pending a PS stub and running it in GhostScript.
    In turn, in spite of strident claims that PS is not even influenced by Forth, it's close enough that any Forth programmer can look at PS and immediately understand it. PS, like Forth, is indeed Turing complete.
    
    - Re: (Score:1)
      
      by gdm ( 97336 ) writes:
      
      no longer Turing complete but, they hope, less likely to lock up a printer (since PS doesn't necessarily halt)
      That appears to be a veiled reference to (and a wild misunderstanding of) the Halting Problem [wikipedia.org]....
PDF/A for document archive "best practice" (Score:2)

by C0L0PH0N ( 613595 ) writes:

If you want to archive a document, for a very long time, PDF/A was the "best practice" for archiving documents. Is this no longer true? What would be a current "best practice" for archiving documents. The primary idea behind archiving digital information is the possibly vain hope that the information will be "readable" and available to humans in 100 years, as paper documents for example, are.
- Re: (Score:2)
  
  by C0L0PH0N ( 613595 ) writes:
  
  I have interrogated one of the AI's about this (ChatGPT), and it recommends as a new "best practice" for archiving documents, is to keep PDF/A's, but add an associated "buddy" or "sidecar" file in XML format, reproducing the text and other important information in the PDF/A, in machine readable format. THAT is a good new "best practice". I think this makes sense. The PDF/A for human consumption, the associated XML file for machine consumption. I can see this working wonderfully for text. I am less certain a
  - Re: (Score:2)
    
    by PPH ( 736903 ) writes:
    
    That'll work. If you can guarantee that the XML content will always generate the same PDF. Or just distributing the XML together with a publishing format CSS which together can generate the readable page is better. Or even ... (apologies in advance) ... SGML. The schema guarantees conformity with an agreed-upon data type declaration.
    Take a "blueprint" for example. I am mind-boggled at the idea of expressing that as an XML file.
    Why? The source for the CAD program that produces the blueprint is a structured data file, either in ASCII (DXF for example) or binary (DWG).
    - Re: PDF/A for document archive "best practice" (Score:2)
      
      by Big Hairy Gorilla ( 9839972 ) writes:
      
      Yeah, your right about that. I'd point out we are talking about proprietary vs open formats. So open formats are always the way to go for archiving.
      
      Pdfs are the defacto document preferred document interchange format for several vertical industries, afaik. Medical, real estate, lawyers... that is a lot of locked in markets. Too bad for them. They will be stuck with PDF for ever! It's easier for a rich man to get to heaven than it is to convince and adobe user to give it up, in my experience.
  - Re: (Score:2)
    
    by mattr ( 78516 ) writes:
    
    At the risk of starting a terrible meme, I am wondering what would happen if you do that but also provide a folder of the image files or spreadsheets, whatever *plus* a readme.md or whatever file which explains it all to an AI "agent" or human archivist. If you have higher resolution images, maybe not even in that folder but online somewhere, potentially they could also be included. This file would be in English and we assume any future software would be able to interrogate it with at least as much understa
Acrobat is a sh*tshow (Score:2)

by RogueWarrior65 ( 678876 ) writes:

Seriously, Adobe needs to do something in a major way to fix this. The fact that you can't access all of Acrobat's Pro form features from InDesign is crazy. The fact that you can't build a form with a text box that spans multiple pages is dumb. Even on a top-of-the-line computer, Acrobat Pro is slow. The fact that there is no free API for generating PDFs is ludicrous. The fact that you have to jump through hoops to be able to submit a PDF form electronically via e-mail is bad if it even works at all.
ChatGPT can read them all (Score:2)

by nospam007 ( 722110 ) * writes:

Even bitmapped ones.
- Re: (Score:2)
  
  by dargaud ( 518470 ) writes:
  
  I seriously doubt that.
  - Re: (Score:2)
    
    by nospam007 ( 722110 ) * writes:
    
    It told me so.:-)
Shouldn't be hard to understand (Score:3)

by sentiblue ( 3535839 ) writes: on Tuesday March 11, 2025 @03:33PM (#65226091)

PDF files are unorganized data. They contain unpredictable formatting and cannot in any way be stored in a relational manner. The only way to store/retrieve/analyze data properly is that they are stored in a relational database. Since this cannot happen with PDF, we must understand that Acrobat is NOT the answer, period.

It's a feature, not a bug. (Score:2)

by superdave80 ( 1226592 ) writes:

PDF scans are done INTENTIONALLY by some organizations (*cough* my kids' school district *cough*) to make it harder to find the info you need. They are required by law to make the info available, but they don't have to make it convenient. I don't believe for one second that someone has the know-how to create a 100 page budget report... but just can't figure out how 'print to PDF' works, and would rather scan page by page.
- Re: (Score:2)
  
  by PPH ( 736903 ) writes:
  
  PDF scans
  Yeah. But that's a special case of taking a crap scan/photo of a document and encapsulating the bit-mapped graphic in a PDF file. Maybe the contrast is crap, the page lifted off the flatbed window or the moron doing the scanning set the page on it crooked.
  done INTENTIONALLY
  Submittals to the FAA. Been there, done that. Worse yet, on orders from management, the original document had to be hand-written. By a group of engineers with wildly inconsistent handwriting. There was _no_way_ that an FMEA would make it into a searchable
  - Re: (Score:2)
    
    by superdave80 ( 1226592 ) writes:
    
    Maybe the contrast is crap, the page lifted off the flatbed window or the moron doing the scanning set the page on it crooked.
    Oh, you can read it well enough, but there is no way to search through 100+ pages to find what you want. Again, done intentionally, since clicking the 'print to PDF' button is 100x easier than scanning all those pages. And don't get me started on the dozens of obscure acronyms/codes that are sprinkled throughout the budget to make it even more opaque for the public...
- Re: (Score:2)
  
  by phantomfive ( 622387 ) writes:
  
  I don't believe for one second that someone has the know-how to create a 100 page budget report... but just can't figure out how 'print to PDF' works, and would rather scan page by page.
  I do believe that.
I think there's a low tech solution for this (Score:2)

by zkiwi34 ( 974563 ) writes:

It's called humans.
- Re: (Score:2)
  
  by kqs ( 1038910 ) writes:
  
  The cause of, and solution to, all of our problems!
  Seriously, though, humans are terrible at this. We get bored and miss stuff; we mistype numbers, and we're really expensive to hire.
their... rigid formats... umm... (Score:3)

by zephvark ( 1812804 ) writes: on Tuesday March 11, 2025 @04:00PM (#65226165)

PDFs would be very easy to handle if they had rigid formats. They don't. They're the usual garbage bag of random things that slightly work. Adobe was noted for that. The original coders were half-assed and stoned all the time, and no one can figure out what exactly was going on in their drug-addled heads, so we try not to use things like "Macromedia Flash" or PDFs anymore. They were always broken and it's not reasonable to try to fix them.

Copy-Paste (Score:2)

by dargaud ( 518470 ) writes:

I teach some practical university courses. Students have to copy-paste code samples from pdf files. It's a nightmare. There are plenty of extraneous characters between *each* characters. Also the lines are reorganized in weird columns so you have to delete the end of every line or some weird shit like that. They spend more time editing their copy-paste than adding the important code that is asked from them. PDFs suck.
Azure Document Intelligence overlooked (Score:4, Informative)

by laughingskeptic ( 1004414 ) writes: on Tuesday March 11, 2025 @05:51PM (#65226453)

I have recently used Document Intelligence on crooked coffee-stained scans of 400 page documents from the 1970s and it worked far better than I expected. Biggest complaint is it did not always correctly choose between "/" or "l" for some table headers in some of these documents when reading categorical text that looks like alnum+/alnum+. It has its limitations, but I read this article and wondered if maybe the authors haven't used Document Intelligence.

Nothing to Generate PDFs either (Score:1)

by jago25_98 ( 566531 ) writes:

Surprisingly little software to generate an A4 PDF. I prompt for latex or css. It's quite a mess. I'm amazed the lack of AI in DTP. Did I miss something?
PDF (Score:2)

by ledow ( 319597 ) writes:

As I tell my users a thousand times a year:
If you're extracting data from PDFs... you're doing something wrong.
It was always designed as a WORM format for print publishing. Always from source text in a far more usable format. If you've lost access to the original and it was anything important... that's on you.
If you're getting your data from OCR'ing of images... even worse. If you're then using that data for anything important... more fool you. You can't even find the original sources, you're going off
- Re:PDF-- Sometimes paper is the only source (Score:2)
  
  by laughingskeptic ( 1004414 ) writes:
  
  What is your suggestion then when the only source for some information is decades old piles of paper? OCR has finally become good enough that this old data can be digitally captured. This is a new situation which is resulting in new observations and complaints. When I run some of our old documents through the Adobe OCR software, I get a text file of individual letters that are not even always aligned properly ... not particularly helpful. But these new OCR systems with integrated LLMs work very well.
  - Re: (Score:2)
    
    by ledow ( 319597 ) writes:
    
    "What is your suggestion then when the only source for some information is decades old piles of paper?"
    Learn to do things better in the future, and keep reliable source data for whatever important thing is so necessary that you have to spend ridiculous amounts of money on scanning and fixing it all.
    And tell your users that PDF IS A WORM FORMAT so you don't ever run into this again.
The answer is obvious! (Score:2)

by kaatochacha ( 651922 ) writes:

Because Adobe is responsible.
If you told me that a new thing had been developed, which caused skin blisters, eye watering blindness, painful diarrhea; but gave you a slight benefit and charged you an obscene amount of money monthly, I'd place my bet on Adobe creating it.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

How is this worse than dealing with (Score:3, Insightful)

Re:How is this worse than dealing with (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re:How is this worse than dealing with (Score:4, Interesting)

Re: (Score:3)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Re:How is this worse than dealing with (Score:5, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Prompt injection that friggen easy? (Score:2)

Re: (Score:2)

Bitmap (Score:5, Informative)

Re: (Score:3)

no (Score:1)

For some reason the nonsense continues (Score:2)

Master Yoda asks (Score:2)

Re: (Score:2)

More basic (Score:5, Insightful)

Re:More basic (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

No, just no (Score:3, Informative)

Re:No, just no (Score:5, Interesting)

Re: (Score:1)

PDF/A for document archive "best practice" (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: PDF/A for document archive "best practice" (Score:2)

Re: (Score:2)

Acrobat is a sh*tshow (Score:2)

ChatGPT can read them all (Score:2)

Re: (Score:2)

Re: (Score:2)

Shouldn't be hard to understand (Score:3)

It's a feature, not a bug. (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

I think there's a low tech solution for this (Score:2)

Re: (Score:2)

their... rigid formats... umm... (Score:3)

Copy-Paste (Score:2)

Azure Document Intelligence overlooked (Score:4, Informative)

Nothing to Generate PDFs either (Score:1)

PDF (Score:2)

Re:PDF-- Sometimes paper is the only source (Score:2)

Re: (Score:2)

The answer is obvious! (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals