
Why Extracting Data from PDFs Remains a Nightmare for Data Experts (arstechnica.com) 51
Businesses, governments, and researchers continue to struggle with extracting usable data from PDF files, despite AI advances. These digital documents contain valuable information for everything from scientific research to government records, but their rigid formats make extraction difficult.
"PDFs are a creature of a time when print layout was a big influence on publishing software," Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, told ArsTechnica. This print-oriented design means many PDFs are essentially "pictures of information" requiring optical character recognition (OCR) technology.
Traditional OCR systems have existed since the 1970s but struggle with complex layouts and poor-quality scans. New AI language models from companies like Google and Mistral now attempt to process documents more holistically, with varying success. "Right now, the clear leader is Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's recent OCR solution "performed poorly" in tests.
"PDFs are a creature of a time when print layout was a big influence on publishing software," Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, told ArsTechnica. This print-oriented design means many PDFs are essentially "pictures of information" requiring optical character recognition (OCR) technology.
Traditional OCR systems have existed since the 1970s but struggle with complex layouts and poor-quality scans. New AI language models from companies like Google and Mistral now attempt to process documents more holistically, with varying success. "Right now, the clear leader is Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's recent OCR solution "performed poorly" in tests.
How is this worse than dealing with (Score:3, Insightful)
...paper?
Re:How is this worse than dealing with (Score:5, Informative)
Re: (Score:2)
I remember the problematic PDFs "handled" by "printing" them to a bitmap, then OCR-ing the bitmap. However, that means a ton of stuff is lost.
PDFs are... just weird. You don't just have regular PDFs, you have PDF/a, PDF/x, and even PDFs for 3D models, as well as PDFs with scripting. They are like Word documents where they are a non-standard, standard, that only MS is 100% compatible with. Same with PDF files, in my experience, there are some files that only Acrobat can handle without crashing.
Re: (Score:2)
PDFs are just shit and the worst part is that they don't have to be like that. We've all gotten PDFs where when you C&P the text it gets all fragmented and spaced out in bizarre ways, including PDFs made with Acrobat Pro! But the format allows you to do kerning and spacing that would prevent that, separating the content from the presentation... and somehow the people who invented the format (by bastardizing their prior creation, PostScript) still can't do it in a way that doesn't make it pure trash.
Re: (Score:1)
What do you mean by "non-copiable"? You mean non-digital text mixed in with digital text? That's still not worse than paper that I see.
Back to the original question, suppose you were given the task of scanning 100k documents. Would you rather have 100k of the above kind of PDF's -or- 100k
Re: (Score:3)
Back to the original question, suppose you were given the task of scanning 100k documents. Would you rather have 100k of the above kind of PDF's -or- 100k paper documents, and why?
I worked for a company that had lots of these PDFs they needed to convert to data. It was easier to print them out, scan them, and run them through an OCR. The bulk of them was a form that had been filled out over decades by hand, typewriter, and on the computer. The ones that were filled out on the computer were the PDF ones. The issue was PDF treated some sections that was filled out as uncopiable. The work was outsourced to a company that specialized in digitization to handle them all. The existed paper
Re: (Score:3)
The weird thing is that on macOS, I can generally copy text from screenshots in PNG format. Built-in OCR has rendered PDF largely moot.
Re: (Score:2)
Paper (or as that AC mentioned microfiche). Not only do PDFs have issues with internal layout, missing fonts resulting in font substitution, embedded low quality raster images etc. they also have all sorts of permission weirdness under the hood that can cause issues processing them in anything other than Acrobat. And Acrobat will enforce those permissions. For instance they have a "page extraction" permission which can block any attempt to separate the document into single pages and then turn the pages i
Re: (Score:2)
You explained more than the actual article did. It's been *years* since I've seen a bitmapped pdf. I would be interested to know what percentage of pdf files globally are bitmapped, and what percentage of such files actually contain non-perished data. I can see why this might be a problem for a historian wanting to do a quant analysis of a big archive from the 90s, but I don't understand why it's such an issue outside those quite narrow bounds.
Re:How is this worse than dealing with (Score:5, Interesting)
Re: (Score:2)
Of course, if the PDF is just a scanned image in a PDF format, that doesn't apply.
Re: (Score:2)
Ain't it wonderful?
Re: (Score:2)
Prompt injection that friggen easy? (Score:2)
Prompt injection is a thing? I found the solution to Fermi's Paradox!
Re: (Score:2)
Bitmap (Score:5, Informative)
Re: (Score:3)
2. The problem docuements are bad scans in a crappy format. The crappy format is documented. So you don't need Adobe.
no (Score:1)
For some reason the nonsense continues (Score:2)
There are people insisting in having stuff in PDF, even if available in much better formats, for example Wikipedia pages. There's even a GitHub project to do that in bulk, for many thousands of these.
Master Yoda asks (Score:2)
If so powerful these AIs are... why can't read PDFs?
Re: (Score:2)
Old fashioned non-AI OCR also has trouble with bad scans.
Maybe combining the two will make things better.
More basic (Score:5, Insightful)
The fundamental problem is that PDF is a program that draws graphics, not a data format. That confusion leads to the problem.
Simple example, a CSV file containing a table of observations is quite useful as input to a model for simulation, but may be hard to read. A PS or PDF file COULD contain that table and a program to display it as a nice human readable table with headings and nice spacing and margins, leaving the CSV easy to extract, or it could be just a series of calls to drawing functions that draw what the CSV should look like but not containing a copy of the CSV. MANY if not most PDFs are the latter, not the former.
Side note, in spite of strident denials, PDF is basically PS which is basically Forth with a few aliases, new defined operators, and a graphics module (that can be implemented in Forth).
When it comes to archiving data, text > simple HTML > "modern" HTML > PS or PDF
Re: (Score:3)
Right, it is more-or-less making images of each page. That's also the appeal of it, in that you don't ruin the formatting or render something incorrectly because you don't have a particular font, or some other local feature required to make something like WORD work. Even just the "font substitution" bug alone is enough to make people want to use PDF, and I haven't heard any better solution for archival documents.
We had a Platinum-level trouble ticket with MS for the font substitution issue, they concluded i
Re: (Score:2)
All of that is great until you want to do any sort of processing on the information. Then it's crap. Good old text is quite resilliant and processable. It's just not "pretty".
Re: (Score:2)
The intent of PDFs wasn't for a computer to process the information, but for people. People are much better at acquiring information if it is well presented. There is a lot of information that is non-textual. Images, graphs and even the relative position of text on a page are important to people. (A good graph will be processed by a person far faster than a table of numbers even though it contains "the same" information.)
If the goal is for the computer to process the information, a nice structured file is b
Re: (Score:2)
PDF also tends to be weak for searching (even searchable PDF is only searchable in the limited ways envisioned when it is created). Even if the end result is intended to be a human reading the document, good data mining is a useful tool for that human to find the relevant documents to read.
Markup languages such as HTML strike a better balance. It is relatively easy to index an HTML document, or even feed it into an AI to index concepts.
The crux of the problem is that many people feel that they have saved th
No, just no (Score:3, Informative)
"PDF is basically PS which is basically Forth with a few aliases, new defined operators, and a graphics module (that can be implemented in Forth)." Bruh, NO.
Postscript is a turing complete language. You can run software which does just about anything (subject to the limitations of the intepreter, which was not intended to be abused too much) and outputs its results on paper. Your software can loop and calculate and do if-thens and draw bezier curves to draw a field of pretty flowers. In theory you could wri
Re: (Score:3)
PDF at it's core is PostScript stripped of flow control and the compiler, making it no longer Turing complete but, they hope, less likely to lock up a printer (since PS doesn't necessarily halt). Early on, PDF was viewed in Linux by pre-pending a PS stub and running it in GhostScript.
In turn, in spite of strident claims that PS is not even influenced by Forth, it's close enough that any Forth programmer can look at PS and immediately understand it. PS, like Forth, is indeed Turing complete.
PDF/A for document archive "best practice" (Score:2)
Re: (Score:2)
Re: (Score:2)
That'll work. If you can guarantee that the XML content will always generate the same PDF. Or just distributing the XML together with a publishing format CSS which together can generate the readable page is better. Or even ... (apologies in advance) ... SGML. The schema guarantees conformity with an agreed-upon data type declaration.
Take a "blueprint" for example. I am mind-boggled at the idea of expressing that as an XML file.
Why? The source for the CAD program that produces the blueprint is a structured data file, either in ASCII (DXF for example) or binary (DWG).
Re: PDF/A for document archive "best practice" (Score:2)
Pdfs are the defacto document preferred document interchange format for several vertical industries, afaik. Medical, real estate, lawyers... that is a lot of locked in markets. Too bad for them. They will be stuck with PDF for ever! It's easier for a rich man to get to heaven than it is to convince and adobe user to give it up, in my experience.
Acrobat is a sh*tshow (Score:2)
Seriously, Adobe needs to do something in a major way to fix this. The fact that you can't access all of Acrobat's Pro form features from InDesign is crazy. The fact that you can't build a form with a text box that spans multiple pages is dumb. Even on a top-of-the-line computer, Acrobat Pro is slow. The fact that there is no free API for generating PDFs is ludicrous. The fact that you have to jump through hoops to be able to submit a PDF form electronically via e-mail is bad if it even works at all.
ChatGPT can read them all (Score:2)
Even bitmapped ones.
Re: (Score:2)
Shouldn't be hard to understand (Score:3)
It's a feature, not a bug. (Score:2)
Re: (Score:2)
PDF scans
Yeah. But that's a special case of taking a crap scan/photo of a document and encapsulating the bit-mapped graphic in a PDF file. Maybe the contrast is crap, the page lifted off the flatbed window or the moron doing the scanning set the page on it crooked.
done INTENTIONALLY
Submittals to the FAA. Been there, done that. Worse yet, on orders from management, the original document had to be hand-written. By a group of engineers with wildly inconsistent handwriting. There was _no_way_ that an FMEA would make it into a searchable
Re: (Score:2)
Maybe the contrast is crap, the page lifted off the flatbed window or the moron doing the scanning set the page on it crooked.
Oh, you can read it well enough, but there is no way to search through 100+ pages to find what you want. Again, done intentionally, since clicking the 'print to PDF' button is 100x easier than scanning all those pages. And don't get me started on the dozens of obscure acronyms/codes that are sprinkled throughout the budget to make it even more opaque for the public...
Re: (Score:2)
I don't believe for one second that someone has the know-how to create a 100 page budget report... but just can't figure out how 'print to PDF' works, and would rather scan page by page.
I do believe that.
I think there's a low tech solution for this (Score:2)
Re: (Score:2)
The cause of, and solution to, all of our problems!
Seriously, though, humans are terrible at this. We get bored and miss stuff; we mistype numbers, and we're really expensive to hire.
their... rigid formats... umm... (Score:3)
Copy-Paste (Score:2)
Azure Document Intelligence overlooked (Score:3)
Nothing to Generate PDFs either (Score:1)
Surprisingly little software to generate an A4 PDF. I prompt for latex or css. It's quite a mess. I'm amazed the lack of AI in DTP. Did I miss something?
PDF (Score:2)
As I tell my users a thousand times a year:
If you're extracting data from PDFs... you're doing something wrong.
It was always designed as a WORM format for print publishing. Always from source text in a far more usable format. If you've lost access to the original and it was anything important... that's on you.
If you're getting your data from OCR'ing of images... even worse. If you're then using that data for anything important... more fool you. You can't even find the original sources, you're going off