Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
IT

Why Extracting Data from PDFs Remains a Nightmare for Data Experts (arstechnica.com) 47

Businesses, governments, and researchers continue to struggle with extracting usable data from PDF files, despite AI advances. These digital documents contain valuable information for everything from scientific research to government records, but their rigid formats make extraction difficult.

"PDFs are a creature of a time when print layout was a big influence on publishing software," Derek Willis, a lecturer in Data and Computational Journalism at the University of Maryland, told ArsTechnica. This print-oriented design means many PDFs are essentially "pictures of information" requiring optical character recognition (OCR) technology.

Traditional OCR systems have existed since the 1970s but struggle with complex layouts and poor-quality scans. New AI language models from companies like Google and Mistral now attempt to process documents more holistically, with varying success. "Right now, the clear leader is Google's Gemini 2.0 Flash Pro Experimental," Willis notes, while Mistral's recent OCR solution "performed poorly" in tests.

Why Extracting Data from PDFs Remains a Nightmare for Data Experts

Comments Filter:
    • by Anonymous Coward on Tuesday March 11, 2025 @01:41PM (#65225745)
      There are three kinds of PDFs - bitmapped PDFs, which this article is talking about - basically a scan in a PDF. These are problematic in that you can't redo the scan later with better hardware/software. There's "searchable PDFs" which are decent. Then there's this god-awful thing where the searchable PDF format is abused to make them non-copiable, by mixing up the letters in the font table and in the content of the document. I would make the last of these a felony.
      • I remember the problematic PDFs "handled" by "printing" them to a bitmap, then OCR-ing the bitmap. However, that means a ton of stuff is lost.

        PDFs are... just weird. You don't just have regular PDFs, you have PDF/a, PDF/x, and even PDFs for 3D models, as well as PDFs with scripting. They are like Word documents where they are a non-standard, standard, that only MS is 100% compatible with. Same with PDF files, in my experience, there are some files that only Acrobat can handle without crashing.

        • PDFs are just shit and the worst part is that they don't have to be like that. We've all gotten PDFs where when you C&P the text it gets all fragmented and spaced out in bizarre ways, including PDFs made with Acrobat Pro! But the format allows you to do kerning and spacing that would prevent that, separating the content from the presentation... and somehow the people who invented the format (by bastardizing their prior creation, PostScript) still can't do it in a way that doesn't make it pure trash.

      • by Tablizer ( 95088 )

        Then there's this god-awful thing where the searchable PDF format is abused to make them non-copiable, by mixing up the letters in the font table and in the content of the document. I would make the last of these a felony.

        What do you mean by "non-copiable"? You mean non-digital text mixed in with digital text? That's still not worse than paper that I see.

        Back to the original question, suppose you were given the task of scanning 100k documents. Would you rather have 100k of the above kind of PDF's -or- 100k

        • Back to the original question, suppose you were given the task of scanning 100k documents. Would you rather have 100k of the above kind of PDF's -or- 100k paper documents, and why?

          I worked for a company that had lots of these PDFs they needed to convert to data. It was easier to print them out, scan them, and run them through an OCR. The bulk of them was a form that had been filled out over decades by hand, typewriter, and on the computer. The ones that were filled out on the computer were the PDF ones. The issue was PDF treated some sections that was filled out as uncopiable. The work was outsourced to a company that specialized in digitization to handle them all. The existed paper

        • Paper (or as that AC mentioned microfiche). Not only do PDFs have issues with internal layout, missing fonts resulting in font substitution, embedded low quality raster images etc. they also have all sorts of permission weirdness under the hood that can cause issues processing them in anything other than Acrobat. And Acrobat will enforce those permissions. For instance they have a "page extraction" permission which can block any attempt to separate the document into single pages and then turn the pages i

    • by UnknowingFool ( 672806 ) on Tuesday March 11, 2025 @02:21PM (#65225889)
      It is worse because it is already digitized. When I was dealing with this problem at a previous company, the first solution was to print out the PDF as paper then run the document through a scanner and OCR as that was easier than trying to extract the data in the PDF. Later on To to save paper they eventually "printed" the PDF as an image then processed the image with OCR.
      • I worked for a while extracting data from PDFs. If you're a programmer, extracting it is fairly straightforward, since it is usually generated by a rather simple algorithm. 10 lines of code, or 30 if you're inelegant. But for non-programmers, it's a lot harder. Copy and Paste doesn't quite work the way you'd want it to.

        Of course, if the PDF is just a scanned image in a PDF format, that doesn't apply.
    • ha ha... I immediately thought of the promise of the "paperless office" when I read the summary.
      Ain't it wonderful?
      • by wwphx ( 225607 )
        I remember when they first started talking about the paperless office back in the '80s. I laughed my butt off then, and if anyone talks about it now, I'll do an eye roll and leave.
  • Article: Despite their promise, LLMs introduce several new problems to document processing. Among them, they can introduce confabulations or hallucinations (plausible-sounding but incorrect information), accidentally follow instructions in the text (thinking they are part of a user prompt),... [Emph. added]

    Prompt injection is a thing? I found the solution to Fermi's Paradox!

    ...and the detectives interviewed the distraught suspect, who started shouting, "Kill all humans, kill all humans!"...

  • There are people insisting in having stuff in PDF, even if available in much better formats, for example Wikipedia pages. There's even a GitHub project to do that in bulk, for many thousands of these.

  • If so powerful these AIs are... why can't read PDFs?

    • Just like people, AIs can read good scans, but have trouble with bad scans.

      Old fashioned non-AI OCR also has trouble with bad scans.

      Maybe combining the two will make things better.
  • More basic (Score:5, Insightful)

    by sjames ( 1099 ) on Tuesday March 11, 2025 @02:16PM (#65225861) Homepage Journal

    The fundamental problem is that PDF is a program that draws graphics, not a data format. That confusion leads to the problem.

    Simple example, a CSV file containing a table of observations is quite useful as input to a model for simulation, but may be hard to read. A PS or PDF file COULD contain that table and a program to display it as a nice human readable table with headings and nice spacing and margins, leaving the CSV easy to extract, or it could be just a series of calls to drawing functions that draw what the CSV should look like but not containing a copy of the CSV. MANY if not most PDFs are the latter, not the former.

    Side note, in spite of strident denials, PDF is basically PS which is basically Forth with a few aliases, new defined operators, and a graphics module (that can be implemented in Forth).

    When it comes to archiving data, text > simple HTML > "modern" HTML > PS or PDF

    • Right, it is more-or-less making images of each page. That's also the appeal of it, in that you don't ruin the formatting or render something incorrectly because you don't have a particular font, or some other local feature required to make something like WORD work. Even just the "font substitution" bug alone is enough to make people want to use PDF, and I haven't heard any better solution for archival documents.

      We had a Platinum-level trouble ticket with MS for the font substitution issue, they concluded i

      • by sjames ( 1099 )

        All of that is great until you want to do any sort of processing on the information. Then it's crap. Good old text is quite resilliant and processable. It's just not "pretty".

    • No, just no (Score:3, Informative)

      by Anonymous Coward

      "PDF is basically PS which is basically Forth with a few aliases, new defined operators, and a graphics module (that can be implemented in Forth)." Bruh, NO.

      Postscript is a turing complete language. You can run software which does just about anything (subject to the limitations of the intepreter, which was not intended to be abused too much) and outputs its results on paper. Your software can loop and calculate and do if-thens and draw bezier curves to draw a field of pretty flowers. In theory you could wri

      • by sjames ( 1099 )

        PDF at it's core is PostScript stripped of flow control and the compiler, making it no longer Turing complete but, they hope, less likely to lock up a printer (since PS doesn't necessarily halt). Early on, PDF was viewed in Linux by pre-pending a PS stub and running it in GhostScript.

        In turn, in spite of strident claims that PS is not even influenced by Forth, it's close enough that any Forth programmer can look at PS and immediately understand it. PS, like Forth, is indeed Turing complete.

  • If you want to archive a document, for a very long time, PDF/A was the "best practice" for archiving documents. Is this no longer true? What would be a current "best practice" for archiving documents. The primary idea behind archiving digital information is the possibly vain hope that the information will be "readable" and available to humans in 100 years, as paper documents for example, are.
    • I have interrogated one of the AI's about this (ChatGPT), and it recommends as a new "best practice" for archiving documents, is to keep PDF/A's, but add an associated "buddy" or "sidecar" file in XML format, reproducing the text and other important information in the PDF/A, in machine readable format. THAT is a good new "best practice". I think this makes sense. The PDF/A for human consumption, the associated XML file for machine consumption. I can see this working wonderfully for text. I am less certain a
      • by PPH ( 736903 )

        That'll work. If you can guarantee that the XML content will always generate the same PDF. Or just distributing the XML together with a publishing format CSS which together can generate the readable page is better. Or even ... (apologies in advance) ... SGML. The schema guarantees conformity with an agreed-upon data type declaration.

        Take a "blueprint" for example. I am mind-boggled at the idea of expressing that as an XML file.

        Why? The source for the CAD program that produces the blueprint is a structured data file, either in ASCII (DXF for example) or binary (DWG).

        • Yeah, your right about that. I'd point out we are talking about proprietary vs open formats. So open formats are always the way to go for archiving.

          Pdfs are the defacto document preferred document interchange format for several vertical industries, afaik. Medical, real estate, lawyers... that is a lot of locked in markets. Too bad for them. They will be stuck with PDF for ever! It's easier for a rich man to get to heaven than it is to convince and adobe user to give it up, in my experience.
  • Seriously, Adobe needs to do something in a major way to fix this. The fact that you can't access all of Acrobat's Pro form features from InDesign is crazy. The fact that you can't build a form with a text box that spans multiple pages is dumb. Even on a top-of-the-line computer, Acrobat Pro is slow. The fact that there is no free API for generating PDFs is ludicrous. The fact that you have to jump through hoops to be able to submit a PDF form electronically via e-mail is bad if it even works at all.

  • Even bitmapped ones.

  • by sentiblue ( 3535839 ) on Tuesday March 11, 2025 @03:33PM (#65226091)
    PDF files are unorganized data. They contain unpredictable formatting and cannot in any way be stored in a relational manner. The only way to store/retrieve/analyze data properly is that they are stored in a relational database. Since this cannot happen with PDF, we must understand that Acrobat is NOT the answer, period.
  • PDF scans are done INTENTIONALLY by some organizations (*cough* my kids' school district *cough*) to make it harder to find the info you need. They are required by law to make the info available, but they don't have to make it convenient. I don't believe for one second that someone has the know-how to create a 100 page budget report... but just can't figure out how 'print to PDF' works, and would rather scan page by page.
    • by PPH ( 736903 )

      PDF scans

      Yeah. But that's a special case of taking a crap scan/photo of a document and encapsulating the bit-mapped graphic in a PDF file. Maybe the contrast is crap, the page lifted off the flatbed window or the moron doing the scanning set the page on it crooked.

      done INTENTIONALLY

      Submittals to the FAA. Been there, done that. Worse yet, on orders from management, the original document had to be hand-written. By a group of engineers with wildly inconsistent handwriting. There was _no_way_ that an FMEA would make it into a searchable

      • Maybe the contrast is crap, the page lifted off the flatbed window or the moron doing the scanning set the page on it crooked.

        Oh, you can read it well enough, but there is no way to search through 100+ pages to find what you want. Again, done intentionally, since clicking the 'print to PDF' button is 100x easier than scanning all those pages. And don't get me started on the dozens of obscure acronyms/codes that are sprinkled throughout the budget to make it even more opaque for the public...

    • I don't believe for one second that someone has the know-how to create a 100 page budget report... but just can't figure out how 'print to PDF' works, and would rather scan page by page.

      I do believe that.

    • by kqs ( 1038910 )

      The cause of, and solution to, all of our problems!

      Seriously, though, humans are terrible at this. We get bored and miss stuff; we mistype numbers, and we're really expensive to hire.

  • by zephvark ( 1812804 ) on Tuesday March 11, 2025 @04:00PM (#65226165)
    PDFs would be very easy to handle if they had rigid formats. They don't. They're the usual garbage bag of random things that slightly work. Adobe was noted for that. The original coders were half-assed and stoned all the time, and no one can figure out what exactly was going on in their drug-addled heads, so we try not to use things like "Macromedia Flash" or PDFs anymore. They were always broken and it's not reasonable to try to fix them.
  • I teach some practical university courses. Students have to copy-paste code samples from pdf files. It's a nightmare. There are plenty of extraneous characters between *each* characters. Also the lines are reorganized in weird columns so you have to delete the end of every line or some weird shit like that. They spend more time editing their copy-paste than adding the important code that is asked from them. PDFs suck.
  • by laughingskeptic ( 1004414 ) on Tuesday March 11, 2025 @05:51PM (#65226453)
    I have recently used Document Intelligence on crooked coffee-stained scans of 400 page documents from the 1970s and it worked far better than I expected. Biggest complaint is it did not always correctly choose between "/" or "l" for some table headers in some of these documents when reading categorical text that looks like alnum+/alnum+. It has its limitations, but I read this article and wondered if maybe the authors haven't used Document Intelligence.
  • Surprisingly little software to generate an A4 PDF. I prompt for latex or css. It's quite a mess. I'm amazed the lack of AI in DTP. Did I miss something?

Nonsense. Space is blue and birds fly through it. -- Heisenberg

Working...