JPL Creates World's Largest PDF Archive to Aid Malware Research 21

Posted by BeauHD on Wednesday June 14, 2023 @10:02PM from the size-matters dept.

NASA's Jet Propulsion Laboratory (JPL) has created the largest open-source archive of PDFs as part of DARPA's Safe Documents program, with the aim of improving internet security. The corpus consists of approximately 8 million PDFs collected from the internet. From a press release: "PDFs are used everywhere and are important for contracts, legal documents, 3D engineering designs, and many other purposes. Unfortunately, they are complex and can be compromised to hide malicious code or render different information for different users in a malicious way," said Tim Allison, a data scientist at JPL in Southern California. "To confront these and other challenges from PDFs, a large sample of real-world PDFs needs to be collected from the internet to create a shared, freely available resource for software experts." Building the corpus was no easy task. As a starting point, Allison's team used Common Crawl, an open-source public repository of web-crawl data, to identify a wide variety of PDFs to be included in the corpus -- files that are publicly available and not behind firewalls or in private networks. Conducted between July and August 2021, the crawl identified roughly 8 million PDFs.

Common Crawl limits downloaded data to 1 megabyte per file, meaning larger files were incomplete. But researchers need the entire PDF, not a truncated version, in order to conduct meaningful research on them. The file-size limit reduced the number of complete, untruncated files extracted directly from Common Crawl to 6 million. To get the other 2 million PDFs and ensure the corpus was complete, the JPL team re-fetched the truncated files using specialized software that downloaded the whole files from the incomplete PDFs' web addresses. Various metadata, such as the software used to create each PDF, was extracted and is included with the corpus. The JPL team also relied on free, publicly available geolocation software to identify the server location of the source website for each PDF. The complete data set totals about 8 terabytes, making it the largest publicly available corpus of its kind.

The corpus will do more than help researchers identify threats. Privacy researchers, for example, could study these files to determine how file-creation and editing software can be improved to better protect personal information. Software developers could use the files to find bugs in their code and to check if old versions of software are still compatible with newer versions of PDFs. The Digital Corpora project hosts the huge data archive as part of Amazon Web Services' Open Data Sponsorship Program, and the files have been packaged in easily downloadable zip files.

JPL Creates World's Largest PDF Archive to Aid Malware Research

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 21 Comments Log In/Create an Account

Comments Filter:

JPL??? (Score:3)

by skogs ( 628589 ) writes: on Wednesday June 14, 2023 @10:21PM (#63603930) Journal

I accept that this is a good idea and useful. I question whether it should have been spearheaded by JPL and NASA.
Surely this could have been completed by an non-government organization of some sort.
Even if it was a US government project....why wouldn't it have been more under the pervue of DHS/CISA/FBI or even DoC/NIST?

- Re: (Score:3)
  
  by awwshit ( 6214476 ) writes:
  
  JPL has a strong incentive to protect itself from malware. I'm sure JPL also deals in a lot of PDFs.
- Re: (Score:2)
  
  by GFS666 ( 6452674 ) writes:
  
  It may not be as crazy as you think. Someone correct me if I'm wrong, but JPL is in the business (space probes) of acquiring incredibly large amounts of data from multiple sources (from said space probes) and storing that data for long amounts of time (decades) for later analysis and use by interested people (scientists). Please note that from the summary (hey, I am NOT going to read the article) that JPL is only collecting this data, not analyzing it. And who would you rather trust to collect the data and
- Re: (Score:2)
  
  by RitchCraft ( 6454710 ) writes:
  
  "why wouldn't it have been more under the pervue of DHS/CISA/FBI or even DoC/NIST?" - Because JPL needs an actual solution to the problem.
- USAF (Score:2)
  
  by JBMcB ( 73720 ) writes:
  
  Technically the USAF is in charge of governmental cybersecurity.
  Oh, and the NSA.
  And parts of the FBI.
  And there's an executive level cybersecurity "czar" who does... something.
  And, I guess, NASA now, too.
  What we really need is another organization to centralize everything related to cybersecurity. That would clear everything up.
- Re: (Score:1)
  
  by tallison314159 ( 10434182 ) writes:
  
  Personal opinion: As others noted, JPL has expertise in handling incredibly large amounts of data, generally. JPL also has expertise in web-crawling (https://github.com/nasa-jpl-memex) and in open source file parsers. Specifically, the Apache Tika project (https://tika.apache.org/) was co-founded by Chris Mattmann, a JPL'er.
Why not try fixing the PDF format first? (Score:2)

by thesjaakspoiler ( 4782965 ) writes:

That would have helped all of us and brought World Peace at the same time.
- Re:Why not try fixing the PDF format first? (Score:5, Informative)
  
  by Gimric ( 110667 ) writes: on Thursday June 15, 2023 @02:16AM (#63604282)
  
  According to the article they (DARPA) ARE trying to fix PDF:
  - Filed 117 disambiguating edits to the international standard for PDF (ISO 32000-2 AKA PDF 2.0), 88 of which have been fully resolved and approved by ISO with solutions publicly available;
  - Developed the Arlington PDF Model, the first vendor-neutral, open-source specification-derived, machine and human-readable definition of the PDF data objects;
  - Completed a security audit of the International Color Consortium’s (ICC) color profile format used in PDF and many image formats, resulting in updates to the ICC specifications and a move to incorporate machine-readable data descriptions to assist implementers. ICC color profiles are integral to the accurate rendering of images and can be used for malicious purposes, as River Loop Security and the PDF Association describe in this analysis;
  - Identified the need and directed the curation of a new PDF file corpus, CC-MAIN-2021-31-PDF-UNTRUNCATED, to support research and format awareness; and
  - Generated automatic tests/parsers for coding to address human error and reduce work time from three years to one day.
  
- Re: (Score:2)
  
  by greytree ( 7124971 ) writes:
  
  Came here to say this.
  
  When I learnt about how awful PDF was inside, I wrote a PDF parser as a challenge to myself.
  
  Now it scares me even more.
- Re: (Score:2)
  
  by NoWayNoShapeNoForm ( 7060585 ) writes:
  
  That would have helped all of us and brought World Peace at the same time.
  But does nothing about equity? Or did you forget to mention that?
Link to the (2-year old) docs directory (Score:3)

by greytree ( 7124971 ) writes: on Thursday June 15, 2023 @02:42AM (#63604308)

They've been here since May14th, and were crawled in "July/August of 2021".

https://downloads.digitalcorpora.org/corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/0000-0999/

TL;DR (Score:2)

by LindleyF ( 9395567 ) writes:

JPL accessed an existing database of crawled content and wrote a tool to get more of the incomplete stuff.
Click here to read the document... (Score:2)

by sabbede ( 2678435 ) writes:

I wonder how many of those phishing attachments will be in there.
In my never-humble opinion, they are a dumb attack. The very notion of attaching a document that's a link to the document you could/should have attached in the first place is a bit insulting.
PDF gone feral (Score:2)

by kaur ( 1948056 ) writes:

PDF is a print industry internal file format gone feral. It should never have hit end user workspace in the first place. But it escaped - and Adobe has been abusing is successfully ever since.
And now we try to use it as a database.
- Re: (Score:2)
  
  by drinkypoo ( 153816 ) writes:
  
  PDF is a print industry internal file format gone feral. It should never have hit end user workspace in the first place.
  We already had postscript in the hands of end users, and PDF is just more-modern postscript (mostly it has more stuff tacked on.)
Not Searchable for Specific .PDF Titles (Score:1)

by RayDonaldPratt ( 6966556 ) writes:

I fell in love with pdfdrive.com because it allowed me to search for and download texts that sometimes are not even being sold anymore. Further, even when I have the actual book, I find that it is easier to both read and keep my bookmark automatically when reading a .pdf on my computer (I used to prefer books because it is easier to see where you are at in the book and jump around, but trying to read a book and keep it open can be a pain). So, when I saw this news article that terabytes of .pdf documents we

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

JPL Creates World's Largest PDF Archive to Aid Malware Research 21

JPL Creates World's Largest PDF Archive to Aid Malware Research More Login

JPL Creates World's Largest PDF Archive to Aid Malware Research

JPL??? (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

USAF (Score:2)

Re: (Score:1)

Why not try fixing the PDF format first? (Score:2)

Re:Why not try fixing the PDF format first? (Score:5, Informative)

Re: (Score:2)

Re: (Score:2)

Link to the (2-year old) docs directory (Score:3)

TL;DR (Score:2)

Click here to read the document... (Score:2)

PDF gone feral (Score:2)

Re: (Score:2)

Not Searchable for Specific .PDF Titles (Score:1)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot