Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
IT Technology

Diving into Digital Ephemera: Identifying Defunct URLs in the Web Archives (loc.gov) 7

Olivia Meehan, who worked on the web archiving team at the US Library of Congress, evaluates how well online archives of the Papal Transition 2005 Collection from 2005 have survived: Based on the results I have so far and conversations I've had with other web archivists, the lifecycle of websites is unpredictable to the extent that accurately tracking the status of a site inherently requires nuance, time, and attention -- which is difficult to maintain at scale. This data is valuable, however, and is worth pursuing when possibleÂ. Using a sample selection of URLs from larger collections could make this more manageable than comprehensive reviews.

Of the content originally captured in the Papal Transition 2005 Collection, 41% is now offline. Without the archived pages, the information, perspectives, and experiences expressed on those websites would potentially be lost forever. They include blogs, personal websites, individually-maintained web portals, and annotated bibliographies. They frequently represent small voices and unique perspectives that may be overlooked or under-represented by large online publications with the resources to maintain legacy pages and articles.

The internet is impermanent in a way that is difficult to quantify. The constant creation of new information obscures what is routinely deleted, overwritten, and lost. While the scope of this project is small within the context of the wider internet, and even within the context of the Library's Web Archive collections as a whole, I hope that it effectively demonstrates the value of web archives in preserving snapshots of the online world as it moves and changes at a record pace.

This discussion has been archived. No new comments can be posted.

Diving into Digital Ephemera: Identifying Defunct URLs in the Web Archives

Comments Filter:
  • To prevent losing valuable history, there should be laws removing copyright from anything that the copyright owner cannot show is properly archived and is ready to be opened to all on copyright expiral.

    Copyright must be earnt.

  • by Anonymous Coward

    "The internet" is nothing more than computers networked together. It serves as the transport.

    What's talked about here is the "world-wide web" of collections of documents interlinked through hyperlinks.

    You can't really duplicate "the internet" for archival, nor the entire world-wide web, but you can archive individual websites.

    Though even that is pretty hard to get right (looking at you, archive.org).

    It certainly doesn't help if the archivers can't tell the two apart.

  • by rnturn ( 11092 ) on Friday August 05, 2022 @05:20PM (#62765940)

    ... before attempts to visit a defunct sites began resulting in replies containing various forms of "You can buy this domain..."

    I can barely imagine the problem that some archival sites would have trying keep up with sites that are dropping off the internet. How would they deal with domains that existed, might have been archived somewhere, went dark, the domain re-purchased, and goes live again (possibly with a purpose completely different than its original incarnation. Whew!

    Even my little view if the internet is a mess to keep track of. Wasn't it Netscape that had a feature that would go through your bookmarks and identify the dead sites/pages so either you could remove them or the browser would remove them automagically? I have a huge number of bookmarks to sites/pages many of which are now dead but with no way to clean out the dead ones without taking a weekend to manually visit each one and deleting those that respond with "404" (or the domains are now for sale)---not something I relish doing.

    Slight aside: If you're looking for a list of defunct pages, you could begin by looking at the references at the end of most Wikipedia pages, especially those pointing to newspaper and magazine sites. Noting all the dead links in Wikipedia pages hammered home why -- when my daughter's were still in school -- their teachers wouldn't allow use of Wikipedia pages as references. The links to the original source material are frequently dead.

A committee takes root and grows, it flowers, wilts and dies, scattering the seed from which other committees will bloom. -- Parkinson

Working...