Slashdot Log In
National Archive File Format Time Bomb
Posted by
ScuttleMonkey
on Wed Jul 04, 2007 12:57 PM
from the cleaning-up-your-own-messes dept.
from the cleaning-up-your-own-messes dept.
geordie_loz writes "The BBC is reporting that the UK National Archive is warning of old formats being a 'ticking time-bomb' where data is going to be lost because of incompatibility in newer versions of software, and software not existing at all. More surprisingly, Microsoft has offered a solution via the OOXML format."
Related Stories
[+]
Microsoft Announces OOXML-UOF Project with China 106 comments
Andy Updegrove writes "Today, Microsoft announced its own interoperability project to bridge the gap between China's domestically developed Uniform Office Format (UOF) and Microsoft's OOXML. In the continuing tit for tat battle between ODF and OOXML, this announcement tracks the intent of an already-existing 'harmonization' committee, hosted by OASIS, that is exploring interoperability options between ODF and UOF. Like the OOXML-ODF translator project announced by Microsoft last year, the new effort will be an open source project hosted by SourceForge. The announcement is, in one sense, no surprise. Microsoft has been waging a nation-by-nation battle for the hearts and minds of ISO/IEC JTC1 National Bodies, in an effort to win adoption of OOXML (now Ecma 376) as a global standard with equal status to ODF (now ISO 26300). In order to do so, it needs to offset the argument that one document format standard is not only enough, but preferable. With UOF representing a third entrant in the format race, easy translation of documents would obviously be key to lessen the burden on customers of products based upon one format or the other."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading... please wait.
Tagging beta... (Score:2, Insightful)
MS should not own the standard (Score:2)
Re:MS should not own the standard (Score:5, Informative)
Parent
Re: (Score:3, Informative)
While the GP may or may not have been exactly sure what they were referring to, it doesn't make them wrong.
Re:MS should not own the standard (Score:5, Funny)
Rubbish. I've worked at places with an Open Office format. Basically they open the office to any monkey who turns up for a job interview and a handful of people have to make up for their incompetence.
Parent
Such precise terms as (Score:4, Informative)
Becuase if you want to include bugs etc, then no, it doesn't support each and every 2007 feature.
If you mean supporting tables, nested documents, embedded graphs, scripting and so on, yes.
It may not be "click the same buttons" feature correct nor probably the "run the same VB code" compatible.
Take a look at some of the people on the board that devised ODF. They include the US National Archives. Print media. Archivists.
Y'know, people who KNOW DOCUMENTS.
As to the remainder of your questions, there is a process, it does have to go through comittee (else how does everyone else know how to implement the new standard? MS doesn't have this problem since they only want themselves to know their updated standard). It is XML so it is extensible (decode the initialism). The process will take as long as it takes. Much the same as Vista will take as long as it takes to get SP 1 out.
I don't see how these latter issues are something that is a part of ODF and not any form of standardisation that OfficeXML will have to have to go through for anyone other than MS to implement...
Parent
Re:One thing I'd like to know (ODF question) (Score:5, Informative)
Yeah, it's XML. Also, unlike OOXML, ODF uses namespaces, so you can create a separate standard if you don't want to muck around with ODF.
It would depend. The thing about changing standards is that it causes problems for all sorts of people. There is a real need for a stable and standardized document format that just doesn't change, or if it does, very slightly.
Parent
Idiots (Score:4, Insightful)
There are so many idiots in this state of the affairs:
1. the idiots which decided to build huge archive with undocumented proprietary format
2. idiots which believe they can't find even a single copy of the software they need
3. idiots who didn't store a single copy of the software that reads the format, together with the archive (not very far from obvious, is it).
4. idiots who want to convince other idiots that OOXML is an open format (versus straight XML serialization of the whatever binary DOC was in the source code base at the time in MS)
Re:Idiots (Score:4, Interesting)
Please give me a link to a copy of the Professional Write 3 (PW) software app. for MSDOS 6.
Yep, I had that very problem some years ago when I was cleaning my room and found several 5 1/4 disquettes which contained the
Parent
Re:Idiots (Score:4, Insightful)
Parent
Re:Idiots (Score:5, Insightful)
The system wasn't thought up any more than a library thinks up all the books it contains.
Parent
Re:Doesn't matter. (Score:5, Interesting)
Why haven't they been converted? Really, all their DIGITAL archives should be in a single format by now.
No, they shouldn't. You usually want 3 formats:
- the original format of the document. Whatever whichever idiot happened to write (or record, or video) it in, you absolutely want the original in your records.
- a searchable format (eg OCR'd text from scanned image docs)
- a rendered format. (eg an image or pdf, or svg - something open enough that you can continue to show how the doc would have looked). The appropriate rendered format varies. Paper is not an appropriate format for storing CCTV footage, for example
If you're very, very lucky the original is both searchable and viewable; like, say, HTML. It gets more complicated too, because you often want to store a redacted copy of the document (think of the Onion story 'CIA realise they've been using black highlighter pen all these years') and you want that searchable too, so you have to keep a redacted searchable format too... and of course, some of the records are on actual paper. Have you started worrying about the fading inks in the originals yet?
BTW you can't restrict the format of the original. Consider an email from a corporate bidding for a govt contract, with attachments. They need to keep those.
- Mr. E
PS, posting anon because I have dealings with the national archives, and don't want to speak for my company.
Parent
Re: (Score:3, Insightful)
1/2 pentabyte = 20 bits? (Score:5, Funny)
A pentabyte is 5 bytes, right? How hard is it to store 20 bits on paper? ;)
(I assume petabyte (10^15 or 2^50, depending on convention) is the word you're looking for.)
Parent
Re: (Score:3, Insightful)
Er, even if you translate it into other languages, they'll evolve too. Try reading Old French much? And translation also leaves you with the headache of reconciling various translations and figuring out which is "more correct" (IIRC the Bible has this problem). It would be a much better idea to make redundant copies, to guard against bitrot and store them as physically apart as possible.
Re: (Score:3, Interesting)
For example, I keep a copy of DOS and Win3.1 ISOs (about 20MB total) and Norton Commander (3 floppy images!) on a DVDR, along with a copy of Virtual PC. This lets me recreate a Windows 3.1 virtual PC anytime I want.
Now.... You can do that now. However, in 100 years, will this be possible? You do not know what the future brings. Let's not even talk about 1000 years and beyond. Now; you backed this stuff up on a DVD and you die tomorrow. Your kids keep the data, and when they die a historian speciali
Re: (Score:3, Interesting)
My copy of Office XP won't activate on any of the computers I currently own (the hardware it was originally activated on is long-dead), and that's only 5 years old.
Re: (Score:2, Interesting)
Along with Professional File (database product)....
Re: (Score:3, Interesting)
Which seems reasonable at a time when "everyone" has a computer that'll read it, for example when it comes to image viewers there's software covering literally hundreds of formats without issue.
2. idiots which believe they can't find even a single copy of the software they need
It's supposed to be an archive, not a "well we'll have to dig up a copy of the software, I'll get back to you in some months.
3. idiots who didn't st
Re: (Score:3, Insightful)
That's easier said than done. You'd have to keep multiple copies of everything, including hardware, up to the point where you're confident you have a stable standard - probably the power mains - and that's if you're not worried about violating licenses. Of course, with the advent of online apps, there is no way to snapshot the entire ecosystem of servers and softwa
Re: (Score:3, Insightful)
It's not just about the software... (Score:3, Interesting)
I'm sure that most of the archive data created today is stored on something like DVDs but, as recently as the early 1990s, the official long-term storage medium for the UK government was Syquest 44MB removable cartridge hard drives [wikipedia.org].
I know that I have a working 44MB drive (well, when I last fired it up, which would have been sometime last decade) somewhere in my attic but I doubt that too many of these drives are still in existance.
I only hope that the
Re: (Score:3, Interesting)
It rests on
1) Physical storage medium -- whether this is Flash, Hard Drive, Optical Medium, [NV]RAM, etc., all these technologies may be very difficult to retrieve data from, especially if the level of technology happens to go down in the future (say, global thermonuclear war). Even if data is retrieved, there's no guarantee that it's intact after 1000 years (the dyes in CDs will have decomposed by that
Use SGML (Score:5, Funny)
Re: (Score:3, Funny)
2:2 And on the seventh day God said
2:3 And God watched gcc running and sanctified it, because it would have taken Him at least two weeks to write the whole thing in machine code.
The big lie... (Score:5, Informative)
to give it a proper name, the format is "Microsoft Open Office XML", they deliberately went to a lot of trouble to pick a name that's as easily to confuse as possible with OpenOffice
Re: (Score:3, Informative)
Re: (Score:3, Funny)
You don't understand the format then. Office Open XML is the ultimate in
upgrades free money for MS (Score:2)
Open Formats (Score:2)
Obviously... (Score:5, Funny)
Oh yeah, their solution? Virtualised Windows 3.1. And obviously in 15 years you'll have to virtualise Vista in order to run the Win3.1 virtual machine to run Word. And Microsoft will be paid a license for each application and level of virtualisation.
You couldn't make this stuff up.
Bright people don't make tech decisions (Score:5, Interesting)
Unfortunately, those bright people don't get to make technical decisions.
The British Library recently introduced SED [www.bl.uk], an electronic document delivery system. With SED, you can order electronic copies of journal papers and articles from their archives. Great idea! Previously, you had to wait for the documents to come through the post, and that would take a week or so. Now you get them by email in a couple of working days.
Except that the documents are crippled by Adobe DRM, which imposes the following restrictions:
- You can only view them using certain specific versions of Acrobat Reader (6 or 7) - the latest version is not recommended [www.bl.uk].
- The software only works on Windows 2000 or XP. No Linux support, no Mac support. Vista might work, but again, it's not recommended.
- You can only look at each document for a limited time, and you can only print it once.
So, if you want to use the service, you'd better hope that you have (a) the right version of Windows, (b) the right version of Acrobat Reader, (c) a reliable net connection, and, most importantly, (d) a very reliable printer that won't chew up the document. Unless you're a filthy dirty pirate, of course.If Adobe managed to convince the British Library to put up with this ridiculous system, I am sure that Microsoft will have no difficulty convincing them about their archive "solution". If SED is anything to go by, it'll be another awful implementation of a great idea.
Parent
More surprisingly!? No, UNsurprisingly (Score:4, Interesting)
As "well intentioned" as Microsoft may be, Microsoft's Open XML cannot be anything but proprietary when its code references Windows and Office API functions rather than more precise data format information as with ODF. (For more information about this, you might search out the arguments against making OOXML an ISO standard.)
Doesn't open source solve this (Score:5, Informative)
Also--as noted, the OOXML format is a nonsolution for this nonproblem. It seems like it would be a waste of effort--why convert a bunch of files to a format that may die just as quickly as any other format, when you can just leave the file as is and open it in OOo (assuming I'm correct that they won't stop read support for dead formats)?
Also, it seems to me that no current format or any future format will ever solve this nonproblem because formats will always change as new functionality is continually added. The better solution is to keep this a nonproblem by having open source software that can read old file formats.
Real Issue (Score:2)
The only solution IMHO is _open and documented_ interfaces, protocols, programs, data types and hardware. In the future they won't be able to read our disks and files. They just can try to build a machine that reads our disks and files - for which they need documentation how they work.
surprise? (Score:5, Insightful)
Brilliant.
"What, the shit I sold you yesterday stinks? Try this new shit, it's great and it has none of the problems of the old one."
That's what you hire PR people for.
How about some *helpful* suggestions (Score:5, Insightful)
To kick things off here's one:
Keep EVERYTHING in the simplest possible format. ASCII would seem sensible, since its the content we care about, not the formatting. (although that wouldn't help our Asiatic brethren much). Then Keep decent records of HOW you can read that format. With examples of the software and hardware. do this bit on PAPER. V. Tough Paper (or rock, or plastic or whatever). Update the explanations every other year, to put it in language the next gen will understand. Maybe also have instructions on how to translate the simple format to less simple things.
I guess, basically, its a case of KISS and then *provide a persistent and regularly updated 'Rosetta Stone'* for latecomers to work from.
As a side branch, this kind of reminds me of discussions I read about a while back of how to warn future generations about Nuclear Waste dumps (y'know, the really nasty stuff with half-lives in the thousands of years range). I don't think anyone ever came up with a decent answer....
Re: (Score:2)
Re:How about some *helpful* suggestions (Score:5, Funny)
Parent
Re: (Score:3, Interesting)
ASCII would seem sensible, since its the content we care about, not the formatting.
No, the formatting is important as well. Sometimes 'the medium is the message', and that whole bunch of artsy crap we geeks would prefer to ignore. - Just think of it as an engineering challenge in order to make the pain go away.
You always archive the original (unless you have a batch; then you sample one and call it the original), and that original can be in just about any format, hand-written, coffee-stained, in sanskrit. When scanning a document into an electronic archive the ideal would be to have OCR
Re: (Score:3, Interesting)
plausible, and wrong." -- H.L.Mencken, The Divine Afflatus (1917).
Ok, you started by identifying one problem - asian languages. In fact, pretty much every non-US language since you said ASCII and not Latin1. So we can extend that to UTF-8 with no problems, except there's probably a huge table just for the 100000 characters or so, even though the spec is quite short.
But then, you have only characters, which is probably fine for basic text. How ab
Re: (Score:3, Interesting)
We fought a lot with this at Siemens (Sietec) about fifteen years ago, when trying to decide what format to use on stackers full of 12" WORM disks, which were just nicely becoming useful for large-scale archival storage in those days. We needed format that would outlast the disks, which probably meant 50-100 years assuming normal replacement/turnover.
We ended up with the bottom level being a WORM standard, which was served out to users via the NFS standard, which was reasonably close to a Unix filesystem
Microsoft lecturing about open standards?!? (Score:2)
You can almost hear the slime dripping. (Score:2)
The real problem seems to be the credulous morons in charge of the National Archives project.
I've never understood this arguement... (Score:2, Informative)
Re: (Score:2, Insightful)
IBM (Score:3, Informative)
It has been a while since I worked on an AS/400 system... so anyone with updated info please feel free to correct me if things have changed.
It seems like a no-brainer.
Link: http://en.wikipedia.org/wiki/AS/400 [wikipedia.org]
Re: (Score:3, Insightful)
Re: (Score:2)
(Not if. When.)
Re: (Score:3, Informative)
Yes, that was sarcastic, but you deserved it.