Thursday, July 19, 2007

Open Formats, Archiving, and Retrievability in the Future of Electronic Documents

ISO, ECMA, IEE, ITU, W3C et al are all "standards" organizations that exist solely to publish - and enforce - a set of agreed methods of communication in order to ensure what is sent by "A" is received by "B" the same every time. A defined translator for any given function.
OpenXML, OpenDOC, ASCII, EBCDIC, Hollerith are all standards in that they tell others how to encode human speech and symbols into a machine readable form.
All is well and good!
Wrong.
The problems with any machine readable encoding methods is not how they are coded but WHAT it is encoded (stored) on.
A Hollerith card takes up lots of room but if stored properly 200 years from now you could STILL read the card - maybe not by machine but with your eyes. Data stored on a magnetic tape may still be readable but if there are no machines to read the tape it is as good as never being saved. The problem of storing data is not very hard - use the latest and greatest item you have at the time - but storing it so that hundreds - or even 10 years from now - that  you can still read the data is the real problem.
The fight over OpenDoc vs ms OpenXML never addresses the REAL long term problem nor problems that THESE items will create. They sound very good - open standard that anyone can write to and anyone can create programs that anyone can read from - but the problem comes down to that you need a PROGRAM to read them AND a method to STORE and RETRIEVE them from a repository!
NARA - National Archives and Records Administration (USA) - has been dealing with this problem since the 1950s - data given to them on tapes, cards, etc can no longer be accessed since there are no machines left to read them!
During WWII all Americans captured the by Germans had a punch card created about them by the US Army that tracked them in the various POW camps - all the information about a single POW was on a single card. The machine that read these cards were no longer around, and even the meanings of the card locations was not known when the boxes were discovered in NARA. However, being physical, and access to known data about specific POWs being around to match against THEIR card, a researcher was able to reverse engineer the card and now all those records are online at NARA - in a database.
Accessible yes, but now being in a database you MUST have an application to read it, you MUST have a storage method, you MUST update the application and back end software continuously plus all the data is likely in a relational database which means you MUST document the relationships of all the fields and if you EVER lose that bit of relationship  document showing the relationships of the fields and tables all the information is LOST FOREVER. It becomes a meaningless mass of data. And of course all this recovery information is ALSO stored electronically! In this case since the physical cards are still around you can always reconstruct the information.
The problems with these "open" document standards being created is that they MUST be stored in a machine readable format which places them into solely an electronic realm where everything is programmatically driven to even get at the stored document. You can bet that governments will not store them in a single file folder but in a relational database so that means you have another layer of abstraction, certainly not stored in a local drive but on a massive storage repository (SAN or whatever will follow-up to a SAN) and then you are dependant upon a private company (or if open source kindness of others) to ensure that you can even get at that data.
The US Patent Office has been scanning all their paper documents submitted with patents and then throwing them away to cut down on storage costs. Sounds great, now everything is online and accessible. However now they are solely dependant on that electronic storage method they implemented forever, or converting it again and again as technology moves on into a new storage medium or format but any single document will NEVER be better than the initial scan of it that was done using today's technology. And no one will ever be able to look at the true original document anymore (except for dumpster divers who have been picking through dumpsters behind the Patent Office to find originals to save and later sell if and when deemed worth parting with it.)
The "open doc movement" heavyweights are using the "shaver" method of selling open doc idea to the Government. Give the format away free then sell them the ways of reading them via storage and front end programs which allows them to find the "open" documents.
The open document format sounds good, but the back end is still very closed and locked.

0 Comments:

Post a Comment

<< Home