[MissoulaGov] archival and storage of government data
Michael Loftis
mloftis at wgops.com
Thu Jul 10 11:37:59 MDT 2008
Hello All,
Seth McClain pointed out to me there was some discussion about records
archiving sometime last month either during session or on this list. Being
an IT/Technology professional, employed as the Operations Manager and Sr.
Systems Administrator at Modwest I understand some of the challenges
involved. Just as important as the where is the how. Data formats in
computers are constantly changing and evolving. One of the things that
must be avoided is any closed format for long term archives. This includes
*ALL* proprietary formats, whether by Apple or Microsoft, IBM or EMC. I
refer to long term as 20+ years, though some might say as little as 5 or 10.
Truly open formats are the only way to reliably archive data. PDF, DOC,
XLS, DOCX (despite what Microsoft claims), are all closed formats.
Microsoft's formats are extremely difficult to decode and render, and any
package other than the one that originally wrote it (and the same version)
may or may not render the same information when opening the same file.
Excel files are very vulnerable to this with all the multitude of internal
date and number storage formats that all vary slightly. It only takes one
programming error while reading or writing one of these formats for the
data to start to silently change, whether in display, or actually "on disk"
as the data is saved. With true text formats, any machine, and any human
can read the data. A few simple extra checks can verify the data you're
reading is the data on disk, you can even go a step further using
GnuPG/OpenPG and digitally sign archive data, sort of like a wax seal.
Honestly, and this will raise eyebrows in tech, the only good archival
format is plain text.
Next to that is, yes PDF, even though it's proprietary, Adobe has shown
interest in providing readers for a multitude of platforms, and in not
barring those that would make their own viewers, as well as ensuring
backwards compatibility. One step below PDF, and just as capable, but not
as efficient, is PostScript (PS). PS is open. PS has been around for a
long time, it's designed for printing and formatting of documents. PDF is
actually based on PS. PDF and PS both, designed for portable layout and
print, make it harder for data to be hidden or accidentally manipulated by
errors in the reader. If a PDF or PS reader fails for some reason, your
more usually end up with garbage, rather than a date shifting a few days,
or a number shifting either subtly, or massively (such as in excel). Or
old text showing up in place of newer text (such as in Word).
One of the newest formats that MAY be a solution is OpenDocument (NOT to be
confused with OpenDoc which was a Xerox PARC inspired competitor to a
similar Microsoft technology). It's a very new format but it is fully
open, developed by open standards groups, with community input and has
proven to be a very useful format and has gained quite a bit of community
support. So much so it was published as an ISO standard, ISO/IEC
26300:2006,
<http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=43485>
(link checked today 2008/07/10). Microsoft introduced a competing DOCX
standard that is proprietary. OpenDocument has a much smaller, much
clearer specification for implementation than DOCX which will lead to wider
adoption, just because of ease of implementation. Don't get me wrong, if
you have to choose a proprietary format DOCX is better than DOC, because
DOCX under the hood is XML which is itself text. DOCX is also a published
spec, unlike DOC, XLS, etc, which have only been learned by much trial and
error.
So, that's a whole lot about the how, and I don't know if anyone even
cares. A bit about the where, with the how of the where. Tape formats
degrade and change. DVDs degrade, CDs degrade. Common blank or burnable
CDs/DVDs may only last 20 years when stored carefully, they're designed for
convenience, not long term storage. That's assuming you even have a device
or system capable of reading them in 20 years. Long term digital storage
is a difficult issue. I don't have any good answers here. I'm not certain
the industry does either. This is outside the realm of my own experience,
I can say it's not as easy as putting it onto a DVD or a digital tape, and
putting it into a locker and forgetting about it. 20 years from now that
DVD or that tape may be blank. Right now the best archive medium I know of
(and a digital archiving or optical archiving specialist may have a better
answer!) is specially designed hard drive libraries. The drives are
commodity, and will be available for a relatively long time. During
periods of non-use, these systems, unlike most storage arrays, are designed
to shut the drives down. They're also designed to occasionally, on their
own, start the drives, and check all the data in the archive, to make sure
it's still OK, and alert operators of any need for attention.
And along with this is geographic redundancy (if you want to take it that
far). Just like a paper records room getting flooded or burned, its very
expensive, and very difficult, to recover data from computer systems that
have been exposed to similar hazards. The plus is that for
offline/emergency storage it's much easier to make digital copies than it
is to make paper ones.
Bringing this back to the direct scope. Choosing a free and open format
will always increase the likelihood that 20 years from now you will be
able to read and display the format, assuming your underlying storage
medium is viable. Store the specifics of the format WITH the stuff you're
storing, in text format, "example" readers are also appropriate to store
WITH your archive when you're talking about long term archives. As for the
underlying storage medium, if you choose a removable medium periodic checks
of functioning of the drives and media are a must. You may have to, over
time, migrate the data digitally across mediums.
I don't say this to scare the city off, but to show the long term goals.
Getting started and having the digital data are extremely important. Once
the data is there, and existing in a common format or common formats
(specific encoding for video/audio files, specific encoding for text, for
databases, for spreadsheets) then you've overcome the first, very large,
hurdle. You want to try to avoid having to migrate the data to different
formats, as this is where you can lose data (usually on accident) due to
mistakes made in the transcribing process, either by humans, or by machines.
Storing digitally might mean for near/short term a $3000 or $6000 commodity
server with commodity disks, giving an area where data can be stored very
reliably. It's a start, it's not the end all be all, and in 20 yrs that
machine will not be working. It, just like paper records, is an ongoing
stewardship process, that once started, must be maintained.
Once you're storing stuff digitally, then as you start to look at long term
needs. They can be separate from your near term needs, or they might be
one all encompassing solution.
Another difficult aspect for government is making sure the records are also
open. That what you've stored is accessible to the public. Thanks to the
Internet, public libraries with Internet kiosk machines, coffee shops, etc.
you can put the physical server and its software somewhere with reliable
hosting facilities, and then people can access it anywhere they can get to
a computer with Internet. The Internet has become a Utility. Even if you
don't own a computer, don't know how to use one. You can go to the
library, and I'm certain any librarian or clerk would be more than happy to
help you find what you need.
But that requires open records formats, and durable, available digital
storage. For long term storage durability is the biggest and most
important measurement, for access purposes, durability is important, but
availability is more important.
This has gone on far enough. And wide enough. There was only a few
messages on the list about this topic, and I don't know what the discussion
was that started it, so I'm making a lot of commentary here with very
little context to go on, so I hope it's relevant.
--
"Genius might be described as a supreme capacity for getting its possessors
into trouble of all kinds."
-- Samuel Butler
More information about the MissoulaGov
mailing list