FutureKeep
Extensible Universal Data Media Archiving Standard
Draft outline notes
FutureKeep (tentative name) is a standard being devised to provide a mechanism for imaging any form of media used to store data. Such media in mind includes (but is not limited to) punch cards, paper tape, magnetic tape, magnetic disk, ROM, etc. The images are stored as plaintext structured using XML. Each image will be self-contained, including not only the data being archived but the meta data as well (i.e. identifying information). The standard will allow for any level of imaging of the original media, from the highest level of a single stream of data all the way down to the lowest levels of the original media, including the ability to represent physical encoding features on the source media. Read the specification notes below for more information.
This webpage is just a temporary place to publish these draft outline notes until a more formal website running a content management apparatus can be setup.
Recent Updates:
The End of Software: Contemplating a Standardized Software Preservation Methodology (white paper by Sellam Ismail of VintageTech)
Basic features:
- Well Documented
- Universal (not constrained to any particular hardware)
- All inclusive - inclusive of all physical manner of recording media
- Ease of Implementation - be implementable on even the simplest architectures
- Unencumbered by license - Open source, public domain, etc.
- Extensible - Adaptable, expandable, revisable (for future extensions)
- Character-based - Text-based and stored in commonly accessible character set
- Multi-level - Allow for the representation of media in logical or physical form
Notes for Basic Features:
- Should also allow for logical representation of data on a physical medium
- The original media source will in many cases have to be read on the original
hardware, which will be older and therefore possibly more difficult to operate or maintain.
- A copyright on the format may be held to prevent unauthorized extension or
pollution/adulteration of the standard. As well, the standard should not
incorporate any copyrighted or patented schemes or algorithms.
- Adaptable, expandable, revisable (for future extensions)
- A suitable subset of Unicode, i.e. ASCII or UTF-8, should be specified for
universality. Tag characters should be limited to a defined subset, i.e.
A-Z, 0-9, - (hyphen), = (equals), the period, the command, and the space
character (subject to study).
Documents Required:
- RFP
- Request For Proposal document to introduce the specification to the world.
- Specification
- This will be the actual specification definition.
- Best Practices
- a "Best Practices" document needs to be written which explains
the best way to create an archive.
Meta Data
- Original Media
- Title
- Author
- Publisher
- Distributor
- Creation Date
- Media Type
- Media Geometry (if applicable)
- Media Encoding (if applicable)
- Target Platform
- Archive
- Creator
- Creation Date
- Production Hardware Configuration
- Description
- Notes
- Specification Version Used
Media Scope
- All physical formats
- Punch card
- Paper tape
- Magnetic drum
- Magnetic disk
- Magnetic tape
- Core memory
- ROM
- Diode matrix
Data Scope
- File systems
- Files
- Structured Data
- Data blocks
Archive
- Allow multiple images per archive (?)
- Allow referencing/linking to internal data blocks
- Allow linking to external image files (?)
- Allow provision for hardware or application specific configuration
information (?) [Perhaps a configuration metafile]
Transcoding
- Media errors should be encoded
- Critical timing elements or media features must be capable of being
preserved (e.g. tape marks, timing tracks, sector synchronization on a
diskette, write protect status, etc.)
- Word size must be preserved
Markup Tags
Data Encoding
- Simple compression (RLE?)
- Data block can consist of any (defined) delineation, from segregated blocks
as defined by the medium, to a single character/byte/word stream representing
the entirety of the medium, or anywhere in between
- Data block checksums (must decide at what level to base the checksum:
on the character representation data, or the decoded data itself?)
- Multiple concurrent layers possible within the same archive (e.g. physical,
logical, filesystem, etc.) to allow for the same archive to contain an image
in multiple interpretations
- Multiple encoding levels possible (e.g. to allow for the representation of
one section of a medium at a "finer" level than is required at another section)
- Integral archive processing language to include simple converters,
extractors, viewers within the archive itself (?) [reference Forth or PostScript]
- Multiple encoding schemes, e.g. 2-character hexadecimal, uuencode, base64
(?), octal (?), binary (strongly discouraged)
- Alias labels for data blocks (e.g. Track 0, Sector 0 can be aliased as text
string "boot sector") or data block lists (permits integral "filesystem")
Interface
- Usable as emulator media representation
- Emulator drivers can be written to interpret archive at appropriate level
Miscellaneous
- Create master database of media formats (?)