Intermediate report of the DML Working Group on "Technical Standards" (as of Nov. 4, 2002) Cochairs: Thierry Bouche Grenoble / NUMDAM thierry.bouche@ujf-grenoble.fr Ulf Rehmann Bielefeld rehmann@mathematik.uni-bielefeld.de Members: Pierre Berard Pierre.Berard@ujf-grenoble.fr Jon Borwein jborwein@cecm.sfu.ca Michael Doob mdoob@cc.UManitoba.CA Keith Dennis dennis@rkd.math.cornell.edu Our recommendations concerning the Technical Standards for the Digital Library will be given with respect to the following topics. We hope that further discussions will complete this list and make it more precise. 1. Scanning Quality 2. Archiving Formats 3. File Name Conventions 4. Delivery Formats 5. Download Units 6. Server Techniques 7. Further Recommendations We also hope we will later be able to provide tools for achieving the different tasks described below. ----------------------------------------------- 1. Scanning Quality: 600 dpi bitonal (minimum quality level) (300 dpi bitonal/grayscale is discouraged) In special cases and in the long run, higher resolutions, grayscale, or even color may be more suitable. Obvious flaws of the printing like skewed printing areas should be corrected during the scanning process. The printing area of each page should be positioned at the same place for all pages of a given object, possibly reflecting the differences for "right" and "left" pages. Page jumpings, rotations, varying margins and dimensions of images are discouraged. (Note: a possible choice could be the approach chosen by Gallica: Always put the text into the minimum ISO A* format into which it fits.) ----------------------------------------------- 2. Archiving Formats Scanned raw data (pnm, tiff) (CCIT G4 for bitonal, lossless compressed, LZW, ZIP for gray, color) Suggestion: the raw data should be accessible to the public as well. Reasons: 1. Somebody may come up with a better delivery format later. 2. If the original server dies for some reason, chances are that somebody else might have picked up a copy and the data won't be lost. ----------------------------------------------- 3. File Name and URL Conventions: (This is to be made precise and completed, for example, in cooperation with the meta data group?) Among other things, the following should be guaranteed: unique and meaningful _name_ for all files stable urls for all documents uniform appearence of web pages for all servers (possibly organized in a Math-Net like manner) (ordering scheme governed by MSC 2000) uniform access techniques for all documents ----------------------------------------------- 4. Delivery Formats: File Formats: djvu, pdf both made searchable with an underlying text layer Encoding for non ascii letters like accents, diaereses should be encoded using unicode. Links to MR/ZBL should be added to the references. "Garbage text" (e.g., from unrecognized material like formulae) in OCR should be removed. ----------------------------------------------- 5. Download Units: (primary:) (secondary:) Journals: Single articles (annual) volumes Books: Whole book (chapters?) Comment: The choice between "primary" and "secondary" may depend on the delivery format. Browsing through tables of content is desirable. Download of single pages only is discouraged. Download of page ranges is desirable. ----------------------------------------------- 6. Server Techniques: for djvu, indirect documents should be delivered (this can be achieved by setting up the server appropriately, at least for apache). for pdf, byte optimized files should de delivered, server should be configured for "byte serving". ----------------------------------------------- 7. Further recommendations: (See http://www.library.cornell.edu/dmlib/rehmann.pdf) It is suggested to set up public servers for -- format conversions -- performing ocr -- automatic supply of metadata for an article (using Dublin Core, Open Archive or similar encodings). -- upload digitized material to the DML. -- registry of all ongoing projects, keeping track of ongoing/completed/planned digitizing projects and allowing to input unnoticed material -- (may be even: scan servers? This should be a place with good scanning equipment, where people can send their paper material to in order to get it scanned at high quality.) Reason: Setting up the DML is a task for many people and will last 10-15 years or longer. Any individual or institutional contribution of digitizations therefore should be welcome. Individuals should be encouraged and enabled to help. In order to enable many contributors to provide digitized material in a sufficiently high quality, it is necessary to provide public tools to transform the material into the right format, which is sometimes technically demanding and to provide text layers by ocr (this should be optimized for the language the manuscript is written in, therefore it would be good to have public servers for the various language areas). Also, it should be easy for contributors to provide the scanned material with (elementary) metadata such as MSC, keywords and phrases on Dublin Core and/or Open Archive basis. In principle, this technology will be an advantage for any scientific discipline (as well as for more general areas of electronic literature, so the suggestion of a set of servers like this as a basic archiving infrastructure might help to convince funding agencies). Public format conversion servers could also contribute to the long term archiving problem, since they provide a dynamic tool for achieving this. Of course, all these servers should be able to handle mass data upload (script driven), as well as individual files. Remarks: A kind of prototype server for special file conversion and ocr'ing is the any2djvu server: http://any2djvu.djvuzone.org/ A prototyp for a metadata server is the MathNet MMM server: http://www.mathematik.uni-osnabrueck.de/cgi-bin/MMM3.1.cgi Both servers work via Web masks, but can also be driven (for mass production) by LWP scripts.