Intermediate report of the DML Working Group on "Technical Standards"
                         (as of Nov. 4, 2002)

Cochairs: Thierry Bouche Grenoble / NUMDAM thierry.bouche@ujf-grenoble.fr 
          Ulf Rehmann Bielefeld            rehmann@mathematik.uni-bielefeld.de

Members:  Pierre Berard                    Pierre.Berard@ujf-grenoble.fr    
	  Jon Borwein                      jborwein@cecm.sfu.ca
          Michael Doob                     mdoob@cc.UManitoba.CA
          Keith Dennis                     dennis@rkd.math.cornell.edu


Our recommendations concerning the Technical Standards for the Digital
Library will be given with respect to the following topics. We hope
that further discussions will complete this list and make it more
precise.

1. Scanning Quality
2. Archiving Formats
3. File Name Conventions
4. Delivery Formats
5. Download Units
6. Server Techniques
7. Further Recommendations

We also hope we will later be able to provide tools for achieving the
different tasks described below.


-----------------------------------------------
1. Scanning Quality:

   600 dpi bitonal (minimum quality level) 
   (300 dpi bitonal/grayscale is discouraged)

   In special cases and in the long run, higher resolutions,
   grayscale, or even color may be more suitable.

   Obvious flaws of the printing like skewed printing areas should be
   corrected during the scanning process.

   The printing area of each page should be positioned at the same
   place for all pages of a given object, possibly reflecting the
   differences for "right" and "left" pages.
   Page jumpings, rotations, varying margins and dimensions of images
   are discouraged.
   (Note: a possible choice could be the approach chosen by Gallica:
   Always put the text into the minimum ISO A* format into which it fits.)

-----------------------------------------------
2. Archiving Formats

   Scanned raw data (pnm, tiff)
   (CCIT G4 for bitonal,
   lossless compressed, LZW, ZIP for gray, color)

  Suggestion: the raw data should be accessible to the public as well. 

  Reasons: 
  1.	Somebody may come up with a better delivery format later.
  2.    If the original server dies for some reason, chances are 
	that somebody else might have picked up a copy and the data
	won't be lost.

-----------------------------------------------
3. File Name and URL Conventions:

(This is to be made precise and completed, for example, in cooperation
with the meta data group?) 

Among other things, the following should be
guaranteed:
 
       unique and meaningful _name_ for all files

       stable urls for all documents
  
       uniform appearence of web pages for all servers
       (possibly organized in a Math-Net like manner)
       (ordering scheme governed by MSC 2000)
       
       uniform access techniques for all documents
   
-----------------------------------------------
4. Delivery Formats:

File Formats: djvu, pdf

	both made searchable with an underlying text layer
	Encoding for non ascii letters like accents, 
	diaereses should be encoded using unicode.

	Links to MR/ZBL should be added to the references.

	"Garbage text" (e.g., from unrecognized material like
	formulae) in OCR should be removed.

-----------------------------------------------
5.   Download Units:   (primary:)         (secondary:)
        Journals:    Single articles    (annual) volumes
        Books:       Whole book         (chapters?)

	Comment:     The choice between "primary" and "secondary"
	             may depend on the delivery format.

        Browsing through tables of content is desirable.
	
	Download of single pages only is discouraged.
	Download of page ranges is desirable.

-----------------------------------------------
6. Server Techniques:
	
        for djvu, indirect documents should be delivered (this can be
        achieved by setting up the server appropriately, at least for
        apache).

        for pdf, byte optimized files should de delivered,
        server should be configured for "byte serving".

-----------------------------------------------
7. Further recommendations:

(See http://www.library.cornell.edu/dmlib/rehmann.pdf)

It is suggested to set up public servers for 

    --  format conversions
    --  performing ocr
    --  automatic supply of metadata for an article 
        (using Dublin Core, Open Archive or similar encodings).
    --  upload digitized material to the DML.
    --  registry of all ongoing projects, keeping track of
        ongoing/completed/planned digitizing projects
        and allowing to input unnoticed material
    --  (may be even: scan servers? This should be a place with good
        scanning equipment, where people can send their paper material to in
        order to get it scanned at high quality.)

Reason: Setting up the DML is a task for many people and will last
10-15 years or longer. Any individual or institutional contribution of
digitizations therefore should be welcome. Individuals should be
encouraged and enabled to help.

In order to enable many contributors to provide digitized material in
a sufficiently high quality, it is necessary to provide public tools
to transform the material into the right format, which is sometimes
technically demanding and to provide text layers by ocr (this should
be optimized for the language the manuscript is written in, therefore
it would be good to have public servers for the various language
areas). Also, it should be easy for contributors to provide the
scanned material with (elementary) metadata such as MSC, keywords and
phrases on Dublin Core and/or Open Archive basis.

In principle, this technology will be an advantage for any scientific
discipline (as well as for more general areas of electronic
literature, so the suggestion of a set of servers like this as a basic
archiving infrastructure might help to convince funding agencies).

Public format conversion servers could also contribute to the long
term archiving problem, since they provide a dynamic tool for achieving
this.

Of course, all these servers should be able to handle mass data upload
(script driven), as well as individual files.

Remarks:

A kind of prototype server for special file conversion and ocr'ing is
the any2djvu server: http://any2djvu.djvuzone.org/

A prototyp for a metadata server is the MathNet MMM server:
http://www.mathematik.uni-osnabrueck.de/cgi-bin/MMM3.1.cgi

Both servers work via Web masks, but can also be driven (for mass
production) by LWP scripts.