The contents of this page are based on a presentation given at a workshop on Linking and Searching in Distributed Digital Libraries held at the University of Michigan at Ann Arbor on March 18 - 20, 2002:

Searchable retrodigitised mathematical articles with linked references

Building a digital library that includes articles originally not published in electronic form, i.e. on paper, leads to some requirements - or at least wishes - on the documents to be used.

First one has to obtain a digital document at all from the paper version. This certainly can be done just be scanning. The result will be a set of scanned images that, if produced in sufficient quality will be sufficient for reading by a human reader. (There are several ways of forming a coherent document from single scanned pages which will not be discussed here.)
Next, one might want to search within the library for documents containing certain key phrases. For this one has to be able to get at the text contents of an article. This can be achieved from the scanned images by using an OCR system for optical character, layout, and text recognition.
As, once such an article has been found, a user might while reading it also perform electronic search within that article, it might be of advantage if the results of the text recognition are part of the digital document and in fact closely related to the scanned images.
Moreover, as scientific papers often refer to one another, one might which to get from one paper within a digital library to an other just by following hyperlinks. Thus linking should be supported by the document format.

To bring all this wished for functionality together in one document, at Essen we are using the DjVu format, which allows to add a text layer as well as links to an image comparatively easily. For more information on DjVu please see the presentation of L. Vincent "Using DjVu for document compression" or visit www.djvuzone.org.

As an example for such a document please take a look at the following:

W. Feit, J.G. Thompson
Solvability of Groups of Odd Order
(Pacific Journal of Mathematics Vol 13 No 3)

To view this document, you need to have the DjVu browser plugin installed. This can be found at www.lizardtech.com . For the latest plugin versions for Linux or Solaris please visit djvu.sourceforge.net .

A remark on linking mathematical articles

For many mathematical articles (and most of the "more recent") reviews are provided in digital form by review journals (Mathematical Reviews, Zentralblatt). Compared to this, the number of articles actually available in full in digital from is comparatively small. In addition these electronic formats usually are not freely available but require the subscription at some organisation or publisher. Therefore, rather than linking articles directly, we provide links from the list of references of an article to the reviews of the cited articles at MathSciNet (with links to Zentralblatt in preparation). Moreover, sometimes from these reviews there is a link to the corresponding full version of the reviewed article.

Please refer to the presentation of M.Kratzer "Automatic reference linking using MR Look UP" on how, staring from OCR on the list of references of an article, this linking can be done automatically.

An Example of linking

This example shows the results of the automated linking process. The image below shows a list of references from a mathematical article (W. Kimmerle, K.W. Roggenkamp, A Sylowlike theorem for for integral group rings of finite solvable groups, Arch. Math. 60). For the cited articles (i.e. not the cited books or manuscripts) the corresponding area of the image contains a link to the review of that article in MathSciNet. (Just move the mouse there and give it a click. When doing so, please take notice, how the reviews are presented in some dialect of (La)Tex, which seems to be perfectly suitable for the mathematical community.)

Treating mathematics ...

As indicated above, OCR systems are used to obtain the from a the scanned images the text contents the allows for searching in the resulting document. As on one hand, mathematical articles tend to contain quite a lot of mathematical notation which on the other hand recent commercial OCR systems have not been designed to deal with, the question arises, of how one should treat mathematical notation in this context.
Let us take a look at some possibilities using the following few lines from the paper of Feit and Thompson cited above as example:

... like text

Of course, we can run our (text) OCR over the whole document, not caring whether it encounters "ordinary" text or mathematical notation, and see, what happens.

The following picture shows (reversed) the areas, the OCR system could assign some meaning, which is almost the whole text.

And here is the recognition result:


Proof. Since 0P,(X) < X and ^P n 0P>(X) = 1, O9,(H) is in 

Thus it suffices to show that if ft e MOP), then ftS0P'(X). Since 

is a group of order I^PHftl and ty is a Sp-subgroup of X, ft is a p'- 

group, as is ft0p/(X). In proving the lemma, we can therefore assume

In the example, the recognition result of the ordinary text is unaffected from the presence of mathematical notation. While to our experience this is not always the case, still usually the text recognition is well enough the be sufficient for searching within the text.

(For this example, FineReader 6.0 was used. Omitted from the example was the information given by the OCR system, as to which results it considered unreliable, which is quite well correlated to the areas of mathematical notation.)

omit

In the previous example, the recognition result on the mathematical notation, even for the rather simple structure of the formulas present, takes quite some imagination to be related to the original text. In general, one might thus not consider these results very helpful.

In the following example, the Infty system developed at the laboratory of M. Suzuki at Kyushu University, which is based on the Toshiba ExpressReader OCR, was used to separate mathematical notation from ordinary text and then only include, what is considered ordinary text into the document. The results are as follows:


Proof. Since and is in . 

Thus it suffices to show that if , then . Since 

is a group of order and is a -subgroup of is a 

group, as is . In proving the lemma, we can therefore assume

In this way, we get a good description of the ordinary text, sufficient for searching, without any irritating recognition results on the formula part.

For more information on the Infty system, please see the presentations of M. Suzuki "Extraction of text data and hyperlink structure from scanned images of mathematical journals" and E.Ando "A recognition system of voluminous journals of mathematics". For information on ExpressReader see the presentation of K. Yokota "ExpressReader Pro adapted to retrodigitization of mathematical documents".

... with specialised recognition engine

On the other hand, the recognition results included in our digital documents (in DjVu format) can not only be used for searching the document, but can also be extracted from the document. In the two preceeding examples, the text that can be copied from the document that way - as accurate as it may be on the ordinary text - gives a incorrect resp. incomplete description of the text of the article itself. One might thus want to fill the gaps of the previous example by using a specialised OCR for the mathematical formulas, as in the following example:


Proof. Since ${O }_{p '}^{}\left( \mathfrak{X} \right) \vartriangleleft \mathfrak{X}$ and $\mathfrak{P} \displaystyle{\cap _{}^{}}{O }_{p '}^{}\left( \mathfrak{X} \right) = 1 ,$ ${O }_{p '}^{}\left( \mathfrak{X} \right)$ is in $\leftrightarrow \left( \mathfrak{P} \right) ,$ . 

Thus it suffices to show that if $\mathfrak{H} \in \leftrightarrow \left( \mathfrak{P} \right)$ , then $\mathfrak{H} \subseteqq {O }_{p '}^{}\left( \mathfrak{X} \right)$ . Since $\mathfrak{P} \mathfrak{H} .$ 

is a group of order $| \mathfrak{P} | \cdot | \mathfrak{H} |$ and $\mathfrak{P}$ is a ${S }_{p }^{}$ -subgroup of $\mathfrak{X} ,$ $\mathfrak{H}$ is a ${p }_{}^{'}-$ 

group, as is $\mathfrak{H} {O }_{p '}^{}\left( \mathfrak{X} \right)$ . In proving the lemma, we can therefore assume

This way, we get a quite complete and rather good description of the text in question - with the mathematics part described in LaTeX.
Note however, that the OCR had some problems with the rather uncommon symbol

which was consequently misrecognised.

For the mathematical formula recognition in this example we used the specialised system developed at the laboratory of M. Okamoto at Shinshu University at Nagano. For mor information in this system please see the presentation of M. Okamoto "A mathematical formula recognition method and its performance evaluation".

Mathematics in native digital documents

Documents originally created in electronical format (e.g. PDF) usually admit text extraction. However usually, as far as the structure of the contained formulas is concerned this extracted text is about as useful as the text obtained by just using text OCR on a scanned page, as the following example is meant to illustrate. The image below shows a few lines from an article published in PDF (L.Paoluzzi, On $\pi$-hyperbolic knots and branched coverings, Comm. Math. Helvet. 74):

Here is, how the extracted text for that section looks when copied into an editor:


are distinct, İ and İ0 cannot be conjugate. Since Iso+(M) has finite order, İ and İ0 generate a dihedral group where the element (İİ0) has even order, say, 2d; elseİ and İ0 would be conjugate. Define r := (İİ0) d . By [17, Corollary 1] both K and K0 admit n-periodic symmetries   h and   h 0 respectively whose actions on K and K0 give the trivial knot. Let h and h0 be lifts of these symmetries in Iso+(M) and note that İ and h (resp. İ0 and h0) commute. Note that p  h (K [ Fix(   h)) = p   h 0 (Fix(   h 0) [ K0) is a two trivial non exchangeable component link [17, Theorem 1]. Indeed if the components were exchangeable K and K0 would coincide.

Apart from the fact, that there seems to be some problem with the font of the Greek letters, the formula structure is completely lost. One might whish for a description of the article like in the third example above with the formula structure described in LaTeX. For this the corresponding programs for separation of formulas from text and for the recognition of the formulas structures should work with input directly from the PDF file, thus avoiding any error in character recognition as such. Unfortunately for these specialised systems this is as yet not possible. We have however tried to simulate the possible results of such a procedure, by performing the recognition in the image obtained from the PDF file and then correcting errors, that were due to faulty character recognition by hand; with the following result:


are distinct, $ \tau $ and $ tau ' $ cannot be conjugate. Since $ I s o_{+}^{} \left( M \right) $ has finite order, $ \tau $ and 
$ \tau ' $ generate a dihedral group where the element $\left( \tau \tau ' \right) $ has even order, say, $ 2 d $ ;else $ \tau $ 
and $ \tau ' $ would be conjugate. Define $ r : = {\left( \tau \tau ' \right) }_{}^{d } $ By [17, Corollary 1] both $ K $ and $ K' $ 
admit $ n $ -periodic symmetries $ { \bar h } $ and $ { \bar h } ' $ respectively whose actions on $ K $ and $ K ' $ give 
the trivial knot. Let $ h $ and $ h '$ be lifts of these symmetries in $  I s o_{+}^{} \left( M \right) $ and note that 
$ \tau $ and $ h $ (resp. $ \tau '$ and $ h ' $ ) commute. Note that $ p_{{\bar h}}^{} \left ( K \cup F i x \left ( {\bar h} \right ) \right ) = p_{ {\bar h} ' }^{} \left ( F i x \left ( {\bar h} ' \right ) \cup K ' \right )$ 
is a two trivial non exchangeable component link [17, Theorem 1]. Indeed if the 
components were exchangeable $ K $ and $ K ' $ would coincide.

We venture to say, that that way we get a description of the text that is more adequate than that obtained from the original file.

A remark on formula search

Including mathematical formulas in LateX code into the text layer of our documents, as was proposed above, will have the effect, that this code will be visible to an search engine, when performing text search. Thus a search for expression like "math" or "left" might yield results as the regions described by expressions like "$\mathfrak{P} \mathfrak{H} .$" or "$ I s o_{+}^{} \left( M \right) $". However, that does not necessarily enable us to actually do search for a specific mathematical formula, as to successfully perform such a search, we would have to know quite exactly, how the formula in question is represented in the LaTeX code of the text layer.

For efficient searching for mathematical formulas however, the search engine should be able to decide, whether a formula contained in an article is the same (or maybe equivalent ?) to one, a user has entered as a search term. This task, of determining the identity of two expressions also is a very basic task in computer algebras. Thus developing a search engine for mathematics might at least partially amount to the development of an computer algebra interface for OCR recognised text.