The contents of this page are based on a presentation given at a workshop
on Linking and Searching in Distributed Digital Libraries held
at the University of Michigan at Ann Arbor on March 18 - 20, 2002:
Searchable retrodigitised mathematical articles with linked references
Building a digital library that includes articles originally not published
in electronic form, i.e. on paper, leads to some requirements - or at least
wishes - on the documents to be used.
- First one has to obtain a digital document at all from the paper version.
This certainly can be done just be scanning. The result will be a set of scanned
images that, if produced in sufficient quality will be sufficient for
reading by a human reader.
(There are several ways of forming a coherent document from single scanned pages
which will not be discussed here.)
-
Next, one might want to search within the library for documents containing
certain key phrases. For this one has to be able to get at the
text contents of an article. This can be achieved from the scanned
images by using an OCR system for optical character, layout, and text
recognition.
As, once such an article has been found, a user might while reading it
also perform electronic search within that article, it might be of advantage
if the results of the text recognition are part of the digital document and
in fact closely related to the scanned images.
-
Moreover, as scientific papers often refer to one another, one might
which to get from one paper within a digital library to an other just
by following hyperlinks. Thus linking should be supported
by the document format.
To bring all this wished for functionality together in one document, at Essen we are using
the DjVu format, which allows to add a text layer as well as links to an image
comparatively easily. For more information on DjVu please see the presentation
of L. Vincent "Using DjVu for document compression"
or visit www.djvuzone.org.
As an example for such a document please take a look at the following:
W. Feit, J.G. Thompson
Solvability of Groups of Odd Order
(Pacific Journal of Mathematics Vol 13 No 3)
To view this document, you need to have the DjVu browser plugin installed.
This can be found at
www.lizardtech.com .
For the latest plugin versions for Linux or Solaris please visit
djvu.sourceforge.net .
A remark on linking mathematical articles
For many mathematical articles (and most of the "more recent") reviews
are provided in digital form by review journals (Mathematical Reviews,
Zentralblatt).
Compared to this, the number of articles actually available in full in
digital from is comparatively small. In addition these electronic formats
usually are not freely available but require the subscription at some
organisation or publisher. Therefore, rather than linking articles
directly, we provide links from the list of references of an article
to the reviews of the cited articles at MathSciNet (with links to
Zentralblatt in preparation). Moreover, sometimes from these reviews
there is a link to the corresponding full version of the reviewed
article.
Please refer to the presentation of M.Kratzer
"Automatic reference linking using MR Look UP" on how, staring from
OCR on the list of references of an article, this linking can be done
automatically.
An Example of linking
This example shows the results of the automated linking process. The image
below shows a list of references from a mathematical article
(W. Kimmerle, K.W. Roggenkamp, A Sylowlike theorem for for integral group
rings of finite solvable groups, Arch. Math. 60).
For the cited articles (i.e. not the cited books or manuscripts) the
corresponding area of the image contains a link to the review of that
article in MathSciNet. (Just move the mouse there and give it a click. When
doing so, please take notice, how the reviews are presented in some dialect
of (La)Tex, which seems to be perfectly suitable for the mathematical
community.)
Treating mathematics ...
As indicated above, OCR systems are used to obtain the from a the scanned images
the text contents the allows for searching in the resulting document. As on
one hand, mathematical articles tend to contain quite a lot of mathematical
notation which on the other hand recent commercial OCR systems have not been
designed to deal with, the question arises, of how one should treat mathematical
notation in this context.
Let us take a look at some possibilities using the following few lines from the
paper of Feit and Thompson cited above as example:
... like text
Of course, we can run our (text) OCR over the whole document, not caring
whether it encounters "ordinary" text or mathematical notation, and see,
what happens.
The following picture shows (reversed) the areas, the OCR system could
assign some meaning, which is almost the whole text.
And here is the recognition result:
Proof. Since 0P,(X) < X and ^P n 0P>(X) = 1, O9,(H) is in
Thus it suffices to show that if ft e MOP), then ftS0P'(X). Since
is a group of order I^PHftl and ty is a Sp-subgroup of X, ft is a p'-
group, as is ft0p/(X). In proving the lemma, we can therefore assume
|
In the example, the recognition result of the ordinary text is
unaffected from the presence of mathematical notation. While to our
experience this is not always the case, still usually the text recognition
is well enough the be sufficient for searching within the text.
(For this example, FineReader 6.0 was used. Omitted
from the example was the information given by the OCR system, as to
which results it considered unreliable, which is quite well correlated
to the areas of mathematical notation.)
omit
In the previous example, the recognition result on
the mathematical notation, even for the rather simple structure of the
formulas present, takes quite some imagination to be related to the
original text. In general, one might thus not consider these results
very helpful.
In the following example, the Infty system developed at the laboratory of M.
Suzuki at Kyushu University, which is based on the Toshiba ExpressReader OCR,
was used to separate mathematical notation from ordinary text and then only
include, what is considered ordinary text into the document. The results are
as follows:
Proof. Since and is in .
Thus it suffices to show that if , then . Since
is a group of order and is a -subgroup of is a
group, as is . In proving the lemma, we can therefore assume
|
In this way, we get a good description of the ordinary text, sufficient for
searching, without any irritating recognition results on the formula part.
For more information on the Infty system, please see the presentations of
M. Suzuki
"Extraction
of text data and hyperlink structure from scanned images of mathematical
journals" and E.Ando
"A
recognition system of voluminous journals of mathematics".
For information on ExpressReader see the presentation of K. Yokota
"ExpressReader
Pro adapted to retrodigitization of mathematical documents".
... with specialised recognition engine
On the other hand, the recognition results included in our digital documents
(in DjVu format) can not only be used for searching the document, but can also
be extracted from the document. In the two preceeding examples, the text
that can be copied from the document that way - as accurate as it may be on the
ordinary text - gives a incorrect resp. incomplete description of the text
of the article itself. One might thus want to fill the gaps of the
previous example by using a specialised OCR for the mathematical formulas,
as in the following example:
Proof. Since ${O }_{p '}^{}\left( \mathfrak{X} \right) \vartriangleleft \mathfrak{X}$ and $\mathfrak{P} \displaystyle{\cap _{}^{}}{O }_{p '}^{}\left( \mathfrak{X} \right) = 1 ,$ ${O }_{p '}^{}\left( \mathfrak{X} \right)$ is in $\leftrightarrow \left( \mathfrak{P} \right) ,$ .
Thus it suffices to show that if $\mathfrak{H} \in \leftrightarrow \left( \mathfrak{P} \right)$ , then $\mathfrak{H} \subseteqq {O }_{p '}^{}\left( \mathfrak{X} \right)$ . Since $\mathfrak{P} \mathfrak{H} .$
is a group of order $| \mathfrak{P} | \cdot | \mathfrak{H} |$ and $\mathfrak{P}$ is a ${S }_{p }^{}$ -subgroup of $\mathfrak{X} ,$ $\mathfrak{H}$ is a ${p }_{}^{'}-$
group, as is $\mathfrak{H} {O }_{p '}^{}\left( \mathfrak{X} \right)$ . In proving the lemma, we can therefore assume
|
This way, we get a quite complete and rather good description of the text
in question - with the mathematics part described in LaTeX.
Note however, that the OCR had some problems with the rather uncommon symbol
which was consequently misrecognised.
For the mathematical formula recognition in this example we used the specialised
system developed at the laboratory of M. Okamoto at Shinshu University at
Nagano. For mor information in this system please see the presentation of M. Okamoto
"A
mathematical formula recognition method and its performance evaluation".
Mathematics in native digital documents
Documents originally created in electronical format (e.g. PDF) usually admit
text extraction. However usually, as far as the structure of the contained
formulas is concerned this extracted text is about as useful as the text
obtained by just using text OCR on a scanned page, as the following example is
meant to illustrate.
The image below shows a few lines from an article published in PDF
(L.Paoluzzi, On $\pi$-hyperbolic knots and branched coverings,
Comm. Math. Helvet. 74):
Here is, how the extracted text for that section looks when copied into an
editor:
are distinct, Ý and Ý0 cannot be conjugate. Since Iso+(M) has finite order, Ý and Ý0 generate a dihedral group where the element (ÝÝ0) has even order, say, 2d; elseÝ and Ý0 would be conjugate. Define r := (ÝÝ0) d . By [17, Corollary 1] both K and K0 admit n-periodic symmetries h and h 0 respectively whose actions on K and K0 give the trivial knot. Let h and h0 be lifts of these symmetries in Iso+(M) and note that Ý and h (resp. Ý0 and h0) commute. Note that p h (K [ Fix( h)) = p h 0 (Fix( h 0) [ K0) is a two trivial non exchangeable component link [17, Theorem 1]. Indeed if the components were exchangeable K and K0 would coincide.
|
Apart from the fact, that there seems to be some problem with the
font of the Greek letters, the formula structure is completely lost.
One might whish for a description of the article like in the third
example above with the formula structure described in LaTeX. For this
the corresponding programs for separation of formulas from text
and for the recognition of the formulas structures should work
with input directly from the PDF file, thus avoiding any error in
character recognition as such. Unfortunately for these specialised systems
this is as yet not possible. We have however tried to simulate the
possible results of such a procedure, by performing the recognition
in the image obtained from the PDF file and then correcting errors,
that were due to faulty character recognition by hand; with the following
result:
are distinct, $ \tau $ and $ tau ' $ cannot be conjugate. Since $ I s o_{+}^{} \left( M \right) $ has finite order, $ \tau $ and
$ \tau ' $ generate a dihedral group where the element $\left( \tau \tau ' \right) $ has even order, say, $ 2 d $ ;else $ \tau $
and $ \tau ' $ would be conjugate. Define $ r : = {\left( \tau \tau ' \right) }_{}^{d } $ By [17, Corollary 1] both $ K $ and $ K' $
admit $ n $ -periodic symmetries $ { \bar h } $ and $ { \bar h } ' $ respectively whose actions on $ K $ and $ K ' $ give
the trivial knot. Let $ h $ and $ h '$ be lifts of these symmetries in $ I s o_{+}^{} \left( M \right) $ and note that
$ \tau $ and $ h $ (resp. $ \tau '$ and $ h ' $ ) commute. Note that $ p_{{\bar h}}^{} \left ( K \cup F i x \left ( {\bar h} \right ) \right ) = p_{ {\bar h} ' }^{} \left ( F i x \left ( {\bar h} ' \right ) \cup K ' \right )$
is a two trivial non exchangeable component link [17, Theorem 1]. Indeed if the
components were exchangeable $ K $ and $ K ' $ would coincide.
|
We venture to say, that that way we get a description of the text that is
more adequate than that obtained from the original file.
A remark on formula search
Including mathematical formulas in LateX code into the text layer of our
documents, as was proposed above, will have the effect, that this code
will be visible to an search engine, when performing text search. Thus
a search for expression like "math" or "left" might yield results as the
regions described by expressions like "$\mathfrak{P} \mathfrak{H} .$" or
"$ I s o_{+}^{} \left( M \right) $". However, that does not necessarily
enable us to actually do search for a specific mathematical formula, as to
successfully perform such a search, we would have to know quite exactly,
how the formula in question is represented in the LaTeX code of the
text layer.
For efficient searching for mathematical formulas however, the search engine
should be able to decide, whether a formula contained in an article is the
same (or maybe equivalent ?) to one, a user has entered as a search term.
This task, of determining the identity of two expressions also is a very
basic task in computer algebras. Thus developing a search engine for
mathematics might at least partially amount to the development of an
computer algebra interface for OCR recognised text.