Wanli Ma, Computer Sciences Laboratory, RSISE, The Australian National University, Canberra, ACT 0200, Email: ma@cslab.anu.edu.au, Home Page
Richard P. Brent,Computer Sciences Laboratory, RSISE, The Australian National University, Canberra, ACT 0200, Email: rpb@cslab.anu.edu.au, Home Page
Figure 1: A Hyperlinked Document
Hypertext provides a way to transcend the limitations of linear structured documents by giving users the freedom to have their own chosen sequence of browsing a hyperlinked document (See Figure 1 [2]). In other words, the users do not have to be restricted to the head-to-tail way of reading as in a linear structure. Powerful as the hypertext structure is, a linear structure is still needed for hard-copy printing. When a user finds an interesting document, say, a book, in the Internet and would like to have a hard-copy of this book, its hierarchical structure makes printing a tedious task: the user have to manually follow each hyperlink, e.g., each chapter (then perhaps each section within a chapter), in the main page of the document, print each piece of the document separately, and finally, assemble them together to get the whole book. Even worse, the assembling-together might not be possible without physically cutting and pasting if a chapter is linked from the middle of the main text or a section from the middle of a chapter. Therefore a tool which can automatically follow the hyperlinks in a main document and replace each link with the actual sub-document in the final output is desirable.
In this paper, we promote such a tool called H2FDoc. The bulk of this paper is devoted to presenting the tool and its capabilities. The automatic construction of the final flat output, also in HTML format, is based on a set of heuristic rules which specify the depth of link searching, the means of dealing with cyclic links and images etc. We believe that this kind of tool for document construction is necessary as a supplement to WWW browsing and authoring tools.
The rest of the paper is structured as follows: Section 2 presents an outline of H2FDoc, and Section 3 discusses the searching algorithms and heuristic rules used to build H2FDoc. Section 4 gives two testing results of H2FDoc on the examples of Netscape handbook and the Third International WWW Conference Proceedings. Finally, Section 5 summarizes the paper with a conclusion and our future plans.
Figure 2 The Architecture of H2FDoc
The main components of H2FDoc are in the following:
<a href="URL">
, comes, the Controller will ask
the communication interface to contact the remote server specified by the
"URL"
and get the corresponding file to a temporary file, and then ask
the temporary file manager to preserve the environment of the file currently being processed and
switch the "current file pointer" to the new file. Processing of the old file will not
be resumed until processing the current file is finished.
from |
line |
col |
font |
face |
type |
spelling |
comment |
from
records the URL address of the original file;line
and col
are the line and column
numbers of the processed tag in the original file;font
is the current font of the string;face
is the current face of the string;
type
gives the link type of the tag. For example, the type could be a plain
text (PT
), a hypertext link (HL
), and a image link (IL
)
etc;spelling
is the real spelling of the tag or a plain text;comment
can be used to provided some extra information if necessary.<h1>
, <br>
and <p>
etc.),
presentation format (<em>
, <b>
, and so on), background and
colors, special characters, tables, and forms. We simply pass uninteresting tags to the next
stage without taking any action.
<a href="URL">XXX</a>
:
in this case, the XXX
should be replaced by the content this
URL
is pointing to: when the tag is met, the Controller will
interrupt the processing of the current document and get this new
file, pointed by this URL
, and then work on it instead. The
interruption point will be recovered after the processing of this new
file is finished. <a href="URL#abc">XXX</a>
:
if it points to a particular portion of a different document which
has not yet been received by the temporary file manager, the same
action will be taken as above; otherwise, the tag is regarded as an
uninteresting one.<a href="#abc">XXX</a>
:
it points to a different portion of the same document, so no
action will be taken. The tag can be considered as an
uninteresting one, but it can be useful if an index should be created.
We will address the question in a future paper.<img src="URL">
:
it points to an image. We need to replace it with the image,
i.e., get the real data of the image.<a href="URL">XXX</a>
, it may find out new interesting tags
and has to go further down, or recursively, to follow those new tags. A document most
likely contains links pointing to its ancestor documents (in the sense of hierarchical
structure of hypertext), or there might be pointer loops in the documents. H2FDoc should be able
to recursively follow the interesting tags without being trapped in the loops. In the following
section, we discuss the rules for hyperlink searching.
LEVEL
threshold: in general, most publications
tend to have less than 5 levels (e.g., chapter, section, subsection, etc.). Therefore, the
searching is restricted to the depth less than this threshold. Take the examples of Netscape
handbook [ HREF3,
HREF4] and the 3rd
international WWW conference proceedings (3WWW)
[HREF5],
both are of two level hypertext structure although they use a different directory structure.
LEVEL
can be set by a user before H2FDoc starts.
Handbook
" button in a Netscape browser.
The logic structure of the handbook is in Figure 3, where the arrows
are representing the hyperlinks in the original text.
Figure 3: The Hyperlinked Structure of the Netscape Handbook
From the structure we can see that this is a network-linked text document. In this example, the
level threshold LEVEL
is set to 2. After the execution of H2FDoc, this document is
flattened, as shown in Figure 4.
Figure 4: The Flattened Netscape Handbook
The image band "News & Reference" still can be clicked, and will take the user to the homepage of Netscape. The only problem with this example is that the Index section is lost. This is because with the current heuristic rules, if a tag points to document which has been traced before, it is ignored. Future development of H2FDoc should be able to pick up the index pointers and make the output similar to normal publications, such as a book.
http://www.igd.fhg.de/www/www95/proceedings/papers/abstract.html
.
We believe such tool is necessary for document construction in the environment such as WWW. Authoring and electronic publishing tools using hypertext techniques are becoming popular. Unfortunately, the reverse need for converting a hypertext to a flat-text has not yet attracted much attention. When a user wants to download and/or print a hypertext document, H2FDoc is very helpful.
The implementation of H2FDoc is only at its initial stage. Some functions such as error handling, duplication removing, heading and font adjustment will be developed in the future. Among them, the error handling feature is the most desirable one. Because HTML documents are written by different authors with different background, errors are inevitable. Take the following specification as an example (from a handbook about WWW),
<a href=fig6-2.gif"><IMG SRC=
"fig6-2.thumb.gif"></A><BR>
,href=
is missing. Netscape works well on it, but
H2FDoc can not correctly interpret the URL of "fig6-2.gif"
, and produces
strange output.H2FDoc currently assumes that all the related parts of a document are at the same site, but this is not always the case in the real world. For example, a company or a government might run a number of WWW sites. The assumption is the essential pre-condition to run the heuristic rules of this paper; otherwise, automatic browsing may never stop. The current heuristic rules are by no means perfect. A future version of H2FDoc will introduce the mechanism of annotation, by which we allow users to provide some useful information in addition to those heuristic rules to guide the recursively browsing. With those annotations, H2FDoc can make better decisions such as that it needs to go to some outside sites while some inside-pointing links are not necessary to browse. The annotations can also be used to provide the information on how to prepare the final output presentation.
[2] K. Hughes. Entering the World-Wide Web: A Guide to Cyberspace. Honolulu Community College, Oct., 1993.
Pointers to Abstract and Conference Presentation | |||
---|---|---|---|
Abstract | Papers & posters in this theme | All Papers & posters | AusWeb96 Home Page |