Indexing Tomorrow's
Web-Delivered
World Wide Library
Information-technology professionals joke that developers ignorant of
Unix are doomed to reinvent it. (We recall that Unix represented the
leading edge of information technology in the 1970s; that Unix, in all
but nomenclature, is alive and well today both in Linux and in the new
Mac OS X; and that the contemporary operating-system architect is
eventually forced either to adapt or to reinvent the core ideas, such
as permissions, which Unix so carefully articulated so long ago.)
Comparable to the venerable Unix in computing is the still more
venerable discipline of indexing in the world of
publishing. Back-of-book indexing was already known in Renaissance
times. The profession is alive and well today, with a few thousand
practitioners around the world, many of them grouped under such
professional umbrellas as the UK's Society of Indexers, or its
slightly younger sister societies in other countries. A typical
professional indexer, if working full time, might hope to index two or
three books a month, at - alas - something considerably less than two
thousand American dollars for a typical book, under contract either
with the author or with the publishing house. The professional is
inevitably found tracking hundreds of terminological decisions not
with index cards, and not with the absurdly misnamed "indexing
modules" of word processors, but with a tool such as CINDEX. The
professional indexer has in many cases completed a formal indexing
course, to be thought of as equivalent to at least a fifth or a
quarter of the North American full-time
university student's third-year semester
workload.
As modern operating-system architects have to study, or else to
reinvent, parts of Unix, so the builders of the emerging Internet will
have to study, or else to reinvent, aspects of traditional book
indexing.
What is this emerging Internet? We already have a World Wide Web (WWW)
for advertising and e-commerce. What will emerge, however, over coming
decades is a true World Wide Library (say, "WWL"), housing the full
text of all significant new scientific and scholarly work, and hosted
on some tens or hundreds of thousands of servers -
a subset of the millions that might by then be publishing
content of one kind or another on the WWW.
If the gods smile on humanity, the WWL will embrace today's
open-source software model, making scientific and scholarly content
viewable free of charge over the world. If the gods curse us, we shall
have a WWL burdened by "digital rights management," in which content
is viewable only for a fee, and in which centralized organs of
corporate or state security record even which specific files get
opened on which specific workstations. (The abyss that threatens if
the gods curse is sketched by Cambridge University computer scientist
Ross Anderson, in an examination at http://www.cl.cam.ac.uk/~rja14/tcpa-faq.html
of the Microsoft Trusted Computing Platform Alliance "Palladium"
initiative.) But in one way or another, the WWL is on its way.
Thanks in part to the excellent work being done by the tens of
thousands of volunteer cataloguers in the Open Directory Project
(ODP), or "Directory Mozilla", at http://dmoz.org,
and fed into various other
search sites, such as the "Directory" portion of Google,
we already have some idea
of what it will mean for the WWL to be subject-catalogued.
Among the signs of quality work at ODP are the
details incorporated in scope notes.
On examining, for example, the catalogue page
"Top: Business: Agriculture and Forestry: Livestock: Horses and Ponies:
Transportation: North America",
at or very close to the Universal Coordinated Time
20021019T205928Z, we find
a clear scope statement, with the warning that the presented material
is appropriate for readers investigating
horse-transport services, with
or without quarantine. The scope-note writer takes
care to supply instructions redirecting
those link providers who seek to advertise
their horse transport hardware, such as box trailers.
There is,
however, a difference between subject cataloguing,
even at its best, and indexing. In
twentieth-century library-science terms, subject cataloguing is a
matter of assigning a handful of descriptors to an entire book, to
generate a handful of cards for the subject-cards file drawers. In
traditional book indexing, by contrast, every single page of content
needs to be scrutinized, being liable to make its own individual
contribution to the stock of index entries. The old card-file
subject-catalogue "A" drawer would tell the medical student which
books to peruse in the broad area of Alzheimer's disease. It was, on
the contrary, back-of-book indexes that one had to consult in tracking
down references to narrow subareas in that broad expanse - in tracking
down page references, for instance, to the clockface-drawing test
uncovering those specific cognitive defects that are for the
gerontologist the signature of early Alzheimer's. Similarly, whereas
it was a handful of cards in the subject-catalogue "E" drawer that
directed the historian to books on Estonian interwar diplomacy, it was
back-of-book indexes that one had to consult in tracking down page
references to Jaan Poska's responsibilities in drafting the 1920
Treaty of Tartu. The old distinction persists, in that ODP
and similar human-maintained "Web dirctories" can only
hope to catalogue whole Web sites, not to scrutinize sites page
by page.
Today's Web search tools comprise not only subject catalogues,
but also machine-run search engines. The search engine offers retrieval at a
finer level of detail than the subject catalogue does, and indeed can
perform some of the work done in the twentieth century by the
back-of-book index. At or very close to the Universal Coordinated Time
20021018T194508Z, for example, I myself found the Google search engine
delivering numerous appropriate hits for Alzheimer clock
diagnosis
, and delivering two mildly useful hits for the
Boolean-conjunction-of-two-exact-phrases construction "Jaan
Poska" "Treaty of Tartu"
.
Nevertheless, the search engine is not a substitute for the true
index, crafted by a human being. We may well need to find the
definition of a CATS point, a measure of university student workload
used in the UK. With existing search engines, this task is not
guaranteed to go smoothly: many a British Web academic course page
mentions the CATS point, and yet also sports some irrelevant
occurrence of the word "definition". Still worse, we may not always
realize what it is we seek - as when we investigate stars in the solar
neighbourhood without the realization, traditionally communicated to
the uninformed reader by the indexer's See also
cross-reference,
that NASA-sponsored solar-neighbourhood stellar research uses the term
"NStars."
It is in fact the crafting of cross-references that gives the
back-of-book index much of its value, and consumes a correspondingly
large part (say, a fifth or a third) of the indexer's working hours.
Those working hours are expended in crafting not only the obvious
"Roncalli, Angelo. See John XXIII, Pope",
but also the recondite
"solar neighbourhood, stellar populations in. See
also NStars", or
again the recondite "vector product, triple. See also
scalar product, triple".
Two additional styles of cross-reference, rather unfashionable in late
twentieth-century printed-book praxis, but useful, and perhaps
destined to come into their own again with the heavy cross-referencing
demanded by eventual WWL readers, are the two "unders":
"syllogistic. See under logic, formal",
and "beaver. See also under
Québec, commercial history of".
A further indication of the value added by the back-of-book indexer is
the old rule of thumb (useful to publishing personnel managers in
segregating the possibly good indexers from the clearly bad) that no
one term in a single book is to have more than around five or seven
accompanying "reference locators", or page numbers. A good index
cannot, for example, say "daffodils 134,
156-89, 200, 207fig, 356,
398, 401, 444, 430", but must instead break the main heading
"daffodils" down into informative subheadings, each with some
conveniently small number of accompanying reference locators - as it
might be, "daffodils: fertilizers for", "daffodils: in literature",
and "daffodils: naturalizing of".
Yet another suggestion of the care and attention needed in crafting an
index is conveyed by another rule of thumb: how many entries might the
indexer expect to post from a typical page of text in a typical
scholarly or scientific work? Workers in the profession, while
stressing that books vary, consider two or three entries per page
surprisingly low, and between five and ten entries per page typical.
In the publishing world as we knew it in the twentieth century, then,
the back-of-book index had worthy work to perform. (Indeed, even
further: something very like a back-of-book index helped the Bletchley
Park "Ultra" cryptanalysts - celebrated nowadays for their innovations
in computing machinery, and for their trimming perhaps a year from the
ghastly duration of World War II. The German radiotelegraphy decrypts
were indexed, with appropriately detailed cross-references, in a vast
system of cards. One speculates that without
meticulous See and
See also cards, the
Ultra team would have been stymied in their
ultimately successful effort to convert the daily torrent of raw
information into an operationally relevant intelligence product! This
particular British military asset was backed up with a copy under
Oxford's Bodleian Library, as a precaution against the Luftwaffe.)
With the advent of the WWL, we shall continue to need indexers. WWL
readers in, say, the year 2051 can reasonably demand index-like tools
that deliver four-stage retrieval.
(a) Upon uttering, or typing, or mousing from a scrollable browse
list, a tentative index term ("solar neighbourhood, stellar
populations in") with a few other straightforward restrictors (say,
limiting the search to the English language, and to works published
only within the last decade), the reader should be presented with a
list of zero or more possibly relevant publications, together with a
list or zero or more See, See
also, See under, and See also under directives.
(b) With a further few seconds' work, the reader should be able to
refine the tentative list of index terms into a safe list (in the case
imagined here, perhaps the Boolean OR of "solar neighbourhood, stellar
populations in" with "NStars"), and on the strength of this list to
obtain a first set of possibly relevant publications.
(c) With a further few minutes' work, the reader should be able to
make at least a tentative selection, on the strength of ODP-style
catalogue entries, of those documents whose individual indexes seem
worth skimming. The reader might, for instance, retrieve a thousand
English-language publications from the past decade, but with a couple
of mouse movements eliminate eight hundred as being popularizing and
derivative works, rather than peer-refereed original science.
(d) The reader should now be allowed to embark on a difficult and
hazardous kind of information retrieval without many twentieth-century
precedents, constructing an on-screen union index not from a
single publication, but from multiple publications. For each
individual publication, the union index will for a given indexing term
give at most five or seven or so reference locators, presenting each
reference locator as a hyperlink to an actual full-text paragraph or
sequence of paragraphs, on some WWL server or other.
I say here, advisedly, "without many", rather than "without any". The
terrifying task of building a union index did confront a tiny subset
of the professional indexing community before the advent of the
Internet, thanks to the need, even then, for cumulative indexing of
periodicals and multi-book sets. At the popularizer Time-Life Books,
for example, it was judged appropriate not only to index each book in
a set as it made its way to press, but also to publish an index for
the set as a whole, at the end of the project. In the "Special
Concerns" chapter of her Indexing Books (Chicago and
London: Chicago University Press, 1994), indexing
teacher-practitioner-theorist Nancy Mulvany quotes former Time-Life
copy chief Diane Ullius on the difficulty of the task:
[I]t's not just a matter of merging the individual volume indexes. We
put as much time and effort into a 160-page index as we put into a
160-page narrative book. Work on the cumulative index begins long
before the last volume in the series has been
published. . . . Even
if a single indexer has produced all the individual indexes, there
will still be term consistency problems that must be resolved during
the editing phase.
The last part in our four-stage model of a WWL reader's work in 2051
indicates the severity of the challenge eventually facing indexers. It
will no longer be possible to index publications in isolation from
each other. Instead, publications will need to be indexed with some
sensitivity to the standing of a document among its peers. So, for
instance, in indexing a graduate-level text on stellar astrophysics,
it will be necessary to consider the item not only as a self-contained
book, but as a book residing in a community of perhaps five hundred
relevantly similar books - as, so to speak, one book in twenty shelf
metres of cognate books.
It may be beyond the scope of human and machine intelligence to
produce good union indexes. However, since even a bad union index is
better than none, the effort must be made, and we must be prepared for
the slow discovery of what will over the decades prove to be at least
marginally acceptable practices. In the remainder of this essay, I
will develop one suggestion for a path to follow as we inch painfully
forward across unaccustomed terrain.
It is clear that there is no hope of centralized authorities doing the
bulk of the work that needs to be done in indexing, let alone in
trying to union-index, the WWL. The way lies forward in somehow
adapting the model of the ODP, or of the open-source GNU-cum-Linux
programming movement - or, for that matter, adapting the emerging
model of electricity generation, on which central electricity
generating authorities give way to grid-linked local provider
associations, each with its own humble wind turbine or photovoltaic
array. What, in the era of the WWL, could be analogous to the
neighbourhood operating its own wind turbine so as to buy power from
the continent-wide grid on some occasions, and to sell power into the
grid on others?
We start by considering the needs of the lone researcher keeping
private research notes. Such a person is bombarded with information -
the Astronomical Journal article read carefully in the
early morning, the Monthly Notices of the Royal Astronomical
Society article skimmed in haste before teaching started at ten
o'clock, the casual conversation over a sandwich in the departmental
lounge, the afternoon colloquium (although it proved largely
irrelevant, important issues came up for a couple of minutes at
question time), the popularizing Sky and Telescope or BBC
article savoured as a guilty pleasure over evening tea. The
meticulous researcher will clearly keep notes at the workstation day
by day. Clearly, too, the most retrieval-friendly notes will be not in
free-form text, but will be organized under some such formalism as
SGML or XML.
Here, to illustrate workstation note-taking tactics, is a sampling
from my own (essentially undergraduate- or beginning-MSc-level)
records. I keep my prefatory comments from the file intact, to supply
a bit of real-life historical context. I indicate the numerous
omissions from my actual 7744-line file with a "((SNIP))"
pseudo-tag. (The mysterious "HD21699" referred to now and again in the
notes is merely a rather bright helium-weak star in Perseus, known in
astrophysics by its "HD number", or Henry Draper survey
designation. We have here not some arcane library-science terminology,
but merely some routine discipline-specific language, on a par with
"sedum album" in botany, or with "elasticity" in economics. The mildly
mysterious "ApJ" and "PASP" are two printed-paper journals.)
<!--
consolidated personal reading list
of Tom Karmo = {tkarmo}
__rather experimental
__readers are encouraged to use this reading list
as a basis for their OWN experiments in
workstation bibliography management,
and to communicate their suggestions to {tkarmo}
__{tkarmo} communications particulars:
__<karmo@ungrad.astro.utoronto.ca> is deprecated e-mail address
__<verbum@interlog.com> is preferred e-mail address
__format is SGML (hand-coded)
__it may later be possible to develop an SGML
DTD (Document Type Definition)
on the strength of this hand-coding experiment
__the DTD will define various points so far left
undefined
__most notably, for each tag "FOO", whether
a record has to have
* exactly one FOO, or
* zero-or-one FOO, or
* zero-or-more FOOs
__with a DTD developed, Linux sgmls can be run
to check that this SGML document is DTD-compliant
__we thereby have formal machinery for detecting
certain kinds of data-entry errors
__but it would be better to find someone
in the astronomical community who has already developed
an SGML DTD for a consolidated personal reading list
__it may later be possible to develop Perl scripts
which will do some automatic processing of this SGML document
__example_a: retrieve from the document all and only the entries
which relate to the topic HD21699, and display
just the author names, titles, journals, and years
for those entries, suppressing other information
__example_b: convert this whole bloody document into some
quite different flat-ASCII format,
if a really good flat-ASCII format for
personal reading lists turns out to exist
__it is possible that some astronomy student, on some
campus somewhere, has gone through this exercise already,
and has invented a much superior flat-ASCII format,
perhaps even coded in some formalism other than SGML
__reading list was started 19990709
__reading list initially related to {rgarrison}{tkarmo}
collaboration on HD21699
__but it was intended
to have reading list grow to include many other topics
__reading list is sorted in forward order by year of publication,
and is sorted in forward alphabetical order by author surname
within any one given year
__{tkarmo} has consulted briefly on this reading-list format
with librarian ((SNIP)) at Dept of Astron,
University of ((SNIP))
<BIBLIOG_TKARMO>
<!-- Here is a template for creating new records:
<RECORD>
<AUTHOR01>xxxx</AUTHOR01>
<AUTHOR02>xxxx</AUTHOR02>
<AUTHOR03>xxxx</AUTHOR03>
<YEAR>19xx</YEAR>
<JOURNAL>xxxxxx</JOURNAL>
<VOLUME>xxxx</VOLUME>
<COLLECTION>xxxx</COLLECTION>
<COLLECTION_EDITOR01>xxxx</COLLECTION_EDITOR01>
<COLLECTION_EDITOR02>xxxx</COLLECTION_EDITOR02>
<BOOK>xxxx</BOOK>
<EDITION>xxxx</EDITION>
<PUBLISHER_NAME>xxxx</PUBLISHER_NAME>
<PUBLISHER_CITY>xxxx</PUBLISHER_CITY>
<PAGE>xxxx</PAGE>
<URL>xxxx</URL>
<CONF SESSION=>xxxx</CONF>
<SEMINAR>xxxx</SEMINAR>
<COLLOQ>xxxx</COLLOQ>
<LECTURE>xxxx</LECTURE>
<SYMPOS>xxxx</SYMPOS>
<CNVRSATN>xxxx</CNVSATN>
<VENUE>xxxx</VENUE>
<TITLE>xxxx</TITLE>
<UNIV_ESSAY>xxxx</UNIV_ESSAY>
<UNIV_THESIS>xxxx</UNIV_THESIS>
<COMMENT01></COMMENT01>
<COMMENT02></COMMENT02>
<COMMENT03></COMMENT03>
<COMMENT04></COMMENT04>
<NOTES>
</NOTES>
<TOPIC></TOPIC>
<TOPIC></TOPIC>
<TOPIC></TOPIC>
<TOPIC></TOPIC>
<LIBRARY></LIBRARY>
<DATEREC></DATEREC>
</RECORD>
__in this template, "URL" is used only for very special situations
__for example, for a publication which is available
by anonymous ftp from some server, and is NOT in print
__in this template, "VENUE" is used only for very special situations
__examples:
* colloquia
* conference presentations
__in this template, the role of "COMMENT01", . . . , "COMMENT04"
is not yet very sharply defined
__{tkarmo} has to think about
comments over the coming months and years,
not developing fully sharp comment concepts too quickly
__in this template, "NOTES" is intended for free-form comments
__"NOTES" is the place for storing substantive scientific
information
__for instance, the information that HD21699 is
not radio-bright at 6 cm
__with a good author, "NOTES" might store a great deal
of information
__"NOTES" is useful for guiding one's overall study programme
__"NOTES" may be expected to evolve rather radically as one
makes the transition from BSc to MSc to PhD, etc, etc, etc
__ :-)
__in this template, <TOPIC> means, essentially, "keyword"
__the following use of "SUBTOPIC" is legal:
<TOPIC>HD21699<SUBTOPIC>Zeeman effect</SUBTOPIC></TOPIC>
__in this template, "LIBRARY" is intended as a place to store
library call numbers
__keeping that kind of information on line can save time
in certain special cases
__it is not ALWAYS obvious where in the library system
an ink-and-paper artefact resides
__"LIBRARY" is only supposed to be used in unusual situations
__common sense and street smarts are ENTIRELY sufficient
to let one locate items like ApJ and PASP
__in this template, "DATEREC" stores the timestamp of the last
major modification to the given record
__a record started @19991223T235900Z, modified in a major way
@20010430T131345Z, and trivially modified
@20010501T123004Z, should be timestamped 20010430T131345Z
__note timestamping convention, with all times stated to
the second, and referred to "Z" (UTC = Universal Coordinated
Time): CCYYMMDDThhmmssZ
__this convention is prescribed by ISO in Geneva
__Linux boxes can be readily configured to generate
timestamps which conform to this convention
-->
<!-- Here are templates for creating new cross-refs:
<SEE>
<SEARCH>
xxxx
</SEARCH>
<RETRIEVE>
xxxx
</RETRIEVE>
</SEE>
<SEE_ALSO>
<SEARCH>
xxxx
</SEARCH>
<RETRIEVE>
xxxx
</RETRIEVE>
</SEE_ALSO>
-->
<!--
The document should have first all the records,
then all the "see" cross-refs,
then all the "see also" cross-refs.
-->
<RECORD>
<AUTHOR01>Struve, Otto</AUTHOR01>
<YEAR>1935</YEAR>
<JOURNAL>ApJ</JOURNAL>
<VOLUME>82</VOLUME>
<PAGE>252-260</PAGE>
<TITLE>
A Test of Thermodynamic Equilibrium in the Atmospheres
of Early-Type Stars
</TITLE>
<NOTES>
__following indicates two concepts of great importance
(the He anomaly, dilution) to {tk}'s overall study problem
<__QUOTE>
The intensities of He I lines show variations which are
contrary to Boltzmann's formula [fn of tempr], and Rudnick
has shown that this effect cannot be explained by turbulence.
It is suggested that the He anomaly results from an accumulation
of atoms in the triplet system, the lowest term of which is
metastable. Such an accumulation is possible if there are
departures from thermodynamic equilibrium, e.g., if the
photospheric radiation is appreciably diluted in the
absorbing atmosphere. Departures of this nature are not
improbable in the giants of early type.
<__QUOTE>
__{tk} did NOT understand 19991215T035200Z
</NOTES>
<TOPIC>dilution</TOPIC>
<TOPIC>He anomaly </TOPIC>
<DATEREC>19991215T035200Z</DATEREC>
</RECORD>
((SNIP))
<RECORD>
<AUTHOR01>Merrill, Paul W.</AUTHOR01>
<YEAR>1952</YEAR>
<JOURNAL>Ap J</JOURNAL>
<VOLUME>115 (no 2)</VOLUME>
<PAGE>145-153</PAGE>
<TITLE>Pleione: The Shell Episode</TITLE>
<NOTES>
__lovely figure (line graph)
<__QUOTE>
based on eye estimates
</__QUOTE>
of intensities of shell lines from before 1940 to after 1950
__<__QUOTE>
Before 1945 all the hydrogen lines had nearly the same
displacement, the amount of which corresponded closely to the
radial velocity of Pleione measured before the shell episode.
<!-- that's on the order of +7.5 or +8.5 km/sec -->
From 1946 on, the displacements of successive Balmer lines
exhibited an increasing negative progression which in 1951
became very pronounced.
</__QUOTE>
__Merrill assumes (cf p. 153) that
<__QUOTE>
increasing negative displacements (outward motions in the
stellar atmosphere) correspond to increasing effective heights
of absorbing zones above the photosphere
</__QUOTE>
__but this seems wrong
__we find that shorter members of Balmer series correspond
to greater negative displacements
__the shorter the wavelength of a Balmer line,
the DEEPER we are seeing into the absorbing gas,
and so the LOWER is the height above the photosphere
__this is explained in Underhill _The Early Type Stars_
p. 232:
<__QUOTE>
systematic increase of velocity of approach as one goes
up the Balmer series toward the Balmer limit was discovered
by Merrill and Sanford (1944) in the star 48 Librae during
one of its active periods. Merrill calls it the Balmer
progression. The simplest interpretation of the observations
is that material is ejected with a large velocity
and that the material is decelerated as it moves outward
from the star. Since the higher members of the Balmer
series are intrinsically weak lines, one looks through to
the deepest layers of the shell in these wave lengths.
Here the outward velocity is highest.
</__QUOTE>
__Merrill's last remark in his paper seems again to show
that he has the physical meaning of the Balmer progression
the wrong way around:
<__QUOTE>
Is the effective level of formation of the dark hydrogen lines
near the limit of the Balmer series higher than that for lines
of greater wave length? I hope to discuss this interesting
question in future papers.
</__QUOTE>
</NOTES>
<TOPIC>Pleione</TOPIC>
<TOPIC>shell stars</TOPIC>
<TOPIC>Balmer progression</TOPIC>
<DATEREC>20000414T020715Z</DATEREC>
</RECORD>
((SNIP))
<RECORD>
<AUTHOR01>Pych, Wojtek</AUTHOR01>
<YEAR>2002</YEAR>
<COLLOQ>Stellar Discussion Group</COLLOQ>
<TITLE>(binaries in globular clusters)</TITLE>
<NOTES>
__astrophysical importance of binaries in globular clusters
[_I did not understand this well, but
relevant is the great age of the globular clusters,
leading to some kind of opportunity to understand
evolutionary history well]
__until 1978, no binaries known in globular clusters
__in 1978, Niss, Jorgenson, Lautsen found eclipsing binary NJL5
in omega Cen
__followup in 1893-1950 archives found 250 photos, 14 eclipses
__a new era began with the advent of CCDs and accompanying
photometry tools such as Daophot, Dophot:
* 1990: eclipsing binary in field of NGC5466
* 1992-95: 29 eclipsing binaries in omega Cen, 14 in 47 Tuc
__mostly contact binaries
* 1996: Cluster AgeS Experiment (CASE) began
__we now know more than 100 eclipsing binaries in fields of
globular clusters
__in this talk, we focus on OGLEC-17
__light curve has flat bottom, so full eclipse
__mag 17.5
__new prospects:
* large scopes Magellan, Gemini, VLT, Subaru, Keck
* ISIS image-subtraction software (Alard & Lupton 1998),
good for very dense fields
[_in discussion {t.bolton} remarked that binaries in cluster core
have hard time surviving unless their periods are under a day]
[_in private post-discussion chat {w.pych} told {t.karmo},
in answ to {t.karmo} query, that the combo of ISIS and Magellan
makes it reasonable to look for eclipsing binaries all the way in
to the core of a globular cluster]
</NOTES>
<TOPIC>ISIS image-subtraction software</TOPIC>
<TOPIC>globular clusters, eclipsing binaries in</TOPIC>
<TOPIC>binaries, eclipsing</TOPIC>
<TOPIC>eclipsing binaries</TOPIC>
<DATEREC>20021002T175607Z</DATEREC>
</RECORD>
((SNIP))
<!-- "SEE" CROSS-REFS -->
<SEE>
<SEARCH>
speckle interferometry
</SEARCH>
<RETRIEVE>
interferometry, speckle
</RETRIEVE>
</SEE>
<SEE>
<SEARCH>
synthetic photometry
</SEARCH>
<RETRIEVE>
photometry, synthetic
</RETRIEVE>
</SEE>
((SNIP))
</BIBLIOG_TKARMO>
<! ----------------------------------------------------------------------->
<! --END-- >
<! ----------------------------------------------------------------------->
Many people in both academia and industry are no doubt now working at
about this level of unsophistication, some hundreds of thousands of
them with still less formalism than I have used here. Some tens or
hundreds of thousands have no doubt taken the dangerous path of
overmechanization, entrusting their precious records to a relational
database, or even to a closed-source, commercial, bibliography manager
(liable to keep records as machine-readable binary files, and to
change those binary formats in costly "upgrades" as the years go
by). Some (we may piously hope) have advanced to the next appropriate
stage in formalization, using hand-coded XML in place of my rather
archaic pseudo-SGML. In the best of all possible worlds, some will
already have constructed a proper XML schema or DTD, to define their
tagging scheme, and so to allow a validating parser to flag any
erroneous tags.
I now suggest that we proceed a step beyond the hand coding of our
research notes, and instead develop a non-commercial tool to write the
tags. Part of what the tool does will be straightforward. Clearly it
will have to prompt for author, title, and the like, in some cases
offering rational defaults. Clearly, also, the tool will have to offer
a window into which one types the core of a typical entry in a file -
namely, the substantive, if telegraphic and private, notes on what of
research relevance was said in the particular publication,
conversation, or colloquium at hand.
However, what is important from the perspective of the eventual union
indexing of the WWL is the provision the envisaged tool makes for
index terms. We should have our tool display, in some appropriate
scrollable window, a list of all the index terms (in the present crude
pseudo-SGML formalism, the material enclosed in
<TOPIC>
tags) already used by the researcher, with
a further scrollable display of all the researcher's previous
decisions regarding cross-referencing (in the present crude formalism,
the material tagged <SEE>
). Guided by the displays
- professional indexing software such as CINDEX will be a source of
ideas for our eventual interface design - the researcher will
sometimes accept a term already used, sometimes construct a new term.
There will also be provisions not for entering new records, but merely
for editing the entire accumulated set of terms, including the accumulated
hundreds or thousands of cross-references.
We may imagine the researcher updating the workstation at the end of
every day in the light of that day's handwritten or laptop-written
library, colloquium, and conversation notes, and then at the end of
each academic semester taking some hours to review the all-important
indexing structure, breaking up excessively broad terms into subterms
and supplying appropriate new cross-references.
What should such a tool be called? Short, vivid terms like "Scindex"
and "Infotron" either are already in use, somewhere in a vast
commercial world armed to the teeth with trademark registries, or are
liable to come into unexpected use. But we may provisionally indulge
in the name IndiciaScientia, as a latinization, with the
neuter plural of the present participle sciens,
"being-in-possession-of-knowledge", for "smart (knowledgeable,
intelligent) pointers". Under classical, as distinct from the less
satisfactory Church, Latin pronunciation, the name becomes
"in-DIK-ee-ah skee-ENN-tee-ah".
The envisaged IndiciaScientia, as described so far, is a free-standing
application, analogous to the lone photovoltaic array or wind turbine,
powering a single electrical installation, but not participating in
the continental grid. Now, however, we can proceed to the full
IndiciaScientia concept, under which individual workstations run
IndiciaScientia clients capable of being pointed, at the user's
discretion, to public-access IndiciaScientia servers. We may imagine
any reasonably cohesive intellectual community offering an
IndiciaScientia server, just as leaders in intellectual communities
today offer e-mail listservs. Our hypothetical astronomical researcher
may be envisaged as connecting to some universally respected
IndiciaScientia server at, say, the International Astronomical Union
(IAU) in Paris. Analogously, biologists, electronics engineers, and
anglophone analytical philosophers may be imagined as working through,
respectively, the Council of Biology Editors (CBE), the Institute of
Electronic and Electrical Engineers (IEEE), and the American
Philosophical Association (APA).
At indexing-term selection time, our astrophysics researcher can
choose to keep the IndiciaScientia client offline, or to connect to
Paris, so as to display, and optionally to import, terms already
approved by a bibliography specialist at the IAU. Further, when (on,
as it were, the last day of the academic semester) all the accumulated
indexing on the local workstation has been fine-tuned to the
researcher's satisfaction, it is time to consider once again whether
to connect the IndiciaScientia client to the remote IndiciaScientia
server, this time uploading rather than downloading. In the upload,
the remote server is given, at the researcher's discretion, some or
all of the researcher's accumulated indexing terms and
cross-references. What is uploaded is (it must be stressed) not the
private mass of research notes, but only the indexing language, the
conceptual framework which has been found useful in referencing and
cross-referencing the notes. The staff on the remote server can then
decide whether to update their indexing vocabulary, to reflect some or
all of the suggestions which have come in from that particular upload.
Any one researcher can hope to make only a modest contribution to the
work of the remote server, perhaps contributing only ten
useful new indexing terms or new cross-references a year to the
growing server-side thesaurus. However, even in a small discipline
such as astronomy, the depth of the cumulated term-selection wisdom will
be considerable. Here are some possible numbers for astronomy: the
discipline as it exists today may yield perhaps ten thousand core
scientists, as distinct from the large army of graduate students,
telescope assistants, computer maintainers, and the like supporting the
core. (I myself count as a mere support worker, not as a
core scientist!)
Human nature being what it is, only a minority of the scientists in the
core will favour the envisaged
IndiciaScientia over the less rational methods of
notekeeping. Let us suppose, then, that just
one core scientist out of a
hundred opts to use IndiciaScientia, and let us make the further
pessimistic assumption that IndiciaScientia finds no users
at all outside the
core. That still leaves us with a hundred IndiciaScientia users,
each uploading perhaps ten useful suggestions a year.
In its first year of operation, IndiciaScientia will on
our conservative scenario
still generate a thousand-term thesaurus. As the years roll by, the
thesaurus will grow.
Given a decade, anyone asking, "How should we index, say, a
graduate-level book-length survey of current knowledge
in stellar astronomy?"
will have a credible answer: "Well, as our server makes clear
to the world, our own people, the
very people competent to read and write those stellar-astronomy books,
have over the past decade found themselves requiring this
particular set of some hundred or
some thousand terms to index the various facets of this
specific astronomical topic."
When it comes time not to keep private research notes, but to write
for publication, the IndiciaScientia server will prove a useful
resource, supplying a controlled vocabulary for much of the
indexing. We may imagine authors, editors, and managers in the
reputable WWL publishing projects of 2051 priding themselves not just
on amassing meticulous biographies, as they did in twentieth-century
academic publishing, but also on consulting the appropriate
IndiciaScientia server (or, in cross-discipline writing, set of
servers). Where budgets and time permit, they will purchase indexing
services from freelance indexing professionals, as they do today. Just
as we do not now take scholarly and scientific writings seriously
unless they incorporate diligently constructed bibliographies, so the
readers of 2051 will not take World Wide Library publications
seriously unless they incorporate indexes diligently harmonized with
the controlled server-side vocabularies. Harmonization cannot be
perfect, since there will always be a need for indexing a given work
with a number of terms not (or not yet) present in the authoritative
server-side thesauri. But reputable indexes will display some
quality-control information, measuring their degree of divergence from
the thesaurus, perhaps in the style "Guiding thesaurus for this index:
IAU IndiciaScientia server at or very near Universal Coordinated Time
20511014T235521Z. Conformance metric for this index:
84.3 percent of our headings and cross-references
appear also in the thesaurus."
It is easy enough to see how to start programming IndiciaScientia
clients, and how to set up client-server communications. Today's
clients might well be Java applets, runnable in any browser that is
applet-capable, and reading XML files on the given workstation
conformant to some appropriate DTD. (So, in today's terms, even if
your browser is an archaic Netscape 4.x from 1998 or so, you're okay:
if you can read, for instance, the Java-applet moving-headlines ticker
at the world-news edition of http://news.bbc.co.uk/, you have all
the computing power you need.) The existing TCP/IP is all the bedrock
we need for communications with servers. The open-source software
model, with the GPL or some similar public licence, is all the legal
bedrock we need.
I'm learning Java now, and have for many months been working to
supplement my limited old SGML skills with XML. If you can help refine
the concepts proposed here, or can help with the programming, or
(above all) can help reach the eyes and hearts of the high
academic-publishing authorities in your discipline, do get in touch.
The most efficient means of communication is an e-mail to verbum@interlog.com, with a
subject header incorporating the acronym "WWL".