(c) 2002 Toomas Karmo. You are welcome to distribute copies of this essay either in a paper printing not exceeding 10 copies or electronically, without asking my explicit permission, provided you do not modify or omit any of the contents, including this paragraph. Please contact me if you wish to use this material for other purposes, such as magazine publication. Network location of authoritative file version: http://www.interlog.com/~verbum/, literary-pages section. Revision history: 20021019T032717Z/version 0001.0000.

Indexing Tomorrow's
Web-Delivered
World Wide Library

Information-technology professionals joke that developers ignorant of Unix are doomed to reinvent it. (We recall that Unix represented the leading edge of information technology in the 1970s; that Unix, in all but nomenclature, is alive and well today both in Linux and in the new Mac OS X; and that the contemporary operating-system architect is eventually forced either to adapt or to reinvent the core ideas, such as permissions, which Unix so carefully articulated so long ago.)

Comparable to the venerable Unix in computing is the still more venerable discipline of indexing in the world of publishing. Back-of-book indexing was already known in Renaissance times. The profession is alive and well today, with a few thousand practitioners around the world, many of them grouped under such professional umbrellas as the UK's Society of Indexers, or its slightly younger sister societies in other countries. A typical professional indexer, if working full time, might hope to index two or three books a month, at - alas - something considerably less than two thousand American dollars for a typical book, under contract either with the author or with the publishing house. The professional is inevitably found tracking hundreds of terminological decisions not with index cards, and not with the absurdly misnamed "indexing modules" of word processors, but with a tool such as CINDEX. The professional indexer has in many cases completed a formal indexing course, to be thought of as equivalent to at least a fifth or a quarter of the North American full-time university student's third-year semester workload.

As modern operating-system architects have to study, or else to reinvent, parts of Unix, so the builders of the emerging Internet will have to study, or else to reinvent, aspects of traditional book indexing.

What is this emerging Internet? We already have a World Wide Web (WWW) for advertising and e-commerce. What will emerge, however, over coming decades is a true World Wide Library (say, "WWL"), housing the full text of all significant new scientific and scholarly work, and hosted on some tens or hundreds of thousands of servers - a subset of the millions that might by then be publishing content of one kind or another on the WWW.

If the gods smile on humanity, the WWL will embrace today's open-source software model, making scientific and scholarly content viewable free of charge over the world. If the gods curse us, we shall have a WWL burdened by "digital rights management," in which content is viewable only for a fee, and in which centralized organs of corporate or state security record even which specific files get opened on which specific workstations. (The abyss that threatens if the gods curse is sketched by Cambridge University computer scientist Ross Anderson, in an examination at http://www.cl.cam.ac.uk/~rja14/tcpa-faq.html of the Microsoft Trusted Computing Platform Alliance "Palladium" initiative.) But in one way or another, the WWL is on its way.

Thanks in part to the excellent work being done by the tens of thousands of volunteer cataloguers in the Open Directory Project (ODP), or "Directory Mozilla", at http://dmoz.org, and fed into various other search sites, such as the "Directory" portion of Google, we already have some idea of what it will mean for the WWL to be subject-catalogued. Among the signs of quality work at ODP are the details incorporated in scope notes. On examining, for example, the catalogue page "Top: Business: Agriculture and Forestry: Livestock: Horses and Ponies: Transportation: North America", at or very close to the Universal Coordinated Time 20021019T205928Z, we find a clear scope statement, with the warning that the presented material is appropriate for readers investigating horse-transport services, with or without quarantine. The scope-note writer takes care to supply instructions redirecting those link providers who seek to advertise their horse transport hardware, such as box trailers.

There is, however, a difference between subject cataloguing, even at its best, and indexing. In twentieth-century library-science terms, subject cataloguing is a matter of assigning a handful of descriptors to an entire book, to generate a handful of cards for the subject-cards file drawers. In traditional book indexing, by contrast, every single page of content needs to be scrutinized, being liable to make its own individual contribution to the stock of index entries. The old card-file subject-catalogue "A" drawer would tell the medical student which books to peruse in the broad area of Alzheimer's disease. It was, on the contrary, back-of-book indexes that one had to consult in tracking down references to narrow subareas in that broad expanse - in tracking down page references, for instance, to the clockface-drawing test uncovering those specific cognitive defects that are for the gerontologist the signature of early Alzheimer's. Similarly, whereas it was a handful of cards in the subject-catalogue "E" drawer that directed the historian to books on Estonian interwar diplomacy, it was back-of-book indexes that one had to consult in tracking down page references to Jaan Poska's responsibilities in drafting the 1920 Treaty of Tartu. The old distinction persists, in that ODP and similar human-maintained "Web dirctories" can only hope to catalogue whole Web sites, not to scrutinize sites page by page.

Today's Web search tools comprise not only subject catalogues, but also machine-run search engines. The search engine offers retrieval at a finer level of detail than the subject catalogue does, and indeed can perform some of the work done in the twentieth century by the back-of-book index. At or very close to the Universal Coordinated Time 20021018T194508Z, for example, I myself found the Google search engine delivering numerous appropriate hits for Alzheimer clock diagnosis, and delivering two mildly useful hits for the Boolean-conjunction-of-two-exact-phrases construction "Jaan Poska" "Treaty of Tartu".

Nevertheless, the search engine is not a substitute for the true index, crafted by a human being. We may well need to find the definition of a CATS point, a measure of university student workload used in the UK. With existing search engines, this task is not guaranteed to go smoothly: many a British Web academic course page mentions the CATS point, and yet also sports some irrelevant occurrence of the word "definition". Still worse, we may not always realize what it is we seek - as when we investigate stars in the solar neighbourhood without the realization, traditionally communicated to the uninformed reader by the indexer's See also cross-reference, that NASA-sponsored solar-neighbourhood stellar research uses the term "NStars."

It is in fact the crafting of cross-references that gives the back-of-book index much of its value, and consumes a correspondingly large part (say, a fifth or a third) of the indexer's working hours. Those working hours are expended in crafting not only the obvious "Roncalli, Angelo. See John XXIII, Pope", but also the recondite "solar neighbourhood, stellar populations in. See also NStars", or again the recondite "vector product, triple. See also scalar product, triple".

Two additional styles of cross-reference, rather unfashionable in late twentieth-century printed-book praxis, but useful, and perhaps destined to come into their own again with the heavy cross-referencing demanded by eventual WWL readers, are the two "unders": "syllogistic. See under logic, formal", and "beaver. See also under Québec, commercial history of".

A further indication of the value added by the back-of-book indexer is the old rule of thumb (useful to publishing personnel managers in segregating the possibly good indexers from the clearly bad) that no one term in a single book is to have more than around five or seven accompanying "reference locators", or page numbers. A good index cannot, for example, say "daffodils 134, 156-89, 200, 207fig, 356, 398, 401, 444, 430", but must instead break the main heading "daffodils" down into informative subheadings, each with some conveniently small number of accompanying reference locators - as it might be, "daffodils: fertilizers for", "daffodils: in literature", and "daffodils: naturalizing of".

Yet another suggestion of the care and attention needed in crafting an index is conveyed by another rule of thumb: how many entries might the indexer expect to post from a typical page of text in a typical scholarly or scientific work? Workers in the profession, while stressing that books vary, consider two or three entries per page surprisingly low, and between five and ten entries per page typical.

In the publishing world as we knew it in the twentieth century, then, the back-of-book index had worthy work to perform. (Indeed, even further: something very like a back-of-book index helped the Bletchley Park "Ultra" cryptanalysts - celebrated nowadays for their innovations in computing machinery, and for their trimming perhaps a year from the ghastly duration of World War II. The German radiotelegraphy decrypts were indexed, with appropriately detailed cross-references, in a vast system of cards. One speculates that without meticulous See and See also cards, the Ultra team would have been stymied in their ultimately successful effort to convert the daily torrent of raw information into an operationally relevant intelligence product! This particular British military asset was backed up with a copy under Oxford's Bodleian Library, as a precaution against the Luftwaffe.)

With the advent of the WWL, we shall continue to need indexers. WWL readers in, say, the year 2051 can reasonably demand index-like tools that deliver four-stage retrieval.

(a) Upon uttering, or typing, or mousing from a scrollable browse list, a tentative index term ("solar neighbourhood, stellar populations in") with a few other straightforward restrictors (say, limiting the search to the English language, and to works published only within the last decade), the reader should be presented with a list of zero or more possibly relevant publications, together with a list or zero or more See, See also, See under, and See also under directives.

(b) With a further few seconds' work, the reader should be able to refine the tentative list of index terms into a safe list (in the case imagined here, perhaps the Boolean OR of "solar neighbourhood, stellar populations in" with "NStars"), and on the strength of this list to obtain a first set of possibly relevant publications.

(c) With a further few minutes' work, the reader should be able to make at least a tentative selection, on the strength of ODP-style catalogue entries, of those documents whose individual indexes seem worth skimming. The reader might, for instance, retrieve a thousand English-language publications from the past decade, but with a couple of mouse movements eliminate eight hundred as being popularizing and derivative works, rather than peer-refereed original science.

(d) The reader should now be allowed to embark on a difficult and hazardous kind of information retrieval without many twentieth-century precedents, constructing an on-screen union index not from a single publication, but from multiple publications. For each individual publication, the union index will for a given indexing term give at most five or seven or so reference locators, presenting each reference locator as a hyperlink to an actual full-text paragraph or sequence of paragraphs, on some WWL server or other.

I say here, advisedly, "without many", rather than "without any". The terrifying task of building a union index did confront a tiny subset of the professional indexing community before the advent of the Internet, thanks to the need, even then, for cumulative indexing of periodicals and multi-book sets. At the popularizer Time-Life Books, for example, it was judged appropriate not only to index each book in a set as it made its way to press, but also to publish an index for the set as a whole, at the end of the project. In the "Special Concerns" chapter of her Indexing Books (Chicago and London: Chicago University Press, 1994), indexing teacher-practitioner-theorist Nancy Mulvany quotes former Time-Life copy chief Diane Ullius on the difficulty of the task:

[I]t's not just a matter of merging the individual volume indexes. We put as much time and effort into a 160-page index as we put into a 160-page narrative book. Work on the cumulative index begins long before the last volume in the series has been published. . . . Even if a single indexer has produced all the individual indexes, there will still be term consistency problems that must be resolved during the editing phase.

The last part in our four-stage model of a WWL reader's work in 2051 indicates the severity of the challenge eventually facing indexers. It will no longer be possible to index publications in isolation from each other. Instead, publications will need to be indexed with some sensitivity to the standing of a document among its peers. So, for instance, in indexing a graduate-level text on stellar astrophysics, it will be necessary to consider the item not only as a self-contained book, but as a book residing in a community of perhaps five hundred relevantly similar books - as, so to speak, one book in twenty shelf metres of cognate books.

It may be beyond the scope of human and machine intelligence to produce good union indexes. However, since even a bad union index is better than none, the effort must be made, and we must be prepared for the slow discovery of what will over the decades prove to be at least marginally acceptable practices. In the remainder of this essay, I will develop one suggestion for a path to follow as we inch painfully forward across unaccustomed terrain.

It is clear that there is no hope of centralized authorities doing the bulk of the work that needs to be done in indexing, let alone in trying to union-index, the WWL. The way lies forward in somehow adapting the model of the ODP, or of the open-source GNU-cum-Linux programming movement - or, for that matter, adapting the emerging model of electricity generation, on which central electricity generating authorities give way to grid-linked local provider associations, each with its own humble wind turbine or photovoltaic array. What, in the era of the WWL, could be analogous to the neighbourhood operating its own wind turbine so as to buy power from the continent-wide grid on some occasions, and to sell power into the grid on others?

We start by considering the needs of the lone researcher keeping private research notes. Such a person is bombarded with information - the Astronomical Journal article read carefully in the early morning, the Monthly Notices of the Royal Astronomical Society article skimmed in haste before teaching started at ten o'clock, the casual conversation over a sandwich in the departmental lounge, the afternoon colloquium (although it proved largely irrelevant, important issues came up for a couple of minutes at question time), the popularizing Sky and Telescope or BBC article savoured as a guilty pleasure over evening tea. The meticulous researcher will clearly keep notes at the workstation day by day. Clearly, too, the most retrieval-friendly notes will be not in free-form text, but will be organized under some such formalism as SGML or XML.

Here, to illustrate workstation note-taking tactics, is a sampling from my own (essentially undergraduate- or beginning-MSc-level) records. I keep my prefatory comments from the file intact, to supply a bit of real-life historical context. I indicate the numerous omissions from my actual 7744-line file with a "((SNIP))" pseudo-tag. (The mysterious "HD21699" referred to now and again in the notes is merely a rather bright helium-weak star in Perseus, known in astrophysics by its "HD number", or Henry Draper survey designation. We have here not some arcane library-science terminology, but merely some routine discipline-specific language, on a par with "sedum album" in botany, or with "elasticity" in economics. The mildly mysterious "ApJ" and "PASP" are two printed-paper journals.)

<!-- 

consolidated personal reading list 
of Tom Karmo = {tkarmo} 
__rather experimental 
__readers are encouraged to use this reading list
  as a basis for their OWN experiments in 
  workstation bibliography management, 
  and to communicate their suggestions to {tkarmo} 
  __{tkarmo} communications particulars: 
    __<karmo@ungrad.astro.utoronto.ca> is deprecated e-mail address
    __<verbum@interlog.com> is preferred e-mail address 
__format is SGML (hand-coded) 
  __it may later be possible to develop an SGML 
    DTD (Document Type Definition) 
    on the strength of this hand-coding experiment 
    __the DTD will define various points so far left
      undefined
      __most notably, for each tag "FOO", whether
        a record has to have 
        * exactly one FOO, or 
        * zero-or-one FOO,  or 
        * zero-or-more FOOs 
    __with a DTD developed, Linux sgmls can be run 
      to check that this SGML document is DTD-compliant
      __we thereby have formal machinery for detecting 
        certain kinds of data-entry errors
    __but it would be better to find someone 
      in the astronomical community who has already developed
      an SGML DTD for a consolidated personal reading list 
  __it may later be possible to develop Perl scripts 
    which will do some automatic processing of this SGML document
    __example_a: retrieve from the document all and only the entries 
      which relate to the topic HD21699, and display 
      just the author names, titles, journals, and years 
      for those entries, suppressing other information 
    __example_b: convert this whole bloody document into some
      quite different flat-ASCII format, 
      if a really good flat-ASCII format for
      personal reading lists turns out to exist 
      __it is possible that some astronomy student, on some
        campus somewhere, has gone through this exercise already, 
        and has invented a much superior flat-ASCII format, 
        perhaps even coded in some formalism other than SGML 
__reading list was started 19990709
__reading list initially related to {rgarrison}{tkarmo}
  collaboration on HD21699
  __but it was intended 
    to have reading list grow to include many other topics
__reading list is sorted in forward order by year of publication, 
  and is sorted in forward alphabetical order by author surname
  within any one given year  
__{tkarmo} has consulted briefly on this reading-list format
  with librarian ((SNIP)) at Dept of Astron,
  University of ((SNIP))
<BIBLIOG_TKARMO>

<!-- Here is a template for creating new records: 
<RECORD>
  <AUTHOR01>xxxx</AUTHOR01>
  <AUTHOR02>xxxx</AUTHOR02>
  <AUTHOR03>xxxx</AUTHOR03>  
  <YEAR>19xx</YEAR>
  <JOURNAL>xxxxxx</JOURNAL>
  <VOLUME>xxxx</VOLUME>
  <COLLECTION>xxxx</COLLECTION>
  <COLLECTION_EDITOR01>xxxx</COLLECTION_EDITOR01>
  <COLLECTION_EDITOR02>xxxx</COLLECTION_EDITOR02>
  <BOOK>xxxx</BOOK>
  <EDITION>xxxx</EDITION> 
  <PUBLISHER_NAME>xxxx</PUBLISHER_NAME>
  <PUBLISHER_CITY>xxxx</PUBLISHER_CITY>
  <PAGE>xxxx</PAGE>
  <URL>xxxx</URL>
  <CONF SESSION=>xxxx</CONF>
  <SEMINAR>xxxx</SEMINAR>
  <COLLOQ>xxxx</COLLOQ> 
  <LECTURE>xxxx</LECTURE>
  <SYMPOS>xxxx</SYMPOS>
  <CNVRSATN>xxxx</CNVSATN>
  <VENUE>xxxx</VENUE> 
  <TITLE>xxxx</TITLE>
  <UNIV_ESSAY>xxxx</UNIV_ESSAY>
  <UNIV_THESIS>xxxx</UNIV_THESIS>
  <COMMENT01></COMMENT01>
  <COMMENT02></COMMENT02>
  <COMMENT03></COMMENT03> 
  <COMMENT04></COMMENT04>
  <NOTES>
  </NOTES> 
  <TOPIC></TOPIC>
  <TOPIC></TOPIC>
  <TOPIC></TOPIC>
  <TOPIC></TOPIC>
  <LIBRARY></LIBRARY> 
  <DATEREC></DATEREC>
</RECORD>




__in this template, "URL" is used only for very special situations
  __for example, for a publication which is available 
    by anonymous ftp from some server, and is NOT in print
__in this template, "VENUE" is used only for very special situations
  __examples: 
    * colloquia
    * conference presentations
__in this template, the role of "COMMENT01", . . . , "COMMENT04"
  is not yet very sharply defined 
  __{tkarmo} has to think about 
    comments over the coming months and years, 
    not developing fully sharp comment concepts too quickly 
__in this template, "NOTES" is intended for free-form comments 
  __"NOTES" is the place for storing substantive scientific
    information
    __for instance, the information that HD21699 is 
      not radio-bright at 6 cm 
    __with a good author, "NOTES" might store a great deal 
      of information 
    __"NOTES" is useful for guiding one's overall study programme 
    __"NOTES" may be expected to evolve rather radically as one
      makes the transition from BSc to MSc to PhD, etc, etc, etc
      __ :-)  
__in this template, <TOPIC> means, essentially, "keyword"
  __the following use of "SUBTOPIC" is legal: 
    <TOPIC>HD21699<SUBTOPIC>Zeeman effect</SUBTOPIC></TOPIC>
__in this template, "LIBRARY" is intended as a place to store
  library call numbers
  __keeping that kind of information on line can save time
    in certain special cases 
    __it is not ALWAYS obvious where in the library system 
      an ink-and-paper artefact resides 
  __"LIBRARY" is only supposed to be used in unusual situations
    __common sense and street smarts are ENTIRELY sufficient
      to let one locate items like ApJ and PASP 
__in this template, "DATEREC" stores the timestamp of the last
  major modification to the given record 
  __a record started @19991223T235900Z, modified in a major way
    @20010430T131345Z, and trivially modified 
    @20010501T123004Z, should be timestamped 20010430T131345Z
    __note timestamping convention, with all times stated to 
      the second, and referred to "Z" (UTC = Universal Coordinated
      Time): CCYYMMDDThhmmssZ 
      __this convention is prescribed by ISO in Geneva 
      __Linux boxes can be readily configured to generate 
        timestamps which conform to this convention 
  

-->

<!-- Here are templates for creating new cross-refs:

<SEE>
  <SEARCH>
  xxxx
  </SEARCH>
  <RETRIEVE>
  xxxx
  </RETRIEVE>
</SEE>


<SEE_ALSO>
  <SEARCH>
  xxxx
  </SEARCH>
  <RETRIEVE>
  xxxx
  </RETRIEVE>
</SEE_ALSO> 

-->


<!-- 

The document should have first all the records, 
then all the "see" cross-refs,
then all the "see also" cross-refs.

--> 

<RECORD>
  <AUTHOR01>Struve, Otto</AUTHOR01>
  <YEAR>1935</YEAR>
  <JOURNAL>ApJ</JOURNAL>
  <VOLUME>82</VOLUME>
  <PAGE>252-260</PAGE>
  <TITLE>
    A Test of Thermodynamic Equilibrium in the Atmospheres
    of Early-Type Stars
  </TITLE>
  <NOTES>
  __following indicates two concepts of great importance
    (the He anomaly, dilution) to {tk}'s overall study problem 
    <__QUOTE>
    The intensities of He I lines show variations which are
    contrary to Boltzmann's formula [fn of tempr], and Rudnick
    has shown that this effect cannot be explained by turbulence.
    It is suggested that the He anomaly results from an accumulation
    of atoms in the triplet system, the lowest term of which is
    metastable. Such an accumulation is possible if there are
    departures from thermodynamic equilibrium, e.g., if the 
    photospheric radiation is appreciably diluted in the 
    absorbing atmosphere. Departures of this nature are not
    improbable in the giants of early type. 
    <__QUOTE>
    __{tk} did NOT understand 19991215T035200Z
  </NOTES> 
  <TOPIC>dilution</TOPIC>
  <TOPIC>He anomaly </TOPIC>
  <DATEREC>19991215T035200Z</DATEREC>
</RECORD>
 
((SNIP))

<RECORD>
  <AUTHOR01>Merrill, Paul W.</AUTHOR01>
  <YEAR>1952</YEAR>
  <JOURNAL>Ap J</JOURNAL>
  <VOLUME>115 (no 2)</VOLUME>
  <PAGE>145-153</PAGE>
  <TITLE>Pleione: The Shell Episode</TITLE>
  <NOTES>
  __lovely figure (line graph) 
    <__QUOTE>
    based on eye estimates
    </__QUOTE>
    of intensities of shell lines from before 1940 to after 1950
  __<__QUOTE>
    Before 1945 all the hydrogen lines had nearly the same
    displacement, the amount of which corresponded closely to the
    radial velocity of Pleione measured before the shell episode.
    <!-- that's on the order of +7.5 or +8.5 km/sec --> 
    From 1946 on, the displacements of successive Balmer lines
    exhibited an increasing negative progression which in 1951
    became very pronounced. 
    </__QUOTE> 
  __Merrill assumes (cf p. 153) that
    <__QUOTE>
    increasing negative displacements (outward motions in the
    stellar atmosphere) correspond to increasing effective heights
    of absorbing zones above the photosphere
    </__QUOTE>
    __but this seems wrong
      __we find that shorter members of Balmer series correspond
        to greater negative displacements
        __the shorter the wavelength of a Balmer line, 
          the DEEPER we are seeing into the absorbing gas, 
          and so the LOWER is the height above the photosphere
          __this is explained in Underhill _The Early Type Stars_
            p. 232:
            <__QUOTE>
            systematic increase of velocity of approach as one goes 
            up the Balmer series toward the Balmer limit was discovered
            by Merrill and Sanford (1944) in the star 48 Librae during
            one of its active periods. Merrill calls it the Balmer
            progression. The simplest interpretation of the observations
            is that material is ejected with a large velocity 
            and that the material is decelerated as it moves outward
            from the star. Since the higher members of the Balmer 
            series are intrinsically weak lines, one looks through to 
            the deepest layers of the shell in these wave lengths. 
            Here the outward velocity is highest.        
            </__QUOTE>
        __Merrill's last remark in his paper seems again to show
          that he has the physical meaning of the Balmer progression
          the wrong way around:
          <__QUOTE>
          Is the effective level of formation of the dark hydrogen lines
          near the limit of the Balmer series higher than that for lines
          of greater wave length? I hope to discuss this interesting
          question in future papers. 
          </__QUOTE>
  </NOTES> 
  <TOPIC>Pleione</TOPIC>
  <TOPIC>shell stars</TOPIC>
  <TOPIC>Balmer progression</TOPIC>
  <DATEREC>20000414T020715Z</DATEREC>
</RECORD>

((SNIP))

<RECORD>
  <AUTHOR01>Pych, Wojtek</AUTHOR01>
  <YEAR>2002</YEAR>
  <COLLOQ>Stellar Discussion Group</COLLOQ> 
  <TITLE>(binaries in globular clusters)</TITLE>
  <NOTES>
  __astrophysical importance of binaries in globular clusters
    [_I did not understand this well, but
      relevant is the great age of the globular clusters, 
      leading to some kind of opportunity to understand
      evolutionary history well]  
  __until 1978, no binaries known in globular clusters
  __in 1978, Niss, Jorgenson, Lautsen found eclipsing binary NJL5
    in omega Cen
    __followup in 1893-1950 archives found 250 photos, 14 eclipses
  __a new era began with the advent of CCDs and accompanying
    photometry tools such as Daophot, Dophot:
    * 1990: eclipsing binary in field of NGC5466
    * 1992-95: 29 eclipsing binaries in omega Cen, 14 in 47 Tuc
      __mostly contact binaries
    * 1996: Cluster AgeS Experiment (CASE) began 
    __we now know more than 100 eclipsing binaries in fields of 
      globular clusters
  __in this talk, we focus on OGLEC-17
    __light curve has flat bottom, so full eclipse
    __mag 17.5 
  __new prospects: 
    * large scopes Magellan, Gemini, VLT, Subaru, Keck
    * ISIS image-subtraction software (Alard & Lupton 1998), 
      good for very dense fields
  [_in discussion {t.bolton} remarked that binaries in cluster core
    have hard time surviving unless their periods are under a day]
  [_in private post-discussion chat {w.pych} told {t.karmo}, 
    in answ to {t.karmo} query, that the combo of ISIS and Magellan
    makes it reasonable to look for eclipsing binaries all the way in
    to the core of a globular cluster] 
  </NOTES> 
  <TOPIC>ISIS image-subtraction software</TOPIC>
  <TOPIC>globular clusters, eclipsing binaries in</TOPIC>
  <TOPIC>binaries, eclipsing</TOPIC>
  <TOPIC>eclipsing binaries</TOPIC>
  <DATEREC>20021002T175607Z</DATEREC>
</RECORD>

((SNIP))

<!-- "SEE" CROSS-REFS --> 

<SEE>
  <SEARCH>
  speckle interferometry
  </SEARCH>
  <RETRIEVE>
  interferometry, speckle
  </RETRIEVE>
</SEE>

<SEE>
  <SEARCH>
  synthetic photometry
  </SEARCH>
  <RETRIEVE>
  photometry, synthetic 
  </RETRIEVE>
</SEE>


((SNIP))

</BIBLIOG_TKARMO> 

<! ----------------------------------------------------------------------->
<!                            --END--                                     > 
<! ----------------------------------------------------------------------->

Many people in both academia and industry are no doubt now working at about this level of unsophistication, some hundreds of thousands of them with still less formalism than I have used here. Some tens or hundreds of thousands have no doubt taken the dangerous path of overmechanization, entrusting their precious records to a relational database, or even to a closed-source, commercial, bibliography manager (liable to keep records as machine-readable binary files, and to change those binary formats in costly "upgrades" as the years go by). Some (we may piously hope) have advanced to the next appropriate stage in formalization, using hand-coded XML in place of my rather archaic pseudo-SGML. In the best of all possible worlds, some will already have constructed a proper XML schema or DTD, to define their tagging scheme, and so to allow a validating parser to flag any erroneous tags.

I now suggest that we proceed a step beyond the hand coding of our research notes, and instead develop a non-commercial tool to write the tags. Part of what the tool does will be straightforward. Clearly it will have to prompt for author, title, and the like, in some cases offering rational defaults. Clearly, also, the tool will have to offer a window into which one types the core of a typical entry in a file - namely, the substantive, if telegraphic and private, notes on what of research relevance was said in the particular publication, conversation, or colloquium at hand.

However, what is important from the perspective of the eventual union indexing of the WWL is the provision the envisaged tool makes for index terms. We should have our tool display, in some appropriate scrollable window, a list of all the index terms (in the present crude pseudo-SGML formalism, the material enclosed in <TOPIC> tags) already used by the researcher, with a further scrollable display of all the researcher's previous decisions regarding cross-referencing (in the present crude formalism, the material tagged <SEE>). Guided by the displays - professional indexing software such as CINDEX will be a source of ideas for our eventual interface design - the researcher will sometimes accept a term already used, sometimes construct a new term.

There will also be provisions not for entering new records, but merely for editing the entire accumulated set of terms, including the accumulated hundreds or thousands of cross-references.

We may imagine the researcher updating the workstation at the end of every day in the light of that day's handwritten or laptop-written library, colloquium, and conversation notes, and then at the end of each academic semester taking some hours to review the all-important indexing structure, breaking up excessively broad terms into subterms and supplying appropriate new cross-references.

What should such a tool be called? Short, vivid terms like "Scindex" and "Infotron" either are already in use, somewhere in a vast commercial world armed to the teeth with trademark registries, or are liable to come into unexpected use. But we may provisionally indulge in the name IndiciaScientia, as a latinization, with the neuter plural of the present participle sciens, "being-in-possession-of-knowledge", for "smart (knowledgeable, intelligent) pointers". Under classical, as distinct from the less satisfactory Church, Latin pronunciation, the name becomes "in-DIK-ee-ah skee-ENN-tee-ah".

The envisaged IndiciaScientia, as described so far, is a free-standing application, analogous to the lone photovoltaic array or wind turbine, powering a single electrical installation, but not participating in the continental grid. Now, however, we can proceed to the full IndiciaScientia concept, under which individual workstations run IndiciaScientia clients capable of being pointed, at the user's discretion, to public-access IndiciaScientia servers. We may imagine any reasonably cohesive intellectual community offering an IndiciaScientia server, just as leaders in intellectual communities today offer e-mail listservs. Our hypothetical astronomical researcher may be envisaged as connecting to some universally respected IndiciaScientia server at, say, the International Astronomical Union (IAU) in Paris. Analogously, biologists, electronics engineers, and anglophone analytical philosophers may be imagined as working through, respectively, the Council of Biology Editors (CBE), the Institute of Electronic and Electrical Engineers (IEEE), and the American Philosophical Association (APA).

At indexing-term selection time, our astrophysics researcher can choose to keep the IndiciaScientia client offline, or to connect to Paris, so as to display, and optionally to import, terms already approved by a bibliography specialist at the IAU. Further, when (on, as it were, the last day of the academic semester) all the accumulated indexing on the local workstation has been fine-tuned to the researcher's satisfaction, it is time to consider once again whether to connect the IndiciaScientia client to the remote IndiciaScientia server, this time uploading rather than downloading. In the upload, the remote server is given, at the researcher's discretion, some or all of the researcher's accumulated indexing terms and cross-references. What is uploaded is (it must be stressed) not the private mass of research notes, but only the indexing language, the conceptual framework which has been found useful in referencing and cross-referencing the notes. The staff on the remote server can then decide whether to update their indexing vocabulary, to reflect some or all of the suggestions which have come in from that particular upload.

Any one researcher can hope to make only a modest contribution to the work of the remote server, perhaps contributing only ten useful new indexing terms or new cross-references a year to the growing server-side thesaurus. However, even in a small discipline such as astronomy, the depth of the cumulated term-selection wisdom will be considerable. Here are some possible numbers for astronomy: the discipline as it exists today may yield perhaps ten thousand core scientists, as distinct from the large army of graduate students, telescope assistants, computer maintainers, and the like supporting the core. (I myself count as a mere support worker, not as a core scientist!) Human nature being what it is, only a minority of the scientists in the core will favour the envisaged IndiciaScientia over the less rational methods of notekeeping. Let us suppose, then, that just one core scientist out of a hundred opts to use IndiciaScientia, and let us make the further pessimistic assumption that IndiciaScientia finds no users at all outside the core. That still leaves us with a hundred IndiciaScientia users, each uploading perhaps ten useful suggestions a year. In its first year of operation, IndiciaScientia will on our conservative scenario still generate a thousand-term thesaurus. As the years roll by, the thesaurus will grow. Given a decade, anyone asking, "How should we index, say, a graduate-level book-length survey of current knowledge in stellar astronomy?" will have a credible answer: "Well, as our server makes clear to the world, our own people, the very people competent to read and write those stellar-astronomy books, have over the past decade found themselves requiring this particular set of some hundred or some thousand terms to index the various facets of this specific astronomical topic."

When it comes time not to keep private research notes, but to write for publication, the IndiciaScientia server will prove a useful resource, supplying a controlled vocabulary for much of the indexing. We may imagine authors, editors, and managers in the reputable WWL publishing projects of 2051 priding themselves not just on amassing meticulous biographies, as they did in twentieth-century academic publishing, but also on consulting the appropriate IndiciaScientia server (or, in cross-discipline writing, set of servers). Where budgets and time permit, they will purchase indexing services from freelance indexing professionals, as they do today. Just as we do not now take scholarly and scientific writings seriously unless they incorporate diligently constructed bibliographies, so the readers of 2051 will not take World Wide Library publications seriously unless they incorporate indexes diligently harmonized with the controlled server-side vocabularies. Harmonization cannot be perfect, since there will always be a need for indexing a given work with a number of terms not (or not yet) present in the authoritative server-side thesauri. But reputable indexes will display some quality-control information, measuring their degree of divergence from the thesaurus, perhaps in the style "Guiding thesaurus for this index: IAU IndiciaScientia server at or very near Universal Coordinated Time 20511014T235521Z. Conformance metric for this index: 84.3 percent of our headings and cross-references appear also in the thesaurus."

It is easy enough to see how to start programming IndiciaScientia clients, and how to set up client-server communications. Today's clients might well be Java applets, runnable in any browser that is applet-capable, and reading XML files on the given workstation conformant to some appropriate DTD. (So, in today's terms, even if your browser is an archaic Netscape 4.x from 1998 or so, you're okay: if you can read, for instance, the Java-applet moving-headlines ticker at the world-news edition of http://news.bbc.co.uk/, you have all the computing power you need.) The existing TCP/IP is all the bedrock we need for communications with servers. The open-source software model, with the GPL or some similar public licence, is all the legal bedrock we need.

I'm learning Java now, and have for many months been working to supplement my limited old SGML skills with XML. If you can help refine the concepts proposed here, or can help with the programming, or (above all) can help reach the eyes and hearts of the high academic-publishing authorities in your discipline, do get in touch. The most efficient means of communication is an e-mail to verbum@interlog.com, with a subject header incorporating the acronym "WWL".