On the Theory of Library Catalogs and Search Engines
Supplementing a presentation on "Principles and Goals of Cataloging"
What is a good catalog? | |
What should catalogs do? |
Nothing is more practical than a good
theory. A banal statement, considering that a theory should always
enable its users to easily derive the statements they need for practice.
Using Internet search engines, and
knowing their operation is fully automated, people tend to view with
skepticism all practical and theoretical effort invested in catalogs.
Any good search engine, however, has to be be based on a good theory
- though that one may differ quite a bit from a catalog theory. What do libraries and the Internet have in common? Both provide access to collections of recordings. One need not use the difficult-to-define concepts of information and knowledge here. We may leave it open whether or not an "information society" exists, or a "knowledge economy", and whether everything is information or knowledge that is squeezed between book covers or on Web pages. The "Pisa Studies" have reminded everybody that knowledge doesn't come without learning. Being in the possession of printed matter does not mean to possess knowledge, but printed text turns into living knowledge only after reading and understanding, and then this knowledge will sit in the head of the reader and not on the paper or screen. Nobody will doubt that ours is a learning society, and recordings of experience and insight are of central importance for learning. One learns from direct interaction between humans, by one's own doing, by observation, or through studying. Which mostly consists of taking in what others have recorded. In many cases, suitable recordings have
to be found first. Millions of humans, over millennia, have
recorded their experience and encounters, their findings, their
insights, and their
inspiration. When this started with the Greeks, Platon saw in it a
symptom of decline: people wouldn't exercise their memories any more
because they would now rely on inferior surrogates. But people did not
stop at making use of their own recordings, they started using those of
others as well. Collecting began. Libraries were created. After
collecting more than a few hundred papers or papyri, a system of
ordered shelving had to be invented or else the usefulness of the
collection would have suffered. How did cataloging come about? Once several thousand items have been collected, their physical arrangement, whatever the system, becomes tedious. One will need finding aids, i.e., secondary recordings (or meta-recordings), which will reveal where in the collection a particular item is located. This is the birth of cataloging: it shifts the process of ordering from the shelf to paper, to files or, nowadays, to databases. Unless one also invents a nice theory along with this, the usefulness of the catalog will diminish with its size rather than increase. Once one has millions, the assembling
of the finding-aids in itself becomes quite a considerable effort. No
wonder there are attempts at automating the process, at least for
collections that exist in digital formats. The metaphoric term "search
engine" suggests, misleadingly, that a machine peruses the documents as
such, focusing on their content. The actual searching is, however,
always performed on surrogate files the system constructs for this
purpose. Software can only match character strings, not concepts or
ideas. This cannot be done in just some arbitrary way, but
there
has to be a systematic way, an algorithm, which means a theory.
Contents of libraries and Internet Combine libraries, archives, and the
Internet, and they
comprise nothing less but the accumulated intellectual and artistic
recordings of humankind, inasmuch as these survive, from all periods,
all countries and cultures, in all languages and scripts and about all
subjects, by all individuals who ever wished to make a contribution.
The size and the complexity of this is staggering. It is naive to
expect that navigating this multidimensional universe might be easy or
might be made a
simple matter. One may try to simplify the description of the
world, but the world itself will not become any simpler that
way. Note that the initial enthusiasm of the metadata movements has
softened a
bit... Books or Internet - a matter of taste? There is not really an either-or
situation. Only the combined contents of both worlds constitute
the complete universe of recorded knowledge and achievement. Library
catalogs on the Internet do not change this, be they as
comfortable as they may, because catalogs carry only descriptions, not
the publications themselves which exist only on paper or in microform.
To digitize all of these and make them full-text searchable is
presently utopia: there are many millions of texts and new ones
continue to be produced by the tens of thousands per year, a great many
of which are not in machine-readable formats, Google's efforts
notwithstanding. Catalogs contain only
very brief and standardized descriptions of the documents, whereas for
internet content full text is the norm. But: diversity is enormous, and
most documents lack a standardized description (a.k.a. "metadata").
From this it follows there will be a number of differences between
catalogs and internet search engines. In libraries, we not only have to
understand that but we should also be able to transfer this knowledge
to our readers. First, however, let us look at catalogs
as such, and at the difference between the contemporary device, the
OPAC, and the card catalog (now gathering dust, if not discarded). We
also have to ask
what consequences should be envisioned for cataloging rules. It goes
without saying that the OPAC is here to stay and that card catalogs are
history, but one may still learn from a comparison. What is the principal problem today
in searching? What is a good catalog?From all we know, we may characterize it like this (formulated originally by LC's Thomas Mann, as quoted in M. Yee's book):
The most decisive difference between
conventional and online catalogs is this: Card Catalog: a linear sequence of entries, i.e., a one-dimensional space, the ordering principle being the alphabet on the lowest level, names/titles/subjects on an upper level. Some libraries had several catalogs for two or more time periods or for otherwise defined parts of their holdings. Every document can be represented by more than one card in several places of the sequence, one of these being called the "main entry". It served two purposes. Firstly, that of collocating related works in one place (like an author's works under an established form of his/her name). Its second and probably more important function was to provide a predictable location for the item in the catalog: if one knew the principle, one was able to find with certainty what one was looking for in just one attempt. Practicability limited the number of cards per item to an average of well below ten. There are many conceivable ways of arranging a card catalog, and in particular, of determining the entries to be represented in it. The pattern, once chosen, has to be followed consistently in order for the catalog to be reliable. Therefore, a card catalog is the utmost extreme of pre-coordination. Very elaborate rules had proved to be necessary in order to establish the pre-coordination. OPAC: in principle, it contains an unordered mass of structured records. Software, however, can easily produce a dozen or more different indexes, each being a linear, sorted sequence of certain parts of the records. Logically, these are still quite like card sequences, but then software, processing a user query, can extract arbitrary subsequences and merge or intersect them with subsequences from one or more of the other indexes, yielding subsets of the database which can then be presented in one or more different meaningful arrangements. Criteria like names, titles, numbers, subject terms etc. may thus be combined in all conceivable ways. Indexes are thus like the axes of a multi-dimensional space in which software enables the user to navigate. Multi-dimensonal spaces are abstract, mathematical entities and therefore present a challenge for many users to comprehend. As opposed to card catalogs, it means that OPACs rely heavily on post-coordination. The actual arrangement of the pre-coordinated card sequence results from two decisions:
This is, however, a premature conclusion, becoming apparent when looking at the situations in which a catalog is consulted: The situation
most frequently encountered is probably the factual search: For
this, catalogs are not very helpful because they contain descriptions
of reference works only, not their contents. Search engines, however,
index the available documents directly and in their entirety and can
thus lead immediately to the facts contained therein. When looking for
facts, search engines are therefore the first stop for most anybody
these days: the engines serve as directory, dictionary,
encyclopedia, atlas, calendar, timetable, picture book, etc. Catalogs
can only point users to all those reference tools , which makes the
search for facts more cumbersome and time-consuming. (a) Known item search ("I know exactly what I need"): looking for something cited or referred to in some other place, like a bibliography (before the advent of hyperlinks). For cards, these rules had to be very restrictive because, for economic reasons, one could always only produce and file a very limited number of cards for any given item. In contrast, OPACs produce and arrange their indexes automatically. Index entries, and thus access points, can therefore be very numerous. As one attempt fails, for whatever reason, another and yet another can be tried in rapid succession. Before soon, a lack of reliability will be perceived, leading to the desire to have more things standardized (or under authority control) than ever before, like publishers' names or place names. In addition, there have to be rules governing the description of items. Descriptions have to be brief but to the point: they have to ensure that the database user will be able to differentiate between dissimilar items, like different versions or editions of a document. The important principle is: meticulous transcription from the item at hand. The only authoritative authority file in the AACR world is the one of the Library of Congress, for names of persons and corporate bodies as well as subjects. For persons, this file also contains the titles ("uniform titles") of many works that have been issued in numerous editions and translations. OCLC offers a new approach at putting these authorities to good use in its WorldCat Identities.. In Germany, the Deutsche Bibliothek is running similar files, based on German cataloging rules (RAK = Regeln für Alphabetische Katalogisierung). Situation (b) and its aspect of "editions of a work" often gets overlooked or is not given much attention. It may occur less frequently than the others - how many works, after all, run into two or more editions? One gets more of a sense for it when considering the following search situations, all of which can only be successful if the catalog does indeed "bring together what belongs together":
Perfection, however, is out of reach: for example, very often a library has only one edition of a work and the cataloger is unaware of the existence of others (esp. ahead of time before other editions would be published!). Then, only this edition can be found in the catalog, but not under any other title by which it may be known to a searcher. Such cases are less frequent in large, shared databases. Plus ça change, plus n'est pas la même chose ... Technology enabling proliferation like
never before, it is now very common to encounter diverse
"manifestations" of a text: the same content can be presented in
different versions or file formats and with all sorts of modifications.
This can aggravate the difficulties with collocation searches
(situation (b)). And titles, though being the most important element
identifying a document or work, are not handled with a lot of care in
the Internet. AACR are concerned with the formal level, not the subject level! The AACR code of cataloging rules, like
the German RAK, deals
with situations (a) and (b) only. These pose
problems that can be solved by purely formal
or descriptive
means, whereas (c) requires attention to the content of things cataloged. The problems described here have been known at least since Antonio Panizzi's work at the British Museum in the 19th century (his "Ninety-One Rules" were published in 1841). He had set himself the task of setting up the first complete catalog for the library. His employers found his ideas somewhat overly complicated and were reluctant to support him. This situation keeps repeating itself... Attempts at formulating international principles for cataloging set in only in the mid 20th century, the all-time highlight being the IFLA Conference of 1961 in Paris. The "Statement of Principles" promulgated there became the foundation for AACR as well as for RAK. Only as late as 1999, IFLA came up with a new milestone paper, entitled "Functional Requirements of Bibliographic Records (FRBR)", which is gaining ground not just in library circles but also in metadata projects. Some of its main points are presented in a separate paper, "What should catalogs do?". New cataloging rules are in the making under the name of Resource Description and Access (RDA)., to supersede AACR in 2009. FRBR is aimed at collocation search more than anything else, trying to say with all due precision in what ways entities can "belong together" and in what ways that fact should be reflected in catalogs. Is AACR2 inextricably intertwined with MARC21 (and RAK with MAB)? see also the
documentation "Was sind
und was sollen Bibliothekarische Datenformate?"
The MARC21 and MAB exchange formats were
created to serve the exchange of library data. The Deutsche Bibliothek
creates RAK records in MAB2 format, the Library of Congress produces
AACR2 records in MARC21. However: the Deutsche Bibliothek can and does
deliver the same data cast into the MARC mold. Format and rules are not
inextricably intertwined: a data format is nothing more than a
container. With a bit of goodwill, wrinkles can be ironed out. A
worldwide, unified exchange format can be envisioned, despite rules
remaining different. UNIMARC was
created for this purpose, but it has not caught on. Some samples
have been set up for demonstration.
Catalogs and search engines Time and again, catalogs and search engines are juxtaposed in a pears vs. apples comparison. The intention here is not to find out which is the better gadget but to show what differences exist. Not just librarians may be interested to get a clearer picture of strengths and weaknesses. There is actually no competition, for catalogs and search engines cover different ground and cater to different needs. Most print material remains offline and thus inaccessible for harvesters, and on the other hand, many online resources have unprintable characteristics and thus could not be published in print. There are, however, widening "grey" areas: Genuine internet resources are being cataloged to enrich catalogs. And search engines index files that contain book reviews, abstracts, whole chapters, descriptions, etc. Some categories of publications, like preprints and dissertations, which used to appear in print are now mounted on webservers. Important older books no longer subject to copyright are digitized and made freely available. The works of "classics" in many languages are freely available as text files, most prominent example being the "Project Gutenberg". And reference works that used to be published in book form are increasingly made available online and turned into databases or (in library cataloging parlance) "continuing integrating resources". And then, last but not least, there is Google's effort to digitize books on a grand scale. At the time of this writing, one cannot do much more than speculate about the potential of this project.
|
|
|
|
Document base, Coverage | ||
Describes a particular collection, predominantly books, located in one or several buildings. | Indexes documents distributed all over the planet. The majority of these "resources" are not very much like books. | |
Size | ||
The collection
is a selection from a much larger number of existing documents.
The selection will mostly be by objective and quality criteria but it
can also be subjective. However, lack of funds can cause the lack of
important materials. Union catalogs describe many more items than individual catalogs, but not everything is easily accessible. |
The intention
is for comprehensive and global coverage, but in reality no
more than some 30% of accessible materials are indexed by any one
search engine. Selection for quality is generally not possible. Size and currency of coverage are not obvious to the user, selection is an automatic process. Many documents covered have never been published conventionally, and most conventionally published material is not on the web. |
|
Objectives | ||
A catalog has clearly defined goals, as defined in its code of rules, one of which being to ensure reliable access for some types of queries. "Known item searches" and "Collocation search" are deemed particularly important. In many cases, one has to know the right search terms with some accuracy in order to be able to ascertain presence or absence of an item in the collection. | Guiding principles for search engines would be difficult to work out, at least in the sense than one could know with a high degree of certainty how the presence or absence of something can be ascertained. In particular, "Subject searches" and "Collocation searches" are technically impossible to be made reliable. For "Known item searches", the situation is better: knowing two or three characteristic and not too common words the text mustcontain, an AND search is very reliable. The dominating use, however, may well be the factual search: with some luck, it is nowhere else that one can so swiftly find an address, a statistical figure, a historic date, a word's meaning, or a picture. | |
Understandability or explainability of results |
||
Interested users can learn everything about a
catalog's features and functions, to understand how a search result
comes about or why a search failed. Scientists and scholars can expect
to get reliable and
complete results, and to find all the best resources available. Subject searches will, of course, always suffer from the well-known conflicts between "recall" and "precision" that can never be completely solved in automated systems. (Libraries have no need of keeping anything secret because they are in complete control of their catalogs: no users can influence their data and functionality.) |
Search engine operators cannot afford to disclose
their methods of indexing and searching, for more than one reason. One
being that "Search Engine Optimizers" (SEOs) would use any such
knowledge to make up web pages in ways that hike their ranking. End-users are therefore necessarily in the dark about the reliability and completeness of results, they just have no way of assessing these. But further, the enormous size of material in the web makes it necessary to compromise between precision and speed, for example. Hit numbers are therefore mostly estimates. Users, however, are not annoyed as long as they get something useful fast, they are not normally out to find everything or all the best things! |
|
Expectations of users | ||
Holdings of a library are usually smaller than users expect for their fields of interest though libraries usually try to build balanced collections of quality materials of long-term value. Union catalogs may be viewed as catalogs for a much larger yet virtual collection. | The number of "documents" indexed may be much larger than any user would imagine, but valuable resources are side by side with utter ephemera and all sorts of useless matter. There are various attempts to use formal criteria for "ranking". | |
Nature of data | ||
Data consist of
highly standardized brief descriptions, following elaborate
codes of rules. The most widely used code is AACR2. Every
item is represented by a structured record containing well-defined data
fields. The data formats have been designed to accomodate all elements
prescribed by the rules. The most widely used format is MARC21. Documents typically have formally definable parts (like a "title page") and useful metadata data elements can thus be derived with relative ease. This serves to make metadata interoperable (e.g., for cross-database searching). Some examples are provided to illustrate how code and format complement each other. |
There are no standardized descriptions of the documents indexed. The database consists of nothing but large inverted files, derived directly from the documents but never shown as such (like a browsable index). Standardization in the sense of authority control is not possible because of a general lack of standardized metadata. Since hardly any formal characteristics apply across the board, metadata from different sources tend to be inhomogeneous. So, even where metadata exist, they are not always helpful: they are insufficiently standardized, too simple and meager. One most widely advocated and applied semantic standard is the "Dublin Core", but this is a container, like MARC, and what matters is its content. For content, though, any standard like AACR2 is mostly absent. | |
Creation and content of the database | ||
Full texts are
not available for direct access or automatic indexing. Catalog records
are just very brief and artificial surrogates.
Descriptions are based on title pages or equivalents and little else. Record structure is still related to traditional catalog card structure in terms of content and layout. Automatic cataloging (scanning title pages etc.) is not feasible, descriptions have to be prepared by manual and intellectual input. |
Not only because of a general lack of metadata, some search engines index the entire text of web documents. Things like title pages do either not exist or are not detectable by software. Programs can, however, evaluate the proximity of words, their being highlighted or specifically tagged (headlines, image tags) | |
Search criteria | ||
Searches can be
restricted to certain fields and boolean combinations thereof:
names, title words, title phrases, subjects etc., some OPACs have an
"anyword" index allowing for keyword searches in the entire text of the
records. With regard to books and similar
documents, search criteria relate to a book as a whole, not to any of
its parts, like chapters or contributions. |
Full-text
searching is the default. There are mostly no fields for titles, names,
subjects, so these do not exist as search criteria. If a title search
is possible, then it operates on the titles "as is", and not all web
sources do have proper titles. Searches for URL components can be a
useful complement. Because of the full-text searching (which means more "depth"), using combinations of not-too-common words can often yield good results where no library catalog would turn up anything, but one can just as well get scores of irrelevant items. There may be additional functions like, for example, image searching, based on image tags in HTML text. Some engines do a kind of ranking that attributes more weight to words in the opening section. |
|
Browsing | ||
Instead of
direct queries, many OPACs also offer index browsing (up and
down, in sorted lists of terms).
Browsable indexes can assist in finding
words, terms and names the exact spelling of which is not known. Also, it can
be useful to see which inflected forms exist (Plural, Genitive, etc.)
For an untruncated word search will find only that particular spelling,
but titles can contain other forms. English may be the least afflicted
language in this regard, but think of spelling variations between UK and US. |
Search engines
generally do not feature browsable indexes. Although rarely
noticed, this would be very helpful because of the total lack of
authority control. The enormous amount of data may make the production
of browsable indexes unfeasible. Because of full-text indexing, the inflection problem is less serious: the important words will usually occur in several inflected forms in any given text. But: there are prominent search engines not yet featuring truncation... |
|
Result set arrangement | ||
Result sets are
usually shown sorted by author, title or reverse chronologic. Some systems offer a choice. For ranking, an OPAC might employ word proximity, language, number of pages, or facts like existence of a uniform title or edition statement. Not many OPACs presently apply any ranking technique. This may be because the very brief textual content of catalog records severely limits the applicability of techniques developed for search engines. |
Some engines
present results in no predictable order. Some talk of relevance ranking, employing various formal techniques. Strictly speaking, relevance can be judged only by the person searching, not by a machine. The word is used only as metaphor, like so many in the computing field. One ought to make users aware of it. Search engines can, however, use criteria like link evaluation that have no parallel in catalog data. Ordering by date or alphabet are not possible because there are no corresponding data fields. Standard HTML files do not even contain a creation date, and the <title> tag is frequently absent. |
|
Authority control | ||
Standardization
("authority control" or "controlled vocabulary") is applied to the more
important elements (names,
uniform titles, subject terms). Therefore, in some important cases, one may be pretty sure to get a precise result. Not in all cases, however, that a user deems important: not for subject searches in general, and that's because of the generally insufficient depth of indexing. Subject terms quite often use some amount of pre-coordination like, e.g., form subdivision, that can be helpful in grouping results but less in boolean searches. The cases of "works by one author" and "editons of a work" are usually well provided for. Authority control
also aids in ascertaining the fact that a document is not in
the
collection: one can tell under what heading it ought to be findable.
|
There is no
standardization because data harvesting and indexing is fully
automatic. There would be no way of managing the necessary input and
checking of data. For lack of standards, precise results are possible only if it is known ahead of time that a name or certain words (and their particular spelling) must be in the document looked for. Collocation searches cannot be supported. Pre-coordination is as good as unknown, given there is no intellectual indexing input. Authority control is entering the scene, very slowly, under the notion of the Semantic Web. It can be difficult to ascertain that a certain document has not been indexed, but this question will be a rare one. |
|
Availability of documents | ||
Any document found does very likely exist - but it may be on loan and thus not immediately available. It may also be lost. Holdings of other libraries can be accessed via interlibrary loan. Compared with a mouseclick to bring up things regardless of their location, ILL is extremely cumbersome, slow, and expensive. | Once found, a
document is, in principle, only a mouseclick away. Some links are
"broken", some documents no longer exist - as is well known. (This would correspond to a library constantly changing some call numbers or discarding books but not updating the catalog, or only with a long time lag.) |
|
Currentness of Material | ||
Library materials usually are expected to be
long-lived. The timespan between production and availability on the
shelves used to be long as well, but integrated workflows and shared
cataloging has shortened it considerably. Currentness is, however, not of utmost importance for very large parts of library materials. The emphasis in libraries is on recordings of proven knowledge, not of news. |
Search engines are quite good at making current materials accessible. Once something is placed online, search engines can index it right away without human labor. It may still take several weeks, due to the enormous size of the web, until something new becomes findable. There are, however, special engines for the daily indexing of news sources. |