OLAC: Accessing the World's Language Resources

  • Steven Bird, University of Melbourne, Australia
  • Gary Simons, Graduate Institute of Applied Linguistics, United States
  • Language resources are the bread and butter of language documentation and linguistic investigation. They include the primary objects of study such as texts and recordings, the outputs of research such as dictionaries and grammars, and the enabling technologies such as software tools and interchange standards. Increasingly, these resources are maintained and distributed in digital form.

    Searching on the web for language resources in many languages is a hit-and-miss affair for three reasons:
    (i) resources are housed in archives that have never put their catalog online,
    (ii) resources are exposed to online search engines but inadequately described so that searches do not retrieve desired results with precision, or
    (iii) resources are exposed online but are hidden behind form-based interfacessuch that search engines cannot find them.

    The Open Language Archives Community (OLAC) is addressing these problems by providing a standard set of language resource descriptors and a portal that permits users to query dozens of language archives simultaneously using a single search. However, the current coverage of OLAC is only the tip of the iceberg. New research is needed in order to tap the wealth of new digital library services and web-mining technologies, and to make the discovered language resources maximally accessible to linguists.

    We will describe new methods for greatly improving access to archived language resources, using new services that encourage best common practices among language archives, and new services that bridge the resource catalogs of the repository, library, and web domains.