Diversity and Identity: Categories for OAI data-providers in the Open Language Archives Network

15 Jun, 2023 17:00

This paper proposal presents an analysis of research involving a network of Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) data-providers who engage in ethnolinguistic information-resource stewardship. Through qualitative analysis and literature review, I address the gap in our understanding of OAI data-providers, explore the classification of data-providers, and the impact of classification/description on the social-understanding of those providers. The classification of data-providers is an important business function across industries. For example, Bessembinder et al. (2019) discuss the classification of climate data-providers and the services via data sharing that they enable, while Exegy (2019) presents a classification of financial data-providers and the level of business services bundled with the data access. In these contexts, the classifications of data-providers are used to indicate the authority source of the data as well as assumptions about the metadata quality. Within the context of research on cultural heritage institutions (CHI), the social classification or “framing” of providers has implications for research evaluating CHI information retrieval systems. That is, the social assent of categories and divisions of Knowledge Organization (KO) using Hjørland’s (2008) broad sense of KO impacts the interpretation of metadata records, which are in fact KO products in Hjørland’s (2008) narrow sense. I propose that a theory of KO must account for types of information resources as well as types of business models (social functions) used by organizations in a larger KO ecosystem. The research analysis is summarized as a taxonomy of data-providers applicable not only to the Open Language Archives Community (OLAC), which was the focus network but also to other data-sharing networks in the cultural heritage space.

OLAC is a federation of 60+ data-providers sharing metadata records via OAI-PMH for consumption and display via a common aggregator (Bird and Simons 2022). The consumption of records from diverse types of data-providers suggests, at least in cases of metadata aggregation, that various KO practices (in the narrow sense) for resource description are brought together for display and engagement. These practices can and often do represent different kinds of KO approaches, e.g., impacting metadata quality assessment (Manghi et al. 2010). Therefore, for studies which use OLAC metadata, it is important to account for these various local KO approaches.

Since its establishment in 2001 the OLAC application profile has allowed data-providers to identify their nature as either “personal” or “institutional” (Simons and Bird 2008:§3). At the same time, all data providers are discussed as “archives”. OLAC documentation drafters have used inclusive language choosing to cluster different types of institutional data-providers together. The choice maps well to the concepts of “open” and “community” found in the OLAC name, but the language of inclusion here does not acknowledge the diversity of the kinds of data-providers. OLAC’s use of the term “archive” leads to a very interesting question: “What is an archive?” Is an archive an institution with a preservation mandate, as is commonly used in scholarly literature (Featherstone 2006; Matthews 2016; Seyfeddinipur et al. 2019; Burke et al. 2022)? Or is an archive a set of records often with a common origin and intra-record relationships, as is discussed by Duranti (1997) and others (Jenkinson 1937; Johnston and Schembri 2006)? The formative role that OLAC has had in the language-scholar community has strongly influenced the concept of “archive” among language-scholars. The purpose of this study was to investigate, through an empirical qualitative approach, what kinds of data-providers participate in the OLAC network and to propose a taxonomy which can be used in the interpretation of metadata records. Such an analysis is useful when evaluating issues in metadata quality (Bruce and Hillmann 2004) including the following: the possible access stability of a resource, the curatorial work conducted over a collection of records, and the currency of a metadata record.

Another useful purpose for the proposed taxonomy is in the evaluation of information retrieval systems. Within the field of language-resource stewardship and language scholarship, the different senses of the term “archive” have been conflated, resulting in impacts on research results. For example, Yi et al. (2022) make a comparison of “language archives” and their web-facing user interfaces without differentiating the kind of information resource the website presents, e.g., institutions with diverse collections versus a single corpus. They also further conflate websites presenting either sense of “archive” with actual digital asset management infrastructure. They present an analysis suggesting that the website is the archive.

Yi et al.(2022) demonstrate that language-scholars’ understanding of an archive is related to the language-scholar’s experience of it — a nod to digital materiality (and cf. Manoff 2006; Leonardi 2010; Jung and Stolterman 2012; Pink, Ardévol, and Lanzeni 2016). It also suggests that a scholar’s understanding of preservation is the ability to access resources. These are two important components in developing successful cultural preservation workflows which involve scholar-driven accessions and descriptions.

Digital infrastructure facilitating asset storage, discovery, and acquisition is actually a complex construct which varies from implementation to implementation. In contrast to Yi et al.’s (2022) “equal treatment” of different kinds of “language archives”, Ferreira et al. (2021) argue that archives and digital displays for interactions are different types of entities. They denote specific kinds of purposes for websites highlighting the fact that some websites do not structure their existence around a preservation mandate, in a sense they are ephemeral, even as they provide community meaningful access to resources. These ephemeral websites often function as community produced exhibits or the presentation of the scholarly outputs of research labs. Although these websites are not archives in the preservation institutional sense, they can equally provide data feeds to OLAC for aggregation and increased awareness of the language related cultural heritage resources.

The current limited two-value OLAC taxonomy is problematic. Some websites which present and make language resources accessible are built as part of a funded project or as part of a group without a clear linkage to an institution. Their “collective voice” is neither “personal” nor “institutional”. A more diverse taxonomy describing data-providers within the OLAC application profile, one with more specificity, serves the OLAC user communities more effectively. By acknowledging diversity, reasonable expectations by language-scholars and data users can be established. The proposed taxonomy serves at least three functions:

  • First, it allows for a gap analysis by revealing the kinds of OLAC data-providers which have found value through participation and the kinds of data-providers around which possible network growth opportunities exist.
  • Second, it allows for a more useful metadata quality evaluation. The current method for evaluating OLAC metadata (Hughes 2005) is conducted on the basis of the total number of Dublin Core elements per record. Profiling data feeds by kind of data-provider allows for a comparison within provider type, i.e., records for books do not have the same kinds of elements as learning resources or collections. Many of the current OLAC data-providers offer records related to specific kinds of digital artifacts.
  • Third, it raises awareness among language scholars and metadata specialists concerning the differences between data-providers.

To investigate classificatory distinctions and the necessary criteria for defining variation among data-providers. I adopt Besser’s observations distinguishing the institutional characteristics of memory institutions (Dietz et al. 2005:23). I look at four conditions across the sixty-three distinct OLAC data-providers:

  1. the name of the data-provider.
  2. the institutional context of the data-provider, i.e, organizational structure.
  3. the functional role of the data-provider, i.e., organization’s mission.
  4. whether the data provider made resources accessible or not. From these factors, I derive three broad categories with a total of twelve specific taxonomic values as presented in Table 1.
Taxonomy Term Broad Category Instances in OLAC
Archive Access Institutions 17
Repository Access Institutions 20
Museum Access Institutions 0
Gallery Access Institutions 0
Library Access Institutions 6
Special Collection Exhibits 1
Personal Portfolio Exhibits 1
Lab or Department Portfolio Exhibits 2
Project Portfolio Exhibits 1
Typological Database (Encyclopedic Resources) Reference Resources 14
Bibliography Reference Resources 1
Aggregator (Bibliographic Utilities) Reference Resources 0
Table 1.

Broadly, data-providers patterned into the categories of access institutions, exhibits, and reference resources. I identify twelve specific classifications of data-providers to OLAC. Based on investigations into the network typologies of Europeana (Neale and Charles 2020; Raemy 2020) and the Digital Public Library of America, two other large OAI data-sharing networks, it was found that some data-providers may act as Bibliographic Utilities (Hillmann et al. 2004; Hillmann 2008) or regional aggregators who pass materials on to a primary aggregator. These types of entities do not currently exist in the OLAC network but could. A forward looking taxonomy should account for these kinds of functions. I propose that a more diverse typology of data-providers within the OLAC application profile, serves user communities more effectively and impacts the social perspective on stewardship organizations and access channels.


Hugh Paterson III
Hugh Paterson III
Collaborative Scholar

My research interests include typological patterns in articulatory phonetics; User Experience design in language tools; and graph theory applied to language and linguistic resource discovery.