This paper proposal presents an analysis of research involving a network of Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH) data-providers who engage in ethnolinguistic information-resource stewardship. Through qualitative analysis and literature review, I address the gap in our understanding of OAI data-providers, explore the classification of data-providers, and the impact of classification/description on the social-understanding of those providers. The classification of data-providers is an important business function across industries. For example, Bessembinder et al. (2019) discuss the classification of climate data-providers and the services via data sharing that they enable, while Exegy (2019) presents a classification of financial data-providers and the level of business services bundled with the data access. In these contexts, the classifications of data-providers are used to indicate the authority source of the data as well as assumptions about the metadata quality. Within the context of research on cultural heritage institutions (CHI), the social classification or “framing” of providers has implications for research evaluating CHI information retrieval systems. That is, the social assent of categories and divisions of Knowledge Organization (KO) using Hjørland’s (2008) broad sense of KO impacts the interpretation of metadata records, which are in fact KO products in Hjørland’s (2008) narrow sense. I propose that a theory of KO must account for types of information resources as well as types of business models (social functions) used by organizations in a larger KO ecosystem. The research analysis is summarized as a taxonomy of data-providers applicable not only to the Open Language Archives Community (OLAC), which was the focus network but also to other data-sharing networks in the cultural heritage space.
OLAC is a federation of 60+ data-providers sharing metadata records via OAI-PMH for consumption and display via a common aggregator (Bird and Simons 2022). The consumption of records from diverse types of data-providers suggests, at least in cases of metadata aggregation, that various KO practices (in the narrow sense) for resource description are brought together for display and engagement. These practices can and often do represent different kinds of KO approaches, e.g., impacting metadata quality assessment (Manghi et al. 2010). Therefore, for studies which use OLAC metadata, it is important to account for these various local KO approaches.
Since its establishment in 2001 the OLAC application profile has allowed data-providers to identify their nature as either “personal” or “institutional” (Simons and Bird 2008:§3). At the same time, all data providers are discussed as “archives”. OLAC documentation drafters have used inclusive language choosing to cluster different types of institutional data-providers together. The choice maps well to the concepts of “open” and “community” found in the OLAC name, but the language of inclusion here does not acknowledge the diversity of the kinds of data-providers. OLAC’s use of the term “archive” leads to a very interesting question: “What is an archive?” Is an archive an institution with a preservation mandate, as is commonly used in scholarly literature (Featherstone 2006; Matthews 2016; Seyfeddinipur et al. 2019; Burke et al. 2022)? Or is an archive a set of records often with a common origin and intra-record relationships, as is discussed by Duranti (1997) and others (Jenkinson 1937; Johnston and Schembri 2006)? The formative role that OLAC has had in the language-scholar community has strongly influenced the concept of “archive” among language-scholars. The purpose of this study was to investigate, through an empirical qualitative approach, what kinds of data-providers participate in the OLAC network and to propose a taxonomy which can be used in the interpretation of metadata records. Such an analysis is useful when evaluating issues in metadata quality (Bruce and Hillmann 2004) including the following: the possible access stability of a resource, the curatorial work conducted over a collection of records, and the currency of a metadata record.
Another useful purpose for the proposed taxonomy is in the evaluation of information retrieval systems. Within the field of language-resource stewardship and language scholarship, the different senses of the term “archive” have been conflated, resulting in impacts on research results. For example, Yi et al. (2022) make a comparison of “language archives” and their web-facing user interfaces without differentiating the kind of information resource the website presents, e.g., institutions with diverse collections versus a single corpus. They also further conflate websites presenting either sense of “archive” with actual digital asset management infrastructure. They present an analysis suggesting that the website is the archive.
Yi et al.(2022) demonstrate that language-scholars’ understanding of an archive is related to the language-scholar’s experience of it — a nod to digital materiality (and cf. Manoff 2006; Leonardi 2010; Jung and Stolterman 2012; Pink, Ardévol, and Lanzeni 2016). It also suggests that a scholar’s understanding of preservation is the ability to access resources. These are two important components in developing successful cultural preservation workflows which involve scholar-driven accessions and descriptions.
Digital infrastructure facilitating asset storage, discovery, and acquisition is actually a complex construct which varies from implementation to implementation. In contrast to Yi et al.’s (2022) “equal treatment” of different kinds of “language archives”, Ferreira et al. (2021) argue that archives and digital displays for interactions are different types of entities. They denote specific kinds of purposes for websites highlighting the fact that some websites do not structure their existence around a preservation mandate, in a sense they are ephemeral, even as they provide community meaningful access to resources. These ephemeral websites often function as community produced exhibits or the presentation of the scholarly outputs of research labs. Although these websites are not archives in the preservation institutional sense, they can equally provide data feeds to OLAC for aggregation and increased awareness of the language related cultural heritage resources.
The current limited two-value OLAC taxonomy is problematic. Some websites which present and make language resources accessible are built as part of a funded project or as part of a group without a clear linkage to an institution. Their “collective voice” is neither “personal” nor “institutional”. A more diverse taxonomy describing data-providers within the OLAC application profile, one with more specificity, serves the OLAC user communities more effectively. By acknowledging diversity, reasonable expectations by language-scholars and data users can be established. The proposed taxonomy serves at least three functions:
To investigate classificatory distinctions and the necessary criteria for defining variation among data-providers. I adopt Besser’s observations distinguishing the institutional characteristics of memory institutions (Dietz et al. 2005:23). I look at four conditions across the sixty-three distinct OLAC data-providers:
Taxonomy Term | Broad Category | Instances in OLAC |
---|---|---|
Archive | Access Institutions | 17 |
Repository | Access Institutions | 20 |
Museum | Access Institutions | 0 |
Gallery | Access Institutions | 0 |
Library | Access Institutions | 6 |
Special Collection | Exhibits | 1 |
Personal Portfolio | Exhibits | 1 |
Lab or Department Portfolio | Exhibits | 2 |
Project Portfolio | Exhibits | 1 |
Typological Database (Encyclopedic Resources) | Reference Resources | 14 |
Bibliography | Reference Resources | 1 |
Aggregator (Bibliographic Utilities) | Reference Resources | 0 |
Broadly, data-providers patterned into the categories of access institutions, exhibits, and reference resources. I identify twelve specific classifications of data-providers to OLAC. Based on investigations into the network typologies of Europeana (Neale and Charles 2020; Raemy 2020) and the Digital Public Library of America, two other large OAI data-sharing networks, it was found that some data-providers may act as Bibliographic Utilities (Hillmann et al. 2004; Hillmann 2008) or regional aggregators who pass materials on to a primary aggregator. These types of entities do not currently exist in the OLAC network but could. A forward looking taxonomy should account for these kinds of functions. I propose that a more diverse typology of data-providers within the OLAC application profile, serves user communities more effectively and impacts the social perspective on stewardship organizations and access channels.