Metadata Quality Assessment & OLAC

Metadata Quality is a fascinating topic that I’m interested in more professional engagement around. I am specifically interested in user-contributed metadata platforms and quality control in those digital library contexts. I have long noticed the need for spell checking and for the ability to move folksonomy terms to controlled vocabulary terms at certain archives stewarding linguistic and language resources. 

I read the following three articles for UNT work:

Across these articles, I really appreciate the internationally diverse perspectives brought by these authors. Kumar et al. (2022) has some really clear methodologies (I feel like I could replicate the experiment in my location) and looks at spelling mistakes in a catalog where the language of instruction (at the institution) may be English, but the first language of many (maybe most) of the people there is not English. It seems natural that there would be spelling mistakes in these environments. In the archive that I worked at (in Dallas, Tx. no less) I offered to do a spell-check analysis on titles and folksonomy terms. The staff were less than ecstatic to receive any feedback on the records as that would mean that they would need to do more work to correct them. (In that situation the archive’s catalog has central staff supporting some collections and core infrastructure, and remote staff around the world supporting location specific collections.) Depending on which collection the misspelling was found in would determine which remote staff would need to approve the changes. So I appreciate the management at the instution where Kumar and collegues did their work to have a willingness to allow this kind of research. I hope that they then engaged with the results. In the case of the archive at which I worked, they also have a remote accession tool for digital objects. I’ve suggested they add a spell check system to that tool to catch errors in the accession process, but it has been argued that those features are not mission critical. This leads back to an analysis of what are the financial priorities of the management. If an archive has no specific materials re-use objective then this could have a direct impact on the quality of the records in interesting ways. For example, an archive without a specific requirement to engage the public in innovative ways (or at all) or where this process of engagement is disjunct from cataloging, curation, and preservation activities may not seek to maintain their records in a way where potential targest would benefit.

Since I am interested in linguistics and language resource records and their quality I found the description of the types of errors in Italian libraries interesting. Petrucciani (2015), an article which was derived from a speech given at a conference, says the following of the Italian situation (p. 304 & 309):

In Italy, for example, the main factors may be the widespread outsourcing of cataloging activities (especially retrospective conversion and cataloging of backlogs, bequests, etc.) with very low remuneration and lack of quality control, and the participation in library networks (mainly the national network SBN, Servizio Bibliotecario Nazionale) of small libraries—public, religious, or special—run by unqualified people or volunteers, without skilled staff. In many countries, an apparent factor is the lack of competence in foreign languages among cataloging staff, resulting in wrong language coding (e.g., Spanish for Italian or vice versa, Russian for books in any language written in Cyrillic script) and the abundance of misprinted words.

These (and many more) examples are not errors. Or better, they are not simply errors. They are the result of cataloging rules and policies, of the design of OPAC interfaces, or of bibliographic formats. The low quality of large catalogs is not simply due to the vast amount of duplicated records, unprofessional work, errors of interpretation, miskeyings, and so on, but in the first place to the misunderstanding (and undervaluing) of the functions and aims of bibliographic information and to unwise cataloging standards, rules, and policies.

Again he attributes the errors to management practices established at institutions. In other parts of his paper Petrucciani points out that tools like Google Search and full text search options available exploit the problems that less-than-perfect catalogs have. That is, end-users for a variety of reasons turn to Google Search instead of catalog searches. I have long found interesting that Google doesn’t return results for books found in library catalogs. Its like library catalogs are part of the dark web.

I found the approach by Hider and Tan (2008) inspiring. I often look at records found in the Open Language Archives Community (OLAC). One of the open questions there has been how to assess data providers and the quality of the data provided. OLAC records are often sourced from digital repositories where accessions are described by the creators/depositors. This means that variation in management practices at the data provider to OLAC impact the quality of record which OLAC has to aggregate. The current method (Hughes, 2005)  is to count the DC elements on a record, average the count on each record and assign that average to the archive to rank the data providers. In this framework quality is only measured on the basis of the syntax present in the record rather than any sort of syntax-semantics relationship. e.g., Does a date element really contain valid date data? This method does not attempt to measure the validity or conformity of the data values to Dublin Core or OLAC recommendations. 

With regards to OLAC metadata quality, the current status quo follows the methods laid out by B. Hughes which is to count the DC elements on a record, average the count on each record and assign that average to the data provider and then to rank the data providers. In this framework Quality is only measured on the basis of the syntax (number of elements) present in the record rather than any sort of syntax-semantics relationship. E.g., does a date really appear in a date field and are they really valid dates? This method does not attempt to measure the validity or conformity of the data values to Dublin Core or OALC recommendations in terms of the semantics or the well-formedness of the values. In a sense the OLAC method of quality is similar to what MARC record generators call Encoding Level.

In contrast to B. Hughes’ approach I have suggested that the DCMIType vocabulary in conjunction with the dc:format , dc:relation , dc:medium & tableOfContents elements could be used to provide continuity checks to the type of artifact being described. The purpose here would be to check for the completeness of a description if there is evidence in the record that specific type is present. This would would be a step towards enforcing the Dublin Core 1-to-1 principle. In a way the suggested process uses a fingerprinting process for each DCMIType. Common linguistic artifact types could be crafted (fingerprinted) and also added to the checks. These crafted fingerprints would appear in the documentation where recommendations existed for describing materials.

A third method to look at quality is to look at the records that people are visiting and clicking through. People found these records and they must be interesting for some reason. An investigation of the most successful records in providing click-throughs to target websites is one way to assess the quality of a record. That is, is the record valuable to an audience? This third method of establishing quality is discussed in Hider & Tan (2008). They look at MARC records, but we have Dublin Core records, so some generalities and abstractions to their arguments may be needed before application.

I find the idea of measuring record quality based on its use interesting. With nearly 450K records in OLAC but only some 8,000 views per month (Bird & Simons, 2021) a natural question arises. Which records are being viewed and are these the same ones month-over-month? Is there something “higher quality” about these records that they are being found and used? Recent work by Husic (2014) and Osborn (2019) show that adding Library of Congress Subject Headings (LCSH) to records does make them more visible to search engines. So maybe the highly used OLAC records have LCSH terms attached? This is something worth looking into and reporting on.

References

Bird, Steven, and Gary Simons. 2021. “Towards an Agenda for Open Language Archiving.” In Proceedings of the International Workshop on Digital Language Archives: LangArc 2021, edited by Oksana Zavalina and Shobhana Lakshmi Chelliah, 25–28. Denton, Texas: University of North Texas. doi:10.12794/langarc1851171.

Hider, Philip, and Kah-Ching Tan. 2008. “Constructing Record Quality Measures Based on Catalog Use.” Cataloging & Classification Quarterly 46 (4). Routledge: 338–61. doi:10.1080/01639370802322515.

Hughes, Baden. 2005. “Metadata Quality Evaluation: Experience from the Open Language Archives Community.” In Digital Libraries: International Collaboration and Cross-Fertilization, edited by Zhaoneng Chen, Hsinchun Chen, Qihao Miao, Yuxi Fu, Edward Fox, and Ee-peng Lim, 320–29. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. doi:10.1007/978-3-540-30544-6_34.

Husic, Geoff. 2014. “Enhancing an Open-Access Linguistics Journal Archive with Library of Congress-like Metadata: A Case Study of the Effectiveness for Improving Discovery.” Kansas Working Papers in Linguistics 35 (December): 1–8. doi:10.17161/KWPL.1808.15945.

Kumar, Vinit, Sharad Maurya, and Gopal Ji. 2022. “An Assessment of Spelling and Typographical Errors in WebOPAC of Academic Libraries.” Journal of Library Metadata 22 (1–2). Routledge: 63–73. doi:10.1080/19386389.2022.2051978.

Osborn, Carlyn. 2019. “How Do Levels of Description Affect Discoverability of the Web Archives at the Library of Congress?” Poster presented at the SAA Research Forum, Austin, Texas. https://www2.archivists.org/sites/all/files/flattened_Osborn_SAA_Research_Forum_2019.pdf.

Petrucciani, Alberto. 2015. “Quality of Library Catalogs and Value of (Good) Catalogs.” Cataloging & Classification Quarterly 53 (3–4). Routledge: 303–13. doi:10.1080/01639374.2014.1003669.

User-Generated Metadata

The following is a quick extract of academic papers from my files on the topics of user-generated metadata and metadata quality.

Stvilia, Gasser & Twidale (2007)
, & (). Metadata Quality Problems in Federated Collections:. In Al-Hakim, L. (Eds.), Challenges of Managing Information Quality in Service Organizations. (pp. 154–186). IGI Global. https://doi.org/10.4018/978-1-59904-420-0.ch008
Barrett (2021)
(). Participatory Description and Metadata Quality in Rapid Response Archives. Collections, 17(3). 226–236. https://doi.org/10.1177/1550190620981038
Béquet (2016)
(). Questionable Practices in Scholarly Publishing: the Stance of the ISSN Network. Retrieved from https://library.ifla.org/id/eprint/1462/
Chassanoff (n.d.)
(n.d.). Metadata Quality Evaluation in Institutional Repositories: A Survey of Current Practices The University of North Carolina at Chapel Hill University Libraries, https://doi.org/10.17615/w2kv-3c50
Gligorov (2012)
(). User-generated metadata in audio-visual collections. Association for Computing Machinery. https://doi.org/10.1145/2187980.2187998
Gregg, Erdmann, Paglione, Schneider & Dean (2019)
, , , & (). A literature review of scholarly communications metadata. Research Ideas and Outcomes, 5. e38698. https://doi.org/10.3897/rio.5.e38698
Guo, Zhang & Liu (2013)
, & (). Scientific metadata quality enhancement for scholarly publications. https://doi.org/10.9776/13382
Hendricks, Tkaczyk, Lin & Feeney (2020)
, , & (). Crossref: The sustainable source of community-owned scholarly metadata. Quantitative Science Studies, 1(1). 414–427. https://doi.org/10.1162/qss_a_00022
Hillmann & Phipps (2007)
& (). Application Profiles: Exposing and Enforcing Metadata Quality. Dublin Core Metadata Initiative & National Library Board Singapore. Retrieved from https://dcpapers.dublincore.org/pubs/article/view/866
Hillmann, Dushay & Phipps (n.d.)
, & (n.d.). Improving Metadata Quality: Augmentation and Recombination. 8.
Hillmann (2008)
(). Metadata Quality: From Evaluation to Augmentation. Cataloging & Classification Quarterly, 46(1). 65–80. https://doi.org/10.1080/01639370802183008
Hrynaszkiewicz & Grant (2018)
& (). Embedding Research Data Management Support in the Scholarly Publishing Workflow. https://doi.org/10.31219/osf.io/brzwm
Inácio, Ferreira & Couto (2017)
, & (). Metadata Analyser: Measuring Metadata Quality. Springer International Publishing. https://doi.org/10.1007/978-3-319-60816-7_24
Khwileh, Ganguly & Jones (2016)
, & (). Utilisation of Metadata Fields and Query Expansion in Cross-Lingual Search of User-Generated Internet Video. Journal of Artificial Intelligence Research, 55. 249–281. https://doi.org/10.1613/jair.4775
Koshoffer, Neeser, Newman & Johnston (2018)
, , & (). Giving Datasets Context: a Comparison Study of Institutional Repositories that Apply Varying Degrees of Curation. International Journal of Digital Curation, 13(1). 15–34. https://doi.org/10.2218/ijdc.v13i1.632
Krichel & Warner (2001)
& (). A metadata framework to support scholarly communication. International Conference on Dublin Core and Metadata Applications. 131–137. Retrieved from https://dcpapers.dublincore.org/pubs/article/view/651
Lammey, Mitchell & Counsell (2018)
, & (). Metadata 2020: A collaborative effort to improve metadata quality in scholarly communications. Septentrio Conference Series(1). https://doi.org/10.7557/5.4471
Liu, Guo & Zhang (2014)
, & (). Scholar metadata and knowledge generation with human and artificial intelligence. Journal of the Association for Information Science and Technology, 65(6). 1187–1201. https://doi.org/10.1002/asi.23013
Lorenzini, Rospocher & Tonelli (2021)
, & (). On assessing metadata completeness in digital cultural heritage repositories. Digital Scholarship in the Humanities, 36(Supplement_2). ii182–ii188. https://doi.org/10.1093/llc/fqab036
Manzo, Kaufman, Punjasthitkul & Flanagan (n.d.)
, , & (n.d.). “By the People, For the People”: Assessing the Value of Crowdsourced, User-Generated Metadata. 17.
Memmel (2015)
(). Socially Enhanced Access to Digital Resources Technische Universität Kaiserslautern, Retrieved from https://kluedo.ub.uni-kl.de/frontdoor/index/index/docId/4120
Payant, Skeen, Arnljots & Williams (2020)
, , & (). Beyond crowdsourcing: Working with donors, student fieldworkers, and community scholars to improve cultural heritage collection metadata. Journal of Digital Media Management, 8(3). 242–253.
Quarati (2021)
(). Open Government Data: Usage trends and metadata quality. Journal of Information Science. 01655515211027775. https://doi.org/10.1177/01655515211027775
Quarati & Raffaghelli (2022)
& (). Do researchers use open research data? Exploring the relationships between usage trends and metadata quality across scientific disciplines from the Figshare case. https://doi.org/10.1177/0165551520961048
Quarati & Raffaghelli (2022)
& (). Do researchers use open research data? Exploring the relationships between usage trends and metadata quality across scientific disciplines from the Figshare case. Journal of Information Science, 48(4). 423–448. https://doi.org/10.1177/0165551520961048
Rousidis, Garoufallou, Balatsoukas & Sicilia (2014)
, , & (). Metadata for Big Data: A preliminary investigation of metadata quality issues in research data repositories. Information Services & Use, 34(3-4). 279–286. https://doi.org/10.3233/ISU-140746
Schoenberger (2017)
(). Who Wrote This? Creator Metadata Quality on Academia.Edu ,
Spindler (2014)
(). An Evaluation of Crowdsourcing and Participatory Archives Projects for Archival Description and Transcription. 28.
Stein, Applegate & Robbins (2017)
, & (). Achieving and Maintaining Metadata Quality: Toward a Sustainable Workflow for the IDEALS Institutional Repository. Cataloging & Classification Quarterly, 55(7-8). 644–666. https://doi.org/10.1080/01639374.2017.1358786
Stein, Applegate & Robbins (2017)
, & (). Achieving and Maintaining Metadata Quality: Toward a Sustainable Workflow for the IDEALS Institutional Repository. Cataloging & Classification Quarterly, 55(7-8). 644–666. https://doi.org/10.1080/01639374.2017.1358786
Stvilia & Jörgensen (2009)
& (). User-generated collection-level metadata in an online photo-sharing system. Library & Information Science Research, 31(1). 54–65. https://doi.org/10.1016/j.lisr.2008.06.006
Stvilia & Gasser (2008)
& (). Value-based metadata quality assessment. Library & Information Science Research, 30(1). 67–74. https://doi.org/10.1016/j.lisr.2007.06.006
Syn & Spring (2013)
& (). Finding subject terms for classificatory metadata from user-generated social tags. Journal of the American Society for Information Science and Technology, 64(5). 964–980. https://doi.org/10.1002/asi.22804
Takase, Hasan, Matsumoto & Sasaki (2014)
, , & (). Capturing User-Generated Metadata. Sissa Medialab. https://doi.org/10.22323/1.210.0033
Hooland, Rodríguez & Boydens (2011)
, & (). Between Commodification and Engagement: On the Double-Edged Impact of User-Generated Metadata within the Cultural Heritage Sector. Library Trends, 59(4). 707–720. https://doi.org/10.1353/lib.2011.0011
Hooland (2006)
(). From Spectator to Annotator: Possibilities offered by User-Generated Metadata for Digital Cultural Heritage Collections. 16.
Weagley, Gelches & Park (2010)
, & (). Interoperability and Metadata Quality in Digital Video Repositories: A Study of Dublin Core. Journal of Library Metadata, 10(1). 37–57. https://doi.org/10.1080/19386380903546984
Tags:
Categories:
Countries:
Content Mediums:
Organizations:
Hugh Paterson III
Hugh Paterson III
Collaborative Scholar

I specialize in bespoke research at the intersection of Linguistics, Law, Languages, and Technology; specifically utility and life-cycle management for information products in these spaces.

Related