Metadata Quality Assessment & OLAC
Metadata Quality is a fascinating topic that I’m interested in more professional engagement around. I am specifically interested in user-contributed metadata platforms and quality control in those digital library contexts. I have long noticed the need for spell checking and for the ability to move folksonomy terms to controlled vocabulary terms at certain archives stewarding linguistic and language resources.
I read the following three articles for UNT work:
- Hider, P., & Tan, K.-C. (2008). Constructing Record Quality Measures Based on Catalog Use. Cataloging & Classification Quarterly, 46(4), 338–361. https://doi.org/10.1080/01639370802322515
- Kumar, V., Maurya, S., & Ji, G. (2022). An Assessment of Spelling and Typographical Errors in WebOPAC of Academic Libraries. Journal of Library Metadata, 22(1–2), 63–73. https://doi.org/10.1080/19386389.2022.2051978
- Petrucciani, A. (2015). Quality of Library Catalogs and Value of (Good) Catalogs. Cataloging & Classification Quarterly, 53(3–4), 303–313. https://doi.org/10.1080/01639374.2014.1003669
Across these articles, I really appreciate the internationally diverse perspectives brought by these authors. Kumar et al. (2022) has some really clear methodologies (I feel like I could replicate the experiment in my location) and looks at spelling mistakes in a catalog where the language of instruction (at the institution) may be English, but the first language of many (maybe most) of the people there is not English. It seems natural that there would be spelling mistakes in these environments. In the archive that I worked at (in Dallas, Tx. no less) I offered to do a spell-check analysis on titles and folksonomy terms. The staff were less than ecstatic to receive any feedback on the records as that would mean that they would need to do more work to correct them. (In that situation the archive’s catalog has central staff supporting some collections and core infrastructure, and remote staff around the world supporting location specific collections.) Depending on which collection the misspelling was found in would determine which remote staff would need to approve the changes. So I appreciate the management at the instution where Kumar and collegues did their work to have a willingness to allow this kind of research. I hope that they then engaged with the results. In the case of the archive at which I worked, they also have a remote accession tool for digital objects. I’ve suggested they add a spell check system to that tool to catch errors in the accession process, but it has been argued that those features are not mission critical. This leads back to an analysis of what are the financial priorities of the management. If an archive has no specific materials re-use objective then this could have a direct impact on the quality of the records in interesting ways. For example, an archive without a specific requirement to engage the public in innovative ways (or at all) or where this process of engagement is disjunct from cataloging, curation, and preservation activities may not seek to maintain their records in a way where potential targest would benefit.
Since I am interested in linguistics and language resource records and their quality I found the description of the types of errors in Italian libraries interesting. Petrucciani (2015), an article which was derived from a speech given at a conference, says the following of the Italian situation (p. 304 & 309):
In Italy, for example, the main factors may be the widespread outsourcing of cataloging activities (especially retrospective conversion and cataloging of backlogs, bequests, etc.) with very low remuneration and lack of quality control, and the participation in library networks (mainly the national network SBN, Servizio Bibliotecario Nazionale) of small libraries—public, religious, or special—run by unqualified people or volunteers, without skilled staff. In many countries, an apparent factor is the lack of competence in foreign languages among cataloging staff, resulting in wrong language coding (e.g., Spanish for Italian or vice versa, Russian for books in any language written in Cyrillic script) and the abundance of misprinted words.
These (and many more) examples are not errors. Or better, they are not simply errors. They are the result of cataloging rules and policies, of the design of OPAC interfaces, or of bibliographic formats. The low quality of large catalogs is not simply due to the vast amount of duplicated records, unprofessional work, errors of interpretation, miskeyings, and so on, but in the first place to the misunderstanding (and undervaluing) of the functions and aims of bibliographic information and to unwise cataloging standards, rules, and policies.
Again he attributes the errors to management practices established at institutions. In other parts of his paper Petrucciani points out that tools like Google Search and full text search options available exploit the problems that less-than-perfect catalogs have. That is, end-users for a variety of reasons turn to Google Search instead of catalog searches. I have long found interesting that Google doesn’t return results for books found in library catalogs. Its like library catalogs are part of the dark web.
I found the approach by Hider and Tan (2008) inspiring. I often look at records found in the Open Language Archives Community (OLAC). One of the open questions there has been how to assess data providers and the quality of the data provided. OLAC records are often sourced from digital repositories where accessions are described by the creators/depositors. This means that variation in management practices at the data provider to OLAC impact the quality of record which OLAC has to aggregate. The current method (Hughes, 2005) is to count the DC elements on a record, average the count on each record and assign that average to the archive to rank the data providers. In this framework quality is only measured on the basis of the syntax present in the record rather than any sort of syntax-semantics relationship. e.g., Does a date element really contain valid date data? This method does not attempt to measure the validity or conformity of the data values to Dublin Core or OLAC recommendations.
With regards to OLAC metadata quality, the current status quo follows the methods laid out by B. Hughes which is to count the DC elements on a record, average the count on each record and assign that average to the data provider and then to rank the data providers. In this framework Quality is only measured on the basis of the syntax (number of elements) present in the record rather than any sort of syntax-semantics relationship. E.g., does a date really appear in a date field and are they really valid dates? This method does not attempt to measure the validity or conformity of the data values to Dublin Core or OALC recommendations in terms of the semantics or the well-formedness of the values. In a sense the OLAC method of quality is similar to what MARC record generators call Encoding Level.
In contrast to B. Hughes’ approach I have suggested that the
DCMIType vocabulary in conjunction with the dc:format
, dc:relation
, dc:medium
& tableOfContents
elements could be used to provide continuity checks to the type of artifact being described. The purpose here would be to check for the completeness of a description if there is evidence in the record that specific type is present. This would would be a step towards enforcing the
Dublin Core 1-to-1 principle. In a way the suggested process uses a fingerprinting process for each DCMIType. Common linguistic artifact types could be crafted (fingerprinted) and also added to the checks. These crafted fingerprints would appear in the documentation where recommendations existed for describing materials.
A third method to look at quality is to look at the records that people are visiting and clicking through. People found these records and they must be interesting for some reason. An investigation of the most successful records in providing click-throughs to target websites is one way to assess the quality of a record. That is, is the record valuable to an audience? This third method of establishing quality is discussed in Hider & Tan (2008). They look at MARC records, but we have Dublin Core records, so some generalities and abstractions to their arguments may be needed before application.
I find the idea of measuring record quality based on its use interesting. With nearly 450K records in OLAC but only some 8,000 views per month (Bird & Simons, 2021) a natural question arises. Which records are being viewed and are these the same ones month-over-month? Is there something “higher quality” about these records that they are being found and used? Recent work by Husic (2014) and Osborn (2019) show that adding Library of Congress Subject Headings (LCSH) to records does make them more visible to search engines. So maybe the highly used OLAC records have LCSH terms attached? This is something worth looking into and reporting on.
References
Bird, Steven, and Gary Simons. 2021. “Towards an Agenda for Open Language Archiving.” In Proceedings of the International Workshop on Digital Language Archives: LangArc 2021, edited by Oksana Zavalina and Shobhana Lakshmi Chelliah, 25–28. Denton, Texas: University of North Texas. doi:10.12794/langarc1851171.
Hider, Philip, and Kah-Ching Tan. 2008. “Constructing Record Quality Measures Based on Catalog Use.” Cataloging & Classification Quarterly 46 (4). Routledge: 338–61. doi:10.1080/01639370802322515.
Hughes, Baden. 2005. “Metadata Quality Evaluation: Experience from the Open Language Archives Community.” In Digital Libraries: International Collaboration and Cross-Fertilization, edited by Zhaoneng Chen, Hsinchun Chen, Qihao Miao, Yuxi Fu, Edward Fox, and Ee-peng Lim, 320–29. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer. doi:10.1007/978-3-540-30544-6_34.
Husic, Geoff. 2014. “Enhancing an Open-Access Linguistics Journal Archive with Library of Congress-like Metadata: A Case Study of the Effectiveness for Improving Discovery.” Kansas Working Papers in Linguistics 35 (December): 1–8. doi:10.17161/KWPL.1808.15945.
Kumar, Vinit, Sharad Maurya, and Gopal Ji. 2022. “An Assessment of Spelling and Typographical Errors in WebOPAC of Academic Libraries.” Journal of Library Metadata 22 (1–2). Routledge: 63–73. doi:10.1080/19386389.2022.2051978.
Osborn, Carlyn. 2019. “How Do Levels of Description Affect Discoverability of the Web Archives at the Library of Congress?” Poster presented at the SAA Research Forum, Austin, Texas. https://www2.archivists.org/sites/all/files/flattened_Osborn_SAA_Research_Forum_2019.pdf.
Petrucciani, Alberto. 2015. “Quality of Library Catalogs and Value of (Good) Catalogs.” Cataloging & Classification Quarterly 53 (3–4). Routledge: 303–13. doi:10.1080/01639374.2014.1003669.
User-Generated Metadata
The following is a quick extract of academic papers from my files on the topics of user-generated metadata and metadata quality.
- Stvilia, Gasser & Twidale (2007)
- Stvilia, B., Gasser, L. & Twidale, M. (2007). Metadata Quality Problems in Federated Collections:. In Al-Hakim, L. (Eds.), Challenges of Managing Information Quality in Service Organizations. (pp. 154–186). IGI Global. https://doi.org/10.4018/978-1-59904-420-0.ch008
- Barrett (2021)
- Barrett, S. (2021). Participatory Description and Metadata Quality in Rapid Response Archives. Collections, 17(3). 226–236. https://doi.org/10.1177/1550190620981038
- Béquet (2016)
- Béquet, G. (2016). Questionable Practices in Scholarly Publishing: the Stance of the ISSN Network. Retrieved from https://library.ifla.org/id/eprint/1462/
- Chassanoff (n.d.)
- Chassanoff, A. (n.d.). Metadata Quality Evaluation in Institutional Repositories: A Survey of Current Practices . The University of North Carolina at Chapel Hill University Libraries, https://doi.org/10.17615/w2kv-3c50
- Gligorov (2012)
- Gligorov, R. (2012). User-generated metadata in audio-visual collections. Association for Computing Machinery. https://doi.org/10.1145/2187980.2187998
- Gregg, Erdmann, Paglione, Schneider & Dean (2019)
- Gregg, W., Erdmann, C., Paglione, L., Schneider, J. & Dean, C. (2019). A literature review of scholarly communications metadata. Research Ideas and Outcomes, 5. e38698. https://doi.org/10.3897/rio.5.e38698
- Guo, Zhang & Liu (2013)
- Guo, C., Zhang, J. & Liu, X. (2013). Scientific metadata quality enhancement for scholarly publications. https://doi.org/10.9776/13382
- Hendricks, Tkaczyk, Lin & Feeney (2020)
- Hendricks, G., Tkaczyk, D., Lin, J. & Feeney, P. (2020). Crossref: The sustainable source of community-owned scholarly metadata. Quantitative Science Studies, 1(1). 414–427. https://doi.org/10.1162/qss_a_00022
- Hillmann & Phipps (2007)
- Hillmann, D. & Phipps, J. (2007). Application Profiles: Exposing and Enforcing Metadata Quality. Dublin Core Metadata Initiative & National Library Board Singapore. Retrieved from https://dcpapers.dublincore.org/pubs/article/view/866
- Hillmann, Dushay & Phipps (n.d.)
- Hillmann, D., Dushay, N. & Phipps, J. (n.d.). Improving Metadata Quality: Augmentation and Recombination. 8.
- Hillmann (2008)
- Hillmann, D. (2008). Metadata Quality: From Evaluation to Augmentation. Cataloging & Classification Quarterly, 46(1). 65–80. https://doi.org/10.1080/01639370802183008
- Hrynaszkiewicz & Grant (2018)
- Hrynaszkiewicz, I. & Grant, R. (2018). Embedding Research Data Management Support in the Scholarly Publishing Workflow. https://doi.org/10.31219/osf.io/brzwm
- Inácio, Ferreira & Couto (2017)
- Inácio, B., Ferreira, J. & Couto, F. (2017). Metadata Analyser: Measuring Metadata Quality. Springer International Publishing. https://doi.org/10.1007/978-3-319-60816-7_24
- Khwileh, Ganguly & Jones (2016)
- Khwileh, A., Ganguly, D. & Jones, G. (2016). Utilisation of Metadata Fields and Query Expansion in Cross-Lingual Search of User-Generated Internet Video. Journal of Artificial Intelligence Research, 55. 249–281. https://doi.org/10.1613/jair.4775
- Koshoffer, Neeser, Newman & Johnston (2018)
- Koshoffer, A., Neeser, A., Newman, L. & Johnston, L. (2018). Giving Datasets Context: a Comparison Study of Institutional Repositories that Apply Varying Degrees of Curation. International Journal of Digital Curation, 13(1). 15–34. https://doi.org/10.2218/ijdc.v13i1.632
- Krichel & Warner (2001)
- Krichel, T. & Warner, S. (2001). A metadata framework to support scholarly communication. International Conference on Dublin Core and Metadata Applications. 131–137. Retrieved from https://dcpapers.dublincore.org/pubs/article/view/651
- Lammey, Mitchell & Counsell (2018)
- Lammey, R., Mitchell, D. & Counsell, F. (2018). Metadata 2020: A collaborative effort to improve metadata quality in scholarly communications. Septentrio Conference Series(1). https://doi.org/10.7557/5.4471
- Liu, Guo & Zhang (2014)
- Liu, X., Guo, C. & Zhang, L. (2014). Scholar metadata and knowledge generation with human and artificial intelligence. Journal of the Association for Information Science and Technology, 65(6). 1187–1201. https://doi.org/10.1002/asi.23013
- Lorenzini, Rospocher & Tonelli (2021)
- Lorenzini, M., Rospocher, M. & Tonelli, S. (2021). On assessing metadata completeness in digital cultural heritage repositories. Digital Scholarship in the Humanities, 36(Supplement_2). ii182–ii188. https://doi.org/10.1093/llc/fqab036
- Manzo, Kaufman, Punjasthitkul & Flanagan (n.d.)
- Manzo, C., Kaufman, G., Punjasthitkul, S. & Flanagan, M. (n.d.). “By the People, For the People”: Assessing the Value of Crowdsourced, User-Generated Metadata. 17.
- Payant, Skeen, Arnljots & Williams (2020)
- Payant, A., Skeen, B., Arnljots, A. & Williams, R. (2020). Beyond crowdsourcing: Working with donors, student fieldworkers, and community scholars to improve cultural heritage collection metadata. Journal of Digital Media Management, 8(3). 242–253.
- Quarati (2021)
- Quarati, A. (2021). Open Government Data: Usage trends and metadata quality. Journal of Information Science. 01655515211027775. https://doi.org/10.1177/01655515211027775
- Quarati & Raffaghelli (2022)
- Quarati, A. & Raffaghelli, J. (2022). Do researchers use open research data? Exploring the relationships between usage trends and metadata quality across scientific disciplines from the Figshare case. https://doi.org/10.1177/0165551520961048
- Quarati & Raffaghelli (2022)
- Quarati, A. & Raffaghelli, J. (2022). Do researchers use open research data? Exploring the relationships between usage trends and metadata quality across scientific disciplines from the Figshare case. Journal of Information Science, 48(4). 423–448. https://doi.org/10.1177/0165551520961048
- Rousidis, Garoufallou, Balatsoukas & Sicilia (2014)
- Rousidis, D., Garoufallou, E., Balatsoukas, P. & Sicilia, M. (2014). Metadata for Big Data: A preliminary investigation of metadata quality issues in research data repositories. Information Services & Use, 34(3-4). 279–286. https://doi.org/10.3233/ISU-140746
- Schoenberger (2017)
- Schoenberger, Z. (2017). Who Wrote This? Creator Metadata Quality on Academia.Edu . ,
- Spindler (2014)
- Spindler, R. (2014). An Evaluation of Crowdsourcing and Participatory Archives Projects for Archival Description and Transcription. 28.
- Stein, Applegate & Robbins (2017)
- Stein, A., Applegate, K. & Robbins, S. (2017). Achieving and Maintaining Metadata Quality: Toward a Sustainable Workflow for the IDEALS Institutional Repository. Cataloging & Classification Quarterly, 55(7-8). 644–666. https://doi.org/10.1080/01639374.2017.1358786
- Stein, Applegate & Robbins (2017)
- Stein, A., Applegate, K. & Robbins, S. (2017). Achieving and Maintaining Metadata Quality: Toward a Sustainable Workflow for the IDEALS Institutional Repository. Cataloging & Classification Quarterly, 55(7-8). 644–666. https://doi.org/10.1080/01639374.2017.1358786
- Stvilia & Jörgensen (2009)
- Stvilia, B. & Jörgensen, C. (2009). User-generated collection-level metadata in an online photo-sharing system. Library & Information Science Research, 31(1). 54–65. https://doi.org/10.1016/j.lisr.2008.06.006
- Stvilia & Gasser (2008)
- Stvilia, B. & Gasser, L. (2008). Value-based metadata quality assessment. Library & Information Science Research, 30(1). 67–74. https://doi.org/10.1016/j.lisr.2007.06.006
- Syn & Spring (2013)
- Syn, S. & Spring, M. (2013). Finding subject terms for classificatory metadata from user-generated social tags. Journal of the American Society for Information Science and Technology, 64(5). 964–980. https://doi.org/10.1002/asi.22804
- Takase, Hasan, Matsumoto & Sasaki (2014)
- Takase, W., Hasan, A., Matsumoto, Y. & Sasaki, T. (2014). Capturing User-Generated Metadata. Sissa Medialab. https://doi.org/10.22323/1.210.0033
- Hooland, Rodríguez & Boydens (2011)
- Hooland, S., Rodríguez, E. & Boydens, I. (2011). Between Commodification and Engagement: On the Double-Edged Impact of User-Generated Metadata within the Cultural Heritage Sector. Library Trends, 59(4). 707–720. https://doi.org/10.1353/lib.2011.0011
- Hooland (2006)
- Hooland, S. (2006). From Spectator to Annotator: Possibilities offered by User-Generated Metadata for Digital Cultural Heritage Collections. 16.
- Weagley, Gelches & Park (2010)
- Weagley, J., Gelches, E. & Park, J. (2010). Interoperability and Metadata Quality in Digital Video Repositories: A Study of Dublin Core. Journal of Library Metadata, 10(1). 37–57. https://doi.org/10.1080/19386380903546984
Categories:
Countries:
Content Mediums:
Organizations: