Corpora, Collections, and Datasets

admin

Corpora, Collections, and Datasets

18 Jan, 2021 7 min read Research Note

There has been some discussion of what constitutes a collection among language documentation practitioners. Johnson (Citation: Johnson, 2004, p. 142) Johnson, H. (2004). Language documentation and archiving, or how to build a better corpus. Language Documentation and Description, 2. 140–153. Retrieved from http://www.elpublishing.org/PID/026 equates a collection and a corpus saying:

“A collection, or corpus, is the body of documentary materials created by linguists and native speakers in the course of their research.”

Simons (Citation: Simons, 2008) Simons, G. (2008). Documentary Linguistics and a New Kind of Corpus. Retrieved from https://scholars.sil.org/sites/scholars/files/gary_f_simons/presentation/doc_ling.pdf and Boerger (Citation: Boerger, 2011) Boerger, B. (2011). To BOLDly Go Where No One Has Gone Before. Language Documentation & Conservation, 5. 208–233. Retrieved from http://scholarspace.manoa.hawaii.edu/handle/10125/4499 ¹ advance the idea that the materials generated as part of a language documentation effort are a corpus, not distinguishing between a corpus and a collection². Many other language documentation practitioners (Citation: Lüpke, 2010) Lüpke, F. (2010). Research methods in language documentation. Language Documentation and Description, 7. 55–104. , (Citation: Good, 2011) Good, J. (2011). Data and language documentation. In Austin, P. & Sallabank, J. (Eds.), The Cambridge Handbook of Endangered Languages. (pp. 212–234). Cambridge University Press. https://doi.org/10.1017/CBO9780511975981.011 , (Citation: Austin, 2013) Austin, P. (2013). Language documentation and meta-documentation. In Jones, M. & Ogilvie, S. (Eds.), Keeping Languages Alive: Documentation, Pedagogy and Revitalization. (pp. 3–15). Cambridge University Press. https://doi.org/10.1017/CBO9781139245890.003 extend or modify the meaning of “corpus” when referring to a collection; using a phrase like language documentation corpus. In contrast to the coalescence of corpus and collection, Thieberger (Citation: Thieberger, 2018) Thieberger, N. (2018). Texts and more texts: corpora in the CoEDL [Blog]. Endangered Languages and Cultures. Retrieved from https://www.paradisec.org.au/blog/2018/02/texts-and-more-texts-corpora-in-the-coedl/ , in a blog post, credits Jane Simpson with proposing the following terms: Assemblage – all material collected, working files, early sources, multiple versions and drafts; Collection – the archived material, a subset of the above, but curated with sufficient metadata to allow the user to know what all items are; Corpus – a crafted set of texts in the language that can be used for further analysis. For a recent treatment of these ideas see Sullivant (Citation: Sullivant, 2020) Sullivant, R. (2020). Archival description for language documentation collections. Language Documentation & Conservation, 14. 520–578. Retrieved from http://hdl.handle.net/10125/24949 which expounds on Simpon’s ideas. I am not convinced that the distinction between assemblage and collection is relevant for a variety of reasons. But my interests are specifically how the distinction might impact the crafting of a bibliographic reference. I suggest that reference formulations do not need to be different between an assemblage and a collection.

Lest I leave the definitions distinguishing a corpus and a collection undisputed, let me suggest some pragmatic distinctions. Biber (Citation: Biber, 1993) Biber, D. (1993). Using Register-Diversified Corpora for General Language Studies. Computational Linguistics, 19(2). 219–241. Retrieved from https://www.aclweb.org/anthology/J93-2001 , (Citation: Biber, 1993) Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4). 243–257. https://doi.org/10.1093/llc/8.4.243 set up diversity-based criteria for corpus design. One interpretation of Biber’s work is that if a body of evidence does not meet the diversity threshold then it does not qualify as a corpus. The diversity dynamic as a goal for a set of materials generated through language documentation efforts has been a long standing tenant of creating an evidence based documentary record for a language (Citation: Himmelmann, 1998) Himmelmann, N. (1998). Documentary and Descriptive Linguistics. Linguistics, 36(1). 161–195. https://doi.org/10.1515/ling.1998.36.1.161 . Biber’s diversity arguments have even been suggested as guidelines for guiding elicitation in language documentation (Citation: Lüpke, 2010) Lüpke, F. (2010). Research methods in language documentation. Language Documentation and Description, 7. 55–104. . In this sense it is clear that selection is a process which impacts both corpus design and what is collected during fieldwork. However, a second useful criterion for defining a corpus (in contrast to a collection) might be that there is an existing tool set which can be used to exploit the collection of artifacts as a single work. This could lead to an interpretation that a corpus is more like a dataset in the Dublin Core sense, while a collection is more like a collection in the Dublin Core sense—and indeed we see that (Citation: Xia et al., 2016) Xia, F., Lewis, W., Goodman, M., Slayden, G., Georgi, R., Crowgey, J. & Bender, E. (2016). Enriching a massively multilingual database of interlinear glossed text. Language Resources and Evaluation, 50(2). 321–349. https://doi.org/10.1007/s10579-015-9325-4 blur the line between database and corpus in their work. Both datasets (of which databases are a type) and collections can be considered examples of aggregate works. I realize that linguists and language materials archivists have applied these labels loosely but I side with Society of American Archivists, rule 2.3.19 (Citation: , 2013) (2013). Describing archives: a content standard (2nd). Society of American Archivists. Retrieved from http://files.archivists.org/pubs/DACS2E-2013_v0315.pdf in its suggested distinguishing terms for the highest level of an archival unit of materials: “Archival materials are frequently described by devised aggregate terms such as papers (for personal materials), records (for organizational materials), or collection (for topical aggregations)” (see (Citation: Wahbeh, 2009) Wahbeh, F.(2009, 8/10). Retrieved from https://blogs.cul.columbia.edu/schapiro/2009/08/10/on-records-papers-and-collection-a-dacs-case-in-point/ for an example of application). There are other terms which still need to be fleshed out among language archive practitioners such as: archival deposit, accession, and what one might call an artifact and its associated metadata file.

Bibliography

Good (2011): Good, J. (2011). Data and language documentation. In Austin, P. & Sallabank, J. (Eds.), The Cambridge Handbook of Endangered Languages. (pp. 212–234). Cambridge University Press. https://doi.org/10.1017/CBO9780511975981.011
Austin (2013): Austin, P. (2013). Language documentation and meta-documentation. In Jones, M. & Ogilvie, S. (Eds.), Keeping Languages Alive: Documentation, Pedagogy and Revitalization. (pp. 3–15). Cambridge University Press. https://doi.org/10.1017/CBO9781139245890.003
Biber (1993): Biber, D. (1993). Representativeness in Corpus Design. Literary and Linguistic Computing, 8(4). 243–257. https://doi.org/10.1093/llc/8.4.243
Biber (1993): Biber, D. (1993). Using Register-Diversified Corpora for General Language Studies. Computational Linguistics, 19(2). 219–241. Retrieved from https://www.aclweb.org/anthology/J93-2001
Boerger (2011): Boerger, B. (2011). To BOLDly Go Where No One Has Gone Before. Language Documentation & Conservation, 5. 208–233. Retrieved from http://scholarspace.manoa.hawaii.edu/handle/10125/4499
Himmelmann (1998): Himmelmann, N. (1998). Documentary and Descriptive Linguistics. Linguistics, 36(1). 161–195. https://doi.org/10.1515/ling.1998.36.1.161
Johnson (2004): Johnson, H. (2004). Language documentation and archiving, or how to build a better corpus. Language Documentation and Description, 2. 140–153. Retrieved from http://www.elpublishing.org/PID/026
Lüpke (2010): Lüpke, F. (2010). Research methods in language documentation. Language Documentation and Description, 7. 55–104.
Simons (2008): Simons, G. (2008). Documentary Linguistics and a New Kind of Corpus. Retrieved from https://scholars.sil.org/sites/scholars/files/gary_f_simons/presentation/doc_ling.pdf
(2013): (2013). Describing archives: a content standard (2nd). Society of American Archivists. Retrieved from http://files.archivists.org/pubs/DACS2E-2013_v0315.pdf
Sullivant (2020): Sullivant, R. (2020). Archival description for language documentation collections. Language Documentation & Conservation, 14. 520–578. Retrieved from http://hdl.handle.net/10125/24949
Thieberger (2018): Thieberger, N. (2018). Texts and more texts: corpora in the CoEDL [Blog]. Endangered Languages and Cultures. Retrieved from https://www.paradisec.org.au/blog/2018/02/texts-and-more-texts-corpora-in-the-coedl/
Wahbeh (2009): Wahbeh, F.(2009, 8/10). Retrieved from https://blogs.cul.columbia.edu/schapiro/2009/08/10/on-records-papers-and-collection-a-dacs-case-in-point/
Xia, Lewis, Goodman, Slayden, Georgi, Crowgey & Bender (2016): Xia, F., Lewis, W., Goodman, M., Slayden, G., Georgi, R., Crowgey, J. & Bender, E. (2016). Enriching a massively multilingual database of interlinear glossed text. Language Resources and Evaluation, 50(2). 321–349. https://doi.org/10.1007/s10579-015-9325-4

In footnote 2. ↩︎
A second consideration here is that these authors might be referening to a much stricter defnition of “language documentation” such as is found in (Citation: Himmelmann, 1998) Himmelmann, N. (1998). Documentary and Descriptive Linguistics. Linguistics, 36(1). 161–195. https://doi.org/10.1515/ling.1998.36.1.161 where Himmelmann proposes a grid like structure for collecting speach from various kinds of social events and various sectors of a speech community. By and large when I hear people talking about “language documentation” 25 years after Himmelmann’s approach was stated, I don’t see the large scale wide topic and social contribution “corpus building” activities. Rather I see many PhD scopped projects, 1-3 year grant funded projects with particular interests in a anthopological or grammatical theme, etc. Therefore the material created today, and perahps since 1998, has not really fit the definitions that these early authors are assuming are true. That is there has been semantic drift around the useage of the term “language documentation” as it applies to linguistic field work and its outputs. Perahps these other outputs are better classified as digital linguistics. ↩︎

Tags: Corpora Archival Corpus Terminology Collections Language Documentation Corpus Dublin Core DCMIType
Categories: Research Note
Content Mediums: Text

Hugh Paterson III

Collaborative Scholar

I specialize in bespoke research at the intersection of Linguistics, Law, Languages, and Technology; specifically utility and life-cycle management for information products in these spaces.

Corpora, Collections, and Datasets

Bibliography

Hugh Paterson III

Collaborative Scholar

Related