Building Multilingual Comparable Corpora

Name: Building Multilingual Comparable Corpora
Start: 2021-03-04T00:00:00Z
End: 2021-03-07T00:00:00Z
Location: University of Hawai‘i at Mānoa

admin; Abbie; ChristianChanard

Building Multilingual Comparable Corpora

Rebecca Paterson, Abbie Hantgan‑Sonko, Christian Chanard

Conference, Peer Reviewed

Project Video Conference Abstract Watch

Abstract

Building on existing corpora and new audio/video documentary fieldwork from 12+ languages from across West Africa, we are creating a multilingual comparative corpus with input from 20+ collaborating researchers (Nikitina et al. 2020). We present a toolkit of technologies and three parallel workflows that can be used to mobilize language materials from diverse sources for a variety of purposes, particularly for the discovery of discourse patterns in legacy materials that could then be used in revitalization efforts.

Our toolkit includes the following technologies: ELAN-CorpA (Chanard 2015; 2019), Fieldworks Language Explorer (FLEx, SIL International), Toolbox (SIL International), ELAN Tools (Chanard et al. 2020, under development), SpeechReporting Template (Nikitina et al. 2019) and Tsakorpus (Arkhangelskiy 2019). The three workflows differ with regard to the initial file format and the software platform that is to be used for parsing and glossing of texts. All three workflows lead to a collection of annotated files that can be queried with ELAN-CorpA (Hantgan 2019).

In the first workflow, (1) ELAN-CorpA is used for time-aligned translation and transcription of a recorded text; (2) FLEx is used to parse and gloss the text, and (3) ELAN Toolsconverts the .flextext export into the project template for use in ELAN-CorpA. (4) Once in the project template, the text is annotated for project categories and complex queries can be run across all texts in any language using search features of ELAN-CorpA. In the second workflow, (1) transcription, translation, parsing and glossing is done in Toolbox, (2) ELAN Tools converts a Toolbox file to the project template; (3) ELAN-CorpA is used to time align and annotate for the project. In the third workflow, translation, transcription, parsing and glossing is all done in ELAN-CorpA using the project template from initial stages. Using these three workflows, the data from various source file types is processed into a shared format that will be displayed in an online platform via Tsakorpus.

This methodology may be of interest to community members looking for ways to prepare already-collected language materials in order to display them on the internet, those interested in specific questions regarding discourse phenomena, typologists, and linguists in general.

Date

4 Mar, 2021 — 7 Mar, 2021

Event

The 7th International Conference on Language Documentation & Conservation (ICLDC)

Location

University of Hawai‘i at Mānoa

Bibliography

Arkhangelskiy (2019): Arkhangelskiy, T. (2019). Corpora of social media in minority Uralic languages. Retrieved from http://volgakama.web-corpora.net/Social_media_corpora_IWCLUL2019_final.pdf
Chanard, Hantgan & Third Author Missing (2020): Chanard, C., Hantgan, A. & Third Author Missing (2020). ELAN Tools. CNRS-LLACAN.
Chanard (2019): Chanard, C. (2019). ELAN-CorpA. CNRS-LLACAN. Retrieved from http://llacan.vjf.cnrs.fr/res_ELAN-CorpA.php
Hantgan (2019): Hantgan, A. (2019). La Construction d’un Corpus Comparative : Une Méthodologie Pour L’étude Des Langues Parlées en Afrique de l’Ouest. Retrieved from https://lift2019.sciencesconf.org
Chanard (2015): Chanard, C. (2015). ELAN-CorpA: Lexicon-aided annotation in ELAN. In Mettouchi, A., Vanhove, M. & Caubet, D. (Eds.), Corpus-based Studies of Lesser-described Languages: The CorpAfroAs corpus of spoken AfroAsiatic languages. (pp. 311–332). John Benjamins Publishing Company. https://doi.org/10.1075/scl.68.10cha
Nikitina, Hantgan & Chanard (2019): Nikitina, T., Hantgan, A. & Chanard, C. (2019). Reported speech annotation template for ELAN. LLACAN.
Nikitina, Perekhvalskaya, Hantgan, Voll, Aplonova & Paterson (2020): Nikitina, T., Perekhvalskaya, E., Hantgan, A., Voll, R., Aplonova, E. & Paterson, R. (2020). SpeechReporting Corpus: Discourse Reporting in Storytelling. LLACAN.

Tags: West Afrcia ELAN Corpora Methodology
Categories: Conference Peer Reviewed
Locations: Hawaii
Countries: United States
Content Mediums: Event
Organizations: CNRS-LLACAN

Rebecca Paterson

Associate Research Scholar

My research interests include field linguistics, grammatical description, and translation.

Building Multilingual Comparable Corpora

Abstract

Bibliography

Rebecca Paterson

Associate Research Scholar

Abbie Hantgan‑Sonko

Linguist

Christian Chanard

Programmer

Related