Building Multilingual Comparable Corpora

Abstract

Building on existing corpora and new audio/video documentary fieldwork from 12+ languages from across West Africa, we are creating a multilingual comparative corpus with input from 20+ collaborating researchers (Nikitina et al. 2020). We present a toolkit of technologies and three parallel workflows that can be used to mobilize language materials from diverse sources for a variety of purposes, particularly for the discovery of discourse patterns in legacy materials that could then be used in revitalization efforts.

Our toolkit includes the following technologies: ELAN-CorpA (Chanard 2015; 2019), Fieldworks Language Explorer (FLEx, SIL International), Toolbox (SIL International), ELAN Tools (Chanard et al. 2020, under development), SpeechReporting Template (Nikitina et al. 2019) and Tsakorpus (Arkhangelskiy 2019). The three workflows differ with regard to the initial file format and the software platform that is to be used for parsing and glossing of texts. All three workflows lead to a collection of annotated files that can be queried with ELAN-CorpA (Hantgan 2019).

In the first workflow, (1) ELAN-CorpA is used for time-aligned translation and transcription of a recorded text; (2) FLEx is used to parse and gloss the text, and (3) ELAN Toolsconverts the .flextext export into the project template for use in ELAN-CorpA. (4) Once in the project template, the text is annotated for project categories and complex queries can be run across all texts in any language using search features of ELAN-CorpA. In the second workflow, (1) transcription, translation, parsing and glossing is done in Toolbox, (2) ELAN Tools converts a Toolbox file to the project template; (3) ELAN-CorpA is used to time align and annotate for the project. In the third workflow, translation, transcription, parsing and glossing is all done in ELAN-CorpA using the project template from initial stages. Using these three workflows, the data from various source file types is processed into a shared format that will be displayed in an online platform via Tsakorpus.

This methodology may be of interest to community members looking for ways to prepare already-collected language materials in order to display them on the internet, those interested in specific questions regarding discourse phenomena, typologists, and linguists in general.

Date
4 Mar, 2021 — 7 Mar, 2021
Location
University of Hawai‘i at Mānoa

Bibliography

Arkhangelskiy (2019)
(). Corpora of social media in minority Uralic languages. Retrieved from http://volgakama.web-corpora.net/Social_media_corpora_IWCLUL2019_final.pdf
Chanard, Hantgan & Third Author Missing (2020)
, & (). ELAN Tools. CNRS-LLACAN.
Chanard (2019)
(). ELAN-CorpA. CNRS-LLACAN. Retrieved from http://llacan.vjf.cnrs.fr/res_ELAN-CorpA.php
Hantgan (2019)
(). La Construction d’un Corpus Comparative : Une Méthodologie Pour L’étude Des Langues Parlées en Afrique de l’Ouest. Retrieved from https://lift2019.sciencesconf.org
Chanard (2015)
(). ELAN-CorpA: Lexicon-aided annotation in ELAN. In Mettouchi, A., Vanhove, M. & Caubet, D. (Eds.), Corpus-based Studies of Lesser-described Languages: The CorpAfroAs corpus of spoken AfroAsiatic languages. (pp. 311–332). John Benjamins Publishing Company. https://doi.org/10.1075/scl.68.10cha
Nikitina, Hantgan & Chanard (2019)
, & (). Reported speech annotation template for ELAN. LLACAN.
Nikitina, Perekhvalskaya, Hantgan, Voll, Aplonova & Paterson (2020)
, , , , & (). SpeechReporting Corpus: Discourse Reporting in Storytelling. LLACAN.
Tags:
Categories:
Locations:
Countries:
Content Mediums:
Organizations:
Rebecca Paterson
Rebecca Paterson
Affiliated

My research interests include field linguistics, grammatical description, and translation.

Abbie Hantgan‑Sonko
Abbie Hantgan‑Sonko
Linguist

Post-doctoral research fellow of the ERC project Speechreporting.
Ph.D. in African lingistics from Indiana University, Bloomington, Indiana.

Christian Chanard
Christian Chanard
Programmer

Related