SpeechTEX - Medilatin Project

This project was initiated upon the request of historians processing medieval Latin documents originating from the East-Central European region. The aim of the project is to accelerate the digitization process of various historical documents from the Middle Ages by Automatic Speech Recognition based dictation.

The page contains material used for building and testing a Medieval Latin speech recognition system.

The following language resources may be dowloaded and used for research purposes. Any contribution is welcome, please contact Peter Mihajlik at mihajlik@thinktech.hu.

Text data

The text data we used to test and train the language model are in corpora.zip, in the following file structure, where the 'raw' and 'clean' directories contain the raw and normalized textual data.

The 'latinlibrary' and 'monasterium' part of the train data were crawled from http://www.thelatinlibrary.com/medieval.html and http://monasterium.net/mom/HU-PBFL/archive respectively. The former consists of literary and historical texts from the medieval era, the latter contains medieval charters originating from the region of the late Hungarian Kingdom.

The 'eval' (test) data are 3 charters originating from the Kingdoms of Bohemia (3), Hungary (2) and Poland (1) respectively.

corpora/test_text/raw/eval_1.txt
corpora/test_text/raw/eval_3.txt
corpora/test_text/raw/eval.txt
corpora/test_text/raw/eval_2.txt
corpora/test_text/clean/eval_1.txt
corpora/test_text/clean/eval_3.txt
corpora/test_text/clean/eval.txt
corpora/test_text/clean/eval_2.txt
corpora/train_text/raw/monasterium_hu_PannHOSB.txt
corpora/train_text/raw/latinlibrary_medieval.txt
corpora/train_text/clean/monasterium_hu_PannHOSB.txt
corpora/train_text/clean/latinlibrary_medieval.txt

Speech test data

The speech test recording files are in recordings.zip. The test data set was created by native speakers of Czech, Hungarian, Polish, Lithuanian and Slovak reading the eval/test data in corpora.zip.

The first digit in the filenames denotes the charter, as described in the TEXT DATA section above. The last two letters mean the native language of the speaker, and the second digit is the number of the speaker of the current language.

The contents of the recordings.zip:

recordings/test_wav/eval_1_1_CZ.wav
recordings/test_wav/eval_1_1_HU.wav
recordings/test_wav/eval_1_1_LT.wav
recordings/test_wav/eval_1_1_PL.wav
recordings/test_wav/eval_1_1_SK.wav
recordings/test_wav/eval_1_2_CZ.wav
recordings/test_wav/eval_1_2_LT.wav
recordings/test_wav/eval_2_1_CZ.wav
recordings/test_wav/eval_2_1_HU.wav
recordings/test_wav/eval_2_1_LT.wav
recordings/test_wav/eval_2_1_PL.wav
recordings/test_wav/eval_2_1_SK.wav
recordings/test_wav/eval_2_2_CZ.wav
recordings/test_wav/eval_2_2_LT.wav
recordings/test_wav/eval_3_1_CZ.wav
recordings/test_wav/eval_3_1_LT.wav
recordings/test_wav/eval_3_1_PL.wav
recordings/test_wav/eval_3_1_SK.wav
recordings/test_wav/eval_3_2_CZ.wav
recordings/test_wav/eval_3_2_HU.wav
recordings/test_wav/eval_3_2_LT.wav

For details see the publications below:

Péter Mihajlik, Lili Szabó, Balázs Tarján, András Balog, Krisztina Rábai, First Results in Developing a Medieval Latin Language Charter Dictation System for the East-Central Europe Region, INTERSPEECH-2017, Stockholm, 20-24 Aug, 2017, pp. 2058-2062

Lili Szabó , Péter Mihajlik, András Balog, and Tibor Fegyó: Unified Simplified Grapheme Acoustic Modeling for Medieval Latin LVCSR, TSD-2017, Prague, 27– 31 Aug. 2017