This project was initiated upon the request of
historians processing medieval Latin documents originating from the
East-Central European region. The aim of the project is to accelerate
the digitization process of various historical documents from the Middle
Ages by Automatic Speech Recognition based dictation.
The page contains material used for building and testing a Medieval Latin speech recognition system.
The following language resources may be dowloaded and used for research purposes. Any contribution is welcome, please contact Peter Mihajlik at mihajlik@thinktech.hu.
The text data we used to test and train the language model are in corpora.zip, in the following file structure, where the 'raw' and 'clean' directories contain the raw and normalized textual data.
The 'latinlibrary' and 'monasterium' part of the train data were crawled from http://www.thelatinlibrary.com/medieval.html and http://monasterium.net/mom/HU-PBFL/archive respectively. The former consists of literary and historical texts from the medieval era, the latter contains medieval charters originating from the region of the late Hungarian Kingdom.
The 'eval' (test) data are 3 charters originating from the Kingdoms of Bohemia (3), Hungary (2) and Poland (1) respectively.
corpora/test_text/raw/eval_1.txt
corpora/test_text/raw/eval_3.txt
corpora/test_text/raw/eval.txt
corpora/test_text/raw/eval_2.txt
corpora/test_text/clean/eval_1.txt
corpora/test_text/clean/eval_3.txt
corpora/test_text/clean/eval.txt
corpora/test_text/clean/eval_2.txt
corpora/train_text/raw/monasterium_hu_PannHOSB.txt
corpora/train_text/raw/latinlibrary_medieval.txt
corpora/train_text/clean/monasterium_hu_PannHOSB.txt
corpora/train_text/clean/latinlibrary_medieval.txt
The speech test recording files are in recordings.zip. The test data set was created by native speakers of Czech, Hungarian, Polish, Lithuanian and Slovak reading the eval/test data in corpora.zip.
The first digit in the filenames denotes the charter, as described in the TEXT DATA section above. The last two letters mean the native language of the speaker, and the second digit is the number of the speaker of the current language.
The contents of the recordings.zip:
recordings/test_wav/eval_1_1_CZ.wav
recordings/test_wav/eval_1_1_HU.wav
recordings/test_wav/eval_1_1_LT.wav
recordings/test_wav/eval_1_1_PL.wav
recordings/test_wav/eval_1_1_SK.wav
recordings/test_wav/eval_1_2_CZ.wav
recordings/test_wav/eval_1_2_LT.wav
recordings/test_wav/eval_2_1_CZ.wav
recordings/test_wav/eval_2_1_HU.wav
recordings/test_wav/eval_2_1_LT.wav
recordings/test_wav/eval_2_1_PL.wav
recordings/test_wav/eval_2_1_SK.wav
recordings/test_wav/eval_2_2_CZ.wav
recordings/test_wav/eval_2_2_LT.wav
recordings/test_wav/eval_3_1_CZ.wav
recordings/test_wav/eval_3_1_LT.wav
recordings/test_wav/eval_3_1_PL.wav
recordings/test_wav/eval_3_1_SK.wav
recordings/test_wav/eval_3_2_CZ.wav
recordings/test_wav/eval_3_2_HU.wav
recordings/test_wav/eval_3_2_LT.wav
Péter Mihajlik, Lili Szabó, Balázs Tarján, András Balog, Krisztina
Rábai, First Results in Developing a Medieval Latin Language Charter
Dictation System for the East-Central Europe Region, INTERSPEECH-2017,
Stockholm, 20-24 Aug, 2017, pp. 2058-2062
Lili Szabó , Péter Mihajlik, András Balog, and Tibor Fegyó: Unified
Simplified Grapheme Acoustic Modeling for Medieval Latin LVCSR,
TSD-2017, Prague, 27– 31 Aug. 2017