低資源言語音声認識における音響的および音素的類似度に基づくデータサブセット選択
DATA SELECTION BASED ON COMBINATION OF ACOUSTIC AND PHONETIC SIMILARITIES FOR LOW-RESOURCE SPEECH RECOGNITION

概要

Automatic Speech Recognition (ASR) has significantly improved due to pre-trained models, which are first trained on large-scale datasets and then fine-tuned for specific target languages. However, their performance tends to degrade in low-resource languages due to limited training data. In this study, we explore using non-target language data to enhance target low-resource language ASR and propose an effective combination of Spoken Language Identification (SLI) models to measure the similarity of speech utterances to the target low-resource language. Specifically, we combine SLI models based on acoustic and phonetic similarities. Experiments on Indic and some European languages from the Common Voice dataset show that phonetic similarity based on International Phonetic Alphabet (IPA) tokens achieves performance comparable to the conventional method using acoustic similarity in SLI and ASR. Moreover, combining acoustic and phonetic similarities further reduces the character error rate.

研究者

氏名 コース 研究室 役職/学年
Jianan Chen 知能情報学コース 音声メディア研究室 博士3回生
Chenhui Chu 知能情報学コース 言語メディア研究室 准教授
Sheng Li その他の専攻・大学 Institute of Science Tokyo 助教
Kawahara Tatsuya 知能情報学コース 音声メディア研究室 教授