Automatic Speech Recognition (ASR) has significantly improved due to pre-trained models, which are first trained on large-scale datasets and then fine-tuned for specific target languages. However, their performance tends to degrade in low-resource languages due to limited training data. In this study, we explore using non-target language data to enhance target low-resource language ASR and propose an effective combination of Spoken Language Identification (SLI) models to measure the similarity of speech utterances to the target low-resource language. Specifically, we combine SLI models based on acoustic and phonetic similarities. Experiments on Indic and some European languages from the Common Voice dataset show that phonetic similarity based on International Phonetic Alphabet (IPA) tokens achieves performance comparable to the conventional method using acoustic similarity in SLI and ASR. Moreover, combining acoustic and phonetic similarities further reduces the character error rate.

| 氏名 | コース | 研究室 | 役職/学年 |
|---|---|---|---|
| Jianan Chen | 知能情報学コース | 音声メディア研究室 | 博士3回生 |
| Chenhui Chu | 知能情報学コース | 言語メディア研究室 | 准教授 |
| Sheng Li | その他の専攻・大学 | Institute of Science Tokyo | 助教 |
| Kawahara Tatsuya | 知能情報学コース | 音声メディア研究室 | 教授 |