A Model for Albanian Speech Recognition Using End-to-End Deep Learning Techniques
Keywords:Deep learning, Albanian language, End-to-end ASR, Speech Recognition, Corpus
End-to-end Automatic Speech Recognition (ASR) system folds the acoustic model (AM), language model (LM), and pronunciation model (PM) into a single neural network. The joint optimization of all these components optimizes performance of the model. In this paper, we introduce a model for Albanian speech recognition (SR) using end-to-end deep learning techniques. The two main modules that build this model are: Residual Convolutional Neural Networks (ResCNN), which aims to learn the relevant features and Bidirectional Recurrent Neural Networks (BiRNN) aiming to leverage the learned ResCNN audio features. To train and evaluate the model, we have built a corpus for Albanian Speech Recognition (CASR), which contains 100 hours of audio data along with their transcripts. During the design of the corpus we took into account the attributes of the speaker such as: age, gender, and accent, speed of utterance and dialect, so that it is as heterogeneous as possible. The evaluation of the model is done through word error rate (WER) and character error rate (CER) metrics. It achieves 5% WER and 1% CER.
How to Cite
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.