Development of a speech translation system for two low-resource languages through some intermediate language

Annotation. In the last decade, tremendous progress has been made in the field of speech and language processing with the advent of deep learning approaches and the availability of computing resources. These advances are giving the possibility to address NLP tasks in a new and completely different manner. One of these tasks is to translate audio speech in one language into text in the target language. For years, it has been addressed through the classical approach. The way to face the task was to decode the speech utterance with an ASR module and then translate it with a machine translation system. The classical way is known as "cascade" speech translation. However, in the last years, a new approach has been proposed in which speech is not decoded but directly translated into the language differ than language of the utterance.

The aim of the project: The goal of the project is to develop methods and algorithms for speech translation for two low-resource languages, namely Kazakh and Tatar, based on pre-trained models using some intermediate language. Recent trends are based on the use of a single neural network to translate an input audio signal in one language into text in another. The goal of the project should be achieved through the use of cascading solutions or end-to-end approaches. In particular, we are interested in comparing the performance of the cascade approach with the direct approach. A speech synthesis model for the target language can be added to both methods.

Project objectives: To achieve this goal, the following tasks will be performed within the framework of the project:
1. Data collection and processing
1.1 Collection and processing of audio data for automatic speech recognition system and speech synthesis
1.2 Collection and processing of text data for machine translation system
2. Development of pretrained models
2.1 Development of pretrained models for audio data
2.2 Developing pretrained models for text data
3. Development of machine translation systems
3.1 Development of a machine translation system from Kazakh into an intermediate language
3.2 Development of a machine translation system from an intermediate language into Tatar language
4. Development of a cascading speech translation system
5. Development of a system for end-to-end speech translation with pretrained models
5.1 Design of the end-to-end speech translation model
5.2 Training and testing the end-to-end speech translation system with pretrained models
6. Development of software and web services
6.1 Development of a software module for a cascaded speech translation system
6.2 Development of a software module for a system of end-to-end speech translation
6.3 Software integration and web service development

The main results of the project will be:

audio corpora of the Kazakh language;
parallel text corpora for Kazakh and Russian languages;
pre-trained and fine-tuned models for speech recognition for Kazakh language based on self-supervised learning with gradual specialization of training data;
pre-trained and fine-tuned models for machine translation for Kazakh and Russian languages as well as for Tatar and Russian languages;
software and a demonstration web service of the Kazakh and Tatar speech translation system that implement the developed methods and algorithms;
publications in peer-reviewed journals with high impact factor and in the proceedings of top conferences.

The theoretical significance of this research is that the results of the study can be used in the development of speech translation systems for the related languages (Uzbek, Kyrgyz) and other low resource languages. In addition, with proper design, the proposed model can be used to create speech-to-speech systems using a single neural network. The practical significance of the project is that it will enable domestic software developers to use the developed tools for developing open and commercial speech translation programs for the Kazakh language in other domains (mobile phones, contact centers, etc.). Along with this, it is expected that, from the point of view of the socio-economic impact of the project on Kazakhstan, the project will undoubtedly contribute to the development of the Kazakh language, one of the key areas of the Kazakhstan-2050 Strategy, and it will also promote research in the field of speech processing and natural language, as well as artificial intelligence both in Kazakhstan and in the world.