The development of uncontrolled speech recognition methods and algorithms based on unaligned audio and text data.

The development of uncontrolled speech recognition methods and algorithms based on unaligned audio and text data.

IRN AР 08053085

Abstract

In this project, it is planned to develop unsupervised methods and algorithms for speech recognition based on unaligned audio and text data. The work is based on the assumption that the frequencies and contextual relationships of phonemes are similar in audio and text domains for the same language. Therefore, this will allow creating a mapping between acoustic and text spaces, which will take into account the structure of these spaces together with the concatenation operation of the sequences. As a mathematical apparatus, generative methods based on variational autoencoders are chosen. Evaluation of the developed methods and algorithms will be carried out on the task of automatic transcription of news in Kazakh and English. Data for experiments will be collected as part of the project.

The goal of the project is to develop unsupervised methods and algorithms for speech recognition based on unaligned audio and text data.

Tasks

1) Data collection and processing. This task assumes automatic collection (crawling) of audio and text data in the Kazakh and English languages from various Internet and media resources.

2) Acoustic modeling. This task involves the development of unsupervised deep learning methods and algorithms for mapping audio data into a latent space.

3) Language modeling. This task involves the development of unsupervised deep learning methods and algorithms for mapping text data into a latent space.

4)Mapping of spaces. This task involves the development of unsupervised deep learning methods and algorithms for a homomorphic mapping between acoustic and text latent spaces, which will take into account the structure of these spaces together with the concatenation operation of the sequences.

5) Conducting final experiments. In this task, large-scale computational experiments are supposed to be conducted based on the collected data (in the task 1) and the developed methods and algorithms (in the tasks 2-4). Here, non-annotated and unaligned audio and text data of arbitrary length will be used.

6) Software and web service development. Based on the developed methods and algorithms, software and a demo web service will be developed. The web service will also contain the collected audio and text data. Python will be used as the programming language.

The results of the project will be: (1) audio and text corps of the Kazakh and English languages; (2) unsupervised methods and algorithms for speech recognition based on unaligned audio and text data; (3) software and a demonstration web service of the Kazakh and English speech recognition systems that implement the developed methods and algorithms; (4) publications in peer-reviewed journals with high impact factor and in the proceedings of top conferences.

Acoustic proposal modeling.
In this task, an automatic collection (crawling) of audio and text data in Kazakh and English was carried out from various Internet and media resources. In particular, at least 1000 hours of audio data was collected, and at least 1 billion words of text data for each language.

In the task of acoustic modeling of sentences, we trained a neural network based on a variational autoencoder. The architecture of the model consists of five layers of convolutional networks, the input of which is fed with sequences of vectors encoding a spectrogram of a short-term Fourier transform (STFT) extracted from speech signals.

In the problem of language modeling of sentences, we used the same variational autoencoder model, but the input to the network was vector embeddings trained using the BERT transformer. The BERT model accepts entire sentences, unlike word2vec or fastText, which in general has a positive effect on the quality of the final neural network.

Mapping between spaces of individual words.
Based on the vector representation of words in acoustic and textual spaces, collected in the previous year, we carried out preliminary work on their analysis and visualization using persistent homology methods. As the results showed, the topological structure of both spaces is quite similar, which confirms the hypothesis of the similarity of audio and text spaces, and the research itself can be further developed. In particular, in the future it is possible to carry out experiments comparing persistence diagrams, as described above. The closeness of persistence diagrams in the sense of the Riemannian metric or the Wasserstein metric can also shed light on the topological similarity of both spaces.

Development of a software module for displaying acoustic and text space.
Next, we developed a software module for displaying acoustic and text spaces as a web application using HTML / CSS / Javascript technologies. This module allows you to load audio and text data in the form of a set of vectors and visualize them, there is a functionality for selectively including or excluding the necessary vectors, you can choose one or another dimension for display, various vector transformations are implemented, for example, PCA, etc. Also, the functionality of visualizing persistent charts and histograms for the first two Betty numbers has been implemented.

Performers of projects

  1. Zhandos Yessenbayev, PhD, senior researcher, PE “National Laboratory Astana”, [ORCID: 0000-0002-6322-3848] Жандос Есенбаев, PhD, старший научный сотрудник, ЧУ «National Laboratory Astana»
    https://research.nu.edu.kz/en/persons/zhandos-yessenbayev
    https://scholar.google.com/citations?hl=en&user=oZlOmsAAAAAJ&view_op=list_works&sortby=pubdate
  2. Zhanibek Kozhirbayev, PhD, senior researcher, PE “National Laboratory Astana”, [ORCID: 0000-0003-4235-9049] Жанибек Кожирбаев, PhD, старший научный сотрудник, ЧУ «National Laboratory Astana»
    https://research.nu.edu.kz/en/persons/zhanibek-kozhirbayev
    https://scholar.google.com/citations?user=qkucYS0AAAAJ&hl=en

Publications

  1. Kozhirbayev, Zh., and Yessenbayev, Zh. "Kazakh Text Normalization using Machine Translation Approaches." In CEUR Workshop Proceedings, vol. 2780, pp. 115-122. CEUR-WS, 2020.
    URL: http://ceur-ws.org/Vol-2780/paper10.pdf
  2. Kozhirbayev, Zh., and Yessenbayev, Zh. Named entity recognition for the Kazakh language. Journal of Mathematics, Mechanics and Computer Science, vol. 107, no. 3, pp. 57-66, 2020. ISSN 2617-4871.
    URL:  https://doi.org/10.26577/JMMCS.2020.v107.i3.06
  3. Yessenbayev Z., Kozhirbayev Z., Makazhanov A. (2020) KazNLP: A Pipeline for Automated Processing of Texts Written in Kazakh Language. In: Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2020. Lecture Notes in Computer Science, vol. 12335, pp. 657-666. Springer, Cham.
    URL: https://doi.org/10.1007/978-3-030-60276-5_63