Přepis audiozáznamů do textové podoby

Abstract

This master's thesis focuses on methods for transcribing audio recordings into text, with a particular emphasis on transcription accuracy. The work summarizes the principles of automatic speech recognition, including traditional approaches based on Hidden Markov Models and Gaussian Mixture Models, as well as modern methods using deep neural networks and end-to-end architectures. Special attention is given to the Whisper model, which was implemented and experimentally evaluated. To validate the system’s performance, experiments were conducted involving data processing techniques, model modifications, and training parameter adjustments. The results show that fine-tuning the model, including audio augmentation and the addition of dense or adapter layers, significantly improves transcription accuracy measured by WER and CER metrics. The contribution of the thesis lies in the practical implementation of an efficient Czech speech transcription system and the analysis of the impact of various experimental methods on transcription quality.

Description

Subject(s)

automatic speech recognition, audio-to-text transcription, Whisper model, deep learning, WER and CER metrics, audio augmentation, neural networks, Czech language

Citation