Speech recognition
ocotillo is nice.
Notes
Links
- HN: Facebook open-sources a speech-recognition system and a machine learning library (2018)
- DeepSpeech - Open source Speech-To-Text engine, using a model trained by machine learning techniques, based on Baidu's Deep Speech research paper. (Examples)
- Online speech recognition with wav2letter@anywhere (2020)
- wav2letter++ - Fast, open source speech processing toolkit from the Speech team at Facebook AI Research built to facilitate research in end-to-end models for speech recognition.
- Kaldi - Speech Recognition Toolkit.
- Building an end-to-end Speech Recognition model in PyTorch (HN)
- Real-Time Voice Cloning - Clone a voice in 5 seconds to generate arbitrary speech in real-time.
- Kaldi Active Grammar - Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time.
- SpecAugment with PyTorch - PyTorch Implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.
- Dragonfly - Speech recognition framework for Python that makes it convenient to create custom commands to use with speech recognition software.
- Gentle - Robust yet lenient forced-aligner built on Kaldi. A tool for aligning speech with text.
- Porcupine - On-device wake word detection powered by deep learning.
- Eesen - End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding.
- Ask HN: Is there any work being done in speech-to-code with deep learning? (2020)
- Silero Models - Pre-trained STT models and benchmarks made embarrassingly simple. (HN)
- High-quality pre-trained speech-to-text models now available on Torch Hub (HN)
- Wavenet For Speech Denoising - Neural network for end-to-end speech denoising, as described in: "A Wavenet For Speech Denoising".
- Vosk - Speech recognition toolkit with state-of-the-art accuracy and low latency in Rust.
- Voicegain - Speech-to-text Platform and APIs. Speech Recognition.
- LibreASR - On-Premises, Streaming Speech Recognition System. (HN)
- WORLD - High-quality speech analysis, manipulation and synthesis system. (Web)
- ESPnet - End-to-end speech processing toolkit. (Docs)
- Speaker Diarization - Process to answer the question of 'who spoke when?' in an audio file.
- SpeechRecognition - Local auto speech recognition project based on Kaldi and ALSA.
- Athena - Open-source implementation of sequence-to-sequence based speech processing engine.
- PyTorch end-to-end speech recognition
- Cheetah - On-device streaming speech-to-text engine powered by deep learning.
- WaveRNN - PyTorch implementation of Deepmind's WaveRNN model from Efficient Neural Audio Synthesis.
- Conformer - PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition.
- A Review of End-to-End Architectures for Speech Recognition (2021)
- libfvad - Voice activity detection (VAD) library, based on WebRTC's VAD engine.
- ASR with PyTorch - Experimental code for speech recognition using PyTorch and Kaldi.
- YSDA Speech Processing Course
- Paper List for Speech Translation
- Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition (2020) (Code)
- Lyra: A New Very Low-Bitrate Codec for Speech Compression (2021)
- Parrot.PY - Computer interaction using audio and speech recognition.
- SpeechBrain Toolkit - PyTorch-based Speech Toolkit. (Web)
- Vosk API - Offline open source speech recognition toolkit. (Rust API)
- Lyra - Very Low-Bitrate Codec for Speech Compression.
- lasr - PyTorch Lightning implementation of Automatic Speech Recognition.
- Speech Recognition from Scratch
- Common Voice - Mozilla's initiative to help teach machines how real people speak.
- FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement (2021) (Code)
- DeepSpeech2 in PyTorch using PyTorch Lightning
- Speech and Language Processing Book (2021) - Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (2020 Version)
- voice2json - Command-line tools for speech and intent recognition on Linux. (Web)
- wav2vec Unsupervised: Speech recognition without supervision (2021)
- Online Speech recognition using RNN-Transducer
- Openspeech - Open-Source Toolkit for End-to-End Speech Recognition.
- Unsupervised Speech Decomposition via Triple Information Bottleneck (2020) (Code)
- AudioCLIP: Extending CLIP to Image, Text and Audio (2021) (Code)
- Wav2vec: Semi and Unsupervised Speech Recognition (HN)
- WeNet - Production First and Production Ready End-to-End Speech Recognition Toolkit. (Docs)
- Why Hasn’t the iPhone Moment Happened Yet for Voice UIs (2021)
- LeBenchmark: a reproducible framework for assessing SSL from speech
- INTERSPEECH 2021
- WER are we? - Tracking states of the art(s) and recent results on speech recognition.
- GigaSpeech - Large, modern dataset for speech recognition.
- Coqui STT - Deep learning toolkit for Speech-to-Text, battle-tested in research and production. (Docs) (Rust lib)
- Coqui - Startup providing open speech tech for everyone. (GitHub)
- Open Speech Corpora - List of accessible speech corpora for ASR, TTS, and other Speech Technologies.
- An Overview of Multi-Task Learning in Speech Recognition (2020)
- Coqui Inference Engine - Library for efficiently deploying speech models.
- PDF to Speech - Deep-learning powered accessibility application which turns PDFs into audio files.
- ASV-Subtools - Open Source Tools for Speaker Recognition.
- VoiceFixer - General Speech Restoration.
- speechmetrics - Wrapper around speech quality metrics MOSNet, BSSEval, STOI, PESQ, SRMR, SISDR.
- Silero VAD - Pre-trained enterprise-grade Voice Activity Detector, Language Classifier and Spoken Number Detector.
- A New AI Lexicon: Voice (2021) - The Legacies and Limits of Automated Voice Analysis.
- Octopus - On-device speech-to-index engine powered by deep learning.
- Open Audio Search - Full text search engine with automatic speech recognition for podcasts.
- HuBERT: How to Apply BERT to Speech, Visually Explained (2021)
- Happy Scribe - Audio Transcription & Video Subtitles.
- Speech Recognition Papers
- Steerable discovery of neural audio effects (2021) (Code)
- audapolis - Editor for spoken-word media with transcription.
- Shennong - Python toolbox for speech features extraction.
- Paderbox - Collection of utilities for audio / speech processing.
- Icefall - Speech recognition recipes using k2. (Docs)
- k2 - FSA/FST algorithms, differentiable, with PyTorch compatibility.
- ViSQOL (Virtual Speech Quality Objective Listener) - Objective, full-reference metric for perceived audio quality.
- Espresso - Fast End-to-End Neural Speech Recognition Toolkit.
- UniSpeech - Large Scale Self-Supervised Learning for Speech
- NISQA: Speech Quality and Naturalness Assessment
- Optimization techniques proposed in Improving RNN Transducer Modeling for End-to-End Speech Recognition
- Conformer: Convolution-augmented Transformer for Speech Recognition (2020) (Code)
- CAT: Crf-based Asr Toolkit - Complete workflow for CRF-based data-efficient end-to-end speech recognition.
- Neural HMMs are all you need (for high-quality attention-free TTS) (2022) (Code)
- End-to-End Speech Translation Progress - Tracking the progress in end-to-end speech translation.
- EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (2020) (Code)
- S3PRL - Self-Supervised Speech Pre-training and Representation Learning Toolkit.
- pyannote-audio - Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding.
- DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (2021) (Code)
- Speech recognition polyfill - Polyfill for the SpeechRecognition standard on web, using Speechly as the underlying API.
- Speech-to-Text Benchmark
- Hyperion - Speaker Recognition Toolkit based on PyTorch and numpy.
- textlesslib - Library for Textless Spoken Language Processing.
- FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech (2021) (Code)
- HuggingSound - Toolkit for speech-related tasks based on HuggingFace's tools.
- hear - macOS speech recognition via the command line.
- PaddleSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.
- BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation (2021) (Code)
- Edinburgh Speech Tools
- rVADfast - Python library for an unsupervised, fast method for robust voice activity detection.
- NeuralSpeech - Research project in Microsoft Research Asia focusing on neural network based speech processing, including automatic speech recognition (ASR), text to speech (TTS), etc.
- Speech Super-resolution Evaluation and Benchmarking
- Real Time Speech Recognition with Gradio (HN)
- Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques (2021) (Code)
- CoVoST: A Large-Scale Multilingual Speech-To-Text Translation Corpus
- Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction (2022) (Code)
- Real Time Speech Enhancement in the Waveform Domain (2020) (Code)
- Vosk-Browser - Opinionated speech recognition library for the browser using a WebAssembly build of Vosk.
- VocalSound: A Dataset for Improving Human Vocal Sounds Recognition
- PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
- NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022) (HN)
- George Hotz | Programming | speech recognition (2022)
- NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022) (Code)
- CoquiSTT + Signal = Love (death to voice messages) (2022)
- ocotillo - PyTorch-based ML model that does state-of-the-art English speech transcription.
- SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing (2021) (Code)
- pyctcdecode - Fast and lightweight python-based CTC beam search decoder for speech recognition.
- Avocodo: Generative Adversarial Network for Artifact-free Vocoder (2022) (Code)
- Squeezeformer - PyTorch implementation of "Squeezeformer: An Efficient Transformer for Automatic Speech Recognition".
- Masked Autoencoders that Listen (2022) (Code)
- SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech (2022) (Code)
- Speech Enhancement and Dereverberation with Diffusion-based Generative Models