Speech recognition

ocotillo is nice.

Notes

Voice assistants don't seem to stick for most people is that they're actually command line interfaces, but even less discoverable because they don't provide any visible feedback at all.

Links

HN: Facebook open-sources a speech-recognition system and a machine learning library (2018)
DeepSpeech - Open source Speech-To-Text engine, using a model trained by machine learning techniques, based on Baidu's Deep Speech research paper. (Examples)
Online speech recognition with wav2letter@anywhere (2020)
wav2letter++ - Fast, open source speech processing toolkit from the Speech team at Facebook AI Research built to facilitate research in end-to-end models for speech recognition.
Kaldi - Speech Recognition Toolkit.
Building an end-to-end Speech Recognition model in PyTorch (HN)
Real-Time Voice Cloning - Clone a voice in 5 seconds to generate arbitrary speech in real-time.
Kaldi Active Grammar - Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time.
SpecAugment with PyTorch - PyTorch Implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition.
Dragonfly - Speech recognition framework for Python that makes it convenient to create custom commands to use with speech recognition software.
Gentle - Robust yet lenient forced-aligner built on Kaldi. A tool for aligning speech with text.
Porcupine - On-device wake word detection powered by deep learning.
Eesen - End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding.
Ask HN: Is there any work being done in speech-to-code with deep learning? (2020)
Silero Models - Pre-trained STT models and benchmarks made embarrassingly simple. (HN)
High-quality pre-trained speech-to-text models now available on Torch Hub (HN)
Wavenet For Speech Denoising - Neural network for end-to-end speech denoising, as described in: "A Wavenet For Speech Denoising".
Vosk - Speech recognition toolkit with state-of-the-art accuracy and low latency in Rust.
Voicegain - Speech-to-text Platform and APIs. Speech Recognition.
LibreASR - On-Premises, Streaming Speech Recognition System. (HN)
WORLD - High-quality speech analysis, manipulation and synthesis system. (Web)
ESPnet - End-to-end speech processing toolkit. (Docs)
Speaker Diarization - Process to answer the question of 'who spoke when?' in an audio file.
SpeechRecognition - Local auto speech recognition project based on Kaldi and ALSA.
Athena - Open-source implementation of sequence-to-sequence based speech processing engine.
PyTorch end-to-end speech recognition
Cheetah - On-device streaming speech-to-text engine powered by deep learning.
WaveRNN - PyTorch implementation of Deepmind's WaveRNN model from Efficient Neural Audio Synthesis.
Conformer - PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition.
A Review of End-to-End Architectures for Speech Recognition (2021)
libfvad - Voice activity detection (VAD) library, based on WebRTC's VAD engine.
ASR with PyTorch - Experimental code for speech recognition using PyTorch and Kaldi.
YSDA Speech Processing Course
Paper List for Speech Translation
Deep Contextualized Acoustic Representations For Semi-Supervised Speech Recognition (2020) (Code)
Lyra: A New Very Low-Bitrate Codec for Speech Compression (2021)
Parrot.PY - Computer interaction using audio and speech recognition.
SpeechBrain Toolkit - PyTorch-based Speech Toolkit. (Web)
Vosk API - Offline open source speech recognition toolkit. (Rust API)
Lyra - Very Low-Bitrate Codec for Speech Compression.
lasr - PyTorch Lightning implementation of Automatic Speech Recognition.
Speech Recognition from Scratch
Common Voice - Mozilla's initiative to help teach machines how real people speak.
FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement (2021) (Code)
DeepSpeech2 in PyTorch using PyTorch Lightning
Speech and Language Processing Book (2021) - Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. (2020 Version)
voice2json - Command-line tools for speech and intent recognition on Linux. (Web)
wav2vec Unsupervised: Speech recognition without supervision (2021)
Online Speech recognition using RNN-Transducer
Openspeech - Open-Source Toolkit for End-to-End Speech Recognition.
Unsupervised Speech Decomposition via Triple Information Bottleneck (2020) (Code)
AudioCLIP: Extending CLIP to Image, Text and Audio (2021) (Code)
Wav2vec: Semi and Unsupervised Speech Recognition (HN)
WeNet - Production First and Production Ready End-to-End Speech Recognition Toolkit. (Docs)
Why Hasn’t the iPhone Moment Happened Yet for Voice UIs (2021)
LeBenchmark: a reproducible framework for assessing SSL from speech
INTERSPEECH 2021
WER are we? - Tracking states of the art(s) and recent results on speech recognition.
GigaSpeech - Large, modern dataset for speech recognition.
Coqui STT - Deep learning toolkit for Speech-to-Text, battle-tested in research and production. (Docs) (Rust lib)
Coqui - Startup providing open speech tech for everyone. (GitHub)
Open Speech Corpora - List of accessible speech corpora for ASR, TTS, and other Speech Technologies.
An Overview of Multi-Task Learning in Speech Recognition (2020)
Coqui Inference Engine - Library for efficiently deploying speech models.
PDF to Speech - Deep-learning powered accessibility application which turns PDFs into audio files.
ASV-Subtools - Open Source Tools for Speaker Recognition.
VoiceFixer - General Speech Restoration.
speechmetrics - Wrapper around speech quality metrics MOSNet, BSSEval, STOI, PESQ, SRMR, SISDR.
Silero VAD - Pre-trained enterprise-grade Voice Activity Detector, Language Classifier and Spoken Number Detector.
A New AI Lexicon: Voice (2021) - The Legacies and Limits of Automated Voice Analysis.
Octopus - On-device speech-to-index engine powered by deep learning.
Open Audio Search - Full text search engine with automatic speech recognition for podcasts.
HuBERT: How to Apply BERT to Speech, Visually Explained (2021)
Happy Scribe - Audio Transcription & Video Subtitles.
Speech Recognition Papers
Steerable discovery of neural audio effects (2021) (Code)
audapolis - Editor for spoken-word media with transcription.
Shennong - Python toolbox for speech features extraction.
Paderbox - Collection of utilities for audio / speech processing.
Icefall - Speech recognition recipes using k2. (Docs)
k2 - FSA/FST algorithms, differentiable, with PyTorch compatibility.
ViSQOL (Virtual Speech Quality Objective Listener) - Objective, full-reference metric for perceived audio quality.
Espresso - Fast End-to-End Neural Speech Recognition Toolkit.
UniSpeech - Large Scale Self-Supervised Learning for Speech
NISQA: Speech Quality and Naturalness Assessment
Optimization techniques proposed in Improving RNN Transducer Modeling for End-to-End Speech Recognition
Conformer: Convolution-augmented Transformer for Speech Recognition (2020) (Code)
CAT: Crf-based Asr Toolkit - Complete workflow for CRF-based data-efficient end-to-end speech recognition.
Neural HMMs are all you need (for high-quality attention-free TTS) (2022) (Code)
End-to-End Speech Translation Progress - Tracking the progress in end-to-end speech translation.
EfficientTTS: An Efficient and High-Quality Text-to-Speech Architecture (2020) (Code)
S3PRL - Self-Supervised Speech Pre-training and Representation Learning Toolkit.
pyannote-audio - Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding.
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (2021) (Code)
Speech recognition polyfill - Polyfill for the SpeechRecognition standard on web, using Speechly as the underlying API.
Speech-to-Text Benchmark
Hyperion - Speaker Recognition Toolkit based on PyTorch and numpy.
textlesslib - Library for Textless Spoken Language Processing.
FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech (2021) (Code)
HuggingSound - Toolkit for speech-related tasks based on HuggingFace's tools.
hear - macOS speech recognition via the command line.
PaddleSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.
BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation (2021) (Code)
Edinburgh Speech Tools
rVADfast - Python library for an unsupervised, fast method for robust voice activity detection.
NeuralSpeech - Research project in Microsoft Research Asia focusing on neural network based speech processing, including automatic speech recognition (ASR), text to speech (TTS), etc.
Speech Super-resolution Evaluation and Benchmarking
Real Time Speech Recognition with Gradio (HN)
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques (2021) (Code)
CoVoST: A Large-Scale Multilingual Speech-To-Text Translation Corpus
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction (2022) (Code)
Real Time Speech Enhancement in the Waveform Domain (2020) (Code)
Vosk-Browser - Opinionated speech recognition library for the browser using a WebAssembly build of Vosk.
VocalSound: A Dataset for Improving Human Vocal Sounds Recognition
PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022) (HN)
George Hotz | Programming | speech recognition (2022)
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality (2022) (Code)
CoquiSTT + Signal = Love (death to voice messages) (2022)
ocotillo - PyTorch-based ML model that does state-of-the-art English speech transcription.
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing (2021) (Code)
pyctcdecode - Fast and lightweight python-based CTC beam search decoder for speech recognition.
Avocodo: Generative Adversarial Network for Artifact-free Vocoder (2022) (Code)
Squeezeformer - PyTorch implementation of "Squeezeformer: An Efficient Transformer for Automatic Speech Recognition".
Masked Autoencoders that Listen (2022) (Code)
SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech (2022) (Code)
Speech Enhancement and Dereverberation with Diffusion-based Generative Models

Notes
Links

Speech recognition

Notes​

Links​

Notes

Links