Julius: Open-Source Large Vocabulary Continuous Speech Recognition Engine
“Julius” is a high-performance, small-footprint large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. Based on word N-gram and context-dependent HMM, it can perform real-time decoding on various computers and devices from micro-computer to cloud server. The algorithm is based on 2-pass tree-trellis search, which fully incorporates major decoding techniques such as tree-organized lexicon, 1-best / word-pair context approximation, rank/score pruning, N-gram factoring, cross-word context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized to be independent from model structures, and wide variety of HMM structures are supported such as shared-state triphones and tied-mixture models, with any number of mixtures, states, or phone sets. It also can run multi-instance recognition, running dictation, grammar-based recognition or isolated word recognition simultaneously in a single thread. Standard formats are adopted for the models to cope with other speech / language modeling toolkit such as HTK, SRILM, etc. Recent version also supports Deep Neural Network (DNN) based real-time decoding.
The main platform is Linux and other Unix-based system, as well as Windows, Mac, Androids and other platforms.
Julius has been developed as a research software for Japanese LVCSR since 1997, and the work was continued under IPA Japanese dictation toolkit project (1997-2000), Continuous Speech Recognition Consortium, Japan (CSRC) (2000-2003) and Interactive Speech Technology Consortium (ISTC).
The main developer / maintainer is Akinobu Lee (email@example.com).
An open-source LVCSR software (see terms and conditions of license.)
Real-time, hi-speed, accurate recognition based on 2-pass strategy.
Low memory requirement: less than 32MBytes required for work area (<64MBytes for 20k-word dictation with on-memory 3-gram LM). Supports LM of N-gram with arbitrary N. Also supports rule-based grammar, and word list for isolated word recognition. Language and unit-dependent: Any LM in ARPA standard format and AM in HTK ascii hmm definition format can be used. Highly configurable: can set various search parameters. Also alternate decoding algorithm (1-best/word-pair approx., word trellis/word graph intermediates, etc.) can be chosen. List of major supported features: On-the-fly recognition for microphone and network input GMM-based input rejection Successive decoding, delimiting input by short pauses N-best output Word graph output Forced alignment on word, phoneme, and state level Confidence scoring Server mode and control API Many search parameters for tuning its performance Character code conversion for result output. (Rev. 4) Engine becomes Library and offers simple API (Rev. 4) Long N-gram support (Rev. 4) Run with forward / backward N-gram only (Rev. 4) Confusion network output (Rev. 4) Arbitrary multi-model decoding in a single thread. (Rev. 4) Rapid isolated word recognition (Rev. 4) User-defined LM function embedding DNN-based decoding, using front-end module for frame-wise state probability calculation for flexibility.