ARANGA. KOTHAI NACHIYAR, M.C.A.,M.Phil.
Assistant Professor of Computer Science (SF)
Ayya Nadar Janaki Ammal College(Autonomous),
Sivakasi.
Abstract
It aims to provide an efficient way for human to communicate with computers. With the extensive development of these systems permits the user to talk almost naturally with computers. Hence, today’s researchers are mainly focusing on developing a system for recognizing continuous speech to accomplish tasks such as answering emails and creating text documents etc. But developing such system is still found to be a difficult task due its own complexity. This paper present is Automatic Speech Recognition (ASR) system for Tamil language using Hidden Markov Model (HMM) approach. They are based acoustic model is chosen to recognize the given set of sentences from medium vocabulary. The results are found to be satisfactory with 92% of word recognition accuracy and 81% of sentence accuracy for the proposed developed system.
Keywords
Feature extraction, pattern matching, ANN, HMM, DTW,
1. INTRODUCTION
In the world of science, computer always understand human mimics. The idea which generated for making speech recognition system, because it is convenient for humans to interact with a computer, robot or any machine through speech or vocalization rather than difficult instructions. Human beings have long been inspired to create computer that can understand and talk like human. Since, 1960s computer scientists have been researching various ways and means to make computer record, interpret with understand human.
The fundamental aspects of speech recognition are the translation of sound into text and commands. Speech recognition is the process by which computer maps an acoustic speech signal to some form of abstract meaning of the speech. This process is highly difficult since sound has to be matched with stored sound bites on which further analysis has to be done because sound bites do not match with pre-existing sound pieces. Various feature extraction methods and pattern matching techniques are used to make better quality speech recognition systems. Feature extraction technique and pattern matching techniques plays important role in speech recognition system to maximize the rate of speech recognition of various persons.
2. CLASSIFICATION OF SPEECH
RECOGNITION SYSTEM
2.1 Types of speech recognition system based on utterances
2.1.1 Isolated Words
Isolated word recognition system which recognizes single utterances i.e. single word. Isolated word recognition is suitable for situations where the user is required to give only one word response or commands, but it is very unnatural for multiple word inputs. It is simple and easiest for implementation because word boundaries are obvious and the words tend to be clearly pronounced which is the major advantage of this type.
2.1.2 Connected Words
A connected words system is similar to isolated words, but it allows separate utterances to be “runtogether‟ with a minimal pause between them. Utterance is the vocalization of a word or words that represent a single meaning to the computer.
2.1.3 Continuous Speech
Continuous speech recognition system allows users to speak almost naturally, while the computer determines its content.
Basically, it is computer dictation. In this closest words run together without pause or any other division between words. Continuous speech recognition system is difficult to develop.
2.1.4 Spontaneous Speech
Spontaneous speech recognition system recognizes the natural speech. Spontaneous speech is natural that comes suddenly through mouth. An ASR system with spontaneous speech is able to handle a variety of natural speech features such as words being run together. Spontaneous speech may include mispronunciation, false-starts and non-words.
2.2 Types of speech recognition based on
Speaker Model
Each speaker has special voice, due to his unique physical body and personality. Speech recognition system is classified into three main categories as follows:
2.2.1 Speaker Dependent Models
Speaker dependent systems are developed for a particular type of speaker. They are generally more accurate for the particular speaker, but could be less accurate for other type of speakers. These systems are usually cheaper, easier to develop and more accurate .But these systems are not flexible as speaker independent systems.
2.2.2 Speaker Independent Models
Speaker Independent system can recognize a variety of speakers without any prior training. . A speaker independent system is developed to operate for any particular type of speaker. It is used in Interactive Voice Response System (IVRS) that must accept input from a large number of different users. But drawback is that it limits the number of words in a vocabulary. Implementation of Speaker Independent system is the most difficult. Also it is expensive and its accuracy is lower than speaker dependent systems.
2.2.3 Speaker Adaptive Models
Speaker adaptive speech recognition system uses the speaker dependent data and adapt to the best suited speaker to recognize the speech and decreases error rate by adaption [6].They adapt operation according to characteristics of speakers.
2.3 Types of speech recognition based on Vocabulary
The size of vocabulary of a speech recognition system can affect the complexity, processing and the rate of recognition of ASR system. So that ASR system are classified based on the vocabulary as following:
- Small Vocabulary – 1 to 100 words or sentences
- Medium Vocabulary – 101 to 1000 words or sentences
- Large Vocabulary- 1001 to 10,000 words or sentences
- Very-large vocabulary – More than 10,000 words or sentences
3. FUNCTIONING OF SPEECH RECOGNITION SYSTEM
Fig: System Architecture for Automatic Speech Reconization
3.1 Pre-processing/Digital Processing
The recorded acoustic signal is an analog signal. An analog signal cannot directly transfer to the ASR systems. So these speech signals need to transform in the form of digital signals and then only they can be processed. These digital signals are move to the first order filters to spectrally flatten the signals. This procedure increases the energy of signal at higher frequency. This is the preprocessing step.
3.2 Feature Extraction
Feature extraction step finds the set of parameters of utterances that have acoustic correlation with speech signals and these parameters are computed through processing of the acoustic waveform. These parameters are known as features. The main focus of feature extractor is to keep the relevant information and discard irrelevant one. To act upon this operation, feature extractor divides the acoustic signal into 10-25 ms. Data acquired in these frames is multiplied by window function. There are many types of window functions that can be used such as hamming Rectangular, Blackman, Welch or Gaussian etc. In this way features have been extracted from every frame. There are several methods for feature extraction such as Mel-Frequency Cepstral Coefficient (MFCC), Linear Predictive Cepstral Coefficient (LPCC), Perceptual Linear Prediction (PLP), wavelet and RASTA-PLP (Relative Spectral Transform) Processing etc.
3.3 Acoustic Modeling
Acoustic modeling is the fundamental part of ASR system. In acoustic modeling, the connection between the acoustic information and phonetics is established. Acousticmodel plays important role in performance of the system and responsible for computational load. Training establishes co-relation between the basic speech units and the acoustic observations. Training of the system requires creating a pattern representative for the features of class using one or more patterns that correspond to speech sounds of the same class. Many models are available for acoustic modeling out of them Hidden Markov Model (HMM) is widely used andaccepted as it is efficient algorithm for training and recognition.
3.4 Language Modeling
A language model contains the structural constraints available in the language to generate the probabilities of occurrence. It induces the probability of a word occurrence after a word sequence. Each language has its own constraints. Generally Speech recognition systems uses bi-gram, tri-gram, n-gram language models for finding correct word sequence by predicting the likelihood of the nth word, using the n-1 earlier words. In speech recognition, the computer system matches sounds with word sequence. The language model distinguishes word and phrase that has similar sound. For example, in American English, the phrases like “recognize speech” and “wreck a nice beach” have same pronunciation but mean very different things. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model.
3.5 Pattern Classification
Pattern Classification (or recognition) is the process of comparing the unknown test pattern with each sound class reference pattern and computing a measure of similarity between them. After completing training of the system at the time of testing patterns are classified to recognize the speech.
4. DIFFERENT FEATURE EXTRACTION TECHNIQUE USED IN SPEECH RECOGNITION
Technique | Technique | Advantage | Disadvantage |
Linear Predictive Coding (LPC) | Provides autoregression- based speech features A Static Technique The residual sound is very close to the vocal tract input signal | The advantage of LPC is it has high rate of audio compression Take short time for training the redundancy signal could be removed | Due to its linear calculation nature, LPC could not extract noisy signal at high amplitude. Take a long time to extract the features |
MelFrequency Cepstrum (MFFC) | Used for speech processing tasks Mimics the person auditory system. | The accuracy is high with low complexity. The method is used for find our feature. High Performance rate. | Background noise. MFCC values are not very robust in the presence of additive noises Large Computations make it difficult to implement. |
5. APPROACHES FOR PATTERN
MATCHING IN SPEECH RECOGNITION
5.1 Template- Based Approach
Template based approach has a collection of prototypical speech patterns. These patterns are stored as reference patterns representing the dictionary of words. Speech is recognized by matching an unknown spoken utterance with each of these reference templates and selecting the category of the best matching pattern. Normally Templates for entire words are constructed.
5.2 Knowledge-Based Approach
The use of knowledge/rule based approach to speech recognition has been proposed by several researchers and applied to speech recognition. Knowledge based approach uses the information regarding linguistic, phonetic and spectrogram. The expert knowledge about variation in speech is handcoded into a system. It takes set of features from the speech and then train the system to generate set of production rules automatically from the samples. These rules are resulted from the parameters that provide useful information about a classification. The effort of recognition is performed at the frame level, using an inference engine to implement the decision tree and classify the firing of the rules. This approach has the benefit of explicitly modeling variation in speech; but unfortunately such expert knowledge is difficult to obtain and use successfully, so this approach is considered as impractical and automatic learning procedures were sought instead.
5.3 Neural Network-Based Approach
Another approach for pattern matching in the speech recognition system is the use of neural networks. Neural Networks are capable of solving more complicated recognition tasks, but could not perform as excellent as Hidden Markov Model (HMM) when it comes to large vocabularies. They can grip low quality, noisy data and speaker independency. This type of systems can achieve more accuracy than HMM based systems when there is training data and the vocabulary size is limited. A more familiar approach using neural networks is phoneme recognition. This is dynamic area of research, but generally its results are better than HMMs. There are also an NN-HMM hybrid system that uses the neural network as part of phoneme recognition and the HMM as part of language modeling.
Artificial neural network technology in speech recognition due to the following reason
- It reduces the modeling unit, generally in the phoneme modeling to advance the recognition rate of the entire system by improving the recognition rate of phonemes.
- Depth learning of the acoustic model, the brain operation structure, the introduction of context information, to reduce the impact of changes in voice more than the speech signal.
- Various feature are extracted from speech signal, a hybrid network model (HMM + NN) and to apply variety of knowledge sources i.e. characteristics, vocabulary and meaning of the word for speech recognition to understand the research, to advancesystem properties.
The application of artificial neural network in the field of speech recognition has been significantly developed in recent years. Artificial neural networks in speech recognition process can be divided into the following areas: Firstly improve the performance of artificial neural networks. Secondly, can be used to develop combine a hybrid system. Thirdly, mathematical methods represent the unique nature of neural network and applied to the field of speech recognition process. Artificial neural network in speech recognition has become a new emerging trend.
5.4 Dynamic Time Warping (DTW) Based Approach
Dynamic Time Warping is an algorithm for measuring similarity between two sequences which may vary in time or speed. It is used in ASR, to cope with different vocalization speeds. In general, it is a method that allows a computer to find an optimal match between two given sequences with certain restriction i.e. the sequences are “warped” non-linearly to match each other. This sequence ordering method is often used in the situation of HMM. In general, DTW is an approach that allows a computer to find an optimal match between two given sequences with certain limitations. This technique is useful for isolated word recognition and can be modified to recognize connected words also.
5.5 Statistical- Based Approach
In this approach, variations in speech are modeled statistically (e.g. HMM) using training methods. This approach represents the current state of the art. Present general purpose speech recognition systems are based on statistical acoustic and language models. Acoustic model and language models for ASR in unlimited domain require large amount of acoustic and linguistic data for parameter estimation.
5.5.1 Hidden Markov Model (HMM)-Based Speech Recognition
Hidden Markov Model based speech recognition system has become popular. Because HMM can be trained automatically and computationally feasible to use HMMs are simple networks that can generate speech using a number of states for each model and modeling the short-term spectra associated with each state. The parameters of the model are the state transition probabilities the means, variances and mixture weights that represent the state output distributions. Each word or phoneme, will have a different output distribution a HMM for a sequence of words or phonemes is made by concatenating the individual trained HMM for the separate words and phonemes. Modern HMM based large vocabulary speech recognition systems are often trained on hundreds of hours of acoustic data. The word sequence, pronunciation dictionary and HMM training process can automatically determine word. This means that it is relatively straightforward to use large training corpora. It is the main advantage of HMM which will extremely reduce the time and complexity of recognition process for training large vocabulary.
6. CONCLUSION
In this review paper the basics of speech recognition system and different approaches available for feature extraction and pattern matching has been discussed. Using these various techniques rate of speech recognition can be improved and better quality speech recognition can be developed. In future there will be focus on development of large vocabulary speech recognition system and speaker independent continuous speech recognition system. For developing such systems in future Artificial Neural Network (ANN) and Hidden Markov Model (HMM) will be used at high level as in recent these techniques have become popular techniques in speech recognition process. The benefits of speech recognition software are that it provides a faster method of writing on a computer, tablet, or smartphone, without typing. You can speak into an external microphone, headset, or built-in microphone, and your words appear as text on the screen. It does work on this technology then analyses each sound and uses an algorithm to find the most probable word fit for that sound. Finally, those sounds are transcribed into text. Speech recognition systems often use AI-based natural language algorithms to predict the probability of all words in a language’s vocabulary.