S.Ahilandeshwari, M.Sc.,B.Ed.,
Assistant Professor,
Ayyanadar Janaki Ammal College,
Sivakasi.
Abstract
Tamil is a Dravidian language predominantly spoken by the Tamil people of India and Sri Lanka. It is one of the longest-surviving classical languages in the world. A recorded Tamil literature has been documented for over 2000 years. When we started to creating innovative software for the Tamil language, it can involve a variety of applications ranging from language processing and translation to educational tools and content creation. Predicting the exact software needs for the future is challenging, as it depends on technological advancements, emerging trends, and specific applications. However, we can anticipate that future software development for the Tamil language may involve cutting-edge technologies and tools (AI Machine Learning, NLP, and CLTK). Here I aimed to search some quality software needs for the future of Tamil language development.
Keywords: AI Machine Learning, NLP, CLTK
INTRODUCTION
Using AI and Machine Learning Frameworks
There are several AI and machine learning frameworks that are language-agnostic and can be used for developing applications in various languages, including Tamil. These frameworks provide tools and libraries for tasks such as natural language processing (NLP), machine translation, and speech recognition. Here are some frameworks that you might find useful for developing AI and machine learning applications for the Tamil language.
Hugging Face Transformers
The Transformers’ library from Hugging Face provides pre-trained models for various NLP tasks, including text classification, named entity recognition, and machine translation. These models can be fine-tuned for specific tasks in Tamil. Like GPT2-Tamil Transformer model. It is fine-tuned on a large corpus of Tamil data in a self-supervised fashion. This means it was pertained to the raw texts only, with no humans labeling them in any way with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences and inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.
OpenNMT
OpenNMT is an open source ecosystem for neural machine translation and neural sequence learning. We can use OpenNMT to develop machine translation models for Tamil, allowing the translation of text from Tamil to another language or vice versa.
Steps for Tamil Language Development using OpenNMT
Data Preparation
- Prepare parallel datasets containing pairs of sentences in Tamil and the target language (e.g., English.
- Split your data into training, validation, and test sets.
Tokenization
Tokenize your data into subword units using the OpenNMT Tokenizer.
onmt_tokenize –mode aggressive –joiner_annotate < source.txt > source_tokenized.txt
onmt_tokenize –mode aggressive –joiner_annotate < target.txt > target_tokenized.txt
Configuration
Create a configuration file (YAML format) specifying the model architecture, training parameters, and data paths. You can use a sample configuration file provided by OpenNMT or customize it based on your requirements.
Training
Train our model using this prepared data and the configuration file.
onmt_train -config your_config_file.yaml
- Adjust parameters such as the number of training steps, learning rate, and batch size based on your data and model requirements.
Translation
Translate sentences using the trained model.
onmt_translate -model model_step_10000.pt -src source_test.txt -output pred.txt -replace_unk –verbose
Evaluation
Evaluate the performance of our model using metrics such as BLEU score.
onmt_translate -model model_step_10000.pt -src source_test.txt -output pred.txt -replace_unk -verbose
Fine-Tuning (Optional)
If necessary, you can fine-tune your model on domain-specific data to improve its performance.
onmt_train -config your_finetuning_config.yaml
Deployment
Deploy your trained model in your application or service for real-time translation.
Stanford NLP Libraries
The Stanford NLP (Natural Language Processing) Toolkit is a suite of natural language processing tools developed by the Stanford NLP Group. These tools are designed to perform various linguistic tasks, ranging from basic tokenization and part-of-speech tagging to more advanced tasks such as named entity recognition and sentiment analysis. The toolkit is implemented in Java and has a user-friendly interface, making it accessible for all researchers and developers. Tools which provide support for Dravidian languages are still scarce; this is mainly since these languages are lowly resourced. Over the recent years, several Indian research institutes started working on preprocessing tools and resources for Tamil language.
Tokenizing
Tokenization is the process of breaking a stream of textual content into meaningful elements called tokens. These tokens can be words, terms, symbols, etc. Generally, the process of tokenization happens at word level, but sometimes it’s tough to define what’s meant by a ‘word’. Standard tokenizes use simple heuristics like,
- Punctuations and whitespace may or may not be returned with the tokens.
- Contiguous strings of alphabetic characters or numbers are considered as a single token.
- Tokens are separated using whitespace characters or punctuation characters
Part of Speech Tagging (POS)
POS tagging is a vital process in understanding the meaning of a sentence, it helps to infer possible knowledge about neighboring words and the syntactic structure weaving around a word. POS tagging is crucial since the accuracy of an NLP tool depends on its POS tagger.
Several well-established POS tagging tools are out there for languages like English. However, for a lowly resourced language like Tamil, there are limited numbers of works carried out and different approaches are yet to be tested out. Especially for a highly inflectional language like Tamil the complexity of the tagger is increased.
POS tagging for Tamil language is supported by RDRPOSTagger. This is a ripple-down rule-based POS tagger, which comes with pre-trained POS tagging modules. Note that this library only supports Universal POS tags for Tamil language.
Morphological Analysis
Stemming is a computational procedure where words with the same root are reduced to a common form, generally by stripping each word of its derivational and inflectional suffixes. Most of the IR systems uses the stemming process to identify the root words and to improve the retrieval performance.
Morphological analysis (MA) produces information regarding the morphosyntactic properties of a word; it is a highly important component to perform Machine Translations.
Stemming is a simpler process than MA but stemming alone will not be able to identify the root words if the words are inflected. MA is proven to outperform stemming because of its ability to take care of additional analysis that is not supported by stemming. Stemming gives higher accuracy for languages with fewer word inflections, but MA performs better than algorithmic stemmers in terms of languages with complex morphology. Provided that Tamil being a morphologically rich language, MA is suitable than stemmers generally.
CONCLUSION
The references listed above stand to establish its efforts in fulfilling the need of the future. The potential software needs for the future of Tamil language development are diverse and span across various domains, from linguistic research and education to advanced technologies like artificial intelligence. Recognizing the significance of the Tamil language in both historical and contemporary contexts, the development of innovative software is crucial to preserving, promoting, and enhancing the language.
REFERENCES
Artificial Intelligence: A Modern Approach by Stuart Russell and Peter Norvig
Natural Language Processing in Action by Hobson Lane, Hannes Hapke, and Cole Howard
Practical Natural Language Processing by Sowmya Vajjala et al