Sentiment Analysis in Tamil Language Using Hybrid Deep Learning Approach

S.Vaishnavi1 & P. Saranya

Assistant Professor

GTN Arts College (Autonomous), Dindigul.

Summary

Due to the rise of social media, the number of people using social media has also getting increased day by day from every corner of the world. This is the place for many people to share and discuss their opinion. However, these opinions were shared by various people in various languages. In recent studies, the new advancement in Machine learning and deep learning made the natural language processing task perform better in rich resources language such as English. However, this advancement has not been reached in Tamil language because of the less resources. Tamil language is one of the morphological rich languages. There are a limited number of studies conducted in the field of sentiment analysis in Tamil language due to the complexity of the process and the limited resources. This research work aims to propose hybrid deep learning approaches that combine the capabilities of two different deep learning algorithms. These are the CNN-BiLSTM, CNN-LSTM and CNN-BiGRU In this study, to prepare the data various tools and libraries which support Tamil language were used. The proposed methods will be evaluated and compared on various metrics such as accuracy, recall, and F1 to find the best performing model among them. The hybrid model will be able to classify the sentiments in the movie reviews in the Tamil language. The result shows that CNN-BiLSTM has achieved the higher accuracy of 80.2% and highest f1-score of0.64 when compared to other two models.

Keywords: NLP, Sentiment Analysis, Deep Learning, Hybrid Deep learning,

Fast text, word embedding, CNN-LSTM CNN-BiLSTM, CNN-BiGRU

Introduction:
The escalation of social media has allowed people to share their views on various topics, such as movies, television shows, and products. Texts can be used to provide users with information about these topics. Due to the large number of people using social media, the amount of information that can be presented in these types of texts has increased. The use of text as a medium to provide users with subjective information is beneficial for both expert and general users as it allows them to make informed decisions. Many companies are working on analysing the reviews shared by peoples in order to understand their customer needs. However, extracting this information from the text is not as simple as it sounds. Some of the challenges we face when it comes to analysing the text include the introduction of  new terms, the spelling mistakes of multiple languages, punctuations, repeated characters and so on. The natural language processing is a technique that goes through various steps in processing the text. The field of sentiment analysis has been growing in popularity in recent years. It is a part of Natural Language Processing which focuses on the mapping and refinement of the opinions from the narrative. 

Sentiment analysis is useful in various fields because it categorizes the opinions into either positive or negative. Opinions are very important for various activities, such as getting a product or watching a movie. It can be challenging to find an opinion that is both positive and negative, as everyone has their own opinion on the product or the movie. Opinions are private expressions, which are not observed by others. The three levels of sentiment analysis are: sentence level, aspect level, and document level.

 One of the most used levels of sentiment analysis is the sentence level, which we have used to analyse the review sentiment of movies. It determines whether the sentence is objective or subjective. If the sentence is objective, then the level of SA determines whether the opinion is positive or negative. (“Sentiment Analysis in Tamil Language Using Hybrid Deep Learning Approach”) Although sentiment analysis is commonly used in English, it is used less in some of the other regional languages like Tamil because of the poor resource. Despite the increasing number of people using social media, the number of people who share their opinions on their languages also increases.

Natural Language Processing:
This section aims to introduce the various techniques and tools utilized in the development of natural language processing. According to NLP is focused on the study and development of methods that can be used to analyse and perform different tasks such as speech recognition, machine translation, and language text analysis and text summarization. Text manipulation is also widely used in the field of natural language processing. 




The studies in the field of natural language processing goes through various steps in the process of translating and processing the text. These include analysing and recognizing the words and their parts-of-speech, the parsing of words, and identifying the patterns and meaning of the text. Sentiment analysis is one of the techniques used in NLP to identify the sentiment of the text as positive, negative, or neutral. In sentiment analysis, it is also used to find the meaning and pattern of the text and determine its sentiment.

Sentiment Analysis:
Sentiment analysis is a popular task in natural language processing. The goal of sentiment analysis is to classify the text based on the mood or mentality expressed in the text, which can be positive negative, or neutral. (“Aspect Modelling in Sentiment Analysis - GeeksforGeeks”) In this section, we will talk about the various techniques that are used in sentiment analysis. According to the survey 80% of the world’s data is unstructured. "The data needs to be analysed and be in a structured manner whether it is in the form of emails, texts, documents, articles, and many more." (“What is Sentiment Analysis? - GeeksforGeeks”)
  • Sentiment Analysis is required as it stores data in an efficient, cost friendly. (“What is Sentiment Analysis? - GeeksforGeeks”)
  • "Sentiment analysis solves real-time issues and can help you solve all real-time scenarios." (“What is Sentiment Analysis? - GeeksforGeeks”)


According to in the past, it was done using manual annotators. (“Sentiment Analysis in Tamil Language Using Hybrid Deep Learning Approach”) Although it is more accurate, it can be very time-consuming and expensive to manually code each document. In the study described two types of methods used in sentiment classification. One is dictionary-based, and another one is based on machine learning. Although both methods are commonly used, machine learning is more likely to perform better among those. This is because of the varying size of the lexicon used in the dictionary-based method. This is why it takes longer to complete the task. To improve the efficiency of sentiment classification in the study conducted by a rule-based approach was implemented. The main features of this approach are speech tagging, tokenization, and stemming. The implementation of this approach involved the use of various machine learning techniques such as KNN, Na¨ıveBayes, Random Forest, SKlearn SVC and Decision Tree. In the result Random Forest was able to achieve the accuracy of 80% which is higher than other techniques. However, the result of the experiment conducted on the reviews of the amazon fine food shows that the Linear SVC performed well. 

Sentiment Analysis Based Deep Learning approach;
This section aims to review studies that investigated the use of deep learning techniques in sentiment analysis. The researchers (Zouzou and El Azami; 2021) in their work, presented a new approach to combine the text function model and sentiment-specific word embeddings. In this paper, the authors proposed the method using the word embedding.

technique known as” gloVe” instead of” Amsterdam embeddings” which was used in the study conducted by focused on developing a text sentiment classification algorithm that combines the CNN-GRU model and the GloVe word embedding. For this experiment, the researchers used 50kmovie reviews in the IMDB dataset. After pre-processed the data, firstly the embedded layer was created that considers the 320-dimensional vectors of each word. They then trained the data on algorithms such as CNN, GRU and CNN-GRU models. The proposed algorithms GRU and CNN-GRU was composed of two optimization functions known as Adam and Adadelta. During the training phase, these models were able to achieve the higher accuracy of 86.34% for GRU and 82.25% for CNN-GRU. The dataset of the ChnSenti Corp which contains hotel reviews. In the experiment various methods were  applied such as SVM, CNN and Att-CNN with word2vec and CNN with BERT. The experiment result shows that the BERT-CNN outperforms word2vec-CNN and word2vec-SVM by achieving the highest accuracy of 90%. While (Tan et al.; 2022) developed a robustly optimized version of the BERT model known as RoBERT an instead of BERT due to its undertrained nature. 

This paper proposes a hybrid approach that combines the RoBERT and LSTM. This hybrid model was used to perform various tasks in three different datasets: the US Airline Twitter dataset, the movie review of IMDB, and theSentiment140 and achieves the F1-score of 0.93, 0.91 and 0.90, respectively. From this result we can be able to notice that the model’s accuracy differs from the dataset’s domain. Carry out a different approach by using a bi-directional propagation model called BiLSTM instead of the traditional LSTM. This model adds an advantage of understanding the meaning of the sentence using the bi-directional propagation mechanism. For this they used dataset consist of user reviews in the amazon ecommerce platform. To convert each word to vector they used similar approach which is well-known as one-hot coding. After performing various functions, the proposed system can be able to reach higher accuracy of 90%. Bi-LSTM for their experiment to predict a people mental health condition. For that experiment Bi-LSTM and CNN were combined to create a hybrid model which uses twitter data that can classify whether the user is depressive or non- depressive. They also compare the hybrid model with other traditional models such as RNN and CNN. It is found that the hybrid model CNN-biLSTM model performed well in terms of accuracy. But BiLSTM-CNN model which uses Bi-LSTM layer before CNN layer outperforms CNN-BiLSTM model. In this subsection, we discussed the various deep learning mechanisms in sentiment analysis. However, we can see that these types of advanced hybrid approaches were tested mostly using English language. The output of these models will vary depending on the domain they are working in and the language they are being processed in. In the next section, we will explore the various techniques that are used to find sentiment in different languages.

Sentiment Analysis in various languages using deep learning:
In this section we will investigate various studies that use different language text for Sentiment analysis. Deep learning methods that can classify the movie and hotel reviews in Persian language into either positive or negative sentiments. They tested these methods against various ML, such as SVM and logistic regression and with DL such as CNN and LSTM. To train the models, they were first converting each Persian word into vectors using fast Text word embedding and then the CNN and LSTM layers were used to extract the features. It is noted that the Bi-LSTM has the highest accuracy when it comes to analysing movie review data while the 2DCNNhas the highest accuracy when it comes to analysing hotel data. This shows that the result of the model will vary with different datasets. 


The limits and challenges of deep learning - TechTalks


The objective of the study was to classify the sentiment in the text of the Urdu language. They used the UCSA dataset to train the models. They introduced the LSTM and 1D CNN deep learning approaches. This was then compared with some of the traditional ML approaches like (LR, SVM, RF, NB, MLP, AdaBoost). The results revealed that the machine learning model LR achieves the higher accuracy among all. However, the main reason for the poor performance of the deep learning model was due to the lack of vocabulary in the pre-trained fast Text model. Instead of fast Text embedding, they used word2vec. The goal of this study is to analyse the performance of various word embeddings in Roman and English languages. Using a bidirectional LSTM process, they used two separate LSTM layers which analyse the text in one direction twice. To test the model using four datasets: RUSA-19, MDPI, UCL, and RUSA and found that the SVM with word 2vec and CBOW provided better results in analysing sentiment in Roman Urdu than in English. The combination of BERT, two layers of LSTM and SVM performed well in analysing the sentiment of the English text. It shows how the hybrid architectures can perform in analysing Indonesian sentiment in e-commerce reviews. After the training process, the combined models such as LSTM-CNN, CNN-LSTM, GRU-CNN, CNN-GRU, and the standard models such as LSTM, CNN, GRU were implemented. The results of the training process show that the combined models, which are the CNN-LSTM, CNN-GRU, and CNN-GRU models, have an accuracy of 82.71%, 82.69%, and 82.56%, respectively and perform better than the standard models. Similarly, the models such as deep LSTM, GRU, and CNN were developed for Arabic sentiment analysis.

For extracting features from character-level representation. The results of the study also revealed that combine the architectures CNN-LSTM can improve the performance of the models with accuracy of 95.14%.

Although the accuracy of some of the models was not the same when they were training in English language. But we can note that the hybrid models were able to improve the performance compared to the previous models. This motivated to continue looking into the performance and capabilities of the hybrid model in Tamil language. The field of natural language processing is still in its early stages in Tamil language due to the agglutinative nature of the language. So, in the following subsection, our focus will be on the recent work that has been performed in the Tamil language.

Sentiment analysis in Tamil languages:
This section will provide the necessary information to understand and develop effective.

Strategies for performing natural language processing in Tamil. Its agglutinative nature and grammatical structure make it more challenging is compared the performance of the LSTM and BiLSTM networks in terms of their ability to analyse the Tamil tweets. Data pre-processing was done by removing various symbols and punctuation marks. They then trained the word2vec model to convert the words from the Tamil tweets into vectors. Since the characters in the language have special compound characters, they are considered as combined characters instead of individual characters. This makes them different from other NLP tasks. The experiment shows the BiLSTM model and got accuracy of 86.2% that performed better than the LSTM model, which achieves accuracy of 77.2%. In an experiment developed a method that can classify the sentiments of the users using the tweets in the Tamil and Malayam were proposed. After the pre-processing of collected data, the models were trained using deep learning techniques. The result shows that LSTM approach was able to achieve a 97.71% accuracy rate on the Tamil dataset and a 97.23% accuracy rate on the Malayam dataset and considers the data set of Tamil and English languages from the Fire 2021 database. To address the class imbalance problem, resampling is performed. The proposed model was analysed with both pre-processed and raw data. The results of this study show that the pre-processing techniques improve the accuracy. This research is focused on developing hybrid deep learning models that are composed of CNN+LSTM, LSTM+CNN, CNN+BiLSTM and BiLSTM+CNN. The result of this study indicates that the CNN+BiLSTM hybrid deep learning model is very effective at analysing the sentiment generated by Tamil code-mixed data by achieving the accuracy of 66%.

Tamil language in terms of classification using natural language processing. For instance, in an experiment conducted the Tamil news reports were collected and classified into different topics such as politics, cinema and sports. They compared the methods used by machine learning to extract features from words using TFIDF of words with the deep learning model CNN that uses the pre-trainedWord2Vec as word embedding. The output of the study revealed that models trained withWord2Vec and CNN performed better than the machine learning technique. In addition, due to the presence of a new token in the data, the recall and F1score from politics tests performed lower than those from sports and cinema. In this section, we have seen a few studies that were conducted on Tamil language in the field of sentiment analysis.

Tools and Libraries used for Tamil language.:
At presents the NLP library which is iNLTK. It supports various Indic Languages
including English, Hindi, Tamil, Malayalam, and Telugu. It also supports code mixed.
Languages such as Malayalam and English, Tamil and English and so on. Also, this
iNLTK library provides pre-trained models for data augmentation, word embeddings,
Tokenization and son on in various indic Languages. In this proposed paper the author from this iNLTK library used pre-trained language model for classification of the text. The experimented result shows that this helps to achieve more than 95% of accuracy. A methodology for identifying religious extremism-based threats in Sri Lanka using social media data from tweets. In this study while pre-processing the several tools were used for tamil and Sinhala language. The pre-processing of Tamil language was done by using existing tools such as IndicNLP and Ripple Tagger. The Ripple Tagger is a python library which is used for part of speech (POS) tagging. A part of speech tag, which is also known as POS, is a tool used to identify the parts of a word in a sentence. At presents a state-of-the-art, contextual POS tagger called ThamizhiPOSt. They showed how they developed ThamizhiPOSt, a POS tagger for Tamil that uses the Stanza neural-based toolkit. The study then compares ThamizhiPOSt with other Tamil POS taggers. It shows that its accuracy is 93.27% when compared to all the others. It has a score of 93.27%. They have also discussed the various tools and resources that are available for Tamil POS taggers. The proposed TamizhiPOSt can also be able to generate the data with POS tags with other features as well such as spell checkers and translations. Using POS, they were able to identify the meaning of the neighbouring words.

Conclusion:
These studies revealed the techniques are being used to understand the structure of the language and their meaning. By considering those in this study, we will be presenting the hybrid methods and compare them to find the best model in the analysis of the sentiment in the Tamil Movie reviews. From the above literature review other than CNN-BiLSTMmodel, the hybrid models such as CNN-LSTM and CNN-BiGRU were also performing better in classifying the sentiment of the text. Although these model’s accuracy varies from domain and the language. In this study proposed on Tamil-English code-mixed data CNN-BiLSTM performs better than other models. However, in the proposed work of CNN-BiGRU achieves better result than the CNN-BiLSTM.

References:
  • [Gowandi, T., Murfi, H. and Nurrohmah, S. (2021). Performance analysis of hybrid architectures of deep learning for indonesian sentiment analysis, International Conference on Soft Computing in Data Science, Springer, pp. 18–27.
  • Chowdhary, K. (2020). Natural language processing, Fundamentals of artificial intelligencepp. 603–649.
  • Arora, G. (2020). inltk: Natural language toolkit for indic languages, arXiv preprint
  • arXiv:2009.12534.
  • Ramanathan, V., Meyyappan, T. and Thamarai, S. (2021). Sentiment analysis: An
  • Approach for analysing tamil movie reviews using tamil tweets, Recent Advances in
  • Mathematical Research and Computer Science 3: 28–39.
  • Tan, K. L., Lee, C. P., Anbananthen, K. S. M. and Lim, K. M. (2022). Roberta-lstm:
  • A hybrid model for sentiment analysis with transformer and recurrent neural network,
  • IEEE Access 10: 21517–21525.

Author
கட்டுரையாளர்

S.Vaishnavi1 & P. Saranya

Assistant Professor

GTN Arts College (Autonomous), Dindigul.