Skip to content

NLP: Predicting the mass sentiment response to the Covid vaccination based on Tweets.

Notifications You must be signed in to change notification settings

yrodriguezmd/tweet_sentiment_covid_vaccination_Rodriguez

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Are people happy with Covid vaccination? A Sentiment Analysis on Tweeter Data

Predicting the mass sentiment response to the Covid vaccination based on Tweets

Rodriguez, Maria | July 2021

Study Objective:

Since Twitter is a widely-used venue for expression, utilize the information available in its database to gain perspective on the general response to Covid vaccination.

Two datasets were used:

  1. Main data set containing tweeter details pertaining to covid vaccination, and
  2. A Training data set containing tweeter texts with positive or negative sentiment labels.

Main Data Set

The Main data set contained 16 variables and 125 906 observations. The contents are details from tweets from December 2020 to June 2021 pertaining to Covid vaccination. There are some missing data on 4 variables.

Data Source: https://www.kaggle.com/gpreda/all-covid19-vaccines-tweets

The Training Data Set

The Training set contained 2 variables (text and target) and 1 600 000 observations. This was an automatically generated training set, which is widely used for sentiment analysis, but for which I have not seen a validating study (thus proceed with caution).

Data Source: http://help.sentiment140.com/for-students

Methodology

The analyses and modelling was performed on a Jupyter notebook and utilized the following libraries: pandas, numpy, matplotlib, seaborn, string, re, nltk, sklearn and wordcloud. Text preprocessing utilized both manual cleaning and nltk imports. Vectorization was performed using TfidfVectorizer. Models analyzed were SGD Classifier and BernoulliNB. Final modelling used SGD Classifier default settings.

Results and Recommendations:

  1. Tweets on Covid vaccination has been progressively increasing since December 2020. Most users are from India and the US, with good representation on most continents.

Figure 1. Word Cloud of Covid Vaccination TweetsWord Cloud of Covid Vaccination Tweets

--> Since the problem is global, it is good to use a tool that would analyze the problem with a wide breadth, such as Twitter. Understandably, there are technological social limitations to using the application, but it is an acceptable tool to use in gaining a general picture.

  1. Using a dataset that has labelled twitter texts as either a positive or negative, a Stochastic Gradient Descend classifier model was generated, yielding a 75% accuracy in predicting text sentiment.

Figure 2. ROC_AUC curve for SGD ClassiferROC_AUC curve for SGD Classifer

--> Due to the importance of the vaccination practice being accepted by the general population, it is incumbent for the Health department to monitor the public response. This model will enable the staff to evaluate the distribution of positive and negative sentiments based on Tweets generated by the masses.

--> Further improvement in accuracy may be obtained by adding a Neutral sentiment label, or by using Embedding models and deep learning. However, for general purposes, the SGD Classifier model should be effective.

  1. The predominant sentiment in the Covid vaccination tweets is positive.

Figure 3. Country distribution of positive and negative tweets (top 10)Country distribution of positive and negative tweets

Figure 4. Tweet sentiment in CanadaTweet sentiment in Canada

--> This gives an idea that current measures are well accepted.

  1. Popular tweets and users may be identified.

Figure 5. Plot of the number of users' friends and followersPlot of the number of users' friends and followers

--> Popular users who generally publish positive tweets should be encouraged to continue or increase their activity so they would act as advocates for vaccination.

--> Popular negative tweets may be analyzed to identify problems. This set of tweets show discontent about cancellation of appointments and frustration over lack of updates. The Health department can address these problems to enable a more effective public acceptance.

Conclusion

The sentiment of tweets may be determined with 75% accuracy. The tweets regarding Covid vaccination are predominantly positive. Details may be harvested for optimization of vaccination practices.

About

NLP: Predicting the mass sentiment response to the Covid vaccination based on Tweets.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published