Since Twitter is a widely-used venue for expression, utilize the information available in its database to gain perspective on the general response to Covid vaccination.
- Main data set containing tweeter details pertaining to covid vaccination, and
- A Training data set containing tweeter texts with positive or negative sentiment labels.
The Main data set contained 16 variables and 125 906 observations. The contents are details from tweets from December 2020 to June 2021 pertaining to Covid vaccination. There are some missing data on 4 variables.
Data Source: https://www.kaggle.com/gpreda/all-covid19-vaccines-tweets
The Training set contained 2 variables (text and target) and 1 600 000 observations. This was an automatically generated training set, which is widely used for sentiment analysis, but for which I have not seen a validating study (thus proceed with caution).
Data Source: http://help.sentiment140.com/for-students
The analyses and modelling was performed on a Jupyter notebook and utilized the following libraries: pandas, numpy, matplotlib, seaborn, string, re, nltk, sklearn and wordcloud. Text preprocessing utilized both manual cleaning and nltk imports. Vectorization was performed using TfidfVectorizer. Models analyzed were SGD Classifier and BernoulliNB. Final modelling used SGD Classifier default settings.
- Tweets on Covid vaccination has been progressively increasing since December 2020. Most users are from India and the US, with good representation on most continents.
Figure 1. Word Cloud of Covid Vaccination Tweets
--> Since the problem is global, it is good to use a tool that would analyze the problem with a wide breadth, such as Twitter. Understandably, there are technological social limitations to using the application, but it is an acceptable tool to use in gaining a general picture.
- Using a dataset that has labelled twitter texts as either a positive or negative, a Stochastic Gradient Descend classifier model was generated, yielding a 75% accuracy in predicting text sentiment.
Figure 2. ROC_AUC curve for SGD Classifer
--> Due to the importance of the vaccination practice being accepted by the general population, it is incumbent for the Health department to monitor the public response. This model will enable the staff to evaluate the distribution of positive and negative sentiments based on Tweets generated by the masses.
--> Further improvement in accuracy may be obtained by adding a Neutral sentiment label, or by using Embedding models and deep learning. However, for general purposes, the SGD Classifier model should be effective.
- The predominant sentiment in the Covid vaccination tweets is positive.
Figure 3. Country distribution of positive and negative tweets (top 10)
Figure 4. Tweet sentiment in Canada
--> This gives an idea that current measures are well accepted.
- Popular tweets and users may be identified.
Figure 5. Plot of the number of users' friends and followers
--> Popular users who generally publish positive tweets should be encouraged to continue or increase their activity so they would act as advocates for vaccination.
--> Popular negative tweets may be analyzed to identify problems. This set of tweets show discontent about cancellation of appointments and frustration over lack of updates. The Health department can address these problems to enable a more effective public acceptance.
The sentiment of tweets may be determined with 75% accuracy. The tweets regarding Covid vaccination are predominantly positive. Details may be harvested for optimization of vaccination practices.