Sentiment analysis and search for the most frequently used words in Donald Trump's tweets

January 4, 2021

Project Overview

the data has been downloaded from the website www.thetrumparchive.com;
the analysis was conducted on tweets published from January 20, 2017 to December 31, 2020;
only tweets in English were analyzed;
sentiment analysis was carried out using the TextBlob library;
the thematic modeling was performed with the Latent Dirichlet Allocation (LDA).

After retrieving the data, I had to clean it to make it usable for analysis. I made the following changes and created the following variables:

Created columns containing hashatgs, mentions, retweets;
To clear the text removed links, punctuation marks, numbers, emoji icons, multiple spaces and the characters ‘@’, ‘#’, ‘RT’;
Removed tweets with no text;
The languages in which the tweets were written were identified;
Tweets have been tokenized, lemmatized and stopwords removed.

As a result of the above operations, I received the following columns: ‘text’, ‘clean_tweet’, ‘date’, ‘retweeted’, ‘mentioned’, ‘hashtags’, ‘language’.

To perform the analysis, it was necessary to define for each tweet:

Subjectivity
Polarity
Sentiment

Then I checked how the number of tweets changed, taking into account the division into negative / neutral / positive tweets in individual years.

Finally, I conducted thematic modeling using Latent Dirichet Allocation and designated 10 topics that the analyzed tweets cover. For each topic, the 10 most frequently used words in tweets are highlighted.

Links:

GitHub repository

Open in Google Colab