Final project for Havard CS109
The README file gives an overview of what we are handing in: our project notebook, any non-standard Python libraries we used, and URLs to our project websites and screencast videos.
Project Notebook:
Main.ipynb is the project notebook that we hand in. It includes all the analysis we have conducted. (Follow the link for nbviewer version.)
Data:
All the data are saved in the raw_data
folder:
-
stop_word.txt: the list of stop words that we used in one LDA analysis.
-
emoji-data.txt: the table of emoji and what each moji represents
-
negative-words.txt: the list of negative words
-
positive-words.txt: the list of positive words
-
Raw data scrape from Twitter API can be found here: Raw Data on Google Drive
-
Working dataframe (
dftokens.csv,dftweets.csv
) can be found here: Raw Data on Google Drive -
dftokens.xls
is processed data for LDA I analysis.
Python Library:
-
pytz: which allow use to convert time to correct timezone.
-
gensim: word process packages
-
ast: package that allows use to use
user-defined function
in spark -
tweepy: package that allows use to scrape twitter data
-
xlrd: package that extracts data from Excel spreadsheets
-
nltk: Natural Language Toolkit
Project websitres: http://thanksgivingontwitter.weebly.com/
Project Screencaste: https://www.youtube.com/watch?v=UCImCJhoTgc