A project written in R and Python to mine a Reddit corpus.
- Python 3
- PRAW
- requests
- bs4
- numpy
- fuzzywuzzy
- nltk
- matplotlib
Recommended: Install python related packages in a virtual environment.
Install using pip install -U <package-name>
. NLTK also requires that you install the corpuses for tokens and stopwords for the English language.
- R
- sna
- ggnetwork
- svglite
- igraph
- intergraph
- rsvg
- ggplot2
Install using install.packages(<package-name>)
.
- Create a Reddit account, and while logged in, navigate to preferences > apps
- Click on the
Are you a developer? Create an app...
button - Fill in the details-
- name: Name of your bot/script
- Select the option 'script'
- description: Put in a description of your bot/script
- redirect uri:
http://localhost:8080
- Click on
Create App
. - You will be given a
client_id
and aclient_secret
. Keep them confidential.
- Sign up / login on Google BigQuery.
- Select or create a new project and click on 'Compose Query'.
- Paste the contents of the SQL script in the folder
subreddit-viz
in the editor and run it. - Download the generated CSV file as
reddit-edge-list.csv
and save it within thesubreddit-viz
folder.
- To obtain the subreddit visualizations, run the R script using
R CMD BATCH reddit.R
. Make sure to create an empty folder calledsubreddit-groups
in the same folder as the script. - Create a file named
praw.ini
with it's contents as:[<bot-name>] username: reddit username password: reddit password client_id: client_id that you got client_secret: client_secret that you got
- Run the script
getdata.py
viapython3 getdata.py
. - It should scrape all the necessary data in approximately 20-25 minutes.
- Run
analysis.py
usingpython3 analysis.py [args]
. The arguments the script accepts are -- no arguments - Runs sentiment analysis on the entire data.
-h
or--help
- Prints the usage details.-w string type
or--words string type
- Generates a word distribution of the given string and type - positive or negative. Requires that sentiment analysis for the particular term already be performed previously.string
- Looks for similar strings in the corpus and performs sentiment analysis on it.
- Max Woolf's blog post on subreddit visualizations was of great help in making this project.