Skip to content

pranau97/reddit-opinion-mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit Opinion Mining and Sentiment Analysis

A project written in R and Python to mine a Reddit corpus.

Requirements

Python and its dependencies

  1. Python 3
  2. PRAW
  3. requests
  4. bs4
  5. numpy
  6. fuzzywuzzy
  7. nltk
  8. matplotlib

Recommended: Install python related packages in a virtual environment.

Install using pip install -U <package-name>. NLTK also requires that you install the corpuses for tokens and stopwords for the English language.

R and its dependencies

  1. R
  2. sna
  3. ggnetwork
  4. svglite
  5. igraph
  6. intergraph
  7. rsvg
  8. ggplot2

Install using install.packages(<package-name>).

Obtaining Reddit API access credentials

  1. Create a Reddit account, and while logged in, navigate to preferences > apps
  2. Click on the Are you a developer? Create an app... button
  3. Fill in the details-
    • name: Name of your bot/script
    • Select the option 'script'
    • description: Put in a description of your bot/script
    • redirect uri: http://localhost:8080
  4. Click on Create App.
  5. You will be given a client_id and a client_secret. Keep them confidential.

Extracting edge data from the Pushshift Reddit dataset

  1. Sign up / login on Google BigQuery.
  2. Select or create a new project and click on 'Compose Query'.
  3. Paste the contents of the SQL script in the folder subreddit-viz in the editor and run it.
  4. Download the generated CSV file as reddit-edge-list.csv and save it within the subreddit-viz folder.

Running the scripts

  1. To obtain the subreddit visualizations, run the R script using R CMD BATCH reddit.R. Make sure to create an empty folder called subreddit-groups in the same folder as the script.
  2. Create a file named praw.ini with it's contents as:
    [<bot-name>]
    username: reddit username
    password: reddit password
    client_id: client_id that you got
    client_secret: client_secret that you got
    
  3. Run the script getdata.py via python3 getdata.py.
  4. It should scrape all the necessary data in approximately 20-25 minutes.
  5. Run analysis.py using python3 analysis.py [args]. The arguments the script accepts are -
    • no arguments - Runs sentiment analysis on the entire data.
    • -h or --help - Prints the usage details.
    • -w string type or --words string type - Generates a word distribution of the given string and type - positive or negative. Requires that sentiment analysis for the particular term already be performed previously.
    • string - Looks for similar strings in the corpus and performs sentiment analysis on it.

Credits

About

Sentiment analysis and opinion mining of Reddit data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published