Python Instructions:
Anaconda Python 3.5 - https://www.continuum.io/downloads
Python EditDistance package - https://pypi.python.org/pypi/editdistance
Scikit Learn - http://scikit-learn.org/stable/install.html
NLTK - http://www.nltk.org/install.html
Beautiful Soup 4 - https://pypi.python.org/pypi/beautifulsoup4
In the main directory, you will see the following files and directories:
- Data/ - This directory hold the dataset and all of the feature data
- Results/ - This directory holds the result files generated by Classifiers.py
- MutInfo.py - This file calculates the pointwise mutual information of features
- Classifiers.py - This file runs given features through the specified classifier and writes the output to the Results directory.
Other files and directories in the main directory include:
- FileGather.py - This script downloads entire pages of recipes from allrecipes.com
- html/ - This is the directory where FileGather.py stores the html files
- RecipeScraper.py - This file converts the files in the html directory into the correct format and saves it in the Data directory
To add recipes to the dataset, first you will need to run:
python FileGather.py [allrecipe cuisine url] [cuisine] [page number]
Example of allrecipe url: http://allrecipes.com/recipes/695/world-cuisine/asian/chinese/
Once you have downloaded all the recipes you want to add to the dataset you run
python RecipeScraper.py
This will generate all of the data and save it to the Data folder as [cuisineName].txt
To run the classifier you want on the dataset, simply use the following command:
python Classifiers.py [classifier you want] [feature you want]
This will run the classifiers on the dataset as folds. The script will go through
the different folds and calculate the accuracy. When the script is done it will
save the results to the Results directory.
To find the n words with the most mutual information, use the following command
python MutInfo.py [feature] [n] [cuisine]
you can also use the tag 'all' in place of the cuisine name to get the top n from
the entire corpus.