Data Intensive Architecture

The Datasets can be downloaded from here - https://www.imdb.com/interfaces/ . There are 7 TSV files on this page to download. Please see link for all dataset description https://github.com/ayo-nci/DIA/blob/main/IMDB%20Dataset%20description

Clone the repository from github using https://github.com/ayo-nci/DIA.git or download the zip format of the repository from the same URL

Running Steps

Create a folder for the project on your local hdfs called 'imdb' and upload all seven files downloaded from IMBD
Extract the contents of the cloned git/zip file to a folder on your local file system called 'imdb'
Run this command inside the folder mentioned in (2) above to carry out the hadoop job. - /bin/bash /imdb/compile.sh imdb imdb/titleratings.tsv imdb/titlebasics.tsv imdb/titlecrew.tsv imdb/titleoutput imdb/combinedoutput imdb/namebasics.tsv imdb/getdirectorsnameoutput

The file output gotten from the hadoop process is fed into Jupyter by converting it to a csv file. A copy can be found within this zip.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
src		src
IMDB Dataset description		IMDB Dataset description
LICENSE		LICENSE
README.md		README.md
compile.sh		compile.sh
imdb.class		imdb.class
imdb.csv		imdb.csv
imdb.java		imdb.java
part-r-00000		part-r-00000
x20103689 DIA IMDB Analysis.ipynb		x20103689 DIA IMDB Analysis.ipynb
x20103689 Project Report DIA.pdf		x20103689 Project Report DIA.pdf

Provide feedback