blog topic prediction

System Requirements:

16 GB RAM
12 GB RAM GPU 1080 Ti
i7-8700 @ 3.20 GHz

Dependency:

Custom NER, my bachelor year project NER
Google word2vec
gensim
Mallet LDA
spacy
nltk

Grouping/classification of 114 instances (for 3 instances blog URL returned no data) of different blogs has to be done in the following way


Marketing	Branding	Growth marketing	Growth strategies	Product Management
Product discovery	Product Growth	Product Management Fundamentals	Agile principles	Company Culture
Company Growth	People Management	Startup Fundamentals	Interpersonal skills	Business Fundamentals
Business Growth	Sales Growth	Investment cycle

Steps followed in the Machine learning pipeline

Gather data in raw_blog_content.csv using gather_data.ipynb
Clean data
Build Feature
Create Model
Predict topics
Map them on actual topics

In order to gather data/blog content, requests and beautifulSoup4 and simple preprocessing was conducted in gather_data.ipynb. The preprocessing of data, with feature extraction and model creation is done in lda_topic_modeling.py. Three models were used and compared on Term Document frequency features those were

lda
ldamulticore
lda mallet

The coherence and perplexity scores of each were checked and best model was picked to predict the topic of a given blog. In this case lda mallet showed best coherence of around 0.42. Due to the time constraint this metric could not be improved further.

Lastly, for each blog prominent topics were caluclated and were mapped to given topics, here I have used word2vec. I have calculated the vector of the predicted topic phrase and given topic and using Word Mover's Distance [https://github.com/mkusner/wmd/] calculated document distance. The result are written back to JSON file, articles_topic.json.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
mallet-2.0.8		mallet-2.0.8
modules		modules
.gitignore		.gitignore
LICENSE		LICENSE
Machine Learning Engineer Test.pdf		Machine Learning Engineer Test.pdf
README.md		README.md
README.md.pdf		README.md.pdf
articles.json		articles.json
articles_topic.json		articles_topic.json
cleaning_data.ipynb		cleaning_data.ipynb
corpus.txt		corpus.txt
entity_extraction.py		entity_extraction.py
gather_data.ipynb		gather_data.ipynb
grouping.txt		grouping.txt
lda_keyword.csv		lda_keyword.csv
lda_topic_modeling.py		lda_topic_modeling.py
lda_topic_modelling.ipynb		lda_topic_modelling.ipynb
mallet-2.0.8.zip		mallet-2.0.8.zip
mind_clustering.png		mind_clustering.png
phrase_extract.py		phrase_extract.py
raw_blog_content.csv		raw_blog_content.csv
raw_blog_content_cleaned.csv		raw_blog_content_cleaned.csv
scikit_plot_document_clustering.ipynb		scikit_plot_document_clustering.ipynb
stopword.txt		stopword.txt
sym.py		sym.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blog topic prediction

About

Releases

Packages

Languages

License

neerajvashistha/lda_blog_topic_prediction

Folders and files

Latest commit

History

Repository files navigation

blog topic prediction

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages