GitHub - Yugesh1620/Neura-Scholar: A MLOps RAG project with an ETL Data Pipeline, Argo CI/CD, a Training platform on a Ray Cluster, an inference continous-X pipeline, and monitoring deployed over three m3-mediums and one V100 baremetal instances on chameleon cloud with Kubernetes

Semantic Search over Research Corpus

(Project_22)

Our business or service is to provide an easier way for users to find relevant research papers from a vast corpus using semantic search.
The existing Non-ML status quo is arXiv's keywords based search over title, author names and abstract.
Business metric would be to Minimize the click through rate over entries from search results, click through rate through to the paper text from these results. Overall, improving search efficiency that is to reduc the time spent on search and serve highly relevant papers based on the search query.

Contributors

Name	Responsible for	Link to team members' commits in this repo
Preetham Rakshith Prakash	Continuous X	https://github.com/Yugesh1620/Neura-Scholar/commits/main/?author=th-blitz
Riya Garg	Data pipeline	https://github.com/Yugesh1620/Neura-Scholar/commits/main/?author=riyagarg30
Pranav Bhatt	Model serving and monitoring platforms	https://github.com/Yugesh1620/Neura-Scholar/commits/main/?author=pranav-bhatt
Yugesh Panta	Model training and training platforms	https://github.com/Yugesh1620/Neura-Scholar/commits/main/?author=Yugesh1620

Per Responsibility README Sections

System diagram

Summary of outside materials

	How it was created	Conditions of use
arXiv Dataset (PDFs)	From arXiv's bulk access API	arxiv's nonexclusive-distrib/1.0 license for most individual papers, unknown for the rest
arXiv Dataset (Metadata)	From arXiv's kaggle dataset	Creative Commons CC0 1.0 Universal Public Domain Dedication for the metadata only
Script to pull in arXiv PDFs in bulk	A third party script called mattbierbaum/arxiv-public-datasets to pull in arXiv Dataset in bulk from arXiv's bulk access API or Google bucket or AWS bucket and to generate plain text	MIT License
Base Embed Model	Linq-AI-Research/Linq-Embed-Mistral	Creative Commons Attribution Non Commercial 4.0
BART (facebook/bart-large-cnn or philschmid/bart-large-cnn-samsum)	Pretrained by Facebook on CNN/DailyMail and/or SAMSum datasets, a Large language model with encoder-decoder structure. Hosted on huggingface.	MIT License (Samsum variant) or Fairseq license for BART. Free for research use.
etc

Summary of infrastructure requirements

Requirement	How many/when	Justification
`m1.medium` VMs	1 for the entire project duration, 2 more during final setup	For hosting a 1 conventional DB, 1 for a vector DB and 1 as a proxy
`compute_liqid` node with 2 GPUs OR `gpu_mi100` node with 2 GPUs	1 node with 2 a100s for 20 hours a week ( 2x 4 hour blocks and 2x 6 hour blocks a week )	Hosting a Ray cluster for training and serving models
`compute_liqid` or `gpu_mi100` or `rtx8000` node with 1 GPU	For the entire project duration	For training and serving models under development
Persistent Storage	120 GB for entire project duration	Needed to persist models, database, training datasets, metrics etc
Floating IPs	1 for entire project duration	1 for everything ( gateway, dashboards etc )

Detailed design plan

Model training and training platforms

Unit 4 and Unit 5: Training

Steps to run:

Once inside a jupyter notebook mounted on block storage and Ray cluster running using gpu In Neura-Scholar/queries/openchat3.5 folder Run Genrating_Storing _Queries_OpenChat3.5.ipynb to generate queries from chunks by using Openchat-3.5 model. Then run phrases.ipnyb to convert queries into query_phrases

Next, in Neura-Scholar/Training/main folder Run train_longformer_final_run.ipynb this will submit a ray train job and longformer-base-4096 model will be registered on MLflow and stored in minio and Postgres as well as checkpoingts being stored on /mnt/ directory. The registered model is getting stored with being named 'final' and version 1

When retrain is triggered from argo workflows, a ray job is being submit using retrain_without_ray.py in Neura-Scholar/Retraining/ directory. The retrained model is is getting stored with being named 'final' and version 2.

Note: Please change the floating ip and port numbers according to your setup.

Additional runs and experiments information:

I) queries folder: queries generation

1)Neura-Scholar/queries/openchat3.5

Queries generated by OpenChat3.5 and Query_phrases by keybert are saved in Neura-Scholar/queries/openchat3.5/Data/ References are stored in refrences.txt

2)Neura-Scholar/queries/t5-large/

t5-large.ipynb and t5-large_storing _queries.ipynb is used to generate queries from chunks by using t5-large model Queries generated are saved Neura-Scholar/queries/t5-large/t5-large.txt References are stored in refrences.txt

3)Neura-Scholar/queries/t5-small/

t5 -small_queries_creating_2new tables.ipynb is used to generate queries from chunks by using t5-small model Queries generated are saved Neura-Scholar/queries/t5-large/t5-small.txt References are stored in refrences.txt

4)Neura-Scholar/queries/t5-xl/

t5-xl_storing_queries.ipynb is used to generate queries from chunks by using t5-xl model Queries generated are saved Neura-Scholar/queries/t5-large/t5-xl.txt References are stored in refrences.txt

II)Training folder:

In Neura-Scholar/Training/main->

Ray used code->longformer_final_run.py -jupyter notebook used to submit ray job->longformer_final_run

version '1' of model 'final'
experiment->final
run->clumsy-yak-807
outputs_screenshots has logs and screenshots of ray dashboard
References are stored in refrences.txt

Neura-Scholar/Training/experiments/:

openchat3.5:

folder openchat3.5: In Neura-Scholar/Training/experiments/openchat3.5/mlflow/

1)mlflow without ray

jupyter notebook ->Train_mlflow
version '1' of model 'arxiv-bi-encoder-distilbert'
experiment->arxiv-bi-encoder-distilbert
run->carefree-wren-174
Used distilbert-uncased model
References are stored in refrences.txt

2)mlflow without ray

a)mlflow without ray

jupyter notebook ->Train_mlflow_2
version '1' of model 'distilbert-arxiv-bi-encoder1'
experiment->distilbert-arxiv-bi-encoder1
run->carefree-wren-174
Used distilbert-uncased model

b)mlflow without ray

jupyter notebook ->Train_mlflow_2
version '2' of model 'distilbert-arxiv-bi-encoder1'
experiment->distilbert-arxiv-bi-encoder1
run->vaunted-shark-785
Used distilbert-uncased model
Outputs are in Training/experiments/openchat3.5/mlflow/output
References are stored in refrences.txt

folder2:

In Neura-Scholar/Training/experiments/openchat3.5/ray/

1)Ray used In Training/experiments/openchat3.5/ray/1

a)Ray used

code->longformer_8.py
jupyter notebook used to submit ray job->run_ray_longformer_experiment b)Ray used
code->longformer_9.py
jupyter notebook used to submit ray job->run_ray_longformer_experiment c)Ray used
code->longformer_10.py
jupyter notebook used to submit ray job->run_ray_longformer_experiment

References are stored in refrences.txt

2)Ray used In Training/experiments/openchat3.5/ray/2

a)Ray used

code->check1.py

jupyter notebook used to submit ray job->run_ray_longformer_experiment_2 (small data to test code, it did not register properly as i was logging arttifacts twice)
version '1' of model 'check'
experiment->arxiv-bi-encoder-longformer-ray
run->efficient-finch-1
Used longformer-base-4096 model

b)Ray used

code->final.py
jupyter notebook used to submit ray job->run_ray_longformer_experiment_2
(complet data, 3 epochs but doubted it will register properly as i was logging arttifacts twice, so stopped midway)
version '2' of model 'check'
experiment->final
run->amazing-sloth-18
Used longformer-base-4096 model
References are stored in refrences.txt

t5-small: data got deleted from mlflow but code is there

Ray_tune folder:

Ray used

code->train_tune_2.py
jupyter notebook used to submit ray job->train_tune_2
version '2' and '3' of model 'final'
experiment->ray_tune
run->enchanting-pug-21(lr=3e-05)('3)' and upbeat-boar-917(1e-05)('2') for different lr

III)Retraining folder:

Ray job is being submited ad ray train is being used is getting triggered by argo workflow

code->retrain_with_ray.py

version '2' of model 'final'
experiment->final
run->gentle-mule-713
Output folder has logs and screenshot so ray dashboard
References are stored in refrences.txt

Model serving and monitoring platforms

The embed and summarizer models are served as an API endpoint within a internal network running on GPU nodes ( or CPU nodes ) with a Triton backend. This endpoint is only accessible internally.

Both conventional DB ( i.e. SQL ) and Vector DB ( i.e. Faiss ) are served on CPU nodes as an API endpoint within the network.

These internal endpoints can be used to handle the system flow such as generating embeddings, vector DB lookups, SQL queries, and generating summaries.

A gateway node exposed to the public is used to handle requests to and from within the system.

Platforms used:

Docker to containerize software (i.e. DBs, models, monitoring platforms etc)
Triton backend for serving both models seperately in onnx format.
FastAPI to implement a gateway that interacts with serving models, and databases.

Data pipeline

A datapipeline that handles requests and transfer of information between Model endpoints, and the databases with low latency.

A persistent storage of 120 GB for the complete project.
A conventional DB ( i.e. SQL ) running on a m1.medium sharing a partition with the persistent storage of around 40GB ( Offline data ).
This conventional DB will house all of our dataset that is (a) the corpus of research papers in pdfs, and (b) it's metadata in text.
A Vector DB with it's partition of 30 GB to store embeddings of the corpus metadata. ( Offline data )
A second partition in our Vector DB of around 20 GB for gradual replacement of embeddings from a new model. ( Offline data )
Dataset for re-training ( Offline data ) will be used from the conventional DB itself.
The only Online data in our pipeline would be the user quries coming in as requests from the gateway, and the summaries generated by our summarizer model.

Continuous X

We plan to use PythonCHI for installation, configuration and deployment of our infrastructure, versioned as Git.

Implementation of a CI/CD pipeline to re-train, run tests and deploy our models.

Finally, setting up staging environment for the model deployment.

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
Retraining		Retraining
Training		Training
ansible		ansible
data-pipeline-files		data-pipeline-files
experiments		experiments
kubernetes		kubernetes
notes		notes
openstack		openstack
queries		queries
serving_eval		serving_eval
terraform		terraform
workflows		workflows
.DS_Store		.DS_Store
.gitignore		.gitignore
README-continousx.md		README-continousx.md
README.md		README.md
System Diagram.png		System Diagram.png
continousx.png		continousx.png
docker-compose.yaml		docker-compose.yaml
read-me-data-pipeline.md		read-me-data-pipeline.md

Yugesh1620/Neura-Scholar

Folders and files

Latest commit

History

Repository files navigation

Semantic Search over Research Corpus

Contributors

Per Responsibility README Sections

System diagram

Summary of outside materials

Summary of infrastructure requirements

Detailed design plan

Model training and training platforms

Unit 4 and Unit 5: Training

Steps to run:

Additional runs and experiments information:

I) queries folder: queries generation

1)Neura-Scholar/queries/openchat3.5

2)Neura-Scholar/queries/t5-large/

3)Neura-Scholar/queries/t5-small/

4)Neura-Scholar/queries/t5-xl/

II)Training folder:

In Neura-Scholar/Training/main->

Neura-Scholar/Training/experiments/:

openchat3.5:

1)mlflow without ray

2)mlflow without ray

folder2:

In Neura-Scholar/Training/experiments/openchat3.5/ray/

III)Retraining folder:

Model serving and monitoring platforms

Data pipeline

Continuous X

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages