(Project_22)
- Our business or service is to provide an easier way for users to find relevant research papers from a vast corpus using semantic search.
- The existing Non-ML status quo is arXiv's keywords based search over title, author names and abstract.
- Business metric would be to Minimize the click through rate over entries from search results, click through rate through to the paper text from these results. Overall, improving search efficiency that is to reduc the time spent on search and serve highly relevant papers based on the search query.
Name | Responsible for | Link to team members' commits in this repo |
---|---|---|
Preetham Rakshith Prakash | Continuous X | https://github.com/Yugesh1620/Neura-Scholar/commits/main/?author=th-blitz |
Riya Garg | Data pipeline | https://github.com/Yugesh1620/Neura-Scholar/commits/main/?author=riyagarg30 |
Pranav Bhatt | Model serving and monitoring platforms | https://github.com/Yugesh1620/Neura-Scholar/commits/main/?author=pranav-bhatt |
Yugesh Panta | Model training and training platforms | https://github.com/Yugesh1620/Neura-Scholar/commits/main/?author=Yugesh1620 |
How it was created | Conditions of use | |
---|---|---|
arXiv Dataset (PDFs) | From arXiv's bulk access API | arxiv's nonexclusive-distrib/1.0 license for most individual papers, unknown for the rest |
arXiv Dataset (Metadata) | From arXiv's kaggle dataset | Creative Commons CC0 1.0 Universal Public Domain Dedication for the metadata only |
Script to pull in arXiv PDFs in bulk | A third party script called mattbierbaum/arxiv-public-datasets to pull in arXiv Dataset in bulk from arXiv's bulk access API or Google bucket or AWS bucket and to generate plain text | MIT License |
Base Embed Model | Linq-AI-Research/Linq-Embed-Mistral | Creative Commons Attribution Non Commercial 4.0 |
BART (facebook/bart-large-cnn or philschmid/bart-large-cnn-samsum) | Pretrained by Facebook on CNN/DailyMail and/or SAMSum datasets, a Large language model with encoder-decoder structure. Hosted on huggingface. | MIT License (Samsum variant) or Fairseq license for BART. Free for research use. |
etc |
Requirement | How many/when | Justification |
---|---|---|
m1.medium VMs |
1 for the entire project duration, 2 more during final setup | For hosting a 1 conventional DB, 1 for a vector DB and 1 as a proxy |
compute_liqid node with 2 GPUs OR gpu_mi100 node with 2 GPUs |
1 node with 2 a100s for 20 hours a week ( 2x 4 hour blocks and 2x 6 hour blocks a week ) | Hosting a Ray cluster for training and serving models |
compute_liqid or gpu_mi100 or rtx8000 node with 1 GPU |
For the entire project duration | For training and serving models under development |
Persistent Storage | 120 GB for entire project duration | Needed to persist models, database, training datasets, metrics etc |
Floating IPs | 1 for entire project duration | 1 for everything ( gateway, dashboards etc ) |
Once inside a jupyter notebook mounted on block storage and Ray cluster running using gpu In Neura-Scholar/queries/openchat3.5 folder Run Genrating_Storing _Queries_OpenChat3.5.ipynb to generate queries from chunks by using Openchat-3.5 model. Then run phrases.ipnyb to convert queries into query_phrases
Next, in Neura-Scholar/Training/main folder Run train_longformer_final_run.ipynb this will submit a ray train job and longformer-base-4096 model will be registered on MLflow and stored in minio and Postgres as well as checkpoingts being stored on /mnt/ directory. The registered model is getting stored with being named 'final' and version 1
When retrain is triggered from argo workflows, a ray job is being submit using retrain_without_ray.py in Neura-Scholar/Retraining/ directory. The retrained model is is getting stored with being named 'final' and version 2.
Note: Please change the floating ip and port numbers according to your setup.
Queries generated by OpenChat3.5 and Query_phrases by keybert are saved in Neura-Scholar/queries/openchat3.5/Data/ References are stored in refrences.txt
t5-large.ipynb and t5-large_storing _queries.ipynb is used to generate queries from chunks by using t5-large model Queries generated are saved Neura-Scholar/queries/t5-large/t5-large.txt References are stored in refrences.txt
t5 -small_queries_creating_2new tables.ipynb is used to generate queries from chunks by using t5-small model Queries generated are saved Neura-Scholar/queries/t5-large/t5-small.txt References are stored in refrences.txt
t5-xl_storing_queries.ipynb is used to generate queries from chunks by using t5-xl model Queries generated are saved Neura-Scholar/queries/t5-large/t5-xl.txt References are stored in refrences.txt
Ray used code->longformer_final_run.py -jupyter notebook used to submit ray job->longformer_final_run
- version '1' of model 'final'
- experiment->final
- run->clumsy-yak-807
- outputs_screenshots has logs and screenshots of ray dashboard
- References are stored in refrences.txt
folder openchat3.5: In Neura-Scholar/Training/experiments/openchat3.5/mlflow/
- jupyter notebook ->Train_mlflow
- version '1' of model 'arxiv-bi-encoder-distilbert'
- experiment->arxiv-bi-encoder-distilbert
- run->carefree-wren-174
- Used distilbert-uncased model
- References are stored in refrences.txt
a)mlflow without ray
- jupyter notebook ->Train_mlflow_2
- version '1' of model 'distilbert-arxiv-bi-encoder1'
- experiment->distilbert-arxiv-bi-encoder1
- run->carefree-wren-174
- Used distilbert-uncased model
b)mlflow without ray
- jupyter notebook ->Train_mlflow_2
- version '2' of model 'distilbert-arxiv-bi-encoder1'
- experiment->distilbert-arxiv-bi-encoder1
- run->vaunted-shark-785
- Used distilbert-uncased model
- Outputs are in Training/experiments/openchat3.5/mlflow/output
- References are stored in refrences.txt
1)Ray used In Training/experiments/openchat3.5/ray/1
a)Ray used
- code->longformer_8.py
- jupyter notebook used to submit ray job->run_ray_longformer_experiment b)Ray used
- code->longformer_9.py
- jupyter notebook used to submit ray job->run_ray_longformer_experiment c)Ray used
- code->longformer_10.py
- jupyter notebook used to submit ray job->run_ray_longformer_experiment
References are stored in refrences.txt
2)Ray used In Training/experiments/openchat3.5/ray/2
a)Ray used
code->check1.py
- jupyter notebook used to submit ray job->run_ray_longformer_experiment_2 (small data to test code, it did not register properly as i was logging arttifacts twice)
- version '1' of model 'check'
- experiment->arxiv-bi-encoder-longformer-ray
- run->efficient-finch-1
- Used longformer-base-4096 model
b)Ray used
- code->final.py
- jupyter notebook used to submit ray job->run_ray_longformer_experiment_2
- (complet data, 3 epochs but doubted it will register properly as i was logging arttifacts twice, so stopped midway)
- version '2' of model 'check'
- experiment->final
- run->amazing-sloth-18
- Used longformer-base-4096 model
- References are stored in refrences.txt
t5-small: data got deleted from mlflow but code is there
Ray_tune folder:
Ray used
- code->train_tune_2.py
- jupyter notebook used to submit ray job->train_tune_2
- version '2' and '3' of model 'final'
- experiment->ray_tune
- run->enchanting-pug-21(lr=3e-05)('3)' and upbeat-boar-917(1e-05)('2') for different lr
Ray job is being submited ad ray train is being used is getting triggered by argo workflow
code->retrain_with_ray.py
-
version '2' of model 'final'
-
experiment->final
-
run->gentle-mule-713
-
Output folder has logs and screenshot so ray dashboard
-
References are stored in refrences.txt
The embed and summarizer models are served as an API endpoint within a internal network running on GPU nodes ( or CPU nodes ) with a Triton backend. This endpoint is only accessible internally.
Both conventional DB ( i.e. SQL ) and Vector DB ( i.e. Faiss ) are served on CPU nodes as an API endpoint within the network.
These internal endpoints can be used to handle the system flow such as generating embeddings, vector DB lookups, SQL queries, and generating summaries.
A gateway node exposed to the public is used to handle requests to and from within the system.
Platforms used:
- Docker to containerize software (i.e. DBs, models, monitoring platforms etc)
- Triton backend for serving both models seperately in onnx format.
- FastAPI to implement a gateway that interacts with serving models, and databases.
A datapipeline that handles requests and transfer of information between Model endpoints, and the databases with low latency.
- A persistent storage of 120 GB for the complete project.
- A conventional DB ( i.e. SQL ) running on a m1.medium sharing a partition with the persistent storage of around 40GB ( Offline data ).
- This conventional DB will house all of our dataset that is (a) the corpus of research papers in pdfs, and (b) it's metadata in text.
- A Vector DB with it's partition of 30 GB to store embeddings of the corpus metadata. ( Offline data )
- A second partition in our Vector DB of around 20 GB for gradual replacement of embeddings from a new model. ( Offline data )
- Dataset for re-training ( Offline data ) will be used from the conventional DB itself.
- The only Online data in our pipeline would be the user quries coming in as requests from the gateway, and the summaries generated by our summarizer model.
We plan to use PythonCHI for installation, configuration and deployment of our infrastructure, versioned as Git.
Implementation of a CI/CD pipeline to re-train, run tests and deploy our models.
Finally, setting up staging environment for the model deployment.