Skip to content

VLM based RAG (without parsing) for enhanced robustness on unstructured docs such as plots, info-graphs, and slides

Notifications You must be signed in to change notification settings

K0EKJE/VLM-Based-Retrieval-Augmented-Generation

Repository files navigation

VLM-Based-Retrieval-Augmented-Generation

Stanford NLP Project Repo

VLM RAG pipeline based on CoPali.

Interpretable MaxSim Mapping:

Query: What is the hand-and-arm signal used for tuning right while driving?

Max MaxSim-Score Token: driving

Min MaxSim-Score Token: What

Project Structure Tree:

VLM RAG/
│
├── benchmark_run_metrics/        # ranking metrics for benchmark
│   ├── datasetName/
│       └── metrics.json           
│
├── codes/
│   ├── finetune.py               # script for fine-tuning retriever using contrastive learning
│   ├── run_benchmark.py          # script to run model on benchmark
│   └── utils                     # util functions
│
├── interpreted_output            # heatmap visualizing visual attention   
|
├── main/                         # main rag pipeline
│   ├── dbManager.py              # script for article vectorization
│   ├── gen.py                    # script for inference and synthetic question generation
│   ├── preprocessor.py           # script for doc preprocessing
│   ├── get_data.py               # scraper for evaluation set
│   └── pipeline.py               # script for RAG pipeline
│
├── dmv_example.png               # example image used for interpretable similarity mapping  
|
├── requirements.txt              # Python dependencies

About

VLM based RAG (without parsing) for enhanced robustness on unstructured docs such as plots, info-graphs, and slides

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages