Skip to content

Latest commit

 

History

History
177 lines (115 loc) · 8.25 KB

README.md

File metadata and controls

177 lines (115 loc) · 8.25 KB

Learning and actioning general principles of cancer cell drug sensitivity

Workflow

This repository contains the complete codebase required to reproduce the analysis presented in the paper Learning and actioning general principles of cancer cell drug sensitivity.

Repository Structure

The code is organized into four main directories, each tailored to facilitate specific aspects of the analysis:

  • CellHit: This is a custom library that encapsulates all the functions used throughout the analysis. The library is designed for reusability in further analyses, making it a versatile tool for similar research.

  • scripts: This folder contains various Python scripts that manage tasks ranging from data pre-processing to model training. An additional Markdown file is included in this folder, providing detailed descriptions and usage instructions for each script.

  • AsyncDistribJobs: This directory houses an auxiliary custom library crafted to efficiently manage asynchronous parallel jobs on HPC environments.

  • Data: This directory contains data needed to reproduce all the obtained results

1. Development Environment Setup

This guide provides detailed steps to set up the development environment necessary to replicate the results from our research. The setup includes creating a new Python environment, installing general libraries, large language model (LLM) libraries, and compiling XGBoost with GPU support.

**NOTE: overall time requirments to setup the enviroment may greatly vary from system to system

A. Creating the new enviroment through the yml file (recommended)

We freezed into the "cellHit.yml" file all of the libraries needed to run the code provided in this codebase. In order create a new enviroment with the yml just run

conda env create --name envname --file=cellHit.yml

B. Creating a New Python Environment (alternative from scratch)

First, create a new environment using Conda and install CUDA toolkit:

conda create -n CellHit python=3.11
conda install -c "nvidia/label/cuda-11.8" cuda-toolkit

General Libraries Installation

Install the following general purpose libraries using pip:

pip install biopython==1.82 \
			SQLAlchemy==2.0.23 \
			tqdm==4.66.1 \
			torch==2.1.2+cu118 \
			torchaudio==2.1.2+cu118 \
			torchvision==0.16.2+cu118 \
			numba==0.58.1 \
            openpyxl==3.1.2

LLM Libraries Installation

Install libraries specifically used for working with large language models:

pip install guidance==0.1.8 \
             openai==0.28.1 \
             requests==2.31.0 \
             transformers==2.31.0 \
             auto-gptq==0.6.0+cu118 \
             optimum==1.16.1 \
             peft==0.7.1

Installing vLLM with CUDA 11.8 requieres a specific procedure:

# Install vLLM with CUDA 11.8
export VLLM_VERSION=0.2.7
export PYTHON_VERSION=311
pip install https://github.com/vllm-project/vllm/releases/download/v${VLLM_VERSION}/vllm-${VLLM_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux1_x86_64.whl

# Re-install PyTorch with CUDA 11.8
pip uninstall torch -y
pip install torch --upgrade --index-url https://download.pytorch.org/whl/cu118

# Re-install xFormers with CUDA 11.8
pip uninstall xformers -y
pip install --upgrade xformers --index-url https://download.pytorch.org/whl/cu118

For additional informations refer to the original vLLM documentation

Machine Learning Libraries

Install additional machine learning libraries:

pip install scikit-learn==1.3.2 shap==0.43.0 optuna==3.3.0 

Compiling XGBoost with GPU Support

Follow these steps to compile XGBoost with GPU support, using CUDA 11.8 and gcc 10.2.0:

# Obtain the code
git clone --recursive https://github.com/dmlc/xgboost

# Compiling XGBoost
cd xgboost
mkdir build
cd build
cmake .. -DUSE_CUDA=ON
make -

# Then install
cd python-package/
pip install .

For additional informations refer to the XGBoost documentation

Make sure to follow these instructions sequentially to avoid any issues with dependencies and library versions. Once finished reproducing the enviroment, clone the repository with

git clone https://github.com/raimondilab/CellHit.git

2. Getting the data

Most of the data required to replicate the results is in the data folder. However, some files were too large for direct upload to GitHub, particularly

For transcriptomics data for CCLE and TCGA. To access this data, clone the repository and create a transcriptomics folder within the data folder, alongside metadata and reactome. Then download:

If you wish to bypass the celligner steps, a pre-aligned version of these transcriptomics datasets is available here.

For MOA data you can download prism_LLM_drugID_to_genes.json from here and place it the MOA_data inside the data folder

3. Getting LLMs weights

For the anlysis we used a GPTQ quantized version of Mixtral-8x7B-Instruct-v0.1 from Mistral.AI. The scripts search for this model in you home folder. To obtain the weights first install huggingface-cli and login with the following commands:

#instal
pip install -U "huggingface_hub[cli]"

#login
huggingface-cli login

To login you need a User Access Token from your Settings page (more on this on huggingface official documentation). Once logged-in you can obtain the weights of the used LLM with the following command

huggingface-cli download TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ --revision gptq-4bit-32g-actorder_True --local-dir <your_home_folder>

Hardware requirements

Our pipeline has primarily been developed and executed in High Performance Computing (HPC) environments. We have utilized the following types of nodes, which are significantly enhanced by GPU acceleration:

  • Daneel nodes on HPC@SNS, each equipped with 2 Intel Xeon CPUs, 36 cores (18 cores per socket), 1.5 TB of RAM (about 42 GB/core), 6TB local scratch space and 4 Tesla (V100) NVIDIA GPUs with 32 GB of RAM (each);
  • Gaia nodes on HPC@SNS, equipped with 2 AMD EPYC 7352, 48 cores (24 physical cores per socket), 512 GB of RAM (10.6 GB/core), 4 NVIDIA A100 and a local scratch area of ~890 GB;
  • BullSequana X2135 "Da Vinci" nodes on Leonardo Supercomputer @ CINECA, equipped with 1 x CPU Intel Xeon 8358 32 core, 2,6 GHz, 512 (8 x 64) GB RAM DDR4 3200 MHz and 4x Nvidia custom Ampere (A100) GPU 64GB HBM2.

Despite the high-performance hardware requirements for development and training phases, the final trained models are designed to be deployable on standard desktops or laptops without the need for GPU acceleration. These models will be made available upon publication.

Troubleshooting and Support

If you encounter any issues while setting up or using this environment, please do not hesitate to reach out for help or clarification:

  • Open an Issue: For problems or enhancements related to the code, please open an issue directly on the GitHub repository.

  • Contact via Email: If you have specific questions or need further assistance, you can email us at francesco.carli@sns.it.

We are committed to providing support and making continuous improvements to this project.