LLMSemanticAnnotator employs Semantic Textual Similarity (STS) to annotate scientific articles with controlled vocabulary, based on precise term definitions. This implementation primarily leverages ontologies from the Planteome project, while also incorporating other relevant sources, to address the lack of detailed annotations in scientific articles, particularly regarding experimental conditions and plant developmental stages.
- LLM Utilization: The annotator employs Large Language Models (LLMs) to deeply understand the context and content of scientific articles.
- STS Application: The system compares the semantic similarity between ontological term definitions and article text, using advanced natural language processing techniques.
- Ontology Sources: In addition to Planteome, the annotator integrates controlled vocabularies from other recognized sources in the field of plant biology, ensuring comprehensive coverage of relevant terms.
- Multi-level Annotation: The annotation process specifically targets:
- Experimental conditions
- Plant developmental stages
- Molecules of interest under study
- Semantic Association: Ultimately, the annotator establishes links between annotated terms, enabling the association of experimental conditions and developmental stages with the molecules of interest studied.
This approach aims to significantly enrich the metadata of scientific articles, thereby facilitating experimental reproducibility, comparative analysis of studies, and large-scale knowledge extraction in the field of plant biology.
pip install git+https://github.com/p2m2/encoder-ontology-match-abstract.git@20250120
curl -O https://raw.githubusercontent.com/p2m2/encoder-ontology-match-abstract/refs/heads/main/llm_semantic_annotator.sh
check versions available
export LLM_SEMANTIC_ANNOTATOR_REPO=$WORK/encoder-ontology-match-abstract
module purge
module load git
module load python/3.11.5
export PYTHONUSERBASE=$WORK/python_base
export HF_HOME=$WORK/hg_cache/huggingface
pip install -r requirements.txt
from huggingface_hub import snapshot_download
mkdir config_workdir
pushd config_workdir
wget http://purl.obolibrary.org/obo/po.owl
wget http://purl.obolibrary.org/obo/pso.owl
wget http://purl.obolibrary.org/obo/to.owl
wget http://purl.obolibrary.org/obo/ncbitaxon.owl
curl -O https://raw.githubusercontent.com/p2m2/encoder-ontology-match-abstract/refs/heads/main/config/foodon-demo.json
./llm_semantic_annotator.sh foodon-demo.json 1
Usage: ./llm_semantic_annotator.sh <config_file> <int_commande>
1. Pseudo workflow [2,4,5,6,7]
2. Populate OWL tag embeddings
3. Populate abstract embeddings
4. Compute similarities between tags and abstract chunks
5. Display similarities information
6. Build turtle knowledge graph
7. Build dataset abstracts annotations CSV file
2: Compute TAG embeddings for all ontologies defined in the populate_owl_tag_embeddings section
3: Compute ABSTRACT embeddings (title + sentences) for all abstracts in the dataset
4: Compute similarities between TAGS and ABSTRACTS
5: Display similarities information on the console
6: Generate turtle file with information {score, tag} for each DOI
7: Generate CSV file with [doi, tag, pmid, reference_id]
example can be found :
"encoder": string,
"threshold_similarity_tag_chunk": number,
"threshold_similarity_tag": number,
"batch_size": number,
"populate_owl_tag_embeddings": object,
"populate_abstract_embeddings": object
- encoder: (string) Specifies the encoding model to use.
- threshold_similarity_tag_chunk: (number) Similarity threshold for computing owl tag / chunk tags.
- threshold_similarity_tag: (number) Similarity threshold between tags (keeps the best above this value).
- batch_size: (number) Batch size for processing.
This section configures the ontologies to be used for populating OWL tag embeddings.
"populate_owl_tag_embeddings": {
"ontologies": {
"group_link": {
"ontology_name": {
"url": string,
"prefix": string,
"format": string,
"label": string,
"properties": [string],
"constraints": object
- url: (string) URL of the ontology.
- prefix: (string) Prefix of the ontology.
- format: (string) Format of the ontology (e.g., "xml").
- label: (string) Property used as a label (Used to build embeddings).
- properties: (array of strings) Additional properties to include (Used to build embeddings).
- constraints: (object) Constraints to apply on the ontology.
This section configures the population of abstract embeddings.
"populate_abstract_embeddings": {
"abstracts_per_file": number,
"from_ncbi_api": object,
"from_file": object
Configures fetching abstracts from the NCBI API.
- ncbi_api_chunk_size: (number) Chunk size for NCBI requests.
- debug_nb_ncbi_request: (number) Number of requests for debugging (-1 for unlimited).
- retmax: (number) Maximum number of results to return.
- selected_term: (array of strings) Selected search terms.
Configures fetching abstracts from local files.
- json_files: (array of strings) List of JSON files to use.
- json_dir: (string) Directory containing JSON files.
To execute the test suite, you can use the following commands:
python3 -m venv llm_semantic_annotator_env
source llm_semantic_annotator_env/bin/activate
pip install -r requirements.txt
python -m unittest discover
Run a specific test file
python3 -m venv llm_semantic_annotator_env
source llm_semantic_annotator_env/bin/activate
pip install -r requirements.txt
python -m unittest tests/similarity/test_model_embedding_manager.py
python3 -m venv llm_semantic_annotator_env
source llm_semantic_annotator_env/bin/activate
pip install -r requirements.txt
python -m llm_semantic_annotator.similarity_evaluator
- '-a' max article
- 1m : scroll time
- o : output directory
- 1000 article par fichiers de sortie
. ./llm_semantic_annotator_env/bin/activate
python llm_semantic_annotator/misc/get_istex_corpus.py metabolite -s 20m -o data/istex -a 5000
check config/planteome-istex-pubmed.json