updated docs

confident-ai · Mar 10, 2025 · 8aaaec8 · 8aaaec8
1 parent 6550550
commit 8aaaec8
Show file tree

Hide file tree

Showing 58 changed files with 1,026 additions and 438 deletions.
diff --git a/docs/confident-ai/confident-ai-evaluation-dataset-evaluation.mdx b/docs/confident-ai/confident-ai-evaluation-dataset-evaluation.mdx
@@ -52,7 +52,7 @@ dataset = EvaluationDataset()
 dataset.pull(alias="QA Dataset")
 ```
 
-An `EvaluationDataset` accepts one mandatory and one optional argument:
+An `EvaluationDataset` accepts **ONE** mandatory and **ONE** optional argument:
 
 - `alias`: the alias of your dataset on Confident. A dataset alias is unique for each project.
 - [Optional] `auto_convert_goldens_to_test_cases`: When set to `True`, `dataset.pull()` will automatically convert all goldens that were fetched from Confident into test cases and override all test cases you currently have in your `EvaluationDataset` instance. Defaulted to `False`.

diff --git a/docs/confident-ai/confident-ai-human-feedback.mdx b/docs/confident-ai/confident-ai-human-feedback.mdx
@@ -41,7 +41,7 @@ deepeval.send_feedback(
 )
 ```
 
-There are two mandatory and four optional parameters when using the `send_feedback()` function:
+There are **TWO** mandatory and **FOUR** optional parameters when using the `send_feedback()` function:
 
 - `response_id`: a string representing the `response_id` returned from the `deepeval.monitor()` function.
 - `rating`: an integer ranging from 1 - 5, inclusive.

diff --git a/docs/confident-ai/confident-ai-llm-monitoring.mdx b/docs/confident-ai/confident-ai-llm-monitoring.mdx
@@ -171,7 +171,7 @@ sync_with_stream("Tell me a joke.")
 The examples above uses `OpenAI` but realistically it will be whatever implementation your LLM application is using to generate the _observable_ outputs based on some input it receives.
 :::
 
-There are four mandatory and ten optional parameters when using the `monitor()` function to monitor responses in production:
+There are **FOUR** mandatory and **TEN** optional parameters when using the `monitor()` function to monitor responses in production:
 
 - `input`: type `str`, the input to your LLM application.
 - `response`: type `str`, the final output of your LLM application.

diff --git a/docs/docs/benchmarks-HumanEval.mdx b/docs/docs/benchmarks-HumanEval.mdx
@@ -14,7 +14,7 @@ The **HumanEval** benchmark is a dataset designed to evaluate an LLM’s code ge
 
 ## Arguments
 
-There are two optional arguments when using the `HumanEval` benchmark:
+There are **TWO** optional arguments when using the `HumanEval` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`HumanEvalTask` enums), specifying which of the **164 programming tasks** to evaluate in the language model. By default, this is set to all tasks. Detailed descriptions of the `HumanEvalTask` enum can be found [here](#humaneval-tasks).
 - [Optional] `n`: the number of code generation samples for each task for model evaluation using the pass@k metric. This is set to **200 by default**. A more detailed description of the `pass@k` metric and `n` parameter can be found [here](#passk-metric).

diff --git a/docs/docs/benchmarks-MMLU.mdx b/docs/docs/benchmarks-MMLU.mdx
@@ -14,7 +14,7 @@ import Equation from "@site/src/components/equation";
 
 ## Arguments
 
-There are two optional arguments when using the `MMLU` benchmark:
+There are **TWO** optional arguments when using the `MMLU` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`MMLUTask` enums), specifying which of the **57 subject** areas to evaluate in the language model. By default, this is set to all tasks. Detailed descriptions of the `MMLUTask` enum can be found [here](#mmlu-tasks).
 - [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This is set to **5 by default** and cannot exceed this number.

diff --git a/docs/docs/benchmarks-arc.mdx b/docs/docs/benchmarks-arc.mdx
@@ -12,7 +12,7 @@ To learn more about the dataset and its construction, you can [read the original
 
 ## Arguments
 
-There are three optional arguments when using the `ARC` benchmark:
+There are **THREE** optional arguments when using the `ARC` benchmark:
 
 - [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set all problems available in each benchmark mode.
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/benchmarks-bbq.mdx b/docs/docs/benchmarks-bbq.mdx
@@ -16,7 +16,7 @@ sidebar_label: BBQ
 
 ## Arguments
 
-There are two optional arguments when using the `BBQ` benchmark:
+There are **TWO** optional arguments when using the `BBQ` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`BBQTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `BBQTask` enums can be found [here](#bbq-tasks).
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/benchmarks-big-bench-hard.mdx b/docs/docs/benchmarks-big-bench-hard.mdx
@@ -8,7 +8,7 @@ The **BIG-Bench Hard (BBH)** benchmark comprises 23 challenging BIG-Bench tasks
 
 ## Arguments
 
-There are three optional arguments when using the `BigBenchHard` benchmark:
+There are **THREE** optional arguments when using the `BigBenchHard` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`BigBenchHardTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `BigBenchHardTask` enums can be found [here](#big-bench-hard-tasks).
 - [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This number ranges strictly from 0-3, and is **set to 3 by default**.

diff --git a/docs/docs/benchmarks-bool-q.mdx b/docs/docs/benchmarks-bool-q.mdx
@@ -12,7 +12,7 @@ To learn more about the dataset and its construction, you can [read the original
 
 ## Arguments
 
-There are two optional arguments when using the `BoolQ` benchmark:
+There are **TWO** optional arguments when using the `BoolQ` benchmark:
 
 - [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set to 3270 (all problems).
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/benchmarks-drop.mdx b/docs/docs/benchmarks-drop.mdx
@@ -12,7 +12,7 @@ sidebar_label: DROP
 
 ## Arguments
 
-There are two optional arguments when using the `DROP` benchmark:
+There are **TWO** optional arguments when using the `DROP` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`DROPTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `DROPTask` enums can be found [here](#drop-tasks).
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/benchmarks-gsm8k.mdx b/docs/docs/benchmarks-gsm8k.mdx
@@ -8,7 +8,7 @@ The **GSM8K** benchmark comprises 1,319 grade school math word problems, each cr
 
 ## Arguments
 
-There are three optional arguments when using the `GSM8K` benchmark:
+There are **THREE** optional arguments when using the `GSM8K` benchmark:
 
 - [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set to 1319 (all problems in the benchmark).
 - [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This number ranges strictly from 0-3, and is **set to 3 by default**.

diff --git a/docs/docs/benchmarks-hellaswag.mdx b/docs/docs/benchmarks-hellaswag.mdx
@@ -12,7 +12,7 @@ sidebar_label: HellaSwag
 
 ## Arguments
 
-There are two optional arguments when using the `HellaSwag` benchmark:
+There are **TWO** optional arguments when using the `HellaSwag` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`HellaSwagTask` enums), which specifies the subject areas for sentence completion evaluation. By default, this is set to all tasks. The list of `HellaSwagTask` enums can be found [here](#hellaswag-tasks).
 - [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This is **set to 10** by default and **cannot exceed 15**.

diff --git a/docs/docs/benchmarks-lambada.mdx b/docs/docs/benchmarks-lambada.mdx
@@ -12,7 +12,7 @@ The `LAMBADA` dataset is specifically designed so that humans cannot predict the
 
 ## Arguments
 
-There are two optional arguments when using the `LAMBADA` benchmark:
+There are **TWO** optional arguments when using the `LAMBADA` benchmark:
 
 - [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set to 5153 (all problems).
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/benchmarks-logi-qa.mdx b/docs/docs/benchmarks-logi-qa.mdx
@@ -12,7 +12,7 @@ LogiQA is derived from publicly available logical comprehension questions from C
 
 ## Arguments
 
-There are two optional arguments when using the `LogiQA` benchmark:
+There are **TWO** optional arguments when using the `LogiQA` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`LogiQATask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `LogiQATask` enums can be found [here](#logiqa-tasks).
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/benchmarks-math-qa.mdx b/docs/docs/benchmarks-math-qa.mdx
@@ -12,7 +12,7 @@ sidebar_label: MathQA
 
 ## Arguments
 
-There are two optional arguments when using the `MathQA` benchmark:
+There are **TWO** optional arguments when using the `MathQA` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`MathQATask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `MathQATask` enums can be found [here](#mathqa-tasks).
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/benchmarks-squad.mdx b/docs/docs/benchmarks-squad.mdx
@@ -12,7 +12,7 @@ SQuAD was constructed by sampling **536 articles from the top 10K Wikipedia arti
 
 ## Arguments
 
-There are three optional arguments when using the `SQuAD` benchmark:
+There are **THREE** optional arguments when using the `SQuAD` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`SQuADTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `SQuADTask` enums can be found [here](#squad-tasks).
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/benchmarks-truthful-qa.mdx b/docs/docs/benchmarks-truthful-qa.mdx
@@ -8,7 +8,7 @@ sidebar_label: TruthfulQA
 
 ## Arguments
 
-There are two optional arguments when using the `TruthfulQA` benchmark:
+There are **TWO** optional arguments when using the `TruthfulQA` benchmark:
 
 - [Optional] `tasks`: a list of tasks (`TruthfulQATask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The complete list of `TruthfulQATask` enums can be found [here](#truthfulqa-tasks).
 - [Optional] mode: a `TruthfulQAMode` enum that selects the evaluation mode. This is set to `TruthfulQAMode.MC1` by default. `deepeval` currently supports 2 modes: **MC1 and MC2**.

diff --git a/docs/docs/benchmarks-winogrande.mdx b/docs/docs/benchmarks-winogrande.mdx
@@ -12,7 +12,7 @@ Learn more about the construction of WinoGrande [here](https://arxiv.org/pdf/190
 
 ## Arguments
 
-There are two optional arguments when using the `Winogrande` benchmark:
+There are **TWO** optional arguments when using the `Winogrande` benchmark:
 
 - [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set to 1267 (all problems).
 - [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.

diff --git a/docs/docs/evaluation-introduction.mdx b/docs/docs/evaluation-introduction.mdx
@@ -145,7 +145,7 @@ And run the test file in the CLI:
 deepeval test run test_example.py
 ```
 
-There are two mandatory and one optional parameter when calling the `assert_test()` function:
+There are **TWO** mandatory and **ONE** optional parameter when calling the `assert_test()` function:
 
 - `test_case`: an `LLMTestCase`
 - `metrics`: a list of metrics of type `BaseMetric`
@@ -263,7 +263,7 @@ answer_relevancy_metric = AnswerRelevancyMetric()
 evaluate(dataset, [answer_relevancy_metric])
 ```
 
-There are two mandatory and thirteen optional arguments when calling the `evaluate()` function:
+There are **TWO** mandatory and **THIRTEEN** optional arguments when calling the `evaluate()` function:
 
 - `test_cases`: a list of `LLMTestCase`s **OR** `ConversationalTestCase`s, or an `EvaluationDataset`. You cannot evaluate `LLMTestCase`/`MLLMTestCase`s and `ConversationalTestCase`s in the same test run.
 - `metrics`: a list of metrics of type `BaseMetric`.

diff --git a/docs/docs/evaluation-test-cases.mdx b/docs/docs/evaluation-test-cases.mdx
@@ -458,7 +458,7 @@ def hyperparameters():
     }
 ```
 
-There are two mandatory and one optional parameter when calling the `assert_test()` function:
+There are **TWO** mandatory and **ONE** optional parameter when calling the `assert_test()` function:
 
 - `test_case`: an `LLMTestCase`
 - `metrics`: a list of metrics of type `BaseMetric`
@@ -495,7 +495,7 @@ metric = HallucinationMetric(threshold=0.7)
 evaluate([test_case], [metric])
 ```
 
-There are two mandatory and thirteen optional arguments when calling the `evaluate()` function:
+There are **TWO** mandatory and **THIRTEEN** optional arguments when calling the `evaluate()` function:
 
 - `test_cases`: a list of `LLMTestCase`s **OR** `ConversationalTestCase`s, or an `EvaluationDataset`. You cannot evaluate `LLMTestCase`/`MLLMTestCase`s and `ConversationalTestCase`s in the same test run.
 - `metrics`: a list of metrics of type `BaseMetric`.

diff --git a/docs/docs/metrics-answer-relevancy.mdx b/docs/docs/metrics-answer-relevancy.mdx
@@ -6,7 +6,7 @@ sidebar_label: Answer Relevancy
 
 import Equation from "@site/src/components/equation";
 
-The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the `actual_output` of your LLM application is compared to the provided `input`. `deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.
+The answer relevancy metric uses LLM-as-a-judge to measure the quality of your RAG pipeline's generator by evaluating how relevant the `actual_output` of your LLM application is compared to the provided `input`. `deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.
 
 :::tip
 Here is a detailed guide on [RAG evaluation](/guides/guides-rag-evaluation), which we highly recommend as it explains everything about `deepeval`'s RAG metrics.
@@ -39,9 +39,11 @@ test_case = LLMTestCase(
     actual_output=actual_output
 )
 
-metric.measure(test_case)
-print(metric.score)
-print(metric.reason)
+# To run metric as a standalone
+# metric.measure(test_case)
+# print(metric.score, metric.reason)
+
+evaluate(test_cases=[test_case], metrics=[metric])
 ```
 
 There are **SEVEN** optional parameters when creating an `AnswerRelevancyMetric`:
@@ -52,18 +54,21 @@ There are **SEVEN** optional parameters when creating an `AnswerRelevancyMetric`
 - [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`.
 - [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](/docs/metrics-introduction#measuring-a-metric-in-async) Defaulted to `True`.
 - [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`.
-- [Optional] `evaluation_template`: a class of type `AnswerRelevancyTemplate`, which allows you to override the default prompt templates used to compute the `AnswerRelevancyMetric` score. You can learn what the default prompts looks like [here](https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/answer_relevancy/template.py), and should read the [How Is It Calculated](#how-is-it-calculated) section below to understand how you can tailor it to your needs. Defaulted to `deepeval`'s `AnswerRelevancyTemplate`.
+- [Optional] `evaluation_template`: a class of type `AnswerRelevancyTemplate`, which allows you to [override the default prompts](#customize-your-template) used to compute the `AnswerRelevancyMetric` score. Defaulted to `deepeval`'s `AnswerRelevancyTemplate`.
 
-:::info
+### As a standalone
 
-If you are looking to generate a comprehensive evaluation report for your `test_case` or run multiple metrics on a single test case, use the `evaluate` function.
+You can also run the `AnswerRelevancyMetric` on a single test case as a standalone, one-off execution.
 
 ```python
 ...
 
-evaluate([test_case], [metric])
+metric.measure(test_case)
+print(metric.score, metric.reason)
 ```
 
+:::caution
+This is great for debugging or if you wish to build your own evaluation pipeline, but you will **NOT** get the benefits (testing reports, Confident AI platform) all the optimizations (speed, caching, computation) the `evaluate()` function or `deepeval test run` offers.
 :::
 
 ## How Is It Calculated?
@@ -74,7 +79,7 @@ The `AnswerRelevancyMetric` score is calculated according to the following equat
 
 The `AnswerRelevancyMetric` first uses an LLM to extract all statements made in the `actual_output`, before using the same LLM to classify whether each statement is relevant to the `input`.
 
-:::tip
+:::note
 You can set the `verbose_mode` of **ANY** `deepeval` metric to `True` to debug the `measure()` method:
 
 ```python
@@ -85,3 +90,47 @@ metric.measure(test_case)
 ```
 
 :::
+
+## Customize Your Template
+
+Since `deepeval`'s `AnswerRelevancyMetric` is evaluated by LLM-as-a-judge, you can likely improve your metric accuracy by [overriding `deepeval`'s default prompt templates](/docs/metrics-introduction#customizing-metric-prompts). This is especially helpful if:
+
+- You're using a [custom evaluation LLM](/guides/guides-using-custom-llms), especially for smaller models that have weaker instruction following capabilities.
+- You want to customize the examples used in the default `AnswerRelevancyTemplate` to better align with your expectations.
+
+:::tip
+You can learn what the default `AnswerRelevancyTemplate` looks like [here on GitHub](https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/answer_relevancy/template.py), and should read the [How Is It Calculated](#how-is-it-calculated) section above to understand how you can tailor it to your needs.
+:::
+
+Here's a quick example of how you can override the statement generation step of the `AnswerRelevancyMetric` algorithm:
+
+```python
+from deepeval.metrics import AnswerRelevancyMetric
+from deepeval.metrics.answer_relevancy import AnswerRelevancyTemplate
+
+# Define custom template
+class CustomTemplate(AnswerRelevancyTemplate):
+    @staticmethod
+    def generate_statements(actual_output: str):
+        return f"""Given the text, breakdown and generate a list of statements presented.
+
+Example:
+Our new laptop model features a high-resolution Retina display for crystal-clear visuals.
+
+{{
+    "statements": [
+        "The new laptop model has a high-resolution Retina display."
+    ]
+}}
+===== END OF EXAMPLE ======
+
+Text:
+{actual_output}
+
+JSON:
+"""
+
+# Inject custom template to metric
+metric = AnswerRelevancyMetric(evaluation_template=CustomTemplate)
+metric.measure(...)
+```