Skip to content

Commit

Permalink
updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
penguine-ip committed Mar 10, 2025
1 parent 6550550 commit 8aaaec8
Show file tree
Hide file tree
Showing 58 changed files with 1,026 additions and 438 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ dataset = EvaluationDataset()
dataset.pull(alias="QA Dataset")
```

An `EvaluationDataset` accepts one mandatory and one optional argument:
An `EvaluationDataset` accepts **ONE** mandatory and **ONE** optional argument:

- `alias`: the alias of your dataset on Confident. A dataset alias is unique for each project.
- [Optional] `auto_convert_goldens_to_test_cases`: When set to `True`, `dataset.pull()` will automatically convert all goldens that were fetched from Confident into test cases and override all test cases you currently have in your `EvaluationDataset` instance. Defaulted to `False`.
Expand Down
2 changes: 1 addition & 1 deletion docs/confident-ai/confident-ai-human-feedback.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ deepeval.send_feedback(
)
```

There are two mandatory and four optional parameters when using the `send_feedback()` function:
There are **TWO** mandatory and **FOUR** optional parameters when using the `send_feedback()` function:

- `response_id`: a string representing the `response_id` returned from the `deepeval.monitor()` function.
- `rating`: an integer ranging from 1 - 5, inclusive.
Expand Down
2 changes: 1 addition & 1 deletion docs/confident-ai/confident-ai-llm-monitoring.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -171,7 +171,7 @@ sync_with_stream("Tell me a joke.")
The examples above uses `OpenAI` but realistically it will be whatever implementation your LLM application is using to generate the _observable_ outputs based on some input it receives.
:::

There are four mandatory and ten optional parameters when using the `monitor()` function to monitor responses in production:
There are **FOUR** mandatory and **TEN** optional parameters when using the `monitor()` function to monitor responses in production:

- `input`: type `str`, the input to your LLM application.
- `response`: type `str`, the final output of your LLM application.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-HumanEval.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ The **HumanEval** benchmark is a dataset designed to evaluate an LLM’s code ge

## Arguments

There are two optional arguments when using the `HumanEval` benchmark:
There are **TWO** optional arguments when using the `HumanEval` benchmark:

- [Optional] `tasks`: a list of tasks (`HumanEvalTask` enums), specifying which of the **164 programming tasks** to evaluate in the language model. By default, this is set to all tasks. Detailed descriptions of the `HumanEvalTask` enum can be found [here](#humaneval-tasks).
- [Optional] `n`: the number of code generation samples for each task for model evaluation using the pass@k metric. This is set to **200 by default**. A more detailed description of the `pass@k` metric and `n` parameter can be found [here](#passk-metric).
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-MMLU.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import Equation from "@site/src/components/equation";

## Arguments

There are two optional arguments when using the `MMLU` benchmark:
There are **TWO** optional arguments when using the `MMLU` benchmark:

- [Optional] `tasks`: a list of tasks (`MMLUTask` enums), specifying which of the **57 subject** areas to evaluate in the language model. By default, this is set to all tasks. Detailed descriptions of the `MMLUTask` enum can be found [here](#mmlu-tasks).
- [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This is set to **5 by default** and cannot exceed this number.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-arc.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ To learn more about the dataset and its construction, you can [read the original

## Arguments

There are three optional arguments when using the `ARC` benchmark:
There are **THREE** optional arguments when using the `ARC` benchmark:

- [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set all problems available in each benchmark mode.
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-bbq.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ sidebar_label: BBQ

## Arguments

There are two optional arguments when using the `BBQ` benchmark:
There are **TWO** optional arguments when using the `BBQ` benchmark:

- [Optional] `tasks`: a list of tasks (`BBQTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `BBQTask` enums can be found [here](#bbq-tasks).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-big-bench-hard.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The **BIG-Bench Hard (BBH)** benchmark comprises 23 challenging BIG-Bench tasks

## Arguments

There are three optional arguments when using the `BigBenchHard` benchmark:
There are **THREE** optional arguments when using the `BigBenchHard` benchmark:

- [Optional] `tasks`: a list of tasks (`BigBenchHardTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `BigBenchHardTask` enums can be found [here](#big-bench-hard-tasks).
- [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This number ranges strictly from 0-3, and is **set to 3 by default**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-bool-q.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ To learn more about the dataset and its construction, you can [read the original

## Arguments

There are two optional arguments when using the `BoolQ` benchmark:
There are **TWO** optional arguments when using the `BoolQ` benchmark:

- [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set to 3270 (all problems).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-drop.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ sidebar_label: DROP

## Arguments

There are two optional arguments when using the `DROP` benchmark:
There are **TWO** optional arguments when using the `DROP` benchmark:

- [Optional] `tasks`: a list of tasks (`DROPTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `DROPTask` enums can be found [here](#drop-tasks).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-gsm8k.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ The **GSM8K** benchmark comprises 1,319 grade school math word problems, each cr

## Arguments

There are three optional arguments when using the `GSM8K` benchmark:
There are **THREE** optional arguments when using the `GSM8K` benchmark:

- [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set to 1319 (all problems in the benchmark).
- [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This number ranges strictly from 0-3, and is **set to 3 by default**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-hellaswag.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ sidebar_label: HellaSwag

## Arguments

There are two optional arguments when using the `HellaSwag` benchmark:
There are **TWO** optional arguments when using the `HellaSwag` benchmark:

- [Optional] `tasks`: a list of tasks (`HellaSwagTask` enums), which specifies the subject areas for sentence completion evaluation. By default, this is set to all tasks. The list of `HellaSwagTask` enums can be found [here](#hellaswag-tasks).
- [Optional] `n_shots`: the number of "shots" to use for few-shot learning. This is **set to 10** by default and **cannot exceed 15**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-lambada.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ The `LAMBADA` dataset is specifically designed so that humans cannot predict the

## Arguments

There are two optional arguments when using the `LAMBADA` benchmark:
There are **TWO** optional arguments when using the `LAMBADA` benchmark:

- [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set to 5153 (all problems).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-logi-qa.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ LogiQA is derived from publicly available logical comprehension questions from C

## Arguments

There are two optional arguments when using the `LogiQA` benchmark:
There are **TWO** optional arguments when using the `LogiQA` benchmark:

- [Optional] `tasks`: a list of tasks (`LogiQATask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `LogiQATask` enums can be found [here](#logiqa-tasks).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-math-qa.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ sidebar_label: MathQA

## Arguments

There are two optional arguments when using the `MathQA` benchmark:
There are **TWO** optional arguments when using the `MathQA` benchmark:

- [Optional] `tasks`: a list of tasks (`MathQATask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `MathQATask` enums can be found [here](#mathqa-tasks).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-squad.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ SQuAD was constructed by sampling **536 articles from the top 10K Wikipedia arti

## Arguments

There are three optional arguments when using the `SQuAD` benchmark:
There are **THREE** optional arguments when using the `SQuAD` benchmark:

- [Optional] `tasks`: a list of tasks (`SQuADTask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The list of `SQuADTask` enums can be found [here](#squad-tasks).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-truthful-qa.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ sidebar_label: TruthfulQA

## Arguments

There are two optional arguments when using the `TruthfulQA` benchmark:
There are **TWO** optional arguments when using the `TruthfulQA` benchmark:

- [Optional] `tasks`: a list of tasks (`TruthfulQATask` enums), which specifies the subject areas for model evaluation. By default, this is set to all tasks. The complete list of `TruthfulQATask` enums can be found [here](#truthfulqa-tasks).
- [Optional] mode: a `TruthfulQAMode` enum that selects the evaluation mode. This is set to `TruthfulQAMode.MC1` by default. `deepeval` currently supports 2 modes: **MC1 and MC2**.
Expand Down
2 changes: 1 addition & 1 deletion docs/docs/benchmarks-winogrande.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Learn more about the construction of WinoGrande [here](https://arxiv.org/pdf/190

## Arguments

There are two optional arguments when using the `Winogrande` benchmark:
There are **TWO** optional arguments when using the `Winogrande` benchmark:

- [Optional] `n_problems`: the number of problems for model evaluation. By default, this is set to 1267 (all problems).
- [Optional] `n_shots`: the number of examples for few-shot learning. This is **set to 5** by default and **cannot exceed 5**.
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/evaluation-introduction.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,7 @@ And run the test file in the CLI:
deepeval test run test_example.py
```

There are two mandatory and one optional parameter when calling the `assert_test()` function:
There are **TWO** mandatory and **ONE** optional parameter when calling the `assert_test()` function:

- `test_case`: an `LLMTestCase`
- `metrics`: a list of metrics of type `BaseMetric`
Expand Down Expand Up @@ -263,7 +263,7 @@ answer_relevancy_metric = AnswerRelevancyMetric()
evaluate(dataset, [answer_relevancy_metric])
```

There are two mandatory and thirteen optional arguments when calling the `evaluate()` function:
There are **TWO** mandatory and **THIRTEEN** optional arguments when calling the `evaluate()` function:

- `test_cases`: a list of `LLMTestCase`s **OR** `ConversationalTestCase`s, or an `EvaluationDataset`. You cannot evaluate `LLMTestCase`/`MLLMTestCase`s and `ConversationalTestCase`s in the same test run.
- `metrics`: a list of metrics of type `BaseMetric`.
Expand Down
4 changes: 2 additions & 2 deletions docs/docs/evaluation-test-cases.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -458,7 +458,7 @@ def hyperparameters():
}
```

There are two mandatory and one optional parameter when calling the `assert_test()` function:
There are **TWO** mandatory and **ONE** optional parameter when calling the `assert_test()` function:

- `test_case`: an `LLMTestCase`
- `metrics`: a list of metrics of type `BaseMetric`
Expand Down Expand Up @@ -495,7 +495,7 @@ metric = HallucinationMetric(threshold=0.7)
evaluate([test_case], [metric])
```

There are two mandatory and thirteen optional arguments when calling the `evaluate()` function:
There are **TWO** mandatory and **THIRTEEN** optional arguments when calling the `evaluate()` function:

- `test_cases`: a list of `LLMTestCase`s **OR** `ConversationalTestCase`s, or an `EvaluationDataset`. You cannot evaluate `LLMTestCase`/`MLLMTestCase`s and `ConversationalTestCase`s in the same test run.
- `metrics`: a list of metrics of type `BaseMetric`.
Expand Down
67 changes: 58 additions & 9 deletions docs/docs/metrics-answer-relevancy.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ sidebar_label: Answer Relevancy

import Equation from "@site/src/components/equation";

The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the `actual_output` of your LLM application is compared to the provided `input`. `deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.
The answer relevancy metric uses LLM-as-a-judge to measure the quality of your RAG pipeline's generator by evaluating how relevant the `actual_output` of your LLM application is compared to the provided `input`. `deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

:::tip
Here is a detailed guide on [RAG evaluation](/guides/guides-rag-evaluation), which we highly recommend as it explains everything about `deepeval`'s RAG metrics.
Expand Down Expand Up @@ -39,9 +39,11 @@ test_case = LLMTestCase(
actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
# To run metric as a standalone
# metric.measure(test_case)
# print(metric.score, metric.reason)

evaluate(test_cases=[test_case], metrics=[metric])
```

There are **SEVEN** optional parameters when creating an `AnswerRelevancyMetric`:
Expand All @@ -52,18 +54,21 @@ There are **SEVEN** optional parameters when creating an `AnswerRelevancyMetric`
- [Optional] `strict_mode`: a boolean which when set to `True`, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to `False`.
- [Optional] `async_mode`: a boolean which when set to `True`, enables [concurrent execution within the `measure()` method.](/docs/metrics-introduction#measuring-a-metric-in-async) Defaulted to `True`.
- [Optional] `verbose_mode`: a boolean which when set to `True`, prints the intermediate steps used to calculate said metric to the console, as outlined in the [How Is It Calculated](#how-is-it-calculated) section. Defaulted to `False`.
- [Optional] `evaluation_template`: a class of type `AnswerRelevancyTemplate`, which allows you to override the default prompt templates used to compute the `AnswerRelevancyMetric` score. You can learn what the default prompts looks like [here](https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/answer_relevancy/template.py), and should read the [How Is It Calculated](#how-is-it-calculated) section below to understand how you can tailor it to your needs. Defaulted to `deepeval`'s `AnswerRelevancyTemplate`.
- [Optional] `evaluation_template`: a class of type `AnswerRelevancyTemplate`, which allows you to [override the default prompts](#customize-your-template) used to compute the `AnswerRelevancyMetric` score. Defaulted to `deepeval`'s `AnswerRelevancyTemplate`.

:::info
### As a standalone

If you are looking to generate a comprehensive evaluation report for your `test_case` or run multiple metrics on a single test case, use the `evaluate` function.
You can also run the `AnswerRelevancyMetric` on a single test case as a standalone, one-off execution.

```python
...

evaluate([test_case], [metric])
metric.measure(test_case)
print(metric.score, metric.reason)
```

:::caution
This is great for debugging or if you wish to build your own evaluation pipeline, but you will **NOT** get the benefits (testing reports, Confident AI platform) all the optimizations (speed, caching, computation) the `evaluate()` function or `deepeval test run` offers.
:::

## How Is It Calculated?
Expand All @@ -74,7 +79,7 @@ The `AnswerRelevancyMetric` score is calculated according to the following equat

The `AnswerRelevancyMetric` first uses an LLM to extract all statements made in the `actual_output`, before using the same LLM to classify whether each statement is relevant to the `input`.

:::tip
:::note
You can set the `verbose_mode` of **ANY** `deepeval` metric to `True` to debug the `measure()` method:

```python
Expand All @@ -85,3 +90,47 @@ metric.measure(test_case)
```

:::

## Customize Your Template

Since `deepeval`'s `AnswerRelevancyMetric` is evaluated by LLM-as-a-judge, you can likely improve your metric accuracy by [overriding `deepeval`'s default prompt templates](/docs/metrics-introduction#customizing-metric-prompts). This is especially helpful if:

- You're using a [custom evaluation LLM](/guides/guides-using-custom-llms), especially for smaller models that have weaker instruction following capabilities.
- You want to customize the examples used in the default `AnswerRelevancyTemplate` to better align with your expectations.

:::tip
You can learn what the default `AnswerRelevancyTemplate` looks like [here on GitHub](https://github.com/confident-ai/deepeval/blob/main/deepeval/metrics/answer_relevancy/template.py), and should read the [How Is It Calculated](#how-is-it-calculated) section above to understand how you can tailor it to your needs.
:::

Here's a quick example of how you can override the statement generation step of the `AnswerRelevancyMetric` algorithm:

```python
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics.answer_relevancy import AnswerRelevancyTemplate

# Define custom template
class CustomTemplate(AnswerRelevancyTemplate):
@staticmethod
def generate_statements(actual_output: str):
return f"""Given the text, breakdown and generate a list of statements presented.
Example:
Our new laptop model features a high-resolution Retina display for crystal-clear visuals.
{{
"statements": [
"The new laptop model has a high-resolution Retina display."
]
}}
===== END OF EXAMPLE ======
Text:
{actual_output}
JSON:
"""

# Inject custom template to metric
metric = AnswerRelevancyMetric(evaluation_template=CustomTemplate)
metric.measure(...)
```
Loading

0 comments on commit 8aaaec8

Please sign in to comment.