Skip to content

Commit c2e6e52

Browse files
scosmangitbook-bot
authored andcommitted
GITBOOK-151: No subject
1 parent 5031fa3 commit c2e6e52

File tree

3 files changed

+14
-1
lines changed

3 files changed

+14
-1
lines changed

SUMMARY.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@
88
* [Models and AI Providers](docs/models-and-ai-providers.md)
99
* [Synthetic Data Generation](docs/synthetic-data-generation.md)
1010
* [Fine Tuning Guide](docs/fine-tuning-guide.md)
11+
* [Evaluations](docs/evaluations.md)
1112
* [Guide: Train a Reasoning Model](docs/guide-train-a-reasoning-model.md)
1213
* [Reasoning & Chain of Thought](docs/reasoning-and-chain-of-thought.md)
13-
* [Evaluations](docs/evaluations.md)
1414
* [Prompts](docs/prompts.md)
1515
* [Reviewing and Rating](docs/reviewing-and-rating.md)
1616
* [Collaboration](docs/collaboration.md)

docs/evaluations.md

+7
Original file line numberDiff line numberDiff line change
@@ -413,6 +413,13 @@ Like mean squared error, but scores are normalized to the range 0-1. For example
413413

414414
</details>
415415

416+
#### Resolving "N/A" Correlation Scores
417+
418+
If you see "N/A" scores in your correlation table, it means more data is needed. This can be one of two cases
419+
420+
* _**Simply not enough data**_: if your eval method dataset if very small (<10 items) it can be impossible to produce confident correlation scores. Add more data to resolve this case.
421+
* _**Not enough variation of human ratings in the eval method dataset**_: if you have a larger dataset, but still get N/A, it's likely there isn't enough variation in your dataset for the given score. For example, if all of the golden samples of a score pass, the evaluator won't produce a confident correlation score, as it has no failing examples and everything is a tie. Add more content to your eval methods dataset, designing the content to fill out the missing score ranges. You can use synthetic data gen [human guidance](synthetic-data-generation.md#human-guidance) to generate examples that fail.
422+
416423
#### Select a Default Eval Method
417424

418425
Once you have a winner, click the "Set as default" button to make this eval-method the default for your eval.

docs/synthetic-data-generation.md

+6
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,12 @@ Adding a short guidance prompt can quickly improve the quality of the generated
5858

5959
<figure><img src="../.gitbook/assets/Screenshot 2025-02-07 at 9.31.39 AM.png" alt="" width="152"><figcaption><p>Click "Add Guidance"</p></figcaption></figure>
6060

61+
{% hint style="info" %}
62+
Often human guidance is used for producing adversarial content: poor quality or inappropriate content. This is done to ensure an [evaluation](evaluations.md) can detect and fail this sort of content.
63+
64+
However, LLMs will often do their best to avoid producing poor or inappropriate content, even when asked for it. If you find that's the case, use an uncensored and unaligned model like Dolphin 8x22B or Grok. These models will follow instructions more closely, and do not attempt to censor their content.
65+
{% endhint %}
66+
6167
#### Interactive Curation UX
6268

6369
Kiln synthetic data generation is designed to be used in our interactive UI.

0 commit comments

Comments
 (0)