|
| 1 | +# Already Said That |
| 2 | + |
| 3 | +This eval measures how robust models are to distractors when performing |
| 4 | +sequential tasks. We construct a toy task where the model needs to determine |
| 5 | +whether it has already seen a given word, and inject distractor questions into |
| 6 | +the interaction, keeping track of model performance throughout. |
| 7 | + |
| 8 | +## Usage |
| 9 | + |
| 10 | +Run with: |
| 11 | + |
| 12 | +```bash |
| 13 | +oaieval <solver> already_said_that |
| 14 | +``` |
| 15 | + |
| 16 | +We have found that `generation/direct/gpt-4-0125-preview` works well on this |
| 17 | +eval. For more examples of tested solvers, see |
| 18 | +[`./scripts/run_experiments.sh`](./scripts/run_experiments.sh). |
| 19 | + |
| 20 | +## Dataset |
| 21 | + |
| 22 | +The dataset consists of 500 samples, where each sample contains 100 unique words |
| 23 | +randomly sampled from the [WordNet corpus](https://wordnet.princeton.edu/) via |
| 24 | +the `nltk` library. |
| 25 | + |
| 26 | +We also rely on four sets of distractor questions, sourced directly from the |
| 27 | +datasets of pre-existing evals. Specifically we make use of the datasets of the |
| 28 | +following evals from our evals registry: |
| 29 | + |
| 30 | +- [`which-is-heavier`](../../registry/evals/which-is-heavier.yaml) |
| 31 | +- [`first-letters`](../../registry/evals/first-letters.yaml) |
| 32 | +- [`ambigous-sentences`](../../registry/evals/ambiguous-sentences.yaml) |
| 33 | +- [`reverse-sort-words-eng`](../../registry/evals/reverse-sort-words-eng.yaml) |
| 34 | + |
| 35 | +## Evaluation Process |
| 36 | + |
| 37 | +The evaluation process is as follows for a given sample from our dataset: |
| 38 | + |
| 39 | +1. The `TASK_DESCRIPTION` prompt is shown to the solver. |
| 40 | +2. For 100 turns, we either show a word to the solver or a distractor question, |
| 41 | + with probability 2/3 and 1/3 respectively. |
| 42 | +3. If a word is shown, we prefix it with `MAIN TASK -`, to indicate that we are |
| 43 | + asking the solver to perform the main task of determining whether it has seen |
| 44 | + the word before. |
| 45 | +4. When showing a word, we randomly show previously seen words with a |
| 46 | + probability of 1/2 and new words with a probability of 1/2. |
| 47 | +5. If we show a distractor question, we directly show the question to the |
| 48 | + solver. |
| 49 | +6. The solver should respond with its answer wrapped in the format |
| 50 | + `[answer: <answer>]`. |
| 51 | +7. The solver's response is parsed and compared to the correct answer. |
| 52 | +8. If the solver's response is incorrect or a violation is raised (answered in |
| 53 | + the incorrect format), in the case of the main task we stop the interaction |
| 54 | + and record the number of turns the solver lasted for. Otherwise we continue |
| 55 | + to the next turn. |
| 56 | + |
| 57 | +## Prompts |
| 58 | + |
| 59 | +We refer readers to [`./prompts.py`](./prompts.py) for the `TASK_DESCRIPTION` |
| 60 | +used in the eval. |
| 61 | + |
| 62 | +We refer readers to [`./distractors.py`](./distractors.py) for any cosmetic |
| 63 | +changes we make to the distractor questions. |
| 64 | + |
| 65 | +## Metrics |
| 66 | + |
| 67 | +Below are the metrics returned by the eval: |
| 68 | + |
| 69 | +<!-- prettier-ignore-start --> |
| 70 | +| **Metric** | **Notes** | |
| 71 | +|------------------------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
| 72 | +| `avg_num_turns` | The average number of turns shown before the model fails across the samples. Higher is better. Best possible is 100. | |
| 73 | +| `stddev_num_turns` | The standard deviation on the above. | |
| 74 | +| `median_num_turns` | The median number of turns shown before the model fails across the samples. Higher is better. Best possible is 100. | |
| 75 | +| `max_num_turns` | The maximum number of turns shown before the model fails across the samples. | |
| 76 | +| `min_num_turns` | The minimum number of turns shown before the model fails across the samples. | |
| 77 | +| `false_positive_rate` | How often the model answers “yes” when it should have answered “no” (i.e. a new word is shown, and the model claims to have seen it already). | |
| 78 | +| `false_negative_rate` | How often the model answers “no” when it should have answered “yes” (i.e. a word is shown again, and the model claims to not have seen it). | |
| 79 | +| `avg_distractor_accuracy` | For a given sample interaction, we measure whether each model response to a given distractor question is accurate. We then compute the accuracy on the distractor questions shown over the interaction. We then average this accuracy across all samples. | |
| 80 | +| `violation_rate` | how often the model responds in an invalid format, i.e. not using the `[answer: <answer>]` format. | |
| 81 | +| `avg_num_distractors` | The average number of distractors shown before the model fails across the samples. Higher is better. Best possible is around 33. | |
| 82 | +| `stddev_num_distractors` | The standard deviation on the above. | |
| 83 | +| `median_num_distractors` | The median number of distractors shown before the model fails across the samples. Higher is better. Best possible is around 33. | |
| 84 | +| `max_num_distractors` | The maximum number of distractors shown before the model fails across the samples. | |
| 85 | +| `min_num_distractors` | The minimum number of distractors shown before the model fails across the samples. | |
| 86 | +<!-- prettier-ignore-end --> |
| 87 | + |
| 88 | +## Variants |
| 89 | + |
| 90 | +We consider each of the four distractor datasets mentioned in |
| 91 | +[Dataset](#dataset) as a variant of the eval. |
| 92 | + |
| 93 | +```bash |
| 94 | +oaieval <solver> already_said_that.<distractor> |
| 95 | +``` |
| 96 | + |
| 97 | +We also have a `distractorless` variant where we only show words to the solver. |
| 98 | +We use this as a baseline to determine how robust the solver is to distractors. |
| 99 | + |
| 100 | +```bash |
| 101 | +oaieval <solver> already_said_that.distractorless |
| 102 | +``` |
| 103 | + |
| 104 | +## Custom Solvers |
| 105 | + |
| 106 | +We implement 2 custom solvers for this eval in [./solvers.py](./solvers.py): |
| 107 | + |
| 108 | +1. `RandomBaselineSolver`: A solver that randomly answers `yes` or `no` for any |
| 109 | + input. We view this baseline as equivalent to randomly guessing. |
| 110 | +2. `AlreadySaidThatHuman`: A helper solver class that wraps the `HumanCliSolver` |
| 111 | + class such that users do not have to wrap their answer in the |
| 112 | + `[answer: <answer>]` format and can instead just directly type the answer. |
| 113 | + |
| 114 | +## Token Usage Estimates |
| 115 | + |
| 116 | +Below are approximate token usage estimates for a given run (one run = all |
| 117 | +samples) of the eval, for each of the distractor variants. |
| 118 | + |
| 119 | +For Direct gpt-4-0125-preview: |
| 120 | + |
| 121 | +| Distractor variant | Input | Output | Total | |
| 122 | +| --------------------- | ---------- | ------- | ---------- | |
| 123 | +| which-is-heavier | 17,960,000 | 80,000 | 18,040,000 | |
| 124 | +| ambiguous-sentences | 27,750,000 | 110,000 | 27,860,000 | |
| 125 | +| first-letters | 19,850,000 | 80,000 | 19,940,000 | |
| 126 | +| reverse-sort-words-en | 10,700,000 | 120,000 | 10,820,000 | |
| 127 | +| distractorless | 27,550,000 | 120,000 | 27,680,000 | |
| 128 | + |
| 129 | +For Direct gpt-3.5-turbo-0125: |
| 130 | + |
| 131 | +| Distractor variant | Input | Output | Total | |
| 132 | +| --------------------- | --------- | ------ | --------- | |
| 133 | +| which-is-heavier | 1,200,000 | 10,000 | 1,210,000 | |
| 134 | +| ambiguous-sentences | 1,540,000 | 20,000 | 1,550,000 | |
| 135 | +| first-letters | 2,120,000 | 20,000 | 2,140,000 | |
| 136 | +| reverse-sort-words-en | 910,000 | 20,000 | 940,000 | |
| 137 | +| distractorless | 1,250,000 | 20,000 | 1,270,000 | |
| 138 | + |
| 139 | +For Direct gpt-4-base: |
| 140 | + |
| 141 | +| Distractor variant | Input | Output | Total | |
| 142 | +| --------------------- | ---------- | --------- | ---------- | |
| 143 | +| which-is-heavier | 16,950,000 | 3,670,000 | 20,620,000 | |
| 144 | +| ambiguous-sentences | 23,100,000 | 4,390,000 | 27,490,000 | |
| 145 | +| first-letters | 25,310,000 | 4,870,000 | 30,180,000 | |
| 146 | +| reverse-sort-words-en | 14,380,000 | 2,760,000 | 17,140,000 | |
| 147 | +| distractorless | 24,460,000 | 5,000,000 | 29,460,000 | |
| 148 | + |
| 149 | +For CoT gpt-4-0125-preview: |
| 150 | + |
| 151 | +| Distractor variant | Input | Output | Total | |
| 152 | +| --------------------- | ----------- | --------- | ----------- | |
| 153 | +| which-is-heavier | 263,600,000 | 1,900,000 | 265,500,000 | |
| 154 | +| ambiguous-sentences | 383,500,000 | 2,700,000 | 386,200,000 | |
| 155 | +| first-letters | 251,700,000 | 1,700,000 | 253,400,000 | |
| 156 | +| reverse-sort-words-en | 236,700,000 | 2,100,000 | 238,800,000 | |
| 157 | +| distractorless | 395,500,000 | 2,400,000 | 398,000,000 | |
| 158 | + |
| 159 | +For CoT gpt-3.5-turbo-0125: |
| 160 | + |
| 161 | +| Distractor variant | Input | Output | Total | |
| 162 | +| --------------------- | ---------- | ------- | ---------- | |
| 163 | +| which-is-heavier | 10,100,000 | 190,000 | 10,280,000 | |
| 164 | +| ambiguous-sentences | 7,510,000 | 140,000 | 7,650,000 | |
| 165 | +| first-letters | 16,450,000 | 220,000 | 16,670,000 | |
| 166 | +| reverse-sort-words-en | 4,690,000 | 150,000 | 4,840,000 | |
| 167 | +| distractorless | 30,230,000 | 310,000 | 30,540,000 | |
| 168 | + |
| 169 | +## Future modifications |
| 170 | + |
| 171 | +- Extending the range of distractors considered, either by incorporating more |
| 172 | + evals or designing new distractor variants. |
| 173 | +- Experiment with multiple distractor sources in a single eval run, to see if |
| 174 | + the variety of distractors affects the model's robustness. |
| 175 | + |
| 176 | +## Version History |
| 177 | + |
| 178 | +- v0: Initial version released |
| 179 | + |
| 180 | +## Contribution Statement |
| 181 | + |
| 182 | +Eval design, implementation, and results evaluation were primarily conducted by |
| 183 | +Giulio Starace, under the guidance of (alphabetically by last-name) Steven |
| 184 | +Adler, Andrei Alexandru, James Aung, and Chan Jun Shern who provided research |
| 185 | +input, report revisions, and project management support. |
0 commit comments