From 7dacf0a45a32c6de53c3363b7c6f099059bbb71e Mon Sep 17 00:00:00 2001 From: David Wu Date: Sun, 11 Feb 2024 12:41:42 -0500 Subject: [PATCH] Fix one more math --- docs/KataGoMethods.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/KataGoMethods.md b/docs/KataGoMethods.md index c2a100ad1..b916bb81e 100644 --- a/docs/KataGoMethods.md +++ b/docs/KataGoMethods.md @@ -177,8 +177,7 @@ By only averaging errors in a bucket rather than absolute utilities, we continue (This method was first experimented with in KataGo in early 2021, and released in June 2021 with v1.9.0). This method can be motivated and explained by a simple observation. Consider the PUCT formula that controls exploitation versus exploration in modern AlphaZero-style MCTS: - - +$$\text{Next action to explore}=\text{argmax}_a \, Q(a) + c_{\text{PUCT}} P(a) \frac{\sqrt{\sum_b N(b)}}{1 + N(a)}$$ Suppose for a given game/subgame/situation/tactic the value of the cPUCT coefficient is k. Then, consider a game/subgame/situation/tactic that is identical except all the differences between all the Q values at every node are doubled (e.g. the differences between the winrates of moves and the results of playouts are doubled). In this new game, the optimal cPUCT coefficient is now 2k because a coefficient of 2k is what is needed to exactly replicate the original search behavior, given that the differences in Q are all twice as large as before.