This is based on having observed that there is a lot of variation
between runs on `n=1` and `n=3`.
* With `n=8` two runs on the same branch give answers that seem close
enough to be reasonably consistent.
* With higher concurrency, trying to run this many repetitions seems to
lead language servers to time out a lot, causing evals to fail.
Release Notes:
- N/A