Max Brunsfeld 36d02de784 Rework eval to support interpretable scores and efficient repetitions (#29197)
### Problem

We want to start continuously tracking our progress on agent evals over
time. As part of this, we'd like the *score* to have a clear,
interpretable meaning. Right now, it's a number from 0 to 5, but it's
not clear what any particular number works. In addition, scores vary
widely from run to run, because the agent's output is deterministic. We
try to stabilize the score using a panel of judges, but the behavior of
the agent itself varies much more widely than the judges' scores for a
given run.

### Solution

* **explicit meanings of scores** - In this PR, we're prescribing the
diff and thread criteria files so that they *must* be unordered lists of
assertions. For both the thread and the diff, rather than providing an
abstract score, the judge's task is simply to count how many of these
assertions are satisfied. A percentage score can be derived from this
number, divided by the total number of assertions.
* **repetitions** - Rather than running each example once, and judging
it N times, we'll **run** the example N times. Right now, I'm just
judging the output once per run, because I believe that with these more
clear scoring criteria, the main source of non-determinism will be the
*agent's* behavior, not the judge's

### Questions

* **accounting for diagnostic errors** - Previously, the judge was asked
to incorporate diagnostics into their abstract scores. Now that the
"score" is determined directly from the criteria, the diagnostic will
not be captured in the score. How should the diagnostics be accounted
for in the eval? One thought is - let's simply count and report the
number of errors remaining after the agent finishes, as a separate field
of the run (along with diff score and thread score). We could consider
normalizing it using the total lines of added code (like errors per 100
lines of code added) in order to give it some semblance of stability
between examples.

* **repetitions** - How many repetitions should we run on CI? Each
repetition takes significant time, but I think running more than one
repetition will make the scores significantly less volatile.

### Todo

* [x] Fix `--concurrency` implementation so that only N tasks are
spawned
* [x] Support `--repetitions` efficiently (re-using the same worktree)
* [x] Restructure judge prompts to count passing criteria, not compute
abstract score
* [x] Report total number of diagnostics in some way
* [x] Format output nicely

Release Notes:

- N/A *or* Added/Fixed/Improved ...

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
2025-04-22 14:00:09 +00:00
2025-04-01 22:35:15 +00:00
2025-04-18 19:28:14 +00:00
2025-04-21 22:28:32 +00:00
WIP
2023-12-14 09:25:14 -07:00
2025-03-10 01:06:11 -07:00
2024-09-30 17:46:21 -04:00
2025-04-01 14:19:10 -07:00
2025-02-04 09:02:59 -05:00
2025-02-04 09:02:59 -05:00
2024-08-08 17:21:38 -07:00
2024-09-05 15:39:16 -04:00
2025-03-10 01:06:11 -07:00

Zed

CI

Welcome to Zed, a high-performance, multiplayer code editor from the creators of Atom and Tree-sitter.


Installation

Packaging status

On macOS and Linux you can download Zed directly or install Zed via your local package manager.

Other platforms are not yet available:

Developing Zed

Contributing

See CONTRIBUTING.md for ways you can contribute to Zed.

Also... we're hiring! Check out our jobs page for open roles.

Licensing

License information for third party dependencies must be correctly provided for CI to pass.

We use cargo-about to automatically comply with open source licenses. If CI is failing, check the following:

  • Is it showing a no license specified error for a crate you've created? If so, add publish = false under [package] in your crate's Cargo.toml.
  • Is the error failed to satisfy license requirements for a dependency? If so, first determine what license the project has and whether this system is sufficient to comply with this license's requirements. If you're unsure, ask a lawyer. Once you've verified that this system is acceptable add the license's SPDX identifier to the accepted array in script/licenses/zed-licenses.toml.
  • Is cargo-about unable to find the license for a dependency? If so, add a clarification field at the end of script/licenses/zed-licenses.toml, as specified in the cargo-about book.
Description
Code at the speed of thought – Zed is a high-performance, multiplayer code editor from the creators of Atom and Tree-sitter.
Readme 586 MiB
Languages
Rust 94.7%
JSON-with-Comments 3.1%
Inno Setup 0.6%
Scheme 0.5%
Shell 0.3%
Other 0.4%