Compare commits

...

1 Commits

Author SHA1 Message Date
Jason Mancuso
56b3d3bad1 exploring SWE-bench-verified structure and eval plan
Co-Authored-By: Richard <richard@zed.dev>
2024-08-21 14:43:26 -04:00
2 changed files with 69 additions and 0 deletions

View File

@@ -0,0 +1,39 @@
# SWE-Bench eval
## Zed "agent" flow
- Spin up Zed
- Open project with repo at `base_commit`
- Open assistant panel
- Add /workflow to context (e.g. in System message)
- (Out of band) LLM call to rephrase SWE-bench `problem_statement` into a Zed Assistant user query/prompt
- Trying to simulate user prompt here
- `user_query = rephrase(problem_statement)`
- Add `/auto {user_query}` to populate context
- Store benchmark outputs:
- `/file` calls + `/search` output
- Overlap of these files/snippets with `patch`
- [Stretch]: Overlap of these files/snippets within one-hop of `patch` (hop resolved via LSP go-to-impl call)
- Add `user_query` at end of assistant context
- Run assistant on context
- Apply workflow step resolution
- Apply inline-edits
- Store benchmark outputs:
- Success/failure of step resolution
- Success/failure of "proper" indentation of inline edit
- Success/failure of "overgeneration" of inline edit
- Finally, run tests from test_patch, observe results of `PASS_TO_PASS` + `FAIL_TO_PASS` tests
- Store benchmark outputs:
- Number of patch files modified: all/any/none
- Success/failure of `PASS_TO_PASS` + `FAIL_TO_PASS` tests
## Things to Report
- Rephrased user query (for test case validity)
### /workflow
- Step resolution: OK/fail
- Proper indents in inline edits: OK/fail per edit
- Overgeneration in inline edits: OK/fail per edit
- Number of patch files modified: all/any/none
- Success/failure of `PASS_TO_PASS` + `FAIL_TO_PASS` tests: OK/fail
### /auto {problem_statement}:
- Overlap of `/file` + `/search` outputs with `patch` snippets
- Overlap of these files/snippets within one-hop of `patch`

View File

@@ -0,0 +1,30 @@
# %%
import polars as pl
df = pl.read_parquet('hf://datasets/princeton-nlp/SWE-bench_Verified/data/test-00000-of-00001.parquet')
print(df.head())
print(df.columns)
print(len(df))
# Inspect the head of specific columns
df.select(['repo', 'problem_statement', 'test_patch', 'hints_text']).head()
full_row = df.head(1).to_dict(as_series=False)
import pprint
pp = pprint.PrettyPrinter(indent=4)
print("Repo:")
pp.pprint(full_row['repo'])
print("\nPatch:")
pp.pprint(full_row['patch'])
print("\nTest Patch:")
pp.pprint(full_row['test_patch'])
print("\nProblem Statement:")
pp.pprint(full_row['problem_statement'])
print("\nHints Text:")
pp.pprint(full_row['hints_text'])
print("\nPASS_TO_PASS:")
pp.pprint(full_row['PASS_TO_PASS'])
print("\nFAIL_TO_PASS:")
pp.pprint(full_row['FAIL_TO_PASS'])