exploring SWE-bench-verified structure and eval plan

Co-Authored-By: Richard <richard@zed.dev>
2024-08-21 14:43:26 -04:00
2 changed files with 69 additions and 0 deletions
--- a/script/swe-bench-eval/swe-bench-eval.md
+++ b/script/swe-bench-eval/swe-bench-eval.md
@@ -0,0 +1,39 @@
+# SWE-Bench eval
+## Zed "agent" flow
+- Spin up Zed
+  - Open project with repo at `base_commit`
+  - Open assistant panel
+- Add /workflow to context (e.g. in System message)
+- (Out of band) LLM call to rephrase SWE-bench `problem_statement` into a Zed Assistant user query/prompt
+  - Trying to simulate user prompt here
+  - `user_query = rephrase(problem_statement)`
+- Add `/auto {user_query}` to populate context
+- Store benchmark outputs:
+  - `/file` calls + `/search` output
+  - Overlap of these files/snippets with `patch`
+  - [Stretch]: Overlap of these files/snippets within one-hop of `patch` (hop resolved via LSP go-to-impl call)
+- Add `user_query` at end of assistant context
+- Run assistant on context
+- Apply workflow step resolution
+- Apply inline-edits
+- Store benchmark outputs:
+  - Success/failure of step resolution
+  - Success/failure of "proper" indentation of inline edit
+  - Success/failure of "overgeneration" of inline edit
+- Finally, run tests from test_patch, observe results of `PASS_TO_PASS` + `FAIL_TO_PASS` tests
+- Store benchmark outputs:
+  - Number of patch files modified: all/any/none
+  - Success/failure of `PASS_TO_PASS` + `FAIL_TO_PASS` tests
+
+  ## Things to Report
+  - Rephrased user query (for test case validity)
+
+  ### /workflow
+  - Step resolution: OK/fail
+  - Proper indents in inline edits: OK/fail per edit
+  - Overgeneration in inline edits: OK/fail per edit
+  - Number of patch files modified: all/any/none
+  - Success/failure of `PASS_TO_PASS` + `FAIL_TO_PASS` tests: OK/fail
+  ### /auto {problem_statement}:
+  - Overlap of `/file` + `/search` outputs with `patch` snippets
+  - Overlap of these files/snippets within one-hop of `patch`
--- a/script/swe-bench-eval/swe_bench.py
+++ b/script/swe-bench-eval/swe_bench.py
@@ -0,0 +1,30 @@
+# %%
+import polars as pl
+
+df = pl.read_parquet('hf://datasets/princeton-nlp/SWE-bench_Verified/data/test-00000-of-00001.parquet')
+
+print(df.head())
+print(df.columns)
+print(len(df))
+
+# Inspect the head of specific columns
+df.select(['repo', 'problem_statement', 'test_patch', 'hints_text']).head()
+full_row = df.head(1).to_dict(as_series=False)
+import pprint
+
+pp = pprint.PrettyPrinter(indent=4)
+
+print("Repo:")
+pp.pprint(full_row['repo'])
+print("\nPatch:")
+pp.pprint(full_row['patch'])
+print("\nTest Patch:")
+pp.pprint(full_row['test_patch'])
+print("\nProblem Statement:")
+pp.pprint(full_row['problem_statement'])
+print("\nHints Text:")
+pp.pprint(full_row['hints_text'])
+print("\nPASS_TO_PASS:")
+pp.pprint(full_row['PASS_TO_PASS'])
+print("\nFAIL_TO_PASS:")
+pp.pprint(full_row['FAIL_TO_PASS'])