SWE-bench Evaluation Helper

This repository contains a helper script to evaluate model predictions on the SWE-bench dataset.

Using ssh ubuntu@35.212.134.229 to connect to the evaluation server.

Usage

Setup the environment

Pre-requisites
- Docker Desktop (with Allow the default Docker socket to be used (requires password) selected, you can find this option in Docker Desktop -> Settings -> Advanced)
- Python
- pip

run following commands in terminal

  python3 -m venv .venv
  source .venv/bin/activate
  pip install -e .

Run following command to start the evaluation script

mode 0 (default)

Input the path (or url) of the combined patch file and the instance id to evaluate the model prediction.

  python -m gru.evaluation

mode 1

Input instance ids and patch file links manually according to instructions.

  python -m gru.evaluation --mode 1

Disable the Cache

If you want to disable the cache, you can set the --disable-cache flag.

disable-cache flag value	Description
0	Enable the cache (default)
1	Disable the cache
2	disable the cache of unresolved instances

  python -m gru.evaluation --disable-cache 1

  # or
  python -m gru.evaluation --mode 1 --disable-cache 2

Modify Max Workers

You can modify the number of workers by setting the --max-workers flag, default is 0 (auto mode).

max-workers flag value	Description
0	Auto mode (default), 3/4 of the CPU cores
1	Single worker
k (k>0)	k workers

  python -m gru.evaluation --max-workers 2

Enable Chunk Mode

You can enable the chunk mode by setting the --enable-chunk flag, default is False (disable).

  python -m gru.evaluation --enable-chunk false

Evaluation Results

All of the evaluation results will be saved in the gru-result/evaluation directory. Organized by timestamp (Month-Day-Hour-Minute-Second), each evaluation will be saved in a separate directory.

report.json: contains all evaluation results
predictions.json: contains the model predictions for each instance
test-instances.json: contains the test instances
log/: contains the log file of the evaluation, includes some details of the evaluation process, organized by instance id

Sync with SWE-bench Official Code

Switch to SWE-bench-official branch, click Sync fork button on the upper-right corner
Rebase updates from SWE-bench-official branch to main branch

Name		Name	Last commit message	Last commit date
Latest commit History 474 Commits
.github		.github
assets		assets
docs		docs
gru		gru
swebench		swebench
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE-bench Evaluation Helper

Usage

Setup the environment

Run following command to start the evaluation script

mode 0 (default)

mode 1

Disable the Cache

Modify Max Workers

Enable Chunk Mode

Evaluation Results

Sync with SWE-bench Official Code

About

Releases

Packages

Languages

License

babelcloud/SWE-bench-Gru

Folders and files

Latest commit

History

Repository files navigation

SWE-bench Evaluation Helper

Usage

Setup the environment

Run following command to start the evaluation script

mode 0 (default)

mode 1

Disable the Cache

Modify Max Workers

Enable Chunk Mode

Evaluation Results

Sync with SWE-bench Official Code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages