https://deepswe.datacurve.ai/blog#limitations

DeepSWE

DeepSWE measures frontier coding agents on original, long-horizon software engineering tasks.

deepswe.datacurve.ai




TS/JS/Go/Rust/Python 을 대상으로 총 91개의 레포지토리로 측정했다고해


새로운 이슈를 만들어 푸는지보는데

버그해결/리펙토링 등등이 있다고함


https://github.com/datacurve-ai/deep-swe


GitHub - datacurve-ai/deep-swe: Measuring frontier coding agents on original, long-horizon engineering tasks

Measuring frontier coding agents on original, long-horizon engineering tasks - datacurve-ai/deep-swe

github.com


AI 에게 어떤 문제를 풀라고 시켰는지 볼 수 있어

1ebec223e0dc2bae61abe9e74683726d31d6aecbef7d88a793a51014f8a7a9050b5fce6817d6a6fe05




결과는 GPT 승.