Current developments in giant language fashions (LLMs) have enabled the event of AI-based coding brokers that may generate, modify, and perceive software program code. Nevertheless, the analysis of those programs stays restricted, typically constrained to artificial or narrowly scoped benchmarks, primarily in Python. These benchmarks seldom mirror the structural and semantic range of real-world codebases, and consequently, many brokers overfit to benchmark-specific patterns somewhat than demonstrating strong, transferable capabilities.
AWS Introduces SWE-PolyBench: A Extra Complete Analysis Framework
To deal with these challenges, AWS AI Labs has launched SWE-PolyBench, a multilingual, repository-level benchmark designed for execution-based analysis of AI coding brokers. The benchmark spans 21 GitHub repositories throughout 4 widely-used programming languages—Java, JavaScript, TypeScript, and Python—comprising 2,110 duties that embody bug fixes, function implementations, and code refactorings.
In contrast to prior benchmarks, SWE-PolyBench incorporates actual pull requests (PRs) that shut precise points and embody related check instances, permitting for verifiable analysis. A smaller, stratified subset—SWE-PolyBench500—has additionally been launched to help faster experimentation whereas preserving activity and language range.

Technical Construction and Analysis Metrics
SWE-PolyBench adopts an execution-based analysis pipeline. Every activity features a repository snapshot and an issue assertion derived from a GitHub subject. The system applies the related floor reality patch in a containerized check surroundings configured for the respective language ecosystem (e.g., Maven for Java, npm for JS/TS, and so forth.). The benchmark then measures outcomes utilizing two sorts of unit checks: fail-to-pass (F2P) and pass-to-pass (P2P).
To offer a extra granular evaluation of coding brokers, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics. These embody each file-level and node-level retrieval scores, assessing the agent’s means to find and modify related sections of the codebase. These metrics provide insights past binary move/fail outcomes, particularly for advanced, multi-file modifications.
Empirical Analysis and Observations
Three open-source coding brokers—Aider, SWE-Agent, and Agentless—had been tailored for SWE-PolyBench. All used Anthropic’s Claude 3.5 because the underlying mannequin and had been modified to deal with the multilingual, repository-level necessities of the benchmark.
The analysis revealed notable variations in efficiency throughout languages and activity sorts. As an illustration, brokers carried out greatest on Python duties (as much as 24.1% move price) however struggled with TypeScript (as little as 4.7%). Java, regardless of its increased complexity when it comes to common node adjustments, achieved increased success charges than TypeScript, suggesting that pretraining publicity and syntax familiarity play a crucial position in mannequin efficiency.

Efficiency additionally assorted with activity complexity. Duties restricted to single-function or single-class adjustments yielded increased success charges (as much as 40%), whereas these requiring combined or multi-file adjustments noticed a big drop. Apparently, excessive retrieval precision and recall—notably for file and CST node identification—didn’t at all times translate to increased move charges, indicating that code localization is critical however inadequate for drawback decision.

Conclusion: Towards Strong Analysis of AI Coding Brokers
SWE-PolyBench presents a sturdy and nuanced analysis framework for coding brokers, addressing key limitations in present benchmarks. By supporting a number of programming languages, protecting a wider vary of activity sorts, and incorporating syntax-aware metrics, it affords a extra consultant evaluation of an agent’s real-world applicability.
The benchmark reveals that whereas AI brokers exhibit promising capabilities, their efficiency stays inconsistent throughout languages and duties. SWE-PolyBench offers a basis for future analysis aimed toward enhancing the generalizability, robustness, and reasoning capabilities of AI coding assistants.
Take a look at the AWS DevOps Weblog, Hugging Face – SWE-PolyBench and GitHub – SWE-PolyBench. Additionally, don’t overlook to observe us on Twitter and be part of our Telegram Channel and LinkedIn Group. Don’t Neglect to hitch our 90k+ ML SubReddit.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.