Amazon’s Swe-Polibench, a dirty secret about your AI coding assistant has just exposed


Join our daily and weekly newsletters for the latest updates and exclusive content in the industry’s leading AI coverage. Learn more


Amazon Web Services He recognized today Swe-peolbenchAI coding assistants are a comprehensive multilingual benfmark designed to evaluate different grade programming languages ​​and real world scenarios. This calister In the existing assessment framework offers significant restrictions and researchers and developers, how to manage how EU agents have effective codes.

« Now there is a criterion that coding agents can appreciate them to assess their complex programming tasks, » he said Anoop deorasIn an interview with VentureBeat to the director of Applied Sciences for Extensive AI Applications and Development Experience in AWS. « Real World offers you more complex tasks. You need to touch a mistake or touch more than one file, unlike a file to make a feature building. »

The release exploded in the popularity of AI-power coding, large technology companies integrate them into the development environment and independent products. When these tools show the impressive capabilities, evaluating their performance, especially in difficulty with different programming languages ​​and various tasks.

Swe-peolbench The real Github problems in four languages ​​consist of more than 2,000 penalty coding problems: Java (165 tasks), JavaScript (1,017 task), types (729 tasks) and Python (199 task). Benchmark, 500 pieces (SWE-POLYBENCH500), designed for faster experience, also covers a sole section.

« Task diversity and diversity of programming languages ​​have been missing, » Deoras explained about existing criteria. « Today there is only one programming language, Python and a task in the Swe-bench. There are bug fixes. In Polbench, we have expanded this Etalo to enter three additional languages. « 

The new benchmark solves direct restrictions SwenzaThose that occur as a de facto standard for assessing a coding agent with more than 50 leadership. Despite the role of pionering, the SWE-SCENCH focuses only on Python repositories, mainly incorrectly bent tasks and significantly bends in the unified code – Django Depot is 45% of all tasks.

« Intentional, we have decided to represent a little for JavaScript and species, because there are python tasks, » Deoras said. « So, than to represent in Python, we have been convinced that JavaScript and species represented in addition to Java. »

Why simple transition / failure measurements do not explain the whole story about AI coding performance

Main innovation Swe-peolbench A patch produced is successfully solved the issue of coding, and the application of more complex evaluation measurements than the traditional « transition rate ».

« The evaluation of these coding agents was first implemented through metric called the transition ratio, » Deoras said. « The transition rate is part of the tasks successfully employed in the application of the slope produced by agents. However, this number does not say how to resize this number at a very high level. »

The new dimensions include an agent’s level locality, which requires a specific syntax tree (CST) node-level search, which requires you to require modification of any files in the repository and requires changes.

« Along with the price rate, we have accuracy and recall. We are looking for a program analysis tool called a concrete syntax tree, called accuracy and withdrawal, » Deoras. « What is the classroom that tells you how your Core file structure is set up and what are the classrooms, function nodes and variables. »

Ink tasks remain dominant of Python while exposing AI restrictions

The evaluation of several open source coding agents in Amazon’s SWE-Polibench revealed several examples. Python remains the strongest language for all tested agents, probably due to the spread of training information and existing criteria. The performance is required, especially if it is required to change three or more files, especially in the complexity of the task, especially.

Different agents show different strengths in the task categories. The performance associated with incorrect correction tasks is relatively consistent, there is more volatility between agents while working for feature requests and code refactoring.

Benchmark significantly affects the success rate of problem expressions that the obvious proposes to be important for effective AI assistance.

Which Swe-Polibench is meant for the developers employee employees working in many languages

Swe-peolbench AI comes to a critical rating in the development of coding assistants. As these instruments move from experimental to the production environment, the need for serious, various and representative criteria has intensified.

« Over time, not only LLS has developed, but at the same time the tasks gained more complex, » Deoras said. « Developers have a need to solve more complex tasks in a synchronous way using these agents. »

Extended language support for the assessment is especially valuable to enterprise environments where the development of the polyglot is common. Java, JavaScript, species and Python, SWE Polybenchin are ranked in a row in a consistent among the most popular programming languages ​​in enterprise parameters, which adapt the coverage to real world development scenarios.

Amazon has made all the Swe-Polybench frame obviously. The database is accessible Hug faceand evaluation trailers are available Entrusted. Dedicated leadership In the assessment, it is set to monitor the performance of various coding agents.

« Deoras said: » We have extended the swe-bench information pipeline to support these three additional languages, « Hopefully, it will further eliminate this process and extend in four languages. »

As the AI ​​coding assistant market warms with each major technological company offers, SWE-Polybench provides an important reality in its true capabilities. The assessment design requires more than the simple bug fixes in Python in PYTHON, acknowledging complex codes and requires a solution to various engineering problems.

Evaluating AI coding tools, for the enterprise decision-makers, SWE-Polybench offers an invaluable thing: a way to distinguish marketing hype from original technical ability. After all, the actual test of an AI coding assistant is how well this simplified demos performs how well the actual software projects can be wrestled every day so that the mixture cannot manage multi-language complexity.



Source link

Leave a Reply

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *