Simple Science

Cutting edge science explained simply

# Computer Science# Computation and Language# Artificial Intelligence# Machine Learning

Advancing Open-Source AI for Tool Manipulation

This paper discusses the challenges and opportunities of open-source LLMs in tool manipulation.

― 5 min read


Open-Source AI: ToolOpen-Source AI: ToolManipulation InsightsLLMs in task automation.Exploring the potential of open-source
Table of Contents

Recent developments in artificial intelligence have shown that large language models (LLMs) can help automate tasks through natural language commands. These models are capable of interacting with software tools, making them valuable for various applications. This paper discusses the challenges and opportunities associated with using Open-source LLMs for Tool Manipulation.

Understanding Tool Manipulation

Tool manipulation refers to the ability of software users to translate a description of a goal into a series of actions that software can execute. For instance, a user could ask a model to find a specific item online, and the model would generate and execute the necessary commands to perform that search.

Traditionally, most research in this area has focused on closed models, where users are limited to what those models can do through a restricted set of commands. This presents challenges for businesses concerned about security and data privacy.

The Role of Open-Source Models

Open-source LLMs present a promising solution to the limitations of closed models. Since they are publicly available, they can be adapted and improved upon by anyone, fostering innovation and collaboration. However, there remains a significant gap in performance when comparing these open-source models to proprietary ones like OpenAI's GPT-4, especially in the area of tool manipulation.

Key Challenges in Tool Manipulation

To understand how to improve open-source LLMs, we must first identify the challenges they face in tool manipulation.

API Selection Issues

One major challenge is the difficulty in selecting the correct Application Programming Interface (API) commands. Open-source models often fail to identify the appropriate commands needed to achieve a user's goal, which can lead to errors in execution. In contrast, models like GPT-4 demonstrate a better ability to internalize the knowledge of API usage during their training.

Argument Population Errors

Once an API is selected, the model must fill in the necessary arguments. Open-source models frequently struggle to input the correct values for these arguments. This issue can stem from a lack of examples available during training, leading to inaccurate or nonsensical inputs.

Non-Executable Outputs

Another common problem is generating responses that are not executable. This can include overly verbose language or not following the required coding format. For software tools to execute actions properly, the outputs must be clear and concise code.

Enhancing Open-Source Models

To address these challenges, we can adopt several strategies to boost the capabilities of open-source LLMs in tool manipulation.

Adapting Existing Techniques

We can revisit established techniques from LLM literature and adapt them to suit the specific needs of tool manipulation. These strategies can be implemented without requiring a large amount of human oversight, which is crucial for practical implementation.

Model Alignment

Model alignment involves training LLMs using examples drawn from potential API usage. By creating templates that represent goals and their corresponding actions, we can expand the training data available to the model. This increase in data helps to internalize the necessary knowledge.

In-Context Demonstration Retrieval

By incorporating retrieval-augmented generation techniques, we can enhance LLMs with a mechanism that selects similar examples from a curated repository during inference. This allows the model to leverage previously successful actions as demonstrations when generating outputs.

System Prompts

Introducing systematic prompts can help define the expectations for outputs, ensuring the model focuses on generating executable code. This structure helps regulate the style and form of the generated responses.

Evaluating Tool Manipulation Techniques

To determine the effectiveness of these techniques, we developed a benchmark suite consisting of various real-world applications, including manipulating software tools for tasks like online shopping and data management.

Benchmark Overview

The benchmark comprises diverse tasks tailored to evaluate the performance of LLMs when manipulating tools. Each task is associated with specific goals, and the model's ability to generate appropriate API calls is assessed.

Performance Metrics

The primary evaluation metric for these tasks is the success rate, which reflects how often the model generates correct and executable actions. The benchmark allows for a quantitative comparison of the capabilities of open-source LLMs against leading closed models.

Results and Analysis

Through extensive testing on the benchmark, we found notable performance gaps between open-source models and GPT-4. Specifically, open-source models exhibited significantly lower success rates on more complicated tasks.

Improving Performance

By applying the proposed techniques, we were able to enhance the success rates of open-source LLMs substantially. Results showed that with a practical amount of human supervision, open-source models could reach capabilities that are competitive with those of closed models in many tasks.

Detailed Examination of Tasks

Each task in the benchmark is designed with varying complexity levels, testing different aspects of tool manipulation capabilities.

Task Case Studies

Home Search Functionality

In the home search task, the model must generate a series of API calls to retrieve listings based on user-defined criteria. The challenge lies in selecting the correct function calls and filling in the parameters accurately.

Trip Booking Functionality

This task involves more complex interactions as users may seek to book tickets or accommodations through multiple API calls. The relationships between the different parameters and functions make this task demanding for LLMs.

Google Sheets Manipulation

Manipulating spreadsheets presents its own unique set of challenges, requiring the model to understand the context and perform specific actions like cell updates or data sorting.

Conclusion

The findings from our evaluation reveal that while open-source LLMs face significant challenges in tool manipulation, there are effective strategies to enhance their performance. Through model alignment, in-context learning, and systematic prompts, we can bridge the gap and enable open-source models to be viable alternatives to closed models in this domain.

These advancements not only provide opportunities for better automation of tasks but also foster a more secure environment for businesses to adopt AI technologies. Continuous research and development in this area will help unlock further potential and improve the overall efficacy of open-source LLMs in tool manipulation.

Original Source

Title: On the Tool Manipulation Capability of Open-source Large Language Models

Abstract: Recent studies on software tool manipulation with large language models (LLMs) mostly rely on closed model APIs. The industrial adoption of these models is substantially constrained due to the security and robustness risks in exposing information to closed LLM API services. In this paper, we ask can we enhance open-source LLMs to be competitive to leading closed LLM APIs in tool manipulation, with practical amount of human supervision. By analyzing common tool manipulation failures, we first demonstrate that open-source LLMs may require training with usage examples, in-context demonstration and generation style regulation to resolve failures. These insights motivate us to revisit classical methods in LLM literature, and demonstrate that we can adapt them as model alignment with programmatic data generation, system prompts and in-context demonstration retrievers to enhance open-source LLMs for tool manipulation. To evaluate these techniques, we create the ToolBench, a tool manipulation benchmark consisting of diverse software tools for real-world tasks. We demonstrate that our techniques can boost leading open-source LLMs by up to 90% success rate, showing capabilities competitive to OpenAI GPT-4 in 4 out of 8 ToolBench tasks. We show that such enhancement typically requires about one developer day to curate data for each tool, rendering a recipe with practical amount of human supervision.

Authors: Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, Jian Zhang

Last Update: 2023-05-25 00:00:00

Language: English

Source URL: https://arxiv.org/abs/2305.16504

Source PDF: https://arxiv.org/pdf/2305.16504

Licence: https://creativecommons.org/licenses/by/4.0/

Changes: This summary was created with assistance from AI and may have inaccuracies. For accurate information, please refer to the original source documents linked here.

Thank you to arxiv for use of its open access interoperability.

More from authors

Similar Articles