Inspiration

For agentic use cases, LLMs often go astray when given too many tools

a) LLM does not know what the tool does b) LLM does not consider the costs associated with a tool (e. g. a human intervention tool is very costly)

What it does

Toolpicker picks the right strategy of tool use to solve a given task.

Implemented Strategies:

  • answer: Directly answer the query
  • chain-of-thought: Think step by step, then answer.
  • web-search: Search the web for results

Policy: Use the least costly strategy (time, money) to fulfill the user request.

How we built it

Create a fine tuning dataset as follows:

  • System prompt: 'Plan the next action. Options:\n"answer": Directly answer the question.\n"chain-of-thought": Think step by step and answer.\n"web-search": Use a search engine to find the answer.’
  • Input: question + choices from MMLU eval set
  • Output: "answer" | "chain-of-thought" | "web-search" → for each question, identify the least costly strategy that yields the correct result

Use the fine tune dataset to train a toolpicker model.

Then compare the model's performance with other mechanisms.

Challenges we ran into

Creating a high quality dataset to finetune on was the most difficult task. The chosen eval dataset (MMLU) and the mix of tools did not allow the model to outperform a base benchmark of picking chain-of-thought reasoning.

Accomplishments that we're proud of

Getting a first version ready.

What we learned

Using groq, tavily, llama 3 finetuning parameters. The inference costs associated with generating synthetic data for a project like this.

What's next for toolpicker

Experiment with a different eval dataset & larger finetune dataset, wider variety of tools.

Built With

Share this project:

Updates