Inspiration
For agentic use cases, LLMs often go astray when given too many tools
a) LLM does not know what the tool does b) LLM does not consider the costs associated with a tool (e. g. a human intervention tool is very costly)
What it does
Toolpicker picks the right strategy of tool use to solve a given task.
Implemented Strategies:
answer: Directly answer the querychain-of-thought: Think step by step, then answer.web-search: Search the web for results
Policy: Use the least costly strategy (time, money) to fulfill the user request.
How we built it
Create a fine tuning dataset as follows:
- System prompt: 'Plan the next action. Options:\n"answer": Directly answer the question.\n"chain-of-thought": Think step by step and answer.\n"web-search": Use a search engine to find the answer.’
- Input: question + choices from MMLU eval set
- Output:
"answer" | "chain-of-thought" | "web-search"→ for each question, identify the least costly strategy that yields the correct result
Use the fine tune dataset to train a toolpicker model.
Then compare the model's performance with other mechanisms.
Challenges we ran into
Creating a high quality dataset to finetune on was the most difficult task. The chosen eval dataset (MMLU) and the mix of tools did not allow the model to outperform a base benchmark of picking chain-of-thought reasoning.
Accomplishments that we're proud of
Getting a first version ready.
What we learned
Using groq, tavily, llama 3 finetuning parameters. The inference costs associated with generating synthetic data for a project like this.
What's next for toolpicker
Experiment with a different eval dataset & larger finetune dataset, wider variety of tools.
Log in or sign up for Devpost to join the conversation.