Modelwire
Subscribe

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Illustration accompanying: RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

RaTA-Tool introduces a retrieval-based framework enabling multimodal large language models to select and invoke external tools from open-world settings, moving beyond text-only, closed-world tool-use approaches that struggle with unseen APIs and diverse input modalities.

Modelwire context

Explainer

The core technical bet here is that retrieval at inference time can substitute for exhaustive training coverage of every possible API, which matters most when the tool inventory keeps changing or is too large to bake into weights. Most prior tool-use benchmarks assume a fixed, known set of tools, so RaTA-Tool is essentially stress-testing a different failure mode entirely.

The retrieval-plus-reasoning pattern is showing up across multiple fronts in recent coverage. IG-Search (arXiv, same week) applies a similar retrieval-augmented logic to question answering, rewarding models for effective search queries rather than just correct final answers. Both papers are pushing against the same wall: models that reason well in closed settings but degrade when the relevant context isn't already in their weights. MM-WebAgent from the same batch of arXiv papers adds multimodal coordination on top, which is the same modality challenge RaTA-Tool is trying to solve for tool selection specifically. These aren't the same paper, but they're converging on a shared architectural intuition.

Watch whether any of the major agentic coding platforms, Codex or Cursor-adjacent tools like Schematik, adopt retrieval-based tool routing in the next two quarters. If they do, it validates the open-world framing; if they stick with fixed tool registries, that suggests the closed-world assumption is good enough for most production use cases.

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsRaTA-Tool · Multimodal Large Language Models · Large Language Models

MW

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models · Modelwire