writing/blog/2026/06
BlogJun 8, 2026·6 min read

NVIDIA LocateAnything: Visual Grounding for AI Agents

NVIDIA's open LocateAnything-3B gives AI agents the missing perception layer — fast, accurate visual grounding for GUI automation, document AI, and robotics.

Most conversations about AI agents focus on reasoning: can the model plan, call tools, and recover from mistakes? But there is a quieter bottleneck that breaks more real-world agents than bad planning ever does — perception. An agent that cannot reliably answer "where exactly is the Submit button on this screen?" will never click it correctly, no matter how good its reasoning is.

That is the gap NVIDIA's new LocateAnything-3B is built to close. Released as an open model in late May 2026, it is a compact 3-billion-parameter vision-language model dedicated to one job done extremely well: visual grounding — turning a natural-language description into precise pixel coordinates.

What visual grounding actually means

Visual grounding is the bridge between language and pixels. You give the model an image and a phrase, and it returns the location — a bounding box or a point — of the thing you described. "Locate all cats," "find the search field," "where is the invoice total?" Each query produces coordinates the rest of your system can act on.

This sounds simple, but it is the foundation for an entire class of agentic systems:

  • GUI and computer-use agents that click, type, and navigate real software interfaces
  • Robotics and embodied agents that need to point at and grasp objects
  • Document-understanding pipelines that extract fields, tables, and layout regions
  • OCR and text localization that find where text sits, not just what it says
  • Open-world detection where the categories are not known in advance

LocateAnything-3B is a generalist across all of these. Rather than training a separate detector per domain, it handles referring-expression grounding, multi-object detection, GUI element localization, and text detection from a single model.

The breakthrough: Parallel Box Decoding

The headline innovation is Parallel Box Decoding (PBD), and it solves a problem that has quietly throttled vision-language detectors.

Most VLMs that output coordinates do so the same way they write text: one token at a time, autoregressively. To emit a single box they generate x1, then y1, then x2, then y2 as a sequence. In a cluttered scene with dozens of objects, that serialized decoding becomes painfully slow.

PBD treats a bounding box as an atomic unit instead of a token stream. It predicts the complete set of coordinates for each box in a single parallel step, using a structured block-based output with dedicated Box, Semantic, Negative, and End blocks (unused positions are padded with <null> tokens). The geometry stays coherent, but the decoding parallelizes.

The speed difference is dramatic. On an NVIDIA H100, LocateAnything reaches 12.7 boxes per second in hybrid mode — over 10x faster than the textual autoregressive Qwen3-VL (1.1 BPS) and 2.5x faster than the quantized Rex-Omni (5.0 BPS). In dense scenes the speedup ranges from 2x to 6x over autoregressive methods. For an agent that needs to scan a busy dashboard many times per task, that throughput is the difference between usable and unusable.

The numbers that matter

Speed would mean little without accuracy, and this is where LocateAnything earns attention. Built on a Moon-ViT vision encoder paired with a Qwen2.5 language decoder, it posts state-of-the-art or near-SOTA results across very different domains:

  • GUI grounding (ScreenSpot-Pro): 60.3 mean F1 — state of the art, the metric that matters most for computer-use agents
  • Object detection (LVIS): plus 3.8 percent mean F1 over Rex-Omni, and a large jump at strict localization (31.1 vs 20.7 at IoU 0.95)
  • Document layout (M6Doc): 70.1 mean F1
  • Referring comprehension (HumanRef): 78.7 mean F1
  • Scene text (TotalText): 43.3 mean F1

The strict-IoU result is worth highlighting. Many detectors look fine at loose overlap thresholds but drift when you demand precise boxes. A near-50 percent relative improvement at IoU 0.95 means the boxes are tight enough to actually click on.

This breadth comes from scale: the model was trained on a curated dataset of roughly 12 million images, 138 million language queries, and 785 million bounding boxes spanning general detection, GUI interaction, referring comprehension, text, layout, and point-based tasks.

Running it in practice

Inference uses the familiar Hugging Face transformers API. The model accepts images up to roughly 2.5K resolution and text prompts up to 24K tokens, and returns structured coordinates.

from transformers import AutoModel, AutoProcessor
from PIL import Image
 
model = AutoModel.from_pretrained(
    "nvidia/LocateAnything-3B",
    torch_dtype="auto",
    trust_remote_code=True,
).eval()
 
processor = AutoProcessor.from_pretrained(
    "nvidia/LocateAnything-3B",
    trust_remote_code=True,
)
 
image = Image.open("dashboard.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Locate the export button"},
    ],
}]
 
text = processor.py_apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
images, _ = processor.process_vision_info(messages)
inputs = processor(text=[text], images=images, return_tensors="pt").to("cuda")
 
response = model.generate(
    pixel_values=inputs["pixel_values"],
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    tokenizer=processor.tokenizer,
    max_new_tokens=2048,
    generation_mode="hybrid",
)

The output contains semantic labels plus coordinates in a structured form — boxes as x1,y1,x2,y2 and points as x,y. The generation_mode="hybrid" setting is the recommended balance of speed and accuracy. From there, the coordinates feed directly into whatever acts on the screen: a click controller, a crop-and-extract step, or a robot arm planner.

Why this matters for builders

If you are building agentic products, LocateAnything fills a specific and previously expensive slot. Until now, teams stitched together a patchwork — one model for OCR, another for layout, a brittle YOLO detector for objects, and a separate prompt-heavy VLM for screen understanding. A single fast grounding model collapses that stack and removes the latency tax of chaining several models per step.

For teams across the MENA region, the practical implications are concrete. Document-heavy workflows — invoices, contracts, government forms, multilingual paperwork — depend on reliably finding the right region before extracting it. Retail and operations teams piloting computer-use agents need grounding that survives real, cluttered enterprise dashboards rather than clean demos. And the fact that the model runs locally on your own GPU matters for data-sovereignty requirements, where sending screenshots of internal systems to a third-party API is a non-starter.

NVIDIA has also signaled where this is heading: LocateAnything serves as a perception foundation inside its larger production vision-language models, such as Nemotron 3 Nano Omni, supplying the grounding and GUI understanding those systems need for multimodal agentic work.

The catch to plan for

One important constraint: LocateAnything-3B ships under the NVIDIA License for non-commercial use — academic and non-profit research only. Commercial deployment is not permitted under the current terms. That makes it an excellent tool for prototyping, evaluation, benchmarking your own grounding pipeline, and research, but you will need to watch for a commercial-license track or a different model before shipping it into a paid product. Treat it as a preview of where open grounding is going, and a way to measure the ceiling, rather than a drop-in production component today.

The bigger picture

The agentic AI wave has spent two years obsessed with reasoning, planning, and tool protocols. LocateAnything is a reminder that the perception layer underneath all of that has been quietly improving too — and that grounding, not just intelligence, is what determines whether an agent can actually touch the world it is asked to operate in. Fast, accurate, open visual grounding is one of the missing pieces that turns impressive demos into agents that work on the messy screens and documents real businesses run on.