Skip to content

Conversation

@MiguelsPizza
Copy link

@MiguelsPizza MiguelsPizza commented Sep 17, 2025

This draft proposal outlines a declarative WebMCP API that enables web pages to expose tools via HTML, using minimal attributes like tool-name and standard form semantics. Currently, it's a compilation of my notes and ideas from developing this approach, and I'm sharing it to gather feedback before the September 18th working group meeting. I'm particularly interested in your thoughts on the open questions (e.g., JSON vs. HTML responses, elicitation flows), tradeoffs, and overall API design.

The proposed API was shaped by building a real application and polyfill during the MCP enterprise hackathon, where our team successfully implemented it (and took home the win, which was exciting validation!). You can see a video of a Rails app using declarative WebMCP tools to enable complex browser automation without client-side JavaScript: link.

Based on your feedback, I'll refine this draft to align more closely with the structured format and narrative style of other explainers in the repo.

@MiguelsPizza MiguelsPizza marked this pull request as ready for review September 17, 2025 04:08
@bwalderman
Copy link
Collaborator

This is great work. I do have one general question. Was reusing ARIA attributes instead of introducing new tool-* attributes considered?

There are already attributes such as aria-label and aria-description and others for labelling and describing elements and so it might be helpful to define WebMCP mappings/behaviors for these instead of introducing entirely new HTML attributes.

One benefit of using these existing attributes is that they are also surfaced in native accessibility APIs, so assistive tools that already use these APIs to access the page's accessibility tree would be able to access WebMCP tools declarations as well.

@MiguelsPizza
Copy link
Author

@bwalderman This is a good idea, I'll put the PR in draft while I re-implement the ARIA based polyfill.

The only thing I can think is that we still need a way to make exposing tools to the agent opt-in (or opt-out)

Maybe we still tag elements with a tool-name to expose them to the agent? This will help prevent duplicate tool names which causes errors in most inference providers

@vsakaria
Copy link

vsakaria commented Nov 7, 2025

The concern with the HTML method is of course the iFrame. Realistically speaking its stood up well for some years now. Would an iFrame in a browser be more trustworthy. I would prefer JSON and rendering on client. I am sure web components can be distributed with framework payloads and CSS. But the build process for this type of architecture would have to change. I would prefer that design.

The trade off really is that payload would need to be disputed more frequently.

@anssiko
Copy link
Member

anssiko commented Nov 11, 2025

@matatk to review for the accessibility group's perspective (aka APA WG).

@anssiko
Copy link
Member

anssiko commented Nov 25, 2025

A new paper and implementation experience:

https://arxiv.org/abs/2511.11287v1
https://svenschultze.github.io/VOIX/

@svenschultze & team, this W3C community group is developing a WebMCP API that is complemented with a declarative mechanism explored in this PR.

Let’s join forces to explore this space. Here’s how to join:
https://webmachinelearning.github.io/community/#join

@anssiko
Copy link
Member

anssiko commented Nov 25, 2025

That was fast. I’m excited to welcome @svenschultze to the WebML Community Group! 🎉

@svenschultze
Copy link

Hi @anssiko, thank you for making me aware of this project! It is great to see the community converging on this. I'm happy to share some insights from our work on VOIX, where we implemented a similar declarative framework and tested it with developers.

  1. We established a more explicit interface where MCP tools are separated from standard UI HTML elements. This ensures the agent only accesses data and actions the developer specifically intended to share. I think this is also relevant for the discussion about including ARIA attributes. I think it is important not to just reuse ARIA since this could lead to conflicts of interest between optimizing for accessibility or agents.
  2. Is there an equivalent idea for declarative context/resources in this spec? We found that it was really helpful to explicitly set agent-only text elements (in our case, specific <context name="mouse_position"> elements). This avoids long context inputs of the full html text, hides potentially sensitive data like credit card numbers, and enables high-fidelity synergetic multimodal interaction where UI hover/selection states can be explicitly exposed to the agent. This way, you can interact with websites using commands like "move this to here" without requiring long chains of tool calls.

@bwalderman
Copy link
Collaborator

Following up on my earlier suggestion to look at ARIA: ARIA attributes for browsing agents was discussed at TPAC and from what I understand, the consensus is this is not actually a good idea. One of the concerns raised was the same that @svenschultze mentioned above, that reusing ARIA could lead to conflicts of interest and possibly incentivize web developers to optimize for machines instead of people using a11y tools.

- To preserve correctness and avoid race conditions, the browser MUST enforce this ordering for a single tool invocation:
1) Execute the tool and fully evaluate its body (JS `execute` or form submission/response parsing).
2) Deliver the tool result to the client (agent) over MCP.
3) Recompute the catalog (scan DOM + JS registrations), compute deltas, and if changed, emit [`notifications/tools/list_changed`](https://modelcontextprotocol.io/specification/2025-06-18/schema#notifications%2Ftools%2Flist-changed) to all connected clients for this server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user agent wouldn't have a step here to actively recompute the catalogue, right? I would assume that any side effects that result from executing the last tool (i.e., adding new elements to the DOM that hav a tool-name attribute, or invoking script which registers a new tool imperatively) would just keep the catalogue up-to-date always, right?

Apologies if I'm missing how MCP's notifications/tools/list_changed fits in here.

1) Execute the tool and fully evaluate its body (JS `execute` or form submission/response parsing).
2) Deliver the tool result to the client (agent) over MCP.
3) Recompute the catalog (scan DOM + JS registrations), compute deltas, and if changed, emit [`notifications/tools/list_changed`](https://modelcontextprotocol.io/specification/2025-06-18/schema#notifications%2Ftools%2Flist-changed) to all connected clients for this server.
4) Apply any navigation/unmounts/redirects (e.g., those implied by `_meta.uiRedirect` or HTTP redirects).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the only thing that can specify an action that the browser should take after the response is obtained? Having a full list of actions that could be done (navigations, displaying more UI / appending things to the DOM) would be great—I'm not sure what MCP specifies is possible outside of _meta.uiRedirect.

2) Deliver the tool result to the client (agent) over MCP.
3) Recompute the catalog (scan DOM + JS registrations), compute deltas, and if changed, emit [`notifications/tools/list_changed`](https://modelcontextprotocol.io/specification/2025-06-18/schema#notifications%2Ftools%2Flist-changed) to all connected clients for this server.
4) Apply any navigation/unmounts/redirects (e.g., those implied by `_meta.uiRedirect` or HTTP redirects).
- In particular, the tool that is currently executing MUST NOT be unmounted (e.g., DOM removal or JS unregistration) until after its result has been delivered to the client and any [`notifications/tools/list_changed`](https://modelcontextprotocol.io/specification/2025-06-18/schema#notifications%2Ftools%2Flist-changed) notification has been sent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super clear about this part. Imagine the user submits a tool_name= form, but before the agent receives the response, some random script comes in and calls form.remove(), removing it from the DOM. Is this line saying that we shouldn't let that DOM API work? That's not really feasible. What is the intent here?


Notification shape
- Method: [`notifications/tools/list_changed`](https://modelcontextprotocol.io/specification/2025-06-18/schema#notifications%2Ftools%2Flist-changed)
- Use when: any change to the catalog (addition, removal, or metadata/schema change) is observed after a recomputation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this make sense to me. Basically this would mean the browser sends a list_changed notification to the ... agent? ... whenever declarative tool is appended to or removed from the DOM, or whenever the tool-name attribute is added to an existing <form>, <a>, or <button>, right?

```
interface ToolListChangedNotification {
method: "notifications/tools/list_changed";
params?: { _meta?: { [key: string]: unknown }; [key: string]: unknown };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I'm clear, is this the shape of the notification that's sent to the agent? (I'm still cutting my teeth on the whole MCP spec, so bear with me if this is a dumb question...

P-->>B: result
else Declarative form
B->>S: fetch(action, method, body/query) [Accept: JSON]
S-->>B: JSON result (optional Location)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Location? Is this the meta UI redirect?

S-->>B: JSON result (optional Location)
end
B-->>A: Return CallToolResult
B->>P: Recompute catalog (DOM + JS)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this step is what I was commenting on earlier. I'm not totally sure why we'd need to recompute the list of available tools at this stage. I'd assumed this would be automatically kept up to date by operations that add/remove declarative tools (or imperative, JS-based tools) on its own. Is it possible that the CallToolResult JSON specifies new tools to be added?

- Slightly delays visible navigation until result delivery.
- Page authors can’t forcibly interrupt an in-flight call (must cancel cooperatively).

Strategy B — Navigate-early (allow unmount during execution)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I'm clear on what "Navigate-early" means here—what is the navigation referring to? Are you referring to the action attribute's URL for the form? Or the navigation that the agent would tell the browser to do, once it receives the meta UI redirect? I think you're referring to the former.

On that assumption, I'm not sure what Strategy B really looks like. Given what you say below in "Matches default web navigation behavior", it sounds like submitting a form would trigger a navigation to the form's action attribute URL. So what does the agent get in return? Does it also fetch the Accept: JSON request alongside the navigation, race the two requests, and sometimes the agent receives the JSON before the navigation completes, and other times the navigation completes first and the agent receives nothing? Is that "Strategy B"?

- Visibility: Ignore `tool-elicit` on non-interactive controls such as `input[type=hidden]`, `disabled`, or `readonly` controls. Authors should instead expose an interactive control if they want user input.
- Merging: Final submitted values come from the elicitation UI for elicited fields (prefilled by agent/defaults); all other fields submit resolved values from agent/defaults/DOM as usual.
- Validation: Standard HTML constraints gate submission. `required` still applies; `tool-elicit` doesn’t change schema.
- UX: The browser chooses the UI, but SHOULD display each elicited field’s label or `tool-param-title` and (optionally) `tool-param-description`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the UI resource used for the elicitation of user input is entirely in the control of the browser, right? The agent does not provide any HTML/CSS resources to display to the user to collect such information?

@domfarolino
Copy link

Thinking through it a little more, I'm a little concerned about the Accept: application/json part of the proposal. Why does the proposal lean on it? It seems to lean on it to increase the likelihood that the agent gets structured JSON back from the action URL endpoint, instead of HTML. But we need to consider the fact that this header does not enforce JSON responses, it's just used for content negotiation. So:

  1. Do we really expect server authors to supply JSON/agent-readable responses alongside traditional HTML responses, at the same endpoint? Historically, https://wiki.whatwg.org/wiki/Why_not_conneg#Negotiating_by_format says this is pretty rare, so I'm less sure about leaning on it. At least initially, almost all front-end authors that slap tool-name on their forms will be feeding straight HTML into the agent, which presumably will be rejected either by the agent, or by WebMCP's spec text that processes the response before feeding it to the agent.
  2. Do we have to worry about old/legacy action URL endpoints that don't expect to be hit as a result of agent actuation, suddenly being hit by it? I guess not, maybe the whole WebMCP proposal relies on the assumption that the AI agent would only call tools (and submit forms) that the user would do themselves if they weren't working through an agent, so maybe we don't need to worry about it. I'm just trying to make sure we don't end up in a situation where people think slapping tool-name on a <form> only hits the "agent-ready"-version of the endpoint if it exists and nothing else, and therefore it's free to call and doesn't have the same destructive properties as traditional forms. (When this is not true, since Accept: application/json makes no such guarantees; it just calls the usual endpoint, and the server has to be ready to take the hint as a part of content negotiation).

If either of the above concerns are big enough, I wonder if makes sense to use something like tool-url that overrides action...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants