Grizzly

MCP security analysis! Quickly check if an MCP server attemps for tool-poisonning & prompt injection via tools, resources or prompts
LLM Playground: test your MCP server with an LLM (you can setup your model & Api key, for now only Anthropic models)
LLM Playground (DARK)
LLM Playground: your LLM is capable of using the tools of the MCP you connected to the inspector
Tools Evaluation: Generate Test cases to evaluate how prone your tools are to confusion (when tool description are too close)
Tools Evaluation: Grizzly tells you why your tests failed to help you improve your MCP servers
Tools Evaluation: Check for model provider consumption problems and MCP specification completeness

Inspiration

While contributing to MCP and developing our own servers, we found Inspector useful but limited. Grizzly adds the features we wish we had.

Who it's for

anyone who need to verify MCP server safety and capabilities before using it
MCP server developers looking for a complete debugging toolset
MCP client developers looking to benchmark a group of MCP servers for their use-case

Try it out: npx @alpic-ai/grizzly

What it does

Grizzly evaluates MCP servers across three critical dimensions:

Stability: Tests protocol implementation accuracy
Performance: Detects potential tool confusion and ensures tool are well understood by the LLM
Security: Identifies prompt injection, tool poisoning and other security related issues

Grizzly also includes a chat playground in order to bridge the gap and enable testing behavior with your chosen LLM directly from the web app.

How we built it

We built on top of existing https://github.com/modelcontextprotocol/inspector. The original inspector is clear on its intention to remain LLM agnostic and focused on the protocol itself. The added features require an LLM provider to :

evaluates tools, resources and prompts descriptions and look for security issues
generate test cases

Challenges we ran into

Our initial strategy for tool confusion detection was to compute tools description embeddings distances and compare them with a threshold value. However Anthropic lacks an embedding API and we didn't want to ask users to provide 2 keys to use our tool. We changed strategy to have Sonnet build tool test cases instead.

Accomplishments that we're proud of

The look and feel of this renewed inspector!

What we learned

Anthropic has no embeddings API
That one day hackathon is short but abusing vibe-coding is the goto strategy!

What's next for Grizzly

Support for additional LLM providers beyond Anthropic
Add support for resources and prompts in the LLM playground in addition to tools
Export generated tool confusion test cases in order to be able to version them and use them in CI