Summary
Pact monitors AI agents on behavioral contracts, not just infrastructure metrics. It ships 3 verifier types (deterministic, semantic, and natural language inference) and contract coverage to surface the rules you forgot to write. Our demo is OpsGuard, an autonomous incident-response agent governed by a 5-commitment contract. Every verdict is pushed to Splunk over HEC as indexed events, every violation arrives with an auto-generated SPL query attached, and a 6-panel Splunk dashboard plus 5 MCP tools let the Splunk AI Assistant answer behavioral questions in plain language. When the agent misbehaves, the proof is already a queryable Splunk event.
Key Deliverables and Where to Find Them
All paths are in the repo: github.com/Akshat74747/Pact
- Public repo + MIT License — LICENSE
- Setup & hosting instructions — README.md
- Architecture diagram — repo root (architecture.png) and in README.md
- Demo video — linked at the top of README.md
- Splunk integration code (HEC transport + SPL generator) — pact/observers/splunk_observer.py
- Two sourcetypes — pact:commitment (per-verdict) and pact:coverage (per-run), documented in README.md
- Pre-built Splunk dashboard — splunk/dashboard.json (6 panels)
- MCP tools for the Splunk AI Assistant — splunk/mcp_tools.py (5 tools) + api/routes/mcp.py
- Behavioral contract (5 commitments) — opsguard/contract.py
- The verifiers — pact/verifiers/{deterministic,semantic,nli}.py
- Contract coverage + uncovered-tool-call reporting — pact/coverage.py
- OpsGuard agent + 4 scenarios — opsguard/
Inspiration
Deterministic software has unit tests. But as we hand real work to AI agents, "the code didn't error" stops being a useful definition of success. An agent can complete a run, return a clean exit, and still have done exactly the wrong thing confidently.
Splunk already sees every log, metric, and trace your systems produce. What it was never given a way to see is the new signal agents introduced: not what the system did, but whether the agent was allowed to do it. In ops, security, and finance, that's the gap that bites you.
So we built Pact: a way to define your operational expectations as a contract, verify the agent against it after every execution, and make the result a first-class Splunk event. Pact takes the behavioral-contract approach to agent observability and rebuilds it natively for Splunk HEC events, sourcetypes, SPL, dashboards, and the AI Assistant rather than bolting a dashboard onto a generic eval pipeline.
Our demo, OpsGuard, is an incident-response agent. Picture it clearing a latency spike: it restarts the service, latency recovers, the run completes. Splunk says the incident is healthy. But the agent restarted blind it never checked downstream dependencies first. That's the misbehavior traditional observability can't catch, and the one Pact is built to.
What it does
The premise
A developer predefines a contract with multiple commitments. A contract is like a test suite; a commitment is a test case. Each commitment has a verifier, which can be deterministic, semantic, or NLI-based. At the end of the agent's execution, Pact verifies the run against the contract and submits the results to Splunk.
from pact import Contract, Commitment from pact.observers import SplunkObservability
observer = SplunkObservability() # HEC transport + SPL generator
contract = Contract( observer=observer, commitments=[ Commitment( name="dependency_check_before_action", terms="check_downstream_dependencies must precede any restart, scale, or rollback", verifier=dependency_check, # deterministic ), Commitment( name="false_positive_validation", terms="Consult 2+ independent data sources before confirming an alert", verifier=false_positive_check, # semantic semantic_sampling_rate=0.5, # only verify half the runs, to control cost ), # ...3 more ], ) The terms double as a domain-specific system prompt and the expectation the semantic verifier grades against so the cost of adopting Pact is close to zero beyond verification itself.
The strategy a four-phase loop
Pact treats getting an agent right as a debugging cycle, decomposed into four parts:
Set expectations. Define a contract with the commitments you think the agent should honor. Verify against reality. Run the verifiers. Deterministic where you can (cheap, exact), semantic only where language judgment is genuinely needed. Aggregate and find gaps. Push every verdict to Splunk and watch two gaps: the gap between execution and contract (failed commitments), and the gap between contract and your true expectation (things no rule covers). Adjust. Tighten a commitment that's too lax, relax one that's too strict, or add one you never wrote — then go again.
What makes it Splunk-native
This is the part we care most about, because it's where Pact stops being "an eval tool with a dashboard" and becomes a Splunk citizen:
Verdicts are indexed events. pact:commitment carries passed, verifier_type, violation_detail, tool_call_sequence, and more. pact:coverage carries coverage_percentage and uncovered_tool_calls. Everything is queryable with SPL like any other source. Every violation ships its own SPL. The investigation_spl field is generated per verifier type and attached to the event. A judge or an on-call engineer can run the investigation immediately instead of writing it. The AI Assistant can answer behavioral questions. Five MCP tool definitions (get_commitment_pass_rates, get_recent_violations, get_contract_coverage, get_violation_investigation_spl, get_violation_timeline) expose the behavioral data conversationally. A pre-built dashboard ships in dashboard.json: pass rate by commitment, violation timeline, coverage gauge, uncovered tool calls, recent violations with drilldown SPL, and total executions.
Contract coverage finding the rules you forgot
Coverage doesn't claim the agent is safe. It reports what fraction of the agent's actual actions any commitment governs, and names the ones nothing is watching. Verifiers self-report which interactions they cover; the complement of their union is your blind spot, reported to Splunk as uncovered_tool_calls. It's the difference between "all my rules passed" and "all the right rules exist."
How is this different from LLM-as-a-judge?
LLM-as-a-judge abstracts evaluation away from the developer great when no domain knowledge is needed (hallucination checks, toxicity). Pact does the opposite on purpose: it asks the developer to encode their own domain expectations, then verifies against them with the cheapest sufficient method often deterministic, no LLM at all. Generic observability tells you the agent ran; Pact tells you whether it ran correctly for your domain.
How we built it
The Pact library. To intercept tool calls without coupling to any one framework's callbacks, we wrap the agent's tools in a decorator that records each call. That keeps Pact's boundaries clean it touches the agent only before and after execution, never during and means it works for anything with Python functions as actuators, not just LLM agents. Concurrent executions are kept separate with a ContextVar-based execution context, so each decorated call appends to the right run.
The Splunk observer. The interesting work was mapping the contract model onto Splunk's event model. We modeled verdicts as two purpose-built sourcetypes and made the observer responsible for both HEC transport and SPL generation. The SPL generator is verifier-aware: a deterministic tool-sequence violation produces a different investigation query than a semantic rubric failure.
The three verifiers. Deterministic verifiers inspect the recorded tool-call sequence directly required orderings (A must precede B) and occurrence limits (restart_service ≤ 1), no LLM, always exact. The semantic verifier uses an LLM judge against a plain-English rubric, applied at a configurable sampling rate to control cost, and supports both Gemini and Splunk Hosted Models. The NLI verifier checks entailment locally on CPU and skips gracefully when its dependency isn't present.
OpsGuard, the demo. A good framework needs a demo where success isn't just "no errors." OpsGuard monitors three simulated microservices and runs on Google ADK. Its contract has five commitments dependency-check-before-action, false-positive validation, human escalation on uncertainty, a one-restart scope limit, and an audit-trail rule. Crucially, the Violation Demo uses a genuinely misconfigured agent the check_downstream_dependencies tool is actually removed from its toolset, not prompt-tricked so the contract fires deterministically and the violation hitting Splunk is real, not staged.
Challenges we ran into
The hard part was the port, not the polish. Splunk's event-and-search model is a different shape than custom-evaluation platforms, so we had to decide what a "verdict" is as a Splunk citizen before anything else worked. Once the two sourcetypes were right, the dashboard, the SPL generation, and the MCP tools fell into place on top of them.
Making the demo trustworthy was its own challenge. It's easy to fake a violation with a clever prompt and just as easy for a judge to see through it. Removing a real tool from the agent's toolset was the honest way to guarantee a deterministic failure we could stand behind on camera.
And the perennial problem with any LLM-based verification: cost. We leaned on deterministic verifiers wherever the rule could be expressed in code, used NLI for simple entailment, and put a per-commitment sampling rate on the semantic verifier so you're not paying to grade every single run.
Accomplishments that we're proud of
The thing we're proudest of is invisible: SPL on the event. Attaching a tailored investigation query to every violation turns Pact from a thing that reports problems into a thing that hands you the next step inside the tool ops teams already live in. The deterministic guarantees and the honest coverage panel are close behind, because both are about refusing to overstate what the system actually knows.
What we learned
We came in knowing agents and left knowing Splunk HEC, sourcetypes, SPL, dashboards, and the MCP/AI Assistant surface, none of which we'd built on before. More broadly, encoding expectations as explicit, checkable commitments is a discipline that makes any agent easier to debug, and we'll carry that into everything we build next.
What's next for Pact
- Close the loop with Splunk alerting. Every verdict is already an event, so wiring violations into saved-search alerts and SOAR for real-time response is the natural next step — moving from a camera to, optionally, a lock.
- A tri-state verdict. Add an explicit skipped status alongside pass/fail so a verifier that couldn't run is never mistaken for a pass.
- Parallel-aware tracking. The execution context is sequential today; making it parallel-aware extends Pact to agents that fan out tool calls.
- Hosted-models-first + real data sources. Default the semantic verifier to Splunk Hosted Models, and run OpsGuard against live Splunk data instead of the simulated bench.
Log in or sign up for Devpost to join the conversation.