AI is not plug-and-play for businesses. Most don’t try out options in a review period, settle on a favorite, purchase it, set up AI security and access control, serve it, and call it done.
If you do, you may quickly find your solution lagging behind the competition. Increasingly, that could mean your version hallucinates more, reasons less effectively (if at all), and only helps with an isolated set of tasks compared to the newer and more autonomous releases that keep coming fast and furious.
I would call this approach a guaranteed way to end up with AI envy.
Even if you use a third-party that switches on models, or build out your own agnostic software layer, you probably have systems coming and going.
Likely you pilot programs, try things on, use a number of different solutions, and probably aren’t entirely sure what all your employees are using across the full organization.
And even if you have adaptable, effective AI policies at your workplace that govern usage, how do you go about introducing new systems or solutions?
Again, traditional approaches seem to easily break down.
In today’s newsletter, I consider an alternative—an onboarding approach comparable to one we use for new employees. AI may not really think and feel, but it does make mistakes, draw on real human knowledge, and do some things better than others (and in certain ways). And of course some solutions also work better than others, and depending on their underlying model, may change over time.
I also want to work AI KPIs and performance metrics into this—with the AI manufacturers providing irregular self-reporting and new models often better at some things while worse at others. How do you identify and track changes in these varying areas?
What should you realistically expect, and how do you continue to monitor?
AI Agent Onboarding: The Case for Giving AIs Employee Treatment
For traditional, chatbot-based GenAI systems, where you’re using AI like a glorified knowledge base, an entire onboarding process may be overkill.
But agentic AI doesn’t just respond to your prompts and draw on confined sources.
It can plan, evaluate, and take autonomous action. And even if current models don’t go too far successfully, every new step is able to combine more actions without needing to check back in.
When Anthropic’s Claude Opus 4 was released earlier this summer, it showed a large step forward in the number of steps it could actually take on its own.
Ars Technica’s Benj Edwards detailed how Anthropic’s Opus 4 went 24 hours straight playing Pokémon (coherently), while Claude Code refactored famously without human input for seven straight hours.
This level of depth is extremely intriguing, but it also opens the door to a number of real concerns for business leaders.
For one, identity controls for AI suddenly become far more important. (This is one reason Palo Alto Networks spent a reported $25 billion acquiring CyberArk earlier this month.)
Around the same time Opus 4 arrived, Microsoft introduced Entra Agent ID with the goal of staying ahead of security issues from “AI agent sprawl.” Its goal is to ensure secure identities for AI agents that can be managed much as they are for people, and provide conditional access in real time.
And far more broadly, the International Organization for Standardization provides a number of AI-related standards, with ISO/IEC 42001 specifically providing AI governance best practices in the form of a management system.
It’s just one of many out there geared to help you structure your management of AI lifecycles.
Now let’s look more specifically at this idea of onboarding your agents like employees.
Responsible AI Deployment in Action
You can think of AI management like managing any other part of your workforce.
Consider these stages:
Formalized Description of Fit
This is roughly comparable to an offer letter stage, after you’ve decided you want the addition at the right price.
Here you would define the AI’s role, scope, and ownership. Is there a job description it is being deployed to meet? There should be.
With agentic AI, successful evaluation makes this process more and more important. What outcomes will the AI own?
For enterprise AI risk management, what is the classification of the role? What are the expectations both in terms of accomplishments and required standards of quality?
Who is the human manager or backup for this process or system? Having a single human contact responsible for safety and monitoring is comparable to a direct report or manager.
Administering Employee Accounts
New hires need a full system setup, with an employee ID.
The same is true for AI agents.
Non-human identities (NHI) already outnumber human IDs 50 to one in many organizations, per numbers from Oasis Security.
These can include service and system accounts, IAM roles, and identities that enable authentication activities (around API keys, tokens, and certificates).
AI agents are another that need unique identification for access and auditing. And these credentials, like employee IDs, need rotation, tracking, and a capacity for being revoked.
Walk Before You Run/Starting at Least Privilege
AI agents don’t yet arrive at the level of the most experienced humans, no matter what AI companies may say.
And just as you wouldn’t give most new employees lacking this expertise full access to your system without a trial period, the same holds for AI.
Begin with read-only and least privilege—only giving the permissions that are absolutely necessary, well communicated, and with potential ramifications fully understood.
From here you can expand privileges as a form of AI sandbox training where behavior is observed.
AI Orientation and Review Period
Prompt injection is an ongoing issue for LLMs, where commands can be introduced as invisible or buried text, via emails, calendar invites, or third-party websites, for example.
We covered this in a recent PTP Report, and if your AI system has access to your data, third-party content, and a capacity for external communications, it can pose a risk.
You can think of orienting AI agents as you would an employee, with masked/de-identified data in a sandbox setting. Here you can verify guardrails like rate limits and kill switches are aligned and functioning as needed.
The Open Web Application Security Project (OWASP) is an example of an organization that provides GenAI security assistance that’s very useful in this stage, such as their list of the top 10 risks and defenses for LLMs. (Number one is prompt injection, but it also includes handling outputs, poisoned training data, disclosures, excessive agency, and more.)
Verify mitigations are in place as you monitor in a sandbox setting.
I cover benchmarks a bit below, but KPIs will be specific to you—hallucination, accuracy, customer satisfaction, cost-per-task—but define and monitor them as you would any employee evaluation.
Success? Promote Thoughtfully and Carefully
If your AI “hire” is meeting expectations and can do so more effectively, elevate privileges and scope carefully, as you would with any promotion.
Keep people in the loop, and maintain access reviews and KPI evaluations.
Not Cutting It? Know How You’re Terminating and Archiving
If your system’s not meeting the grade, or is cost-prohibitive, deprecate with a specified offboarding process.
Identify how you revoke access, disable accounts, and archive necessary logs.
This may sound like common sense, but AI adoption in enterprises right now is often happening very fast, dropping some of these processes in the confusion over whether AI is more a tool or member of the team.
Ensuring total AI lifecycle management is considered (as with your workforce) will prevent the potential for ensuing chaos.
AI Benchmarking Strategies for Your Business
We’re written before on the challenges of trying to benchmark AI systems, including the pros and cons of public benchmarks that are available (we’ve also updated these in our AI roundups).
But with AI systems being trained to score well in targeted benchmarks (and AI companies self-reporting performance in many cases), these measures can fail to objectively and successfully measure real business needs.
With adoption of AI so widespread, every organization should develop its own tests for measuring new AI releases.
The Wharton School professor and AI researcher and writer Ethan Mollick regularly advocates for companies making their own benchmarks with realistic business tasks, and he recently shared a Salesforce example called CRMArena-Pro.
This research (released in May) is proposed as an alternative to the poor state of public benchmarks, which fail on realism, agent-to-user measures, and ongoing interactions, as their provided examples.
Their approach uses 19 expert-validated tasks across the business, pairing AI agents with various human personas, for turn-based dialogue. It also uses synthetic data in a sandboxed Salesforce setting.
You can look at their reporting (see below) for great detail on their means of evaluation, which breaks skills into workflow, policy, text, and database, and see results as they tested on the GPT line, Gemini, and Llama with the ReAct agentic framework.
One takeaway: they found moderate success over single-turn tests (58% success on average), but that success fell fast over multiple turns (down to around 35% on average here).
Crafting Custom AI Benchmarks
When building your own measures, begin with tasks that tie as directly as possible to existing KPIs. Query your subject matter experts for their takes on how to best set performance metrics for AI agents in ways that reflect your real work.
Begin with real workflows, and model off real dialogues where possible.
Ensure that your measures cover all the bases and also rotate them (it’s easy and tempting to fall in love with a home run on a single metric, for example). Cost is an obvious measure, but what is the latency variation?
Create task banks where AI is useful, including checks on confidentiality if this is applicable for your use cases (such as exchanging SSN or other PII).
Like the Salesforce example, build in checks for data leaks and safety violations, using realistic but anonymized data, and rotate the data too, to avoid it becoming stale.
Run human experts through your tests as well as open-source models for baselines. And expect to see varied results from even the top plateau, closed-model offerings.
Conclusion: AI Performance Evaluation is Now Critical
We chastise AI models that fail, but this is inevitable.
What’s more useful is to know how they fail, and how often, and how it compares to both other models and humans doing the same tasks.
In my experience, without a concerted effort, this kind of information is hard to get reliably.
GPT-5 came out last week, and included in their changes is a built-in router that sends your request to the model they deem the best fit. Unless you’re on the $200 plan, you no longer manually pick which model to use for which task. And unsurprisingly, this has met with some frustration from professional users.
These kinds of sudden changes, along with changing (and highly varied) token costs, are exactly why companies must be proactive with how they measure AI.
Thinking of AI agents like employees is not a perfect system. After all, we do not have AGI systems capable of learning on their own, and they do not truly think or feel, and most importantly, they cannot take responsibility for their actions.
But this kind of consideration can provide a starting place for the process of bringing on, training, promoting, and even terminating AI-based systems safely and consistently.
References
New Claude 4 AI model refactored code for 7 hours straight, Ars Technica
5 Things To Know On Microsoft Entra Agent ID, CRN
ISO/IEC 42001:2023, International Organization for Standardization (ISO)
Why CIOs should think like HR leaders to onboard agentic AI, Tech Radar
What are non-human identities and why do they matter?, CSO
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions, arXiv:2505.18878 [cs.CL]