The vendor-agnostic AI tool stack
Categories of AI tools L&D should help their teams evaluate, what to look for in each, and what to be skeptical of. No vendor recommendations — the goal is to teach you to evaluate, not to sell you on specific products.
We don’t recommend specific AI tools to mid-market customers. Two reasons. First, the tool category is moving fast enough that any specific recommendation will be out of date within months. Second, the right tool depends on your stack, your data classification, your security posture, and the work you actually do — none of which we know better than you.
What we can do is map the categories worth knowing about, what each one solves, what to look for in a serious option, and what to be skeptical of. If you’re an L&D buyer trying to decide which tools your program should center on, this page is the framework.
Category 01 — General assistants
What it solves. The 70% of work that’s drafting, summarizing, brainstorming, restructuring, or asking. The default surface most knowledge workers will use most of the time.
What to look for.
- A version control story for prompts (so your team’s reusable prompts don’t live in screenshots).
- A clear data-handling posture — specifically: does your input train future models, can you opt out, and is there an enterprise-tier guarantee in writing.
- Workspace-level controls — admin visibility into who’s using the tool, what controls are enforceable, what data flows out.
- A meaningful long-context window. Anything below 100K tokens stops being viable for serious internal-document work.
What to be skeptical of.
- Vendors that lead with productivity claims (“write 10x faster”) rather than substance. The serious vendors compete on reliability, governance, and developer experience.
- “Custom GPT” features sold without governance. If anyone in the org can publish a workspace assistant with zero review, you’ve created a sprawl problem the moment you scale past 50 seats.
- Pricing tiers that gate basic security controls (SSO, audit logs, data-handling guarantees) behind enterprise contracts. That’s a vendor signaling they don’t take mid-market seriously.
Category 02 — Code copilots
What it solves. Code completion, refactor suggestions, multi-file edits, and the recent shift to agentic coding (where the assistant proposes and applies a sequence of changes across files).
What to look for.
- Does the assistant understand the whole repository, not just the open file?
- Does it integrate with your code review process (PR descriptions, suggested reviews, automated nitpicks)?
- A clear story on training-data provenance and license compliance for the suggested code.
- Telemetry your engineering leadership can read — adoption per team, suggestions accepted vs rejected, time-to-merge changes.
What to be skeptical of.
- “10x productivity” claims for engineering teams. Real measurements are more modest and more honest — typical reports cluster around 20–40% on routine tasks, much less on architectural work.
- Tools that won’t tell you their training-data sources. If you ship code commercially, this matters.
- Vendors that don’t differentiate between completion and agentic behaviour. The latter requires materially more governance — agents shipping code into your repo deserve more than the same review process you applied to autocomplete.
Category 03 — Evaluation and observability
What it solves. The “is the AI output any good?” problem at scale. When a single human is reviewing every output, evaluation is ad-hoc. When the team is doing thousands of AI-assisted operations a day, evaluation needs tooling.
What to look for.
- Support for both human-graded and automated evaluation, ideally side by side so you can calibrate one against the other.
- Versioning of prompts, evaluation criteria, and model versions — so a regression in any of the three is visible.
- Trace-level visibility into multi-step systems. When an agent fails, you need to see which step failed, not just that the final output was wrong.
- Cost and latency observability alongside quality.
What to be skeptical of.
- Pure “LLM-as-judge” scoring without any human calibration. The model’s own judgement of its output is correlated with quality but not the same thing.
- Evaluation tools that lock you into one model vendor. Your evaluation harness should outlive any single model.
- Vendors selling “AI evaluation” as a separate product when your existing observability stack (Datadog, Honeycomb, etc.) is adding LLM features quickly. The dedicated tool may still win, but the comparison is worth doing.
Category 04 — Agentic platforms
What it solves. Building multi-step automated workflows where AI is one component — alongside API calls, database operations, and human-in-the-loop steps. The “let me automate this 5-step process” problem.
What to look for.
- A visual builder and a code interface. Visual for the people designing the workflow; code for the people debugging it at 2am.
- A clear story on long-running tasks, retries, and idempotency. Many real workflows take minutes to hours; the platform needs to handle that without losing state.
- Native integrations with the systems you actually use — your CRM, your ticketing, your data warehouse — versus generic webhook adapters.
- Observability and rollback. When an agentic workflow makes a mistake at scale, you need to see what happened and undo it.
What to be skeptical of.
- Demos of agents doing impressive one-off tasks (“book me a flight!”). The hard problems are reliability across hundreds of runs, governance over what the agent is allowed to touch, and recoverability when things break.
- “No-code” agentic platforms that hide what’s actually happening. The team that ships these workflows needs to understand them; opacity is a liability when one breaks.
Category 05 — Governance and DLP
What it solves. Visibility and enforcement of what data goes into which AI tools, and what comes out. The “are we leaking customer data into ChatGPT?” question turned into a system.
What to look for.
- Browser-level visibility (what tools are people actually using) and proxy-level enforcement (what data is being submitted).
- Policy templates aligned to your data classification scheme.
- Reporting that ties violations to the team, not the individual — the goal is to fix the policy, not punish people for being curious.
- An honest position on shadow AI. Pretending it doesn’t happen is a failure mode; making it visible and routing it through approved tools is the answer.
What to be skeptical of.
- Tools that frame governance as “blocking AI.” The teams that work with the program owner to enable safe AI usage win; the teams that try to lock it down lose adoption and end up with worse shadow AI than before.
- Vendors that sell governance as a checkbox for compliance without supporting the actual cultural work — training, communication, and feedback loops.
Category 06 — Internal knowledge bases
What it solves. Making your organization’s own documents — wikis, handbooks, runbooks, past projects — searchable and queryable through AI rather than keyword search.
What to look for.
- A serious story on permissions. The KB must respect existing access controls, not become the way confidential documents leak across teams.
- Source attribution on every answer. The team needs to know which document the AI is citing, with a link.
- Re-indexing on document change, ideally near-real-time. Stale answers based on outdated handbook entries are worse than no answers.
- The ability to flag low-quality answers — the KB improves over time only if the team can correct it.
What to be skeptical of.
- “Drop a folder of PDFs and start asking questions” demos. They look magical and break the moment the document set is large or the questions are nuanced.
- Pricing that scales linearly with document count rather than usage. The whole point of a knowledge base is to grow it; pricing that punishes that growth is a vendor mismatch.
How to use this list
For each category, the program owner’s job is to:
- Decide whether the category is in scope for your rollout this year.
- Run a structured evaluation of 2–3 candidates, scored against the “what to look for” criteria above.
- Add the chosen tool (or “none, not yet”) to your AI usage policy (starter template).
- Re-evaluate annually — sooner if the category undergoes a generational shift.
The categories above intentionally don’t include “AI training platforms” because that’s the category 174 itself sells in. The assessment is how we’d suggest you evaluate that one. Same principles apply — what does it solve, what to look for, what to be skeptical of.
Where does your org actually stand?
Ten minutes. Three dimensions. A leadership-shareable baseline.