Core Building Blocks & Relationships
When a team says “use AI,” what exactly are we building?
Picture a common workplace moment: a manager wants “AI in customer support,” a data analyst wants “AI for invoice errors,” and IT is worried about data exposure. Everyone is pointing at something real—but they’re pointing at different layers of the same system. One person means a chat interface, another means a model, another means workflow automation, and another means governance.
This matters now because AI is no longer experimental in many organizations—it’s becoming a default suggestion. When you don’t share a mental model of the building blocks, teams make preventable mistakes: they buy the wrong tool, they send sensitive data to the wrong place, they over-trust fluent outputs, or they can’t reproduce results because prompts and data sources change.
So this lesson does one thing: it gives you a map of the core building blocks (what pieces exist) and the relationships between them (how they fit together). With that map, “we should use AI” turns into a concrete design conversation: Which model? Which inputs? What guardrails? What does success look like?
The AI stack: the pieces you can point to on a whiteboard
At beginner level, the most useful way to understand AI is as a stack of components. You’ve already seen why language matters: separating capability vs. application, and anchoring work in clarity, calibration, and control. This stack is the technical mirror of that mindset: it keeps you from confusing “the model” with “the product,” or “a prompt” with “a policy-compliant workflow.”
Here are the core terms we’ll use:
-
Application (product/workflow): The business process and UI where AI shows up (support desk, invoice review, onboarding).
-
Model: The learned system that maps inputs to outputs (e.g., an LLM for text, or an anomaly model for risk scoring).
-
Prompt / instructions: The task specification you give a generative model (inputs, constraints, output format, quality bar).
-
Context (grounding material): The information you provide at runtime—ticket text, policy excerpts, order status, invoice fields.
-
Tools / functions (optional): Calls the app can make (lookup shipping status, fetch policy doc, create a draft response).
-
Guardrails: Constraints and checks (don’t invent order IDs, don’t promise refunds, redact sensitive data, require human approval).
-
Evaluation & monitoring: How you test quality, catch drift, and decide if it’s safe to expand automation.
A helpful analogy: if AI is “medicine,” the model is the active ingredient, the application is the treatment plan, the prompt and context are the dosage instructions, and the guardrails/evaluation are the safety protocols. You don’t get safe results by having the ingredient—you get them by designing the whole system around it.
To make the relationships concrete, here’s a simple “what connects to what” map:
| Building block | What it does | What it depends on | Beginner failure mode |
|---|---|---|---|
| Application / workflow | Embeds AI into real work (who uses it, when, and what happens next). | Business rules, stakeholders, system integration, review steps. | Shipping a demo that can’t survive real constraints (compliance, edge cases, approvals). |
| Model | Generates text or produces scores/labels based on patterns. | Training patterns, runtime inputs, and how the app constrains use. | Treating the model like an “oracle” instead of a system that outputs plausible text or probabilistic scores. |
| Prompt + context | Tells the model what job to do and what information to use. | Clear task definition; approved sources; consistent formatting. | Thinking a prompt is “just a question,” leading to inconsistent, untestable outputs. |
| Guardrails + evaluation | Keeps risk proportional to stakes; measures quality over time. | Policies, threat model, test cases, monitoring plan. | Over-trusting fluency; skipping calibration; automating too early and scaling errors. |
Relationships that prevent category errors (and expensive rework)
Capability → application: why naming the job comes first
A reliable AI design starts with naming the capability, not just the area of the business. “Customer support” is an application; “draft a reply,” “summarize a ticket,” and “classify ticket type” are capabilities. This distinction sounds small, but it’s the difference between a testable requirement and a vague aspiration.
When teams skip capability naming, they often choose the wrong approach. For example, if the output you need is a flag/score (invoice anomaly risk), that’s an analysis shape of work. If the output you need is draft language (support reply), that’s typically augmentation. Generative AI is strong at drafting and transforming language; it’s not automatically the right engine for detecting anomalies from structured tables. The “AI umbrella” hides these differences, which is how projects end up with shiny demos and weak operational results.
Capability naming also forces clarity about inputs and constraints. “Draft a reply” implies you need: the ticket text, allowed customer facts (plan type, order status), tone guidance, and policy excerpts. “Flag anomalies” implies you need: vendor history, thresholds, and an agreed definition of “anomaly.” In both cases, the capability tells you how to evaluate success and what “wrong” looks like.
The practical best practice is to write one sentence that locks the job down:
- Capability statement: “Given X inputs, produce Y output, under Z constraints, for this decision point in the workflow.”
That one sentence becomes the anchor for prompts, guardrails, testing, and stakeholder alignment, and it directly supports the earlier mindset of clarity.
Generative outputs vs. analytical outputs: why “type of output” decides the design
A second relationship that clears confusion is: models behave differently depending on the output type you want. Generative AI produces plausible language; analytical ML produces scores/labels based on learned patterns. Both can be useful, but they invite different risks and require different controls.
Generative AI’s most common pitfall is fluency masquerading as accuracy. A language model can produce a confident-sounding response that subtly violates policy (“we’ve processed your refund”), invents missing details (fake order numbers), or implies certainty where none exists. That isn’t a rare edge case—it’s normal behavior for systems trained to continue text. The fix isn’t “better vibes” in the prompt; it’s high-quality constraints, grounded context, and a workflow that assumes human verification when stakes are high.
Analytical outputs have a different risk profile. A risk score might be consistently measurable, but it can drift over time when the environment changes (new vendors, new pricing patterns). It can also encode historical biases or errors if the training data reflects past mistakes. The fix is not “make it sound more cautious,” but monitoring, threshold tuning, and a clear calibration choice: do you prefer fewer false positives (precision) or fewer false negatives (recall)?
Here’s a comparison that helps you pick the right mental model quickly:
| Dimension | Generative (LLM drafts) | Analytical (scores/flags/labels) |
|---|---|---|
| Typical output | Text, summaries, rewritten content, structured text (tables/JSON). | Risk score, anomaly flag, category label, ranking. |
| Strength | Language-heavy augmentation: drafting, summarizing, transforming. | Consistent detection and prioritization from structured or semi-structured data. |
| Common pitfall | Hallucination: inventing facts; policy drift; confident tone hides uncertainty. | Data drift; misdefined labels; optimizing the wrong metric for the business cost. |
| Best guardrail | Constrain with approved context, strict “don’t invent” rules, and human review for high-stakes outputs. | Clear anomaly definitions, threshold tuning, ongoing monitoring, and a human escalation path. |
Notice how this ties back to calibration: match trust to risk, and match design to the kind of output you truly need.
Prompt + context + guardrails: how “draft engine” becomes a safe system
A prompt is not a magic phrase—it’s a mini specification. In real workflows, the prompt is also only one piece. The model’s output is shaped by: the prompt, the context you provide, the allowed tools, and the guardrails that constrain what “done” means.
Start with the prompt as a specification. Strong prompts usually include:
-
Inputs: What the model should use (ticket text, policy excerpt, order status).
-
Output format: Email draft, bullet summary, JSON fields, word limit.
-
Constraints: What it must not do (no invented facts, no unapproved promises).
-
Quality bar: How you’ll judge it (policy-accurate, easy to verify, respectful tone).
Then add context discipline. Context is where teams accidentally create risk: they paste sensitive data, undefined internal notes, or entire documents without knowing what is allowed. Operationally, “context” should be the minimum necessary information to do the job, ideally from approved sources. This aligns with control: you limit what the model sees and what it’s allowed to say.
Finally, guardrails turn drafts into safe outcomes. Guardrails can be workflow-based (human approval), content-based (must quote policy verbatim), or system-based (block sending if the draft contains unverified claims). The misconception to avoid is thinking guardrails are only about “safety theater.” The real purpose is reliability: guardrails make outputs repeatable, auditable, and aligned with how your organization actually operates.
[[flowchart-placeholder]]
Two workplace examples, built from the blocks
Example 1: Customer support replies (LLM augmentation with tight constraints)
Start with a concrete capability: draft a customer reply. This is augmentation, not full automation—at least at first—because customer-facing mistakes scale quickly. The application is the ticketing system, and the model is an LLM used to generate a draft email.
Step by step, the building blocks line up:
- Context: Provide the customer’s message, plus only the allowed facts (plan tier, order status from an internal system if permitted), plus an excerpt of the refund policy that the team approves. Keep context minimal so the model doesn’t “guess” from irrelevant noise.
- Prompt as specification: Require a 120–160 word reply, friendly tone, and explicitly forbid unverified claims (“Do not state a refund has been processed unless the ticket includes confirmation.”). Ask for a short “assumptions” line so the agent can spot invented details quickly.
- Guardrails in workflow: The draft cannot be sent automatically; an agent reviews and approves. If the draft includes restricted phrases (like “refund processed”), the system flags it for extra review.
- Evaluation: Success isn’t “sounds good.” It’s “policy-accurate, non-deceptive, and faster to verify than writing from scratch.”
Impact and limitations become clear with this design. Benefits include faster first response time and consistent tone, especially for repetitive ticket types. Limitations include edge cases (angry customers, unusual billing scenarios) and policy changes that require updating the context/prompt. This setup also keeps you honest about calibration: drafts are low-to-medium risk with review; fully automated sending is high risk and should be reserved for only the safest, checkable cases.
Example 2: Invoice anomaly detection (analysis first, then generative explanation second)
Now take a different capability: flag invoices for review. The output is a risk score or anomaly flag—not a paragraph. This is the analysis category of work, so the model choice and evaluation approach shift immediately.
Step by step:
- Define “anomaly” operationally: Is it a duplicate invoice, unusual amount vs. vendor history, mismatched tax rate, or invoice outside the normal cycle? Without this definition, you can’t label data or evaluate performance.
- Inputs: Use structured fields—vendor ID, amount, tax rate, category, date, approval path—and historical baselines. The model outputs a score/flag, and the workflow routes high-risk items to a human reviewer.
- Calibration: Choose thresholds based on business costs. If false positives waste hours, you tune for precision. If false negatives cost money, you tune for recall. Either way, you plan to revisit thresholds as patterns change.
- Where generative fits: An LLM can add value after detection by drafting a short analyst note: “Flagged because amount is 3× vendor median and outside the monthly cycle.” That’s augmentation layered on top of analysis—keeping each model type in the job it’s suited for.
Benefits include consistent screening at scale and prioritization for human attention. Limitations include data quality and drift: if vendor behavior changes, anomaly patterns change, and performance can degrade unless monitored. This example also protects you from a common misconception: using a chatbot to “find anomalies” when what you need is a measurable, auditable flagging system.
What to remember when you’re scoping real work
Three takeaways should stay top of mind:
-
Build from capability to workflow: name the job (“draft,” “summarize,” “classify,” “flag”) before choosing tools or vendors.
-
Match output type to method: generative drafts and analytical scores behave differently, so they need different evaluation and guardrails.
-
Prompts don’t replace controls: the prompt is a spec, but reliability comes from the full system—context discipline, guardrails, and calibration to risk.
This sets you up perfectly for Workflows and Beginner Pitfalls [25 minutes].