Why AI portfolios stall: too many “good ideas,” not enough decision-grade choices

A common moment in AI programs is the “backlog explosion.” After a few pilots, every function arrives with plausible use cases: support wants a genAI assistant, finance wants invoice matching, sales wants a copilot, risk wants fraud detection. On paper they all sound valuable, and teams can usually produce a demo quickly. Then resourcing gets messy: the same data engineers are needed everywhere, legal reviews become a bottleneck, and leaders can’t tell which initiatives deserve scarce capacity—or which ones should be stopped before they accumulate hidden risk.

This matters now because the ease of starting AI work creates a portfolio problem: the organisation must choose among initiatives that differ not just in upside, but in feasibility constraints (data access, workflow readiness, integration complexity) and risk exposure (privacy, regulatory, model failure modes). If prioritisation is informal—driven by seniority, excitement, or the “loudest pain”—you get metric theater: lots of activity, uneven impact, and risk discovered late.

Prioritisation is the bridge between measuring outcomes and actually executing strategy. Once outcomes and guardrails are defined, the next question is operational: Which use cases do we fund, sequence, and govern first—so the portfolio produces repeatable value without unacceptable exposure?

The three-part lens: value, feasibility, and risk (and what each really means)

Prioritisation is the process of ranking AI initiatives so funding, talent, and governance attention go to the best next bets. In AI, “best” is not just the biggest ROI estimate; it’s the best combination of value, feasibility, and risk under real constraints.

Value is the size and credibility of the outcome change you expect. It should trace through an outcome chain: what operational metric changes, what business outcome moves, and what financial or risk impact follows. Value is strongest when it’s measurable with clear baselines and counter-metrics, rather than assumed from model performance or vendor benchmarks.

Feasibility is the likelihood you can deliver that value within a reasonable time and cost. In practice, feasibility is socio-technical: data quality and rights, integration, workflow adoption, ownership, and the ability to monitor and operate the model in production. A high-value use case that needs new data pipelines, a redesigned process, and heavy change management may be less feasible than a moderate-value use case that fits cleanly into an existing workflow.

Risk is the probability and severity of harm—legal, ethical, operational, financial, and reputational—and the cost of controls needed to keep the initiative within appetite. Risk is not just “will the model be wrong?” It includes privacy leakage, policy misstatements (especially in genAI), bias signals, auditability gaps, and failures under drift. A crucial principle: risk can erase value, so it must be evaluated alongside upside, not after the pilot “proves it works.”

A useful analogy is a product launch. Value is the expected market impact, feasibility is whether you can build and distribute reliably, and risk is whether the launch triggers recalls, compliance breaches, or brand damage. AI use cases are product launches embedded inside operations—so prioritisation must be equally multidimensional.

Turning a longlist into a ranked portfolio: a practical scoring model that stays honest

A workable approach is a triage-to-score system: first filter ideas that cannot be governed, then score the remaining ideas using a consistent rubric. The goal is not mathematical certainty; it’s decision integrity—leaders can explain why one initiative is ahead of another using the same language of outcomes, operational reality, and guardrails.

Start with a short triage that eliminates “non-starters.” Typical triage questions are binary: do we have the right to use the data, can we log decisions, do we have an accountable business owner, and can we define measurable outcomes and counter-metrics? If an initiative fails triage, the right move is not to argue about priority; it’s to define what must change (data rights, instrumentation, governance design) before it re-enters the funnel.

Then score initiatives on value, feasibility, and risk in a way that reflects what actually makes AI succeed in production. AI often fails not because the model is impossible, but because adoption and workflow instrumentation are missing. Similarly, portfolios fail when leaders treat risk as a compliance checkbox rather than a first-class outcome with metrics (privacy incidents, policy violations, override rates, audit findings). A scoring model should force these realities into the conversation early, instead of letting them surface during scaling.

Use a simple scale (e.g., 1–5) with anchored definitions so teams don’t inflate. Anchors are critical: “5” should mean something observable (baseline exists, comparison method defined, integration path known, monitoring plan clear), not “we feel good about it.”

A decision-friendly rubric (with anchors you can reuse)

Dimension What “high” looks like (score 5) What “low” looks like (score 1) Evidence you should demand
Value (outcomes) Outcome chain is clear from intervention → operational metric → business outcome; benefits are not double-counted; counter-metrics defined to detect harm/gaming. “Cool demo” or accuracy gains with unclear linkage to business outcomes; ROI is mostly assumptions; no guardrails. Baseline + comparison method (A/B, before/after, matched cohorts); defined KPI + counter-metrics; agreed unit of analysis.
Feasibility (delivery + adoption) Data is accessible with clear rights; workflows are instrumented; integration path is known; ops owner commits to change management; monitoring and on-call ownership planned. Data is messy/unknown; needs major process redesign; dependencies unclear; adoption risks ignored; “deploy = done” mindset. Data readiness assessment; integration diagram; RACI for ops + model ownership; rollout plan with adoption metrics (acceptance, rework, escalations).
Risk (exposure + control cost) Risk tier is known; controls are designed upfront (logging, human oversight, privacy safeguards); risk metrics are defined and monitored; auditability is achievable. Risk unknown or postponed; reliance on “we’ll review later”; weak traceability; high likelihood of policy/regulatory friction or customer harm. Risk assessment; control design (human-in-the-loop, retrieval constraints, thresholds); monitoring plan for drift, incidents, fairness/bias indicators where applicable.

A common misconception is that scoring should produce a “correct” list. In reality, it produces a negotiated and auditable choice: leadership can justify trade-offs (“we’re choosing moderate value but high feasibility to build momentum”) and can revisit scores as evidence improves. The scoring model is a governance tool as much as a planning tool.

Typical pitfalls—and how to avoid them

“Value-first” portfolios often overfund initiatives with impressive upside narratives but weak measurement integrity. If baselines and counter-metrics aren’t defined, teams can unintentionally report activity as impact, or claim ROI via proxy metrics without validating the proxy-to-outcome relationship. A practical fix is to require that any “high value” claim includes a clear outcome chain and an explicit measurement approach.

“Feasibility-blind” portfolios underestimate the socio-technical reality of AI. A use case can be technically easy and operationally hard: low adoption, poor incentives, limited training, or workflow friction can kill benefits even if the model performs well. The fix is to treat adoption and instrumentation metrics as part of feasibility—because without them, you can’t prove causality or manage degradation.

“Risk-late” portfolios defer governance until after pilots. In AI, that usually backfires because many controls must exist at design time: data rights, logging, explainability and traceability (where required), and human oversight for high-impact decisions. The fix is to score risk as a first-class dimension and to include the cost of controls in feasibility—because controls affect delivery time and operational burden.

[[flowchart-placeholder]]

From scores to sequencing: choosing the right first wave (and avoiding portfolio whiplash)

Once initiatives are scored, the next step is sequencing—which use cases go first, which run in parallel, and which wait. This is where many organizations stumble: they pick “top 5” by score and start all of them, ignoring shared constraints like data engineering capacity, legal review cycles, and dependency bottlenecks. A portfolio can be “high scoring” and still be impossible to execute if everything depends on the same scarce resources.

A practical sequencing principle is to create a balanced first wave:

  • A few initiatives with high feasibility to prove repeatable delivery and measurement in production.

  • At least one initiative with high value that is hard but strategically important, so the portfolio doesn’t become a collection of small wins.

  • Nothing with unbounded risk unless controls and monitoring are ready from day one.

This balance also supports organisational learning. Early initiatives should improve your ability to execute the next ones: reusable data pipelines, monitoring patterns (drift, latency, incident logging), governance routines, and change management muscle. If the first wave is all bespoke builds, the program doesn’t compound; it just accumulates.

Another sequencing idea: treat feasibility and risk as “gates,” not just scores. For example, a use case may rank high on value but be gated until data rights are resolved or until workflow logging is in place. This prevents the common failure mode where a team burns months building a model only to find it cannot be deployed with acceptable governance.

A simple way to communicate choices to executives

Leaders rarely need the full spreadsheet; they need clarity on the trade-off story. A compact narrative per initiative usually works:

  • What outcome changes (and how we’ll measure it): KPI + counter-metrics, baseline, comparison method.

  • What must be true to deliver: critical dependencies—data access, integration, process owner, adoption plan.

  • What could go wrong (and how we contain it): key risks, control design, monitoring signals, go/no-go thresholds.

This keeps prioritisation tied to decision-grade metrics rather than model-level excitement. It also makes “stop” decisions easier: if adoption is low, rework is high, or risk signals breach thresholds, the portfolio has a rational basis to pause or redesign—without politics.

Applied example 1: GenAI customer support—fast responses without quality or compliance collapse

Consider a genAI assistant for customer support that drafts replies and retrieves policy information. On value, the outcome chain is usually attractive: reduced drafting time should reduce time-to-first-response and potentially lower cost per ticket. But value estimates become credible only when paired with counter-metrics: repeat contacts within 7 days, escalation rate, and CSAT/NPS movement. If the assistant increases speed but also increases rework, the portfolio should treat that as value leakage, not “phase 2 refinement.”

Feasibility is often misunderstood here. The model may be easy to prototype, but production readiness depends on workflow instrumentation (acceptance rate, edit distance, escalation triggers), integration into the ticketing tool, and an operating model for monitoring. For genAI specifically, feasibility includes designing retrieval and citation behavior so agents can trust outputs, and setting latency and outage thresholds so the tool doesn’t slow work during peak volumes. If these elements are missing, the initiative’s feasibility score should drop—even if the demo is excellent.

Risk is not hypothetical in this use case. Common exposures include privacy leakage (customer PII in prompts), policy misstatements, and inconsistent disclosures. Risk scoring should reflect the control plan: redaction, access controls, logging, sampled hallucination checks, and rules for when the assistant must refuse or escalate. A practical portfolio decision might be: prioritize this use case if the control design is ready and the counter-metrics are in place; otherwise treat it as gated until governance instrumentation exists, because reputational harm can outweigh any handle-time win.

Applied example 2: ML credit decisioning—growth that stays inside appetite

Now consider an ML model for credit decisioning. Value can be large: improved approval efficiency, increased booked volume, or better risk-adjusted returns. But the outcome chain is longer and more delayed than most teams expect. Early “wins” like higher approval rate can be misleading if they simply shift thresholds and buy growth with future defaults. Value is decision-grade only when paired with lagging outcomes (loss rate, delinquency) and explicitly labeled leading indicators (30/60 day delinquency proxies) during early rollout.

Feasibility here is tightly coupled to governance. A credit model needs reliable data pipelines, drift monitoring, calibration checks, and traceability of inputs and decisions. Operational metrics matter too: time-to-decision, exception handling volume, and override rates (how often underwriters reject the recommendation). If override rates are high, the portfolio should treat that as a feasibility signal: either the model is misaligned with policy, explainability is insufficient, or incentives and trust aren’t in place.

Risk is central, not peripheral. Depending on jurisdiction and policy, fairness and adverse-action consistency may be mandatory. Even where not legally required, internal risk appetite often demands monitoring for disparate outcomes across relevant segments, plus strong audit readiness: versioning, decision logs, and clear governance ownership. A sound prioritisation outcome might be: this is high value but medium feasibility and high risk-control cost, so it should be sequenced after the organisation has proven it can operate monitoring and audit trails in production—or it should be launched in a constrained segment with strict go/no-go thresholds.

What strong prioritisation looks like in practice

Prioritisation works when it is evidence-led and repeatable. Value is not a story; it’s an outcome chain with baselines, comparison methods, and counter-metrics. Feasibility is not “we can build a model”; it includes adoption, integration, monitoring, and ownership. Risk is not “compliance later”; it’s exposure plus control design that can be operated day to day.

If you keep one principle, make it this: the best AI portfolio choices are the ones you can measure, deliver, and govern without surprises. That is how you scale impact without scaling chaos.

Next, we'll build on this by exploring Roadmap Design: Horizons and Dependencies [30 minutes].

Last modified: Friday, 6 March 2026, 6:05 PM