AI Product Manager Playbook

Author: Ryan Nadel

Table of Contents

Introduction
Core Principles of AI-Driven Products
Defining “Good” in AI Products
Identifying & Validating AI Use Cases
Designing & Building the AI System
Evaluating AI Outputs
Launching & Iterating on AI Features
Ensuring Product-Market Fit & Competitive Edge
Collaborating with Stakeholders
Playbook Checklist & Summary

1. Introduction

How to use this playbook

Read the intro, take a look at the table comparing traditional PM to AI PM and the spec template and then use the rest as a reference. Or just read it top to bottom.

Why AI Product Management Is Different

Playbook Tip: Lead with output quality. In AI, the output is the “product.” Ensuring it meets user needs and quality benchmarks is the foundation upon which great UX can then be layered.

AI product management differs from traditional software in three critical ways:

Probabilistic Outputs: AI generates variable results influenced by training data, prompts, and real-world usage.
Quality First: If model outputs are poor, no UI/UX polish can compensate.
Rapid Evolution: AI frameworks, best practices, and Models change quickly, requiring constant adaptation.

Traditional PM vs. AI PM


	Traditional Product Management	AI Product Management
Definition of “Good”	Features are defined by a set of functional requirements and deterministic logic. If the feature meets specs, it’s “good.”	Quality is probabilistic; “good” is defined by metrics like accuracy, relevance, clarity, or user satisfaction. Continuous measurement and clear criteria (golden sets, test sets) are essential.
Spec & Requirements	Specifications center on predefined features, acceptance criteria, and deterministic logic. Requirements are mostly about how the system should behave under various conditions.	Specs must explicitly define what good looks like through sample prompts, golden sets, and evaluation metrics. AI PMs must provide annotated examples, success benchmarks, and clear criteria for acceptable vs. unacceptable outputs.
Empirical Mindset	Validation relies on predefined use cases, acceptance criteria, and manual QAJ(2(#_msocom_2) RN3(#_msocom_3) .	Demands a data-driven, experimental approachBK4(#_msocom_4) . Product teams must continuously test, measure output quality, and refine based on real-world feedback and metrics.
Core Focus	The UI/UX and workflow design often take precedence. If the feature’s logic is correct, a polished experience is enough.	AI output quality is paramount, overshadowing UI design. A subpar model output can negate even the best-designed interface.
Feature Crew Disciplines	Primary collaboration: Product Managers, Engineers, UX Designers, and Copywriters.	Deep collaboration is needed with applied researchBK5(#_msocom_5) (for AI model development, prompt engineering, data pipelines) and technical writers (to craft prompts, refine model responses), in addition to classic disciplines (UX, copy, eng).
Data Requirements	Mostly static requirements and configurations; data typically is for analytics or minimal business logic.	Robust, high-quality datasets drive output evaluation and improvement.BK6(#_msocom_6)
Iteration	Iteration is usually tied to feature roadmaps and version releases; updates are less frequent once feature logic stabilizes.	An ongoing cycle of prompt tuning, model retraining, and evaluation. AI features often see continuous updates as the model and data evolve.
Evaluation & Testing	Test cases and QA checklists ensure deterministic outcomes match the specification.	Golden sets, automated metrics, LLM-as-judge pipelines, and human reviews. Success is assessed against empirical benchmarks and user feedback loops.
Stakeholder Collaboration	Product, marketing, and user research typically align on messaging once core feature functionality is locked.	Tight cross-functional alignment is critical. Marketing must understand AI’s capabilities and limits; user research must inform ongoing prompt/model refinements.
Risk of Failure	Bugs or mismatched features can lead to user frustration, but issues are often binary and more predictable.	AI outputs can fail in subtle ways—incorrect facts, biased or confusing responses. Failures may be less predictable and require robust risk mitigation (e.g., human-in-the-loop evaluations).
User Expectations	Consistent functionality once a feature “works.”	Variable output quality; must manage expectations and clarify limitations.
Safety & RAI	Privacy & Security requirements focus on data protection, regulatory compliance, and standard code-of-conduct.	Goes beyond privacy/security to include algorithmic bias detection, content moderation, ethical usage guidelines, and frameworks for responsible AI (e.g., fairness, transparency, governance).

AI Product Spec TemplateLS9(#_msocom_9)

Below is a suggested spec template for AI products, drawing from the key ideas covered throughout the playbook. This template outlines the sections and types of information an AI Product Manager should include when specifying a new AI feature or product.

Do a spec review with AI. Use this prompt prior to reviewing a spec with the team. Can be used for all specs not just AI projects.

1. Overview & Goals

Product/Feature Name: Clearly label the feature or initiative.
Objective: Describe the problem you aim to solve. Specify why it’s important for users and the business.
AI-Specific Rationale: Explain why AI (vs. a simpler approach) is critical to success (e.g., complexity, ambiguity, scalability).

2. User Needs & Use Cases

Target Users: Identify the primary user segments or personas (e.g., business analysts, end consumers).
Key User Scenarios: Provide 2–3 representative user stories or journeys highlighting the tasks the AI will address.
Pain Points & Expected AI Contributions: Articulate where AI’s capabilities (pattern recognition, language generation, etc.) will alleviate user frustrations or open new opportunities.

3. “Definition of Good” & Quality Goals

Quality Metrics: Outline the specific metrics you’ll track (accuracy, relevance, clarity, coverage, etc.).
Golden Set & Benchmarks: Describe the curated input-output examples you’ll use to measure performance. List links to golden set artifacts if available.
Success Criteria: Define clear thresholds for launch-readiness (e.g., 80% accuracy on golden set, under 2-second latency, etc.).

4. ‘Science’ Approach

Data Requirements:
Types, sources, and volume of data needed for training or prompting.
Plans for labeling, cleaning, or augmenting data if necessary.
Model Selection:
Whether you’ll use Cloud, Smaller Language Models (SLM), or Local deployments.
Rationale for your choice (cost, latency, compliance, scalability).
Prompting Strategy:
Single-shot, few-shot, or chain-of-thought.
Example prompts or instructions for the model.
Evaluation Mechanisms:
How you’ll test and compare outputs (e.g., side-by-side comparisons, LLM-as-judge, human-in-the-loop).
Frequency of model retraining or prompt updates.

5. User Experience & Interface

Interaction Flow:
Overview of how the user initiates the AI feature (UI elements, voice commands, API calls, etc.).
Steps or states the user sees from start to finish.
Handling Variability:
Expected or potential error states, disclaimers, or fallback options if the AI’s output is uncertain.
How users can revise or refine AI outputs (e.g., re-prompt, correct suggestions, etc.).
Design/Copy Guidelines:
Any specific tone, style, or brand language requirements.
Accessibility considerations or necessary disclaimers.

6. Responsible AI & Risk Mitigation

Bias & Safety Checks:
Content filters, detection of harmful or disallowed outputs.
Plans for reviewing flagged outputs and iterative improvements.
Ethical/Regulatory Requirements:
Industry-specific compliance concerns (e.g., healthcare, finance).
Steps to ensure transparency, data privacy, and user trust.

7. Launch & Iteration Plan

Beta Testing:
Criteria for selecting pilot/beta users.
Metrics to watch (adoption, error rates, user satisfaction).
Feedback Loops:
How user input will be collected (surveys, in-product prompts, support channels).
Cadence for integrating feedback into the model or prompts.
Release Phases:
Target timeline for beta, GA (general availability), and subsequent iterations.

8. Measurement & Post-Launch Monitoring

KPIs & Ongoing Metrics:
Usage frequency, satisfaction scores, error/apology rates, etc.
Any domain-specific success measures (e.g., conversion rates, time-on-task improvements).
Model Drift & Maintenance:
Plan for continuous evaluation to detect performance dips.
Strategy for re-training or re-tuning the model as data or user behavior changes.

9. Appendix / Supporting Documentation

Golden Set Examples: Links or references to the full set of curated inputs/outputs.
Technical References:
Detailed architecture diagrams, data schema.
Any relevant API docs or integration guides.
Changelog: For tracking spec updates, pilot learnings, or model version releases.

2. Core Principles of AI-Driven Products

Playbook tip: Unlike traditional software development, where you can rely on pre-defined logic and deterministic outcomes, AI is probabilistic. The model’s outputs can vary significantly based on context, prompt design, and real-world usage scenarios.

Empirical Mindset

Define Quality: Articulate clear output standards (accuracy, clarity, brevity).
Measure Continuously: Track performance using metrics, user feedback, and test sets.

User-Centric Approach

Real-World Value: Solve genuine user pain points.
Iterative Improvement: Launch, gather data, refine.

Appropriate Model Selection

Right Tool for the Task: Advanced models (e.g., GPT-4) can be powerful but expensive.
Optimize for Scale: Factor in cost, latency, and feasibility before widespread deployment.

Responsible AI

Transparency & Accountability: Explain model limitations and have human oversight for errors.
Safety Nets: Employ guardrails like content filters to prevent harmful or biased outputs.
Continuous Governance: Regularly monitor and update models based on evolving ethical standards.

3. Defining “Good” in AI Products

Recognize Probabilistic Outputs

AI output quality varies by data, prompts, and context. Success depends on explicit quality definitions and performance thresholds.

Start with User Intent

Core Task: Identify what the user wants—accuracy, creativity, brevity, etc.
Metrics: Translate user goals into quantifiable measures (e.g., factual correctness, readability).

Use a “Golden Set”

Curated Exemplars: Collect ideal inputs/outputs.
Annotate “Good”: Explain why each output is high quality.
Reference Constantly: Compare new model outputs to the golden set during development.

Establish Clear Evaluation Criteria

Objective Measures: Accuracy, coverage, precision/recall.
Subjective Measures: Readability, helpfulness, creativity.
User-Centric Metrics: Align to real user stories and business goals (time saved, reduced errors).

Quantify with Evaluation Methods

Human Evaluation: Ratings from general users or domain experts.
Automated Metrics: Large language model (LLM)-as-judge pipelines for bulk scoring.
Side-by-Side Comparisons: Compare two outputs directly to see which is better.

Iterate, Iterate, Iterate

Test After Every Change: Each model update or prompt tweak should be re-evaluated.
Open Feedback Loops: Continuously gather feedback and usage data to confirm that quality definitions remain valid.

4. Identifying & Validating AI Use Cases

4.1 Pattern Matching User Needs to AI Capabilities

Finding the right use cases for AI begins with pattern-matching what your users need to what AI can do well. AI excels at tasks involving large-scale data processing, pattern recognition, language understanding/generation, and predictions or classifications in complex, variable scenarios.

Identify Repetitive or Complex Tasks

Pinpoint activities or workflows that demand significant manual effort, domain expertise, or repeated decision-making.
Look for tasks where rules-based systems become unwieldy because of exceptions or ambiguities—situations where AI’s flexibility can outperform rigid logic.

Find a Natural Fit for AI Strengths

AI is valuable when it can provide better, faster, or cheaper solutions than existing methods.
Examples: Generating summaries from lengthy text, classifying or labeling large volumes of data, making personalized recommendations.

Hone your intuitions

You need to use AI for ‘real work’ to hone your intuitions for where it excels and where it struggles. Push existing systems and integrations to the limit.
Experiment with the latest models, apps, platforms, tools and features. Stay on top of the latest industry trends and think deeply about how the landscape is evolving.

Balance Risk and Reward

If the problem is critical (e.g., high-stakes decisions) or heavily regulated, ensure you can incorporate human oversight.
For lower-stakes tasks, AI may still deliver cost or efficiency benefits if it can reduce manual effort.

Look for Data-Driven Opportunities

Assess whether you have sufficient data to train or prompt a model effectively.
Check data integrity and diversity—clean, representative data is essential for reliable outputs.

Tip: Conduct brainstorming workshops with cross-functional teams (product, engineering, support, domain experts) to identify repetitive tasks or bottlenecks that align with AI’s strengths.

4.2 Evaluate for AI-Fit

Check Data Availability & Quality

AI typically requires robust datasets. Ensure you have enough relevant examples or labeled data.

Determine Complexity

Is a machine learning model truly necessary, or could a simpler rule-based approach solve 80% of the problem at lower cost/complexity?

Consider Technical Feasibility & Resource Constraints

Do you have the infrastructure, talent, and budget to build, deploy, and maintain an AI feature?

4.3 Estimate Impact & Feasibility

Cost-Benefit Analysis

Compare potential benefits (time saved, increased accuracy, scalability) to the resources and overhead needed for AI (compute, licensing, engineering, etc.).

Rapid Prototypes

Build small proof-of-concepts to gauge early results before committing to a full project.

Playbook Tip: Start with a small, well-defined use case to validate your hypothesis that AI meets user needs better than a simpler approach. Once proven, expand to more complex scenarios.

5. Designing & Building the AI System

Define Prompting Strategy

Single-Shot vs. Chain-of-Thought: Some tasks benefit from step-by-step instructions.
Example Prompts: Provide structured examples to guide the model (few-shot learning).

Prepare Your Training Data

Quality > Quantity: Poorly labeled data leads to unreliable performance.
Diverse Use Cases: Include typical scenarios and edge cases for robustness.

Model type: Cloud vs. vs SLM vs Local

Cloud
Pros: Easy updates, virtually limitless scalability, and integrated cost management (pay-as-you-go).
Cons: Potential latency, dependency on internet connectivity, ongoing subscription costs.
SLM (Smaller or Specialized Language Model)
Pros: Tailored for specific tasks or domains, lighter resource requirements, faster inference times on limited hardware.
Cons: May lack the broader capabilities of larger models and require careful tuning and domain-specific training data.

o Local

Pros: Best for offline scenarios or strict compliance needs; offers full control over data and infrastructure.
Cons: Hardware constraints can be significant, especially for large models; updates and maintenance can be more complex.

Playbook Tip: Remain flexible. The best model for a pilot may not be optimal for large-scale deployment.

6. Evaluating AI Outputs

6.1 Golden Set

Purpose A golden set is a curated collection of input-output pairs that serves as the baseline for measuring model accuracy and quality. It represents ideal or approved responses across diverse scenarios your AI should handle.

Key Steps to Building a Good Golden Set

Collect Real-World Inputs

Gather inputs from actual user queries, support tickets, logs, or interviews.
Include both common scenarios and edge cases (e.g., Rare inputs, ambiguous questions).

Create “Ideal” Outputs

For each input, define a correct or high-quality output. This may be best-in-class content produced by domain experts or existing high-performing systems.
Provide annotations explaining why each output is correct or exemplary (e.g., factual accuracy, tone, style).

Diversity & Coverage

Ensure your golden set covers a wide range of topics, user intents, and complexities.
Include corner cases (e.g., incomplete data, special formatting requirements) to ensure robust testing.

Maintain & Update

Continually refine your golden set as user behavior evolves or product scope changes.
Add new input-output pairs when you encounter novel use cases or edge cases.

Tip: Aim for a balanced mix of straightforward and challenging examples. Too many easy examples can give a false sense of model performance; too many niche examples can skew results.

6.2 LLM as Judge

Why Use an LLM as Judge? Manually reviewing AI outputs at scale can be time-consuming. Using a large language model to score or rank generated outputs against your golden set automates much of this evaluation, enabling rapid iteration.

How It Works

Auto-Comparison

The “judge” LLM is given the input and the candidate output (from your AI) along with the golden reference output.
It then assesses how well the candidate matches or improves upon the golden reference, often producing a similarity or quality score.

Scoring & Ranking

The LLM as judge can rank multiple candidate outputs (e.g., from different prompt variations or model versions) to identify which is best.
It may also provide a textual explanation or “reasoning” for why it picked one output over another.

Speed & Scalability

This method is efficient for large-scale testing, particularly when you have thousands of examples to evaluate regularly.
It reduces the volume of manual reviews needed, allowing human evaluators to focus on the most uncertain or critical cases.

Human Review

Always pair automated LLM judgments with periodic human audits.
People can catch nuances—like domain-specific accuracy issues or brand/style mismatches—that the LLM judge may miss.

Best Practices for Using an LLM as Judge

Prompt Engineering: Provide the judge model with clear instructions and context (e.g., “Rate correctness on a 1–5 scale based on factual accuracy, relevance, clarity, and completeness.”).
Evaluation Protocol: Use a consistent format for each evaluation request (input, candidate output, reference output, scoring instructions).
Periodic Calibration: Compare the judge’s scores to expert human scores periodically. If the judge’s decisions drift or become inconsistent, adjust its prompt or re-train your judge model.
Bias Awareness: LLMs may exhibit biases or overlook domain-specific criteria. Use domain experts for final sign-off in critical applications.

Tip: Consider rotating or experimenting with multiple LLM judge models to avoid overfitting to a single model’s biases.

6.3 Metrics & Methods

Quality Metrics

Accuracy: Rate of factual correctness or error-free classification.
Relevance: Does the output address the user’s query or task directly?
Clarity: Is the language understandable, concise, and well-structured?
Task completion: identify direct measures or proxies for task completion.
Apology rates: For general purpose, open ended scenarios measure how often the system cannot address the user need.

Side-by-Side Comparisons

When testing different model versions or prompt strategies, display outputs A and B (or more variants) to evaluators.
Ask, “Which output better meets the user’s needs?” This helps reveal differences that may not be captured by a simple numeric score.

Iterative Testing

Prompt/Model Adjustments: Tweak the prompt or model parameters if you see repeated issues.
Continuous Retesting: Each time you update the model or prompt, re-run the golden set through both the AI output and your judge pipeline (LLM or human-based) to confirm improvements.

By building a robust golden set and leveraging an LLM as judge, you create a systematic evaluation workflow. This ensures that as you iterate and experiment with prompts, training data, and models, you have consistent evidence on what’s working and what needs refinement.

7. Launching & Iterating on AI Features

Beta Testing & Soft Launch

Release to a small user group for early feedback.
Monitor real-time metrics (latency, error rates, SAT).

Gather User Feedback

Surveys, interviews, and in-product ratings to pinpoint confusion or unmet needs.

Refinement Loop

Identify failure modes and make prompt or model adjustments.
Validate changes against the golden set and user feedback.

Playbook Tip: Track usage. Drop-offs may indicate quality or UX issues requiring immediate attention. Look for ways to determine if engagement or churn is a result of UX (lack of awareness) or value (AI not addressing the user need).

8. Ensuring Product-Market Fit & Competitive Edge

User Expectations

Avoid over-promising AI capabilities.
Clearly communicate limitations and uncertainty.

Competitive Analysis

Understand how users solve the problem now (manual or automated).
Differentiate by emphasizing clear benefits like speed or quality.

Value Delivery

Focus on measurable benefits (time saved, cost reduction).
Continually improve to remain competitive.

9. Feature Crew & Stakeholders

AI product development requires close coordination among multiple disciplines—product management, engineering, applied research, marketing, design, user research, and beyond. Each function brings unique expertise, and effective collaboration ensures AI features are both technically viable and truly valuable to end users.

9.1 Deep Partnership with Applied Research

Shared Understanding of Objectives

Why: Applied research teams are the engine behind model development and experimentation. They need a clear picture of the user problem and success criteria to build effective solutions.
How: Start each initiative with a deep-dive session on user needs, desired outcomes, and acceptance metrics (accuracy, relevance, etc.). Make sure researchers understand the “why” behind the user stories.

Co-Development & Experimentation

Iterative Approach: AI typically requires constant iteration—prompt engineering, data collection, retraining. Product managers should stay in close contact with researchers to swap insights from user feedback and share real-world test results.
Regular Prototyping: Encourage short, time-boxed experiments (e.g., a 2-week sprint) to validate new approaches or model improvements. Evaluate results against agreed-upon metrics (golden set accuracy, side-by-side comparisons, etc.).

Technical Feasibility & Constraints

Trade-off Discussions: The product manager balances user needs with model complexity, latency, or cost constraints. Applied researchers can advise on which approaches are most viable and how trade-offs (e.g., smaller vs. larger models) might impact user experience.
Emerging Research: Keep an open channel for new discoveries in AI (e.g., new model architectures, fine-tuning techniques) so the product can evolve quickly with minimal disruption.

Key Takeaway: Think of applied research as co-owners of the product vision. They aren’t just refining the system prompts in isolation; they’re shaping core user experiences through system prompt design decisions.

9.2 User Research & Design Teams

Continuous Feedback on AI Interactions

Why: User researchers and UX designers can spot usability bottlenecks or confusing output presentations.
How: Integrate these teams into your beta programs and post-launch metrics reviews. They can run usability sessions or analyze user logs for friction points.

Conversational AI Considerations

Intent Detection: For chatbots or virtual assistants, ensuring the AI recognizes user intent accurately is critical. User research can validate whether the system is missing or misunderstanding queries.
Frictionless Flow: Designers can optimize how the user interface handles clarifications, error messages, and next-step suggestions, reducing frustration and dropout rates.

Key Takeaway: For AI products, experience design is more than visuals. It must account for uncertain or variable outputs and ensure the user can quickly correct or refine AI-driven suggestions.

9.3 The Feature Crew Disciplines

Building and iterating on AI features requires a multidisciplinary “feature crew”:

Product Management (PM)

Define the customer problem and success metrics.
Maintain the roadmap and ensure alignment across teams.

Engineering

Implement scalable systems for AI deployment.
Integrate model APIs, build user interfaces, handle data pipelines.

Applied Research / Data Science

Focus on model selection, training, and evaluation.
Innovate on prompt engineering and manage ongoing experimentation.

Design & User Research

Optimize UX flows around AI outputs.
Gather qualitative and quantitative feedback on usability and clarity.

Technical Writers / Content Specialists

Create consistent, user-friendly prompts and system messages.
Support data labeling and documentation for model training.

Key Takeaway: Effective AI products require each discipline to contribute their expertise. Product managers orchestrate this collaboration, ensuring the final experience meets user needs and maintains reliable model performance.

9.4 Marketing & Product Alignment

Hands-On Demos & Training

Why: Marketing teams must understand AI capabilities and constraints firsthand to craft accurate messaging and set realistic customer expectations.
How: Schedule regular demo sessions—ideally with real customer scenarios—and provide “sandbox” environments where marketing can try the AI feature. Encourage them to document or record interesting use cases to highlight in future campaigns.

Messaging & Reality

Align Promotional Materials: Work with marketing to ensure claims (e.g., “automatic summarization” or “personalized recommendations”) match the actual performance of the AI. Over-promising leads to customer dissatisfaction; under-promising risks missing key differentiators.
Ongoing Syncs: Hold frequent check-ins between product and marketing leads to review any new model updates or performance metrics. If the AI improves (or sees regressions), marketing materials must be updated accordingly.

Key Takeaway: Marketing is your megaphone to the market. Equip them with authentic examples and clear guardrails so they can confidently communicate your AI’s strengths and manage user expectations.

Below is a revised Playbook Checklist that more closely reflects each major section of the document. You can tailor the level of detail to your specific team or project needs.

AI Product Manager Playbook: Comprehensive Checklist

1. Introduction

o Clarify AI Differences

o Ensure the team understands AI’s probabilistic nature and the importance of output quality.

o Align on the unique challenges (e.g., rapid evolution, continuous iteration) compared to traditional software.

2. Core Principles of AI-Driven Products

o Adopt an Empirical Mindset

o Plan how you’ll measure and monitor output quality (metrics, test sets, user feedback).

o Emphasize User-Centricity

o Identify real user problems; confirm you’re solving pain points that matter.

o Outline an iterative improvement process (launch → gather data → refine).

o Choose the Right Model

o Evaluate cost, latency, and feasibility of high-end vs. smaller models.

o Commit to Responsible AI

o Ensure transparency of AI limitations and have human oversight for errors or critical decisions.

o Set up governance to catch biases, moderate content, and stay compliant with ethical guidelines.

3. Defining “Good” in AI Products

o Set Explicit Quality Definitions

o Create clear benchmarks for accuracy, relevance, clarity, or brevity.

o Build a Golden Set (Intro Level)

o Collect initial exemplar inputs and ideal outputs.

o Outline Evaluation Criteria

o Decide on objective (accuracy, coverage) vs. subjective (readability, helpfulness) metrics.

o Plan for Iteration

o Confirm a process for re-testing and updating metrics as the model evolves.

4. Identifying & Validating AI Use Cases

o Map User Needs to AI Capabilities

o Pinpoint repetitive or complex tasks where AI adds unique value.

o Use cross-functional brainstorming to uncover hidden or high-impact problems.

o Assess Data Availability & Complexity

o Confirm that required data is sufficient, clean, and well-labeled.

o Conduct Cost-Benefit Analyses

o Weigh potential impact against resource requirements.

o Start Small, Prove Value

o Prioritize a narrow, well-defined use case to build momentum.

5. Designing & Building the AI System

o Define Prompting Strategy

o Choose between single-shot, chain-of-thought, or few-shot approaches.

o Prepare Training Data

o Focus on data diversity, quality labeling, and representative edge cases.

o Select Deployment Model (Cloud, SLM, Local)

o Weigh pros/cons of each in light of cost, latency, compliance, and update needs.

6. Evaluating AI Outputs

o Construct & Maintain a Robust Golden Set

o Curate a balanced mix of common and edge-case examples.

o Document why each example is “ideal.”

o Use an LLM as Judge (If Appropriate)

o Automate large-scale output comparisons against your golden set.

o Periodically calibrate against human reviews to guard against biases or drift.

o Define Metrics & Methods

o Track accuracy, relevance, clarity, apology rates, or other domain-specific signals.

o Employ side-by-side comparisons and continuous retesting to validate changes.

7. Launching & Iterating on AI Features

o Beta Testing & Soft Launch

o Roll out to a small user group for early feedback.

o Monitor real-time performance (latency, errors, satisfaction).

o Gather User Feedback

o Deploy surveys, interviews, or in-product ratings to identify UI/UX or quality gaps.

o Refinement Loop

o Adjust prompts, retrain models, and re-check with your golden set.

o Investigate drop-offs to differentiate between UX vs. AI quality issues.

8. Ensuring Product-Market Fit & Competitive Edge

o Manage User Expectations

o Clearly communicate capabilities and limitations—avoid over-promising.

o Analyze Competition

o Identify how users solve the same problem today.

o Differentiate through speed, accuracy, or unique features.

o Deliver Tangible Value

o Highlight quantifiable benefits (time saved, reduced costs, improved accuracy).

o Keep iterating to maintain a leading edge.

9. Collaborating with Stakeholders

o 9.1 Deep Partnership with Applied Research

o Share objectives and success metrics early and often.

o Conduct time-boxed experiments for new techniques; validate them with golden sets and user feedback.

o 9.2 User Research & Design Teams

o Involve them in beta testing to identify usability or flow issues specific to AI outputs.

o Refine conversational AI flows (if relevant) for clarity and friction reduction.

o 9.3 Feature Crew Disciplines

o Ensure PM, Engineering, Applied Research, Design, and Technical Writers each own key parts of the workflow.

o Keep lines of communication open; coordinate frequent check-ins.

o 9.4 Marketing & Product Alignment

o Schedule demos so marketing fully grasps the AI’s capabilities.

o Align promotional claims with model performance and keep marketing materials updated as the model evolves.

Final Word

AI product development is an iterative, data-driven journey. Define “good” early, validate relentlessly, and refine based on real-world feedback. Collaborate across teams—product, research, marketing—to ensure solutions are both technically robust and truly valuable to users. Keep this playbook updated as AI technologies evolve, ensuring your products stay relevant and effective.

BK1(#_msoanchor_1)@Ryan Nadel I’d suggest expanding on “Measuring Good” to highlight the two key aspects of AI product evaluation: offline vs. online.

An AI PM needs to design a strong offline evaluation that effectively represents real-world usage, this isn’t something the science team can do alone since they may lack deep user insights.

Unlike traditional PMs, who can rely only on online evaluation metrics

J(2(#_msoanchor_2)@Ryan Nadel This one feels dated. Even for non AI features today, PMs should be defining success metrics, running A/B tests, and iterating based on real customer usage. Might be valuable to focus on the delta from this to the approach for AI.

RN3(#_msoanchor_3)This refers to functional validation - does the thing work i.e. bug bashing - not value validation

BK4(#_msoanchor_4)@Ryan Nadel This is tricky since we currently don’t have a simple way to run an AI experimentation pipeline. The existing A/B testing framework in Teams is designed for client-side changes, not AI models. This is an area we should explore further.

BK5(#_msoanchor_5)@Ryan Nadel I want to highlight something about collaboration with the science team.

When it comes to cost estimation, engineering often works with a cost per feature or solution. However, for the science team, the cost is determined by the level of quality we aim to achieve.

For example, we could deliver the Proactive Skills in 3 months or 3 days, depending on whether we prioritize precision over recall or vice versa. Do we want more suggestions or higher accuracy?

I believe AI PMs play a crucial role in bridging the gap between quick prototypes and production-ready AI features, helping to balance quality, timelines, and cost.

BK6(#_msoanchor_6)@Ryan Nadel It also include aspects of permissions for eyes-off data, or online metrics that require privacy approval before evaluating user data.

BK7(#_msoanchor_7)@Ryan Nadel I would also add here the aspect of user education: how to effectively use the AI product, help users generate good prompts, and proactively suggest ways to assist them.

J(8(#_msoanchor_8)100%

LS9(#_msoanchor_9)THANK YOU