The Middle Innings of AI

From Kevin Scott, CTO at Microsoft:

The Middle Innings of AI

If anything, 2025 will be the first year of the middle innings of an Al game that, when complete, will have determined the shape of the technology industry through the 2030’s and beyond. Whereas the early innings were defined by the discovery of a new set of foundational technologies and the construction of supporting infrastructure to accelerate their development and begin their patterns of scaling, the middle innings will be about a deep focus on efficient scaling and assessing the value of the things made newly possible by early inning breakthroughs. As with the many other things that have exhibited this pattern of growth in the past, whether social movements or startups or macroeconomic development, what got us through the early innings will not alone be sufficient for us to make progress and to thrive through the middle innings.

This memo is an attempt to describe our position as we exit the early innings of Al, in terms of both the state of the technology itself and the state of the products built on top of it. Although it is not my intention to criticize what we’ve done or how we’ve done it, nor to offend anyone by challenging closely-held beliefs that they might have, this memo is blunt. It is blunt not because I have absolute conviction in some proposed course of action that is the only way forward, but because I have absolute conviction that the way forward is not a line extension of where we are now, and that nothing short of a blunt conversation even has a chance of helping us change as quickly and as much as is necessary.

Aside from asking some difficult questions and trying to factually and simply state the constraints under which we are operating, I do have a point of view on a path forward. There are three categories of building to which we should pay attention: infrastructure; ecosystem; and product. We must have activity in each, and I have some thoughts about what that activity should look like.

In particular, we need to pay attention to both what it is we are attempting to accomplish in all three categories, as well as how we go about accomplishing those things.

In many cases, the thing preventing us from making the sort of progress that we need to be making is less about what we’re doing, and more about how we’re doing it. Changing the how is unfortunately much harder to change than the what. That’s true in general for any organization and especially challenging for us given our top-down management style, and our unwavering confidence that we will always be able to manage through by appending more complexity onto what already exists.

The way that we currently choose our “what” is to understand and embrace all our existing complexity, to spend lots of time building mock ups and debating complex product strategies, launching things in marketing moments, and sending the field out to see if they can close sales with what we’ve built. We have a disconnect between what our product managers and engineers perceive as a win-which is launching while minimizing the amount of derailment that happens leading up to the launch-and what other stakeholders perceive as a win. For customers, the win is being able to discover new things that we build and wanting to use them because they do something valuable. For the field, finance, and our shareholders, it’s having what we build be correlated to some kind of growth. (Being correlated is important. If the growth is going to come one way or the other, we don’t need to expend so many resources making uncorrelated shots on goal.)

Our “how” has been good to us historically. It yielded an epic win in the PC revolution where it’s hard to conceive a better strategy. It kept us in the game with the Internet and the Cloud, where, even though we didn’t spot the frontier first, we could use our “how” to get into position. It notably hasn’t worked for anything else.

We lost search. We lost mobile. And it is my contention that we’ll struggle to stay out front with Al unless we adapt and adapt quickly.

Why? Sometimes, the “what” is so complex and evolving so quickly that the “how” is really the only thing that matters. I believe that we are in one of those times.

We have a technology that’s changing faster than anything humans have ever invented. By heroics and unnatural acts, we’ve been able to carve out a leading position in platform, have partnered to create one clear product standout ourselves (GitHub Copilot), and are the platform provider for the only GenAl product with real consumer traction (ChatGPT). That said, it would be very easy to lose that winning position in platform, and the real Al-driven product impacts are still to be discovered by us or anyone else. When things are this complex and dynamic, with the right “how” you can start with the wrong “what” and still win. But if you start with the wrong “how” and refuse to improve that process, it doesn’t matter if you start with the perfect “what,” you’re simply going to have a hard time. When you don’t yet know what the winning hand is, it’s all about the game and how you play it.

Before diving into a strawman for infrastructure, ecosystem, and product, I want to take a moment to walk through some things that seem relatively clear in February 2025.

We now have a reasonably clear path to building Al systems with greater and greater reasoning power over time. There is no obvious technological limit to our continued progress. Most of what is being written on limits to scaling laws are extrapolations from what has already been built that is easily observable, not from the point of view of scientists with track records of encountering obstacles and inventing their way around them. (Much the same was true with Moore’s Law and semiconductors; if Intel, AMD, Nvidia, and Qualcomm had stopped inventing when folks first proclaimed Moore’s Law was dead, we would live in a very, very different world.)

This being said, there is a very clear economic limit on the scaling laws. The amount of capital we’ve collectively invested in reaching the current level of reasoning in Al systems will seem too high unless we see significant expansion of adoption and revenues this year.

Despite explosive growth in the capabilities of Al models, Al product has not caught up to that full technological capability. With perhaps only two exceptions— GitHub Copilot and ChatGPT—we have not seen Al products that fully leverage the power of frontier Al models that have also found product-market fit. I would argue that since o1 and its successors, as of February 2025 we’ve seen no products making full use of the reasoning powers of the current frontier. One of the most striking things to me since the availability of o1 is that not only are its full reasoning capabilities not being leveraged in products, but folks seem generally unaware of where the limits of those reasoning capabilities even are.

(I will talk at length later in the memo about why that might be, and offer some very strong opinions about what can’t possibly be the cause, given that it will be those excuses that damn us to mediocrity. But prior, it is interesting to note that no one has cracked this, including even teams like Google DM and Open Al, who are richly equipped with compute, with the scientists developing Al capabilities, and with very good modern product-making capabilities.)

There are folks here at Microsoft and at large who are pushing models in very interesting ways, closer to the limits of their capabilities, in service of trying to solve valuable problems. I’ll offer two examples.

First, at a recent Tech Leaders offsite I had Eric Horvitz demonstrate an advanced Bayesian diagnostic system he had developed simply by constructing an expert prompt to o1. Eric, who has a PhD in CS and an MD, wrote his PhD dissertation on Bayesian diagnostics and came to Microsoft decades ago in an acquisition of a company building Bayesian medical diagnostic tools. There aren’t many people in the world who could have written the four-page prompt that Eric constructed to 01. But with that expert prompt, o1 was able to perform medical diagnosis at superhuman levels of performance. No post-training was required. No special data aside from the prompt was required. All that was required was a domain expert with a complex and interesting problem and a suspicion that if you talk to the model like an expert that it would be able to do expert things. Real-world approval has been swift and positive: Eric showcased this technology at Columbia’s Department of Medicine, where experts were wowed, and is now working with a medical team on a trauma care version to aid time-sensitive decisions.

Second, just last week I was riding home one evening from an event and got a text message from my daughter asking me how convolutional neural networks worked. I asked why, and she told me that for her internship at Stanford she had high resolution images of rat retina and needed to be able to segment the images into blood vessel and non-vessel data. She was thinking about training a CNN to do this. Rather than having a long text thread with her about the pros, cons, and alternatives, I just sent the following to Deep Research:

I have high resolution images of retina and would like to segment the images to filter out everything other than the blood vessels in the images. l’d then like to compute the average intensity of the pixel values for the blood vessels. Can you write code to do that?

ChatGPT said:

To ensure the best approach, could you provide more details on:

The format of your high-resolution retina images (e.g., PNG, JPG, TIFF, etc.).
Whether you have labeled images for segmentation, or if you’d like to use a pre-trained model for blood vessel segmentation.
The preferred programming language or framework (e.g., Python with OpenCV, TensorFlow, or PyTorch).
If you need the output in a specific format or visualization.

With this information, I can tailor the code to your specific requirements.

I said:

The images are PNG formatted. I do not have a pretrained segmentation model, nor training data. I’d like the code in Python. And the output should just be the pixel intensity average for the blood vessels data as a single floating point number per input image.

ChatGPT said:

I will write a Python script that segments blood vessels from high-resolution retina images in PNG format, using traditional image processing techniques (such as thresholding and morphological operations) since no pretrained model or training data is available. The script will then compute the average pixel intensity of the segmented blood vessels and output a single floating-point number per image. I will provide the code shortly.

The output is in the Appendix. When I was in graduate school, starting with a reasonable set of foundations in computer science, programming, image processing, and Al, getting the model output in 5m, augmented by 23 sources, could have easily been a week’s worth of work. For my 16-year-old 10th grader, the path from her problem to this solution would have otherwise been insurmountable in the amount of time she had.

This sort of capability unlock-the rendering of possibility from impossibility-and this sort of productivity gain—a week’s worth of expert work done in 5

minutes—is where the true power of these systems is.

While reasoning is necessary to build agentic Al software, it is insufficient. One of several theories for why we have a capability-to-product gap right now is that agents need more than just reasoning power to take on more complex tasks fully autonomously. (This is certainly intuitively true if you imagine delegating tasks to a human collaborator: if all they had was reasoning capability, they would be very limited in what they could take on for you.)

The three primary things that are missing in our Al capabilities right now are memory, entitlements, and actions. In brief: agents must be able to remember and recall in service of their interactions with you as well as over their “scratch work” done in service of the tasks they were assigned (as a human collaborator would); agents must be entitled to see and act upon information and systems that a human collaborator would be able to in service of accomplishing a task assigned to them; and agents must be able to use tools and take external actions necessary to complete their assigned tasks.

In addition to being necessary to build useful agents, memory, entitlements, and actions will not organically emerge from scaling up our reasoning engines (i.e., transformers). These are completely independent capabilities that must be built in conjunction with, but outside of, frontier models. Moreover, these capabilities need fully functioning ecosystems, with robust interoperability, tooling, and developer communities, in addition to services and libraries that implement the basic capabilities.

First mover advantages matter more than ever. When Satya and I were at the White House almost two years ago shortly after the ChatGPT and GPT-4 launch, I asserted to everyone in the room that within 18-24 months the cost of training a GPT4-equivalent model would go down by at least 10x, and that consequently, many would be producing GPT4-caliber models. I think that prediction has mostly come true. For example, DeepSeek v3 is reported to achieve GPT 4o-level performance with under $6M in compute cost for training, a fraction of the expense of training GPT4o. DeepSeek-R1’s open-sourcing of the recipes for building strong reasoning models will likewise accelerate community development. We’ve witnessed this pattern recently with Grok v3, as well.

Were there no further advances in Al systems capabilities coming, and the only thing valuable about Al systems were a set of GPT4-ish model weights and inference APls, then what you would see is the value of training going to zero, the cost of an inference API call getting set marginally higher than the cost of compute to serve it, and a huge race to drive the cost of that compute down as low as the laws of physics would permit. Thankfully, per 1-4, we are quite far away from the end of progress and have just barely scratched the surface of what must be built and how to translate capability to value. That notwithstanding, the past two years have affirmed my belief that for every advance we make, competition will catch up, and moreover that what is hard now will be easy two years from now.

Efficiency matters in the middle innings. One of the truisms of the tech industry is that premature optimization often means death. Trying to wring efficiencies out of systems not yet proven valuable often wastes time and energy better spent by making the systems valuable. Moreover, optimizing a system that is not yet valuable and that needs to rapidly change to become so may well over-constrain the system that slows you down (even if you do have money to burn), and will certainly, at the least, result in a bunch of throwaway work.

That said, we are at a moment where optimizing our systems, particularly our inference systems, matters a lot. The place where we need extreme flexibility

is in training and post-training.

Data matters, but not the way that most folks think. In the early innings, almost everyone’s Al efforts were dominated by accumulating as much compute and as much data as possible in service of maximally scaling a reasoning model. Too little compute relative to data meant smaller, less capable models an even less inference compute. Too little data relative to compute meant that you couldn’t get big models to converge.

We increasingly have a set of tools at our disposal that allow us to take high quality data-whether indirect sources like textbooks or journal papers or direct sources like expert human feedback-and amplify them through a powerful frontier model to generate the optimal volume of tokens for training or post-training a model that hits a desired point in the tradeoff space between training compute budget, trained model inference performance, and model capability. Things like CoT and RLHF are ways to achieve parts of this kind of synthetic data generation.

What does this mean? It means that we will be able to train bigger, more powerful reasoning models, which in turn makes it possible to more efficiently generate synthetic data. This is an extraordinarily important (and completely ill-understood) feedback loop that is existentially important to the middle innings. (Folks who are declaring that we can’t build bigger models because there aren’t enough training tokens are apocalyptically wrong.)

It also means that it is now sometimes sensible to train application-specific models for performance reasons, i.e., to achieve a level of performance otherwise impossible with the available training or inference compute budget. The majority of the impulses to do this will still likely be wrong, but the cost of being wrong now is less about catastrophic opportunity cost on wasted compute and more likely to manifest as wasted product development efforts from folks without a good product idea or reasonable product rhythm looking for a miracle in the wrong place.

Product-making matters. This is perhaps the most important and most contentious point that this memo will make. Beyond ChatGPT, GitHub Copilot, and our API business, we have not yet discovered high-ambition Al-powered products, features, and services that are contributing to Al revenue growth. It seems very likely that products, features, and services, to generate truly transformational value for users and hence usage intensity for us, will not arise first from incremental changes to existing products.

There are also few, arguably no, third-party products in the marketplace (other than ChatGPT) that have any kind of user signal that could help us focus our own efforts. And that puts us at a distinct disadvantage, given that not only are we not good at the types of modern product development practices which are the only things proven to generate disruptive new ideas “from scratch,” but we’re proud that we’re not good at them. We believe that the way that we do things is necessary for our success, as contrary as our methods appear to anyone who has built products elsewhere, even when those successes are greater and more recent than ours. Everyone empowered to make decisions about product direction and strategy, multiple levels deep in our biggest orgs, believe in our way of doing things. Even when they bring new talent from elsewhere with proven track records into their teams, we don’t change our product-making culture.

Now, the question to ask ourselves isn’t whether this dynamic is true or not, or whether things could have been different and better in arrears if we’d just had a different product-making culture in the past. The question to ask is whether discovering new Al value is going to be more like all the net new value that’s been discovered in the software industry this century, or whether it will get discovered the way that we do things. If the latter, then great. We’re set. If the former, then we have two choices: get with the times; or, make sure that our infrastructure and fast-follow ability is so great that we can intercept change in the market as fast as it happens.

Our infrastructure decisions and allocations matter. There’s an obvious financial corollary between these allocation decisions and our ability to deploy valuable product. Thankfully an enormous amount of the Al capacity that we’re already deploying and planning to deploy in the future has a paying customer: OpenAl. In addition to standing up Al infrastructure to meet OpenAl demand, we are also standing up Al infrastructure to support both revenue-generating 1P Al product growth, and 1P training.

This is effectively a bet that our product leaders can create enough new revenue from the benefits of that training to offset the billions in additional OPEX to build and manage that capacity, just to break even. Moreover, even if we are able to produce these levels of net new revenue due to 1P Al training by 28/29 when the expense hits the books, in order for this allocation to be rational, given that we will have access to OAl models through 2030, we would need to believe that 1P Al training will produce more valuable models than OpenAl-which will likely be spending an order of magnitude more on training by then — or that the OpenAl partnership will have become unfruitful in some other way.

Even with 1P training and current forecasted 3P and OAl infrastructure demands, by 2027, we may have a significant surplus of power globally. That’s a better position to be in than the deficit of power and capacity that have plagued us for the past two years. We can absorb that excess of power with revenue-producing workloads. On current course and speed, there is no way we will have enough demand on the 1P side to satisfy the level of overage. One upside of our big capacity swing is that if any product team can deliver engaging products based on Al tech that translate into real incremental revenue for the company, then we will absolutely be able to deliver the infrastructure needed to support it. This is true whether that somebody is us or a competitor: we’ll have the capacity to fast-follow and meet the moment, something which Microsoft has a rich multi-decade history of doing and doing well. We can and should increase the speed at which we can do so.

Takeaway #9: Failure to deliver product value in the past has been mostly felt as opportunity cost; failure to deliver product value in the future given the magnitude of our Al investment will have costs that are much harder to bear. Given that we are choosing to invest in Al capacity with urgency and ambition and world-leading innovation, we also need to choose to do the same with our bets to create new product value.

Trailheads

Hopefully the moral of the story comes through loud and clear from the takeaways in the first part of this memo. We have an extraordinary opportunity ahead of us, and the most enviable position of any player in the game as we enter the middle innings. That said, we have some very significant challenges facing us, hard decisions to make, and the necessity to do some fundamental things differently than we have chosen to do them in the past. Rather than proscribe in detail exactly what I think we ought to do, l’m instead going to describe where I think the most interesting trailheads are for the three categories of Al strategy and investment that we need to be considering: infrastructure; ecosystems; and product.

Infrastructure

The good thing about infrastructure in the middle innings is that we’re starting with winds in our sail, given our lead from the early innings and having the only real breakout successes in Generative Al running on our platform.

On the capacity front, we are rapidly transitioning from an untenable problem—a giant deficit of an asset that takes forever to build—to a more tenable one: a likely surplus of capacity that we need to manage super well.

The three dimensions of that management are: forecasting; allocation; and yield. I won’t say much about forecasting, other than it’s better in a highly competitive boom market to have too much capacity than too little. We should treat our impending surplus as an opportunity, although one that everyone should equally understand.

On allocation, we should make sure that our product leaders understand that they either need to do things with Al that make more revenue, or the size of their teams will unavoidably need to be reduced, no ifs, ands or buts. Particularly if leaders are asking to have significant R&D assets allocated to them, like training compute or Al headcount, they need to show a plan for how they-or someone-is going to pay for those allocations. Where there is an opportunity to reuse an asset that someone else has already developed and paid for, whether that’s an internal piece of infrastructure or an Open Al model, there needs to be a supremely clear and well-understood plan that the team will be held accountable to offset the costs of their bespoke efforts.

On yield, we have a bunch of very interesting trailheads to explore. The problem here is to make sure that 100% of our infrastructure investment is producing maximum ROl. Part of that equation is about finding high-value workloads to run on available capacity. And part of the equation is about reducing the cost of the infrastructure itself.

We’ll talk more about finding high-value workloads in the product section. On the efficiency front, we have two frontiers: inference and training. Optimizing the cost of training for us is mostly going to be about avoiding doing it when unnecessary. This will be hard to do given that everyone’s intuition here is wrong. © Almost everyone thinks that training models is the most important and interesting thing in the world right now. There are a variety of wrong theories that are used to justify unnecessary training efforts:

The only way to make a good product is to train a model with my private and special data.
I have a special idea about how to build a model that will make it better than what other folks are doing because I am cleverer than the other folks.
If I don’t let my team train a model, they will quit.

Our default position when someone wants to train a new model should be “don’t,” similar to how we would tell a team not to write an operating system or compiler or query optimizer or storage layer from scratch unless there is literally no other way forward. The vast majority of the time, these three are bullshit excuses, and are not simply just false, but are misdirecting attention from bigger, and often cases very obvious problems that if not fixed first will result in failure. Ignoring those dynamics is, to quote a former boss, going 0 for 2.

When a team persists with their desire to train a bespoke model, we should have a pretty formal mechanism to have them justify the need, and one that ensures that these three template excuses aren’t being misused.

First, and trivially, they should have a plan that justifies the expense of training a brand new thing, that shows what it is the new model will enable and the potential revenue upside against the magnitude of investment required in terms of compute, data acquisition, and people. In other words, in success, what’s the ROl they’re signing up to deliver?

Second, they should be able to show the experiments that they’ve run that show why they can’t achieve their product goal with something that exists or that could be more cheaply built by taking a dependency. We should force those conversations to be good faith, in the sense that the goal should not be building a model but solving the product problem. Accordingly, the case for the bespoke model and what they’ve done to disqualify other options should be widely circulated for feedback on their analysis and for suggestions on alternative solutions.

Third, if we allow a team to build a bespoke model, there should be an ongoing assessment of their progress against both their goals, and the state of the alternatives. Should the original assessment of the bespoke model’s effectiveness relative to alternatives materialize, we should discontinue the bespoke efforts and force use of the alternative.

Finally, our Al tooling teams should be looking across all the legitimate bespoke model asks to determine whether there’s common infrastructure that could be built to help solve Al product problems in a more highly leveraged way.

For instance, in a world of CoT RL, it has become a lot cheaper to do post-training. We need standard post-training services that everyone is using, rather than allowing every team to adopt their own post-training pipeline. There is no good reason that this set of post-training services are not built by and operated on behalf of the whole company by CoreAl.

In the limit, there should be vanishingly few reasons for folks to do their own pre-training, or to even post-train in a non-standard way. Every such thing we allow should be considered an exception, tracked, and as soon as the conditions no longer exist to justify the exception, permission for the exception should be revoked.

All of this is thankfully not rocket science, and aside from making our standard infrastructure more complete and usable, doesn’t require an extraordinary amount of development work. All of this is now just engineering, and discipline. The things that are not quite in this “just engineering” category are the extra-model infrastructure that we need to build to make training and post-training more efficient and to help us make useful agentic software. Those things are data management, memory, entitlements, and planning

I’ll talk about the first one-data management-here, and the other three in the ecosystems section. The latter three are fundamentally about interoperability of agents with each other and with the broader world, so it’s less about the infrastructure for them, and more about ensuring that whatever infrastructure is needed to support them is very broadly deployed.

For data management we need a company-wide horizontal effort to ensure that we have a consistent way of procuring, generating, storing, assessing the value of, and reusing preand post-training tokens for our Al models. Without such an effort we incur a variety of risks, including buying or expensively generating tokens redundantly, unnecessarily starving training efforts of useful data, over-constraining the use of training tokens, and/or accidentally using tokens for training against policy. I think that we urgently need an effort, in the same spirit as Project Aristotle, to coordinate existing efforts and to fill in some pretty substantial gaps that exist.

We will need contributions to this effort from all parts of engineering, although given the horizontal nature of the program, it should be coordinated either from CoreAl or from OCTO. It could be coordinated elsewhere as long as the team signing up for ownership will treat it as a true horizontal, company-wide effort optimizing for MSFT-wide leverage.

Lastly on infrastructure, and this topic deserves significantly more space than I have here, is that we are going to need to fundamentally rethink the nature of our cloud. The cloud that we’ve built is an excellent match to the compute, networking, and storage needs of our first generation of cloud customers. The majority of these customers were not cloud-native, dipping their toes for the first time into cloud, and we gave them outstanding on-ramps to meet their needs. They represented most of our growth and revenue, and placed the majority of the demands on compute, storage, and networking infrastructure, which meant that the compute, storage, and networking infrastructure that we built needed to be for them.

That is not going to be the case going into the future. Overwhelmingly, our growth in compute, networking, and storage needs are going to come from Al workloads.

We have to date appended these core Al infrastructure needs into our roadmaps and builds for our legacy cloud infrastructure. Perhaps that is a spot on the efficient frontier. I suspect believing so is wishful thinking, as is believing that we can get by without being on that efficient frontier.

There are many implications of this very significant shift in capital allocations and workloads. We should be having a thorough round of self-reflection on whether we are architected properly top to bottom, from our physical plant, power, and cooling designs, to our compute, storage, and networks, to operating and systems software, to the basics of our control plane and resource management.

At the very highest level, we will have three Al workloads to think about: training; inference; and apps.

We’ve done quite a bit of work on training architectures, and have quite a bit more to do in the future as the nature of training changes to be a heavier mix of synthetic token generation and chain of thought generation (both of which are inference workloads), grading (which is a general purpose workload), in addition to core model weight optimization (which is a classical homogeneous supercomputing workload). Now, even though we operate some of the largest training environments on the planet, we don’t have distinct advantages here in terms of our hardware or software. We run on someone else’s hardware platform, and the workload prefers the efficiencies and ease of use of bare metal. Our advantage to date has been that we saw how the workload was evolving and through partnerships moved more quickly than anyone else to build a winning, if not terribly efficient environment. The question we must ask and answer in the coming months and quarters is what, if anything, we should do to create advantages here. It may be as simple as what we are doing now: outsourcing training to a partner in an extremely capital efficient deal that dramatically reduces execution risk for us.

We’ve done quite a bit of work on the inference workload, and even though there’s quite a bit more to do, here I feel like our API offering and our middleware is world-class. There are legitimate questions about the architecture below the middleware looks like. The message that we should clearly send our teams, given the magnitude of that inference workload, is that the should be absolutely unconstrained in what they can optimize to achieve greater efficiencies. They are heavily constrained now, which very soon, if not already, is the tail wagging the dog.

The place where we have the most thinking to do is on what our platform needs to look like for Al applications. Those Al applications most likely are going to need access to new pieces of the software stack, like agentic memory, a new entitlements pattern, and new ways to plan and take actions in the context of tasks that they are performing. We will talk about all three of those things in the ecosystems section. Al applications will also need to be able to very efficiently make inference calls, manage complex inference patterns across multiple models, and across models running in different regions and across the cloud and edge. Al applications will almost certainly have different latency expectations around rendering the user experience and potentially different modalities than just text and static graphic payloads.

It is highly likely that apps as we know them now cease to exist. In a world where agents have access to capabilities through action frameworks and rich abilities to get to user intent directly and formulate plans to take actions, the idea that we need product managers and designers serving as editors-in-chief for which functionality gets bound to which UX affordances is going to seem like an absurd anachronism in a few years. We’ll likely still need the bundles of capabilities in apps, and a way to deploy them. But they may just be headless capability containers that are there solely for the use of agents.

There will certainly be other changes as well, almost surely in how we permission and secure agents to do work on our behalf, to make sure the actions that they take and the interactions that they have with on another are safe and intended.

Even the patterns of scaling for agent-based Al applications will likely be different. Have an Al agent doing recruiting and you have a huge quarterly hiring target to hit? Great, spin up a huge number of agents to go hit those goals. Big holiday shopping event approaching, and you are using Al agents to help with marketing, sales, and customer service? Great, surge agents to go take care of all that work. Given how elastic agentic worker capacity is going to seem relative to human work forces, folks are going to want to avail themselves of that elasticity in unprecedented new ways.

As we imagine this world of new application patterns, we must be thinking about what the absolute best way is to provide agents with the services and environments they need to work optimally. That means thinking about the full stack from top to bottom, and allowing ourselves to challenge long held assumptions about what constitutes a good infrastructure stack. We may very well need a lot of what we’ve already built, and we should hope that is in fact the case. But we should not convince ourselves that the most convenient thing for us, which is to only append and extend onto what already exists, is going to be the optimal solution to the infrastructure problems of the future. We desperately need our folks to be first principles thinkers here, and to feel as if they have permission to optimize everything.

Lastly, for both the training and inference workloads, we will have very large amounts of compute, and potentially excess power that we will have to manage very well. Obviously, to the extent that we can have 1P Al product success and continue to drive 3P inference growth, we may be able to use these surpluses. But we also have an opportunity to think creatively about net new uses for these surpluses. There may be opportunities to leverage Al compute in existing workloads that use classical compute, and there may be net-new uses of Al compute that no one has yet imagined. We should reserve some cycles in both MSR and in our product teams to do some research and experimentation here to see if we can find unlocks. We certainly cannot afford to wait to do this thinking until the surplus manifests.

In exponential growth it’s easy to lose sight of just how fast things can change. Reasonable impulses that you have to fit the new into the gigantic envelope of the existing can very quickly become existentially misguided when you realize that the new will easily swallow the old in an unreasonably short period of time. Another illustrative anecdote is how folks are thinking about power. In the first innings of the GenAl wave, we used the scale of what we had been building as a hyperscale cloud as the quickest way to bootstrap our Al infrastructure, including leveraging our well-known power playbook to add necessary additional capacity. Doing something different would be challenging. And it may also be the case that doing something different will not land us on either the efficient frontier or in a winning position. That said, the two most interesting feats of scaling going on right now in Al infrastructure are doing something idiosyncratic, and first principles with power capacity management. There’s what Elon has done in Memphis. And there’s what Crusoe is doing in Texas, likely with Oracle and/or Open Al. Startup Behind OpenAl’s Stargate Data Center Lands Record Power Deal — The Information. Whether this is the right or wrong strategy for us, we should give ourselves full permission to go back to first principles thinking on how we solve the problems of the future, versus being over reliant on what we already know and are comfortable with.

Ecosystems

The other bit of Al good news is that in the middle innings a significant amount of the net new work that needs to be done is figuring out a path not to higher reasoning Al models, but to completing the stack of agentic capabilities. There are three agentic capability pillars that are missing or underdeveloped: memory; entitlements; and actions. We have significant work to do in the next 12 months and beyond to get these capabilities implemented.

For memory, the problem to be solved is how to allow an agent bound to an individual or a team to remember in the same way a human would its past interactions with humans and its own past output, so that it can more easily complete complex tasks that can now only be realistically delegated to human colleagues. We have a variety of efforts here-from early ones like RAG and long contexts that are likely inadequate—to new efforts like AMP being built by Steve Lucco & co. The characteristic of an effective agentic memory system is that it be more like human memory than a database. It should be able to start with imprecise recollections, and iteratively refine them to high precision remembrances with a variety of cognitive and externalized tools. RAG and longer contexts might be helpful, but they are insufficient.

The reason that memory is so important is twofold. One is that it is wasteful to have to recompute the solution to every problem from first principles every time you encounter it. It takes time that requires unnecessary waiting by folks depending on an agent. And it is certainly wasteful of GPU cycles. More importantly, though, a lack of memory means that an agent is by necessity transactional and mostly impersonal. There is no ability for the agent to learn about you and your preferences.

There is no ability for it to really anticipate your needs on a personal level. There is no way for you or it to refer back to interactions prior to the current session that you’re in. And there’s no ability for the agent to have durable retention.

If all there is to your interactions with an agent is what’s in your current chat session with it, it is trivially easy for you to switch between agents, given that everyone is on equal grounds contextually, and the retention game then is one of UX, model capability, and brand. Once someone cracks real agentic memory, nearly all of the retention game will be how quickly the agent can build up a set of interactions with you to personalize itself to you. Once it is, most of that other stuff matters much, much, much less.

In order for Microsoft to have the best platform for agentic software development and agentic products across 1P and 3P, we have to have a common agentic memory infrastructure. With things like AMP, which should be ready for deployment in the Build timeframe, we have good starting points for how to organize, store, and retrieve memory. Think of AMP, though, as a library which invites multiple instantiations, and those multiple instantiations could mean that memory becomes fragmented in ways that are unnecessary.

There will be necessary fragmentation. For instance, there is likely going to be at the least fragmentation between your personal agent’s memory and your work agent’s memory, given that neither employee nor employer are going to want those memories intermingling.

There are two interesting problems here that we’ll have to tackle:

We must make sure that we don’t have Conway’s Law fragmentation across work memory. It would be bad if, because we’re terrible at leverage and common infrastructure building, we allow the natural NIH tendencies in teams to result in separate work memory instances by product line (much in same way that we’ve allowed teams to build fragmented and marginal utility Copilot experiences all over the place). If your work agent can’t remember all your work context, whether you’re interacting with a horizontal copilot UX or an in-app copilot UX, things are going to seem broken. The value of this memory is going to be so high, and so integral to a good user experience, that this sort of fragmentation is going to make our agents feel broken if we allow it to exist.
We are going to have to figure out a good way to allow different memory contexts to comingle. While it is likely unavoidable to have separate work and personal agent contexts because of personal preferences and the constraints of employers, it may be useful and productive to be able to, in a controlled way, to allow memories to flow across this personal/work boundary. Again, to convince yourself you need go no further than imagining collaboration with your human colleagues who technically can’t fragment memories. They have all of their personal memories and work memories intermingled because they can’t not, but have rules that they follow to make sure that those memories are only used in specific ways. It is highly likely that we’ll need to think about how to emulate this for agents.

Lastly, if we solve this problem well, it will also make it possible to allow third parties to plug into the memory system so that you could theoretically have a single, coherent, monolithic memory across all of your work contexts, whether they are in M365, Power Platform, or Azure, or in 31d party products like Service Now, Salesforce, etc. We need to think carefully about what a memory API would look like, one that would be very closely tied to the entitlements and actions APls and that would allow an agent doing work spanning Microsoft and non-Microsoft systems to store, recall, and process those multi-platform memories. If we don’t get this right, someone else will, because so much of the utility of agents will be bound to the breadth of their memories.

For entitlements, the problem to be solved is how to give an Al agent permission to see information and/or perform actions necessary to complete a task which has been delegated to it. On the one hand, this is super prosaic. We ought to be able to specify a scoped permissioning architecture that will allow an agent to request additional permissions to perform a delegated task, for the user to grant permissions with constraints, and for an administrator to review and/or further constrain those permissions. With that architecture in place, this is then an exercise of implementing it, everywhere, as quickly as possible, given that systems that don’t implement entitlements in a way that agents can navigate will mean that agents are blind to those systems.

For actions, the problem to be solved is how to enable an Al agent to, with high precision, call into systems to access information or actuate services to which they have been entitled access. Entitlements and actions are obviously deeply intertwined. Whereas entitlements are about reasoning over access, preventing unwanted or unintended access to systems, and ensuring that access is transparent to both the user and the administrators of systems being accessed, actions are mostly about determining what needs to be accessed in order to complete a delegated task. The agent must be aware of the entire action space to which it is entitled for a given task it is attempting to complete, and it must only trigger actions with high precision and transparency. Not triggering an action when the agent is entitled to and it is the best possible next action may result in the failure of the agent to complete a task properly. Triggering an action when the agent shouldn’t may have more dire consequences.

Perhaps the most important thing from an ecosystem perspective about memory, entitlements, and actions is that the protocols that implement them be open and as simple as possible to encourage widespread adoption. Secondarily, but also very important, is that these protocols should compose, given that things will be in quite a state of flux for potentially the next decade. Taking lessons from the web, the enduring protocols really were ones that solved broad problems as simply and as openly as possible as problems emerged, and that were flexible enough to compose with systems that were going to be implemented in the future as new problems were discovered and had to be solved. It is a near certainty this dynamic is going to play out with the web of agentic Al that emerges over the coming years.

A good starting point here, which is not complete, but is open, simple, and composable, is MCP, Anthropic’s Model Context Protocol. MCP itself does not have a memory API, an entitlements architecture, or specify a planning engine for actions. But all of these things can compose with MCP.

MCP provides a protocol for agents to access things called Resources. A “resource” could be a record in a database, a source file in a repo, a video, etc. It currently relies on both MCP servers and clients to implement their own entitlement system. Clients that implement MCP (like Claude desktop) require that the user explicitly specify resources before the client can even consider accessing them. And servers implementing MCP must superimpose their own access control systems on top of incoming requests. It would be very straightforward to augment the MCP protocol with a real client and server-side entitlements system. It’s possible and somewhat likely that multiple entitlement systems could be used simultaneously by clients and servers participating in a single MCP session. But there would be significant advantages to having a single, extensible entitlements architecture that was widely used by everyone.

All of the work that we’re doing with AMP, and presumably with other systems we’re building that implement a slice of Memory, Entitlements, and Actions, could compose with MCP. And it would be an enormous accelerant to 1P interoperability of our agents, and certainly from broader interop, for as many resource providers as possible to implement MCP. This appears to be the direction that the ecosystem is heading towards. The costs and risks of snapping to this are low.

In addition to the horizontal training data initiative mentioned earlier, I am going to insist that we horizontally coordinate and standardize our efforts on Memory, Entitlements, and Actions. It is fine for bits and pieces of this to be implemented in different teams, although our objective should be to use as much horizontal infrastructure as humanly possible, and we should create some real barriers internally to prevent fragmentation, which will naturally want to occur. This will be outrageously hard. There are a million excuses that we’ll hear for why we can’t do this. The two most significant two of which will be “if I take a dependency on them, I won’t be able to move fast enough” or “if I take a dependency on them, they won’t build what I need.”

Those are bullshit excuses. It is true that we have forced dependencies in the past that have slowed folks down, and that have not completely solved the problems that we’ve most needed solved. A lot of that can be traced back to how we’ve structured our “how”, and the incentives that we have given our teams. It’s on us to change the incentives, both holding product teams accountable for building on top of standard platforms, and, holding platform teams accountable for building what those teams need. If we can’t make accountability and prioritization work with the structure that we currently have, then we need to refactor organizations, switch leaders, change talent, and do everything in our power to ensure that we’re getting the right leverage here. That is far more important than preserving the status quo.

In addition to getting protocols for Memory, Entitlements, and Actions established and consistently implemented for 1P and 3P, there are pools of resources to which agents are going to need access that we ought to try ensure have agentic accessibility. One of the most important of these pools of resources is the web. We are now collaborating with, and likely to hire, one of the inventors of the semantic web to help us build out a thing called the “Natural Language Web” or NLWeb for short.

NLWeb is closer to what the architects of the web hoped for than the version of the web that has emerged over the past two decades. Imagine if every web service in existence had something called an NLWeb server that used GenAl to organize the resources that they provided and to provide access to those resources via a natural language interface. The required server would be surprisingly simple to implement, so simple that there’s already an implementation. If this implementation were open with push button deployment for Azure and other hyperscale clouds, it would be straightforward for organizations to deploy in parallel with their existing HTTP servers.

The thing that NLWeb would do most powerfully is provide a high-precision, standard way for web services to present what they offer to the world. NLWeb composes with MCP, and it is almost certain that we would want to make NLWeb servers implement MCP. The value of NLWeb to agents is that it connects them in a powerful, agentic way to the web. The value of NLWeb to web services is that it lets them take back control from the search engines. Rather than the search engine being an all-powerful, traffic-directing aggregator, you can have an ecosystem of agents all using the same protocol to dispatch prompts to web sites and services, bidding to be useful to the user in the context of the task the agent is trying to perform. Search engines will still be useful in an NLWeb world. If nothing else, we could use Bing as a very powerful way to help folks bootstrap their NLWeb implementation. If they don’t want to set up and run an NLWeb service, we can just run one on their behalf, backed by Bing, until they decide that they do.

Product

Product is likely the biggest challenge facing us. No one, neither Microsoft nor startup nor big tech competitor, is delivering Al value as fast as they could be. For the broader ecosystem, this is likely twofold: the underlying power of the models has progressed faster than the imagination and execution of product builders, and some of the places where Al can already be transformative in impact—like medicine—have artificial rate limiters in the form of regulation. Those are things that we all must contend with, and even though they are not problems unique to Microsoft, we should go on the offensive to try to create more industry-wide energy to move faster and more ambitiously.

That said, there are product problems specific to Microsoft, that left unaddressed, will prevent us from delivering on the full value of Al as quickly as we must. Those problems, ironically, have very little to do with Al. The rate of change with Al is just shining the world’s biggest spotlight on problems that we’ve had since the incredible software development and delivery advantages we had in the early days of the company were bulldozed by the Internet. In short, those problems are:

We talk too much about the software we want to build versus actually building and shipping software. The symptoms of this are that when we sit in product reviews, rather than looking at things that are about to be live or that are already being flighted, along with data about how they are performing, we look at mocks and videos. We have planning cycles where executives take two weeks to review the work we’re proposing to do for the next six months, and deliberate what makes the cut, versus just shipping experiments and adjusting course daily. Etc., etc. It makes total sense to do things this way from the vantage point of Windows XP, or in the context of super complex, highly-regulate things where the cost of being wrong is very high. When no one has yet discovered product truth, when we realistically cannot know prior to launch if our opinions are right or wrong, moving in this way and this slowly relative to folks in a more modern, product-driven growth posture will almost assuredly position us poorly for success.
We don’t use data to make decisions. We spend exorbitant amounts of time making things and we can’t even answer basic questions about whether or not these things are useful. Sometimes that’s because there is no measurement. Sometimes it’s because we are measuring the wrong thing. Sometimes we measure the right thing, but we don’t know how to set thresholds for minimum utility. And sometimes, even when the measurement is there, we don’t use the data in our decision making.
Our attention to product usability is not good. There are a whole bunch of little things that most people can see and recognize as “bad experience” that are everywhere in our products that we either don’t see, or see and ignore. Each and every one of us, multiple times per day, run into this. Maybe ignoring them is the only way not to despair, but when you are the frog who has jumped out of the boiling water, it’s striking how bad some of this stuff is. I had included anecdotes in an earlier draft of this memo, but stripped them out, because the anecdotes are easy to find and there are folks who share them frequently. An observation that I have about us as an SLT is that we too often look at anecdotes, not as exemplars of a set of systemic issues that will only change with a broad change in incentives and culture, but as singular point problems that need to be solved. And yes, each anecdote can be fixed. We’re super clever fixers of point problems. The issue isn’t one paper cut. A large quantity of those paper cuts however sums up to lots of unnecessary friction that impedes our ability to get users to discover and use experiments and features so that we can cleanly get data we need to continuously improve.
We are way too slow, and it takes way too many people to do even the simplest things.
We expose the seams between our product groups in ways that at best are not value-adding for users.

What does any of this have to do with Al? While not necessarily every group, we have a huge number of product makers spending their time showing recorded demos of things we may never launch, building features no one is ever going to use, features whose performance we are never going to measure, and whose ultimate waste of effort no one is going to be held accountable for. If I just think back to my early days here at Microsoft when Windows was still its own division, I recall a bunch of things we were doing that fit this pattern. We did them because the group could afford to do them, because they were stuck to a slow release cadence, because they trusted their opinions more than they trusted data, and because their mission was to optimize the legacy, not to disrupt it. We let that continue longer than we should have because we could afford it, and because there was no existential desktop OS competition bearing down on us, aside from what amounted to random noise from ChromeOS/Chromebook, iPadOS/iPad, and MacOS/Mac.

The dynamics of our current Al fight are different. We are in a desperate fight against the clock to find product market fit for things that will enhance the productivity of our customers, and their customers, by orders of magnitude, that will unlock things for customers that they will bet their whole enterprise on, and that can generate tens of billions of dollars in incremental revenue for us. We have to discover those things ourselves, or we have to be able to move at incredible speed to adapt to changes in the marketplace. In many places at the company, in places with huge potential for value creation from agentic Al breakthroughs, we do not have the systems, culture, and incentives to prospect effectively on the product frontier, nor to move as quickly as we need to be able to move to react to the market. We have the pressure and the sense of urgency. Everyone gets it and feels it. We have lots and lots of hard work happening, and we should be extraordinarily grateful for it. But people are playing the game by the rules we’ve set for them. And it’s likely the wrong game.

Full caveat: I’m just the dumb engineer. I don’t even want to be the product guy. So you can ignore everything I’m saying. But I will say with 100% confidence that if you look at the difference in Open Al and us, and the success of ChatGPT versus Copilot, 100% of their success relative to ours has nothing to do with their Al chops relative to ours. Most of the product innovation that they have planned does not require a bunch of heavy-duty model training. The biggest thing that they’re planning to do is find clever ways to expose their free users to more advanced functionality so that they can experimentally measure increases in engagement and so that they in turn can provision upsell funnels and get more feedback from more users about what sorts of model optimizations to do in the future. That’s standard growth hacking 101. And while they are doing this incredibly smart thing, I’ve got folks busting Dee’s chops day and night because they think the mystery will be unlocked for them if they just get access to the model weights. That’s no more likely to happen than the deer at my house are to discover relativity if I leave my daughter’s calculus textbook in the yard for them to read.

So, what l’d love to see us doing with product, after we’ve built such an enviable position in infrastructure, such a great starting point in building out the ecosystem, and even such a great position in brand is:

Go back to school on proper data-driven product development. No more product reviews with mocks and videos and strategery papers. Let’s review things that are in a launchable state and the data about live experiments that are running. Let’s see evidence that folks are ramping winners and killing losers. Let’s see evidence that experiment rates are going up. And let’s hold people accountable. If they have big teams working on features and we’re never seeing data about how what they’ve launched is performing in terms of driving growth, engagement, and revenue, then let’s no longer let those people make resource allocation and roadmap decisions. There’s no way we’ll be able to find valuable things first until we clean this stuff up.
Let’s push on ambition. If folks want to pitch things that they’ve not yet implemented, let’s let those things be ambitious plans with big commitments on how much engagement and revenue they can drive. If folks can’t make big commitments, let’s question why they should do the projects at all, and at the very least let’s constrain how much time and resources they’re allowed to spend chasing the idea. The rule ought to be: small potential impact=small team and small timeframe; big potential impact=larger team and larger timeframe, but still plan to push things out in small, agile chunks with falsifiable hypotheses we can experimentally evaluate to inform the rest of the project.
Let’s push on giving teams more leverage. Let’s have real conversations about what’s preventing them from executing high-ambition plans. Let’s call their bluffs on redirection. (E.g., When they say we can’t make progress because we can’t train our own model from scratch, or we don’t have access to model weights, or some other thing that is conveniently beyond their control, let’s demand the experiments they ran to reach that conclusion and subject it to real expert scrutiny.) Let’s measure the performance of our shared infrastructure the same way l’m asking us to measure the performance of our products. Let’s obsess over that measurement and use it to make that infrastructure better. And then let’s force teams to take dependencies. Let’s not resource folks to build duplicative efforts. And let’s change the incentive structure for our leaders to force them to think about leverage and product performance, not the number of features they ship or the size of their teams.
Let’s start having some pride in the fit-and-finish and interoperability of our software. The real likely value that’s going to get created with agentic Al is going to come from making single agents as capable as possible. We will get almost nothing in terms of value by having teams compete internally with each to get control over the entry point for the agent. And if we can’t have our systems seamlessly interoperate with one another, and have that interoperation be fast and simple and easy, then we’re just leaving the game to be won by someone else. At some point fragmentation and friction in a world where Al is making UX simpler for things is going to become absolutely intolerable.

At the end of the day this is all about culture and incentives. No one is doing any of what they’re doing right now because they’re stupid or lazy or they want to lose.

Everyone comes to Microsoft, give or take, every single day wanting to do something that matters that will make lives better in some way, and that will benefit both their career and the company’s success. To the extent that we’re not getting the outcomes that we need is on us. We’ve configured the pieces on the playing board. We’re telling people to play the game. We’re mostly still telling people how to play the last game. But the new game is upon us. And the clock is ticking.