Forem: Matt Eland

Tracking AI system performance using AI Evaluation Reports

Matt Eland — Tue, 09 Sep 2025 20:08:51 +0000

A few months ago I wrote about how the AI Evaluation Library can help automate evaluating LLM applications. This capability is tremendously helpful in measuring the quality of your AI solutions, but it's only a part of the picture in terms of representing your application quality. In this article I'll walk through the AI Evaluation Reporting library and show how you can build interactive reports that help share model quality with your whole team, including product managers, testers, developers, and executives.

This article will start with an exploration of the final report and its capabilities, then dive into the handful of lines of C# code needed to generate the report in .NET using the Microsoft.Extensions.AI.Evaluation.Reporting library before concluding with thoughts on where this capability fits into your day to day workflow.

The Extensions AI Evaluation Report

Let's start by taking a look at what we're talking about here: The AI Evaluation Report showcasing the performance of a series of different evaluators as they grade a sample interaction produced by an LLM application:

This particular example features a single scenario where an AI agent is instructed to respond to interactions with humorous haikus related to the topic the user is mentioning:

System Prompt: You are a joke haiku bot. Listen to what the user says, then respond with a humorous haiku.

User: I'm learning about AI Evaluation and reporting

Assistant: I grade clever bots, reports spill midnight secrets, robots giggle on.

While not the best interaction, the system technically did close to what it was instructed to do, and the report summarizes the strengths and weaknesses of the system in handling this interaction.

Let's talk about how it works.

This "report card" was generated by sending the conversation history to an LLM with instructions on how to evaluate it for different capabilities including coherence, English fluency / grammatical correctness, relevance, truthfulness, and completeness.

This evaluation is performed using an LLM and specially prepared prompts built for evaluating the performance of this interaction. The evaluation LLM can be the same one as the one you used for conversation or it could be a different one entirely.

The results of this evaluation are persisted in a data store (such as on disk or on Azure) and are available to help show trends over time as well as generating periodic reports in HTML format.

Because the evaluation report is a HTML document, it allows for some interactive features. For example, you can click in on a particular evaluator and see details on its evaluation, as is shown here for the Fluency evaluator:

Here we can see the fluency evaluator giving the response middling reviews for English fluency, which is likely due to the fluency evaluator being designed more for conversational English and articles rather than haikus like the one our bot is generating.

Note that we can see some specific metrics on the tokens that were used, the amount of time taken, and the specific model being used for evaluation.

Implementing an AI Evaluation Report in .NET

There are a few more aspects of this evaluation report we'll highlight, and we'll talk later on about the overall context this report plays into your organization, but for now let's talk about how to generate it.

In this section I'll walk through the C# code needed to generate the report shown here in this article.

This code is taken directly from my GitHub repository and is specifically inside of the EvaluationReportGeneration project.

Connecting to Chat and Evaluation Models

The first thing we need to do with our application is to have a chat client for our AI evaluation as well as for our chat completions. I'll do this here with two OpenAIClient objects representing our chat and evaluation models:

// Load Settings
ReportGenerationDemoSettings settings = ConfigurationHelpers.LoadSettings<ReportGenerationDemoSettings>(args);

// Connect to OpenAI
OpenAIClientOptions options = new()
{
Endpoint = new Uri(settings.OpenAIEndpoint)
};
ApiKeyCredential key = new ApiKeyCredential(settings.OpenAIKey);
IChatClient evalClient = new OpenAIClient(key, options)
.GetChatClient(settings.EvaluationModelName)
.AsIChatClient();
IChatClient chatClient = new OpenAIClient(key, options)
.GetChatClient(settings.ChatModelName)
.AsIChatClient();

You can connect chat and evaluation to any model provider with an IChatClient implementation, which are either available or in preview for all major model providers such as OpenAI, Azure, Ollama, Anthropic, and more.

In this article I'm using o3-mini as my chat model generating the responses and gpt-4o as the evaluation model (the current recommended model by Microsoft as of this article).

Creating a Report Configuration

Now that we've got our chat clients ready, our next step is to create a ReportingConfiguration which will store the raw metrics and conversations that are evaluated over time. This helps in centralizing reporting data and in building trends over time in reports.

There are currently two supported default options for this: DiskBasedReportingConfiguration which stores data on disk in a location you specify, and the AzureStorageReportingConfiguration option present in the Microsoft.Extensions.AI.Reporting.Azure package.

We'll go with the disk-based configuration in this sample because it's far simpler to configure:

// Set up reporting configuration to store results on disk
ReportingConfiguration reportConfig = DiskBasedReportingConfiguration.Create(
storageRootPath: @"C:\dev\Reporting",
chatConfiguration: new ChatConfiguration(evalClient),
executionName: $"{DateTime.Now:yyyyMMddTHHmmss}",
evaluators:
[
new RelevanceTruthAndCompletenessEvaluator(),
new CoherenceEvaluator(),
new FluencyEvaluator()
]);

Here we create our ReportingConfiguration by telling it:

Where to store the raw report metrics on disk (not the location for the generated report file)
Which chat connection it should use to evaluate the interactions
A unique name for the evaluation run. This will be used to generate folder names so only certain characters are allowed.
One or more IEvaluator objects to use in generating evaluation metrics. This is equivalent to using a CompositeEvaluator like I demonstrated in my prior article.

More on Evaluators: If you're looking for more detail on the various evaluators you can use or how they work, I go into each of these evaluators more in my article on MEAI Evaluation.

You can also specify tags that apply to your entire evaluation run here, but I'll cover tags in a future article.

Defining a Scenario Run

Evaluation reports have one or more scenario runs associated with them, representing a specific test case.

We'll create a single "Joke Haiku Bot" scenario for our purposes here:

// Start a scenario run to capture results for this scenario
await using (ScenarioRun run = await reportConfig.CreateScenarioRunAsync(
scenarioName: "Joke Haiku Bot"))
{
// Contents detailed in next few snippets...
}

Note that we're using an await using around the whole context of our ScenarioRun object. This makes sure the run is properly disposed, which causes its metrics to be reported to the reporting configuration object and persisted to disk.

If we had additional scenarios, we could define each one sequentially so that we're aggregating our evaluation results into a single report. In this article we'll keep things simple and look only at a single case, but in our next article in the series I'll cover iteration, experimentation, and multiple scenarios.

Important Note: It's important that any ScenarioRun objects you're using for your evaluation are disposed before you use their evaluation metrics to generate a report. This is why I'm using the await using syntax here as well as explicitly declaring the scope of the object instead using the newer "scopeless" style of defining the object in a using statement.

Getting and Evaluating a Response

Now that we have an active ScenarioRun object we need a list of ChatMessage objects to send to the chat model:

string systemPrompt = "You are a joke haiku bot. Listen to what the user says, then respond with a humorous haiku.";
string userText = "I'm learning about AI Evaluation and reporting";

List<ChatMessage> messages = [
new(ChatRole.System, systemPrompt),
new(ChatRole.User, userText)
];

With that in place, we send it to the chat model using our chat client and we can get back a ChatResponse:

// Use our CHAT model to generate a response
ChatResponse response = await chatClient.GetResponseAsync(messages);

This particular example is using the IChatClient defined in the Microsoft.Extensions.AI (MEAI) package to do this, but you could use something else such as Semantic Kernel or another library, or even just hard-code a chat response you've observed in the wild.

Once we have our list of messages and the model's response, we can send both of them to our ScenarioRun for evaluation with a single line of code:

// Use the EVALUATION model to grade the response using our evaluators
await run.EvaluateAsync(messages, response);

This call returns an EvaluationResult object if you want to look at the immediate output of the evaluation, but the results will also be persisted to our reporting configuration, so we don't need to take immediate action on them.

Generating an AI Evaluation Report

We've now created our reporting configuration, started a scenario, gotten a chat response, and then used our evaluators to grade it. Let's talk about actually building an HTML report from our evaluation data.

The first step of this is to identify the data that should be included in our report.

While it may seem like we already have that data, the evaluation report can show the trends of your different evaluations over time, which can be handy for seeing how experiments are impacting the overall reporting experience.

I typically include the last 5 results in my reports, and use this snippet to grab that data from my reporting configuration:

// Enumerate the last 5 executions and add them to our list we'll use for reporting
List<ScenarioRunResult> results = [];
await foreach (var name in reportConfig.ResultStore.GetLatestExecutionNamesAsync(count: 5))
{
await foreach (var result in reportConfig.ResultStore.ReadResultsAsync(name))
{
results.Add(result);
}
}

Next, we'll use these results from our scenarios to generate the output report file. Reports can be written in JSON format or in HTML. I typically will choose the HTML option because these reports include an option to export the underlying JSON if you need it.

The code for this is fairly simple:

string reportFilePath = Path.Combine(Environment.CurrentDirectory, "Report.html");

IEvaluationReportWriter writer = new HtmlReportWriter(reportFilePath);

await writer.WriteReportAsync(results);

This generates a new report in the Report.html file we specified. You can then open up that file manually and see the results, or you can start a process to open this report in your default web browser:

Process.Start(new ProcessStartInfo
{
FileName = reportFilePath,
UseShellExecute = true
});

When this executes the user's operating system will handle the report just as if the user had double-clicked on the file in their file system - potentially opening a web browser or asking them what action they'd like to take with this file or type of file.

Practical uses for AI Evaluation Reports

Now that we've covered AI Evaluation reports and how to generate them using C#, let's close this article with a discussion of how this technology potentially fits into your workflow.

First of all, if you're looking for a way of evaluating your AI systems, AI Evaluation reports are a fantastic option, even for a solo developer trying to understand the performance of their hobby projects. The graphical reports and being able to click into details are easier than working directly with the EvaluationResult objects with their nested metric objects.

For more serious usage, AI Evaluation has some tremendous merit because it equips you to share something graphical with others to help them understand how your application works with different implementations. Instead of having conversations about your models being "good" or "not good enough", you can have targeted specific conversations on the specific interactions your system is succeeding with and those it is struggling with.

Because these HTML files are interactive and intuitive, this technology enables people in your organization to explore the examples on their own and internalize more of the systems strengths and weaknesses. In a nutshell, these reports make it easy to see and share information about the state of your AI systems.

I view AI evaluation as a vital part of integration testing and the MLOps process prior to any new deployment - or potentially even to block feature branches from rejoining the main product branch as part of the pull request review process in development. Having a graphical report to go with it can help you understand the trends and performance of your models over time and how different changes impact its performance.

Final Recommendations

AI Evaluation and evaluation reporting are important aspects of your team's success in its AI offerings.

Here are some closing recommendations I have when adopting AI evaluation tooling into your organization:

AI Evaluation and evaluation reports are key parts of any significant release that updates the behavior of an AI agent and should be part of your quality assurance and product management efforts.
Automating AI Evaluation as part of your integration tests is worth the effort. You can also optionally have significant degradation of evaluated quality fail your tests when run as an actual integration test (not covered in this article, but I plan on writing more on this in the future).
The quality of your evaluation model matters. It's worth using a more capable model for this as it's more likely to grasp the full context of the request and the response that was generated.
Having automated evaluation in place frees you up to do more experimentation around your system prompts, model selection, and other parameters and settings. Make sure you have this automation in place to collect metrics before doing serious performance tuning of your models as these metrics can help guide your decision-making and refinement process.
Store your model metrics in a centralized location for tracking over time. I recommend a dedicated shared location just for release candidates as well as local metrics storage for developers during development and testing.
The resulting HTML reports should be shared with your entire team, including organizational leadership, quality assurance, and product owners. This practice helps cut through the hype and fear around AI systems and allows your full team to more meaningfully understand what your system is good and bad at.
Just because your metrics are high doesn't mean that your system is performing well. It just means its performing well for those interactions you're measuring and observing.
As your system grows, its evaluation suite should grow over time as well. As you find new interactions it struggles with or add new capabilities to the system, you should be adding in new scenarios to represent these capabilities.

I view AI Evaluation as a vital part of the development of AI systems and evaluation reports make these systems so much more understandable to your whole team.

Coding Agents are here: Is your team ready for AI devs?

Matt Eland — Tue, 05 Aug 2025 18:47:15 +0000

In this post we'll explore the concept of AI agents as software engineers on your development team. The idea that you could write up an enhancement or bug fix and assign it to an AI team member and see what they came up with a short while later would have sounded fantastical only a few years ago, and yet, with the announcement and preview of GitHub Copilot Agents, this is a real technology that exists and you can make use of today.

Introducing GitHub Copilot Coding Agents

GitHub Copilot Coding Agent is a new technology, currently in preview, associated with GitHub pro and enterprise accounts. With Coding Agents you can assign individual issues in a GitHub repository to GitHub Copilot, just like you were assigning it to a team member.

Copilot will then create a branch for your issue, just like a developer would, and begin to plan its approach to carrying out the work item.

Quota Usage Note GitHub Copilot Coding Agent uses part of your account's allocated monthly premium requests. Check out GitHub's documentation on current quota and billing information on this evolving product.

Coding agents at work

After analyzing the issue and your repository, it forms a plan of action to accomplish the work you've assigned to it. As it works, the branch's comment updates to reflect Copilot's progress, accomplished tasks, and remaining work. This helps you monitor its progress and acts as a reference to help ground the AI agent as it works.

Copilot will analyze your code as it works and can also make use of additional resources such as Model Context Protocol (MCP) servers you have configured on GitHub and optional additional documentation you provide for Copilot on the structure of your repository.

Interested in MCP Servers or AI Architectures? Check out Leading EDJE's reference architecture articles on augmenting development teams with MCP servers and team-productivity solutions with MCP servers.

I've also noticed Copilot making use of command line tools to find relevant strings in files in your repository, which can help it orient itself. Copilot is also capable of executing commands to build and test applications, and can even resolve build issues with missing dependencies it finds on its end.

Security and GitHub Copilot

While all of these capabilities are interesting, they also raise important security questions. By default, Copilot has firewall rules in place that prevent it from working outside of GitHub's sandboxed ecosystem. Additionally, any violations of these policies will be logged for your review later on.

It's worth noting that these features are currently only available if your code is on GitHub and you have a paid plan that supports it, so you're also going to benefit from all of GitHub's standard and enterprise security features.

Completing a pull request

When Copilot believes it is complete, it will notify the person who assigned it the task who can then review the pull request for changes. The developer can either approve the pull request or request changes. If changes are requested, Copilot will respond to any comments and notify you when the work item is ready for review again.

Once you're satisfied with Copilot's work, you can mark the pull request as ready for review. This will trigger more parts of your workflow, such as additional tests or having the standard GitHub Copilot system review the pull request, summarize it, and make suggestions. Once you're satisfied, you can approve the pull request, merge it, and it becomes part of your codebase as GitHub closes the issue.

Can AI agents really serve as team members?

So, how good is Copilot? Is this going to replace members of your team?

Well, probably not, but it might change how you work or how you hire.

How coding agents change how I write code

I'm early on in my journey working with copilot, but I'm already impressed. As an experienced developer I can write out what I'm trying to accomplish in technical terms, assign it to Copilot, and see it have a result that's close to what I envisioned. This does require me to think about how I would try to solve a problem and the type of solution I'd like to see, and highlight any areas of concern I have with potential implementations. Interestingly, this is the type of thing I'd probably normally communicate to a technical team member through a direct message or a comment on a work item already.

Because AI agents are capable of working quickly, I've found myself able to quickly gather some thoughts on relatively simple changes, send them over to Copilot, and then come back to that topic later on when I have more focus. I can easily see senior engineers writing up a request, sending it to Copilot, attending a meeting, then reviewing and improving Copilot's result after they're free.

I've found myself also starting development sessions by reviewing what Copilot sent me on something between work sessions, which can be a great way of getting into the flow of something - or take care of the more tedious aspects of software engineering while letting me focus on the strategic direction or specific concerns I care about.

How coding agents might impact hiring

If you look at the prior section, you'll see that a lot of what I'm doing is more senior or supervisory in nature versus directly authoring code. While I'm still writing code and enjoying it, I'm finding that the code I'm writing is less boilerplate or trivial in nature and more specialized and strategic.

This is good, but not all developers can do this. You need a certain level of experience to be able to guide other developers and evaluate their code effectively, and this holds up well for AI.

Because of this, I view AI as filling similar roles to more junior team members: executing on well-defined and standardized tasks that can be easily communicated.

While AI doesn't replace the more junior team members in your organization, it does rival them in some ways:

AI agents are becoming increasingly able to get unstuck on their own
Copilot has the breadth of its training data available to it, so it likely knows libraries junior devs don't yet
Copilot works quickly, and can produce a lot of code very quickly, outpacing even senior developers

Of course, junior team members have a lot of value as well, and can do things that AI can't:

A greater degree of domain knowledge in your organization, its products, its data, and its business context
Effectively consider the end user and the context in which code operates
Human level problem-solving, common sense, and decision-making
The ability to debug problems or scenarios related to specific data situations
The tendency to grow and become senior engineers

A good junior developer is going to provide far more value to an organization than a solid AI agent is able to, but I can see the temptation for organizations to rely on AI agents instead of junior developers, and this scares me for our industry and the many talented people already struggling to get a foot in the door.

Ultimately, if you're an organization that has busy senior engineers and code already present on GitHub, Coding Agents are something that you should try out and see how it impacts your workflow and productivity. Just be cautious because although Coding Agent is usually cheaper than a junior engineer's salary, it's not a replacement for talented, growing, flexible, and human engineers on your team.

I think at this point, it's best to talk about where AI agents fall short in more detail.

Where AI agents fall short

In conversations about AI and particularly about AI productivity there's a central truth that is often overlooked: Most of software engineering isn't about writing code.

I've been writing code for the vast majority of my life, and over two decades professionally. While a lot of my job is around writing code, that code only comes after:

Understanding the business need or current inadequate behavior
Determining what an ideal solution should do
Identifying several different ways of achieving this goal
Selecting a leading candidate for implementation - often with collaboration from others who understand other areas and other needs
Identifying places in code that will need to be adjusted to support the change
Making code changes to support the new behavior
Ensuring the code works
Thinking of ways the code might break, edge cases that we might not have considered, etc. then making sure the code works for those ways as well.
Ensuring the code is as secure, testable, and performant as we expected it to be when we selected our candidate solution
Communicating the change in documentation and to others.
Ensure the change flows through the processes for feedback, testing, and deployment to various environments

As you can see, code changes are only a small portion of software engineering, yet we pay an inordinate amount of attention to them when we think about AI productivity solutions and even when we consider using offshore development resources.

While I believe AI systems can already perform some of these steps to some degree or another, we tend to evaluate their effectiveness mostly on authoring new content, which is a strength area for AI systems. However, humans have strong skills across all of these areas and knowing what to change and the implications of different approaches along with how this fits into the existing data and application architecture are critically important pieces of software engineering.

Also keep in mind that in modern software engineering a change is often needed across multiple different services and databases. While an AI agent might be able to handle changes in one place, they might find themselves less equipped to know all the services that need to change and make the requisite changes for those areas.

AI and Human Partnership

Because of the complexity of software engineering and the relative strengths and weaknesses of AI and humans, I think that AI agents and AI tooling are best deployed for targeted tasks that have been thought through by an experienced engineer.

An ideal flow might be:

Engineers vet an organization's needs and determine a series of technical changes needed to support the new goals
These individual changes are written up as work items and either assigned to other engineers or assigned to AI agents
AI agents or engineers work on the change and send a draft pull request on for review
The change is manually tested and verified by another engineer who uses it as a starting point for the final pull request
The developer makes additional improvements, changes, and tests to support the pull request and ensure it fully meets the organization's needs
The PR is marked ready for review
Other developers review the PR, familiarize themselves with the changes, and leave comments
The change eventually merges into the main branch and reaches production, where it will be supported by a team that understands the changes and designed the approach.

Where software engineering may be headed with AI Copilots

AI agents like GitHub Copilot are powerful and destined to change how organizations and engineers hire and work.

I believe that software engineers can focus on the big picture, stay oriented around technical changes going on in their systems, but use AI to do the majority of the work on well-defined tasks the engineers define, then customize the final behavior of those changes.

Not every change will benefit from AI, and some more sensitive pieces of work reveal new things to think about with each line of code that needs to be modified or added, but a strategic deployment of AI can help busy engineers remain productive between meetings and optimize their time on a busy schedule.

I also hope that the emergence of AI as skilled solutions implementers will help focus experienced and new software engineers on other core competencies that are uniquely theirs: domain knowledge, communications skills, past experiences, empathy for users and business stakeholders, and the ability to evaluate a number of different plans and possible implementations and select the one that is right for the business today and where they're going tomorrow.

AI is advancing at a tremendous rate and being an efficient, experienced, well-rounded, and adaptable engineer is more important than ever, but I'm glad to have copilots along for the ride as we build new things together.

Reference Architecture for Team AI Productivity

Matt Eland — Tue, 01 Jul 2025 20:58:28 +0000

Let's discuss a sample reference architecture for providing a secure and convenient way for your organization to chat with pre-approved AI capabilities.

Previously in this series we discussed Website RAG Chat and Developer AI Productivity reference architectures. Those architectures are valid and helpful for delivering rich AI capabilities to your customers and developer team, but what about the rest of your organization?

In this article we'll lay out a reference architecture that allows different members of your organization to safely enhance their workflows through AI, and do so with the knowledge that organizational data is being handled securely and intellectual property is being respected.

While this architecture could work using a variety of different technologies, specific examples and screenshots will feature the popular Open WebUI conversational AI platform.

Use cases

You may not immediately think that providing a team-wide AI system would be extremely helpful, but when we unveiled our "Chat EDJE" conversational AI solution at Leading EDJE we noticed some immediate and profound impacts on our teams for all types of users.

While developers were excited about these capabilities and took advantage of them for complex tasks like working on improving SQL performance by comparing a complex query with an execution plan, we also saw some tremendous benefits for non-developers.

We saw project managers gain access to ways to generate new ideas relevant for the teams they were on, web designers able to gain access to specialized insights for technical search engine optimization (SEO) considerations, analysts able to help quickly find and fix anomalies in data, and all team members benefitted from summarization and email drafting / proof checking capabilities.

While we weren't initially sure that AI tooling would truly help all team members, we continue to be blown away by the impact of secure, reliable, AI tooling applied to the entire organization and governed by the organization's IT staff.

Let's talk about how this works.

A sample architecture

A team productivity conversational web AI chat architecture consists of the following required components:

A web chat portal for hosting the conversations
One or more registered model provider that makes public or private conversational AI models (typically LLMs) available to the team
A management layer allowing administrators to configure and control the model

Conversational web AI chat architectures may also include the following optional components:

A persistence layer for storing past conversation sessions
A context layer that provides additional documents, tools, or capabilities that can augment the conversation

Let's walk through each of these required and optional components, talk about what they are and the various choices you might need to make with each one.

Web Chat Portal

The first component is the most obvious: you need a centralized web chat portal where users can go to ask questions of your system. This typically looks like a web page hosted on your organization's intranet or on the internet and is secured via your organization's preferred authentication options.

The web chat portal allows you to select a prior conversation to continue (if a persistence layer is present) or start a new conversation.

When starting a new conversation, users must select a model to interact with from the approved list of models, specify a textual query / prompt, and optionally include documents, files, or web links to act as additional sources of context before sending the message.

Your web chat portal may stream the completions your model provides so the user can see the response in real-time, but once the response is fully complete it should show up in your portal and include additional context, links, and feedback mechanisms.

Some web chat portals may also support asking the same question of multiple models to compare their responses or may give you the ability to edit your existing messages and resend them to regenerate a response.

It's important to know that when a persistence layer is present, entire interactions with the system may be stored temporarily or permanently for auditing or quality control purposes. This is particularly true when users provide positive or negative feedback. As a result, users should be informed that their conversations may be stored and retrieved by your organization's IT staff or potentially other personnel and the system should therefore be treated with the same level of professionalism as you would expect from a communications platform like Slack, Teams, or Discord.

Model Provider

A model provider is a connector that connects your web chat portal to organizationally approved large language models (LLMs) that users are allowed to interact with.

Your organization may use public models like those deployed on OpenAI, instanced / dedicated models hosted on a service like Azure, or your team may self-host models using something like Ollama or LM Studio. Your team could even use a combination of these by specifying multiple model providers.

Your list of approved models will grow and shrink over time as new models arrive, are reviewed and approved, and as older models get retired and replaced with newer ones. The most important thing to remember with your model selection is that you only list models your organization is comfortable with people using. If a model does not meet your organization's IP security needs (for example, it retains logs for training future models), it should not appear in your list of models to your end users. In this way, users working with your solution know that they are in compliance with organizational AI policies.

You may wonder "why don't organization's just use a single approved model? Why give user choices?". While providing choices to users may raise the barrier to entry slightly for some users, the overall benefits of having different models is usually worth it. Because different models are good at different tasks and have different basic characteristics in terms of speed, accuracy, and cost, it can be helpful to allow your users to choose.

Additionally, you may find that some models temporarily go offline - particularly if you're using multiple model providers - and it can be helpful to have backup resources for people to consider.

Management Layer

Most AI chat solutions have some form of management or configuration associated with them. The management layer allows your IT admin team to configure your web chat system and connect it to various models and other providers.

Once model providers are configured, you can also select the various models from your model providers that should be available to your users:

Organizations using pay-per-usage models can sometimes use the management layer to limit the budget of individual users in order to ensure a predictable maximum expense per week per user limit.

Persistence Layer

Most web portals with a management layer will also have some form of a persistence layer that allows storing past conversations. This is done for convenience for users who which to refer to past conversations or resume them and can also help your organization's IT team manage and monitor its AI infrastructure.

In evaluating models and compliance, admins may be able to see some or all of the private interactions with users, depending on how the persistence layer is configured and if any rolling delete or anonymization capabilities are present. While this helps evaluate which models are actively being used and how they're performing, this capability and how it may be used should be disclosed to your employees as some employees may include context they intended to be private in even legitimate interactions with the system. For example, an employee brainstorming a presentation with an LLM may choose to disclose private medical information about physical or mental conditions that might impact their performance in order to perform their company-assigned tasks more effectively.

Your persistence layer could be as simple as a series of configuration files, or it could be a relational or document database. Some persistence layers may even use a vector store to store text embeddings allowing for searching past conversations or indexed documents. The capabilities of your persistence layer vary based on the overall solution you're using and will be strongly tied to whatever solution you choose.

Context Layer

Perhaps the most exciting of all the parts of an AI solution is the context layer. The context layer is able to provide your LLM with additional knowledge and capabilities including:

Tools / functions that can be called to produce a result. Some tool examples might involve checking current weather in an area, tracking a package that's out for delivery, or searching the internet.
Prompts define common text instructions for carrying out a task in a way that helps multiple people on your team
Resources such as documents, web pages, and additional pieces of information that can help provide additional context.

These capabilities are typically integrated using a standardized format like Model Context Protocol or OpenAPI endpoints.

When a user sends a request to the LLM, these additional capabilities will also be sent along to the LLM and it may choose to take advantage of them in order to fulfill the user's request. This makes these additional capabilities a form of a retrieval-augmented generation (RAG) data source.

By integrating additional tools and capabilities into your AI systems, you are offering unique value for your organization that they cannot find in another tool. These capabilities are your unique way of adding in additional context to your organization that will help employees do their jobs more effectively. This context can include:

Tools of looking up the status of different work items, orders, or customers
Resources documenting standard definitions, systems, and workflows
Prompts that help generate output that's consistent with organizational branding or work standards

In short, the context layer is something that is uniquely yours and can be uniquely controlled by your organization.

These capabilities can be so valuable that some organizations even offer a shared architecture that encapsulates these tools into a MCP server that is shared between the web AI chat tooling and individual developer productivity solutions as shown here:

In this way all employees can take advantage of organizational knowledge, standards, and capabilities when performing their work, regardless of what that work entails.

Additional Integrations

Some web AI chat systems may include additional integrations including:

Text to speech capabilities that read aloud responses from the system
Speech to text capabilities that allow you to talk to your LLM
Image generation via ComfyUI, OpenAI, Gemini, or other providers
Web search capabilities (essentially a built-in tool provided by your platform)
Direct code execution capabilities in sandboxed environments

This list of capabilities will vary depending on what web chat provider you selected and will change over time as industry trends evolve.

Securing your chat provider

While providing your users with a curated list of models is fantastic for helping users interact with AI in approved ways, these same capabilities can be a target for attackers as well.

If you do not properly secure your AI web chat capabilities it is possible that an attacker can discover your endpoint and use it to cause damage such as:

Incurring charges against pay-per-use AI models
Conduct a denial of service attack against your AI models by attempting to exhaust your rate limit capabilities for certain LLMs, denying legitimate users access to these resources
Access sensitive information stored in prompts or resources
Exploit tools to perform additional attacks such as searching your knowledgebase, querying data stores, or other actions dependent on the exact nature of your implemented tools

There are a number of ways of remedying these vulnerabilities including:

Properly researching the various web AI chat providers to ensure they meet your security and administration needs
Requiring users to log in via an API key, LDAP, or some other form of authentication
Configuring firewall rules to require a VPN to access AI tooling
Setting sensible rate limiting or access permission on groups of users so a single compromised user cannot inflict massive damage to the organization

While any new system carries new attack vectors for malicious users, one of the realities of a world where AI tools are ubiquitous is that your users will find AI tooling that meets their needs. Your goal as an organization should be to make sure that when they do this, they do it in an approved way that also meets the organization's data stewardship and security needs.

Conclusion

Conversational AI systems are powerful ways of augmenting your entire team's capabilities, and a web AI chat portal is an effective way to provide a secure means for your organization to innovate with AI in approved and cost-effective ways. What's more, the ability to integrate your organization's context through resources, prompts, and tools is an offering that no other AI chat toolset will provide - and it can be easily integrated into other solutions such as developer AI productivity architectures.

We've been amazed at the things our team at Leading EDJE has been able to do with properly governed AI - both internally and for our clients - and we'd love to discuss how you can move forward with AI.

An LLM Evaluation Framework for AI Systems Performance

Matt Eland — Tue, 27 May 2025 22:19:26 +0000

One of the challenges of AI systems development is ensuring that your system performs well not just when it is initially released, but as it grows and is deployed to the world. While AI prototyping projects are fun and exciting, eventually systems need to make it to the real-world and evolve over time.

These evolutions can come in the following forms:

Changing the system prompt to try to improve performance or resolve issues
Replacing the text completion or embedding model used by your system
Adding new tools for AI systems to call in function-calling scenarios. This is particularly relevant when working with tooling like Semantic Kernel or Model Context Protocol (MCP)
Changing the data that is accessible to models for Retrieval Augmentation Generation. This often comes naturally over time as new data is added.

Regardless of the cause of change, organizations need repeatable and effective means for evaluating how their conversational AI systems respond to common contexts. This is where Microsoft.Extensions.AI.Evaluation comes in as an open-source library that helps you gather and compare different metrics related to your AI systems.

In this article we'll cover this LLM testing framework and cover the evaluation metrics of Equivalence, Groundedness, Fluency, Relevance, Coherence, Retrieval, and Completeness using C# code in a small .NET application.

Chat History and Context

Our code in this article uses OpenAI to generate chat completions and then Microsoft.Extensions.AI.Evaluation to grade its results on standardized metrics.

Note: Microsoft.Extensions.AI and Microsoft.Extensions.AI.Evaluation can work with model providers other than OpenAI, and this can include locally or network-based models such as Ollama models or services like LM Studio that have an OpenAI compatible API

These metrics are produced by sending a chat session to OpenAI for grading and providing a list of evaluators to run on that metric. Because of this, we'll need to connect to OpenAI not just to get our chat completions but to get the evaluation metrics for the interaction as well.

We can connect to OpenAI by providing an API key and optionally an endpoint (if you're targeting a custom deployment of OpenAI). We'll do this using the OpenAIClient and the IChatClient interface defined in the Microsoft.Extensions.AI NuGet package:

OpenAIClientOptions options = new()
{
    Endpoint = new Uri(settings.OpenAIEndpoint)
};
ApiKeyCredential key = new ApiKeyCredential(settings.OpenAIKey);
IChatClient chatClient = new OpenAIClient(key, options)
        .GetChatClient(settings.TextModelName)
        .AsIChatClient();

Building a conversation history

Both evaluation and chat completions require a conversation history in order to function.

We'll simulate one by building a ChatHistory object and populating it with a short interaction:

const string greeting = "How can I help you today?";
console.MarkupLineInterpolated($"[cyan]AI[/]: {greeting}");

const string userText = "Is today after May 1st? If so, tell me what the next month will be.";
console.MarkupLineInterpolated($"[yellow]User[/]: {userText}");

string ragContext = "The current date is May 27th";

List<ChatMessage> messages = [
    new(ChatRole.System, $"{settings.SystemPrompt} {ragContext}"),
    new(ChatRole.Assistant, greeting),
    new(ChatRole.User, userText)
];

Our history is composed of a system prompt, an assistant greeting, and a simulated message from the user. Normally the user would type the message themselves, but since our goal in this article is to evaluate a simple interaction, this message is hardcoded here instead.

The system prompt is typically one of the things you'll want to try tweaking the most, so it's worth sharing the simple prompt used in this example:

You are a chatbot designed to help the user with simple questions. Keep your answers to a single sentence.

While the evaluation metrics we'll collect deal with the entire AI system, in reality one of the main things you'll use metrics for is tweaking your system prompt to improve its performance for targeted scenarios.

Also note that I am simulating Retrieval Augmentation Generation (RAG) in this example by injecting a ragContext variable that contains a simple string including the current date. Our main focus in this article is not RAG, but RAG will be relevant for the Retrieval metric later on. If you're curious about RAG, I recommend checking out my recent article on RAG with Kernel Memory.

Getting Chat Completions

Chat completions are relatively straightforward to retrieve once we have our chat history, and involve a simple call to the IChatClient:

ChatResponse responses = await chatClient.GetResponseAsync(messages);
foreach (var response in responses.Messages)
{
    console.MarkupLineInterpolated($"[cyan]AI[/]: {response.Text}");
}

This calls out to OpenAI with the simulated history and gets a response to the message before displaying the responses to the user.

Note: I'm using Spectre.Console as a formatting library to make this sample application easier to read. You can see this here with the MarkupLineInterpolated call, though this could just have easily been a Console.WriteLine

Our sample scenario for this article is a short interaction where the user asks the AI system for the date, shown here:

AI: How can I help you today?

User: Is today after May 1st? If so, tell me what the next month will be.

AI: Yes, today is after May 1st, and the next month will be June.

Since the date in this interaction is May 27th, this response is factually correct, though evaluation results will flag that it isn't quite as complete as it could be as the AI doesn't specify the current date in its response.

With our IChatClient ready and our chat completions available, we're now ready to get evaluation results

Evaluating chat completions

Microsoft.Extensions.AI.Evaluation allows you to specify one or more evaluator objects to grade the performance of your AI system for a sample interaction.

Here's a simple example using a CoherenceEvaluator which makes sure the AI system's response is coherent and readable:

IEvaluator evaluator = new CoherenceEvaluator();
ChatConfiguration chatConfig = new(chatClient);
EvaluationResult evalResult = await evaluator.EvaluateAsync(messages, responses, chatConfig);

This produces an EvaluationResult object that contains a single metric for the coherence of your AI system.

This metric will be a NumericMetric with a numeric Value property ranging from 1 to 5 with 1 being poor and 5 being nearly perfect.

Each metric will also include a Reason property including the justification the LLM provided for this ranking. This helps you understand what might be lacking about responses that scored below a 5 and can be good for reporting as well.

Evaluating multiple metrics

Most of the time you'll want to look at not just a single metric, but many different metrics.

The following example illustrates this by using a CompositeEvaluator composed of multiple component evaluators:

// Set up evaluators
IEvaluator evaluator = new CompositeEvaluator(
    new CoherenceEvaluator(),
    new CompletenessEvaluator(),
    new FluencyEvaluator(),
    new GroundednessEvaluator(),
    new RelevanceEvaluator(),
    new RelevanceTruthAndCompletenessEvaluator(),
    new EquivalenceEvaluator(),
    new RetrievalEvaluator()
);

// Provide context to evaluators that need it
List<EvaluationContext> context = [
    new RetrievalEvaluatorContext("The current date is May 27th"),
    new CompletenessEvaluatorContext("Today is May 27th and the next month is June"),
    new EquivalenceEvaluatorContext("The current date is May 27th, which is after May 1st and before June."),
    new GroundednessEvaluatorContext("May 27th is after May 1st. June is the month immediately following May.")
];

// Evaluate the response
ChatConfiguration chatConfig = new(chatClient);
EvaluationResult evalResult = await evaluator.EvaluateAsync(messages, responses, chatConfig, context);

You likely noticed that we're now defining a context collection of EvaluationContext objects and providing it to the EvaluateAsync call.

These context objects are needed for some of the more advanced evaluators. While some of the evaluators can work on their own, a few of them need you to provide additional details on what information should have been retrieved, what an ideal response to the request would look like, and what information absolutely needs to be in the response.

Failing to provide these context objects will not cause errors but does result in missing metric values for the relevant evaluators.

Displaying evaluation results

Evaluation metrics are helpful, but can be cumbersome to look at manually. Thankfully, they're not too hard to display to a table using Spectre.Console.

The following C# code loops over each metric in an EvaluationResult, adds it to a Table, and displays it to the console:

Table table = new Table().Title("Evaluation Results");
table.AddColumns("Metric", "Value", "Reason");
foreach (var kvp in evalResult.Metrics)
{
    EvaluationMetric metric = kvp.Value;
    string reason = metric.Reason ?? "No Reason Provided";
    string value = metric.ToString() ?? "No Value";
    if (metric is NumericMetric num)
    {
        double? numValue = num.Value;
        if (numValue.HasValue)
        {
            value = numValue.Value.ToString("F1");
        }
        else // Possibly missing a Context entry
        {
            value = "No value";
        }
    }
    table.AddRow(kvp.Key, value, reason);
}
console.Write(table);

This will result in a sample output like the following image:

Having a nicely formatted metrics display makes it easy to capture and share AI systems performance and communicate it with others.

Microsoft.Extensions.AI.Evaluation also includes some great HTML and JSON reporting capabilities as well as the ability to examine multiple iterations and scenarios in the same evaluation run. The scope of this is beyond this article, but I plan to cover them in a future article.

Now that we've seen how to pull evaluation metrics, let's discuss what each of these AI system evaluation metrics means.

AI Systems Metrics

Let's cover the currently supported AI metrics available in Microsoft.Extensions.AI.Evaluation.

Equivalence

Equivalence verifies that the system's response roughly matches the sample response coded into the CompletenessEvaluatorContext. Put simply, it's a way of measuring that the system's response was close to what we'd expect it to be.

Groundedness

Groundedness checks to make sure we're using the relevant facts in responding to the user. This helps make sure that the LLM isn't providing an answer that's wildly different than what we'd expect it to offer. This can be helpful for organization's that have specific lists of points they want to cover or organization-specific terms or definitions that may differ slightly from publicly-available information.

Fluency

Fluency checks grammatical correctness and adherence to syntactical and structural rules. In short, it evaluates if the system is producing something that appears to be valid English or something that is complete gibberish.

Note: I suspect that Fluency will work with other supported languages that your LLM supports, but I have not investigated this at the point of writing this article

Coherence

Coherence is a readability check to ensure your sentence is easy to read and flows well. While Fluency could be considered a grammar checker, coherence could be considered an editor helping optimize your outputs for readability.

Retrieval

Retrieval grades the effectiveness of RAG systems at providing relevant context to your AI system for responding to the query. This is set via a RetrievalEvaluatorContext object.

Low retrieval scores may indicate that you have gaps in your content where your system is unable to find relevant information. Alternatively, you might have relevant content, but it's not organized in such a way that your embedding model and indexes are able to effectively retrieve it.

Completeness

Completeness checks to make sure that the response to the user covers all of the major points that the sample response you provided to CompletenessEvaluatorContext contains.

In our example our system's response was correct, but incomplete as it did not include the current date in its output.

RTC Evaluators (Relevance, Truth, Completeness)

Microsoft also offers a newer and more experimental RelevanceTruthAndCompletenessEvaluator (RTC) evaluator that bundles together the effects of RelevanceEvaluator, GroundednessEvaluator, and CompletenessEvaluator into a single evaluator that does not require any context.

This RTC evaluator is still in preview and may change significantly or go away, but it has some core advantages over using the three evaluators individually:

It's faster to make a single evaluation request versus making three separate requests
This consumes less tokens, resulting in cheaper costs in evaluating your system when using token-based LLMs that charge you based on usage
It does not require you to provide additional context to the evaluator, making this evaluator easier to use

Once the RTC evaluator leaves preview, it may be a good choice for teams who only want to work with a single evaluator for cost or performance reasons.

Next steps

In this article we saw how you can gather a variety of metrics related to the performance of AI systems using a few lines of C# code.

I'd encourage you to check out this article's source code on GitHub and play around with it yourself using your own API Key and endpoint. Microsoft also has numerous samples related to their own library that are worth checking out as well.

This capability unlocks a number of interesting paths forward for organizations including:

Creating integration tests that fail if metrics are below a given threshold for key interactions
Experimenting with different system prompts in a manner similar to A/B testing, but using an LLM evaluator as a referee
Integrating these tests into a MLOps workflow or CI/CD pipeline to ensure you don't ship suddenly degraded AI systems

As someone who is very cost-averse, having an automated framework for evaluating the performance of LLM-based systems and being able to provide my own models (including Ollama models) is a critical capability and unlocks many different workflows I wouldn't ordinarily have access to.

While this article covers the evaluation capabilities of Microsoft.Extensions.AI.Evaluation, it barely scratches the surface of its capabilities. Stay tuned for part two of this article which will delve into its reporting options and more advanced A/B testing capabilities.

Document Search in .NET with Kernel Memory

Matt Eland — Tue, 20 May 2025 19:22:29 +0000

I recently discovered the Kernel Memory library for document indexing, web scraping, semantic search, and LLM-based question answering. The capabilities, flexibility, and simplicity of this library are so fantastic that it's quickly ascended my list of favorite AI libraries to work with for RAG search, document search, or AI-based question answering.

In this article I'll walk you through what Kernel Memory is and how you can use the C# version of this library to quickly index, search, and chat with knowledge stored in documents or web pages.

Kernel Memory, a flexible document indexing and RAG search library

At its core, Kernel Memory is all about ingesting information in various sources, indexing it, storing it in a vector storage solution, and providing a means for searching and question answering with this indexed knowledge.

We'll walk through a full small application in this article, but here's a simple implementation to help orient you:

IKernelMemory memory = new KernelMemoryBuilder()
    .WithOpenAI(openAiConfig)
    .Build();

await memory.ImportDocumentAsync("TheGuide.pdf");

string question = "What is the answer to the question of life, the universe, and everything?"
MemoryAnswer answer = await memory.AskAsync(question);

string reply = answer.Result;
console.WriteLine(reply);

In this snippet we see that:

Kernel Memory uses a standard builder API allowing you to add in various sources that are relevant to you (here an OpenAI text and embedding model)
Kernel Memory provides Import methods allowing you to index documents, text, and web pages and store them in its current vector store
Kernel Memory provides a convenient way of asking questions to an LLM and providing your information as a RAG data source

In this short example we're using the default volatile memory vector store which is built into Kernel Memory for demonstration purposes, but you could just as easily use an existing vector storage provider such as Qdrant, Azure AI Search, Postgres, Redis, or others.

Likewise, Kernel Memory supports a wide range of LLMs and other ingestion data sources including OpenAI, Anthropic, ONNX, and even locally-running Ollama models.

This last point has me particularly excited because I can now use locally hosted LLMs and on-network vector storage solutions to ingest and search documents without needing to worry about data leaving my network or per-usage cloud hosting costs. This opens up new usage scenarios for me for experimentation, workshops at conferences, and business scenarios.

Let's drill into a larger Kernel Memory app and see how it flows.

Creating a Kernel Memory instance in C

Through the rest of this article we'll walk through a small C# console application from start to finish. This project is available on GitHub if you'd like to clone it locally and experiment with it as well, though you'll need to provide your own API keys.

Next we use some fairly ordinary C# code using the Microsoft.Extensions.Configuration mechanism for reading settings:

IConfiguration config = new ConfigurationBuilder()
    .AddJsonFile("appsettings.json", optional: true, reloadOnChange: false)
    .AddEnvironmentVariables()
    .AddUserSecrets<Program>()
    .AddCommandLine(args)
    .Build();
DocSearchDemoSettings settings = config.Get<DocSearchDemoSettings>()!;

This reads information from a local JSON file, command-line arguments, user secrets, and environment variables and stores them into a settings object that looks like this:

public class DocSearchDemoSettings
{
    public required string OpenAIEndpoint { get; init; }
    public required string OpenAIKey { get; init; }
    public required string TextModelName { get; init; }
    public required string EmbeddingModelName { get; init; }
}

Most of these settings are set in the appsettings.json file, though you should store your endpoint and key in user secrets or environment variables if you plan on working with your own keys and managing things in source control.

{
    "OpenAIEndpoint": "YourEndpoint",
    "OpenAIKey": "YourApiKey",
    "TextModelName": "gpt-4o-mini",
    "EmbeddingModelName": "text-embedding-3-small"
}

With our configuration loaded, we now jump into creating our IKernelMemory instance, where we'll need to provide information on what model, endpoints and keys to use:

OpenAIConfig openAiConfig = new()
{
    APIKey = settings.OpenAIKey,
    Endpoint = settings.OpenAIEndpoint,
    EmbeddingModel = settings.EmbeddingModelName,
    TextModel = settings.TextModelName,
};
IKernelMemory memory = new KernelMemoryBuilder()
    .WithOpenAI(openAiConfig)
    .Build();

IAnsiConsole console = AnsiConsole.Console;
console.MarkupLine("[green]KernelMemory initialized.[/]");

This creates and configures our Kernel Memory instance using an OpenAI text and embeddings model. The text completion model will be used for conversations with our data using the AskAsync method while the embeddings model is used to generate a vector representing different chunks of documents that are indexed as well as the search queries when the memory instance is searched.

By default Kernel Memory is using a volatile in-memory vector store that gets completely discarded and recreated every time the application runs. This is not a production-level solution, but is fine for quick demonstrations on low volumes of data. For larger-scale scenarios or production usage you would use a dedicated vector storage solution and connect it to Kernel Memory when building your IKernelMemory instance.

Also note the use of AnsiConsole. This is part of Spectre.Console, a library I frequently use alongside .NET console apps for enhanced input and output capabilities. We'll see more of this later.

Indexing Documents and Web Scraping with Kernel Memory

With our Kernel Memory instance set up and running an empty vector store, we should ingest some data before we continue on.

Kernel Memory supports importing data in the following formats:

Raw strings
Web pages via web scraping
Documents in a supported format (PDF, Images, Word, PowerPoint, Excel, Markdown, text files, and JSON)

The API for importing each of these sources is exceptionally simple as well:

// Index documents and web content
console.MarkupLine("[yellow]Importing documents...[/]");

await memory.ImportTextAsync("KernelMemory allows you to import web pages, documents, and text");
await memory.ImportTextAsync("KernelMemory supports PDF, md, txt, docx, pptx, xlsx, and other formats", "Doc-Id");

await memory.ImportDocumentAsync("Facts.txt", "Repository-Facts");

await memory.ImportWebPageAsync("https://LeadingEDJE.com", "Leading-EDJE-Web-Page");
await memory.ImportWebPageAsync("https://microsoft.github.io/kernel-memory/",
                                "KernelMemory-Web-Page", 
                                new TagCollection { "GitHub"});

console.MarkupLine("[green]Documents imported.[/]");

This code indexes a pair of raw strings, a Facts.txt file included with the repository, and a pair of web pages: Leading EDJE's web site (my employer, an IT services consultancy in Columbus, Ohio) and the GitHub repository for Kernel Memory.

Note that as we index anything we can just give it a data source, or we could give it a data source and a document Id, or we could provide additional tag or index metadata as well.

Using tags and indexes you can annotate the documents you insert as belonging to certain collections. This allows you to filter down to certain groups of documents later on when searching or asking questions which supports critical scenarios such as restricting information available to different users based on which organization they're in or their security role.

Most everything about Kernel Memory is customizable as well, so you can change how documents are decoded and partitioned and you can substitute in your own web scraping provider in lieu of Kernel Memory's default one, for example.

These Import calls will take a few seconds to complete, based on the size of the data, your text embeddings model, and your choice of vector storage solution. Once it completes, your data will be available for search.

Searching Documents with Kernel Memory and Text Embeddings

With our data ingested, we can now query Kernel Memory for specific questions. This can come in one of two ways:

SearchAsync which provides raw search results to be handled programmatically
AskAsync which performs a search and then has an LLM respond to the question asked given the search results.

While the search results are more complex than the ask results, we should start by exploring search as this helps us understand what Kernel Memory is doing under the hood.

The code to conduct the search itself is straightforward and intuitive:

string search = console.Ask<string>("What do you want to search for?");
console.MarkupLineInterpolated($"[yellow]Searching for '{search}'...[/]");

SearchResult results = await memory.SearchAsync(search);

The SearchResult object organizes its results into different citations representing different documents searched. Within each citation will be different sets of partitions which represent different chunks of the document which are indexed and stored for data retrieval. This is important because documents and web pages can be very long and you want to match only on the most relevant portions of a document when performing a search.

Each partition has a relevance score stored as a decimal percentage value ranging from 0 to 1.

Using Spectre.Console, you can loop over these citations and partitions and create a display table using the following code:

Table table = new Table()
  .AddColumns("Document", "Partition", "Section", "Score", "Text");

foreach (var citation in results.Results)
{
  foreach (var part in citation.Partitions)
  {
    string snippet = part.Text;
    if (part.Text.Length > 100)
    {
      snippet = part.Text[..100] + "...";
    }

    table.AddRow(citation.DocumentId, part.PartitionNumber.ToString(), part.SectionNumber.ToString(), part.Relevance.ToString("P2"), snippet);
  }
}

table.Expand();
console.Write(table);
console. WriteLine();

This produces a formatted table resembling the following image:

As you can see, each match will have a document, partition, section within that partition, relevance score, and some associated text. Individual tags and source URLs will also be available. Note how document names are not mandatory, but Kernel Memory generates its own random Ids if you don't provide an id yourself.

You can use SearchAsync to manually identify the most relevant documents and pieces of documents from your vector store. This can be useful for providing semantic search capabilities across your site, or for identifying text to inject into prompts as a form of Retrieval Augmetnation Generation (RAG). However, if you're working with RAG, there's a chance you might be better off using the AskAsync method instead, as we'll see next.

Question answering with KernelMemory and LLM

If your end goal is to provide a reply to the user from a query they sent you, you should consider using Kernel Memory's AskAsync method.

AskAsync uses the text model to summarize the result of a search and provide that string back to the user.

The code for this is extremely straightforward:

string question = console.Ask<string>("What do you want to ask?");
console.MarkupLineInterpolated($"[yellow]Asking '{question}'...[/]");

MemoryAnswer answer = await memory.AskAsync(question);

console.WriteLine(answer.Result);

This provides a text output from your LLM as you would expect:

As you might imagine, the AskAsync method takes significantly longer than SearchAsync because it effectively is performing the search as a RAG search and then using the results to chat with the underlying LLM.

If you need information about the sources cited, those are also available in the MemoryAnswer, which can be helpful for diagnostic / logging purposes or simply to let the user know what was used in answering their question - or giving them additional links to investigate.

Kernel Memory Extensions and Integrations

Kernel Memory is a very powerful and flexible library with a simple API and good behaviors out of the box. It has a high degree of customizability in terms of those default behaviors for the times when you need additional control.

For example, you can customize the prompts Kernel Memory uses for fact extraction and summarization, giving you more control of its behavior as a chat partner.

Additionally, Kernel Memory has a number of different deployment models, ranging from in-process MemoryServerless implementations like the one described in this article to pre-built Docker containers, to web services.

Kernel Memory was also built with Semantic Kernel at least partially in mind. While Semantic Kernel has its own built-in vector store capabilities, they're harder to use than Kernel Memory's options and don't have as many ingestion options. As a result, you can connect Kernel Memory into a Semantic Kernel instance, providing a RAG data source for your AI orchestration solution. In fact, there's even a pre-built SemanticKernelPlugin NuGet package built just for this purpose.

Conclusion

I'm absolutely enamored with the Kernel Memory library and see a lot of uses for this technology including:

Simple RAG search and question answering for web applications
Indexing existing knowledge sources like Confluence or Obsidian vaults
Providing a cost-effective and secure option for document ingestion, ensuring document data never leaves the network

If you're curious about Kernel Memory, I'd encourage you to take a look at the GitHub Repository containing this article's code and try things yourself.

I'll be writing more about Kernel Memory in a chapter in my next technical book which releases in Q3 2025, and I'm looking at revising my Digital Dungeon Master solution to take advantage of the Kernel Memory / Semantic Kernel integration. I can also see some very real ways where Kernel Memory can help offer some of Leading EDJE's current and prospective clients additional value, capabilities, and cost savings so I'm excited to share this technology with the broader technical community.

It's a great time to be doing AI and ML in .NET and I'm elated to have Kernel Memory as a tool in my toolbox.

MCP and A2A: Two bright modular futures for AI

Matt Eland — Tue, 13 May 2025 20:46:00 +0000

To say that AI is a field undergoing rapid innovation and expansion would be an extreme understatement. In recent years it's been hard to keep up with the deluge of new models, capabilities, and tools. At the moment there are perhaps fewer brighter rising stars than a pair of new open standards for AI development.

In this article we'll introduce and explore the concepts of model context protocol (MCP) servers and the Agent2Agent (A2A) protocol, compare and contrast the two approaches, and help you determine what's right for you and your team.

Model Context Protocol (MCP)

Model Context Protocol (MCP) is an open protocol that seeks to standardize the tools and capabilities that AI systems can rely on. Prior to MCP you could build custom tools into an AI application using tooling like Microsoft Semantic Kernel or LangChain through custom code and connectors. This would allow you to marry together large language models and your own data or custom toolsets.

This approach was effective and helped us build custom retrieval augmentation generation (RAG) solutions like a web chat agent or full-blown AI orchestration solutions like a hobby project I carried out to build an AI-powered dungeon master. However, while these agents were capable, their capabilities were contained within the AI application on its own.

This meant that when new capabilities like developer AI chat plugins or web chat portals began offering new places and new ways for your team to interact with AI, your systems couldn't easily integrate with them.

Model Context Protocol solves this problem by separating the custom capabilities from the chat plugin and LLM. Instead of building a pipeline for processing messages that adds in tooling, MCP focuses on exposing tooling in a modular way that can be referenced by any number of other AI systems.

This means that developers can build a MCP server and connect their IDE chat plugins, web chat portal, and even future systems like OS or browser-level AI agents to the same MCP server.

What this ultimately gets you is efficiency and modularity as developers can focus their efforts on the specific capbilities they're trying to author versus the end-to-end experience (though this is still an important facet to consider in overall systems design). As a result, your AI systems and resources are more flexible to changes in tooling and changes in the LLMs they're connected to.

What's on an MCP Server?

MCP systems consist of a MCP client and an MCP server. The MCP client is typically integrated into the conversational AI interface a user is using to start the conversation, so it doesn't contain much logic of its own.

The MCP server, on the other hand, is what contains all of the custom logic to provide additional capabilities. MCP servers can contain the following types of resources:

Tools - functions that an AI system can invoke to accomplish a specific task.
Resources - Catalogued resources such as files, documents, or database records
Prompts - pre-defined interaction templates governing common interaction scenarios

When you author a MCP server, you provide a way of advertising what capabilities your MCP server has, ways of retrieving prompts and resources, and ways of executing the specified tools. This is done in a flexible manner, which allows your server to dynamically discover and advertise resources, so resources can change based on new files, database entries, or external API results.

While I currently strongly encourage teams to consider MCP when authoring AI applications, that doesn't mean there's no longer a place for AI orchestration libraries like Semantic Kernel. These frameworks offer a great place to offer highly controlled AI experiences with a pre-defined set of capabilites. In fact, Semantic Kernel recently added direct integration with MCP servers, so you can migrate existing SK apps to use MCP servers or you can expose many of these systems capabilities through an MCP server while keeping the existing application in place.

Deploying MCP Servers

The final note about MCP servers I want to cover is that they are typically installed locally at the moment. Although there are protocols for remote MCP servers that can be addressed via URLs, a more common approach at the moment is for MCP servers to be installed and configured locally for people using MCP servers and then these local MCP servers call out to external resources as needed.

This is done to keep authentication logic out of the MCP client layer since your specific chat agent should not need to know how to authenticate with Azure's MCP server, GitHub's MCP server, and your team's MCP server. Instead, we deploy MCP servers locally and provide configuration to that specific server for how it should authenticate to get its remote resources.

Remote MCP servers will likely increase in popularity as the protocol matures to standardize the authentication process, but for now the recommendation is to use small custom MCP servers that run locally and connect to an API that your team controls using whatever authentication mechanism your team prefers.

With MCP covered, let's turn our attention now to Agent-to-agent (A2A).

Agent2Agent (A2A)

Agent2Agent (A2A) is an open protocol that's emerged recently from Google that allows AI agents to communicate with each other in a standardized process.

While MCP focuses on exposing modular capabilities, A2A focuses on providing a standard protocol for interacting with the entire AI agent, including its prompts, resources, tools, and model selection.

In this way, you can think of MCP as a toolbox - exposing tools and capabilities in a standardized way - while A2A is more of a standardized way of finding a mechanic who might be able to solve your problems using the tools at their disposal.

This standardization of inter-agent communications allows for more complex scenarios where agents can be authored to solve specific tasks and other agents can coordinate between these agents to handle complex requests.

Think of A2A as a way for other AI agents to "phone a friend" if they find they don't have a sufficient amount of information or capabilities to fulfill a request.

Using A2A you can author a series of AI agents and connect them to each other, or author a single agent that is able to call out to specialized third-party agents from trusted vendors you've selected.

Like MCP with its MCP client and MCP servers, A2A also uses an A2A client and A2A server architecture. However, A2A represents its work as tasks and its interactions as messages comprised of multiple parts that are designed to produce one or more artifacts representing deliverables to be returned to the calling agent.

Overall, A2A allows for a more modular and open AI ecosystem while providing support for new interaction scenarios.

Should you choose MCP or A2A?

While both A2A and MCP exist to provide more modular AI systems, these are not conflicting technologies.

It is entirely feasible to create an AI agent that connects to a series of capabilities using MCP to accomplish its tasks and uses A2A to request help for tasks its not capable of handling on its own.

At the moment my general advice to organizations is to embrace MCP if they want to provide custom capabilities for AI agents that could integrate into different user interfaces such as developer IDE chatbots and web chat portals. Designing with MCP in mind opens up new options for organizations as AI tooling evolves and becomes available in more places.

While A2A is a newer protocol, I encourage organizations to consider A2A as a consumer when there are complex interaction scenarios they want to support but don't want to provide a custom solution to handle.

Alternatively, organizations may find themselves in a position where they have specialized capabilities they want to offer to other organizations. In these cases it may make sense to use A2A to share your own agents with external callers so they can integrate them into their own AI solutions.

Conclusion

Both MCP and A2A are open protocols that are gaining broad popularity and support, most recently in the form of Microsoft pledging support of both MCP and A2A protocols in their offerings going forward.

In a rapidly evolving AI ecosystem it makes sense for standards to emerge for defining the ways in which AI systems can connect to capabilities and the ways in which AI systems can interact with each other.

I, for one, welcome our new more modular and interchangeable AI assistants.

Reference Architecture for AI Developer Productivity

Matt Eland — Tue, 06 May 2025 20:24:08 +0000

In this article we'll lay out a reference architecture for an in-house AI assistant that can help development teams work with AI agents connected to their data directly from their integrated development environment (IDE) as well as providing a web portal for other team members to be able to interact with the same capabilities through their browser for non-programming tasks.

Developer Productivity Opportunities with AI

Since large language models (LLMs) gained mass popularity with the advent of GPT Turbo 3.5 (Chat GPT), many teams have been looking at LLMs as an opportunity to improve team productivity when writing and maintaining code. Teams have looked to integrate AI agents into their development workflows to help with the following tasks:

Scaffolding new code by writing standardized but tedious pieces of code
Generating unit tests around specific existing code
Drafting documentation of public methods and classes
Guided refactoring and code review of existing code that is in an unmaintainable state
Helping analyze error messages to expedite troubleshooting of known issues

Of course, anything involving AI must be done in a secure manner that protects the organization's intellectual property.

Additionally, publicly available LLMs and coding assistants are missing organizational context such as your internal documentation, standards, work items, or even relevant configuration data or context from a local database. This eliminates some opportunities for AI systems to help your team at a deeper level, such as cases where the system could query cloud resources or configuration to provide answers to targeted queries.

This means that a simple solution like an IDE plugin pointing directly to a public model may not always be a viable solution for security or capability purposes.

A custom solution is sometimes needed to protect organizational privacy needs and serve up relevant context to the organization through dedicated AI assistants connected to model context protocol (MCP) servers and optionally self-hosted LLMs provided by the organization.

The rest of this article will outline the major parts of such a system.

A sample architecture

An AI assistant agent system is composed of the following components:

A model provider that serves up one or more LLMs in an organization-approved manner
One or more model context protocol (MCP) servers that provide additional resources to your AI agent.
An IDE Plugin connecting your IDE to the other parts of the system and providing a chat interface

The model layer

One of the easiest ways of gaining peace of mind over the use of your data is to customize which model your IDE plugin interacts with.

There's a very real difference between the privacy concerns of a public OpenAI chat model versus one hosted on Azure in a more partitioned environment where your data will not be retained or used to train future models.

For organizations wanting extreme control, a model can be deployed and hosted on network so your data never leaves your premises. Technologies like Ollama and LM Studio allow you to run LLMs on your own devices for free, though these do not typically provide access to some of the more recent commercial models.

Additionally, your hardware must be sufficient to fit these models into memory and deal with incoming requests. This requires a sufficient graphics card / graphics chipset along with enough RAM to fit the model completely in memory. Underpowered machines or insufficient RAM may result in models that fail to function or models that operate at only a fraction of their normal speed.

The IDE Plugin

There are a growing number of IDE plugins that provide chat capabilities including GitHub Copilot, Continue.dev, and Roo Code. There are even already some dedicated AI IDEs such as Cursor.

These tools allow you to have inline conversations with AI agents in your IDE and provide context about your code and structure. The agent in turn can chat back to you, edit code with your permission, or even act in an agentic manner and make a series of iterative changes and observations about the results of these changes in order to achieve a larger result.

Leading EDJE fully expects the trend of custom IDE plugins and IDEs to continue forward much in the same way that models continue to improve, so we don't want to focus on specific implementations in this article.

Instead, we want to be prescriptive in what you should look for in an IDE plugin.

An IDE plugin should:

Not send your data to a server beyond your control
Allow you to customize which LLM you are interacting with, so you can provide your own LLMs
Allow you to specify one or more model context protocol (MCP) servers to augment your agent with capabilities.

Additional features such as allowing the user to specify which files are included in the conversation and conversation management are also important from a usability and productivity perspective.

The context layer

The final component of our AI agent architecture involves providing our system with additional context beyond the code the IDE provides and the capabilities of our LLMs.

This additional context comes through Model Context Protocol servers which connect your AI agent to additional skills in a standardized manner.

For example, a MCP server might provide built-in skills such as random number generation or a way of getting the current date. Alternatively a MCP server could call out to an internal or external API to provide additional data, such as relevant documents from a knowledgebase retrieved via RAG search in a manner similar to that discussed in our web chat agent reference architecture.

Your plugin may use multiple MCP servers running locally, with some of these servers provided by your organization and others provided by other organizations.

For example, when working on a project a developer might use:

A MCP server containing organization-wide information
A MCP server for information specifically related to the developer's project or team
A MCP server provided by their cloud hosting provider, allowing for querying cloud resources
A MCP server provided by GitHub for querying version control history

By combining internal and external MCP servers, developers can customize the skills available to them based on what tasks they're working on and their current organizational context. Integrating external MCP servers also allows your team to save implementation costs by reusing shared capabilities provided by external vendors instead of implementing those capabilities themselves in their own custom MCP server.

MCP servers may be run locally or may be referenced via a HTTP location. At the time of this writing a true authentication-friendly standard for web-based MCP servers has not been fully defined so most MCP servers are currently deployed locally, but we expect this to change as technologies evolve.

Additional Concerns

This document outlines an AI coding assistant architecture consisting of customizable models, MCP servers providing additional context and capabilities, and an IDE plugin to tie things together.

At the moment standards around MCP servers and the availability and capabilities of IDE plugins and IDEs is rapidly evolving. You may find that your organization's IDEs are not currently allowing the degree of customization of model source and MCP servers that is mentioned in this article.

In this case, or the case where non-developers want these same capabilities from outside of an IDE, you may want to consider implementing a web portal for chatting with custom MCP servers and organization-approved LLMs hosted in approved places as we'll discuss in a future reference architecture.

Conclusion

AI assistants are powerful capabilities that can help improve the effectiveness of your team, but privacy and control are critical factors for organizations.

Self-hosting models or specifying a custom location for a cloud-hosted model gives organizations additional control over where their data is going and how it is retained.

The arrival of MCP servers also reduces the need for custom AI orchestration solutions like Semantic Kernel and instead pushes AI skillsets to more centralized locations where they can be connected to from your IDE plugin. By integrating MCP servers into your workflow you allow for modular AI capabilities to augment your existing workflows and inject additional organizational, operational, or team contexts into your conversations.

Reference Architecture for Website Chat Agents

Matt Eland — Tue, 01 Apr 2025 14:03:20 +0000

This article serves as a reference architecture for an embedded web chat AI agent that can help visitors effectively find relevant content on a website.

Conversational AI with Content Search

A web chat AI agent offers a way to integrate conversational AI into an existing website. This gives the user a way of searching the site using conversational language and receiving factually correct responses based on actual data and documentation from the organization's business domain. Web Chat AI agents work by providing either a UI component (usually a pop-up in the lower right corner) or a dedicated page for the user to interact with the system.

For example, a website for a parts manufacturer might use an AI agent to help users find an exact part by its description while a community organization's website could help connect families to events of interest or support resources related to special challenges faced by that family.

When the user types something into the chat system it is relayed to a large language model (LLM) along with the conversation history and a guiding prompt. The LLM's response is shown to the user. Such interactions are now commonplace due to LLMs like ChatGPT, GPT-4o, and others.

While LLMs are capable of offering conversational capabilities in a variety of languages, they are not trained on your specific business or its current data and so they're unlikely to represent your business well to the user and are especially unlikely to provide factual, accurate information. To address this, we need to provide additional data and context to the LLM through the process of Retrieval Augmented Generation (RAG).

The rest of this article outlines a high-level reference architecture to accomplish this goal.

AI Chat Agent System Architecture

Let's talk about the components that go into a RAG chat agent.

An AI Chat agent consists of the following key parts:

A Web Application that hosts the rest of your content and houses the front end components of your AI Agent.
A Web API that the Web Application calls to with chat messages. The Web API accomplishes the necessary work to generate a response and returns it to the Web Application.
A File Storage container containing documents that the AI agent should consider as source information.
An Indexer (not pictured) that generates embeddings from portions of these data files and stores them in the Vector Database.
A Vector Database that stores the embeddings from the source documents and allows searching documents by their similarity to a search string or document.
One or more Large Language Model (LLM) that can generate chat responses from conversations and relevant document information

Let's see how these pieces connect with each other.

At a high level, these components communicate in the following pattern:

The web application makes an API call to the Web API using REST or some other web communication technology. This request contains the chat message and usually contains the chat history unless this history is also stored in the Web API.
The WebAPI optionally uses an embedding model to generate an embedding representing the conversation's contents. Note: this is not necessary in all solutions as some solutions use a Vector Database that can generate these automatically on search
The embedding model generates and returns an large array of numbers representing the content of the chat. This is typically called a vector.
The Web API calls the vector database with information on what to search for. This typically requires a vector, but in some systems may work with text. This step often defines the maximum number of results to return or the minimum amount of similarity needed for something to show up in the search results.
The vector database returns information on any matching documents or portions of documents. This information typically includes a similarity score, a URL to the document, and relevant text from the document.
The Web API takes its system prompt (textual instructions on how to behave), the chat history, and any relevant documents that came back from the vector search and sends them to a chat completion large language model.
The LLM returns one or more responses to the query.
The final response to the user is returned back to the web application which then displays it to the user.

Platform Specifics

Let's look at some implementation specifics for these various components.

Front End Technologies

The majority of web front-ends use JavaScript as their programming language due to JavaScript's ubiquitousness. JavaScript or a JavaScript framework like React, Vue.js, or Angular are common implementations of a front-end for a RAG chat agent.

Alternative implementations might take advantage of web assembly (WASM) or technologies like Blazor or web components.

Additionally, if you are building an application for a non-web platform, such as a desktop or mobile application, your front end usually involve the most common technologies on that platform.

Web Communication

The majority of web communications between modern applications relies on REST with JSON as the content body for requests and responses. However, alternatives exist including SOAP and gRPC.

If you find yourself in a performance critical scenario and want to have live responses between client and server, you may want to consider using a technology that can take advantage of web sockets for live updates over active connections. This is made more manageable by frameworks like SignalR and Deno that manage these connections for you. However, this active connection carries more complexity and server-side performance concerns that may not make it worth the investment unless you expect frequent back and forth interactions or want to stream responses back to the client as they're generated.

Cloud Components

In our reference architecture we define the web application, web API, vector database, file storage, indexer, and LLM components.

The exact specifics of these technologies will vary from cloud provider to cloud provider, but this section offers a high-level list of components you should consider based on each cloud provider:

Azure

When working with Azure, you might consider the following services for this architecture:

Web Application May involve hosting in an Azure App Service or in a Static Web App with a Blob Storage container containing the relevant files. App Service should be used if your application requires complex logic and processing by a server while Static Web App is appropriate for solutions consisting only of static files like HTML, CSS, and JavaScript files.
Web API likely involves a Azure App Service hosting some form of API logic. You may also want to consider Azure API Management to give additional flexibility around this API. When working with Azure API Management, you can sometimes replace the Azure App Service with individual Azure Functions or Azure Logic Apps.
Vector Database and Indexer can be accomplished using Azure AI Search as an indexer and vector storage solution. Alternatively, you could store embeddings in Cosmos DB if you had more complex needs.
File Storage is best accomplished using Blob Storage and an Azure Storage Account. Typically your Azure AI Search resource is configured to search one or more containers in blob storage.
Large Language Models the easiest way of getting started with LLMs on Azure is via Azure Open AI Service and the chat completion models stored there. Alternatively, the Model catalog in Azure Machine Learning Studio has a wide variety of models you can work with and deploy.

AWS

Amazon Web Services follows the same basic structure as Azure, but the technologies change:

Web Applications would use AWS Elastic Beanstalk for live sites or Amazon S3 and AWS CloudFront for static content
Web API could also involve AWS Lambda and Amazon API Gateway.
Vector Database might involve Amazon Kendra or a backing store like DynamoDB or Amazon RDS.
File Storage will definitely involve Amazon S3 and can be searched and indexed through Amazon Elasticsearch Service.
Large Language Models typically involves Amazon SageMaker.

Google Cloud

Like AWS, Google also supports these capabilities, but its services differ:

Web Application could involve Google App Engine or Google Cloud Storage with Firebase hosting.
Web API might also involve Cloud Functions or Google Cloud Endpoints.
Vector Database will look at Bigtable or Google Cloud Search.
File Storage involves Google Cloud Storage and potentially Document AI from Google Cloud's AI APIs.
Large Language Models Google Vertex AI involves some LLM capabilities you might want to consider

The details of each cloud provider will be unique to that provider and its technologies, but at a high level this architecture is supported by most major cloud providers.

Other Concerns

When building a conversational AI agent with RAG capabilities, there are a few other things you should keep in mind.

Security

Any time you expose an LLM to a user through a chat interface, there's a chance a percentage of your users might attack the AI system through prompt injection attacks. These attacks typically aim to give the agent new instructions in how it should operate or aim to discover more about the prompt and data available to the agent.

While some attacks may be benign, such as trying to get the AI to say silly things or agree to ridiculous requests, others could constitute more serious threats.

Major AI systems do get attacked by some very creative people who are curious about how your system works. These people will try to get access to your system prompt which gives textual instructions as to how the agent should behave. If attackers realize your system uses RAG to query for additional data, they may also try to exploit that to see if they can retrieve sensitive information.

While there are ways of detecting and mitigating prompt injection attacks, it's best to treat everything in your system prompt as text that may eventually be leaked to the public. This involves avoiding putting any sensitive or compromising instructions in your prompt such as "Try to discourage the customer from buying cheaper products".

Additionally, when offering RAG as a possibility to augment data, the data retrieved should not allow for accessing sensitive data the user themselves shouldn't be able to see. This means that you would want to keep any sensitive information out of the index your RAG agent searches and restrict your data to public-facing documents such as a public knowledgebase or training manuals.

Performance and Quality

Agentic chat systems, like a website chat agent using RAG, perform better than systems based solely on LLMs alone, but still have their limitations.

Like any AI or machine-learning system (or any human agent), an AI solution is going to make mistakes a certain percentage of the time. This may occur 1 in every 1,000 requests or it might be 1 in every 10 requests, but at some point your system will likely give an answer that's different than one you wish it gave.

This can be mitigated by refining your system prompt to account for the weak area your AI agent missed or by providing additional documents to be indexed and added to the searchable content the RAG agent looks at. Additionally, adding test cases around known weak areas can help shore up deficiencies in your AI systems.

There is a danger that AI systems may encounter issues and you as an organization may not be aware of the mistakes. There are two major ways of handling this problem:

Provide users with a "thumbs up" / "thumbs down" button that they can click to flag interactions for review. Negative interactions can be logged to a table or trigger an email send so that your team can review them and determine the best steps to handle cases like the one that occurred in the future.
Automatically log all interactions with the AI system, potentially anonymizing or redacting any user information. These logs can then be reviewed either by a human or by being fed into an AI system for automatic grading. In the later case, the AI agent would have its own prompt and access to an LLM and could use it to analyze the interaction and flag potential issues for human review.

In either case, your organization and its legal team should be mindful of data retention and storage of user interactions with your AI system as the user must be aware the interactions are being logged for review.

Scalability

As software systems grow in utilization there often are growing pains involved. Rag chat agents are no different in this regard.

Most hosted LLMs bill you based on the amount of tokens you consume with your requests where tokens are typically fragments of words. This means that every time someone chats with the system, the request consumes a certain number of tokens from the text being sent to the LLM as well as the text it generates. This billed amount is typically very small, but it can add up over time - particularly when you send the entire conversation history over with every request.

There are some strategies you can use to reduce your bill, and options such as reserving a certain amount of capacity or hosting your own LLM can help with pricing at larger scales, but understand in general that as usage grows your costs will grow.

Additionally, in shared environments like Azure, you may have a certain quota of maximum number of tokens per minute that you can use. You may not hit these quota limits at low scale, but as you grow they'll become a problem and you'll need to increase that quota, switch to a different model, reserve a certain token capacity, or host your own LLM to address this.

Model Replacement

Anyone who has followed AI over the last few years can attest to the large volume of new models and model versions that come out every year, with each new model advertising better metrics than the last.

It's important to understand that the LLM you select during prototyping might not be the LLM you wind up deploying to production at initial release. Furthermore, the LLM you initially release may be replaced several times during the lifespan of your application as new models become available and old models are retired.

This is part of why I do not typically recommend organizations spend time and money fine-tuning a model, but instead focus on RAG, your system prompt, and the context of how a model is used, because organizations should plan on replacing the core LLM that their AI systems use periodically.

You may find yourself moving models for a variety of reasons including:

A desire for higher quality output from your model
Needing a model that produces responses faster
Trying to control costs and understanding that a new model is cheaper to use than your current one
Your old model is marked as deprecated and will be retired by the organization hosting it at a specific date

For these reasons, it's important to have an identified core set of tasks or test cases for any model you use and a consistent way of evaluating the performance of each model. Being able to automate this testing process is important for the agility of any development team as this will give you confidence when making changes to your model, your system prompt, or how your RAG integration operates.

Opportunities for expansion

There are a few key opportunities to expand the capabilities of RAG AI chat applications if you view your system's capabilities as insufficient for your desired user experience.

You may determine that you want a RAG solution, but the system should not have the same degree of available information for each user. Under such a system, you may need to add in role-based access to different pieces of information or information sources and then conditionally enabling access to those data sources based on the identity and role of your user.

If you find you need additional data sources beyond a single additional search capability, you are likely moving past RAG and into AI orchestration systems. Under AI orchestration systems an AI agent has multiple potential RAG sources it can use to pull information from and must make decisions as to when to use each data source. For example, an AI orchestration system might need to search the knowledgebase for some information while other queries might be better suited to running pre-defined SQL queries against a read replica of a SQL database. Common AI orchestration solutions include Semantic Kernel and Langchain.

If you find yourself using an AI orchestration solution and needing particularly complex problem-solving capabilities, you may need to introduce a planner that can define a series of RAG steps needed to produce comprehensive answers on user queries.

If you find your system prompt gets too complex and multi-faceted, you may benefit from taking an agentic approach and splitting your system into multiple agents who can work together to find solutions to complex problems by focusing on one facet of the problem per agent.

If you want users to be able to have persistent conversations with your agent and return to those conversations later, you could potentially store conversation history either in the browser's local storage or on the server and require users to log in. Alternatively, you could store and vectorize individual chat messages from a user and store them in a vector database as conversational memory. This is significantly more complex, but would allow users to ask it things like "What were we talking about last week?" and get meaningful responses.

Conclusion

AI systems are powerful and can offer a compelling experience to your users while helping people find the content or services that are most relevant to them. However, these systems take work and refinement and benefit from experience with their technologies. This article should give you some good ideas of common architectures and approaches involved in a conversational AI system. If you'd like some help bringing such a system to life, please contact us as Leading EDJE is happy to talk with you about your needs and plans and see if collaborating more makes sense.

Concluding AI Prototyping Projects

Matt Eland — Tue, 28 Jan 2025 14:23:38 +0000

AI prototyping projects are short-lived and will typically have one of a few different conclusions:

A finding that the project is viable and can be converted to a real system with additional engineering effort.
Determining that the project is not currently viable with the data available to the organization and the currently available technology.
A demonstration of partial progress, obstacles overcome, obstacles identified, and a request for additional time and resources to determine project viability.

Note that none of the expected outputs of an AI prototyping project is a deployable AI product - your goal is to determine viability of a project with the organization’s data and technical team.

AI prototyping projects will typically end in a formal report and/or presentation with recommendations to senior leadership on what the next steps of the project should be. These steps will usually result in a direction to suspend or continue the project - potentially converting the prototype to a real AI application that is deployed to real users.

In this fifth and final article in the AI Prototyping Projects series let’s talk through the details of these steps as AI prototyping projects invariably conclude.

Presenting your Findings

In the final phases of an AI prototyping project, the team should take some time to document the results of their project in a short executive summary.

This summary acts to preserve the high-level information about the prototyping project and should include:

A list of functional aspects of the prototype that are currently operational
Accuracy metrics, with one or two key metrics highlighted and explained in plain english
A review of the initial goals defined when launching the AI Prototyping project
Areas of concern such as difficulties with certain types of requests, system limitations, or known deficiencies
High-level summary of the remaining work needed to either make the project a viable AI product that can be released or reach a viable AI prototype.
Recommended next steps

This document will serve as a very brief digest for executives and project managers looking at how the project impacts the planned portfolio of work for the coming quarter or year. It also will help the organization remember the project as time goes on and the details grow hazy.

However, the main value of this summary is to facilitate conversations organizational leadership will have on if this project should continue or be tabled. These conversations may or may not include the project team, so having a concise and effective document is important.

Some organizations may also want a presentation. The presentation should have roughly the same content as the document, but is typically told in a more interactive and engaging way - often highlighting a key use case or two and showing interactive demos of those capabilities if leadership has not yet seen it in action.

A key factor for both a written report and a verbal presentation is that these artifacts should be accurate and truthful. It is your responsibility to highlight where your system excels, its current areas of deficiency, and any known risks you feel the project still has. It’s acceptable to talk about plans to remedy these deficiencies and address the risks, but it is important to present your project in as unbiased a manner as possible.

This approach is critical in a very “hyped” industry like artificial intelligence where vendors and developers routinely make unachievable promises of their technologies while glossing over costs, performance gaps, and scalability issues.

Remember: your goal on an AI prototyping project is to give the business an accurate look into what a technology looks like - along with its strengths, weaknesses, and risks.

Preserving your Prototype

Another thing the team should do as an AI prototyping project winds down is to take steps to preserve the project by documenting key information like:

The structure of the project
How to build and launch the project
What external resources are required to run the project
How to configure the project and where to find credentials that are too secure to be stored in documentation
How the data source for the project was assembled (and how it can be updated)
Any additional data cleaning / preparation steps
Why the team took the approach they did
Other approaches that were considered

By documenting these pieces of information as the project is winding down, the team reduces the risk of critical pieces of information being lost to turnover over time or simply being forgotten as team members move on to other projects.

Keep in mind that AI prototyping projects may not immediately be converted to full projects or may be revisited later down the line once data collection practices, technological developments, or additional time becomes available.

When the team documents the high-level approach and requirements of their project they set the organization up for success in the future as new people join the team or when the project is reactivated after months of being dormant. These actions ultimately increase the impact of the project on the organization by reducing the risk of that effort being wasted or unable to be replicated later.

Promoting your AI Prototype to an AI Project

Let’s say you have an AI prototyping project that just wows your leadership team. You love its performance and capabilities, you feel the technical questions have been answered and have assurances that the system can scale to the level you need it to.

It’s likely in this scenario that you and your organization are really excited to see people get real-world usage out of your new system.

So what’s next?

When you “promote” an AI prototyping project to an AI project, your focus changes from exploring feasibility, demonstrating a concept, and identifying risks to a focus typically reserved for software projects: building something reliable and scalable and deploying it to production in a repeatable manner.

It’s unlikely that the approach your team took in a short prototyping project is the same approach that should be used to serve thousands of daily users.

In many prototyping projects the team takes shortcuts simply to prototype out a concept in a short amount of time. This means that your software might handle only the “happy path” where no errors occur, users interact with it in an expected manner, and all resources are available.

In an AI project you need to care more about these factors. Expect to spend engineering time on concerns like:

Scalability - how many parallel requests or sessions the system can support
Error handling - responding to offline resources, networking issues, and other problems
Observability - detecting issues that occur in production
Security - detecting and preventing attacks on software or AI systems
Authentication & authorization - restricting access to known users with valid permissions
Fairness - ensuring your AI system operates without bias towards genders, ethnicities, or ages
Escalation - allowing users to report issues with your AI system and get human intervention to help alleviate issues
Testability - verifying that the system continues to perform well now and in the future
Data Management - the collection and ongoing maintenance of any training or fine-tuning data, or any curated data sources referenced as RAG data sources.

Some of these factors might be things that you addressed in the prototyping phase, but even if that’s true they should likely be revisited. After all, the focus of a prototyping project should be to determine the viability of a system and any potential issues that might arise, not write efficient, secure, and reliable software code.

Additionally, the development pace on AI Prototyping projects can be rapid, which can make it harder to spot edge cases or evaluate all possibilities. By deliberately revisiting these concerns later on, you allow you to throw away your “demo code” in favor of more reliable, secure, efficient, and testable production code.

With AI systems it is not uncommon to start working using one dataset or model and switch to another as time goes on. You may have used a certain large language model during prototyping but wish to change to another one for production use cases. This type of work could require code changes, but absolutely will require a comprehensive retesting of the application under the new model.

In some cases your team may have used an AI as a Service offering like Azure AI Services’ capabilities around speech, language, and vision. Your team may choose to move away from this service in favor of their own custom-built models for increased control, performance, or improved operational costs. In these cases you likely have a significant amount of training to do to train and evaluate your own models - which may benefit from an AI prototyping of their own - just scoped at developing a custom model for that specific purpose.

Ultimately, your AI system should be flexible enough to change as your organization’s needs inevitably change over time. It’s normal to move from model to model or upgrade API versions as a project goes on or to retrain machine learning models on new data.

AI projects are software projects and, just like software projects, you’ll need to be able to update your deployed systems as issues are found. However, unlike traditional software projects, your systems will likely require periodic updates to keep up with the world your models represent. For this reason, your AI project will need to think through factors related to machine learning operations (MLOps) involving updating, re-evaluating, and deploying new models as data changes.

As you can see, AI projects are similar to software projects in some respects, but in many ways they’re uniquely different and their own creatures entirely with different concerns.

However, AI projects bring new capabilities to the table in terms of assisting users in new ways, automating processes that were previously difficult to automate, and ultimately delivering new value and new experiences to others in ways that help accomplish your organizational goals.

If you’re interested in learning more about how your organization can succeed in AI prototyping projects or converting an AI prototyping project to an AI project, please get in touch as we’d love to talk.

Identify Unknowns, Weaknesses, and Risks in AI

Matt Eland — Tue, 28 Jan 2025 14:20:04 +0000

Previously, I covered what AI prototyping projects are, how to launch AI prototyping projects, and running AI prototyping projects. In this article we'll discuss managing risks in AI.

While communicating the vision of an AI project through an interactive demo is generally understood to be the main goal of an AI prototyping project, one of the key pieces of value is the ability of these projects to uncover and resolve risks.

With AI projects your organization may not know the things they don’t know. You don’t know what aspects of your data or prompts will prove insufficient to get the results you’re looking for. You don’t know how people will interact with your application or the specific areas in which your application may prove inaccurate.
If you’re working with new models or new technologies, the performance, reliability, formats, and specific behaviors of these systems may not be known to you in advance.

For example, I was part of an AI prototyping project team at Leading EDJE where we fed pictures of users to a model hosted on Azure for Azure OpenAI GPT-4 to remark on the user’s attire. While this project was successful, one of the things we discovered in testing was that our model frequently remarked on the user’s face being blurry or mysterious.

We were initially confused by this behavior as our source images were fine, but the more it kept happening the more we realized that there was a hidden layer between us and the model that blurred out faces in images before they even reached our model.

While this behavior had a specific purpose (ensuring that said models aren’t being abused to endanger users), it was unexpected to us and noticeably altered our agent’s behavior. While we were able to minimize its impact with careful prompting, our prototyping phase informed us that we’d need to use a different model for real-world usage if we wanted to be completely free of the issue.

In software engineering you rarely know all of the behaviors you may encounter, particularly with working with external APIs, services, and data sources. This is even more true with artificial intelligence as AI systems are complex and have many hidden characteristics.

Your mission in designing AI systems is to identify these unknown risks and either resolve them or create a plan for resolving them.

Some of these risks can be found through basic interactions while others will only appear once you get a wide variety of users using your system in earnest as an internal prototype. Your role as a member of a prototyping team is to challenge your system to see what it’s good at as well as what it’s bad at.

Additionally, some systems will perform adequately for a single user but come to a crawl with several concurrent users or sudden usage spikes. Determining the scalability characteristics of your application is an important activity that should be done before an application is deployed to production but is not necessarily critical to do within the first few days of prototyping.

Finally, many AI systems will be attacked by end users, including some internal users. Users may try to use prompt injection attacks or simply convince the AI system using logic to act in a certain manner. Keep in mind that your AI systems are real software systems and therefore are part of the surface an attacker might try to exploit when looking for ways of getting access to your data or systems.

I strongly recommend you spend some time “red teaming” your AI systems by trying to find ways of getting them to behave in unacceptable ways before you release a system. You want to be able to detect and deter these attacks on the system, but the importance of this activity varies based on the data your system has access to.

For example, a simple system to summarize information from your corporate web site has a different impact if compromised versus an AI orchestration system that is capable of retrieving data about sales and even inserting data into databases for subsequent processing.

Gaining Certainty with AI

We’ve now discussed the importance of identifying risk in AI systems, but let’s talk now about resolving those unexpected issues that come up during AI systems prototyping.

Getting Unstuck

When working with new systems and APIs it can be normal to get stuck by the issues your system encounters.

I remember a time when I was working on a RAG application prototype for a client of Leading EDJE and my application worked just fine on the initial request to an external API, but all subsequent interactions resulted in 400 Bad Request responses without any additional details about the request.

Since the first request was very similar to the second one (in fact it was made using the exact same code) I was confused and blocked by this issue for a number of hours as I performed troubleshooting.

Searching for the issue online didn’t yield any helpful results and what I was doing was cutting edge enough that there weren’t many help resources available at the time.

I ultimately decided to simplify my solution bit by bit until the application fully worked. In my scenario the culprit wound up being that I was explicitly giving the user and the AI agent a name in code and this name was going to the external API and confusing the API as a result. I removed the custom name code, changed the name displayed on the user interface instead, and restored the rest of my code and the application finally fully worked.

I suspect that the more you and your team conduct AI prototyping projects the more you’ll have stories like mine.

In general I’ve found the following approach helpful:

Working in small batches of changes at a time and checking in code when it works
Asking coworkers when I encounter unexpected problems
Searching when our knowledge isn’t sufficient
Retracing my steps to get back to a working piece of code
Consulting external documentation and samples for additional insights

This list is suspiciously similar to the list of recommendations I used to give my students who were learning software engineering during my time as a teacher, but it turns out that these steps are remarkably effective at not just helping us with the basics but with the advanced.

Involve Experts

One of the dangers of doing something completely new to you is that you don’t know what you don’t know. There are a number of pitfalls with AI projects that must be learned - just as new programmers must learn to ensure database connections and file handles are closed, guard against SQL injection attacks, and deal with unavailable external services.

Machine learning and artificial intelligence are powerful tools, but the potential for inconsistent performance across different types of requests or types of users is real. What’s more, this inconsistency can manifest itself in terms of biased behavior that can be hard to detect if you’re not explicitly looking for it.

Your AI projects will be similar to software engineering projects in some respects, but in others they’ll have entirely new concerns to worry about, such as dealing with model drift and detecting and preventing certain types of attacks such as data poisoning attacks (in systems that are retrained periodically) or prompt injection attacks.

In this regard, it’s helpful to have someone on your team who has been through these problems before and can help you and your team identify the things they’ll need to watch out for with your projects - and how to successfully maintain AI projects that remain effective as time goes on and the world changes.

Targeted Spikes

When you’ve identified key risk areas that you feel could make or break your projects, sometimes the only way to resolve these risks is to explore them in a dedicated manner through a “spike”.

A spike is a targeted piece of work designed to explore the viability of an idea, approach, or service. Spikes can be used to determine areas that will be problematic for the organization going forward and will need additional engineering effort to resolve.

Just as an AI prototyping project is essentially a functional prototype of an AI product, a spike is a technical prototype of a new capability needed to serve that larger product.

An example spike would be a team that has never worked with a multimodal LLM before using that model for the first time, sending it both images and text content and dealing with its responses. The team knows this is possible and has seen documentation around it, however, their confidence in the service working reliably might not be high, or they may be worried about costs, latency, or the accuracy of the model with their data.

In a spike, you spend a fixed period of time looking at a targeted area of risk in an effort to resolve those areas of risk into either a problem-free service or a list of issues that need to be resolved to move forwards.

In the context of a short-lived AI prototyping project, a spike might be committing half a day for a single developer to investigate an area and learn all they can, then make and adjust plans based on what they found.

Larger projects and problems would likely require more time, but the key point to emphasize is that you’re not building a fully-functioning feature with all of its bells and whistles, you’re just looking to see what areas will require additional time investments.

In our next and final article in the series, we'll cover concluding AI prototyping projects.

Running AI Prototyping Projects

Matt Eland — Fri, 20 Dec 2024 19:35:12 +0000

Now that we’ve covered what AI prototyping projects are and how to launch AI prototyping projects, let’s talk about how to get started with an AI prototyping project.

While artificial intelligence is a broad and nebulous term, I find that AI projects benefit from keeping things as simple as possible - especially at the beginning.

Low-fidelity Prototyping

One of the best ways of conveying the vision of an AI system is through low-fidelity prototypes. A low-fidelity prototype involves simulating the application experience through visual aids.

Low fidelity prototypes can include paper prototypes where you sketch the individual screens of your application and how your application responds to an input such as a chat message or uploaded photo.

Paper prototypes are easy to make and simple to understand. Additionally, because they’re literally sketched on paper, a paper prototype is not likely to be misinterpreted as a completed system. However, they can leave some ambiguity as far as styling goes.

More complex prototypes can be created using software like Figma or Balsamiq. These pieces of software allow business users to use drag and drop tooling to build functional user interface concepts. These prototypes can be arranged in a series of screens to illustrate sample interactions and how a system might behave. This can help you illustrate the vision of a software system to a developer team or to other members of the organization.

However, it’s important to note that while these low-fidelity prototypes illustrate your application’s flow and concept, they do not deal with the realities of artificial intelligence since they were created by human intelligence.

To evaluate what’s technically possible - and to discover the limitations of systems you may work with, you’ll need to go ahead and create a technical prototype, ideally by focusing on the simplest possible implementation first.

Simplest possible implementation

With an AI project, you may be tempted to train a custom model to solve a particular problem you’re facing, but many AI as a Service offerings can help you prototype your idea with passable performance to determine if your approach is viable.

For example, if I was building an AI system to use a camera and describe what it sees, I could work on training computer vision models to recognize specific objects using a large quantity of reference photos from different angles and different lighting conditions. Such an understanding would require a significant amount of time and effort.

Alternatively, I could use a pre-trained computer vision model, or a system wrapped around such a model like Azure AI Computer Vision. Using these pre-trained models, you can take advantage of models others have trained already to extract insight from your inputs. You can then run custom logic based on what you see in the image.

For example, if I were building an app for those with accessibility challenges, you could use the computer vision model to extract information about the image and mix it with a custom prompt telling the model to talk about trip hazards such as rugs, cables, or items on the ground.

Alternatively, with the advent of multi-modal models, you could use a model like GPT-4o and send it your image as well as a request to highlight trip hazards. Such a request could likely give you a superior result to the approach with computer vision on its own because the model would have additional context from your raw image.

Contrast this approach to training a custom image model. While that training project might ultimately provide better results to users, it would take a significant amount of time, effort, storage, and computing resources to build such a model - and the model might not be as effective in all lighting conditions or environments. These inaccuracies would require additional images and training time to overcome, resulting in potentially months before you had a viable model of your own to deploy.

Conversely, you could use a pre-built model with a custom prompt to evaluate the feasibility of your approach and a product that uses it before making the decision to train your own model.

Illustrating failure points

One of the things you’ll need to handle in a prototype is when the system fails.

Failure in AI systems can come from a variety of factors:

The system housing your model may be offline or inaccessible due to networking issues
You may encounter rate limits or intermittent error responses from external APIs
Your system may generate a response, but the response has low confidence in its correctness
You may deliver a response to the user and the user finds an issue with something your system said or did

Each of these scenarios represents a different way something could fail, and each of them requires a slightly different response.

In scenarios where you are relying on an external resource that is unresponsive, rate limited, or erroring, your system will need to gracefully handle this error and surface it to the user in a human-readable manner. For example, a rate limit error response in a chat application could be handled with a response indicating that the system is experiencing a high volume of traffic and the user should try again in a specific period of time (typically included in a rate limit error response).

If a system generates a response, but the response is low confidence, you can choose to filter out the response entirely or to indicate to the user that the system is unsure. For example, if you show a computer vision system a photo and the system is 80% certain that the photo contains a dog and 20% certain the photo contains a squirrel, you may want to remark only on the dog and omit information about low confidence objects from the results.

Alternatively, you could redirect low confidence scores to a queue for a human to process. For example, in a particular system Leading EDJE helped a client implement, if the confidence score was below a certain threshold, the item being processed was redirected to a queue for manual review by a human.

Finally, there may be cases where your system simply gets it wrong. In these scenarios you should make it clear to users how to report and resolve these issues. In conversational AI systems this might simply be giving a “thumbs down” feedback on a specific response. In other applications, it might involve appealing an automated decision a system made so that a human is in the loop and can make an appropriate decision based on the information available.

While it’s normal to only handle the “happy path” during prototyping, prototypes do break and they particularly like to break during demos. I recommend you at least add basic error handling to your application and have a plan on handling additional types of failures.

In our next article in this series, we'll talk more about these pieces of uncertainty and how we handle risks, uncertainty, and weak areas when developing AI prototypes.

Launching AI Prototyping Projects

Matt Eland — Tue, 17 Dec 2024 18:01:24 +0000

Define your AI Prototyping Goals

In the first part of this series we discussed what AI Prototyping Projects are and why you would launch one. In this article we'll move along in that process and discuss launching AI prototyping projects in a way that sets them up for success.

In order to start your project, you’ll need to have a good idea of what your team needs to build. This could range anywhere from a paragraph description of a simple system to more complex definitions such as wireframes and component user stories the team can work on.

A simple project statement might look like this:

We’re launching a half-week spike to investigate the viability of using Azure Document Intelligence to automate extracting these 10 pieces of information from documents in 12 different formats from various vendors so that we can reliably automate data extraction and better handle high-volume spikes of incoming documents. This project will involve an AI specialist on staff as well as a team lead responsible for the area of concern.

This goal statement makes the following things clear:

The problem type (document data extraction)
The technology or technologies we’re considering using (Azure Document Intelligence)
Who is working on the project
How long the project will take
Characteristics of the inputs to the system (documents from different sources that contain common pieces of information)
Concerns about reliably, accuracy, and scalability of the resulting system

A more complex project might have an overall goal statement like this, but also a number of different user stories made available in pre-prioritized order.

As you build a system and interact with it, you and your team will have additional ideas or find you may need to adjust the project direction to fit the realities of the interactions you’re seeing. This is normal and acceptable, but you want to start the project with a goal of what you’re building.

Before we move on, I do want to state that if your changes in-flight are substantial, they may call for pausing your prototyping project and launching a new project with a modified goal.

Defining Usage Scenarios

With your goal in hand, you now need to outline the main ways you envision your system being used.

For a conversational AI system, I’d recommend documenting things you think a user might ask the system - and possibly how you would want the system to respond.

Here’s a sample set of requests you might include for an automated chat system related to eCommerce:

What are the best electric shaving products for men designed for the head and neck area available below $60?*

What is the typical return policy like for an item I don’t like?

Is the kilt hose in navy blue that I added to my wishlist back in stock?

When is my board game expected to arrive?

Each of these questions is specific to the business, and in some cases, specific to the user using the application. Each request also shows how the system is designed to be used and represents a discrete task the application might need to support (or respond to stating that it cannot do that yet).

By documenting your sample requests to your system, you have a good way of explaining what your system was designed to do, a good way of testing your system after making changes, and the beginnings of some training data for optimizing your system’s behavior.

In fact, if you were to expand your sample questions to also include sample answers, you could include these in your prompt to your conversational AI system as few-shot inference examples as a way of teaching your system how to structure its responses.

Of course, with data-specific queries about products in stock or past user orders, you’re likely looking at a retrieval augmentation generation (RAG) or AI orchestration system, but these are details about how a solution is implemented, not how we expect the system to be used.

Picking your Team and Timeline

With your goal and sample usage scenarios defined, it’s important to figure out who on your team should be included in the project.

The more people you have on the project, the less any one individual can focus on executing their tasks and getting things done, so add people to your project carefully.

I recommend putting highly-skilled people with specific specializations in terms of skills or domain knowledge on the project. For best results, I recommend having people with complementary specializations to one another, such as an AI specialist, DevSecOps specialist, and a front end specialist on a small team together.

You’ll want your team members to be able to attack specific tasks independently while occasionally coordinating with each other to make sure their efforts are lining up well.

For particularly complex problem areas you may want to have several people with the same or related skillsets so they can talk through complex obstacles or approaches with each other. In these scenarios it can be helpful to partition work areas in advance so individuals can work independently without impacting each other’s efforts.

As far as a timeline goes, you should ask your team to think about the things they know are easily achievable versus the things that will take some additional time, research, and trial and error. I recommend your team plans out your first day or day and a half on the project in terms of what they’ll be working on. Based on your comfort for how much they’ll get done in that time and how much work is left, you can adjust the end date of the project.

I’d strongly recommend keeping your projects contained to a few days to a week and treating them almost like an in-office hackathon because the longer your project goes, the slower your overall pace will be. This is because innovation projects tend to have a lot of early progress as people work through their plans and tackle known problems and expand into new areas, but the high-intensity pace starts to wear at the team’s overall productivity as the days go on.

By keeping projects constricted in terms of time, you keep the team focused on the core goal and are more responsible with their overall energy level - as well as with external commitments from their normal work that you need to keep at bay during a prototyping project.

Defining Data Needs

One critical factor to consider when launching an AI prototyping project is the quality of your data.

If you are using data to train an AI system for predictions or even simply as a resource to search, that data needs to be reliable or your resulting AI system will internalize the inaccuracies of your data.

Depending on what you’re trying to achieve, inaccurate data could manifest as simply as incorrect recommendations or facts spouted by the AI system. However, in more sensitive scenarios, such as training machine learning models, incorrect data could significantly bias your model towards inaccurate predictions.

These predictions may only show up with certain other combinations of data and may be hard to detect as a result. For example, if a data collection anomaly resulted in incorrect data points for the state of Ohio for a two month period of time, your system may wind up being inaccurate for residents of Ohio during certain months. Because it’s unlikely you routinely test data in your systems from every state for every month of the year, these inaccuracies could very well reach production if you don’t find them in your data earlier.

For this reason, you should be familiar with the data your team will likely want to use for the AI prototyping project and its limitations. Additionally, you should reserve time to analyze your dataset for anomalies and additional data cleaning opportunities as data reliability can make or break a project.

Defining AI Project Deployment and Operation Parameters

One final aspect of AI projects to consider is how the AI system will be deployed if the prototype succeeds and is promoted to a full project that then goes on to deploy to production.

While it may seem counterintuitive to think about deployment of a prototyping project designed for exploring viability, one of the concerns for a viable AI project is the ability to deploy and maintain such a system.

For example, if your system relies around data, you should have a plan for how new data gets into your system and how you evaluate that system to ensure you’re not breaking any existing capabilities by adding new context.

Even if your system doesn’t rely on changing data, you will want to patch it from time to time, move it to a different model, add capabilities, or even just style it differently or tweak its personality.

Each of these operations involves deploying or updating the AI system and, in turn, evaluating the effectiveness of the system has not significantly diminished after the change.

In the short term, this likely means a manual testing plan, however more professional and automated solutions would involve automated tests that ensure your system acts properly and its responses continue to meet your organizational standards.