Forem: Mrunmayee Rane

This post is for engineers building agentic harness — especially if you are thinking about tools, memory, evals, observability, and production reliability.

Mrunmayee Rane — Tue, 26 May 2026 08:01:36 +0000

Mrunmayee Rane

May 25

The Agentic Harness: How to Build AI Agents in Production

#ai #agents #llm #agenticharness

Comments

12 min read

The Agentic Harness: How to Build AI Agents in Production

Mrunmayee Rane — Mon, 25 May 2026 07:00:00 +0000

Most people are still building AI agents like demos.

They connect an LLM to a few tools, add a system prompt, wrap everything in a chat UI, and call it an agent.

That is not an agent system.

That is a model with tool access.

A real AI agent is not just a prompt, a model, or a framework. A real AI agent is an engineered runtime.

It needs a harness.

The agentic harness is the system around the model that makes agent behavior useful, repeatable, observable, and safe.

It decides:

How the model receives context
How it uses tools
How progress is persisted
How failures are handled
How work is evaluated
How the system improves over time

The mindset shift is simple:

The model is not the product.

The harness around the model is the product.

A stronger model can improve reasoning.

But the harness determines whether that reasoning turns into reliable action.

What Is an Agentic Harness?

An agentic harness is the runtime layer that enables a model to behave like an agent.

It receives a task, loads the right instructions and context, exposes the right tools, manages the execution loop, captures state, verifies progress, handles errors, records traces, and returns the final result.

A simple version looks like this:

1. Receive task
2. Load identity
3. Load task instructions
4. Load relevant context
5. Retrieve memory
6. Select tools
7. Plan next action
8. Execute tool call
9. Observe result
10. Update state
11. Verify outcome
12. Write durable progress
13. Return response
14. Record trace

The important part is not that every agent uses this exact loop.

The important part is that the loop exists outside the model.

A weak agent relies on the model to figure everything out inside one giant context window.

A strong agent externalizes responsibilities into the harness:

1. Identity lives in a stable instruction layer
2. Memory lives outside the prompt
3. Skills live as reusable procedures
4. Tools expose controlled actions
5. Policies constrain execution
6. Progress files preserve continuity
7. Traces capture behavior
8. Evals measure outcomes and trajectories
9. Governance defines ownership

The model should reason.

The harness should govern.

Practical rule: Do not put everything inside the prompt. Build the system around the prompt.

Start Simple, Then Add Agency Only Where It Pays Off

One of the biggest mistakes in agent development is adding autonomy too early.

Not every AI system needs to be an agent.

Some tasks are better served by:

1. A single model call
2. Retrieval-augmented generation
3. A deterministic workflow
4. A simple rules engine
5. A human review flow

Some tasks genuinely need an agent that can decide what to do next, use tools, and adapt across multiple turns.

A useful distinction:

Workflow: the system follows predefined code paths.
Agent: the model dynamically decides its process and tool usage.

A good harness lets you mix both.

User request
   ↓
Intent router
   ↓
Simple task?   → deterministic workflow
Complex task?  → agent loop
High-risk task? → human review gate

This gives you a practical architecture:

Keep deterministic paths deterministic.

Reserve agentic behavior for places where model-driven decision-making actually creates value.

Practical rule: Use the simplest system that can solve the task. Add agency only when flexibility is worth the cost.

Define the Agent’s Operating Identity

Before memory, tools, skills, and evals, the agent needs identity.

Identity is not personality decoration.

It is behavioral control.

A weak identity says:

You are a helpful AI assistant.

That does almost nothing.

A stronger identity says:

You are a pragmatic staff engineer operating in production systems. You optimize for correctness, reliability, maintainability, and small safe diffs. You read before editing. You verify before claiming completion. You preserve existing architecture unless the architecture itself is the failure. You surface uncertainty instead of hiding it.

This gives the model an operating posture.

In a real harness, this identity can live in a stable file such as:

1. SOUL.md
2. AGENTS.md
3. system profile
4. team-owned instruction file

It should define:

1. Who the agent is
2. What it optimizes for
3. How it communicates
4. What it refuses to do
5. How it uses tools
6. What it remembers
7. What it ignores
8. When it asks for help
9. When it stops

Example:

## Core Truths

- Read before writing.
  Existing systems contain context that the prompt does not.

- Small diffs beat broad rewrites.
  Local fixes are safer unless the abstraction itself is broken.

- Verification is part of the task.
  Never claim success without evidence.

- Production systems punish cleverness.
  Prefer explicit, observable, boring solutions.

- Uncertainty must be surfaced.
  A confident guess is worse than a clearly labeled assumption.

A next-level agent needs judgment, not just capability.

Identity is where that judgment starts.

Practical rule: Give the agent a stable operating profile before giving it powerful tools.

Make the Execution Contract Explicit

Every agent should have an execution contract.

The execution contract tells the agent how work moves from task to completion.

For a coding agent, the contract might be:

1. Understand the request.
2. Inspect relevant files.
3. Identify the smallest safe change.
4. Apply the change.
5. Run targeted tests.
6. Run broader tests if risk is high.
7. Summarize the diff.
8. Document verification.
9. List residual risk.

Without this contract, the agent improvises.

Improvisation is fine for chat.

It is dangerous for production systems.

A better coding-agent instruction looks like this:

You are debugging a production Python service.

Mission:
Find the smallest safe fix.

Workflow:

1. Read the exact error.
2. Inspect the file where the error originates.
3. Inspect the caller.
4. Search for similar patterns in the repository.
5. Identify the smallest local fix.
6. Apply the patch.
7. Run the narrowest relevant test.
8. If the touched surface is broad, run the related suite.
9. Report changed files, verification, and remaining risks.

Rules:

- Do not edit before reading.
- Do not introduce dependencies unless existing tools are insufficient.
- Do not rewrite modules for local bugs.
- Do not claim tests passed unless the command actually ran.
- Do not suppress uncertainty.

This is what separates an agent from a chatbot.

The chatbot answers.

The agent follows an execution contract.

Practical rule: Define how the agent starts, acts, verifies, and stops.

Treat Tools as Privileged Interfaces

Most agent demos expose tools too casually.

They give the model a shell, browser, database, file editor, or API client and trust the prompt to keep behavior sane.

That is not enough.

Tool use needs policy.

For every tool, define:

1. When to use it
2. When not to use it
3. Required preconditions
4. Allowed scope
5. Failure behavior
6. Retry limits
7. Logging requirements
8. Approval boundaries

Example:

## Shell Tool Policy

Use shell for:
- running tests
- inspecting repo structure
- searching files
- checking git state

Do not use shell for:
- destructive commands
- credential access
- broad file deletion
- installing dependencies without approval

Before mutation:
- inspect target files
- check git status
- prefer minimal commands

After mutation:
- run relevant verification
- summarize command output

Every tool expands the agent’s action space.

A larger action space means more capability, but also more failure modes.

The harness should make tool use scoped, observable, and reversible where possible.

Practical rule: Tools should be powerful, scoped, observable, and policy-constrained.

Engineer Context Like a Runtime Resource

Context is not a giant text box.

Context is working memory.

If you treat the context window like a dumping ground, agent quality degrades.

The agent becomes distracted. Stale information competes with fresh information. The model starts to miss details that should have been obvious.

A better mental model is a memory hierarchy:

L0: stable identity
L1: task instructions
L2: active working context
L3: retrieved project context
L4: long-term memory
L5: external documents and tools
L6: durable progress artifacts

Each layer has a job.

The identity layer should be small and stable.

The task layer should be specific.

Retrieved context should be relevant and fresh.

Memory should contain durable facts, not noise.

Tool outputs should be summarized instead of blindly appended forever.

Progress artifacts should preserve state across sessions.

Context engineering asks:

What must be in the prompt?
What can be retrieved on demand?
What should be summarized?
What should be persisted?
What should be forgotten?
What should never enter context?

More context is not always better.

Better-routed context is better.

Practical rule: Treat context like RAM, not storage.

Build Durable State Outside the Context Window

Long-running agents fail when all state lives in the chat.

Eventually, the context compresses, degrades, or disappears.

The agent forgets why it made a decision, repeats work, loses track of tests, or declares success without remembering what is still broken.

A serious harness needs durable progress artifacts.

Examples:

PROGRESS.md
PLAN.md
DECISIONS.md
RISKS.md
TODO.md
CHANGELOG.md
git commits
trace logs
test reports

A weak long-running agent does this:

1. Make many changes
2. Lose context
3. Forget why
4. Declare success
5. Leave broken state

A strong long-running agent does this:

1. Read progress
2. Select one task
3. Make a small change
4. Run verification
5. Commit or checkpoint
6. Update progress
7. Record risks
8. Continue

For coding agents, a good PROGRESS.md might look like this:

## Current Goal

Implement scoped retry handling for failed ingestion jobs.

## Completed

- Identified retry path in worker.py
- Added unit test for transient network failure
- Confirmed existing backoff utility exists

## In Progress

- Wiring retry policy into ingestion worker

## Blockers

- Need to confirm max retry count for production

## Next Step

- Add integration test for failed job replay

## Risks

- Duplicate processing if idempotency key is missing

This gives the next agent session a clean handoff.

Practical rule: Long-running agents need state outside the model.

Separate Memory From Skills

Many agent systems confuse memory and skills.

They are not the same. Memory stores facts. Skills store procedures.

Memory answers:

What does the agent know?

Skills answer:

How does the agent do something?

Examples of memory:

The project uses Poetry.
The user prefers concise technical explanations.
The staging deploy requires manual approval.
The API gateway owns refresh-token handling.

Examples of skills:

How to debug a failing Kubernetes pod.
How to review a pull request.
How to investigate a latency regression.
How to create a database migration safely.
How to summarize a production incident.

A skill should be structured and reusable:

---
name: latency-regression-debug
description: Use when p95/p99 latency increases after a deploy.
version: 1.0.0
---

## When to Use

Use when latency regression is reported after a code, config, infra, or model change.

## Procedure

1. Identify affected endpoint or job.
2. Compare p50, p95, and p99 before and after deploy.
3. Check recent diffs.
4. Inspect dependency latency.
5. Check queue depth and saturation.
6. Reproduce with a controlled benchmark if possible.
7. Propose the smallest reversible fix.

## Pitfalls

- Optimizing average latency while ignoring p99.
- Blaming the database before checking queueing.
- Ignoring cold starts.
- Comparing different traffic windows.

## Verification

- Same traffic class.
- Same time window.
- p95/p99 restored.
- No regression in error rate.

This is procedural memory. It helps the agent avoid rediscovering workflows.

Practical rule: Facts go into memory. Repeatable procedures become skills.

Build the Evaluation Harness With the Agent Harness

Agent evals are harder than normal LLM evals.

A chatbot produces an answer. An agent produces a trajectory.

That trajectory includes:

1. Tool calls
2. File reads
3. Edits
4. API calls
5. Retries
6. Failures
7. Recoveries
8. Test runs
9. Final output
10. State changes

A final answer can look correct while the trajectory is bad.

For example, the test passes, but the agent:

1. Edited the wrong abstraction
2. Ignored an existing helper
3. Introduced duplicate logic
4. Skipped security-sensitive checks
5. Used 40 unnecessary tool calls
6. Failed to document risk

That should not be a full pass.

A serious eval harness should measure both outcome quality and process quality.

For agent systems, useful eval dimensions include:

1. Task success
2. Tool selection
3. Tool efficiency
4. State changes
5. Policy violations
6. Latency
7. Token cost
8. Retry behavior
9. Verification quality
10. Diff quality
11. Failure recovery

The key idea is simple:

Agent harness = runs the agent

Eval harness = runs the agent against tasks,
captures traces, grades outcomes,
and aggregates results

You need both.

Practical rule: Evaluate the trajectory, not just the answer.

Use Macro Evals to Debug Systemic Failures

Single-run debugging is not enough.

Agent systems fail in patterns.

Examples:

Planner delegates too late
Researcher over-collects sources
Coder edits before reading
Reviewer focuses on style instead of correctness
Memory retrieval injects stale context
Tool retry loop burns tokens
Subagents duplicate work
Escalation happens too late

Macro evals look across many traces to identify repeated failure modes.

The workflow looks like this:

1. Collect traces
2. Score individual runs
3. Compress traces into comparable summaries
4. Cluster recurring behavior patterns
5. Rank patterns by impact
6. Inspect representative examples
7. Patch system behavior
8. Rerun evals

This moves you from anecdotal debugging to distribution-level engineering.

Instead of asking:

Why did this one run fail?

Ask:

What class of runs fails, and what system behavior causes it?

That is the difference between debugging an example and improving a platform.

Practical rule: Beginners debug examples. Advanced teams debug failure distributions.

Measure Reliability, Not Just Capability

Agents are nondeterministic.

One successful run does not mean the system is reliable.

Two useful metrics are:

pass@k = did at least one of k attempts succeed?

pass^k = did all k attempts succeed?

These measure different things.

pass@k measures capability.

It asks whether the system can solve the task if given multiple chances.

pass^k measures consistency.

It asks whether the system succeeds every time.

A coding agent that solves a task once out of five attempts is capable.

It is not reliable.

A support agent that gives the correct policy once but fails randomly later is dangerous.

Track:

1. Success rate
2. Variance
3. Retry count
4. Cost per success
5. Latency per success
6. Tool calls per success
7. Failure categories
8. Recovery rate

Practical rule: Production agents need consistency, not occasional brilliance.

Design Failure Handling Explicitly

Most agent demos ignore failure handling.

Real systems cannot.

Every agent needs a failure model.

Define what happens when:

1. A tool call fails
2. Retrieval returns stale context
3. Tests fail
4. An API rate limit is hit
5. The agent loops
6. Required context is missing
7. Permissions are insufficient
8. Output confidence is low
9. Subagents disagree
10. Verification cannot be completed

A good failure policy looks like this:

## Failure Policy

If a tool fails:
- retry once if the failure is transient
- do not retry destructive actions automatically
- summarize the failure
- choose an alternate path if available

If tests fail:
- inspect the failure
- make at most one targeted fix
- rerun the narrow test
- if still failing, stop and report

If context is insufficient:
- state what is missing
- proceed only with clearly labeled assumptions
- avoid irreversible actions

Agents should not silently push through uncertainty.

A reliable agent knows when to continue, when to retry, and when to stop.

Practical rule: Failure handling is part of the harness, not an afterthought.

Use Multi-Agent Systems Only When Coordination Pays Off

Multi-agent systems sound advanced.

Often they are just expensive chaos.

Use multiple agents only when the task benefits from parallelism or specialization.

Good fits:

1. Broad research
2. Multi-source investigation
3. Red-team / blue-team review
4. Planner-coder-reviewer workflows
5. Independent verification
6. Large codebase exploration

Bad fits:

1. Simple Q&A
2. Small code edits
3. Basic summarization
4. Single-file changes
5. Narrow classification

A useful architecture:

 Lead agent
  owns task framing, planning, and synthesis

 Research agents
   explore independent branches

 Coder agent
  makes implementation changes

 Reviewer agent
  checks correctness, safety, and regressions

 Verifier agent
  runs tests and validates outputs

Important harness rules:

1. Give each agent a narrow role
2. Set token and tool budgets
3. Require compressed findings
4. Avoid raw context dumps
5. Prevent duplicate work
6. Define handoff contracts
7. Evaluate the system as a whole

Multi-agent systems are not automatically better.

They are better only when coordination is cheaper than sequential work.

Practical rule: Add agents when specialization creates leverage, not because the diagram looks impressive.

Add Observability From Day One

If you cannot inspect an agent, you cannot improve it.

A production-grade harness should emit traces.

Capture:

1. Input task
2. Loaded context
3. Retrieved memories
4. Selected skills
5. Tool calls
6. Tool outputs
7. State transitions
8. Errors
9. Retries
10. Final answer
11. Cost
12. Latency
13. User feedback

Without traces, you cannot answer:

Why did the agent choose this tool?
Why did it ignore the relevant file?
Why did it retrieve stale memory?
Why did it loop?
Why did cost spike?
Why did the final answer look correct but fail?

Observability enables debugging, evals, macro analysis, cost control, policy enforcement, skill improvement, and memory cleanup.

Practical rule: No traces, no serious agent engineering.

Put Governance Around the Harness

Agent adoption is not only technical.

It is organizational.

Without governance:

1. Every developer writes their own prompts
2. Permissions drift
3. Skills duplicate
4. Memory gets messy
5. Evals are missing
6. Tools are unsafe
7. Nobody owns regressions

With governance:

1. Shared configs
2. Shared skills
3. Shared evals
4. Clear permissions
5. Standard review process
6. Centralized observability
7. Safer rollout
8. Faster onboarding

Every serious agent platform needs a DRI.

Someone must own:

1. Identity files
2. Tool policies
3. Memory policy
4. Skill library
5. Eval suite
6. Permission model
7. Release process
8. Incident review
9. Documentation

Bottom-up experimentation creates energy.

Governance turns it into infrastructure.

Practical rule: If nobody owns the harness, nobody owns the agent.

Final Takeaway

Next-level AI agents are not built by writing bigger prompts.

They are built by engineering better harnesses.

The model is the reasoning engine.

The harness is the operating system around it.

A serious agentic harness needs:

1. Identity
2. Execution contracts
3. Tool policies
4. Context engineering
5. Memory discipline
6. Skills
7. Durable state
8. Failure handling
9. Trajectory evals
10. Macro evals
11. Observability
12. Governance

If you are a student, learn this early.

If you are a developer, practice this deliberately.

If you are building AI products, treat this as infrastructure.

The best AI developers will not be the ones who only know how to call an API.

They will be the ones who know how to design the system around the model.

That is how we move from:

“This AI agent helps me sometimes.”

To:

“This agentic harness is part of my engineering system.”

References

Anthropic: Demystifying evals for AI agents
Anthropic: Building effective agents
OpenAI Cookbook: Building Governed AI Agents
Anthropic: Effective harnesses for long-running agents
Anthropic: Effective context engineering for AI agents
OpenAI Cookbook: Macro Evals for Agentic Systems
OpenAI Cookbook: Getting started with OpenAI Evals
Anthropic: How we built our multi-agent research system

Personalized Food Recommendation RAG bot on WhatsApp

Mrunmayee Rane — Sun, 26 Apr 2026 14:01:07 +0000

Moving from New York City to the west coast, I found it difficult to decide as to what to eat for my meals. Also it was very challenging to find healthy and good restaurants in California. In New York city, it was easy to pick a spot and cuisine, cause every lane there were already 20–25 good restaurants. Having covered a wide range of restaurants in New York from the best of Chintan Pandya’s Dhamaka to the casual Thai at Up Thai, everything was at a quick walking or few subway stops away, In California, I was in for a surprise.

Let’s face the fact that finding meal options with personal preferences such as vegan, gluten free, sugar free, pescatarian food and a variety of cuisines is like searching for a needle in a haystack. Having recognized this dilemma and inspired by Nvidia’s LLM developer day. I embarked on a mission to simplify this search.

Goal? To create a system that understands your craving and points you to the ideal meal.

Journey began with yelp academic datasets. Huge goldmine of user reviews and business information. We zeroed it to California, a hub of diverse and vibrant culinary culture and narrowed it down to 20k samples for efficiency.

Leveraging Retrieval-Augmented Generation (RAG) for Personalized Recommendations

A key innovation in our system is the incorporation of Retrieval-Augmented Generation (RAG). RAG combines the strengths of both retrieval-based and generative AI models, enabling our system to provide highly accurate and personalized food recommendations. This approach works by first retrieving relevant information from our extensive dataset — in this case, the Yelp academic dataset — which includes a wide range of user reviews and business information. Then, using generative models, RAG synthesizes this information to produce coherent and context-specific recommendations. This method is particularly effective for catering to diverse dietary preferences and cuisines, as it can seamlessly integrate vast amounts of detailed data, including vegan, gluten-free, sugar-free, and pescatarian options. By leveraging RAG, we ensure that our recommendations are not just data-driven but also finely tuned to each user’s unique taste and preferences, truly embodying the essence of a personalized recommendation system.

Merged business and user reviews dataset, creating a detailed hashmap of businesses.

This hashmap contained detailed information for each business, including the name, ID, address, city, state, postal code, user reviews, operational hours, and categories — a treasure of information for any foodie. Recognizing the complexity of handling multiple user reviews and ratings for a single business ID, we employed an aggregation method. This approach averaged user ratings and consolidated multiple reviews per business, ensuring a more streamlined dataset. Subsequently, we transformed the hashmap back into a dataframe, and eventually into a CSV file, to facilitate easier referencing and mapping.

For the creation of embeddings and loading of the entire CSV document, we used langchain.document_loaders.csv_loader. To effectively manage the large volume of data, we divided the document into smaller chunks, enabling efficient processing by the LLM model. The RecursiveCharacterTextSplitter from LangChain was utilized for generic text splitting, ensuring the data was appropriately segmented.

Text Embeddings:

Model path sets the pre-trained model to be used for embeddings which is sentence-transformers/all-MiniLM-l6-v2.

It configures and initializes a sentence transformer model from Hugging Face for generating embeddings. It specifically uses the all-MiniLM-l6-v2 model, runs on the CPU, and produces non-normalized embeddings. Normalization is often used to standardize the length of the embedding vectors.

Chroma is a tool used for efficient similarity search and retrieval in large collections of data. It helps when there’s a need to find the most similar items quickly, while having a large number of embeddings. from_documents is a method that creates a Chroma database from a set of documents and their embeddings. embeddings is an object initialized using HuggingFaceEmbeddings. These embeddings are capable of converting text documents into vector embeddings. The embeddings for the docs are generated and used by Chroma to enable efficient similarity searches.

Retrieved Data:

Retriever creates a retriever object from the Chroma database (db), previously initialized.

as_retriever is a method, transforms the database into a retriever capable of performing search operations. search_type=”mmr” specifies the type of search algorithm used. “mmr” stands for Maximal Marginal Relevance. MMR is used to retrieve diverse results by balancing relevance and diversity, ensuring that the retrieved documents are not just relevant but also varied. get_relevant_documents is a method that takes a query and returns a list of documents that are most relevant to the query. num_results=7 specifies the number of results to return.It’s set to retrieve the top 7 relevant documents.

Above statements save the embeddings in a persistent directory, locally so that it can be easily retrieved when needed.

Then it loads the Chroma database for similarity searches and performs a search with a specified query. similarity_search_with_score is a method that searches for documents most similar to the given query based on their embeddings with similarity score and then sort them in descending order for highest ranking.

Prompt Creation:

I parsed details about the top 5 retrieved business information with a detailed prompt using the prompt template in langchain. Send this complete prompt to the Llama2–70B or Steerlm Llama 70B model using NVIDIA’s Cloud Function(NVCF) API and generate a response from it.

Integration with whatsapp through Twilio and Ngrok:

Twilio is a powerful platform for communications, enabling us to send and receive messages, make and receive phone calls, and more. In this project, we use Twilio to receive user queries via SMS and respond with food recommendations..

Setting Up Twilio Account :

First, you need to create a Twilio account and get a phone number that can send and receive SMS messages then obtain your Twilio Account SID, Auth Token, and phone number from the Twilio Console.

Install the Twilio Python helper library to handle messaging: pip install twilio

Configure Twilio to Forward Incoming Messages :

In the Twilio Console, configure your Twilio phone number to forward incoming messages to your FastAPI endpoint exposed by ngrok. This is typically done by setting the “Messaging” webhook URL to point to your /recommendation endpoint (e.g., https://your-domain.com/recommendation)..)

Processing Incoming Messages :

In the FastAPI app, we define an endpoint /recommendation that will handle incoming POST requests from Twilio. Twilio sends incoming messages to this endpoint. When a message is received, the content of the message is extracted and passed to the generate_answer function, which generates the food recommendation based on the user’s query. The response is then wrapped in a Twilio MessagingResponse object and sent back to the user.

Setting Up Webhooks in Twilio:

To complete the integration, you need to set up a webhook in Twilio to point to your FastAPI endpoint exposed by ngrok. Here’s how you can do it:

Log in to your Twilio Console.
Navigate to the “Phone Numbers” section and select the number you want to use.
In the “Messaging” section, set the “A Message Comes In” webhook to your ngrok URL, e.g., https://your-domain.com/recommendation.

Why Use Ngrok?

Ngrok is a tool that creates a secure tunnel to your localhost, allowing you to expose a local server to the internet. When developing locally, your FastAPI application runs on localhost, which is not accessible from the internet. Twilio needs a publicly accessible URL to send webhook requests to your /recommendation endpoint. Ngrok provides this by tunneling requests from a public URL to your local development server.

Setting Up Ngrok:

Install Ngrok by using _pip install pyngrok _command

Sign Up and Configure Ngrok :

Sign up for a free account on the Ngrok website to get your authentication token. After signing up, you will receive an authentication token which you need to configure Ngrok. Use the following command to add your auth token. “ngrok authtoken YOUR_AUTH_TOKEN”

First, run your FastAPI application on your local machine then Ngrok by opening a new terminal. “ngrok http 5000” After starting Ngrok, you will see the forwarding link which we will use to configure the Twilio webhook.

Update Twilio Webhook

Set the “A Message Comes In” webhook in messaging section to your ngrok public URL followed by the /recommendation endpoint.

Technologies behind these tastes were langchain, hugging face, pandas, chroma for vector storage and streamlit for user interface and steerlm Llama 70B model through NVIDIA’s Cloud Function(NVCF).

What’s Next: Enhancing and Expanding:

My vision includes integrating user feedback mechanisms, map functionalities, and personalized dietary preferences into the system. We also plan to evaluate our method with a larger dataset not limited to California, refining approach for even better accuracy.

Happy to connect on LinkedIn!

Building a Multi Agent Career and Workplace Assistant at Stanford Hackathon

Mrunmayee Rane — Tue, 28 Jan 2025 06:35:46 +0000

I participated in my first Hackathon for Women in AI at Stanford University, organized by Twelve Labs, Zilliz, GenAI Collective, and Women Who Do Data (W2D2). One of the key insights I gained is that integrity is the most underrated value in today’s workplaces, often leading to conflicts and misunderstandings when overlooked.

In today’s fast-paced professional landscape, employees face numerous challenges, ranging from workplace stress to navigating career transitions. Despite the availability of HR teams, training programs, and mentorship opportunities, there remains a significant gap in providing real-time, personalized, and scalable guidance.

This introduces the Multi Agent Career and Workplace Assistant , an innovative application designed to address this gap. By leveraging AI-powered tools, it delivers personalized career coaching and workplace guidance, empowering employees to overcome challenges and achieve their professional goals.

The Problem

Employees often encounter challenges such as:

Workplace Stress: Interpersonal conflicts, unclear communication, or work overload.
Career Transition: Navigating skill gaps, identifying the right resources, and making informed career decisions.

Organizations face challenges too:

Scalability: Providing mentorship and career coaching to all employees.
Customization: Tailoring advice to individual needs.
Engagement: Delivering relevant, on-demand learning resources.

The Solution: Multi-Agent Career and Workplace Assistant

This application uses a multi-agent framework to classify and address employee queries into two categories:

Workplace Stress: Provides actionable advice and stress management resources.
Career Transition: Generates a structured learning path for transitioning into new fields.

Key Features

Intent Classification: Uses AI to determine whether the query is related to workplace stress or career transition
Resource Retrieval: Embeds and retrieves relevant videos and PDFs using advanced embedding models such as Twelve labs’ Marengo Retriever 2.7 and Hugging Face Sentence transformer/all-MiniLM-L6-V2.
Generative AI Integration: Delivers personalized advice using Gemini 2.0 Flash Experiment model.
Scalability: Supports multiple users with minimal human intervention.

Technology Stack

LangChain : For embedding and managing document embeddings.
Twelve Labs : For video embeddings and processing.
Zilliz/Milvus : To store and retrieve vector embeddings efficiently.
Streamlit : For a user-friendly interface.
Google Generative AI : To generate natural language responses using Gemini 2.0 Flash Experiment model.
PyPDF2 : For PDF parsing and text extraction.

How does it Work?

Step 1: Intent Classification

The assistant classifies user queries into two categories:

Career Transition: Queries about learning new skills or exploring new career paths.
Workplace Stress: Queries about workplace conflicts, communication issues, or stress management.

Step 2: Resource Embedding and Retrieval

The application preprocesses and embeds PDFs and videos into Zilliz/Milvus, enabling fast and efficient similarity searches.

Embedding PDFs:

Embedding Videos:

Similarity Search:

Step 3: Creating System Prompt

Based on the classified intent, the assistant queries the embedding database for relevant resources and uses Generative AI to provide personalized recommendations.

Step 4: Streamlit UI

The application uses Streamlit for an intuitive UI where users input queries and receive personalized advice. The retrieved PDFs and videos are displayed with thumbnails and clickable links.

Conclusion

The Multi-Agent Career and Workplace Assistant bridges the gap between employees and scalable, personalized mentorship. By leveraging state-of-the-art AI tools, it provides timely and actionable guidance, ensuring both employees and organizations thrive in today’s dynamic professional environments.

Github: https://github.com/mrunmayee17/Career_and_Workplace_Assistant

Happy to connect on LinkedIn!