How We Trained Your AI Agent, Part 2: Reinforcement Learning

In our previous post, we talked about the initial training phase for AI agents. To give you a quick recap:
  • AI agents use systems that are trained on specialized data sets, and are capable of independently solving multi-step problems in service of a particular goal.
  • The initial training phase is all about giving an AI agent context. This is done through Retrieval Augmented Generation (RAGS), which teaches an AI agent to use business-specific data and constrains it to use only data that’s relevant to an organization or industry. This vastly improves its accuracy over generative AI methods.
This initial phase of training an AI agent is all about showing it what its job is, and how to do it. But to improve, AI agents need feedback on their performance (just like people do). This second phase is where the reinforcement learning begins. And in this case we’ll focus on RLHF.

RLHF: Reinforcement Learning With Human Feedback

RLHF is a fancy way of saying that you look at the work the AI agent has done so far and grade it as “right” or “wrong.” You create a list of questions for the AI agent, review the answers the agent gives to those questions, and then mark whether or not the answers were correct.

Then the agent uses that data to improve their responses. Like an eager student, it craves the reward of getting a correct answer and alters its behavior to try to provide more of them.

One key distinction between training an AI agent and a real person is that with a colleague you can be nuanced. An AI agent has a hard time understanding feedback beyond “right” and “wrong” (at least it does now). So your training data should only include falsifiable answers rather than explanations.

A side benefit of this process is that the act of grading forces the humans creating the training set to become very specific. Just like with any human interaction, where both sides learn from each other, you’ll learn from grading the agent’s work how to shape your questions to get the answers you seek, all while the AI agent learns what your expectations are for its outputs.

RLHTF: Everyone Has an Opinion - Use That!

One danger of RLHF is that if you only have one person providing feedback, the model could skew towards that one viewpoint. So to keep your feedback from becoming lopsided, the best way to review it is with a group. We’ve invented our own acronym: RLHTF (Reinforcement Learning from a Human Team’s Feedback).

If you’ve ever worked on a team, you know that there are always a ton of different perspectives in the room. If two people were to evaluate a response from an AI agent, it’s possible that one person might feel like it was perfectly clear, while another might feel it was completely incoherent.

Additionally, different team members will have varying needs — your VP might want a high level overview to head each output, your Marketing colleague might think answers are only useful when accompanied by a graph or chart, and your Analyst might want all answers to include a list of possible exceptions. 

One of the best ways to provide human feedback is to get multiple perspectives, by selecting a team made up of people who have different backgrounds. Each person on the team grades the agent’s responses independently, and then afterward, the team discusses their scoring and uses their varied perspectives to build a single unified rubric that the AI can use going forward. This standardization process is a great way to give the AI the background it needs to work within its parameters.

Your Data May Be Gold, but Your Feedback Is Diamond

You might read that process and think…wow, that’s a lot of man hours. Which is a valid point — it does require people to spend their time grading an agent’s responses, and figuring out how to teach it what it needs to know. But that time is well worth the investment.

Once your AI agent is trained, and repeatedly retrained, you’ve got an extremely valuable data set. One key benefit of this is that if you ever need to retrain your model — or even switch the LLM that you model that your agent uses — you could do that and retrain your model much faster! 

If you want to, for example, transition from OpenAI to DeepSeek, or to some other LLM that hasn’t been invented yet, you’ll have a validated data set that you can input into the LLM, as well as a rubric to teach it how to answer questions. While there’s a good deal of front-end labor to create that data, it means that you’ll have everything you need to save yourself time in the future. 

How Often Should You Give Feedback

It’s good practice to develop a process around continuing to sharpen your AI agent’s skills by doing some reinforcement learning on a regular basis — weekly, monthly, or quarterly. (We recommend doing it weekly in the early stages of your model and then switching to monthly or quarterly once it’s live.) 

The questions we want to answer with our data, and the challenges that a business is facing, shift over time, sometimes imperceptibly. While humans will pick up on those shifts through meetings, casual conversation, or even just instinct, AI agents need to actually be told what’s going on with falsifiable data. Checking in with the AI agent and making sure it’s still giving relevant answers is important to ensure that it’s adapting at the same rate as your human team. 

Next week, we’ll be talking about steering committees, and how to organize a team to advance AI in the most effective way possible.

Ready to Start Working With AI Agents? ​

To improve, AI agents need feedback on their performance (just like people do). This second phase is where the reinforcement learning begins.

learn more

Let us show you how we can help you make the most of your data.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Thank You!

We'll be in touch!

REQUEST A DEMO

Let us give you a customized walkthrough of all of our platform’s capabilities. Reach out to learn more! 

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.