AI Test Optimization

What are AI tests?

AI tests are like any other test but designed to utilize AI more efficiently than a standard test. AI and machine learning systems almost universally share one thing in common. They improve based on learning and testing large data sets. To best optimize AI for your team, you want to design your systems similarly. The initial pieces of copy written by an AI will be clunky and may not perform well at first, even if they seem okay. You must put time and effort into determining how to use AI in a way that it improves your systems.

NOTE: This concept will be most easily implemented in AI copywriting systems.

The why and the how

One of the biggest inhibitors to testing out more variations is the limits to our ability to develop the content we want to test. That inhibitor significantly reduces with AI-generated content. It should be mentioned that AI is still in its early stages of development, and all copy should be reviewed for accuracy. Think of it like hiring a new team member to write copy and having no one review it. You would never do that.

How to optimize around AI-generated content:

Determine what instructions generate the general content you want.
Begin tracking different “keywords” you give the AI when it generates content.
Test and review what keywords lead the AI to generate the best content.
Test a lot, really try different methodologies. Everything you wanted to do before but couldn’t.
Ask and test multiple versions of the same request to validate that the AI can develop many versions that all perform well.
Alter intent and prompts for different user cohorts and different points in the journey.

Create clear and consistent instructions

Most AI systems have limited memory. They remember only a certain amount of requests. Keep track of the specific prompt phrasing you give AI to ensure that when you return to it, you can prompt it the same way or make specific changes for a specific test.

For example, if a team wants to write an email, they may give it the following prompt. “Write an email about Americans' need to exercise more to reduce the risk of heart disease.” This differs from “Write an email about people needing to exercise more to improve their health.” or “Write an email about adults’ need to run more to improve heart health.” We have no way of knowing which of the prompts is better, but it is likely that one will have a higher average conversion rate or a differing variance in its range.

Tips to track prompt phrasing.

Keep a detailed list of all phrasing used in the prompt, the content developed by it, and the conversion rate observed.
Create specific instructions for how to format the content. “Make an email about X with a header section, two body paragraphs and less than 150 words.”
Reverse engineer some of your better-performing content pieces. Ask the AI what it sees when it reads the piece and try instructing it to execute those same points.

Utilize evolutionary algorithm strategies

Our article on Step Gain Optimizations discusses how tests on as few as 100 users can generate directional learnings. When you first create instructions, create a wide range of potential prompts and test them all a few times. Once you determine which prompts perform better, create variants of those and test again. Iterate on this until you find the incremental gains are insignificant, then start again by looking for a new step gain.

This strategy is similar to the evolutionary algorithm utilized in many AI systems as it pits many “prompts” against each other, then chooses a few winners, creates variants, and tries again. Over time, the rate of improvement of the system will be minimized. We can then use this opportunity to again make a wide number of variants from this final best product until we find new variants to compete with our original. For example, we would develop variants for tests 8-10 in our table of various tests.

Test different “moods” or “emotions”

One easy way to develop many different versions of similar content pieces is to have the AI imply different “moods” and emotions” and see which your audience responds better to.

NOTE: Make sure to review how different user cohorts at different journey points react. It is unlikely that there is a singular “mood” that wins with everyone all the time.

Tips for creating different mood prompts

Give your AI pieces of content you have seen that did well and ask what it perceives as the mood.
Consider how your users may feel at that specific journey point.
Consider how you want your users to feel at this point.
Start with very different emotions. Things like “happy” and “gleeful” are too similar to produce significant results.
Don’t hesitate to sprinkle in negative emotions at some points. These initial tests are small, and you may learn something.

Example tracker for AI-generated content

A Google Sheets document can be used to track AI-generated content and its performance. View the template.

It includes:

Prompts used
Specific prompt input into the AI
Output from the AI
Space for conversion data
A “key” and “tracker” sheet to understand the process as it has developed.
A sheet on observed lessons from the testing.

This table is not designed to be a presentation of results but a record of what was done and what was observed. AI is a black box of information. If you do not take detailed notes on the process, there is no one to go to to learn “what they did and why.”

It is incredibly important to take notes not only on what was observed in the moment but also on what you think you learned from the test. By doing this, you can track consistent learnings and seemingly contradictory ones. With AI, you can run far more tests than you had before, and they may not always agree with each other. As a reminder, even a statistically significant test at 95% will still be wrong at about 1 out of every 20 iterations.

Related Information

Testing Groups

Utilizing our Audience Testing Matrix, you can create a subsection of your audience that receives a wide variety of content to ensure poor-performing tests do not significantly affect the organization. Just make sure to rotate through the groups so tests do not consistently bombard one section of people.