How Colgate Replaced a $50K Consumer Study With 8 Prompts (And What It Means for B2B Sales)

Colgate-Palmolive spent €50,000 and six weeks every time they wanted to know if customers would buy a new toothpaste. Then a research team showed them they could get the same answer in three minutes for under a dollar.

The method is called Semantic Similarity Rating (SSR). Researchers at PyMC Labs and Colgate tested it against 57 real consumer surveys — 9,300 human responses — and hit 90% of human reliability. No fine-tuning. No training data. Just eight prompt iterations and a clever trick for getting LLMs to think like real buyers.

This isn’t theoretical. The research is peer-reviewed, the code is open source, and the implications for anyone doing market research, buyer persona work, or prospect qualification are massive.

Why Traditional Research Is Broken

Let’s start with the numbers every sales and marketing leader already knows but rarely says out loud.

A single consumer research study costs $15,000–$100,000 depending on scope. Enterprise B2B research panels run even higher — $50,000+ for a decent sample. Timelines stretch 4–8 weeks from design to delivery. By the time results arrive, the market has moved.

The quality isn’t great either. Survey panels have a well-documented positivity bias — respondents rate everything higher than they actually feel, especially when they know they’re being observed. Response rates keep declining. And the sample you get often doesn’t represent the buyers you actually care about.

So most B2B companies skip formal research entirely. They substitute intuition, anecdotal feedback from a few customers, and competitive guesswork. That’s the real cost — not the $50K you’d spend on a study, but the invisible cost of building products and campaigns based on assumptions.

How SSR Actually Works

The core insight behind SSR is simple: LLMs are terrible at picking numbers, but excellent at expressing opinions.

If you ask GPT-4o “On a scale of 1 to 5, how likely would you be to buy this product?” you get unrealistic distributions. The model gravitates toward middle values, doesn’t differentiate well between products, and produces patterns that look nothing like real human surveys.

But if you ask it to just talk about the product — “What do you think about this? Would you consider buying it?” — the response is remarkably human. The opinions are nuanced, they reflect real considerations (price sensitivity, brand trust, personal relevance), and they vary realistically across different demographics.

SSR exploits this gap with a two-step process:

Step 1: Get a natural text response. Give the LLM a persona (age, income, location) and a product description. Ask it to respond naturally about purchase intent. You get something like: “This looks interesting but I’d want to try it first. The price seems reasonable for what it is — probably worth a shot next time I’m at the store.”

Step 2: Map the text to a rating scale. Take that text response, convert it to an embedding vector, and measure its cosine similarity against reference statements anchored to each point on a 1-5 scale. “I would definitely not buy this” maps to 1. “This is exactly what I’ve been looking for” maps to 5. The response above might land at a probability distribution weighted toward 3-4.

The beauty is that instead of forcing a single number, SSR produces a probability distribution. A response that’s genuinely ambivalent might be 40% likely a 3, 35% likely a 4, 25% likely a 2. That’s how humans actually think about purchase decisions — with uncertainty — and SSR captures it.

AI-generated synthetic results matching real survey data — 90% of human reliability

The Colgate Validation

The PyMC Labs team didn’t test this on toy data. They used 57 real product surveys from Colgate-Palmolive — personal care products with 150 to 400 real respondents each, totaling 9,300 human responses.

The results:

R = 0.72 correlation with human purchase intent rankings. That might not sound impressive until you realize the human test-retest ceiling (how consistently humans agree with themselves on repeated surveys) gives a maximum of about R = 0.80. SSR hit 90% of that ceiling.

Distribution similarity of 0.88. The shape of SSR’s response distributions closely matched human response patterns. Compare that to naive forced-choice prompting, which scored just 0.26 — basically random.

Realistic demographic effects. Lower-income synthetic respondents showed reduced intent for premium products. Age produced an inverted U-shape (younger buyers more open to novelty). Price-sensitive segments reacted to price changes. The model picked up cultural attitudes without being explicitly told about them.

As Thomas Wiecki, CEO of PyMC Labs, put it: “Something in the pre-training must make it think like humans do.”

The team iterated through roughly eight to ten prompt variations to reach this level. No fine-tuning, no custom training. Both GPT-4o and Gemini 2.0 Flash produced comparable results.

AI handles breadth — scanning hundreds of buyer personas while humans focus on depth

What This Means for B2B Sales Teams

“But this is consumer research,” you might say. “We sell B2B. This doesn’t apply.”

It applies more than you think. Here’s how:

Prospect Research at Scale

Every good sales rep researches their prospects before outreach. They try to understand the buyer’s priorities, pain points, and likely objections. Today, this research is manual — reading annual reports, scanning LinkedIn posts, talking to contacts.

SSR-style synthetic research lets you generate structured buyer perspectives at scale. Want to understand how a VP of Sales at a 200-person fintech would react to your pitch? Generate that persona and ask. Want to compare reactions across different company sizes and industries? Run 50 synthetic personas in minutes.

This isn’t about replacing genuine prospect conversations. It’s about pre-qualifying your messaging before you spend time on real outreach. The same way Colgate screens product concepts before committing to a full consumer panel.

ICP Validation Without the Survey

Most B2B companies build their ICP (Ideal Customer Profile) based on closed-won deal data and gut feel. A few run actual surveys. Almost none validate their assumptions against a representative sample of potential buyers.

SSR changes the economics. You can test your value proposition against hundreds of synthetic buyers across different segments — company sizes, industries, job titles, pain points — for the cost of a few API calls. Not to replace real customer interviews, but to identify which segments to focus those interviews on.

This is exactly what we see with our users at Onsa. The best ICP work combines AI-powered research with human judgment. The AI handles breadth — scanning hundreds of potential fits. The human handles depth — building relationships with the best matches.

Competitive Positioning Tests

Want to know how your messaging stacks up against a competitor? Generate buyer personas, show them both pitches, and analyze the responses. SSR gives you distribution data, not just thumbs up/thumbs down — you can see where buyers feel uncertain, which claims land, and which fall flat.

This mirrors what we described in the C.H. Robinson case study. When AI handles structured, repeatable research tasks, humans can focus on the creative and strategic work that actually moves deals forward.

The Limitations (And Why They Matter)

The SSR paper is honest about what doesn’t work, and this honesty is part of what makes the research credible:

Demographics are critical. Without persona specifications (age, income, location), correlation dropped from R = 0.72 to R = 0.39. You can’t just ask the LLM “would someone buy this” — you have to specify who that someone is. This matters for B2B applications too: a synthetic “VP of Sales at a 500-person company” will give you better signal than a generic “business buyer.”

Niche products are harder. If there’s limited online discussion about a product category, the LLM has less pre-training data to draw from. This is a genuine limitation for novel B2B products in specialized verticals. For established categories (sales tools, CRM, marketing automation), the data is abundant.

Gender, religion, and ethnicity are poorly modeled. The LLMs struggled to reproduce response differences based on these demographics. Age and income worked well. This limitation is less relevant for B2B (where purchase decisions are driven by job function and company needs) but important to acknowledge.

It’s not a replacement for real research. SSR is a screening tool. Use it to narrow your focus, pre-test hypotheses, and prioritize where to invest in real customer conversations. Don’t use it as your only source of market intelligence.

The Practical Playbook

If you want to apply this to your sales and marketing workflow:

1. Define your personas with specificity. Don’t just say “enterprise buyer.” Specify: CTO at a 200-person SaaS company, $20M ARR, Series B, based in San Francisco, currently using Salesforce and HubSpot. The more specific the persona, the more useful the output.

2. Ask open-ended questions first. “What would you think about a tool that automates your sales research?” is better than “Rate this tool 1-5.” Let the LLM generate a natural response, then analyze the sentiment and themes.

3. Run multiple personas per segment. Don’t trust a single synthetic response. Generate 10-20 personas per segment and look at the distribution. Where do they agree? Where do they diverge? The variance is as informative as the mean.

4. Compare segments systematically. Test the same messaging across different buyer types. If your pitch resonates with VPs of Sales but falls flat with CFOs, you know where to adjust — before you waste real outreach cycles learning the same lesson.

5. Validate with real conversations. Use synthetic research to generate hypotheses. Then test those hypotheses in actual sales calls. The combination is faster and cheaper than either approach alone.

The SSR open-source library is on GitHub. If you want to build your own synthetic market research pipeline, start there.

For a more turnkey approach, the synthetic market research skill wraps SSR into an agent workflow: describe your product, generate personas, get structured purchase intent analysis with segmentation breakdowns.

The Bigger Picture

What Colgate’s $50K-to-eight-prompts story really demonstrates isn’t that market research is dead. It’s that the cost of understanding your buyers just dropped by orders of magnitude.

For consumer companies, this means testing more concepts, faster, with less risk. For B2B sales teams, it means doing prospect research, ICP validation, and competitive analysis with the same rigor that used to require expensive consulting firms.

The pattern is the same one we keep seeing across sales automation: AI doesn’t eliminate the work, it eliminates the bottleneck. The research still needs to happen. The understanding still matters. The human judgment in interpreting results and acting on them is irreplaceable. But the manual, expensive, slow part — generating and collecting the data — is now something an AI agent can do in minutes.

The teams that figure this out first will have an information advantage that compounds with every deal.

I’m Bayram, founder of Onsa. We build AI agents for B2B sales — automating the research, qualification, and outreach that used to take hours per lead. If your team is spending more time researching than selling, let’s talk.