Cold Email A/B Testing: What to Test and How to Measure...

Affiliate disclosure: some links in this article are partner links. If you start a paid plan through them, imisofts may earn a commission at no extra cost to you. We only recommend tools we actually use to run client campaigns.

Most people run cold email campaigns, get mediocre results, and move on.

At imisofts, we run 100+ A/B tests monthly. Each test teaches us something. Compound those lessons, and you get 65-75% open rates and 3-5% reply rates.

This is our A/B testing framework.

The Cold Email A/B Testing Hierarchy

Not all tests have equal impact. We prioritize:

Subject line (highest impact)
Opening line (second highest)
Value statement (medium impact)
CTA (medium-low impact)
Send time (lowest impact)

Never test low-impact variables first. You'll waste samples before finding the real wins.

Testing Subject Lines (Highest Impact)

Subject lines determine open rate. Open rate determines everything else.

Test structure:

Variant A (baseline): [Current subject line that gets 45% opens]

Variant B (test): [New subject line with different formula]

Sample size: 50-100 prospects each

Duration: 3-5 days

Example test:

Variant A: "hi john, noticed you launched [product]" (45% open rate)

Variant B: "quick thought on [industry]" (52% open rate)

Winner: Variant B (+7 percentage points)

This 7-point improvement compounds. Across 10,000 email prospects, that's 700 additional opens. 700 additional opens means 7-35 additional replies (at 1-5% reply rate).

That's 7-35 additional customers from changing one word.

Subject line tests we run:

Personalization (first name + achievement vs. generic)
Question format vs. statement format
Lowercase vs. Title Case
Short vs. specific
Different achievement angles

Testing Opening Lines (Second Highest Impact)

Opening line determines whether they read past the first sentence.

Test structure:

Email 1, Variant A: [Opening line A] + [rest of email unchanged]

Email 1, Variant B: [Opening line B] + [rest of email unchanged]

Sample size: 100 prospects each

Duration: 3-5 days

Metric: Open rate from click to read percentage (hard to track without advanced tools)

Alternative: Track reply rate as proxy for "engagement with content"

Example test:

Variant A: "I noticed you launched [product] last month." (1.2% reply rate)

Variant B: "Most SaaS founders spend 15 hours/week on prospecting. You probably do too." (1.8% reply rate)

Winner: Variant B (+0.6 percentage points reply rate)

0.6 points seems small. Across 10,000 prospects, it's 60 additional replies.

Opening line tests we run:

Specific personalization vs. general observation
Problem-first vs. achievement-first
Curiosity gap vs. direct statement
Industry pattern vs. company-specific observation

Testing Value Statements (Medium Impact)

Value statement is your chance to prove relevance before the pitch.

Test structure:

Email 1, Variant A: [Value statement A]

Email 1, Variant B: [Value statement B]

Metric: Email 1 reply rate or Email 2 open rate (if they engage with Email 1, they'll engage with Email 2)

Example test:

Variant A: "We help SaaS teams automate their prospecting and save 12 hours/week." (2% Email 1 reply)

Variant B: "One of your competitors just booked 8 qualified deals this month using [tactic]." (2.8% Email 1 reply)

Winner: Variant B (social proof outperforms direct benefit)

This changes your entire Email 1 strategy. Across campaigns, social proof hooks outperform benefit hooks by 20-40%.

Value statement tests we run:

Direct benefit vs. social proof
Specific metric vs. general statement
Industry pattern vs. company-specific observation
Problem-agitation vs. opportunity-excitement

Testing CTAs (Medium-Low Impact)

CTA wording has lower impact than subject/opening, but still matters.

Test structure:

Email 2, Variant A: "[CTA A] or reply with your timeline."

Email 2, Variant B: "[CTA B] or reply with your timeline."

Metric: Email 2 reply rate

Example test (SaaS):

Variant A: "Book a 15-min strategy call" (3.2% reply)

Variant B: "Are you open to a quick conversation?" (2.8% reply)

Winner: Variant A (direct booking link outperforms vague ask)

But this varies by industry. Medicare-focused campaigns might see opposite results (phone-first CTAs outperform calendar links).

CTA tests we run:

Direct booking link vs. "reply to schedule"
Soft ask vs. hard ask
Specific time ("15 min") vs. vague ("quick call")
Phone number vs. calendar link (varies by industry)

Testing Send Times (Lowest Impact)

When you send affects open rate, but much less than what you send.

Test structure:

Group A: Send on Tuesday, 10 AM

Group B: Send on Thursday, 10 AM

Metric: Open rate

Our data across 50M+ emails:

Tuesday: 45% open rate

Wednesday: 47% open rate

Thursday: 48% open rate

Friday: 40% open rate

Monday: 38% open rate

Best day: Thursday, 10 AM

Worst day: Monday, 10 AM

Difference: ~10 percentage points

That matters, but nowhere near as much as subject line testing (which can change open rate by 30+ points).

Send time tests we run:

Weekday vs. weekend
Morning vs. afternoon vs. evening
Time zone-specific sends
Industry-specific patterns (e.g., healthcare gets higher open on Friday due to weekly planning)

How to Measure Statistical Significance

You don't need a PhD in statistics. Here's the simple rule:

Sample size of 50+ per variant. If you see a 5%+ difference, it's probably real.

More rigorous approach:

Use a binomial test calculator.

Example:

Variant A: 45 opens out of 100 (45%)
Variant B: 52 opens out of 100 (52%)
Difference: 7 percentage points

Question: Is this real or random?

Plug into calculator. If p-value < 0.05, it's statistically significant (95% confidence). You can trust the result.

For cold email, we use this rule of thumb:

Sample < 50: Don't trust the result. Run more samples.

Sample 50-100: If difference > 5%, probably real.

Sample 100-200: If difference > 3%, probably real.

Sample 200+: If difference > 2%, probably real.

The Weekly Testing Cycle

Monday: Review last week's tests. Declare winners.

Tuesday-Wednesday: Roll out winning variant to 50% of new prospects.

Wednesday-Thursday: Run new tests on remaining 50%.

Friday: Measure results.

Monday: Repeat.

This weekly cycle compounds. Each week you find one new winning variant. Month 1, you're at baseline. Month 3, you're 30-40% above baseline.

What Not to Test

Don't test too many things at once

Wrong: Test subject line, opening line, CTA, send time simultaneously.

You won't know which variable won. Also called "multivariate testing" and it requires huge sample sizes.

Right: Test one variable per week.

Subject line week 1. Opening line week 2. CTA week 3. You learn faster and with smaller samples.

Don't test on tiny samples

Wrong: Test on 10 people per variant.

Too much variance. Random chance plays huge role.

Right: Test on 50+ people per variant minimum.

This gives signal above noise.

Don't declare winners too early

Wrong: Run test for 24 hours. Declare winner.

Time of day matters. Day of week matters. One day isn't enough.

Right: Run test for 5-7 days minimum.

This accounts for daily/weekly patterns.

Testing Template: What We Track

|---------|----------|----------|--------|-------|

Tools for A/B Testing

At imisofts, we use:

Instantly (built-in A/B testing)
SmartLead (rotation + analytics)
Clay + Apollo (data merge + manual testing)
Custom scripts (for complex multivariate tests)

Most platforms now offer native A/B testing. Use it.

Results: What Testing Gets You

Baseline campaign (no testing):

Subject: 35% open
Reply: 1.5%

After 3 months of weekly testing:

Subject: 55% open (+20 points)
Opening: better engagement
CTA: better conversion
Reply: 3.5% (+2%)

That 2% improvement on reply rate is massive. It doubles your results.

We run A/B testing for all managed clients:

Weekly testing cycles
Subject line, opening, CTA, send time
Multivariate testing for scaled campaigns
Statistical significance validation
Monthly optimization reports

Packages start at $497/month (Management with testing) to $2,450/year (Enterprise with full testing suite).

Explore imisofts Cold Email Packages

Frequently Asked Questions

How many people do I need to A/B test?

Minimum 50 per variant. With 50-person samples, a 5-point difference is meaningful. Larger samples (100+) let you detect smaller differences (2-3 points).

How long should I run an A/B test?

Minimum 5-7 days. This accounts for daily/weekly variations in open rates and behaviors. 24-hour tests are unreliable.

What should I test first in cold email?

Subject line (highest impact), then opening line, then value statement, then CTA, then send time. Don't test low-impact variables first.

Can I test multiple variables at once?

Not recommended. Test one variable per week. Testing multiple variables makes it impossible to know which one caused the change.

What's a "statistically significant" A/B test result?

Use a binomial test calculator. If p-value < 0.05, your result is statistically significant (95% confidence). For cold email rules of thumb: 5%+ difference on 50 samples is likely real.

Cold Email A/B Testing: What to Test and How to Measure Results

The Cold Email A/B Testing Hierarchy

Testing Subject Lines (Highest Impact)

Test structure:

Subject line tests we run:

Testing Opening Lines (Second Highest Impact)

Test structure:

Opening line tests we run:

Testing Value Statements (Medium Impact)

Test structure:

Value statement tests we run:

Testing CTAs (Medium-Low Impact)

Test structure:

CTA tests we run:

Testing Send Times (Lowest Impact)

Test structure:

Send time tests we run:

How to Measure Statistical Significance

The Weekly Testing Cycle

What Not to Test

Don't test too many things at once

Don't test on tiny samples

Don't declare winners too early

Testing Template: What We Track

Tools for A/B Testing

Results: What Testing Gets You

Frequently Asked Questions

How many people do I need to A/B test?

How long should I run an A/B test?

What should I test first in cold email?

Can I test multiple variables at once?

What's a "statistically significant" A/B test result?

Ready to build your cold email infrastructure?

Cold Email A/B Testing: What to Test and How to Measure Results

The Cold Email A/B Testing Hierarchy

Testing Subject Lines (Highest Impact)

Test structure:

Subject line tests we run:

Testing Opening Lines (Second Highest Impact)

Test structure:

Opening line tests we run:

Testing Value Statements (Medium Impact)

Test structure:

Value statement tests we run:

Testing CTAs (Medium-Low Impact)

Test structure:

CTA tests we run:

Testing Send Times (Lowest Impact)

Test structure:

Send time tests we run:

How to Measure Statistical Significance

The Weekly Testing Cycle

What Not to Test

Don't test too many things at once

Don't test on tiny samples

Don't declare winners too early

Testing Template: What We Track

Tools for A/B Testing

Results: What Testing Gets You

What We Recommend at imisofts

Frequently Asked Questions

How many people do I need to A/B test?

How long should I run an A/B test?

What should I test first in cold email?

Can I test multiple variables at once?

What's a "statistically significant" A/B test result?

Ready to build your cold email infrastructure?

Related Articles

Cold Email for Podcast Guest Booking: How to Fill Your Interview Calendar

Cold Email Sequences: The 3-5 Touch Framework That Gets Replies

Cold Email Subject Lines: 25 Formulas That Get Opens