• http://www.facebook.com/jamesjhuman James Hu

    Good post!

  • http://profile.yahoo.com/PEJPHQSQUGB2UV4XB4L2SJUKSE MikeV

    so not to be mean, but other than the typo’s in this article, what was the answer you were looking for in the interview? Obviously, no one is going to go through that much of an in-depth answer and calculations in their head in an interiview

  • logan9497

    Conrad, this is a good post, but I have two questions:

    1 – same as MikeV, what is the correct answer.
    2 – How would this apply for A/B testing for organic search? Could the A/B tester be applied?

  • Pat Grady

    answer: delete B. dissect B for concepts (like ‘free shipping’), add to my ‘likely bad’ concepts list, marked down as x1 with date.  dissect A, add it concepts to ‘likely good’, count as x1, with date.  proven good, proven bad lists stay empty until x count rises.  write more ads with variants from ‘likely good’ theme list, recycle a ‘likely bad’ now and then, and add variants from ‘new ideas’ list.  rinse, lather, repeat – but slow iteration cycle as optimization curve flattens (and work on finding and overcoming bottlenecks elsewhere).  consider recycling proven concepts into tests for other campaigns and ad groups.  make test data and concepts lists available to SEO, PR and Branding team via Dropbox, use descriptive folder name so they can mine as needed.  don’t ask VP or engineering for their next ‘ad B’ idea to test, dive into Analytics and looks busy whenever they walk by.  in coffee room, do share brief data summary with smart intern who writes the actual blog posts for VP, mention that interviewer suggested we share our latest “win”.  at next monthly meeting that somehow could not be avoided (turns out my leg is not broken, just sprained), sing praises of interviewer’s critical insight when asked about KPIs used to reach new performance plateau.  deposit nice slice of bonus into self-directed IRA.  take cert exam on weekend, network via out-reach that is not a request for help, look for opening to escape working for the man.

  • http://www.alexanderlund.com Alexander Lund

    #Fail Grammar everywhere

  • http://www.facebook.com/eyal.josch Eyal Purvin-Josch

    Important post.
    Do you know of a similar tool for multivariate testing? (Say, for assessing the response rates of a combination of 2 experiments run on the same landing page)

  • http://www.facebook.com/people/Steve-Jones/564432613 Steve Jones

    2 way anova would do it

  • http://profile.yahoo.com/FYXR4GTHUVVYSDX5AVLBVTAUVI EricM

    Ok, are you going to let them go online and use this tool at the interview? Horrible.  Choosing B based on CTR still turns out to be the right answer.  #WIN

  • http://www.facebook.com/profile.php?id=749852742 Conrad Saam

    Mike and Logan – Almost everything you do with search can be A/B tested . . . as long as you have a control with which you can contrast any changes.  Any of the following scenarios apply:  you’ve done a link building campaign, or changed title tags for a portion of your site, altered your footer links (ahh -see if that one works), or maybe added something that increases page weight. Simply look at your overall search volume for a term (Google Keyword research tool or whatever your favorite flavor may be) and then review a specific set of relevant entry terms or pages for changes in inbound traffic.  

    To be more concrete:  Let’s say I want to change title tags on part of Urbanspoon to include a city name – and for my “B” test I change those title tags for restaurants in Los Angles.  I look at expected searches for those pages (your denominator) (in our case phrase match on the restaurant name and the restaurant name + Los Angles for example) and then compare inbound search traffic (your numerator).  Your comparative can be chronological (i.e. a before and after) or against a control set on your site that you didn’t change.  I’d recommend both.  

    As for the answer to the interview question – as I mentioned in the article, I’m looking for the theoretical understanding of the interplay between sample size, variability and confidence levels.  If they can tactically tell me how to get there, that’s even better.  

    Hope this helps – oh and MikeV – “typo’s” is possessive, “typos” is plural.  A little bit of irony for my Thursday morning.  :)  

    -Conrad    

  • ABTesterNewbie

    There may be a fundamental problem with the A/B test you describe above. Namely, the two versions of the ad copy ran at different points in time (or there was only a small overlap period). I don’t believe that this makes for a statistically valid test, so your confidence and z-scores may not be meaningful.

    Could you please elaborate on this? How do you statistically account for this timeframe difference?

  • http://www.facebook.com/profile.php?id=749852742 Conrad Saam

    Newbie:  “Take the data from the 3/6 send and plug it in to A/B Tester”.  I wouldn’t be concerned about the five minute time difference between 11:16 and 11:21 am.   

    -Conrad

  • ABTesterNewbie

    Sorry, should have made it clearer. I was referring to your interview question:  “Imagine you are running two different ads on a campaign with 50 kewords. We’ve been running Ad A for a while and have 17,235 impressions and 272 clicks. I started running Ad B last week and that has received 41 clicks on 2,253 impressions.”

    Sounds like the time difference here is quite significant (“for a while” vs. “last week”), not minutes. Am I missing something?

  • http://www.facebook.com/profile.php?id=749852742 Conrad Saam

    Got it.  (I’m glad you weren’t agonizing over a 5 minute interval.)  So this is where some common sense and first hand knowledge of your business comes in.  In general, I can’t imagine a huge difference over time in PPC performance.  But take Urbanspoon for example . . . search volume changes significantly on Friday and Saturdays (and before Valentine’s Day).  So, overall, your point is fair.

    -Conrad

  • http://www.facebook.com/profile.php?id=749852742 Conrad Saam

    Got it.  (I’m glad you weren’t agonizing over a 5 minute interval.)  So this is where some common sense and first hand knowledge of your business comes in.  In general, I can’t imagine a huge difference over time in PPC performance.  But take Urbanspoon for example . . . search volume changes significantly on Friday and Saturdays (and before Valentine’s Day).  So, overall, your point is fair.

    -Conrad

  • Pat Grady

    but since the OP brought up confidence level and z-scores, i think his name isn’t fair.  :-)

  • Mark Aitkin

    This is why it is very important to run the two campaigns at the same point in time. While your sample sizes can vary as the confidence interval reduces by a factor of your standard error which in part involves calculation with the sample size; the points in time should be the same. This will minimise the effect of confounding variables on the conversions.

  • Александр Ироничен

    basically it’s about counting the probabilities using a specified tool…too many words for simple idea

  • http://www.facebook.com/eyal.josch Eyal Purvin-Josch

    I agree with Mark.
    Ignoring the time may result in wrong decision making. 
    See Simpson’s paradox:
    http://lesswrong.com/lw/3q3/simpsons_paradox/

  • ChristopherSkyi

    The question is straightforward for marketers who took, passed, and remember their basic statistic course, but you could probably put these marketers into the 2 std. dev. above the mean (i.e., not too many! :). 

    For those who don’t have the background, it seems to me, this is not a fair question. 

    I have the background, and I could easily throw back at you, yeah, but demonstrate to me that your control and treatment populations are really normally distributed — if they are not, your  estimates are biased (over or under ) as a function unknown non-normal distributions. But, that’s probably going over your head right now, yet that’s a very real possibility.  In medical testing, you have to know the underlying distributions and make corrections to your test in the case of non-normality. In fact, the chances are the two populations are NOT normal and you better hope their deviations are not so much that the estimates are significantly off. 

    The truth is, there’s way WAY more to this whole issue then just normalizing the data into z scores and running a 1-tailed test.  Most of the time you can get away with it, but if the stakes are high (i.e., if the cost of being wrong is high), you better know what your underlying distributions are. There are statistical tests and procedures to handle this, but at that point, you’ll need to hire an expert in statistics. 

  • ChristopherSkyi

    I think Google Website Optimizer has a built-in/automatic estimators . . .

  • ChristopherSkyi

    “There may be a fundamental problem with the A/B test you describe above. Namely, the two versions of the ad copy ran at different points in time (or there was only a small overlap period).”

    Definitely that’s a problem, a potentially big one. Ideally you’d never want to let time be an uncontrolled variable if you can help it. There’s no way, statistically, to control for that after the experiment. It’s a design flaw in the experiment, and it is a potential killer in terms of valid conclusions.  

    The only thing you could do would be to somehow come up w/an independent estimate of the effects of those two time periods and then modify all the original data by that factor.  Good luck successfully doing that. Its a case of closing the barn door after the horse gets out. Better, essential in fact, to control critical variables up front, right off the bat.