Davenport

How to Design Smart Business Experiments

Design

How to

Managers now have the tools to conduct small-scale tests and gain real insight. But too many “experiments” don’t prove much of anything.

Smart Business Experiments
by Thomas H. Davenport
hbr.org
|

E
February 2009
|

EVERY DAY, managers in your organization take steps to implement new ideas without having any real evidence to back them up. They fiddle with offerings, try out distribution approaches, and alter how work gets done, usually acting on little more than gut feel or seeming common sense – “I’ll bet this” or “I think that.” Even more disturbing, some wrap their decisions in the language of science, creating an illusion of evidence. Their so-called experiments aren’t worthy of the name, because they lack investigative rigor. It’s likely that the resulting guesses will be wrong and, worst of all, that very little will have been learned in the process. Take the example of a major retail bank that set the goal of improving customer service. It embarked on a program hailed as scientific: Some branches

Katy Lemay

Harvard Business Review 69

How to Design Smart Business Experiments

grams). Now, a quantitatively trained were labeled “laboratories”; the new IN BRIEF MBA can oversee the process, assisted approaches being tried were known as by software that will help determine “experiments.” Unfortunately, however, » Too many business innovations are what kind of samples are necessary, the methodology wasn’t as rigorous as launched on a wing and a prayer – which sites to use for testing and conthe rhetoric implied. Eager to try out despite the fact that it’s now reasontrols, and whether any changes resulta variety of ideas, the bank changed able to expect truly valid tests. ing from experiments are statistically many things at once in its “labs,” mak» With a small investment in significant. ing it difficult if not impossible to detraining, readily available software, Consumer-facing companies rich in termine what was really driving any and the right encouragement, an transaction data are already routinely improved results. Branches undergoing organization can build a “test and testing innovations well outside the interventions weren’t matched to conlearn” capability. realm of product R&D. They include trol sites for the most part, so no one » Companies that equip managbanks such as PNC, Toronto-Dominion, could say for sure that the outcomes ers to perform small-scale yet and Wells Fargo; retailers such as CKE noted wouldn’t have happened anyway. rigorous experiments don’t only Restaurants, Famous Footwear, Food Anxious to head off criticism, managers save themselves from expensive Lion, Sears, and Subway; and online did provide a control in one test, which mistakes – they also make it more firms such as Amazon, eBay, and Google. was designed to see if placing video likely that great ideas will see the As randomized testing becomes standard screens showing television news over light of day. procedure in certain settings – website waiting lines would shorten customers’ analysis, for instance – firms build the perceived waiting time. But rather than capabilities to apply it in other circumlooking at control and test groups, they stances as well. (See the sidebar “Stop compared just one control site with one Wondering” for a sampling of tests conducted recently.) To test site. That wasn’t enough to ensure statistically valid results. be sure, there remain many business situations where it is not Perceived waiting time did drop in the test branch, but it went easy or practical to structure a scientifically valid experiment. up substantially in the control branch, despite no changes But while the “test and learn” approach might not always be there. Those confounding data kept the test from being at all appropriate (no management method is), it will doubtless conclusive – but that’s not how the findings were presented to gain ground over time. Will it do so in your organization? If top management. it’s like many companies I have studied, an investment in softIt doesn’t have to be this way. Thanks to new, broadly availware and training will yield quick returns of the low-hangingable software and given some straightforward investments fruit variety. The real payoff, however, will happen when the to build capabilities, managers can now base consequential organization as a whole shifts to a test-and-learn mind-set. decisions on scientifically valid experiments. Of course, the scientific method is not new, nor is its application in business. The R&D centers of firms ranging from biscuit bakers to drug When Testing Makes Sense makers have always relied on it, as have direct-mail marketFormalized testing can provide a level of understanding about ers tracking response rates to different permutations of their what really works that puts more intuitive approaches to pitches. To apply it outside such settings, however, has until shame. In theory, it makes sense for any part of the business recently been a major undertaking. Any foray into the ranin which variation can lead to differential results. In practice, domized testing of management ideas – that is, the random however, there are times when a test is impossible or unnecassignment of subjects to test and control groups – meant emessary. Some new offerings simply can’t be tested on a small ploying or engaging a PhD in statistics or perhaps a “design of scale. When Best Buy, for example, explored partnering with experiments” expert (sometimes seen in advanced TQM proPaul McCartney on an exclusively marketed CD and a spon-

IDEA

The real payoff will happen
when the organization as a whole shifts to a test-and-learn mind-set.
70 Harvard Business Review
|

February 2009

|

hbr.org

IDEA IN

PRACTICE
sored concert tour, neither component of the promotion could be tested on a YOU OR SOMEONE on your team is EXAMPLE Marketers at the Subway small scale, so the company’s managers suggesting a change that just might restaurant chain wanted to drum up went with their intuition. At Torontowork. But why act on a hunch when business by putting foot-long subs Dominion, one of the largest and most you can hold out for evidence? on sale for only $5, but franchise profitable banks in Canada, testing is so According to the author, the best owners worried that the promotion well established that occasionally manway to support decision making on would lure existing customers away potential innovations is to… agers are reminded that, in the interests from higher-priced menu items. An experiment pitting test sites of speed, they can make the call with» Design an experiment. against control sites proved that the out a test when they have a great deal Start with a hypothesis about how promotion would pay off – which it of experience in the relevant business the change will help the business. subsequently did. domain. If it’s a good one, you’ll learn as much by disproving it as you would Generally speaking, the triumphs of by proving it. Put it to the test by testing occur in strategy execution, not » Make testing the norm. measuring what happens in a test strategy formulation. Whether in marCreate the training and infrastrucgroup versus a control group. From keting, store or branch location analyture that will enable nonexperts in the outset, be clear on what you sis, or website design, the most reliable statistics to oversee rigorous experineed to measure to produce a dements. Off-the-shelf software can insights relate to the potential impact cisive result – and whether that’s a walk them through the steps and and value of tactical changes: a new metric you even have the capability help them analyze results. A core store format, for example, or marketing to track. group of experts can lend resources promotion or service process. Scientific and expertise and maintain the » Act on the facts. method is not well suited to assessing learning library. Leadership must Nothing but a success in a testing a major change in business models, a cultivate a test-and-learn culture, environment should be rolled out large merger or acquisition, or some in part by penalizing those who act more broadly. But neither should other game-changing decision. without sufficient evidence. failures simply be scrapped. Refine Capital One’s experience hints at the the hypothesis on the basis of natural limits of experimental testing in As your managers become more the results, and consider testing a a business. The company has been one comfortable with testing, they’ll variation. Most important, capture discover that it paves the way for, what’s been learned, and make it of the world’s most aggressive testers rather than throwing up barriers to, available to others in the organizasince 1988, when its CEO and cofounder, promising new ideas. tion through a “learning library,” so Rich Fairbank, joined its predecessor resources aren’t wasted proving the firm, Signet Bank. You could even say same thing again. the firm was founded on the concept. One thing that appealed to Fairbank about the credit card industry was its “ability to turn a business into a scientific laboratory where every decision about product design, marketcritical, it was impossible to design an experiment that could ing, channels of communication, credit lines, customer selecreliably predict the outcomes of such a major change in busition, collection policies and cross-selling decisions could be subness direction. Still, after making the acquisitions, Capital One jected to systematic testing using thousands of experiments.”1 reaffirmed its commitment to information-based strategy. Its managers immediately set about translating that ethos into Capital One adopted what Fairbank calls an information-based the full-service banking context, which required pushing the strategy, and it paid off: The company became the fifth-largest method further, into tests involving customer service and emprovider of credit cards in the United States. ployee behavior. As one employee told me, “It’s much easier to Yet when it came time to make the largest decision the do randomized testing with direct-mail envelopes than with company had faced in recent years, Capital One’s managebranch bankers.” ment concluded that testing would not be useful. Realizing Sears Holdings provides another example of what can reathat the business would need other sources of capital to resonably be tested and what can’t. Interestingly, this is another main independent, the team considered acquiring some rebusiness with a heritage of testing. Robert E. Wood, who origigional banks in order to transform itself from a monoline nally moved Sears out of the catalog business and into retail credit provider into a full-service bank. The decision was not stores, said his favorite book was the Statistical Abstract of the tested for a couple of important reasons. First, the nature of United States. When he opened Sears’s first free-standing retail the opportunity made it imperative to move quickly; no time stores, in 1928, he placed two in Chicago. Asked why he needed was available for even a small-scale test. Second, and more

hbr.org

|

February 2009

|

Harvard Business Review 71

How to Design Smart Business Experiments

two in one city, Wood said it was to reduce the risk of choosing a wrong location or store manager. Today Sears Holdings has embarked upon a new era: Its primary DESIGN TEST owner, financier Edward Lampert, who has been its chairman since Kmart acquired Sears, is exploring CREAT CREATE alternative ways to combine the OR REFINE two troubled chains. To my knowlHYPOTHESIS S edge, Lampert didn’t test the idea EX EXECUTE XECUTE of combining the retailers. That TEST LEARNING LEARNING A would have been difficult if not LIBRARY Y impossible to do (and the jury is still out on whether the acquisition was a good decision). However, he’s a strong advocate of testing at the tactical level. He wrote in a 2006 letter to shareholders, “One of the great advantages of having approxiPLAN ANALYZE ROLLOUT ROLLOUT mately 2,300 large-format stores at TEST Sears Holdings is that we can test concepts in a few stores before undertaking the risk and capital associated with rolling out the concept Adapted from Applied Predictive Technologies’ “Test and Learn” Wheel to a larger number of stores or to the entire chain.” The retailer has tested, for example, various formats are most readily tested in companies that have offices in many for including Sears merchandise in Kmart stores, and vice cities. Drawing statistical inferences from small numbers of versa, as well as other formats, such as the arrangement of test sites is much more difficult and represents the leading merchandise in Sears stores by rooms in a consumer’s home edge of the test-and-learn approach. (kitchen, laundry room, bedroom, and so on). Finally, formal testing makes sense only if a logical hypothBeyond using the tactical-versus-strategic criterion, there esis has been formulated about how a proposed intervention are other ways to decide whether formal testing makes sense. will affect a business. Although it’s possible to just make a For instance, it is useful only in situations where desired outchange and then sit back and observe what happens, that comes are defined and measurable. A new sales training proprocess will inevitably lead to a hypothesis – and often the gram might be proposed, but before you can test its efficacy, realization that it could have been formulated in advance and you’ll need to identify a goal (such as “We want to increase tested more precisely. cross-selling”), and you must be able to measure that change (do you even track cross-selling?). Sales and conversion-rate changes are frequently used as dependent variables in tests The Process of Testing and are reliably measured for separate purposes. Other outTo begin incorporating more scientific management into your comes, such as customer satisfaction and employee engagebusiness, you’ll need to acquaint managers at all levels with ment, may require more effort and invasiveness to measure. your organization’s process of testing. It is probably simple to Tests are most reliable where many roughly equivalent grasp (a typical depiction is shown in the exhibit “Put Your settings can be observed. This might mean physical sites, as Ideas to the Test”), but it must be communicated in the same with Sears’s stores, or it might mean more ephemeral setterms to people across the organization. Having a shared untings, such as alternative website versions. Among the earliest derstanding of what constitutes a valid test enables the innoand most extensive users of testing are retail and restaurant vators to deliver on it and the senior executives to demand it. chains. Because so much is held constant among their multiThe process always begins with the creation of a testable tudinous sites, it is easy to designate which ones will serve as hypothesis. (It should be possible to pass or fail the test based experiments and which will serve as controls and to attribute on the measured goals of the hypothesis.) Then the details of cause to effect. By the same token, workplace design changes the test are designed, which means identifying sites or units

Put Your Ideas to the Test

2

1

3

6

5

4

72 Harvard Business Review

|

February 2009

|

hbr.org

CREATE OR REFINE HYPOTHESIS
ASCERTAIN that the hypothesized relationships haven’t already been tested and measured – and that they can be. MAKE sure the hypothesis could generate substantial economic value. DETERMINE whether it suggests an actual decision or action. (If not, go no further.)

EXECUTE TEST
MEET with test and control site managers and analytical experts to discuss what might go wrong and what would constitute testconfounding events. INSTRUCT field personnel to report abnormal events. REMOVE sites from test if test-confounding events occur. ADJUST evaluation and compensation plans for managers so that they are not negatively affected by tests.

PLAN ROLLOUT
STUDY attributes of test sites to determine whether rollout should be universal or differentiated. BALANCE complexity of rollout with ease of implementation and management.

LEARNING LIBRARY
DEVELOP a summary of each test: hypotheses, test dimensions, key results, interactions, and rollout strategies and results. EMPLOY standard business taxonomy to allow easy searching of library. MAKE library widely accessible to employees; publicize tests and results of important studies to encourage a test-and-learn culture.

1

2
DESIGN TEST

3

4

5
ANALYZE TEST
ENSURE that “lift” from interventions is statistically significant. USE software to analyze results and manage complex data from multiple test and control sites. DETERMINE need for further testing. EXAMINE as many site attributes as possible to see how key variables interact.

6
ROLLOUT
STAGGER the rollout and view it as a test in itself. (Are early-adopting sites yielding the desired result? If not, modify the approach in lateradopting sites.) ENCOURAGE site managers to share rollout strategies and tactics.

ENSURE that the number of test and control sites is sufficient for statistical significance. USE simulation to explore multiple strategies for creating control groups (for instance, they may be nearly identical but different on one key variable). ASSESS whether control group strategies previously used for similar tests will suffice; they usually do. CONDUCT statistical analysis to minimize the number of test cells needed. EXTEND testing period if key metrics are highly variable.

to be tested, selecting the control groups, and defining the test and control situations. After the test is carried out for the specified period – which sometimes can take several months but is usually done in less time – the data are analyzed to determine the results and appropriate actions. The results are ideally put into some sort of “learning library” (although, unfortunately, many organizations skip this step). They might lead to a wider rollout of the experiment or further testing of a revised hypothesis. More broadly, managers must understand how the testing process fits in with other business processes. They conduct tests in the context of, for example, order management, or site selection, or website development, and the testing feeds into various subprocesses. At CKE Restaurants, which includes the

Hardee’s and Carl’s Jr. quick-service restaurant chains, the process for new product introduction calls for rigorous testing at a certain stage. It starts with brainstorming, in which several cross-functional groups develop a variety of new product ideas. Only some of them make it past the next phase, judgmental screening, during which a group of marketing, product development, and operations people will evaluate ideas based on experience and intuition. Those that make the cut are actually developed and then tested in stores, with well-defined measures and control groups. At that point, executives decide whether to roll out a product systemwide, modify it for retesting, or kill the whole idea. CKE has attained an enviable hit rate in new product introductions – about one in four new products is successful, versus

hbr.org

|

February 2009

|

Harvard Business Review 73

How to Design Smart Business Experiments

one in 50 or 60 for consumer products – and executives say that their rigorous testing process is part of the reason why. If you have had occasion to enjoy a Monster Thickburger at Hardee’s, or a Philly Cheesesteak Burger or a Pastrami Burger at Carl’s Jr., you’ve been the beneficiary of CKE’s efforts. These are just three of the successful new products that were rolled out after testing proved they would sell well. At eBay, there is an overarching process for making website changes, and randomized testing is a key component. Like other online businesses, eBay benefits greatly from the fact that it is relatively easy to perform randomized tests of website variations. Its managers have conducted thousands of experiments with different aspects of its website, and because the site garners over a billion page views per day, they are able to conduct multiple experiments concurrently and not run out of treatment and control groups. Simple A/B experiments (comparing two versions of a website) can be structured within a few days, and they typically last at least a week so that they cover full auction periods for selected items. Larger, multivariate experiments may run for more than a month. Online testing at eBay follows a well-defined process that consists of the following steps: ■ Hypothesis development ■ Design of the experiment: determining test samples, experimental treatments, and other factors ■ Setup of the experiment: assessing costs, determining how to prototype, ensuring fit with the site’s performance (for example, making sure the testing doesn’t slow down user response time) ■ Launch of the experiment: figuring out how long to run it, serving the treatment to users ■ Tracking and monitoring ■ Analysis and results The company has also built its own application, called the eBay Experimentation Platform, to lead testers through the process and keep track of what’s being tested at what times on what pages. As with CKE’s new product introductions, however, this online testing is only part of the overall change process for eBay’s website. Extensive offline testing also takes place, including lab studies, home visits, participatory design sessions, focus groups, and trade-off analysis of website features – all with customers. The company also conducts quantitative visual-

design research and eye-tracking studies as well as diary studies to see how users feel about potential changes. No significant change to the website is made without extensive study and testing. This meticulous process is clearly one reason why eBay is able to introduce most changes with no backlash from its potentially fractious seller community. The online retailer now averages more than 113 million items for sale in more than 50,000 categories at any given time. EBay performed extensive online and offline testing, for example, in 2007 and 2008, when it changed its page for viewing items on sale. The page had not been redesigned since 2003, and both customers and eBay designers felt it lacked organization, had inadequate photographs of items, and suffered from haphazard item placement and redundant functionality. After going through all the testing steps, eBay adopted a new site design. It posted photos 200% larger than those in the previous design, added a countdown timer for auctions with 24 hours or less to go, made more prominent the item condition and return policy, and included tabs to make shipping and payment fields easier to navigate. It also included new security features to prevent unauthorized changes in site content. Each new feature and function was tested independently with control pages. Measures of page views and bid counts suggest that the redesign was very successful.

Building a Testing Capability
Establishing a standard process is the first step toward building an organizational test-and-learn capability, but it isn’t sufficient unto itself. Companies that want testing to be a reliable, effective element of their decision making need to create an infrastructure to make that happen. They need training programs to hone competencies, software to structure and analyze the tests, a means of capturing learning, a process for deciding when to repeat tests, and a central organization to provide expert support for all the above. Managerial training. At the very least, managers should learn what constitutes a randomized test and when to employ it. Capital One, for example, offers a professional education program on testing and experiment design through its internal training function known as Capital One University. One benefit of hosting a program like this, rather than sending managers outside for training, is the greater emphasis on how

74 Harvard Business Review

|

February 2009

|

hbr.org

perimented with an even more ambitious the testing connects to upstream and system that would use such learning to downstream activities in the business. guide product managers as they develop Test-and-learn software. Some firms, new offerings. Famous Footwear takes a such as Capital One and eBay, have “billboard” approach; for each test, it capbuilt their own software for managing tures the results in a one-page document, experiments, but several off-the-shelf circulates that throughout the organizaoptions exist – the most common ones tion, and posts it on the wall outside the being broad statistical packages and testing office. analytical tools like SAS. With every passing year, these tools make it more Regular revisiting. One tricky aspect TESTING is used to make tactical possible for numerate – but not statisof establishing a long-term testing apdecisions in a range of business settically expert – users to conduct truly proach is determining when to retest. tings, from banks to retailers to defensible experiments. Ease of design There is no way to know for sure when dot-coms. Here are some questions and analysis has been a particular foa test has become obsolete; an experivarious companies are examining: cus at Applied Predictive Technologies, enced analyst needs to assess whether ■ Do lobster tanks increase lobster sales whose product leads users through the enough factors have changed in the at Food Lion supermarkets? test-and-learn process, keeps track of environment to make previous results test and control groups, and provides suspect. Famous Footwear executives ■ Does a Kmart with a Sears store inside sell more than an all-Kmart format? a repository for findings to be usefully feel that the retail store location conaccessed in the future. text – their primary application area for ■ Do eBay users bid higher in auctions Some software tools are tailored testing – changes enough to merit retestwhen they can pay by credit card? to particular problems or industries. ing after about a year. Netflix concluded ■ What’s the optimum number of loose Several packaged tools, for example, in 2006 that its five-year-old customer checks for a Wells Fargo ATM to accept? are available for the analysis of mantests needed to be redone; the user base ■ Do Subway promotions on low-fat ufacturing-quality experiments. Likehad evolved in that time from internet sandwiches increase sandwich sales? wise, highly specialized tools exist for pioneers to mainstream society mem■ Does a Famous Footwear store sell online-usage testing, such as the web bers. CKE Restaurants has difficulty defewer shoes when there is a competianalytics software sold by Omniture ciding whether to retest pricing, particutor in the same mall? and WebTrends and the free tools prolarly in times when commodity prices ■ Does a Toronto-Dominion branch get vided by Google Analytics. As of yet, are increasing fast. Ironically, it is human significantly more deposits when open unfortunately, no single software tool intuition, not testing or analytics, that 60 hours a week compared with 40? can help organizations with all testing must be applied to determine the need ■ Which promotional offers will most types and contexts. for retesting. efficiently drive checking account Learning capture. If a firm does a Core resource group. Most of the acquisition at PNC Bank? substantial amount of testing, it will firms that do extensive testing have esAs a result of their testing, these generate a substantial amount of tablished a small, somewhat centralized organizations are finding out whether learning about what works and what organization to supervise it. The group supposedly better ways of doing busidoesn’t. Ideally employees throughout either actually does the testing, as at PNC ness are actually better. Once they learn from their tests, they can spread the company would share that knowlBank, Subway, and Famous Footwear, confirmed better practices throughout edge and use it to guide future initiaor – if testing is employed throughout their business. tives. But that happens at few organizathe organization – serves as a resource for tions. The head of testing at one online methodological and statistical questions, firm admitted, “All of that knowledge as at Capital One. At PNC Bank, the testis in my head, and we’d be in tough shape if I were hit by a bus.” and-learn group (part of the bank’s knowledge management One bank executive justified a lack of shared learning, comfunction, which reports to Marketing) views the promotion of menting, “We should probably do more, but we’ve found that its own services around the bank as a priority. It tries to build people need to learn from doing the test themselves, even if relationships and trust with key executives so that no major we’ve done it before many times.” People do learn through initiatives are undertaken without testing. Without a central personal experience, but one would hope that it’s not the only coordination point, testing methods may not be sufficiently possible way. rigorous, and test and control groups across multiple experiSome organizations, however, have begun to address the ments may confound one another. That said, it’s not always issue. Capital One captures the learning from its thousands of easy to influence or coordinate testing even when a central tests in an online knowledge management system and has exgroup exists.

Stop Wondering

hbr.org

|

February 2009

|

Harvard Business Review 75

How to Design Smart Business Experiments

Creating a Testing Mind-Set
In addition to making the requisite changes in process, technology, and infrastructure, organizations also need to establish a testing culture. Testing costs money (though not as much as widespread rollouts of new tactics that don’t work), and it takes time. Senior managers have to become accustomed to, and even passionate about, the idea that no major change in tactics should be adopted without being tested by people who understand testing. Ask for evidence. CEOs who firmly believe in testing can change their entire organization’s perspective on the issue. When people claim that testing has confirmed the wisdom of their idea, have them walk you through the process they used, and demand at least the level of rigor outlined in the exhibit “Put Your Ideas to the Test.” Give it teeth. Gary Loveman at Harrah’s Entertainment has said that “not using a control group” is sufficient rationale for termination at the company. Jeff Bezos of Amazon reportedly fired a group of web designers for changing the website without testing. Toronto-Dominion has a culture in which managers insist on tests for every major initiative involving customers or branches. The CEO, Ed Clark, is a PhD economist who once noted that although the bank might not be perfect, “nobody ever criticizes us for not running the numbers.” Sponsor tests yourself. The best management teams in this regard have institutionalized the process of doing and

reviewing tests. At Famous Footwear, Joe Wood and his senior management team meet with the testing head every two weeks to discuss past tests, upcoming tests, and preliminary and final results. Wood says that the company has made testing a part of management’s dialogue and the organization’s culture.
•••

Testing may not be appropriate for every business initiative, but it works for most tactical endeavors. And it just isn’t that difficult anymore. It needs to come out of the laboratory and into the boardroom. The key challenges are no longer technological or analytical; they have more to do with simply making managers familiar with the concepts and the process. Testing, and learning from testing, should become central to any organization’s decision making. The principles of the scientific method work as well in business as in any other sector of life. It’s time to replace “I’ll bet” with “I know.”
1. “Capital One Financial Corporation,” HBS case no. 9-700-124.

Thomas H. Davenport ([email protected]) is the Presi-

dent’s Distinguished Professor of Information Technology and Management at Babson College in Babson Park, Massachusetts. His newest book is Competing on Analytics: The New Science of Winning, with Jeanne G. Harris (Harvard Business Press, 2007).
Reprint R0902E To order, see page 111.

“I’m here to restore confidence in the unrealistic expectations we all had. ”

76 Harvard Business Review

|

February 2009

|

hbr.org

P Vey .C.

http://www.oppapers.com/essays/Davenport/440956

Davenport 7.9 of 10 on the basis of 3052 Review.