How to Pose a Question
This post is inspired by two books I read during the last couple of days – The Elements of Data Analytic Style and The Art of Data Science (both are pay-what-you-want books and available via the links I provided). Among other topics both deal with what might be the most important step in the analytic process, how to ensure that the question at hand is the one you actually want to answer. Here I reiterate the essence of the corresponding chapters1. Thereby we will take a look at the types of data analytic questions, what makes a good question and how to convert a question into a data problem.
Types of Questions
A useful framework for determining a good question is first making yourself aware of which kinds of questions exist and then determining which category a specific question belongs to. According to Jeff Leek, author of The Elements of Data Analytic Style, there are six types of data analytic questions2. These differentiate in their purpose and their interpretation. Here they are:
The chart on the right also helps you determining the category of a question. The chart it is based on can also be found on page 4 of Jeff’s book. Now let’s take a look at the concepts behind these types of questions.
Descriptive questions ask to summarize a property of a set of data. Examples include the mean, median, mode, variance, standard deviation or skewness of a feature as well as the frequency of an event and the expected value of variable. There is no interpretation of the result in itself since it is a fact.
An Exploratory question builds on a descriptive analysis. It seeks for patterns, trends or relationships between multiple variables. Their main purpose is generating hypotheses. These can either stem from a general idea you already had or be inspired by the relationships found during the analysis. In contrast to inferential, predictive, causal and mechanistic questions it does not test if the hypothesis holds true.
A Inferential question builds on an exploratory analysis. It turns the formerly found hypothesis into a question and seeks to answer it using a different data set. In essence the inferential questions is about verifying that evidence found in an exploratory analysis can be generalized for a larger population.
A Predictive question also builds on an exploratory analysis. It seeks to generate a forecast for the values of a specific individual’s properties. This type of question is more concerned with the ‘what’ than the ‘why’ and ‘how’. Here it is more important to find which factors are useful for getting the prediction right than the reason behind their usefulness.
A Causal question asks if changing one factor, on average, results in a change in another factor. In contrast to a predictive question it is more important to find if a change in a factor causes a change in another factor. To make yourself aware of the difference it is useful to think of confounding factors. Murder rates and ice cream consumption is US cities might be good predictors for each other since they are highly correlated but Popsicle sales hardly cause murder rates to soar (and a predictive question would be fine with it). A causal question though would be more interested in finding that higher temperatures in summer are the actual reason behind rising ice cream consumption (and higher murder rates).
A Mechanistic questions asks if changing one factor, necessarily, results in a change in another factor. It differs from a causal question in that it is not just interested in the ‘why’ but also the ‘how’. For example rising temperatures might be the cause behind rising ice cream sales but knowing this does not explain how exactly one leads to another. In many cases a very specific test design is needed in order to answer a mechanistic question. In many cases the difference between causal and mechanistic questions is generally less about analytic techniques used than about the test design used to generate the data set.
What makes a Good Question
Once you figured out which type of question you have at hands you are ought to make sure it also is a good question before you proceed. Roger D. Peng and Elizabeth Matsui, authors of The Art of Data Science, suggest six criteria to do just that4. Their criteria are that a question must be:
- Interesting Before you spent any further effort on answering your question you should make sure that there is (or potentially is) someone interested in finding the answer. This can be a broad criterion in case you are doing academic work but in case you are working in industry it is advised to confirm with your boss and/or colleagues that someone else is interested in the answer to your question.
- Unanswered Many questions can be easily answered by a quick Google search. For example agencies like the U.S. Census Bureau or the German Federal Statistical Office provide a lot of publicly available data and analyses. The answer might also be found among the thousands of academic articles published every year. In case your question is already answered there is likely no need to repeat the analysis.
- Plausible As with the ice cream/crime example above it is advised to make sure there is a plausible framework around your question. So before conducting an analysis it can be worth to come to know to what extent the possible answers can be explained with reason.
- Answerable Unfortunately some of the best questions cannot be answered. This can be due to the cost or difficulty of collecting the needed data or due to ethical reasons. Still a question worth pursuing should obviously be answerable.
- Specific Having a specific question rather than a general one makes it easier to determine the type of question and thus makes it easier to think about the test design and the analytic process. It also ensures that the question being answered is the one you and your audience really want to see answered. It’s worth checking with others such as your colleagues if your question is specific. As an example instead of the rather general ‘Are footballer salaries increasing?’ a specific question could be either:
- ‘Will the inflation-adjusted fixed seasonally salaries of professional football players in the NFL increase during the next 12 months?‘ or
- ‘By how much did the nominal salaries of professional soccer players in the 1. Bundesliga measured in Euro change over the course of the last 12 months?‘
This example shows that the same general question could reasonably be assumed to mean many different specific ones. Unspecific questions are a potential source of frustration and wasted resources. Thus a good question is specific.
Converting Questions into Data Problems
Of course the aim of posing a question is answering it. In case of a data analytic question you will answer it by analyzing a data set. So at one point you have to translate your question into a data problem. To do this you operationalize the question into a data analysis which leads to a result.
Roger D. Peng and Elizabeth Matsui thought about how to make sure that your question fulfills this criterion. Here is a summary of their thoughts in the form of a checklist:
- Make sure your data set provides close enough measures of the factors you are after. This is to ensure that there is only one interpretation of the results. When analyzing measures which are only somehow related to what you are really after there are competing explanations for the results. For example you might want to determine if footballer salaries have risen but the measure you use to determine this is how much these footballers were spending compared to a year ago. If you find their expenditures have risen there are still competing explanations such as rising prices for comparable goods or a change in their preferences between spending and saving. To avoid this you have to use a measure that’s closer to the actual salaries, for example the staff expenditures the teams reported.
- Think about confounding factors. You might find a correlation between two factors and conclude that one causes the other but you have to be careful with this as the ice cream example above has shown. When posing your question make sure you can account for possible confounding factors. Is there a possible confounding factor that lets a sports team’s staff expenditure rise that also lets the salaries of their players rise? Inflation might be such a confounding factor. When you are aware of the confounding factors you can account for them (in case the data needed is available to you) and you can possibly change your question (asking for real increases of salary rather than nominal ones).
- When operationalizing your question into an analysis, make sure the way your data set was collected does not lead to biased results. According to Peng and Matsui the two main biases to avoid are recall bias and selection bias. Here is also a more complete list of sampling biases and a list of cognitive biases. I am going to write a follow-up article on biases and how to deal with them since sometimes they cannpt be avoided. But possibly the way you pose your question and the way you operationalize the data collection and analysis can indeed avoid bias. So spend a couple of minutes on thinking about which biases your method could produce and if there is any possibility to avoid them.
1. The corresponding chapters are Chapter 2: The data analytic question in Jeff Leek’s and Chapter 3. Stating and Refining the Question in Roger Peng and Elizabeth Matsui’s book.
2. Jeff T. Leek, The Elements of Data Analytic Style, 2015, pp. 4-7
3. Jeff T. Leek and Roger D. Peng, What is the question? in Science, 20 March 2015: Vol. 347 no. 6228, pp. 1314–1315, DOI: 10.1126/science.aaa6146
4. Roger D. Peng and Elizabeth Matsui, The Art of Data Science, 2015, pp. 20-22