The Devil made me do it! It just had to happen sooner or later, and no one else seemed to be willing to bite the bullet. So, I figured it was up to me. I've written the definitive addition to the Dummies series:
O.K., I'm (uncharacteristically) exaggerating just a tad. I think the cover looks good, though; and I've even started to assemble some of the core material, as you'll see below. So what brought on this fit of enthusiasm for what at first blush might be misinterpreted as the neuronically challenged?
Well, last week I gave a talk in the Statistics seminar series in the Math. & Stats. department here at UVic, and this week I gave a similar talk in my own department's brown-bag seminar. The first of those talks was titled "Interpreting Indicator Covariates in Semi-logarithmic Regression Models". The talk for the Economics department was more succinctly called "Dummies for Dummies". The content of the two talks was pretty much the same, but I had to take into account a couple of differences in the language used by econometricians and statisticians. We say "regressor", and they say "covariate". They say "indicator variable", and we say "dummy variable",........ you get the picture. There's another difference too - statisticians don't need to be cajoled into attending seminars by giving the talk a provocative (and possibly insulting) title!
On this occasion the economists noticeably self-selected, and there was a healthy turnout of the curious and homeless. Regrettably we can't afford to hand out free lunches at seminars in the way our colleagues in the Business School purport to. Perhaps it's because we know that such things don't exist! People actually turned up in spite of this. Curiosity got the better of them.
These seminars were based on a recently completed research paper of mine (Giles, 2011a). The main point of that paper is to derive the exact sampling distribution of a particular statistic that arises naturally when estimating a log-linear regression model with one or more dummy variables as regressors. The paper also shows what can go wrong if you don't do the job properly when interpreting that statistic - but more on this below.
Dummy variables are quite alluring when it comes to including them in regression models. However, they're rather special in certain ways. So, here are four things that your mother probably never taught you, but which will form the cornerstones of the forthcoming tome, Dummies for Dummies. Meanwhile, you keen users of dummy variables may want to keep them in mind.
1. Dummies in Log-Linear Models:
Interpreting a dummy variable's coefficient when the dependent variable has been log-transformed has to be undertaken with care. Trust me, the literature is full of empirical applications where the authors get it wrong, and most of the standard text books are no better. The way to interpret the coefficient of a continuous regressor in a regression model, where the dependent variable has been log-transformed, can be seen by considering the following regression model:
ln(Y) = a + bX + cD + ε . (1)
Here, X is a continuous regressor, and D is a zero-one dummy variable. The interpretation of the coefficient, b, is that it is the partial derivative of ln(Y) with respect to X. So, a small change in X (up or down), will lead to a multiplicative change of exp(b) in Y, other things held equal. That is, Y will be scaled by exp(b).
There's another way to express this effect, though. If you recall the Taylor's series expansion for ex that you learned in high school, you'll know that for small values of b, we have the approximation, exp(b) ≈ 1 + b. This implies that 100b is the expected percentage change in Y for a one-unit change in X. (This is different from an elasticity, of course.) You might find this link helpful if you need an elementary discussion of some of this.
Unfortunately, lots of people (who really should know better) then apply the same "reasoning" to the interpretation of c. The trouble is, of course that D is not continuous, so we can't differentiate ln(Y) with respect to D. The way to get the percentage effect of D on Y is pretty obvious. Curiously enough those same people who go about this the correct way when computing marginal effects in the case of Logit and Probit models just don't seem to do it right in the present context. All we have to do is take the exponential of both sides of equation (1), then evaluate Y when D = 0 and when D = 1. The difference between these two values, divided by the expression for Y based on the starting value of D gives you the correct interpretation immediately:
If D switches from 0 to 1, the % impact of D on Y is 100[exp(c) - 1]. (2)
If D switches from 1 to 0, the % impact of D on Y is 100[exp(-c) - 1]. (3)
Notice the asymmetry of the impacts - unlike the case of the continuous regressor. Also notice that, in general, these values will be quite different from the 100c that some of our chums insist on using. For example, if c = 0.6, the naïve econometrician will conclude that there is a 60% impact; whereas it is really an 82.2% positive impact as D changes from 0 to 1, and a 45.1% negative impact as D goes from 1 to 0! (Recalling the formula for the Taylor series expansion of exp(c) will make it really transparent why and when things go wrong by using c itself.)
Let me hasten to spoil your day by assuring you that this not breaking news. This little pearl of wisdom has been around in the mainstream economics/econometrics literature for at least 30 years. Hence the "Read Your History" byline on the cover of Dummies for Dummies. Moreover, even more care has to be taken when using an estimated value of c - say c* - after fitting model (1) using OLS. You might be tempted to simply replace c with c*, in the formulae in (2) and (3). Not a good plan, as we've known since at least 1981! The resulting estimator of the percentage impact is then biased, in a direction that you can figure out for yourself using Jensen's inequality. A nice practical solution - one that gives an almost-unbiased estimator of the % impact of the dummy on Y - was suggested by Kennedy (1981), assuming normal errors in (1). You just have to modify the formula in (2) to become 100[exp(c*-½v*(c*)) -1], where v*(c*) is the estimated variance of c* - i.e., it's the square of the standard error for c*. You make the corresponding adjustment to the formula in (3), though none of the writers back around 1980 (myself included) actually observed that there are these two separate cases. If you want to be really tricky, and use the exact minimum variance unbiased estimator, I derived the formula for this in Giles (1982). However, it's really messy, and in practice adds very little to Kennedy's estimator that I've just described. My colleague. Ken Stewart has a nice discussion of this in his excellent book, Introduction to Applied Econometrics.
So, this is something to think about the next time you're fitting a log-linear regression. If you want to go further than this, and worry about matters beyond point estimation - such as confidence intervals and the like - then you'll be thrilled to know that the sampling distribution of Kennedy's almost unbiased estimator is nowhere near normal. So be even more careful in this case, and maybe even read the paper on which my seminars were based.
2. Dummies That Take Only One Non-Zero Value:
Alright, now here's another trap for young players. I'll keep it really brief. You probably know already that if you have a dummy variable that is zero for all but one of the sample values, then your OLS estimates of the regression model's coefficients will be identical to those that you'd get if you simply dropped the "special"observation (for which the dummy is non-zero) from the regression altogether. I often set the proof of this as an exercise for my students. In addition, the residual for that one special observation will be exactly zero.
So, be careful how you interpret your OLS results if you choose to use such a dummy variable! I'm not saying that you shouldn't do so. In fact, the standard error for the estimated coefficient on the dummy variable is of some interest. It enables you to test if that observation makes a significant contribution. You could use this information to to test if an apparent "outlier" in the sample is having a statistically significant impact on your estimated model.
Did you know, however, that this same result holds for lots of other estimation methods, beyond least squares? You won't find it discussed in your textbook, but it' something that is proven, and discussed in another recent paper of mine (Giles, 2011b). More specifically, the above result relating to the use of single-valued dummy variables also holds for GMM estimation; any generalized IV estimator (including 2SLS and LIML); the MLE for any of the standard count-data models, such as Poisson, Negative Binomial and Exponential; and even for quantile regression.
I'll bet you didn't know that for many of the situations where you estimate a regression model with a dummy variable in it, the estimator of that variable's coefficient is inconsistent. This has nothing to do with random regressors, measurement error or omitted variables. The model can meet all of the usual "textbook assumptions". Guess what else? The problem I'm alluding to arises not just with OLS estimation, but also with any generalized instrumental variables (IV) estimator. And that's not all! The estimator of that coefficient has a non-normal sampling distribution - even for an infinite sample size! The asymptotic distribution is horribly skewed to the right, so this is really going to cause strife if you try to construct confidence intervals or test hypotheses about the dummy's coefficient, but ignore this fact. Remember - this is an asymptotic result, so it doesn't get any better even if you have a huge sample of data.
What on earth is this all about, and why didn't your mom warn you?
Well, notice that I said "....for many of the situations...". So this problem doesn't always arise. Also notice that I was referring only to the coefficient(s) of the dummy variable regressor(s) - not to estimators of the coefficients of the "regular" (measured) regressors in the model. Everything is just fine in their case. So what are these "...many situations.."? You probably won't like the answer to this, because unfortunately these are situations you'll have met many, many times - they're really common, and rather interesting. In a nutshell any time that the dummy variable takes a non-zero (usually unit) value for a finite and fixed number of observations, then the usual asymptotics don't apply and you get the problems I've just mentioned. Of course, the situation of OLS estimation when there is just a single non-zero value for the dummy variable in the sample is a special example of this, and this case is discussed by Hendry and Santos (2005). It doesn't seem to be widely known, however. I provide the generalization from one observation to any finite number of observations; and from OLS to IV estimation in my recent paper, Giles (2011c).
So, consider the following situation, for example:
We want to fit a regression using a sample of data that covers the period 1940 to 1980, and we notice that there is an obvious structural break corresponding to the period of the 2nd World War - 1939 to 1945. So, when we estimate our regression model we include a dummy variable (either to shift the intercept, or multiplicatively to shift one or more of the slope parameters), and this dummy variable is zero except for the 7 years, 1939 to 1945 inclusive. Now, we can't re-write the history books, more's the pity. So, no matter how much more data were to become available, before 1940 or since 1980, our dummy variable will always have just 7 non-zero values. When we look at the coefficient of that dummy variable, the OLS estimator will still be "Best Linear Unbiased" (under our otherwise standard assumptions), but it will be inconsistent. It will be very unreliable even with an infinitely large sample size. We should also be really careful about constructing confidence intervals or tests relating to this coefficient, because the non-normality of the sampling distribution for this particular OLS estimator, even asymptotically.
How many times have you seen emprical studies, perhaps using thousands of observations, where dummy variables of the type I've mentioned appear as regressors? Lots, I'll bet. Those large samples are not much help at all in this case, and you should be skeptical when the authors get all excited about the interpretation of the coefficients of their dummy variables. These numbers mean very little at all!
4. The Perils of Using Seasonal Dummy Variables:
Finally, ask yourself: "How many times have I estimated an OLS regression model using quarterly time-series data, and included seasonal dummy variables to deal with the observed seasonality in the dependent variable?" (Probably more times than you can recall.) Now ask yourself: "What on earth had I been inhaling?" (Don't answer that if you don't want to. Just end me an email and I promise - nudge, nudge -it won't go viral.)
Now, don't panic - I'm not about to launch into a boring little homily about the "dummy variable trap". Here's the thing. Do you recall the Frisch-Waugh Theorem? It was actually published in volume 1 of Econometrica, would you believe! In the context of our seasonal dummy variables this theorem tells us the following, as was pointed out by Lovell (1963). Suppose that we estimate the following regression model by OLS, where the Si's are the quarterly seasonal dummy variables:
Y = a + bX +c1S1 + c2S2 + c3S3 + e . (4)
Let the b* be the OLS estimator of b.
Now, suppose that we decide to "seasonally adjust" the Y data by "explaining" the seasonal component in that variable using the seasonal dummies, and then eliminating that part of the series. So, we fit an OLS regression:
Y = a + c1S1 + c2S2 + c3S3 + v , (5)
and then treat the residuals as the seasonally adjusted Y series, Ysa. We do the same sort of thing to "seasonally adjust" the X series. We fit the OLS regression:
X = a' + c'1S1 + c'2S2 + c'3S3 + u , (6)
and treat these residuals as the seasonally adjusted series, Xsa. Finally, we regress Ysa on Xsa:
Ysa = a" +b"Xsa + e' . (7)
The Frisch-Waugh-Lovell Theorem tells us that the OLS estimator of b" in (7) will be identical to the OLS estimator of b, namely b*, in (4).
This is a purely algebraic result - it doesn't rely on any "statistics" per se, and it certainly doesn't rely on any assumptions about the random errors in any of the fitted models. In addition, it doesn't even require that OLS estimation be used throughout. I showed some years ago (Giles, 1984) that the same results emerge if you replace the OLS estimator with any IV estimator.
What you need to be aware of is that this is not just a rather quaint little result. The implications of what we've just seen are actually quite important. Let's see why this is. First, if we fit a regression with regular data and seasonal dummy variables, this is equivalent to "seasonally adjusting" all of the data (Y and X). Second, the variables have all been effectively "seasonally adjusted" in exactly the same way, which is totally unrealistic - this is not what happens when our statistical agencies seasonally adjust time-series using the Census X-12-ARIMA method (which you can download for free, and is a standard feature in EViews, if you use that package). Third, the data have not really been seasonally adjusted at all, because no account has been taken of the other components of the time-series, Y and X. In general, they will have trend and cyclical components that need to be taken into account, properly, and differently for each series, as is done when the X-12-ARIMA method is used.
So the bottom line is that including seasonal dummy variables makes sense only if: (a) you think that the dependent variables and all of the regressors in your model have a simple additive seasonal component; and (b) you don't think they have any trend or cyclical components! When could you last put your hand on your heart and swear that this was the case in practice?
Anyway, I hope that this sneak preview will whet your appetite somewhat, and I look forward to receiving the flood of orders for Dummies for Dummies when it rolls of the presses. You'll be the first to know - trust me!
Note: The links to the following references will be helpful only if your computer's IP address gives you access to the electronic versions of the publications in question. That's why a written References section is provided.
Frisch, R. and F. V. Waugh (1933). Partial time regression as compared with individual trends. Econometrica, 1, 387-401.
Giles, D. E. (1984). Instrumental variables regressions involving seasonal data. Economics Letters, 14, 339-343.
Giles, D. E. (2011a). Interpreting dummy variables in semi-logarithmic regression models: exact distributional results. Econometrics Working Paper EWP1101, Department of Economics,
. University of Victoria
Giles, D. E. (2011b). Econometric models with single-valued dummy variables. Mimeo., Department of Economics,
Giles, D. E. (2011c). On the inconsistency of instrumental variables estimators for the coefficients of certain dummy variables. Econometrics Working Paper EWP1106, Department of Economics,
Hendry, D. F. and C. Santos (2005), Regression models with data-based indicator variables.
Kennedy, P. E. (1981). Estimation with correctly interpreted dummy variables in semilogarithmic equations. American Economic Review, 71, 801.
Lovell, M. C. (1963). Seasonal adjustment of economic time series. Journal of the American Statistical Association, 58, 993-1010.
Lovell, M. C. (2008). A simple proof of the FWL (Frisch, Waugh, Lovell) theorem. Journal of Economic Education, 39, 88-91.
(This one is definitely) © 2011, David Giles