title: A Guide to Econometrics author: Kennedy, Peter. publisher: MIT

title: A Guide to Econometrics

author: Kennedy, Peter.

publisher: MIT Press

isbn10 | asin: 0262112353

print isbn13: 9780262112352

ebook isbn13: 9780585202037

language: English

subject Econometrics.

publication date: 1998

lcc: HB139.K45 1998eb

ddc: 330/.01/5195

subject: Econometrics.

cover

Page iii

A Guide to Econometrics

Fourth Edition

Peter Kennedy

Simon Fraser University

The MIT Press

Cambridge, Massachusetts

page_iii

Page iv

or by any electronic or mechanical means (including photocopying,

recording, or information storage and retrieval), without permission in

writing from the publisher.

Printed and bound in The United Kingdom by TJ International.

ISBN 0-262 11235-3 (hardcover), 0-262-61140-6 (paperback)

Library of Congress Catalog Card Number: 98-65110

page_iv

Contents

Preface

Introduction

1.1

What is Econometrics?

1.2

The Disturbance Term

1.3

Estimates and Estimators

1.4

Good and Preferred Estimators

General Notes

Technical Notes

Criteria for Estimators

2.1

Introduction

2.2

Computational Cost

2.3

Least Squares

2.4

Highest R2

2.5

Unbiasedness

2.6

Efficiency

2.7

Mean Square Error (MSE)

2.8

Asymptotic Properties

2.9

Maximum Likelihood

2.10

Monte Carlo Studies

2.11

Adding Up

General Notes

Technical Notes

The Classical Linear Regression Model

3.1

Textbooks as Catalogs

3.2

The Five Assumptions

3.3

The OLS Estimator in the CLR Model

General Notes

Technical Notes

page_v

Interval Estimation and Hypothesis Testing

4.1

Introduction

4.2

Testing a Single Hypothesis: the t Test

4.3

Testing a Joint Hypothesis: the F Test

4.4

Interval Estimation for a Parameter Vector

4.5

LR, W, and LM Statistics

4.6

Bootstrapping

General Notes

Technical Notes

Specification

5.1

Introduction

5.2

Three Methodologies

5.3

General Principles for Specification

5.4

Misspecification Tests/Diagnostics

5.5

R2 Again

General Notes

Technical Notes

Violating Assumption One: Wrong Regressors, Nonlinearities, and Parameter

Inconstancy

6.1

Introduction

6.2

Incorrect Set of Independent Variables

6.3

Nonlinearity

6.4

Changing Parameter Values

General Notes

Technical Notes

Violating Assumption Two: Nonzero Expected Disturbance

General Notes

Violating Assumption Three: Nonspherical Disturbances

8.1

Introduction

8.2

Consequences of Violation

8.3

Heteroskedasticity

8.4

Autocorrelated Disturbances

General Notes

Technical Notes

page_vi

Violating Assumption Four: Measurement Errors and Autoregression

9.1

Introduction

9.2

Instrumental Variable Estimation

9.3

Errors in Variables

9.4

Autoregression

General Notes

Technical Notes

Violating Assumption Four: Simultaneous Equations

10.1

Introduction

10.2

Identification

10.3

Single-equation Methods

10.4

Systems Methods

10.5

VARs

General Notes

Technical Notes

Violating Assumption Five: Multicollinearity

11.1

Introduction

11.2

Consequences

11.3

Detecting Multicollinearity

11.4

What to Do

General Notes

Technical Notes

Incorporating Extraneous Information

12.1

Introduction

12.2

Exact Restrictions

12.3

Stochastic Restrictions

12.4

Pre-test Estimators

12.5

Extraneous Information and MSE

General Notes

Technical Notes

The Bayesian Approach

13.1

Introduction

13.2

What is a Bayesian Analysis?

13.3

Advantages of the Bayesian Approach

page_vii

13.4

Overcoming Practitioners' Complaints

General Notes

Technical Notes

Dummy Variables

14.1

Introduction

14.2

Interpretation

14.3

Adding Another Qualitative Variable

14.4

Interacting with Quantitative Variables

14.5

Observation-specific Dummies

14.6

Fixed and Random Effects Models

General Notes

Technical Notes

Qualitative Dependent Variables

15.1

Dichotomous Dependent Variables

15.2

Polychotomous Dependent Variables

15.3

Ordered Logit/Probit

15.4

Count Data

General Notes

Technical Notes

Limited Dependent Variables

16.1

Introduction

16.2

The Tobit Model

16.3

Sample Selection

16.4

Duration Models

General Notes

Technical Notes

Time Series Econometrics

17.1

Introduction

17.2

ARIMA Models

17.3

SEMTSA

17.4

Error-correction Models

17.5

Testing for Unit Roots

17.6

Cointegration

General Notes

Technical Notes

page_viii

Forecasting

18.1

Introduction

18.2

Causal Forecasting/Econometric Models

18.3

Time Series Analysis

18.4

Forecasting Accuracy

General Notes

Technical Notes

Robust Estimation

19.1

Introduction

19.2

Outliers and Influential Observations

19.3

Robust Estimators

19.4

Non-parametric Estimation

General Notes

Technical Notes

Appendix A: Sampling Distributions, the Foundation of Statistics

Appendix B: All About Variance

Appendix C: A Primer on Asymptotics

Appendix D: Exercises

Appendix E: Answers to Even-numbered Questions

Glossary

Bibliography

Author Index

Subject Index

page_ix

Page xi

Preface

In the preface to the third edition of this book I noted that upper-level

undergraduate and beginning graduate econometrics students are as

likely to learn about this book from their instructor as by word-of-

mouth, the phenomenon that made the first edition of this book so

successful. Sales of the third edition indicate that this trend has

continued - more and more instructors are realizing that students find

this book to be of immense value to their understanding of

econometrics.

What is it about this book that students have found to be of such value?

This book supplements econometrics texts, at all levels, by providing an

overview of the subject and an intuitive feel for its concepts and

techniques, without the usual clutter of notation and technical detail

that necessarily characterize an econometrics textbook. It is often said

of econometrics textbooks that their readers miss the forest for the

trees. This is inevitable - the terminology and techniques that must be

taught do not allow the text to convey a proper intuitive sense of

"What's it all about?" and "How does it all fit together?" All

econometrics textbooks fail to provide this overview. This is not from

lack of trying - most textbooks have excellent passages containing the

relevant insights and interpretations. They make good sense to

instructors, but they do not make the expected impact on the students.

Why? Because these insights and interpretations are broken up,

appearing throughout the book, mixed with the technical details. In their

struggle to keep up with notation and to learn these technical details,

students miss the overview so essential to a real understanding of those

details. This book provides students with a perspective from which it is

possible to assimilate more easily the details of these textbooks.

Although the changes from the third edition are numerous, the basic

structure and flavor of the book remain unchanged. Following an

introductory chapter, the second chapter discusses at some length the

criteria for choosing estimators, and in doing so develops many of the

basic concepts used throughout the book. The third chapter provides an

overview of the subject matter, presenting the five assumptions of the

classical linear regression model and explaining how most problems

encountered in econometrics can be interpreted as a violation of one of

these assumptions. The fourth chapter exposits some concepts of

inference to

page_xi

Page xii

provide a foundation for later chapters. Chapter 5 discusses general

approaches to the specification of an econometric model, setting the

stage for the next six chapters, each of which deals with violations of an

assumption of the classical linear regression model, describes their

implications, discusses relevant tests, and suggests means of resolving

resulting estimation problems. The remaining eight chapters and

Appendices A, B and C address selected topics. Appendix D provides

some student exercises and Appendix E offers suggested answers to the

even-numbered exercises. A set of suggested answers to odd-numbered

questions is available from the publisher upon request to instructors

adopting this book for classroom use.

There are several major changes in this edition. The chapter on

qualitative and limited dependent variables was split into a chapter on

qualitative dependent variables (adding a section on count data) and a

chapter on limited dependent variables (adding a section on duration

models). The time series chapter has been extensively revised to

incorporate the huge amount of work done in this area since the third

edition. A new appendix on the sampling distribution concept has been

added, to deal with what I believe is students' biggest stumbling block to

understanding econometrics. In the exercises, a new type of question

has been added, in which a Monte Carlo study is described and students

are asked to explain the expected results. New material has been added

to a wide variety of topics such as bootstrapping, generalized method of

moments, neural nets, linear structural relations, VARs, and

instrumental variable estimation. Minor changes have been made

throughout to update results and references, and to improve exposition.

To minimize readers' distractions, there are no footnotes. All references,

peripheral points and details worthy of comment are relegated to a

section at the end of each chapter entitled "General Notes". The

technical material that appears in the book is placed in end-of-chapter

sections entitled "Technical Notes". This technical material continues to

be presented in a way that supplements rather than duplicates the

contents of traditional textbooks. Students should find that this material

provides a useful introductory bridge to the more sophisticated

presentations found in the main text. Students are advised to wait until a

second or third reading of the body of a chapter before addressing the

material in the General or Technical Notes. A glossary explains

common econometric terms not found in the body of this book.

Errors in or shortcomings of this book are my responsibility, but for

improvements I owe many debts, mainly to scores of students, both

graduate and undergraduate, whose comments and reactions have

played a prominent role in shaping this fourth edition. Jan Kmenta and

Terry Seaks have made major contributions in their role as

"anonymous" referees, even though I have not always followed their

advice. I continue to be grateful to students throughout the world who

have expressed thanks to me for writing this book; I hope this fourth

edition continues to be of value to students both during and after their

formal course-work.

page_xii

Dedication

To ANNA and RED

who, until they discovered what an econometrician was, were

very impressed that their son might become one. With apologies to

K. A. C. Manderville, I draw their attention to the following, adapted from

Undoing of Lamia Gurdleneck.

''You haven't told me yet," said Lady Nuttal, "what it is your fiancé does for a

living."

"He's an econometrician." replied Lamia, with an annoying sense of being on

the defensive.

Lady Nuttal was obviously taken aback. It had not occurred to her that econo-

metricians entered into normal social relationships. The species, she would have

surmised, was perpetuated in some collateral manner, like mules.

"But Aunt Sara, it's a very interesting profession," said Lamia warmly.

"I don't doubt it," said her aunt, who obviously doubted it very much. "To

express anything important in mere figures is so plainly impossible that there

must be endless scope for well-paid advice on how to do it. But don't you think

that life with an econometrician would be rather, shall we say, humdrum?"

Lamia was silent. She felt reluctant to discuss the surprising depth of emo-

tional possibility which she had discovered below Edward's numerical veneer.

"It's not the figures themselves," she said finally, "it's what you do with them

that matters."

page_xiii

Page 1

Introduction

1.1 What is Econometrics?

Strange as it may seem, there does not exist a generally accepted

answer to this question. Responses vary from the silly "Econometrics is

what econometricians do" to the staid "Econometrics is the study of the

application of statistical methods to the analysis of economic

phenomena," with sufficient disagreements to warrant an entire journal

article devoted to this question (Tintner, 1953).

This confusion stems from the fact that econometricians wear many

different hats. First, and foremost, they are economists, capable of

utilizing economic theory to improve their empirical analyses of the

problems they address. At times they are mathematicians, formulating

economic theory in ways that make it appropriate for statistical testing.

At times they are accountants, concerned with the problem of finding

and collecting economic data and relating theoretical economic

variables to observable ones. At times they are applied statisticians,

spending hours with the computer trying to estimate economic

relationships or predict economic events. And at times they are

theoretical statisticians, applying their skills to the development of

statistical techniques appropriate to the empirical problems

characterizing the science of economics. It is to the last of these roles

that the term "econometric theory" applies, and it is on this aspect of

econometrics that most textbooks on the subject focus. This guide is

accordingly devoted to this "econometric theory" dimension of

econometrics, discussing the empirical problems typical of economics

and the statistical techniques used to overcome these problems.

What distinguishes an econometrician from a statistician is the former's

pre-occupation with problems caused by violations of statisticians'

standard assumptions; owing to the nature of economic relationships

and the lack of controlled experimentation, these assumptions are

seldom met. Patching up statistical methods to deal with situations

frequently encountered in empirical work in economics has created a

large battery of extremely sophisticated statistical techniques. In fact,

econometricians are often accused of using sledgehammers to crack

open peanuts while turning a blind eye to data deficiencies and the

many

page_1

Page 2

questionable assumptions required for the successful application of

these techniques. Valavanis has expressed this feeling forcefully:

Econometric theory is like an exquisitely balanced French

recipe, spelling out precisely with how many turns to mix the

sauce, how many carats of spice to add, and for how many

milliseconds to bake the mixture at exactly 474 degrees of

temperature. But when the statistical cook turns to raw

materials, he finds that hearts of cactus fruit are unavailable, so

he substitutes chunks of cantaloupe; where the recipe calls for

vermicelli he used shredded wheat; and he substitutes green

garment die for curry, ping-pong balls for turtle's eggs, and, for

Chalifougnac vintage 1883, a can of turpentine. (Valavanis,

1959, p. 83)

How has this state of affairs come about? One reason is that prestige in

the econometrics profession hinges on technical expertise rather than on

hard work required to collect good data:

It is the preparation skill of the econometric chef that catches

the professional eye, not the quality of the raw materials in the

meal, or the effort that went into procuring them. (Griliches,

1994, p. 14)

Criticisms of econometrics along these lines are not uncommon.

Rebuttals cite improvements in data collection, extol the fruits of the

computer revolution and provide examples of improvements in

estimation due to advanced techniques. It remains a fact, though, that in

practice good results depend as much on the input of sound and

imaginative economic theory as on the application of correct statistical

methods. The skill of the econometrician lies in judiciously mixing these

two essential ingredients; in the words of Malinvaud:

The art of the econometrician consists in finding the set of

assumptions which are both sufficiently specific and sufficiently

realistic to allow him to take the best possible advantage of the

data available to him. (Malinvaud, 1966, p. 514)

Modern econometrics texts try to infuse this art into students by

providing a large number of detailed examples of empirical application.

This important dimension of econometrics texts lies beyond the scope of

this book. Readers should keep this in mind as they use this guide to

improve their understanding of the purely statistical methods of

econometrics.

1.2 The Disturbance Term

A major distinction between economists and econometricians is the

latter's concern with disturbance terms. An economist will specify, for

example, that consumption is a function of income, and write C = (Y)

where C is consumption and Y is income. An econometrician will claim

that this relationship must also include a disturbance (or error) term,

and may alter the equation to read

page_2

Page 3

C = (Y) +e where e (epsilon) is a disturbance term. Without the

disturbance term the relationship is said to be exact or deterministic;

with the disturbance term it is said to be stochastic.

The word "stochastic" comes from the Greek "stokhos," meaning a

target or bull's eye. A stochastic relationship is not always right on

target in the sense that it predicts the precise value of the variable being

explained, just as a dart thrown at a target seldom hits the bull's eye.

The disturbance term is used to capture explicitly the size of these

''misses" or "errors." The existence of the disturbance term is justified in

three main ways. (Note: these are not mutually exclusive.)

(1) Omission of the influence of innumerable chance events Although

income might be the major determinant of the level of consumption, it is

not the only determinant. Other variables, such as the interest rate or

liquid asset holdings, may have a systematic influence on consumption.

Their omission constitutes one type of specification error:

the nature of

the economic relationship is not correctly specified. In addition to these

systematic influences, however, are innumerable less systematic

influences, such as weather variations, taste changes, earthquakes,

epidemics and postal strikes. Although some of these variables may

have a significant impact on consumption, and thus should definitely be

included in the specified relationship, many have only a very slight,

irregular influence; the disturbance is often viewed as representing the

net influence of a large number of such small and independent causes.

(2)

Measurement error

It may be the case that the variable being

explained cannot be measured accurately, either because of data

collection difficulties or because it is inherently unmeasurable and a

proxy variable must be used in its stead. The disturbance term can in

these circumstances be thought of as representing this measurement

error. Errors in measuring the explaining variable(s) (as opposed to the

variable being explained) create a serious econometric problem,

discussed in chapter 9. The terminology errors in variables is also used

to refer to measurement errors.

(3) Human indeterminacy Some people believe that human behavior is

such that actions taken under identical circumstances will differ in a

random way. The disturbance term can be thought of as representing

this inherent randomness in human behavior.

Associated with any explanatory relationship are unknown constants,

called parameters, which tie the relevant variables into an equation. For

example, the relationship between consumption and income could be

specified as

where b1 and b2 are the parameters characterizing this consumption

function. Economists are often keenly interested in learning the values

of these unknown parameters.

page_3

Page 4

The existence of the disturbance term, coupled with the fact that its

magnitude is unknown, makes calculation of these parameter values

impossible. Instead, they must be estimated. It is on this task, the

estimation of parameter values, that the bulk of econometric theory

focuses. The success of econometricians' methods of estimating

parameter values depends in large part on the nature of the disturbance

term; statistical assumptions concerning the characteristics of the

disturbance term, and means of testing these assumptions, therefore

play a prominent role in econometric theory.

1.3 Estimates and Estimators

In their mathematical notation, econometricians usually employ Greek

letters to represent the true, unknown values of parameters. The Greek

letter most often used in this context is beta (b). Thus, throughout this

book, b is used as the parameter value that the econometrician is

seeking to learn. Of course, no one ever actually learns the value of b,

but it can be estimated: via statistical techniques, empirical data can be

used to take an educated guess at b. In any particular application, an

estimate of b is simply a number. For example, b might be estimated as

16.2. But, in general, econometricians are seldom interested in

estimating a single parameter; economic relationships are usually

sufficiently complex to require more than one parameter, and because

these parameters occur in the same relationship, better estimates of

these parameters can be obtained if they are estimated together (i.e., the

influence of one explaining variable is more accurately captured if the

influence of the other explaining variables is simultaneously accounted

for). As a result, b seldom refers to a single parameter value; it almost

always refers to a set of parameter values, individually called b1, b2, . .

., bk where k is the number of different parameters in the set. b is then

referred to as a vector and is written as

In any particular application, an estimate of b will be a set of numbers.

For example, if three parameters are being estimated (i.e., if the

dimension of b is three), b might be estimated as

In general, econometric theory focuses not on the estimate itself, but on

the estimator - the formula or "recipe" by which the data are

transformed into an actual estimate. The reason for this is that the

justification of an estimate computed

page_4

Page 5

from a particular sample rests on a justification of the estimation

method (the estimator). The econometrician has no way of knowing the

actual values of the disturbances inherent in a sample of data;

depending on these disturbances, an estimate calculated from that

sample could be quite inaccurate. It is therefore impossible to justify the

estimate itself. However, it may be the case that the econometrician can

justify the estimator by showing, for example, that the estimator

"usually" produces an estimate that is "quite close" to the true

parameter value regardless of the particular sample chosen. (The

meaning of this sentence, in particular the meaning of ''usually" and of

"quite close," is discussed at length in the next chapter.) Thus an

estimate of b from a particular sample is defended by justifying the

estimator.

Because attention is focused on estimators of b, a convenient way of

denoting those estimators is required. An easy way of doing this is to

place a mark over the b or a superscript on it. Thus (beta-hat) and b*

(beta-star) are often used to denote estimators of beta. One estimator,

the ordinary least squares (OLS) estimator, is very popular in

econometrics; the notation bOLS is used throughout this book to

represent it. Alternative estimators are denoted by , b*, or something

similar. Many textbooks use the letter b to denote the OLS estimator.

1.4 Good and Preferred Estimators

Any fool can produce an estimator of b, since literally an infinite

number of them exists, i.e., there exists an infinite number of different

ways in which a sample of data can be used to produce an estimate of b,

all but a few of these ways producing "bad" estimates. What

distinguishes an econometrician is the ability to produce "good"

estimators, which in turn produce "good" estimates. One of these

"good" estimators could be chosen as the "best" or "preferred"

estimator and be used to generate the "preferred" estimate of b. What

further distinguishes an econometrician is the ability to provide "good"

estimators in a variety of different estimating contexts. The set of

"good" estimators (and the choice of "preferred" estimator) is not the

same in all estimating problems. In fact, a "good" estimator in one

estimating situation could be a "bad" estimator in another situation.

The study of econometrics revolves around how to generate a "good" or

the "preferred" estimator in a given estimating situation. But before the

"how to" can be explained, the meaning of "good" and "preferred" must

be made clear. This takes the discussion into the subjective realm: the

meaning of "good" or "preferred" estimator depends upon the subjective

values of the person doing the estimating. The best the econometrician

can do under these circumstances is to recognize the more popular

criteria used in this regard and generate estimators that meet one or

more of these criteria. Estimators meeting certain of these criteria could

be called "good" estimators. The ultimate choice of the "preferred"

estimator, however, lies in the hands of the person doing the estimating,

for it is

page_5

Page 6

his or her value judgements that determine which of these criteria is the

most important. This value judgement may well be influenced by the

purpose for which the estimate is sought, in addition to the subjective

prejudices of the individual.

Clearly, our investigation of the subject of econometrics can go no

further until the possible criteria for a "good" estimator are discussed.

This is the purpose of the next chapter.

General Notes

1.1 What is Econometrics?

The term "econometrics" first came into prominence with the formation

in the early 1930s of the Econometric Society and the founding of the

journal

Econometrica. The introduction of Dowling and Glahe (1970)

surveys briefly the landmark publications in econometrics. Pesaran

(1987) is a concise history and overview of econometrics. Hendry and

Morgan (1995) is a collection of papers of historical importance in the

development of econometrics. Epstein (1987), Morgan (1990a) and Qin

(1993) are extended histories; see also Morgan (1990b). Hendry (1980)

notes that the word econometrics should not be confused with

"economystics," ''economic-tricks," or "icon-ometrics."

The discipline of econometrics has grown so rapidly, and in so many

different directions, that disagreement regarding the definition of

econometrics has grown rather than diminished over the past decade.

Reflecting this, at least one prominent econometrician, Goldberger

(1989, p. 151), has concluded that "nowadays my definition would be

that econometrics is what econometricians do." One thing that

econometricians do that is not discussed in this book is serve as expert

witnesses in court cases. Fisher (1986) has an interesting account of this

dimension of econometric work. Judge et al. (1988, p. 81) remind

readers that "econometrics is fun!"

A distinguishing feature of econometrics is that it focuses on ways of

dealing with data that are awkward/dirty because they were not

produced by controlled experiments. In recent years, however,

controlled experimentation in economics has become more common.

Burtless (1995) summarizes the nature of such experimentation and

argues for its continued use. Heckman and Smith (1995) is a strong

defense of using traditional data sources. Much of this argument is

associated with the selection bias phenomenon (discussed in chapter 16)

- people in an experimental program inevitably are not a random

selection of all people, particularly with respect to their unmeasured

attributes, and so results from the experiment are compromised.

Friedman and Sunder (1994) is a primer on conducting economic

experiments. Meyer (1995) discusses the attributes of "natural"

experiments in economics.

Mayer (1933, chapter 10), Summers (1991), Brunner (1973), Rubner

(1970) and Streissler (1970) are good sources of cynical views of

econometrics, summed up dramatically by McCloskey (1994, p. 359) ".

. .most allegedly empirical research in economics is unbelievable,

uninteresting or both." More comments appear in this book in section

9.2 on errors in variables and chapter 18 on prediction. Fair (1973) and

From and Schink (1973) are examples of studies defending the use of

sophisticated econometric techniques. The use of econometrics in the

policy context has been hampered

page_6

Page 7

by the (inexplicable?) operation of "Goodhart's Law" (1978), namely

that all econometric models break down when used for policy. The

finding of Dewald et al. (1986), that there is a remarkably high

incidence of inability to replicate empirical studies in economics, does

not promote a favorable view of econometricians.

What has been the contribution of econometrics to the development of

economic science? Some would argue that empirical work frequently

uncovers empirical regularities which inspire theoretical advances. For

example, the difference between time-series and cross-sectional

estimates of the MPC prompted development of the relative, permanent

and life-cycle consumption theories. But many others view

econometrics with scorn, as evidenced by the following quotes:

We don't genuinely take empirical work seriously in economics.

It's not the source by which economists accumulate their

opinions, by and large. (Leamer in Hendry et al., 1990, p. 182);

Very little of what economists will tell you they know, and

almost none of the content of the elementary text, has been

discovered by running regressions. Regressions on government-

collected data have been used mainly to bolster one theoretical

argument over another. But the bolstering they provide is weak,

inconclusive, and easily countered by someone else's

regressions. (Bergmann, 1987, p. 192);

No economic theory was ever abandoned because it was

rejected by some empirical econometric test, nor was a clear cut

decision between competing theories made in light of the

evidence of such a test. (Spanos, 1986, p. 660); and

I invite the reader to try . . . to identify a meaningful hypothesis

about economic behavior that has fallen into disrepute because

of a formal statistical test. (Summers, 1991, p. 130)

This reflects the belief that economic data are not powerful enough to

test and choose among theories, and that as a result econometrics has

shifted from being a tool for testing theories to being a tool for

exhibiting/displaying theories. Because economics is a

non-experimental science, often the data are weak, and because of this

empirical evidence provided by econometrics is frequently

inconclusive; in such cases it should be qualified as such. Griliches

(1986) comments at length on the role of data in econometrics, and

notes that they are improving; Aigner (1988) stresses the potential role

of improved data.

Critics might choose to paraphrase the Malinvaud quote as "The art of

drawing a crooked line from an unproved assumption to a foregone

conclusion." The importance of a proper understanding of econometric

techniques in the face of a potential inferiority of econometrics to

inspired economic theorizing is captured nicely by Samuelson (1965, p.

9): "Even if a scientific regularity were less accurate than the intuitive

hunches of a virtuoso, the fact that it can be put into operation by

thousands of people who are not virtuosos gives it a transcendental

importance." This guide is designed for those of us who are not

virtuosos!

Feminist economists have complained that traditional econometrics

contains a male bias. They urge econometricians to broaden their

teaching and research methodology to encompass the collection of

primary data of different types, such as survey or interview data, and

the use of qualitative studies which are not based on the exclusive use

of "objective" data. See MacDonald (1995) and Nelson (1995). King,

Keohane and

page_7

Page 8

Verba (1994) discuss how research using qualitative studies can meet

traditional scientific standards.

Several books focus on the empirical applications dimension of

econometrics. Some recent examples are Thomas (1993), Berndt (1991)

and Lott and Ray (1992). Manski (1991, p. 49) notes that "in the past,

advances in econometrics were usually motivated by a desire to answer

specific empirical questions. This symbiosis of theory and practice is

less common today." He laments that "the distancing of methodological

research from its applied roots is unhealthy."

1.2 The Disturbance Term

The error term associated with a relationship need not necessarily be

additive, as it is in the example cited. For some nonlinear functions it is

often convenient to specify the error term in a multiplicative form. In

other instances it may be appropriate to build the stochastic element

into the relationship by specifying the parameters to be random

variables rather than constants. (This is called the random-coefficients

model.)

Some econometricians prefer to define the relationship between C and

discussed earlier as "the mean of C conditional on Y is (Y)," written as

E(C\Y). = (Y).

This spells out more explicitly what econometricians

have in mind when using this specification.

In terms of the throwing-darts-at-a-target analogy, characterizing

disturbance terms refers to describing the nature of the misses: are the

darts distributed uniformly around the bull's eye? Is the average miss

large or small? Does the average miss depend on who is throwing the

darts? Is a miss to the right likely to be followed by another miss to the

right? In later chapters the statistical specification of these

characteristics and the related terminology (such as "homoskedasticity"

and "autocorrelated errors") are explained in considerable detail.

1.3 Estimates and Estimators

An estimator is simply an algebraic function of a potential sample of

data; once the sample is drawn, this function creates and actual

numerical estimate.

Chapter 2 discusses in detail the means whereby an estimator is

"justified" and compared with alternative estimators.

1.4 Good and Preferred Estimators

The terminology "preferred" estimator is used instead of the term "best"

estimator because the latter has a specific meaning in econometrics.

This is explained in chapter 2.

Estimation of parameter values is not the only purpose of econometrics.

Two other major themes can be identified: testing of hypotheses and

economic forecasting. Because both these problems are intimately

related to the estimation of parameter values, it is not misleading to

characterize econometrics as being primarily concerned with parameter

estimation.

page_8

Page 9

Technical Notes

1.1 What is Econometrics?

In the macroeconomic context, in particular in research on real business

cycles, a computational simulation procedure called calibration is often

employed as an alternative to traditional econometric analysis. In this

procedure economic theory plays a much more prominent role than

usual, supplying ingredients to a general equilibrium model designed to

address a specific economic question. This model is then "calibrated" by

setting parameter values equal to average values of economic ratios

known not to have changed much over time or equal to empirical

estimates from microeconomic studies. A computer simulation produces

output from the model, with adjustments to model and parameters made

until the output from these simulations has qualitative characteristics

(such as correlations between variables of interest) matching those of

the real world. Once this qualitative matching is achieved the model is

simulated to address the primary question of interest. Kydland and

Prescott (1996) is a good exposition of this approach.

Econometricians have not viewed this technique with favor, primarily

because there is so little emphasis on evaluating the quality of the

output using traditional testing/assessment procedures. Hansen and

Heckman (1996), a cogent critique, note (p. 90) that "Such models are

often elegant, and the discussions produced from using them are

frequently stimulating and provocative, but their empirical foundations

are not secure. What credibility should we attach to numbers produced

from their 'computational experiments,' and why should we use their

'calibrated models' as a basis for serious quantitative policy evaluation?"

King (1995) is a good comparison of econometrics and calibration.

page_9

Page 10

Criteria for Estimators

2.1 Introduction

Chapter 1 posed the question, What is a "good" estimator? The aim of

this chapter is to answer that question by describing a number of criteria

that econometricians feel are measures of "goodness." These criteria are

discussed under the following headings:

(1) Computational cost

(2) Least squares

(3) Highest R2

(4) Unbiasedness

(5) Efficiency

(6) Mean square error

(7) Asymptotic properties

(8) Maximum likelihood

Since econometrics can be characterized as a search for estimators

satisfying one or more of these criteria, care is taken in the discussion of

the criteria to ensure that the reader understands fully the meaning of

the different criteria and the terminology associated with them. Many

fundamental ideas of econometrics, critical to the question, What's

econometrics all about?, are presented in this chapter.

2.2 Computational Cost

To anyone, but particularly to economists, the extra benefit associated

with choosing one estimator over another must be compared with its

extra cost, where cost refers to expenditure of both money and effort.

Thus, the computational ease and cost of using one estimator rather

than another must be taken into account whenever selecting an

estimator. Fortunately, the existence and ready availability of

high-speed computers, along with standard packaged routines for most

of the popular estimators, has made computational cost very low. As a

page_10

Page 11

result, this criterion does not play as strong a role as it once did. Its

influence is now felt only when dealing with two kinds of estimators.

One is the case of an atypical estimation procedure for which there does

not exist a readily available packaged computer program and for which

the cost of programming is high. The second is an estimation method for

which the cost of running a packaged program is high because it needs

large quantities of computer time; this could occur, for example, when

using an iterative routine to find parameter estimates for a problem

involving several nonlinearities.

2.3 Least Squares

For any set of values of the parameters characterizing a relationship,

estimated values of the dependent variable (the variable being

explained) can be calculated using the values of the independent

variables (the explaining variables) in the data set. These estimated

values (called ) of the dependent variable can be subtracted from the

actual values (y) of the dependent variable in the data set to produce

what are called the residuals (y - ). These residuals could be thought

of as estimates of the unknown disturbances inherent in the data set.

This is illustrated in figure 2.1. The line labeled is the estimated

relationship corresponding to a specific set of values of the unknown

parameters. The dots represent actual observations on the dependent

variable y and the independent variable x. Each observation is a certain

vertical distance away from the estimated line, as pictured by the

double-ended arrows. The lengths of these double-ended arrows

measure the residuals. A different set of specific values of the

Figure 2.1

Minimizing the sum of squared residuals

page_11

Page 12

parameters would create a different estimating line and thus a different

set of residuals.

It seems natural to ask that a "good" estimator be one that generates a

set of estimates of the parameters that makes these residuals "small."

Controversy arises, however, over the appropriate definition of "small."

Although it is agreed that the estimator should be chosen to minimize a

weighted sum of all these residuals, full agreement as to what the

weights should be does not exist. For example, those feeling that all

residuals should be weighted equally advocate choosing the estimator

that minimizes the sum of the absolute values of these residuals. Those

feeling that large residuals should be avoided advocate weighting large

residuals more heavily by choosing the estimator that minimizes the sum

of the squared values of these residuals. Those worried about misplaced

decimals and other data errors advocate placing a constant (sometimes

zero) weight on the squared values of particularly large residuals. Those

concerned only with whether or not a residual is bigger than some

specified value suggest placing a zero weight on residuals smaller than

this critical value and a weight equal to the inverse of the residual on

residuals larger than this value. Clearly a large number of alternative

definitions could be proposed, each with appealing features.

By far the most popular of these definitions of "small" is the

minimization of the sum of squared residuals. The estimator generating

the set of values of the parameters that minimizes the sum of squared

residuals is called the ordinary least squares estimator. It is referred to

as the OLS estimator and is denoted by bOLS in this book. This

estimator is probably the most popular estimator among researchers

doing empirical work. The reason for this popularity, however, does not

stem from the fact that it makes the residuals "small" by minimizing the

sum of squared residuals. Many econometricians are leery of this

criterion because minimizing the sum of squared residuals does not say

anything specific about the relationship of the estimator to the true

parameter value b that it is estimating. In fact, it is possible to be too

successful in minimizing the sum of squared residuals, accounting for so

many unique features of that particular sample that the estimator loses

its general validity, in the sense that, were that estimator applied to a

new sample, poor estimates would result. The great popularity of the

OLS estimator comes from the fact that in some estimating problems

(but not all!) it scores well on some of the other criteria, described

below, that are thought to be of greater importance. A secondary reason

for its popularity is its computational ease; all computer packages

include the OLS estimator for linear relationships, and many have

routines for nonlinear cases.

Because the OLS estimator is used so much in econometrics, the

characteristics of this estimator in different estimating problems are

explored very thoroughly by all econometrics texts. The OLS estimator

always minimizes the sum of squared residuals; but it does not always

meet other criteria that econometricians feel are more important. As will

become clear in the next chapter, the subject of econometrics can be

characterized as an attempt to find alternative estimators to the OLS

estimator for situations in which the OLS estimator does

page_12

Page 13

not meet the estimating criterion considered to be of greatest

importance in the problem at hand.

2.4 Highest R2

A statistic that appears frequently in econometrics is the coefficient of

determination, R2. It is supposed to represent the proportion of the

variation in the dependent variable "explained" by variation in the

independent variables. It does this in a meaningful sense in the case of a

linear relationship estimated by OLS. In this case it happens that the

sum of the squared deviations of the dependent variable about its mean

(the "total" variation in the dependent variable) can be broken into two

parts, called the "explained" variation (the sum of squared deviations of

the estimated values of the dependent variable around their mean) and

the ''unexplained" variation (the sum of squared residuals). R2 is

measured either as the ratio of the "explained" variation to the "total"

variation or, equivalently, as 1 minus the ratio of the "unexplained"

variation to the "total" variation, and thus represents the percentage of

variation in the dependent variable "explained" by variation in the

independent variables.

Because the OLS estimator minimizes the sum of squared residuals (the

"unexplained" variation), it automatically maximizes R2. Thus

maximization of R2, as a criterion for an estimator, is formally identical

to the least squares criterion, and as such it really does not deserve a

separate section in this chapter. It is given a separate section for two

reasons. The first is that the formal identity between the highest R2

criterion and the least squares criterion is worthy of emphasis. And the

second is to distinguish clearly the difference between applying R2 as a

criterion in the context of searching for a "good" estimator when the

functional form and included independent variables are known, as is the

case in the present discussion, and using R2 to help determine the

proper functional form and the appropriate independent variables to be

included. This later use of R2, and its misuse, are discussed later in the

book (in sections 5.5 and 6.2).

2.5 Unbiasedness

Suppose we perform the conceptual experiment of taking what is called

a repeated sample: keeping the values of the independent variables

unchanged, we obtain new observations for the dependent variable by

drawing a new set of disturbances. This could be repeated, say, 2,000

times, obtaining 2,000 of these repeated samples. For each of these

repeated samples we could use an estimator b* to calculate an estimate

of b. Because the samples differ, these 2,000 estimates will not be the

same. The manner in which these estimates are distributed is called the

sampling distribution of b*. This is illustrated for the one-dimensional

case in figure 2.2, where the sampling distribution of the estimator is

labeled (b*). It is simply the probability density function of b*,

approximated

page_13

Page 14

Figure 2.2

Using the sampling distribution to illustrate bias

by using the 2,000 estimates of b to construct a histogram, which in turn

is used to approximate the relative frequencies of different estimates of

b from the estimator b*. The sampling distribution of an alternative

estimator, , is also shown in figure 2.2.

This concept of a sampling distribution, the distribution of estimates

produced by an estimator in repeated sampling, is crucial to an

understanding of econometrics. Appendix A at the end of this book

discusses sampling distributions at greater length. Most estimators are

adopted because their sampling distributions have "good" properties;

the criteria discussed in this and the following three sections are directly

concerned with the nature of an estimator's sampling distribution.

The first of these properties is unbiasedness. An estimator b* is said to

be an unbiased estimator of b if the mean of its sampling distribution is

equal to b, i.e., if the average value of b* in repeated sampling is b. The

mean of the sampling distribution of b* is called the expected value of

b* and is written Eb* the bias of b* is the difference between Eb* and

b. In figure 2.2, b* is seen to be unbiased, whereas

has a bias of size

(E - b). The property of unbiasedness does not mean that b* = b; it

says only that, if we could undertake repeated sampling an infinite

number of times, we would get the correct estimate "on the average."

The OLS criterion can be applied with no information concerning how

the data were generated. This is not the case for the unbiasedness

criterion (and all other criteria related to the sampling distribution),

since this knowledge is required to construct the sampling distribution.

Econometricians have therefore

page_14

Page 15

developed a standard set of assumptions (discussed in chapter 3)

concerning the way in which observations are generated. The general,

but not the specific, way in which the disturbances are distributed is an

important component of this. These assumptions are sufficient to allow

the basic nature of the sampling distribution of many estimators to be

calculated, either by mathematical means (part of the technical skill of

an econometrician) or, failing that, by an empirical means called a

Monte Carlo study, discussed in section 2.10.

Although the mean of a distribution is not necessarily the ideal measure

of its location (the median or mode in some circumstances might be

considered superior), most econometricians consider unbiasedness a

desirable property for an estimator to have. This preference for an

unbiased estimator stems from the hope that a particular estimate (i.e.,

from the sample at hand) will be close to the mean of the estimator's

sampling distribution. Having to justify a particular estimate on a "hope"

is not especially satisfactory, however. As a result, econometricians

have recognized that being centered over the parameter to be estimated

is only one good property that the sampling distribution of an estimator

can have. The variance of the sampling distribution, discussed next, is

also of great importance.

2.6 Efficiency

In some econometric problems it is impossible to find an unbiased

estimator. But whenever one unbiased estimator can be found, it is

usually the case that a large number of other unbiased estimators can

also be found. In this circumstance the unbiased estimator whose

sampling distribution has the smallest variance is considered the most

desirable of these unbiased estimators; it is called the best unbiased

estimator, or the efficient estimator among all unbiased estimators. Why

it is considered the most desirable of all unbiased estimators is easy to

visualize. In figure 2.3 the sampling distributions of two unbiased

estimators are drawn. The sampling distribution of the estimator

denoted f( ), is drawn "flatter" or "wider" than the sampling

distribution of b*, reflecting the larger variance of . Although both

estimators would produce estimates in repeated samples whose average

would be b, the estimates from would range more widely and thus

would be less desirable. A researcher using would be less certain that

his or her estimate was close to b than would a researcher using b*.

Sometimes reference is made to a criterion called "minimum variance."

This criterion, by itself, is meaningless. Consider the estimator b* = 5.2

(i.e., whenever a sample is taken, estimate b by 5.2 ignoring the

sample). This estimator has a variance of zero, the smallest possible

variance, but no one would use this estimator because it performs so

poorly on other criteria such as unbiasedness. (It is interesting to note,

however, that it performs exceptionally well on the computational cost

criterion!) Thus, whenever the minimum variance, or "efficiency,"

criterion is mentioned, there must exist, at least implicitly, some

additional constraint, such as unbiasedness, accompanying that

criterion. When the

page_15

Page 16

Figure 2.3

Using the sampling distribution to illustrate

efficiency

additional constraint accompanying the minimum variance criterion is

that the estimators under consideration be unbiased, the estimator is

referred to as the best unbiased estimator.

Unfortunately, in many cases it is impossible to determine

mathematically which estimator, of all unbiased estimators, has the

smallest variance. Because of this problem, econometricians frequently

add the further restriction that the estimator be a linear function of the

observations on the dependent variable. This reduces the task of finding

the efficient estimator to mathematically manageable proportions. An

estimator that is linear and unbiased and that has minimum variance

among all linear unbiased estimators is called the best linear unbiased

estimator (BLUE). The BLUE is very popular among econometricians.

This discussion of minimum variance or efficiency has been implicitly

undertaken in the context of a undimensional estimator, i.e., the case in

which b is a single number rather than a vector containing several

numbers. In the multidimensional case the variance of

becomes a

matrix called the variance-covariance matrix of . This creates special

problems in determining which estimator has the smallest variance. The

technical notes to this section discuss this further.

2.7 Mean Square Error (MSE)

Using the best unbiased criterion allows unbiasedness to play an

extremely strong role in determining the choice of an estimator, since

only unbiased esti-

page_16

Page 17

Figure 2.4

MSE trades off bias and variance

mators are considered. It may well be the case that, by restricting

attention to only unbiased estimators, we are ignoring estimators that

are only slightly biased but have considerably lower variances. This

phenomenon is illustrated in figure 2.4. The sampling distribution of

the best unbiased estimator, is labeled f( ). b* is a biased estimator

with sampling distribution (b*). It is apparent from figure 2.4 that,

although (b*) is not centered over b reflecting the bias of b*, it is

"narrower" than f( ), indicating a smaller variance. It should be clear

from the diagram that most researchers would probably choose the

biased estimator b* in preference to the best unbiased estimator .

This trade-off between low bias and low variance is formalized by using

as a criterion the minimization of a weighted average of the bias and the

variance (i.e., choosing the estimator that minimizes this weighted

average). This is not a variable formalization, however, because the bias

could be negative. One way to correct for this is to use the absolute

value of the bias; a more popular way is to use its square. When the

estimator is chosen so as to minimize a weighted average of the

variance and the square of the bias, the estimator is said to be chosen on

the weighted square error criterion. When the weights are equal, the

criterion is the popular mean square error (MSE) criterion. The

popularity of the mean square error criterion comes from an alternative

derivation of this criterion: it happens that the expected value of a loss

function consisting of the square of the difference between b and its

estimate (i.e., the square of the estimation error) is the same as the sum

of the variance and the squared bias. Minimization of the expected

value of this loss function makes good intuitive sense as a criterion for

choosing an estimator.

page_17

Page 18

In practice, the MSE criterion is not usually adopted unless the best

unbiased criterion is unable to produce estimates with small variances.

The problem of multicollinearity, discussed in chapter 11, is an example

of such a situation.

2.8 Asymptotic Properties

The estimator properties discussed in sections 2.5, 2.6 and 2.7 above

relate to the nature of an estimator's sampling distribution. An unbiased

estimator, for example, is one whose sampling distribution is centered

over the true value of the parameter being estimated. These properties

do not depend on the size of the sample of data at hand: an unbiased

estimator, for example, is unbiased in both small and large samples. In

many econometric problems, however, it is impossible to find estimators

possessing these desirable sampling distribution properties in small

samples. When this happens, as it frequently does, econometricians may

justify an estimator on the basis of its

asymptotic properties - the nature

of the estimator's sampling distribution in extremely large samples.

The sampling distribution of most estimators changes as the sample size

changes. The sample mean statistic, for example, has a sampling

distribution that is centered over the population mean but whose

variance becomes smaller as the sample size becomes larger. In many

cases it happens that a biased estimator becomes less and less biased as

the sample size becomes larger and larger - as the sample size becomes

larger its sampling distribution changes, such that the mean of its

sampling distribution shifts closer to the true value of the parameter

being estimated. Econometricians have formalized their study of these

phenomena by structuring the concept of an asymptotic distribution

and defining desirable asymptotic or "large-sample properties" of an

estimator in terms of the character of its asymptotic distribution. The

discussion below of this concept and how it is used is heuristic (and not

technically correct); a more formal exposition appears in appendix C at

the end of this book.

Consider the sequence of sampling distributions of an estimator

formed by calculating the sampling distribution of for successively

larger sample sizes. If the distributions in this sequence become more

and more similar in form to some specific distribution (such as a normal

distribution) as the sample size becomes extremely large, this specific

distribution is called the asymptotic distribution of . Two basic

estimator properties are defined in terms of the asymptotic distribution.

(1) If the asymptotic distribution of becomes concentrated on a

particular value k as the sample size approaches infinity, k is said to be

the probability limit of and is written plim = k if plim = b, then

is said to be consistent.

(2) The variance of the asymptotic distribution of is called the

asymptotic variance of if

is consistent and its asymptotic variance is

smaller than

page_18

Page 19

Figure 2.5

How sampling distribution can change as the sample size

grows

the asymptotic variance of all other consistent estimators, is said to be

asymptotically efficient.

At considerable risk of oversimplification, the plim can be thought of as

the large-sample equivalent of the expected value, and so plim = b is

the large-sample equivalent of unbiasedness. Consistency can be

crudely conceptualized as the large-sample equivalent of the minimum

mean square error property, since a consistent estimator can be (loosely

speaking) though of as having, in the limit, zero bias and a zero

variance. Asymptotic efficiency is the large-sample equivalent of best

unbiasedness: the variance of an asymptotically efficient estimator goes

to zero faster than the variance of any other consistent estimator.

Figure 2.5 illustrates the basic appeal of asymptotic properties. For

sample size 20, the sampling distribution of b* is shown as (b*)20. Since

this sampling distribution is not centered over

, the estimator

* is

biased. As shown in figure 2.5, however, as the sample size increases to

40, then 70 and then 100, the sampling distribution of b* shifts so as to

be more closely centered over b (i.e., it becomes less biased), and it

becomes less spread out (i.e., its variance becomes smaller). If b* were

consistent, as the sample size increased to infinity

page_19

Page 20

the sampling distribution would shrink in width to a single vertical line,

of infinite height, placed exactly at the point b.

It must be emphasized that these asymptotic criteria are only employed

in situations in which estimators with the traditional desirable small-

sample properties, such as unbiasedness, best unbiasedness and

minimum mean square error, cannot be found. Since econometricians

quite often must work with small samples, defending estimators on the

basis of their asymptotic properties is legitimate only if it is the case that

estimators with desirable asymptotic properties have more desirable

small-sample properties than do estimators without desirable asymptotic

properties. Monte Carlo studies (see section 2.10) have shown that in

general this supposition is warranted.

The message of the discussion above is that when estimators with

attractive small-sample properties cannot be found one may wish to

choose an estimator on the basis of its large-sample properties. There is

an additional reason for interest in asymptotic properties, however, of

equal importance. Often the derivation of small-sample properties of an

estimator is algebraically intractable, whereas derivation of large-sample

properties is not. This is because, as explained in the technical notes, the

expected value of a nonlinear function of a statistic is not the nonlinear

function of the expected value of that statistic, whereas the plim of a

nonlinear function of a statistic is equal to the nonlinear function of the

plim of that statistic.

These two features of asymptotics give rise to the following four

reasons for why asymptotic theory has come to play such a prominent

role in econometrics.

(1) When no estimator with desirable small-sample properties can be

found, as is often the case, econometricians are forced to choose

estimators on the basis of their asymptotic properties. As example is the

choice of the OLS estimator when a lagged value of the dependent

variable serves as a regressor. See chapter 9.

(2) Small-sample properties of some estimators are extraordinarily

difficult to calculate, in which case using asymptotic algebra can

provide an indication of what the small-sample properties of this

estimator are likely to be. An example is the plim of the OLS estimator

in the simultaneous equations context. See chapter 10.

(3) Formulas based on asymptotic derivations are useful approximations

to formulas that otherwise would be very difficult to derive and

estimate. An example is the formula in the technical notes used to

estimate the variance of a nonlinear function of an estimator.

(4) Many useful estimators and test statistics may never have been

found had it not been for algebraic simplifications made possible by

asymptotic algebra. An example is the development of LR, W and LM

test statistics for testing nonlinear restrictions. See chapter 4.

page_20

Page 21

Figure 2.6

Maximum likelihood estimation

2.9 Maximum Likelihood

The maximum likelihood principle of estimation is based on the idea

that the sample of data at hand is more likely to have come from a "real

world" characterized by one particular set of parameter values than

from a "real world" characterized by any other set of parameter values.

The maximum likelihood estimate (MLE) of a vector of parameter

values b is simply the particular vector bMLE that gives the greatest

probability of obtaining the observed data.

This idea is illustrated in figure 2.6. Each of the dots represents an

observation on x drawn at random from a population with mean m and

variance s2. Pair A of parameter values, mA and (s2)A, gives rise in

figure 2.6 to the probability density function A for x while the pair B,

mB and (s2)B, gives rise to probability density function B. Inspection of

the diagram should reveal that the probability of having obtained the

sample in question if the parameter values were mA and (s2)A is very

low compared with the probability of having obtained the sample if the

parameter values were mB and (s2)B. On the maximum likelihood

principle, pair B is preferred to pair A as an estimate of m and s2. The

maximum likelihood estimate is the particular pair of values mMLE and

(s2)MLE that creates the greatest probability of having obtained the

sample in question; i.e., no other pair of values would be preferred to

this maximum likelihood pair, in the sense that pair B is preferred to

pair A. The means by which the econometrician finds this maximum

likelihood estimates is discussed briefly in the technical notes to this

section.

In addition to its intuitive appeal, the maximum likelihood estimator has

several desirable asymptotic properties. It is asymptotically unbiased, it

is consistent, it is asymptotically efficient, it is distributed

asymptotically normally, and its asymptotic variance can be found via a

standard formula (the Cramer-Rao lower bound - see the technical

notes to this section). Its only major theoretical drawback is that in

order to calculate the MLE the econometrician must assume

page_21

Page 22

a specific (e.g., normal) distribution for the error term. Most

econometricians seem willing to do this.

These properties make maximum likelihood estimation very appealing

for situations in which it is impossible to find estimators with desirable

small-sample properties, a situation that arises all too often in practice.

In spite of this, however, until recently maximum likelihood estimation

has not been popular, mainly because of high computational cost.

Considerable algebraic manipulation is required before estimation, and

most types of MLE problems require substantial input preparation for

available computer packages. But econometricians' attitudes to MLEs

have changed recently, for several reasons. Advances in computers and

related software have dramatically reduced the computational burden.

Many interesting estimation problems have been solved through the use

of MLE techniques, rendering this approach more useful (and in the

process advertising its properties more widely). And instructors have

been teaching students the theoretical aspects of MLE techniques,

enabling them to be more comfortable with the algebraic manipulations

it requires.

2.10 Monte Carlo Studies

A Monte Carlo study is a simulation exercise designed to shed light on

the small-sample properties of competing estimators for a given

estimating problem. They are called upon whenever, for that particular

problem, there exist potentially attractive estimators whose small-

sample properties cannot be derived theoretically. Estimators with

unknown small-sample properties are continually being proposed in the

econometric literature, so Monte Carlo studies have become quite

common, especially now that computer technology has made their

undertaking quite cheap. This is one good reason for having a good

understanding of this technique. A more important reason is that a

thorough understanding of Monte Carlo studies guarantees an

understanding of the repeated sample and sampling distribution

concepts, which are crucial to an understanding of econometrics.

Appendix A at the end of this book has more on sampling distributions

and their relation to Monte Carlo studies.

The general idea behind a Monte Carlo study is to (1) model the

data-generating process, (2) generate several sets of artificial data, (3)

employ these data and an estimator to create several estimates, and (4)

use these estimates to gauge the sampling distribution properties of that

estimator. This is illustrated in figure 2.7. These four steps are described

below.

(1) Model the data-generating process Simulation of the process

thought to be generating the real-world data for the problem at hand

requires building a model for the computer to mimic the data-generating

process, including its stochastic component(s). For example, it could be

specified that N (the sample size) values of X, Z and an error term

generate N values of Y according to Y = b1 + b2X + b3Z + e, where the

bi are specific, known numbers, the N val-

page_22

Page 23

Figure 2.7

Structure of a Monte Carlo study

use of X and Z are given, exogenous, observations on explanatory

variables, and the N values of e are drawn randomly from a normal

distribution with mean zero and known variance s2. (Computers are

capable of generating such random error terms.) Any special features

thought to characterize the problem at hand must be built into this

model. For example, if b2 = b3-1 then the values of b2 and b3 must be

chosen such that this is the case. Or if the variance s2 varies from

observation to observation, depending on the value of Z, then the error

terms must be adjusted accordingly. An important feature of the study is

that all of the (usually unknown) parameter values are known to the

person conducting the study (because this person chooses these values).

(2)

Create sets of data

With a model of the data-generating process

built into the computer, artificial data can be created. The key to doing

this is the stochastic element of the data-generating process. A sample

of size N is created by obtaining N values of the stochastic variable e

and then using these values, in conjunction with the rest of the model, to

generate N, values of Y. This yields one complete sample of size N,

namely N observations on each of Y, X and Z, corresponding to the

particular set of N error terms drawn. Note that this artificially

generated set of sample data could be viewed as an example of

real-world data that a researcher would be faced with when dealing with

the kind of estimation problem this model represents. Note especially

that the set of data obtained depends crucially on the particular set of

error terms drawn. A different set of

page_23

Page 24

error terms would create a different data set for the same problem.

Several of these examples of data sets could be created by drawing

different sets of N error terms. Suppose this is done, say, 2,000 times,

generating 2,000 set of sample data, each of sample size N. These are

called repeated samples.

(3) Calculate estimates Each of the 2,000 repeated samples can be used

as data for an estimator 3 say, creating 2,000 estimated 3i (i = 1,2,. .

., 2,000) of the parameter b3. These 2,000 estimates can be viewed as

random ''drawings" from the sampling distribution of 3

(4) Estimate sampling distribution properties These 2,000 drawings

from the sampling distribution of 3 can be used as data to estimate the

properties of this sampling distribution. The properties of most interest

are its expected value and variance, estimates of which can be used to

estimate bias and mean square error.

(a) The expected value of the sampling distribution of 3 is

estimated by the average of the 2,000 estimates:

(b) The bias of 3 is estimated by subtracting the known true value

of b3 from the average:

using the traditional formula for estimating variance:

(d) The mean square error 3 is estimated by the average of the

squared differences between 3 and the true value of b3:

At stage 3 above an alternative estimator could also have been used

to calculate 2,000 estimates. If so, the properties of the sampling

distribution of could also be estimated and then compared with

those of the sampling distribution of 3 (Here 3 could be, for example,

the ordinary least squares estimator and any competing estimator

such as an instrumental variable estimator, the least absolute error

estimator or a generalized least squares estimator. These estimators are

discussed in later chapters.) On the basis of this comparison, the person

conducting the Monte Carlo study may be in a position to recommend

one estimator in preference to another for the sample size N. By

repeating such a study for progressively greater values of N, it is

possible to investigate how quickly an estimator attains its asymptotic

properties.

page_24

Page 25

2.11 Adding Up

Because in most estimating situations there does not exist a "super-

estimator" that is better than all other estimators on all or even most of

these (or other) criteria, the ultimate choice of estimator is made by

forming an "overall judgement" of the desirableness of each available

estimator by combining the degree to which an estimator meets each of

these criteria with a subjective (on the part of the econometrician)

evaluation of the importance of each of these criteria. Sometimes an

econometrician will hold a particular criterion in very high esteem and

this will determine the estimator chosen (if an estimator meeting this

criterion can be found). More typically, other criteria also play a role on

the econometrician's choice of estimator, so that, for example, only

estimators with reasonable computational cost are considered. Among

these major criteria, most attention seems to be paid to the best

unbiased criterion, with occasional deference to the mean square error

criterion in estimating situations in which all unbiased estimators have

variances that are considered too large. If estimators meeting these

criteria cannot be found, as is often the case, asymptotic criteria are

adopted.

A major skill of econometricians is the ability to determine estimator

properties with regard to the criteria discussed in this chapter. This is

done either through theoretical derivations using mathematics, part of

the technical expertise of the econometrician, or through Monte Carlo

studies. To derive estimator properties by either of these means, the

mechanism generating the observations must be known; changing the

way in which the observations are generated creates a new estimating

problem, in which old estimators may have new properties and for

which new estimators may have to be developed.

The OLS estimator has a special place in all this. When faced with any

estimating problem, the econometric theorist usually checks the OLS

estimator first, determining whether or not it has desirable properties.

As seen in the next chapter, in some circumstances it does have

desirable properties and is chosen as the "preferred" estimator, but in

many other circumstances it does not have desirable properties and a

replacement must be found. The econometrician must investigate

whether the circumstances under which the OLS estimator is desirable

are met, and, if not, suggest appropriate alternative estimators.

(Unfortunately, in practice this is too often not done, with the OLS

estimator being adopted without justification.) The next chapter

explains how the econometrician orders this investigation.

General Notes

2.2 Computational Cost

Computational cost has been reduced significantly by the development

of extensive computer software for econometricians. The more

prominent of these are ET,

page_25

Page 26

GAUSS, LIMDEP, Micro-FIT, PC-GIVE, RATS, SAS, SHAZAM,

SORITEC, SPSS, and TSP. The Journal of Applied Econometrics and

the Journal of Economic Surveys both publish software reviews

regularly. All these packages are very comprehensive, encompassing

most of the econometric techniques discussed in textbooks. For

applications they do not cover, in most cases specialized programs exist.

These packages should only be used by those well versed in

econometric theory, however. Misleading or even erroneous results can

easily be produced if these packages are used without a full

understanding of the circumstances in which they are applicable, their

inherent assumptions and the nature of their output; sound research

cannot be produced merely by feeding data to a computer and saying

SHAZAM.

Problems with the accuracy of computer calculations are ignored in

practice, but can be considerable. See Aigner (1971, pp. 99101) and

Rhodes (1975). Quandt (1983) is a survey of computational problems

and methods in econometrics.

2.3 Least Squares

Experiments have shown that OLS estimates tend to correspond to the

average of laymen's "freehand" attempts to fit a line to a scatter of data.

See Mosteller et al. (1981).

In figure 2.1 the residuals were measured as the vertical distances from

the observations to the estimated line. A natural alternative to this

vertical measure is the orthogonal measure - the distance from the

observation to the estimating line along a line perpendicular to the

estimating line. This infrequently seen alternative is discussed in

Malinvaud (1966, pp. 711); it is sometimes used when measurement

errors plague the data, as discussed in section 9.2

2.4 Highest R2

R2 is called the coefficient of determination. It is the square of the

correlation coefficient between y and its OLS estimate

The total variation of the dependent variable y about its mean, s(y - y)2,

is called SST (the total sum of squares); the "explained" variation, the

sum of squared deviations of the estimated values of the dependent

variable about their mean, is called SSR (the regression sum of

squares); and the "unexplained" variation, the sum of squared residuals,

is called SSE (the error sum of squares). R2 is then given by SSR/SST or

by 1 - (SSE/SST).

What is a high R2? There is no generally accepted answer to this

question. In dealing with time series data, very high R2s are not

unusual, because of common trends. Ames and Reiter (1961) found, for

example, that on average the R2 of a relationship between a randomly

chosen variable and its own value lagged one period is about 0.7, and

that an R2 in excess of 0.5 could be obtained by selecting an economic

time series and regressing it against two to six other randomly selected

economic time series. For cross-sectional data, typical R2s are not

nearly so high.

The OLS estimator maximizes R2. Since the R2 measure is used as an

index of how well an estimator "fits" the sample data, the OLS

estimator is often called the "best-fitting" estimator. A high R2 is often

called a ''good fit."

Because the R2 and OLS criteria are formally identical, objections to

the latter apply

page_26

Page 27

to the former. The most frequently voiced of these is that searching for

a good fit is likely to generate parameter estimates tailored to the

particular sample at hand rather than to the underlying "real world."

Further, a high R2 is not necessary for "good" estimates; R2 could be

low because of a high variance of the disturbance terms, and our

estimate of b could be ''good" on other criteria, such as those discussed

later in this chapter.

The neat breakdown of the total variation into the "explained" and

"unexplained" variations that allows meaningful interpretation of the R2

statistic is valid only under three conditions. First, the estimator in

question must be the OLS estimator. Second, the relationship being

estimated must be linear. Thus the R2 statistic only gives the percentage

of the variation in the dependent variable explained linearly by

variation in the independent variables. And third, the linear relationship

being estimated must include a constant, or intercept, term. The

formulas for R2 can still be used to calculate an R2 for estimators other

than the OLS estimator, for nonlinear cases and for cases in which the

intercept term is omitted; it can no longer have the same meaning,

however, and could possibly lie outside the 01 interval. The zero

intercept case is discussed at length in Aigner (1971, pp. 8590). An

alternative R2 measure, in which the variations in y and are measured

as deviations from zero rather than their means, is suggested.

Running a regression without an intercept is the most common way of

obtaining an R2 outside the 01 range. To see how this could happen,

draw a scatter of points in (x,y) space with an estimated OLS line such

that there is a substantial intercept. Now draw in the OLS line that

would be estimated if it were forced to go through the origin. In both

cases SST is identical (because the same observations are used). But in

the second case the SSE and the SSR could be gigantic, because the

and the ( -y)could be huge. Thus if R2 is calculated as 1 - SSR/SST, a

negative number could result; if it is calculated as SSR/SST, a number

greater than one could result.

R2 is sensitive to the range of variation of the dependent variable, so

that comparisons of R2s must be undertaken with care. The favorite

example used to illustrate this is the case of the consumption function

versus the savings function. If savings is defined as income less

consumption, income will do exactly as well in explaining variations in

consumption as in explaining variations in savings, in the sense that the

sum of squared residuals, the unexplained variation, will be exactly the

same for each case. But in percentage terms, the unexplained variation

will be a higher percentage of the variation in savings than of the

variation in consumption because the latter are larger numbers. Thus the

R2 in the savings function case will be lower than in the consumption

function case. This reflects the result that the expected value of R2 is

approximately equal to b2V/(b2V + s2) where V is E(x-x)2.

In general, econometricians are interested in obtaining "good"

parameter estimates where "good" is not defined in terms of R2.

Consequently the measure R2 is not of much importance in

econometrics. Unfortunately, however, many practitioners act as though

it is important, for reasons that are not entirely clear, as noted by

Cramer (1987, p. 253):

These measures of goodness of fit have a fatal attraction.

Although it is generally conceded among insiders that they do

not mean a thing, high values are still a source of pride and

satisfaction to their authors, however hard they may try to

conceal these feelings.

page_27

Page 28

Because of this, the meaning and role of R2 are discussed at some

length throughout this book. Section 5.5 and its general notes extend the

discussion of this section. Comments are offered in the general notes of

other sections when appropriate. For example, one should be aware that

R2 from two equations with different dependent variables should not be

compared, and that adding dummy variables (to capture seasonal

influences, for example) can inflate R2 and that regressing on group

means overstates R2 because the error terms have been averaged.

2.5 Unbiasedness

In contrast to the OLS and R2 criteria, the unbiasedness criterion (and

the other criteria related to the sampling distribution) says something

specific about the relationship of the estimator to b, the parameter being

estimated.

Many econometricians are not impressed with the unbiasedness

criterion, as our later discussion of the mean square error criterion will

attest. Savage (1954, p. 244) goes so far as to say: "A serious reason to

prefer unbiased estimates seems never to have been proposed." This

feeling probably stems from the fact that it is possible to have an

"unlucky" sample and thus a bad estimate, with only cold comfort from

the knowledge that, had all possible samples of that size been taken, the

correct estimate would have been hit on average. This is especially the

case whenever a crucial outcome, such as in the case of a matter of life

or death, or a decision to undertake a huge capital expenditure, hinges

on a single correct estimate. None the less, unbiasedness has enjoyed

remarkable popularity among practitioners. Part of the reason for this

may be due to the emotive content of the terminology: who can stand

up in public and state that they prefer biased estimators?

The main objection to the unbiasedness criterion is summarized nicely

by the story of the three econometricians who go duck hunting. The first

shoots about a foot in front of the duck, the second about a foot behind;

the third yells, "We got him!"

2.6 Efficiency

Often econometricians forget that although the BLUE property is

attractive, its requirement that the estimator be linear can sometimes be

restrictive. If the errors have been generated from a "fat-tailed"

distribution, for example, so that relatively high errors occur frequently,

linear unbiased estimators are inferior to several popular nonlinear

unbiased estimators, called robust estimators. See chapter 19.

Linear estimators are not suitable for all estimating problems. For

example, in estimating the variance s2 of the disturbance term,

quadratic estimators are more appropriate. The traditional formula

SSE/(T - K), where T is the number of observations and K is the number

of explanatory variables (including a constant), is under general

conditions the best quadratic unbiased estimator of s2. When K

does not

include the constant (intercept) term, this formula is written as SSE(T -

K -

1).

Although in many instances it is mathematically impossible to determine

the best unbiased estimator (as opposed to the best linear unbiased

estimator), this is not the case if the specific distribution of the error is

known. In this instance a lower bound, called the Cramer-Rao lower

bound, for the variance (or variance-covariance matrix)

page_28

Page 29

of unbiased estimators can be calculated. Furthermore, if this lower

bound is attained (which is not always the case), it is attained by a

transformation of the maximum likelihood estimator (see section 2.9)

creating an unbiased estimator. As an example, consider the sample

mean statistic X. Its variance, s2/T, is equal to the Cramer-Rao lower

bound if the parent population is normal. Thus X is the best unbiased

estimator (whether linear or not) of the mean of a normal population.

2.7 Mean Square Error (MSE)

Preference for the mean square error criterion over the unbiasedness

criterion often hinges on the use to which the estimate is put. As an

example of this, consider a man betting on horse races. If he is buying

"win" tickets, he will want an unbiased estimate of the winning horse,

but if he is buying "show" tickets it is not important that his horse wins

the race (only that his horse finishes among the first three), so he will be

willing to use a slightly biased estimator of the winning horse if it has a

smaller variance.

The difference between the variance of an estimator and its MSE is that

the variance measures the dispersion of the estimator around its mean

whereas the MSE measures its dispersion around the true value of the

parameter being estimated. For unbiased estimators they are identical.

Biased estimators with smaller variances than unbiased estimators are

easy to find. For example, if is an unbiased estimator with variance V

, then 0.9 is a biased estimator with variance 0.81V( ). As a more

relevant example, consider the fact that, although (SSE/(T - K) is the

best quadratic unbiased estimator of s2, as noted in section 2.6, it can be

shown that among quadratic estimators the MSE estimator of s2 is

SSE/(T - K + 2).

The MSE estimator has not been as popular as the best unbiased

estimator because of the mathematical difficulties in its derivation.

Furthermore, when it can be derived its formula often involves

unknown coefficients (the value of b), making its application

impossible. Monte Carlo studies have shown that approximating the

estimator by using OLS estimates of the unknown parameters can

sometimes circumvent this problem.

2.8 Asymptotic Properties

How large does the sample size have to be for estimators to display their

asymptotic properties? The answer to this crucial question depends on

the characteristics of the problem at hand. Goldfeld and Quandt (1972,

p. 277) report an example in which a sample size of 30 is sufficiently

large and an example in which a sample of 200 is required. They also

note that large sample sizes are needed if interest focuses on estimation

of estimator variances rather than on estimation of coefficients.

An observant reader of the discussion in the body of this chapter might

wonder why the large-sample equivalent of the expected value is

defined as the plim rather than being called the "asymptotic

expectation." In practice most people use the two terms synonymously,

but technically the latter refers to the limit of the expected value, which

is usually, but not always, the same as the plim. For discussion see the

technical notes to appendix C.

page_29

Page 30

2.9 Maximum Likelihood

Note that

bMLE is not, as is sometimes carelessly stated, the most

probable value of b; the most probable value of b is b itself. (Only in a

Bayesian interpretation, discussed later in this book, would the former

statement be meaningful.) bMLE is simply the value of b that

maximizes the probability of drawing the sample actually obtained.

The asymptotic variance of the MLE is usually equal to the Cramer-Rao

lower bound, the lowest asymptotic variance that a consistent estimator

can have. This is why the MLE is asymptotically efficient.

Consequently, the variance (not just the asymptotic variance) of the

MLE is estimated by an estimate of the Cramer-Rao lower bound. The

formula for the Cramer-Rao lower bound is given in the technical notes

to this section.

Despite the fact that bMLE is sometimes a biased estimator of b

(although asymptotically unbiased), often a simple adjustment can be

found that creates an unbiased estimator, and this unbiased estimator

can be shown to be best unbiased (with no linearity requirement)

through the relationship between the maximum likelihood estimator and

the Cramer-Rao lower bound. For example, the maximum likelihood

estimator of the variance of a random variable x is given by the formula

which is a biased (but asymptotically unbiased) estimator of the true

variance. By multiplying this expression by T/(T - 1), this estimator can

be transformed into a best unbiased estimator.

Maximum likelihood estimators have an invariance property similar to

that of consistent estimators. The maximum likelihood estimator of a

nonlinear function of a parameter is the nonlinear function of the

maximum likelihood estimator of that parameter: [g(b)]MLE =

(bMLE) where g is a nonlinear function. This greatly simplifies the

algebraic derivations of maximum likelihood estimators, making

adoption of this criterion more attractive.

Goldfeld and Quandt (1972) conclude that the maximum likelihood

technique performs well in a wide variety of applications and for

relatively small sample sizes. It is particularly evident, from reading

their book, that the maximum likelihood technique is well-suited to

estimation involving nonlinearities and unusual estimation problems.

Even in 1972 they did not feel that the computational costs of MLE

were prohibitive.

Application of the maximum likelihood estimation technique requires

that a specific distribution for the error term be chosen. In the context

of regression, the normal distribution is invariably chosen for this

purpose, usually on the grounds that the error term consists of the sum

of a large number of random shocks and thus, by the Central Limit

Theorem, can be considered to be approximately normally distributed.

(See Bartels, 1977, for a warning on the use of this argument.) A more

compelling reason is that the normal distribution is relatively easy to

work with. See the general notes to chapter 4 for further discussion. In

later chapters we encounter situations (such as count data and logit

models) in which a distribution other than the normal is employed.

Maximum likelihood estimates that are formed on the incorrect

assumption that the errors are distributed normally are called quasi-

maximum likelihood estimators. In

page_30

Page 31

many circumstances they have the same asymptotic distribution as that

predicted by assuming normality, and often related test statistics retain

their validity (asymptotically, of course). See Godfrey (1988, p. 402) for

discussion.

Kmenta (1986, pp. 17583) has a clear discussion of maximum likelihood

estimation. A good brief exposition is in Kane (1968, pp. 17780).

Valavanis (1959, pp. 236), an econometrics text subtitled "An

Introduction to Maximum Likelihood Methods," has an interesting

account of the meaning of the maximum likelihood technique.

2.10 Monte Carlo Studies

In this author's opinion, understanding Monte Carlo studies is one of the

most important elements of studying econometrics, not because a

student may need actually to do a Monte Carlo study, but because an

understanding of Monte Carlo studies guarantees an understanding of

the concept of a sampling distribution and the uses to which it is put.

For examples and advice on Monte Carlo methods see Smith (1973) and

Kmenta (1986, chapter 2). Hendry (1984) is a more advanced

reference. Appendix A at the end of this book provides further

discussion of sampling distributions and Monte Carlo studies. Several

exercises in appendix D illustrate Monte Carlo studies.

If a researcher is worried that the specific parameter values used in the

Monte Carlo study may influence the results, it is wise to choose the

parameter values equal to the estimated parameter values using the data

at hand, so that these parameter values are reasonably close to the true

parameter values. Furthermore, the Monte Carlo study should be

repeated using nearby parameter values to check for sensitivity of the

results. Bootstrapping is a special Monte Carlo method designed to

reduce the influence of assumptions made about the parameter values

and the error distribution. Section 4.6 of chapter 4 has an extended

discussion.

The Monte Carlo technique can be used to examine test statistic as well

as parameter estimators. For example, a test statistic could be examined

to see how closely its sampling distribution matches, say, a chi-square.

In this context interest would undoubtedly focus on determining its size

(type I error for a given critical value) and power, particularly as

compared with alternative test statistics.

By repeating a Monte Carlo study for several different values of the

factors that affect the outcome of the study, such as sample size or

nuisance parameters, one obtains several estimates of, say, the bias of

an estimator. These estimated biases can be used as observations with

which to estimate a functional relationship between the bias and the

factors affecting the bias. This relationship is called a response surface.

Davidson and MacKinnon (1993, pp. 75563) has a good exposition.

It is common to hold the values of the explanatory variables fixed

during repeated sampling when conducting a Monte Carlo study.

Whenever the values of the explanatory variables are affected by the

error term, such as in the cases of simultaneous equations, measurement

error, or the lagged value of a dependent variable serving as a regressor,

this is illegitimate and must not be done - the process generating the

data must be properly mimicked. But in other cases it is not obvious if

the explanatory variables should be fixed. If the sample exhausts the

population, such as would be the case for observations on all cities in

Washington state with population greater than 30,000, it would not

make sense to allow the explanatory variable values to change during

repeated sampling. On the other hand, if a sample of wage-earners is

drawn

page_31

Page 32

from a very large potential sample of wage-earners, one could visualize

the repeated sample as encompassing the selection of wage-earners as

well as the error term, and so one could allow the values of the

explanatory variables to vary in some representative way during

repeated samples. Doing this allows the Monte Carlo study to produce

an estimated sampling distribution which is not sensitive to the

characteristics of the particular wage-earners in the sample; fixing the

wage-earners in repeated samples produces an estimated sampling

distribution conditional on the observed sample of wage-earners, which

may be what one wants if decisions are to be based on that sample.

2.11 Adding Up

Other, less prominent, criteria exist for selecting point estimates, some

examples of which follow.

(a) Admissibility An estimator is said to be admissible (with respect

to some criterion) if, for at least one value of the unknown b, it

cannot be beaten on that criterion by any other estimator.

(b) Minimax A minimax estimator is one that minimizes the

maximum expected loss, usually measured as MSE, generated by

competing estimators as the unknown b varies through its possible

values.

properties are not sensitive to violations of the conditions under

which it is optimal. In general, a robust estimator is applicable to a

wide variety of situations, and is relatively unaffected by a small

number of bad data values. See chapter 19.

(d) MELO In the Bayesian approach to statistics (see chapter 13), a

decision-theoretic approach is taken to estimation; an estimate is

chosen such that it minimizes an expected loss function and is

called the MELO (minimum expected loss) estimator. Under

general conditions, if a quadratic loss function is adopted the mean

of the posterior distribution of b is chosen as the point estimate of b

and this has been interpreted in the non-Bayesian approach as

corresponding to minimization of average risk. (Risk is the sum of

the MSEs of the individual elements of the estimator of the vector

b.) See Zellner (1978).

(e) Analogy principle Parameters are estimated by sample statistics

that have the same property in the sample as the parameters do in

the population. See chapter 2 of Goldberger (1968b) for an

interpretation of the OLS estimator in these terms. Manski (1988)

gives a more complete treatment. This approach is sometimes

called the method of moments because it implies that a moment of

the population distribution should be estimated by the

corresponding moment of the sample. See the technical notes.

(f) Nearness/concentration Some estimators have infinite variances

and for that reason are often dismissed. With this in mind, Fiebig

(1985) suggests using as a criterion the probability of nearness

(prefer to b* if prob or the probability of

concentration (prefer to b* if prob

Two good introductory references for the material of this chapter are

Kmenta (1986, pp. 916, 97108, 15672) and Kane (1968, chapter 8).

page_32

Page 33

Technical Notes

2.5 Unbiasedness

The expected value of a variable x is defined formally as

where f is the probability density function (sampling distribution) of x.

Thus

E(x)

could be viewed as a weighted average of all possible values

of x where the weights are proportional to the heights of the density

function (sampling distribution) of x.

2.6 Efficiency

In this author's experience, student assessment of sampling distributions

is hindered, more than anything else, by confusion about how to

calculate an estimator's variance. This confusion arises for several

reasons.

(1) There is a crucial difference between a variance and an estimate

of that variance, something that often is not well understood.

(2) Many instructors assume that some variance formulas are

"common knowledge," retained from previous courses.

(3) It is frequently not apparent that the derivations of variance

formulas all follow a generic form.

(4) Students are expected to recognize that some formulas are

special cases of more general formulas.

(5) Discussions of variance, and appropriate formulas, are seldom

gathered together in one place for easy reference.

Appendix B has been included at the end of this book to alleviate this

confusion, supplementing the material in these technical notes.

In our discussion of unbiasedness, no confusion could arise from b

being

multidimensional: an estimator's expected value is either equal to b (in

every dimension) or it is not. But in the case of the variance of an

estimator confusion could arise. An estimator b* that is k-dimensional

really consists of k different estimators, one for each dimension of b.

These k different estimators all have their own variances. If all k of the

variances associated with the estimator b* are smaller than their

respective counterparts of the estimator then it is clear that the

variance of b* can be considered smaller than the variance of . For

example, if b is two-dimensional, consisting of two separate parameters

b1 and b2

an estimator b* would consist of two estimators and . If b* were

an unbiased estimator of b, would be an unbiased estimator of ,

and would be an unbiased estimator of b2. The estimators and

would each have variances. Suppose their variances were 3.1 and 7.4,

respectively. Now suppose , consisting of 1 2, is another unbiased

estimator, where 1 and 2 have variances 5.6 and 8.3, respectively. In

this example, since the variance of is less than the variance of 1

and the

page_33

Page 34

variance of is less than the variance of 2, it is clear that the

"variance" of b* is less than the variance of . But what if the variance

of 2 were 6.3 instead of 8.3? Then it is not clear which "variance" is

smallest.

An additional complication exists in comparing the variances of

estimators of a multi-dimensional b. There may exist a nonzero

covariance between the estimators of the separate components of b. For

example, a positive covariance between 1 and 2 implies that,

whenever 1 overestimates b1, there is a tendency for 2 to

overestimate b2, making the complete estimate of b worse than would

be the case were this covariance zero. Comparison of the "variances" of

multidimensional estimators should therefore somehow account for this

covariance phenomenon.

The "variance" of a multidimensional estimator is called a variance-

covariance matrix. If b* is an estimator of k-dimensional b, then the

variance-covariance matrix of b*, denoted by V(b*), is defined as a k ×

k matrix (a table with k entries in each direction) containing the

variances of the k elements of b* along the diagonal and the covariance

in the off-diagonal positions. Thus,

where is the variance of the k the element of b* and C( , ) is

the covariance between and . All this variance-covariance

matrix does is array the relevant variances and covariances in a table.

One this is done, the econometrician can draw on mathematicians'

knowledge of matrix algebra to suggest ways in which the variance-

covariance matrix of one unbiased estimator could be considered

"smaller" than the variance-covariance matrix of another unbiased

estimator.

Consider four alternative ways of measuring smallness among variance-

covariance matrices, all accomplished by transforming the matrices into

single numbers and then comparing those numbers:

(1) Choose the unbiased estimator whose variance-covariance

matrix has the smallest trace (sum of diagonal elements);

(2) choose the unbiased estimator whose variance-covariance

matrix has the smallest determinant;

(3) choose the unbiased estimator for which any given linear

combination of its elements has the smallest variance;

(4) choose the unbiased estimator whose variance-covariance

matrix minimizes a risk function consisting of a weighted sum of

the individual variances and covariances. (A risk function is the

expected value of a traditional loss function, such as the square of

the difference between an estimate and what it is estimating.)

This last criterion seems sensible: a researcher can weight the variances

and covariances according to the importance he or she subjectively

feels their minimization should be given in choosing an estimator. It

happens that in the context of an unbiased estimator this risk function

can be expressed in an alternative form, as the expected value of a

quadratic function of the difference between the estimate and the true

parameter value; i.e.,E (

- b)'Q ( - b) This alternative interpretation

also makes good intuitive sense as a choice criterion for use in the

estimating context.

If the weights in the risk function described above, the elements of

are chosen so as to make it impossible for this risk function to be

negative (a reasonable request,

page_34

Page 35

since if it were negative it would be a gain, not a loss), then a very

fortunate thing occurs. Under these circumstances all four of these

criteria lead to the same choice of estimator. What is more, this result

does not depend on the particular weights used in the risk function.

Although these four ways of defining a smallest matrix are reasonably

straightforward, econometricians have chosen, for mathematical

reasons, to use as their definition an equivalent but conceptually more

difficult idea. This fifth rule says, choose the unbiased estimator whose

variance-covariance matrix, when subtracted from the variance-

covariance matrix of any other unbiased estimator, leaves a

non-negative definite matrix. (A matrix A is non-negative definite if the

quadratic function formed by using the elements of A as parameters

(x'Ax) takes on only non-negative values. Thus to ensure a non-negative

risk function as described above, the weighting matrix Q must be

non-negative definite.)

Proofs of the equivalence of these five selection rules can be

constructed by consulting Rothenberg (1973, p. 8), Theil (1971, p. 121),

and Goldberger (1964, p. 38).

A special case of the risk function is revealing. Suppose we choose the

weighting such that the variance of any one element of the estimator

has a very heavy weight, with all other weights negligible. This implies

that each of the elements of the estimator with the "smallest" variance-

covariance matrix has individual minimum variance. (Thus, the example

given earlier of one estimator with individual variances 3.1 and 7.4 and

another with variances 5.6 and 6.3 is unfair; these two estimators could

be combined into a new estimator with variances 3.1 and 6.3.) This

special case also indicates that in general covariances play no role in

determining the best estimator.

2.7 Mean Square Error (MSE)

In the multivariate context the MSE criterion can be interpreted in

terms of the "smallest" (as defined in the technical notes to section 2.6)

MSE matrix. This matrix, given by the formula E( - b)( - b)', is a

natural matrix generalization of the MSE criterion. In practice, however,

this generalization is shunned in favor of the sum of the MSEs of all the

individual components of , a definition of risk that has come to be the

usual meaning of the term.

2.8 Asymptotic Properties

The econometric literature has become full of asymptotics, so much so

that at least one prominent econometrician, Leamer (1988), has

complained that there is too much of it. Appendix C of this book

provides an introduction to the technical dimension of this important

area of econometrics, supplementing the items that follow.

The reason for the important result that Eg g(Ex) for g nonlinear is

illustrated in figure 2.8. On the horizontal axis are measured values of

, the sampling distribution of which is portrayed by pdf( ), with values

ofg( ) measured on the vertical axis. Values A and B of , equidistant

from E , are traced to give g(A) and g(B). Note that g(B) is much

farther from g( ) than is g(A): high values of lead to values of g( )

considerably above g(E ), but low values of lead to values of g( )

only slightly below g(E ) Consequently the sampling distribution of g(

) is asymmetric, as shown by pdf[g( )], and in this example the

expected value of g( ) lies above g(E )

page_35

Page 36

Figure 2.8

Why the expected value of a nonlinear function is

not the nonlinear function of the expected value

If g were a linear function, the asymmetry portrayed in figure 2.8 would

not arise and thus we would have Eg( )=g(E )For g nonlinear,

however, this result does not hold.

Suppose now that we allow the sample size to become very large, and

suppose that plim exists and is equal to E in figure 2.8. As the

sample size becomes very large, the sampling distributionpdf( ) begins

to collapse on plim i.e., its variance becomes very, very small. The

points A and B are no longer relevant since values near them now occur

with negligible probability. Only values of very, very close to plim

are relevant; such values when traced through g( ) are very, very close

to g(plim

). Clearly, the distribution of g( ) collapses on g(plim ) as

the distribution of collapses on plim . Thus plim g( )=g(plim ), for

a continuous function.

For a simple example of this phenomenon, let

be the square function,

so that g = .2. From the well-known result that V(x) = E(x)2-E(x)2,

we can deduce that E( 2)=(E )2+V ( )Clearly,E( 2) (E 2), but if

the variance of goes to zero as the sample size goes to infinity then

plim( 2)=(plim )2. The case of equal to the sample mean statistic

provides an easy example of this.

Note that in figure 2.8 the modes, as well as the expected values, of the

two densities do not correspond. An explanation of this can be

constructed with the help of the "change of variable" theorem discussed

in the technical notes to section 2.9.

An approximate correction factor can be estimated to reduce the small-

sample bias discussed here. For example, suppose an estimate of b is

distributed normally with

page_36

Page 37

mean b and variance V( .) Then exp is distributed log-normally with

mean exp suggesting that exp (b) could be estimated by exp

which, although biased, should have less bias than exp (b). If

in this same example the original error were not distributed normally, so

that was not distributed normally, a Taylor series expansion could be

used to deduce an appropriate correction factor. Expand exp around

plus higher-order terms which are neglected. Taking the expected value

of both sides produces

suggesting that exp b could be estimated by

For discussion and examples of these kinds of adjustments, see Miller

(1984), Kennedy (1981a, 1983) and Goldberger (1968a). An alternative

way of producing an estimate of a nonlinear function g(b) is to calculate

many values of g(b* + e), where e is an error with mean zero and

variance equal to the estimated variance of b*, and average them. For

more on this ''smearing" estimate see Duan (1983).

When g is a linear function, the variance of g( ) is given by the square

of the slope of g times the variance of i.e., V(ax) = a2V(x). When g is

a continuous nonlinear function its variance is more difficult to

calculate. As noted above in the context of figure 2.8, when the sample

size becomes very large only values of very, very close to plim are

relevant, and in this range a linear approximation to is adequate. The

slope of such a linear approximation is given by the first derivative of g

with respect to . Thus the asymptotic variance of g( ) is often

calculated as the square of this first derivative times the asymptotic

variance of , with this derivative evaluated at = plim for the

theoretical variance, and evaluated at for the estimated variance.

2.9 Maximum Likelihood

The likelihood of a sample is often identified with the "probability" of

obtaining that sample, something which is, strictly speaking, not correct.

The use of this terminology is accepted, however, because of an implicit

understanding, articulated by Press et al. (1986, p. 500): "If the yi's take

on continuous values, the probability will always be zero unless we add

the phrase, '. . . plus or minus some fixed dy on each data point.' So let's

always take this phrase as understood."

The likelihood function is identical to the joint probability density

function of the given sample. It is given a different name (i.e., the name

"likelihood") to denote the fact that in this context it is to be

interpreted

as a function of the parameter values (since it is to be maximized with

respect to those parameter values) rather than, as is usually the case,

being interpreted as a function of the sample data.

The mechanics of finding a maximum likelihood estimator are explained

in most econometrics texts. Because of the importance of maximum

likelihood estimation in

page_37

Page 38

the econometric literature, an example is presented here. Consider a

typical econometric problem of trying to find the maximum likelihood

estimator of the vector

in the relationship y = b1 + b2x + b3z + e where T observations on y, x

and z are available.

(1) The first step is to specify the nature of the distribution of the

disturbance term e. Suppose the disturbances are identically and

independently distributed with probability density function (e). For

example, it could be postulated that e is distributed normally with

mean zero and variance s2 so that

(2) The second step is to rewrite the given relationship as e = y - b1

- b2x - b3z so that for the ithe value of e we have

(3) The third step is to form the likelihood function, the formula for

the joint probability distribution of the sample, i.e., a formula

proportional to the probability of drawing the particular error terms

inherent in this sample. If the error terms are independent of each

other, this is given by the product of all the (e)s, one for each of the

T sample observations. For the example at hand, this creates the

likelihood function

a complicated function of the sample data and the unknown

parameters b1, b2 and b3, plus any unknown parameters inherent in

the probability density function - in this case s2.

(4) The fourth step is to find the set of values of the unknown

parameters (b1, b2, b3 and s2), as functions of the sample data, that

maximize this likelihood function. Since the parameter values that

maximize L also maximize lnL, and the latter task is easier, attention

usually focuses on the log-likelihood function. In this example,

In some simple cases, such as this one, the maximizing values of this

function (i.e., the MLEs) can be found using standard algebraic

maximizing techniques. In

page_38

Page 39

most cases, however, a numerical search technique (described in

section 6.3) must be employed to find the MLE.

There are two circumstances in which the technique presented above

must be modified.

(1) Density of y not equal to density of e We have observations on

y, not e. Thus, the likelihood function should be structured from the

density of y, not the density of e. The technique described above

implicitly assumes that the density of y, (y), is identical to (e), the

density of e with e replaced in this formula by y - Xb, but this is not

necessarily the case. The probability of obtaining a value of e in the

small range de is given by (e) de; this implies an equivalent

probability for y of (y)|dy| where (y) is the density function of y and

|dy| is the absolute value of the range of y values corresponding to

de. Thus, because of (e) de = (y)|dy|, we can calculate (y) as

(e)|de/dy|.

In the example given above (y) and (e) are identical since |de/dy| is

one. But suppose our example were such that we had

where l is some (known or unknown) parameter. In this case,

and the likelihood function would become

where Q

is the likelihood function of the original example, with each

yi raised to the power l.

This method of finding the density of y when y is a function of

another variable e whose density is known, is referred to as the

change-of-variable technique. The multivariate analogue of |de/dy|

is the absolute value of the Jacobian of the transformation - the

determinant of the matrix of first derivatives of the vector e with

respect to the vector y. Judge et al. (1988, pp. 30-6) have a good

exposition.

(2) Observations not independent In the examples above, the

observations were independent of one another so that the density

values for each observation could simply be multiplied together to

obtain the likelihood function. When the observations are not

independent, for example if a lagged value of the regress and

appears as a regressor, or if the errors are autocorrelated, an

alternative means of finding the likelihood function must be

employed. There are two ways of handling this problem.

(a) Using a multivariate density A multivariate density function

gives the density of an entire vector of e rather than of just one

element of that vector (i.e., it gives the "probability" of obtaining

the entire set of ei). For example, the multivariate normal

density function for the vector e is given (in matrix terminology)

by the formula

page_39

Page 40

where s2W is the variance-covariance matrix of the vector e.

This formula itself can serve as the likelihood function (i.e.,

there is no need to multiply a set of densities together since this

formula has implicity already done that, as well as taking

account of interdependencies among the data). Note that this

formula gives the density of the vector e, not the vector y. Since

what is required is the density of y, a multivariate adjustment

factor equivalent to the univariate |de/dy| used earlier is

necessary. This adjustment factor is |det de/dy| where de/dy is a

matrix containing in its ijth position the derivative of the ith

observation of e with respect to the jth observation of y. It is

called the Jacobian of the transformation from e to y. Watts

(1973) has a good explanation of the Jacobian.

(b) Using a transformation It may be possible to transform the

variables of the problem so as to be able to work with errors that

are independent. For example, suppose we have

but e is such that et = ret-1 + ut where ut is a normally

distributed error with mean zero and variance $sigmatwou The

es are not independent of one another, so the density for the

vector e cannot be formed by multiplying together all the

individual densities; the multivariate density formula given

earlier must be used, where W is a function of r and s2 is a

function of r and $sigmatwou. But the u errors are distributed

independently, so the density of the u vector can be formed by

multiplying together all the individual ut densities. Some

algebraic manipulation allows ut to be expressed as

(There is a special transformation for u1; see the technical notes

to section 8.3 where autocorrelated errors are discussed.) The

density of the y vector, and thus the required likelihood

function, is then calculated as the density of the u vector times

the Jacobian of the transformation from u to y. In the example at

hand, this second method turns out to be easier, since the first

method (using a multivariate density function) requires that the

determinant of W be calculated, a difficult task.

Working through examples in the literature of the application of these

techniques is the best way to become comfortable with them and to

become aware of the uses to which MLEs can be put. To this end see

Beach and MacKinnon (1978a), Savin and White (1978), Lahiri and

Egy (1981), Spitzer (1982), Seaks and Layson (1983), and Layson and

Seaks (1984).

The Cramer-Rao lower bound is a matrix given by the formula

where q is the vector of unknown parameters (including s2) for the

MLE estimates of which the Cramer-Rao lower bound is the asymptotic

variance-covariance matrix. Its estimation is accomplished by inserting

the MLE estimates of the unknown parameters. The inverse of the

Cramer-Rao lower bound is called the information matrix.

page_40

Page 41

If the disturbances were distributed normally, the MLE estimator of s2

is SSE/T. Drawing on similar examples reported in preceding sections,

we see that estimation of the variance of a normally distributed

population can be computed as SSE/(T - 1), SSE/T or SSE/(T + 1),

which are, respectively, the best unbiased estimator, the MLE, and the

minimum MSE estimator. Here SSE is s(x - x)2.

2.11 Adding Up

The analogy principle of estimation is often called the method of

moments because typically moment conditions (such as that EX'e = 0,

the covariance between the explanatory variables and the error is zero)

are utilized to derive estimators using this technique. For example,

consider a variable x with unknown mean m. The mean m of x is the

first moment, so we estimate m by the first moment (the average) of the

data, x. This procedure is not always so easy. Suppose, for example,

that the density of x is given by f(x)=lxl-1 for 0 < × < 1 and zero

elsewhere. The expected value of x is l/(l + 1) so the method of

moments estimator l* of l is found by setting x = l*/(l* + 1) and solving

to obtain l

* = x/(1 - x). In general we are usually interested in estimating

several parameters and so will require as many of these moment

conditions as there are parameters to be estimated, in which case

finding estimates involves solving these equations simultaneously.

Consider, for example, estimating a and b in y = a + bx + e. Because

e is specified to be an independent error, the expected value of the

product of x and e is zero, an "orthogonality" or "moment"

condition. This suggests that estimation could be based on setting

the product of x and the residual e* = y - a* - b*x equal to zero,

where a* and b* are the desired estimates of a and b. Similarly, the

expected value of e (its first moment) is specified to be zero,

suggesting that estimation could be based on setting the average of

the e* equal to zero. This gives rise to two equations in two

unknowns:

which a reader might recognize as the normal equations of the

ordinary least squares estimator. It is not unusual for a method of

moments estimator to turn out to be a familiar estimator, a result

which gives it some appeal. Greene (1997, pp. 14553) has a good

textbook exposition.

This approach to estimation is straightforward so long as the number

of moment conditions is equal to the number of parameters to be

estimated. But what if there are more moment conditions than

parameters? In this case there will be more equations than

unknowns and it is not obvious how to proceed. The generalized

method of moments (GMM) procedure, described in the technical

notes of section 8.1, deals with this case.

page_41

Page 42

The Classical Linear Regression Model

3.1 Textbooks as Catalogs

In chapter 2 we learned that many of the estimating criteria held in high

regard by econometricians (such as best unbiasedness and minimum

mean square error) are characteristics of an estimator's sampling

distribution. These characteristics cannot be determined unless a set of

repeated samples can be taken or hypothesized; to take or hypothesize

these repeated samples, knowledge of the way in which the

observations are generated is necessary. Unfortunately, an estimator

does not have the same characteristics for all ways in which the

observations can be generated. This means that in some estimating

situations a particular estimator has desirable properties but in other

estimating situations it does not have desirable properties. Because

there is no "superestimator" having desirable properties in all situations,

for each estimating problem (i.e., for each different way in which the

observations can be generated) the econometrician must determine

anew which estimator is preferred. An econometrics textbook can be

characterized as a catalog of which estimators are most desirable in

what estimating situations. Thus, a researcher facing a particular

estimating problem simply turns to the catalog to determine which

estimator is most appropriate for him or her to employ in that situation.

The purpose of this chapter is to explain how this catalog is structured.

The cataloging process described above is centered around a standard

estimating situation referred to as the classical linear regression model

(CLR model). It happens that in this standard situation the OLS

estimator is considered the optimal estimator. This model consists of

five assumptions concerning the way in which the data are generated.

By changing these assumptions in one way or another, different

estimating situations are created, in many of which the OLS estimator is

no longer considered to be the optimal estimator. Most econometric

problems can be characterized as situations in which one (or more) of