Criterion-referenced Language Testing

September 2003 — Volume 7, Number 2

Criterion-referenced Language Testing

James Dean Brown and Thom Hudson (2002)
Cambridge: Cambridge University Press
Pp. xvi + 320
ISBN 0521000831(paper)
$ 29.95

Criterion-referenced Language Testing, authored by James Dean Brown and Thom Hudson, asserts how criterion-referenced testing (CRT) can provide realistic and useful test development tools that will assist language teachers and language curriculum developers in their respective jobs. In fact, over the past decades, CRT, which provides information about an individual’s mastery of a given criterion domain or ability level, has become an emerging issue in language assessment, especially in language achievement tests. This book addresses the wide variety of CRT and decision-making needs that more and more language-teaching professionals must consider in real-life testing situations. Each of the seven chapters of this volume contains a discussion of the theoretical and practical parameters involved in language testing situations. This book treats CRT at a simple statistical level. Any readers who have taken an introductory statistics course will be easily acquainted with concepts that the volume presents. As the book assumes no previous technical knowledge of CRT as a mode of language testing, it provides a good introduction for laypersons to the issues surrounding language testing in general as well as CRT in particular.

To show the different phases of the CRT, the authors of Criterion-referenced Language Testing take a focused approach to the issues involved in developing, implementing, and improving language tests with relation to the criterion-referenced approach. In so doing, they explore what kinds of alternate paradigms are possible in language testing situations, what curriculum-related language testing is, what CRT items are, how basic descriptive and item statistics for CRT can be conducted and interpreted, how reliability, dependability, and unidimensionality in CRT should be addressed, how validity of CRT can be viewed, and how CRT can be administered, given feedback, and reported. [-1-]

In Chapter 1, entitled ‘Alternate paradigms,’ the authors identify the place of CRTs in language testing theory and research by examining background on what CRT is, what it can do, and how it is related to theoretical issues in language testing. There has been much research on NRT for many decades, but there has been a surge of interest in CRT for the last few decades. The authors discuss the competing paradigms that are represented by NRT and CRT. In their main exploration of the main question of what language tests are measuring, they consider following four questions: “What makes language testing special? What is language proficiency? What is communicative language ability? What problems do CRT developers face?” (p. 15). These four sub-questions under the aforementioned main question are linked to practical implications for the relationship of CRT development and implementation. The chapter ends with following four practical questions that CRT developers must face in serving the goals and objectives of CRT:

“1. How can item analysis be performed when: (a) no comparison group is designated as instructed or uninstructed group; (b) no externally identified masters and non-masters are defined; or (c) when mastery groups are defined and available?

2. How dependable are the decisions made on the basis of the test? How generalizable are the scores and analyses to those of other examinees on other forms of the test?

3. How can a standard, or cut-point, be rationally set?

4. What advantages and disadvantages accrue from application of the statistical approaches provided by NRT or CRT analyses?” (p. 27)

The subsequent chapters of this volume provide answers for these four questions that may be encountered in putting CRT into practice.

Chapter 2, entitled ‘Curriculum-related testing,’ first discusses the interrelationship between CRT and curriculum. This chapter addresses how language testing is involved in needs analysis, goals and objectives, testing, materials, teaching, and evaluation, all of which function as the components of language curriculum development. Then, by providing practical examples of both instructional and performance objectives, the chapter enumerates each and every relationship between two types of objectives and CRT that language specialists may face in their testing situation. In this chapter, the authors also value the washback effect in CRT, which is subsequently linked to the significance of multiple sources of information in language-related decision making. The remainder of this chapter is devoted to a comprehensive overview of how to adjust modes of assessment to curriculum when there is a discrepancy between language curriculum and testing practice.

Chapter 3, ‘Criterion-referenced test items,’ provides caveats in constructing test item specifications together with descriptions of what the test specification is, and how it should be created. This is followed by a practical exploration of item quality and content analysis with relation to the expected problems of our daily test use in language programs. This chapter helps to foster our capability to establish streamlined test specification in implementing CRT in a simple way. However, the discussion of test specification described in the chapter is relatively unsophisticated in that they do not address the situation when “reverse engineering” (Davidson, 2002, p. 41) is necessary for the creation of test specifications from existing test items in language test development. As Davidson argues, since not all language testing is specification-driven, further discussion of the possibility of reverse engineering might have been a valuable channel to explore some of significant related topics: critical language testing, and certain philosophical standings in the use of tests and test change with relation to language curriculum.

Chapter 4, entitled ‘Basic descriptive and item statistics for criterion-referenced tests’ covers detailed illustration of both NRT item statistics and criterion-referenced item analysis for describing and revising CRT for the intended goals and objectives. The writing style of authors of this book is so straightforward that any motivated readers may easily learn how to interpret the results from item statistics with regard to NRT and CRT just through the intensive reading of this particular chapter. It is generally, though not always, accepted that item response theory (IRT) has statistical advantage over classical test theory for calibrating new questions in constructing equivalent forms and for item banking. The authors did not miss this point and they briefly discuss the practical applications of IRT in CRT construction. They also present a basic level of multi-faceted Rasch model which “locates an examinee’s ability and an item’s difficulty estimates on a common scale” (p. 145). However, their coverage of IRT and multi-faceted Rasch model provides so limited information that readers may not map the overall features of those two topics.

This volume may be arguably viewed as a slightly revised and expanded edition of the book Testing in Language Programs (1996), published in Prentice Hall, where James Dean Brown illustrated how to address proficiency, placement, diagnostic, and achievement tests, and how to design them for both program level and classroom level decisions. Any readers who have read the previous version will not find any big difference between Testing in Language Programs and the first three and half chapters of Criterion-referenced Language Testing. [-2-]

In Chapter 5, ‘Reliability, dependability, and unidimensionality,’ the authors address the three central issues involved in test consistency, which are reliability in NRT, dependability in CRT, and fit in IRT. Starting with a review of the traditional concept of test reliability in NRT, test-retest reliability, equivalent forms reliability and internal consistency reliability, the chapter carries the discussion of threshold-loss methods and generalizability approaches to CRT dependability to highlight their importance for making decisions based on CRT scores. This chapter also sets the stage for addressing validity in the following chapter. The fact that reliability is a measure of whether a measuring device measures a construct in the same way from context to context suggests that any valid measure must first be reliable, and if measures are not reliable, they obscure the construct they measure and hence, may obstruct validity. In that sense, the combined discussion of reliability and validity serves as a synthesis of those two issues in addressing any research, including language testing.

Chapter 6 introduces ‘Validity of criterion-referenced tests’ with two perspectives: content validity and construct validity. The first includes the following two approaches: theoretical arguments and expert judgments. The second includes the following three studies: intervention studies, differential group studies, and hierarchical-structural studies. Together with examples provided in the chapter, these five strategies of validity studies were addressed to show how they can practically be applied in running and maintaining language programs. Next, both Messick’s (1988) and Cronbach’s (1988) expanded views of validity, which has led to a paradigm shift in the study of validity, are explored for the evidential and consequential bases of test interpretation and use, and the functional, political, economic, and explanatory perspectives on test validity. This chapter functions as an introduction of a unified view of validity, and an illustration of how these expanded views of validity concepts can be applied in CRT practice.

In covering standard setting, the authors also highlight some of the existing methods in the field of educational measurement and illuminate the applicability and utility of these methods to CRT in language program administration. Chapter 6 differentiates this volume from other language testing books on the market that do not even mention the concept of standard setting. Given the importance of program or school admissions, certification acquisition, personnel selection and program evaluation, a discussion of this topic seems to be an appropriate vehicle for leading readers to build on the concepts and procedures of standard setting.

After the coverage of the topics throughout the previous six chapters, however, the enduring challenge remains how CRT should be conducted, how the results should be interpreted, and how the reporting should be conducted. The answers can be found in Chapter 7, entitled ‘Administering, giving feedback, and reporting on criterion-referenced tests.’ The authors provide some practical suggestions based on their own experiences with real criterion-referenced assessment projects. The underlying logic of CRT approach is based on assessing how much of the content in a course or program is being learned by the students. Such assessment depends on comparing performance to the well-defined criteria rather than to assess student’s performance in relationship to the performances of the other students in the norm group. Such a connection of CRT to goals and objectives as particular standards or criteria is closely linked to a curriculum. Hence, this book featuring CRT should also be considered as the contribution to the enhanced notion of valuing both individual and contextual differences in pedagogical decision-making in language-related curriculum development.

To this reviewer, overall, Criterion-referenced Language Testing is a well-written book that will appeal to upper-level undergraduate and graduate students who are preparing to become second language (L2) teaching professionals, and L2 testing practitioners. This volume is also well-suited for classroom teachers, language testing researchers, and curriculum developers who are preparing to develop new perspectives, maintain language programs, or conduct research in the field of language testing from theory and practice. Providing a readable introduction to the issues surrounding CRT, this book is guiding those readers how to use criterion-referenced approach to analyze language testing data and construct systematic curriculum-related testing. Symbols and equations are both graphically and verbally well explained, and detailed examples and illustrations are salient across almost the whole chapters. With its clear examples, Criterion-referenced Language Testing not only provides an applied introduction to any language testing course but also is a valuable reference from undergraduate students to language testing professionals.

References

Brown, J. D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice-Hall.

Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. I. Braun (Eds.) Test validity (pp. 3-17). Hillsdale, NJ: Lawrence Erlbaum Associates.

Davidson, F. & Lynch, B. K. (2002). Testcraft: A teacher’s guide to writing and using language test specifications. New Haven, CT: Yale University Press.

Messick, S. (1988). The once and future issues of validity: Assessing the meaning and consequences of measurement. In H. Wainer & H. I. Braun (Eds.) Test validity (pp. 33-45). Hillsdale, NJ: Lawrence Erlbaum Associates.

Hyeong-Jong Lee
University of Illinois at Urbana-Champaign
<hlee26@uiuc.edu>

Editor’s Note: Dashed numbers in square brackets indicate the end of each page for purposes of citation..

[-3-]