Monday, December 15, 2014

• Measurement and Assessment in Teaching.. Summary

A summary of the book “Measurement and Assessment in Teaching” by Robert L. Linn & M.David Miller,
Questions and answers

Chapter 1: Educational Testing and Assessment: Context, Issues, and Trends
Briefly describe the major components and requirements of No Child Left Behind. Be sure to include the goals of the legislation that must be met by 2014 and what might be done regarding schools. 

The testing required by the NCLB Act is used to hold schools, districts, and states accountable for student achievement. The law requires states to set performance standards that are used to establish annual "adequate yearly progress" targets that ensure that all students will perform at the "proficient" level or above by the 2013-2014 school year. Schools that do not meet yearly goals two years in a row are placed into a "needs improvement" category. Tutoring, expanded time for instruction either after school or during the summer, and public school choice may be provided to students in low-performing schools. When initially identified as "needs improvement," schools must develop improvement plans, and districts must provide public school choice. Schools that continue to be identified are subject to more severe sanctions, leading finally to restructuring and the replacement of teachers and school administrators.

NCLB is intended to include all students. Describe the special needs and/or accommodations that are allowed under NCLB to allow that all students meet proficiency goals.

Many students who would likely have been excluded in the past can be included with minor accommodations of the assessment. Some students simply need more time to complete the tests. This can be accomplished on a student-by-student basis or by extending the time for all students to make the process fairer. English-language learners, who are proficient in another language, may be assessed in the student's first language. Accommodations are also needed for some students with disabilities. The nature and extent of accommodation needed clearly depends on the kind and severity of a student's disability. Students with visual impairments may receive tests in Braille. Students with some types of physical handicaps may require someone to record their responses for them. By far, the largest fraction of students who may need accommodations are students with learning disabilities. Many students with learning disabilities are likely to be able to take parts or all of standards-based assessments without special accommodations. Among the more common suggested accommodations for students with learning disabilities are shorter assessments, more time for completing assessment tasks, oral reading of directions, assessment in small groups or individually, and oral responses to tasks.

Discuss some concerns that many members of the public have regarding achievement testing in the schools. Be sure to include in your answer the concerns of the nature and quality of tests, effect of testing on students, fairness of tests to minorities and gender fairness. 

The text chapter discusses four areas of concern resulting in controversy over testing and assessment: (a) the nature and quality of tests, (b) the effects of testing on students, (c) fairness to minorities, and (d) gender fairness. One criticism of proficiency tests centers is around the issue that the tests do not measure real life problems that tend to be "ill" structured rather than "well structured." Another criticism is that such tests measure only a narrow segment of skills that are needed to be successful both in the classroom and in life. Critics argue that they create anxiety for the test-takers and educators alike, that they tend to categorize and label students, that they can hurt the self-esteem of children and that they lead to self-fulfilling prophecy in terms of academic achievement. Regarding fairness to minorities, critics have argued that the tests are biased, and that the ways they are administered and scored are also biased. Critics also argue that minorities do not have equal opportunities to learn the material and that the tests are interpreted unfairly for minority children. Finally, critics argue that the tests are not gender-fair.

Chapter 2: The Role of Measurement and Assessment in Teaching

Define "assessment." Name and define its two main components. What are the main questions that assessment seeks to answer? 

Assessment is the general (and overlying) term to describe the procedures by which educators gain knowledge, insight, and information about student learning. There are two major components of assessment; test and measurement. A test is a type of assessment that contains questions administered over a fixed time period under controlled and objective conditions for all students tested. Measurement is the assigning of numbers or score to a test in using a specific set of rules for scoring. Assessment wishes to answer the question; "How well does the individual perform?" A test seeks to answer the question "How well does the individual perform—either in comparison with others or in comparison with a domain of performance tasks?" Finally, measurement seeks to answer the question; "How much did the student score on the test?"

List and discuss the five general principles of good assessment. What type of error may occur when each of the principles are violated? 

There are five general principles that govern good assessment. They are:
  1. Clearly specifying what is to be assessed has priority in the assessment process.
  2. An assessment procedure should be selected because of its relevance to the characteristics or performance to be measured.
  3. Comprehensive assessment requires a variety of procedures.
  4. Proper use of assessment procedures requires an awareness of their limitations.
  5. Assessment is a means to an end, not an end in itself.
Clearly specifying what is to be assessed should be done before any testing or measurement takes place. This indicates that the assessment taking place is meaningful and has a clear purpose. To violate this principle runs the risk of assessing haphazardly or rendering any use of the assessment meaningless. One must make sure that the assessment procedures chosen are relevant to what is to be measured. If this principle is violated information gathered may turn out to be useless. No one test or type of test can adequately answer all the types of learning required in school. To violate this principle is to perhaps overgeneralize the results of an assessment or measure too narrow a sample of learning behavior. The proper procedures of assessment instruments should be followed. This means that the assessments should be administered and scored objectively and fairly for all students. It also means that the educator must realize that all tests contain error. To violate test procedures or not keep in mind testing error may lead to overgeneralization, misinterpretation, or misuse of assessment results. Assessment must lead to a stated goal or objective. It must be given for a purpose. To violate this principle is simply to waste student and educator time and misuse funds.

How should assessment be used to meet sets of stated educational achievement goals? Name and define the five steps in this process. 

The main purpose of classroom instruction is to help students achieve a set of intended learning goals. The first step in this process is to identify exactly what those learning goals are going to be for the student. The next step in the process is to preassess the student's level of current competency. The purpose of this step is to guarantee that the student possesses the prerequisite skills to benefit from instruction. The third step is for the educator to provide relevant instruction. During this stage, assessment should be ongoing to insure that the learner is mastering content and to diagnosis any learning difficulties. The fourth step is to assess learning outcomes. In this phase, assessment seeks to confirm that the child has mastered the content and has reached the learning goal. The fifth stage in the process is to use the assessment results in some relevant way. It often involves feedback to educators, administrators, and the students themselves of their progress and helps to set future learning goals.

Identify and define maximum performance, typical performance, fixed-choice tests and complex-performance assessments. What is an appropriate use or educational example for each type of assessment?

Maximum performance assessment occurs when the individual knows that he/she is going to be assessed, has time to prepare for the assessment, and is motivated to do well. An example of maximum performance assessment is a midterm examination that counts toward one's final grade. Typical performance assessment measures the types of behavior that an individual demonstrates daily when the stakes to do well are not high and they do not have time to prepare. An example of typical performance might be an observation of a child's paying attention in class during a normal school day. Fixed-choice tests refer to where the individual must choose his/her answers from those provided. For example, multiple-choice and true-false tests are examples of fixed format tests. Finally, complex-performance tests are those requiring extended answers that are produced by the student rather than from choosing from a fixed set of alternatives. Examples of complex-format assessment might be essay assessment and completing science laboratory assignments.

Identify and define the four types of assessment outlined in the chapter. What would be an educationally appropriate use of each type?

The four types of assessment are: placement, formative, diagnostic, and summative. Placement assessment is used to determine student performance at the beginning of instruction. An example of placement assessment would be measuring a child's current reading skills before placing him/her in a reading group or with a certain level of reading text. Formative assessment is ongoing, day-to-day assessment of a child's educational progress. For example, it might be used to make sure that the child understands a given unit of math instruction before progressing on to further instruction. Diagnostic assessment is related to formative instruction and is designed to identify any difficulties that the child is having learning in day-to-day instruction. An example would be to identify that a child has trouble in two-digit addition before going on to three-digit addition. Summative assessment is designed to occur at the end of a unit of instruction. Its purpose is to assess if learning goals have been met. Examples of summative instruction would be a unit test or a final exam.
Chapter 3: Instructional Goals and Objectives: Foundation for Assessment

What does it mean to say that an instructional objective should be expressed behaviorally? Give an example of a behaviorally and non-behaviorally expressed instructional objective. Contrast the terms "product" and "process" and state which one is preferable in an instructional objective. 

An instructional objective is expressed behaviorally when it contains an action (or verb) that is directly observable and measurable. Two people viewing the learner should agree that the target behavior has taken place. An example of a behaviorally stated objective would be: "The student will list the four largest cities in Ohio." An example of a non-behaviorally stated objective would be: "The student will know his spelling words." "Product" refers to what the student is going to do (stated in behavioral terms) in order to indicate mastery of the objective. "Process" indicates the way that the child is going to reach the stated behavior goal. Instructional objectives are usually concerned with products.

Define the three different domains in the taxonomy of educational objectives. Give an example of each of the three domains. 

The three domains in the taxonomy of educational objectives are: cognitive, affective, and psychomotor. The cognitive domain deals with the kinds of knowledge and skills that one might learn in school and are related to facts, knowledge, and its applications and uses. The affective taxonomy is concerned with interests, attitudes, and values. The psychomotor domain deals with physical movements and skills. An example of a cognitive skill would be completing a set of arithmetic problems. An example of an affective skill would be liking and appreciating classical music. An example of a psychomotor skill would be hitting a baseball.

What are the six levels that make up the cognitive domain? Give an example of each. 

The six levels of the cognitive domain are: knowledge, understanding, application, analysis, synthesis, and evaluation. Examples of each are:
Knowledge: The student will know/recite the letters of the alphabet.
Understand: The student will explain in his own words the main causes of the Civil War.
Application: The child will correctly solve math problems.
Analysis: The student will outline the major themes of Hamlet.
Synthesis: The student will write an essay.
Evaluation: The student will critique a work of art.

What occurs when "unanticipated learning outcomes" occur? Give an example of such an outcome. What type or form do these outcomes take? What should the teacher do when such outcomes occur? 

Unanticipated learning outcomes occur when student learning or behavior take place which were not included in the learning goals or the objectives. An example of such an outcome might be when a concept is presented in algebra and a student unexpectedly relates the concept to a particular problem she had the night before while trying to figure out a recipe. Unexpected learning outcomes can be positive (as described above) or negative as when a child is rude to another child in class. Teachers should view unexpected learning outcomes as an opportunity to enrich material or teach new and needed concepts to students.

Distinguish "general instructional objectives" from "specific learning outcomes." Give examples of acceptable verbs for each. Describe when and how each are used. 

General instructional objectives should be specific enough to provide direction for instruction but not so specific that instruction is reduced to training. Specific learning outcomes refer to the specific way that students will demonstrate that they have achieved the general objective. Verbs acceptable for general instructional objectives are "know," "understand," and "interpret." Examples of acceptable verbs for specific learning outcomes are "list," "identify," and "solve." In creating programming for students, general instructional objectives are stated first and then specific learning outcomes are created to detail how the general objectives will be reached.
Chapter 4: Validity

What is validity? What is reliability? What is usability? What is the relationship between validity and reliability?

Validity is the adequacy and appropriateness of the interpretations and uses of assessment results. In common terms, a test is valid if it adequately measures what it purports to measure. Reliability refers to consistency. The idea is that a person taking the same test twice without intervening variables (e.g., time to study) should score about the same on both of the test administrations. Usability refers to the practicality of the test or test procedures. In order for a test to be valid it must be reliable. However, not all reliable tests are, by definition, valid.

What are the four types of validity considerations? Define each. 

The four types of validity considerations are: content, construct, assessment-criterion relationship, and consequences. Content validity refers to the idea that a test adequately samples or measures a representative sample of the content presented. Construct validity measures a hypothetical attribute that we believe exists and which is inferred from behavior. The assessment-criterion relationship refers to how well a test predicts future behavior or adequately describes a group of people presently demonstrating those behaviors. Consequences refer to the ways and purposes in which test information is used.

Give an example of the four types of validity considerations. What techniques might be used to adequately measure that a test possesses a high degree of validity in each of the four considerations? 

A test has content validity if the test adequately samples the content presented. For example, a test might lack content validity if it only asked questions from three of five book chapters assigned. Content validity is often assessed by using a table of specifications that contains the content presented over the six levels of the cognitive taxonomic objectives. An example of construct validity would be a test of personality, self-esteem, or reading comprehension. While these constructs cannot be directly seen they are inferred by the behaviors that persons holding high levels of those constructs may show. Assessment-criterion considerations measure either the predictive or concurrent levels of validity. In predictive validity, the attempt is made to see how a test given in the present can predict future behavior. A test with concurrent validity is a test designed to measure the extent that a group of persons (e.g., musicians) score highly on a present test designed to measure that attribute (a test of musical ability). Consequence considerations are most concerned with the useful, ethical, and appropriate use of assessment results. A test should be used for inclusive rather than exclusive purposes.

What is correlation? What does correlation measure? Give an example of correlation. What is the relationship between correlation and causation? 

Correlation is the degree of relationship between two events or phenomena. That is, it is the measure of how well one variable is in predicting another variable. An example of correlation would be the relationship between grade level and math achievement--that as students advance in grade levels, the math complexity presented to them also increases. Correlation does not infer causation. Causation is the 100% cause of one variable on another. Heat causes atoms to move faster. Correlation is usually never 100%. An increase in grade usually increases the complexity of math curriculum but that need not be always the case.

What are some of the factors affecting validity? How or why are they important? Give an example of each. 

A number of factors affect validity. These include factors of the test itself, factors of the task or teaching procedure, factors of administration and scoring, factors of student responses, and the nature of the group and the criterion. Examples of factors of the test itself are unclear directions or words or vocabulary which are too complex for test takers. An example of the teaching procedure is the teacher teaching in such a way as to actually interfere with good achievement on the test. An example of factors of administration or scoring is a teacher invalidating a standardized test by his giving his own students extra time to complete the exam. Factors of student responses include factors such as illness or high anxiety that would not other times interfere with the child's achieving a better score on the test. Finally, an example of factors of the group and criterion is a test being given to a group of individuals for whom the test was not designed and then misinterpreting these students' achievement based these factors.

Chapter 5: Reliability and Other Desired Characteristics
What is reliability? What is its nature and major properties? 

Reliability refers to the consistency of measurement, that is, how consistent test scores or other assessment results are from one measurement to another. Reliability is the correlation between the same people on two similar assessments. As correlation, it is measured by the Pearson Product Moment statistic (r). Reliability can range from 0.00 to 1.00.

What are the six major types of reliability? How are they measured or assessed?

The six major types of reliability are: test-retest, equivalent forms, test-retest with equivalent forms, split half, coefficient alpha, and interrater. Test-retest reliability is assessed by giving the same person the same test twice with as short of a time interval between the two test administrations as possible. Equivalent forms reliability refers to the idea that when a test producer creates two forms of a test (e.g., Form A and Form B) the two forms should measure the same material in the same way but with different questions. If this is the case then the reliability between the two forms will be high. Test-retest with equivalent forms occurs when the same person is given a test twice (as in test-retest) but instead of administering the exact same test, equivalent forms of the test are administered. Split-half reliability does not require a test being administered twice. Rather, the single test is split in half with odd numbered items being compared against even numbered items (as if two tests existed). Coefficient alpha is a special case of split reliability where the test is not divided in two but is looked at by use of the KR-20 formula. Finally, interrater reliability is used in behavioral or performance assessments and refers to the concept that agreement should occur between raters as to whether and to what extent a given behavior did or did not occur.

What is error as it applies to reliability? How is error assessed? What are confidence bands and what relationship do they have with error. 

Error as it applies to reliability acknowledges that no test is 100% reliable (e.g., r never equals 1.00). To the extent that perfect reliability is never reached, there is error in the test or test situation. Error is measured by the Standard Error of Measurement statistic. The Standard Error of Measurement statistic allows a confidence band or interval to be placed around a single test score. The lower to upper limits of that test score represent the range of the person's "true" score taking error into account.

What are four factors that influence reliability measures? How do they provide such influences? 

The four factors influencing reliability measures are: number of assessment tasks, spread of scores, objectivity, and methods of estimating reliability. In most cases, as the number of assessments are increased, the reliability of the overall assessment also increases. Regarding spread of scores, as the range or spread of scores in a distribution increases, the reliability of the assessment also increases. Reliability increases when the assessment becomes more objective. That is, as independent judges can more readily agree that behaviors have taken place and to what extent they have occurred increases, the reliability of the assessment also increases. Finally, the method used to assess reliability can affect the reliability of the assessment. Types of reliability assessment differ in their degree of liberalness and the conservatism of the r statistic.

What is the concept of "acceptable reliability"? Under what circumstances must the reliability be relatively high for assessment decisions to take place with confidence? 

Reliability can vary from a low of r=0.00 (no reliability) to r=1.00) perfect reliability). Virtually no assessment is 100% reliable and thus contains error. The degree of acceptable error depends on the use and decision-making process to be made from the test results. Reliability should be relatively high when decisions to be made are important, when the decision to be made is final, when the decision is irreversible, when it is unconfirmable, when the decision concerns individuals, and when the decision has lasting consequences.

Chapter 6: Planning Classroom Tests and Assessments

What are the three purposes or types of classroom assessment and testing? When and how should each type be used by teachers?

The three purposes and/or types of classroom assessment and testing are: pretesting, testing and assessment during instruction, and post-testing. Pretesting (as its name implies) occurs before instruction has begun. The purpose of pretesting is to determine whether students have the prerequisite skills needed for the instruction (to determine readiness). Pretesting is also used to assess to what extent students have already achieved the objectives of the planned instruction (to determine student placement or modification of instruction). A second purpose of assessment is testing during instruction. This type of testing is used to monitor the progress of students and to see if there are areas in which material needs to be explained to aid learning. Additional purposes of assessment during instruction also include encouraging students to study, and providing feedback to students and teachers. Finally, post-testing is used to assess whether the learning goals for the student have been achieved.

What are the major steps involved in building a table of specifications for a unit of instructions? What is involved in each step? 

There are three main steps in bulding a table of specifications for a unit of instruction. These are (a) preparing a list of instructional objectives, (b) outlining the course content, and (c) preparing the two-way chart. In preparing a list of instructional objectives, separate lists of general objectives and specific learning outcomes are listed. In the second step, the content to be covered is broken down (outlined) into major topic areas with each topic area further parsed into subtopics. Finally, in preparing a two-way chart the topics and subtopics are listed down the y-axis while the objectives are carried across the x-axis. In this two-way table, it is important that the objectives cover the span of the taxonomy of educational objectives ranging from knowledge through evaluation.

What are the three main types of test items? How are they defined? Describe the categories of test items that might fall in each type. 

What are four main concerns that a teacher should keep in mind when creating a test for students? What might be the result if each of these concerns is not addressed? 

The four main concerns that should be kept in mind when creating a teacher made test are: matching items and tasks to intended outcomes, obtaining a representative sample of items and tasks, eliminating irrelevant barriers to the performance, and avoiding unintended clues in objective test items. Violation of these four considerations would lead to their own special types of error and such errors would make proper interpretation of the test results problematic.

Chapter 7: Constructing Objective Test Items: Simple Forms

What are the three types of objective test items discussed in Chapter Seven? How are they defined? How are they "simple form"? 

The three types of test items discussed in Chapter Seven are short answer, true-false, and matching. Short answer is defined as requiring the test taker to supply the answer by a word, phrase, number, or symbol. True-false is defined by having the test taker choose from two alternatives as to the veracity of a declarative statement. Finally, matching items require the test taker to link the relationship of two concepts choosing the one from one column that "goes with" or agrees with the corresponding variable chosen from a parallel column. All three item types are considered "simple form" because they require a minimum (as opposed to an extended) response from the test taker and because they measure basic knowledge level and/or factual skills.

What are the basic uses of short-answer questions? How do they differ from true-false and matching? 

Short-answer items are suitable for measuring a wide variety of relatively simple learning outcomes. These include assessing whether the student knows the definition of given terminology in a topic of instruction, and/or the ability to solve numerical or scientific problems. They differ from true-false and matching exercises in that short answer is a supply item while the other two are selection items.

What are the basic advantages and disadvantages of true-false items? How might the disadvantages be avoided? 

The major advantages to true-false questions are that they are time efficient. The student can answer more true-false questions than any other type of item. By including more items in a test, the teacher can increase certain types of content validity as well as increasing reliability (see Chapters Five and Six). Disadvantages are that they are harder to construct than they first appear, that they cannot adequately measure higher level learning outcomes such as analysis or synthesis, and they have a high susceptibility to guessing. Some of these disadvantages can be reduced during the construction of the items by not assessing trivial content and avoiding giving clues that will increase guessing and increase error.

When are matching exercises best used? What are the recommendations given in the chapter as they refer to the number of items in each column and the number of times an alternative might be chosen? Why are these recommendations made? 

Matching items are best used in situations in which the teacher wants to see if the student understands the relationship between two variables, concepts or events. It is recommended that the two lists not be of the same length and that the instructions state that any alternative may be used once, more than once, or not at all. This is to prevent any wrong answer counting as two since if a student believes that an alternative can be used only once, a wrong alternative for one item would not available for use for its correct partner in a corresponding match.

Chapter 8: Constructing Objective Test Items: Multiple-Choice Forms

What are the components of multiple-choice questions? How might multiple-choice questions be posed?

A multiple-choice item consists of a problem and a list of suggested solutions. The problem is called the stem of the item. The list of suggested solutions are called alternatives. The correct alternative in each item is called the answer, and the remaining alternatives are called distracters (also called decoys or foils). Multiple-choice questions may be posed as a direct question or as an incomplete statement.

What are the basic uses of multiple-choice items? Give a brief description of each use. 

The most common use of multiple-choice items is to measure verbal information. These items are good for measuring knowledge outcomes and measuring outcomes at higher taxonomic levels. Included in measuring knowledge outcomes are: knowledge of terminology, knowledge of specific facts, knowledge of principles, and knowledge of methods and procedures. Measuring outcomes at higher taxonomic levels include identification, application of facts and principles, identification of application of facts and principles, and the ability to justify methods and procedures.   

Identify the major advantages and limitations of multiple-choice items. Be as specific as possible. 

Multiple-choice items work best when they are assessing achievement information. Likewise, their use should be restricted to verbal materials. One advantage of multiple-choice questions is their flexibility. Within the parameters of verbal and achievement material they can be used to test virtually any subject matter. They are usually less vague than short-answer questions and are more objective in their scoring. Compared to true-false items, the student encountering a multiple-choice item must not only identify a wrong statement, but must know what the correct alternative is. Also compared to true-false questions, multiple-choice items are more resistant to guessing and more resistant to response set. The major limitation to multiple-choice questions is that they are of the selection format and thus less "real life" than supply item questions.

What are some of the ways that multiple-choice questions can be written to strengthen them, reduce their liabilities and create fairer and more objective questions? Be as specific as possible. 

A major consideration in the construction of multiple-choice questions is the stem. A stem should be able to stand alone without the alternatives and pose a clear question or problem. As much as possible, the multiple-choice stem should be free of irrelevant material not needed in the stem. Such irrelevant material may confuse the reader and may cause problems associated with short-term memory for what is happening in the stem. When possible, the stem should not be stated in the negative. Certainly double negatives should not be used. In order not to give clues, all alternatives should agree grammatically with the item stem. Each item should have one and only one correct answer. Items with two correct answers should be discarded. All alternatives including the distracters should be plausible. Implausible or silly foils give clues to the correct answer by reducing plausible alternatives and encourage guessing.

Chapter 9: Measuring Complex Achievement: The Interpretive Exercise

What are interpretive exercises? What are the components of interpretive exercises? How are interpretive exercises posed?

Interpretive exercises are intended to measure those learning outcomes based on the higher mental processes, such as understanding, thinking skills, and various problem-solving abilities. An interpretive exercise consists of a series of objective items based on a common set of stimuli. The stimuli may be in the form of written materials, tables, charts, graphs, maps, or pictures. The series of related test items may take various forms but are most commonly multiple-choice or true-false items with multiple-choice items the most widely used.

What are the basic uses of interpretive exercises? (List at least three.) What types and levels of educational outcomes do they assess best? 

Interpretive exercises are used to recognize inferences, recognize warranted and unwarranted generalizations, recognize assumptions, recognize the relevance of information, apply principles, and use pictorial materials. Interpretive exercises are usually used to assess higher order levels of learning that include understanding and application.

Identify the major advantages and limitations of interpretive exercises. Be as specific as possible. 

There are multiple advantages to using interpretive exercises. One is that the stimulus materials used makes it possible to measure the ability to interpret written materials, charts, graphs, maps, pictures, and other media encountered in everyday situations. Another is that the interpretive exercise makes it possible to measure more complex learning outcomes than can be measured with the single objective item. Thirdly, by having a series of related test items based on a common set of data, greater depth and breadth can be obtained in the measurement of achievement skills. The interpretive exercise minimizes the influence of irrelevant factual information on the measurement of complex learning outcomes. Finally, students may be unable to demonstrate their understanding of a principle simply because they do not know some of the facts concerning the situation to which they are to be applied. The interpretive exercise remedies this.
There are also a number of limitations. It is difficult to construct sound exercises. A second limitation is that when introductory material is in written form, there is a heavy demand on reading skills. Finally, because the interpretive exercises usually use selection items, they are confined to learning outcomes at the recognition level.

What are some of the ways that interpretive exercises can be written to strengthen them, reduce their liabilities, and create fairer and more objective questions? Be as specific as possible. 

It is essential that interpretive exercises be as strong and valid as possible. This means that the set of stimuli must be appropriate and that the objective questions that are drawn from the stimuli also be strong. In relation to the set of stimuli, it is important that the material is relevant to the objectives of the course. The set of stimuli should also be at an appropriate reading level for the students taking the test. The stimulus material should be new to students, not something they have encountered before. It should be brief but meaningful. It should be clear and concise. Regarding the questions themselves, they should reflect the stimulus materials. They should also conform to all of the rules for constructing sound objective items such as multiple choice.
Chapter 10: Measuring Complex Achievement: Essay Question

What is an essay? What is the distinctive feature of the essay question? What types of learning are essay questions best able to assess? 

An essay is an extended response, supply type assessment. In it the student replies with connected prose to a question or series of questions. The distinctive feature of an essay question is its extended response form. Students are free to construct, relate, and present ideas in their own words. Essay questions are best used to measure higher order learning objectives such as analysis, synthesis, and evaluation.

What are the two types of essay question format? How do they differ? Which type is usually more preferable over the other? Why? 

The two types of essay question formats are restricted response and extended response. Restricted response asks specific questions with specific instructions on how to answer the question and requires an answer that conforms to those instructions. The extended-response essay question is more open-ended and gives the student more latitude in answering the item. Restricted-response essays are usually preferable since they are more objective to score and lend themselves to greater reliability.

What are the major advantages and limitations to essay questions? Be as specific as possible. 

Among the advantages of essay questions is the fact that the essay allows for the measurement of complex learning outcomes that cannot be measured by other means. A second advantage of the essay is its emphasis on the integration and application of thinking and problem-solving skills. Finally, the potentially most important advantage of the essay question is its contribution to student learning.
Disadvantages of the essay question are that good essay questions can be difficult to construct and they can also be difficult to score. Regarding scoring, interrater reliability is often a problem. They are also time consuming to score. Finally, since only a few essays can be asked on an exam, material may not be adequately sampled and the test can lack content validity.

How might the interrater reliability of essay scoring be improved? What are the two ways of scoring an essay? All things being equal, which type of scoring should be employed and why?

Interrater reliability in scoring essays can be improved by using scoring rubrics or plans to which scores must adhere to receive good grades. The teacher may also wish to write out a well-answered essay and use it as his/her rubric. Scoring of essays may be holistic or analytic. Holistic scoring requires reading the entire essay and giving it an overall grade. Analytic scoring involves reading sections of the essay to see how each section conforms in answering the essay question and then giving the essay a grade based on the essay section scores. All things being equal, analytic scoring leads to more objectivity and more reliable scoring and gives students feedback on what sections of the essay they did well and poorly on.

Is it a sound educational procedure to give students the option of which essays to answer? Is it a sound educational procedure to grade handwriting and spelling as part of the essay grade? Defend your positions. 

It is not a sound educational procedure to allow students to have options as to what essay questions to answer. If such options are given, students are answering different questions (actually taking different tests), and so the common basis for evaluating their achievement is lost. It is probably not advisable to score handwriting and spelling unless the essay test covers handwriting and spelling. If handwriting and spelling are counted on essays in other curricular areas, it is possible that a child could answer the essay adequately and show achievement in learning goals but nevertheless score poorly on the essay because of irrelevant scoring characteristics.

Chapter 11: Measuring Complex Achievement: Performance-Based Assessments

Besides essay tests, what are other types of performance assessments? Give some examples of performance assessment. How do these assessments differ from essays? 

Essay tests are the most common example of a performance-based assessment. However, there are others, including artistic productions, experiments in science, oral presentations, and the use of mathematics to solve real-world problems. Examples might include creating an art or music product or for vocational or industrial education courses, such as auto repair, woodworking, or word processing. Performance assessment is also useful for mathematics, science, social studies, and foreign languages. While essay tests are based on written responses, the above examples require the student to "do" something or engage in some specific behaviors.

Define process, product, restricted-performance assessment and extended-performance assessment. Give an example of each.

Performance assessments provide a basis for teachers to evaluate both the effectiveness of the process or procedure used and the product resulting from performance of a task. Unlike simple tests of factual knowledge, there is unlikely to be a single right or best answer. Restricted-response performance tasks are usually relatively narrow in definition. The instructions are generally more focused than extended-response performance tasks and the limitations on the types of performance expected are likely to be indicated. The extended-performance task may require students to seek information from a variety of sources beyond those provided by the task itself. An example of process is a student showing the procedures used to complete a science experiment. An example of product is an apple pie that the student has baked. An example of a restricted-response performance assessment is the construction of graphs of the average amount of rainfall per month for two cities. An example of extended-response performance assessment is preparing and delivering a speech to persuade people to take actions to protect the environment.

Identify two advantages and two limitations of performance assessments. Explain why they are an advantage or a limitation. 

A major advantage of performance assessments is that they can clearly communicate instructional goals that involve complex performances in natural settings in and outside of school. By using tasks that require performances that correspond as closely as is feasible to major instructional objectives, they provide instructional targets and thereby can encourage the development of complex understandings and skills. A second advantage of performance assessments is that they can measure complex learning outcomes that cannot be measured by other means. They measure "real world" outcomes.
The most commonly cited limitation of performance assessments is the unreliability of ratings of performances across teachers or across time for the same teacher. Reliability can be greatly increased by clearly defining the outcomes to be measured, properly framing the tasks, and carefully defining and following rubrics for scoring performances. Another limitation of extended performance assessments is their time-consuming nature. This limitation may not be easily overcome. However, the need for fair and valid assessment may outweigh the time needed to create and score those assessments.

Identify ways for creating sound and useful performance assessments. Why are they important? 

A number of suggestions are given in the chapter for creating valid performance assessments. These include focusing on learning outcomes that require complex cognitive skills and student performances. Time constructing performance assessments should probably not be spent on lower order or knowledge objectives. Tasks should be selected that represent both the content and the skills that are central to important learning outcomes. Assessments should stress the interdependence of content and skills. Assessments should minimize the dependence of task performance on skills that are irrelevant to the intended purpose of the assessment task. It is important that only the most relevant material be assessed. The teacher should provide the necessary scaffolding for students to be able to understand the task and what is expected. Students should have the necessary prerequisite skills to complete the task. Teachers should construct task directions so that the student's task is clearly indicated. The students should clearly understand what is expected of them so that the assessment is valid and accurate. Finally, the teacher should clearly communicate performance expectations in terms of the scoring rubrics by which the performances will be judged. Explaining the criteria that will be used in rating performances provides students with guidance on how to focus their efforts and helps convey priorities for learning outcomes.

Define rating scales, scoring rubrics, and checklists. 

A scoring rubric typically consists of verbal descriptions of a performance or aspects of student responses that distinguish between advanced, proficient, partially proficient, and beginning levels of performance. Both analytic and holistic scoring rubrics may be employed. A checklist is similar in appearance and use to the rating scale. The basic difference between them is in the type of judgment needed. On a rating scale, one can indicate the degree to which a characteristic is present or the frequency with which a behavior occurs. The checklist, on the other hand, calls for a simple yes-no judgment.

Chapter 12: Portfolios

What is a portfolio? What qualifies as a portfolio of student work? 

A student portfolio is a purposeful collection of pieces of student work. However, it possesses several special attributes. A portfolio is a collection of student work selected to serve a particular purpose, such as the documentation of student growth. Unlike other examples of student work, a portfolio does not contain all the work a student does. Instead, a portfolio may contain examples of "best" works or typical examples from each of several categories of work.

What are some of the advantages and limitations of portfolios? 

There are a number of strengths and limitations of portfolios. An important advantage is that they can be readily integrated with instruction. Another is that they give students important opportunities to show what they can do. In doing this, they also help students become more reflective and critical of their work allowing them to adjust and improve. It also gives students a sense of responsibility and self-efficacy for collecting and submitting their work. Finally, portfolios give teachers products to use in communicating with parents as to their child's work and what goes on in the classroom.
Among the disadvantages of portfolios are that they take considerable time to construct and score. They can also lead to problems with interrater reliability in scoring and are not easily convertible to summative evaluation grades.

What are some of the purposes of portfolio? 

Portfolios can be used in a variety of ways. Perhaps it is best to view their uses as poles along four main dimensions. Among one dimension is the use of portfolios as a means of instruction or assessment. A second dimension of use is if the portfolio is used to show current accomplishment or works in progress. A third dimension is whether it shows the student's best work or a demonstration of typical work. Finally, portfolio use can be seen along the dimension of whether the portfolio contains finished work or works in progress.

Should students evaluate and/or select the material in their portfolios? If so, what are some guidelines for the process? 

One legitimate use of portfolios is to have students evaluate and/or select the material that goes into their portfolios. However, some guidelines are necessary if such evaluation and choice of material is to take place. To some extent, the guidelines are dependent on the type of portfolio or its purpose. It is usually advisable that the student be given particular (and written) guidelines as to what is to go into the portfolio and how they are to critique their portfolios. Thus, evaluation and item choice should not be "open – ended." There should also be prompts given which are intended to encourage students to think about what they planned to do and what they actually did, and to evaluate the strong and weak points of the entry. By asking students to say what they might do differently next time, students are encouraged to think about how their work might be improved.

How should portfolios be evaluated by teachers? Be as specific as possible. 

To evaluate portfolios, a teacher must be clear in his or her mind about the instructional goals for individual portfolio entries and for the portfolio as a whole. Teachers must know in advance whether they are going to score the portfolio analytically or holistically. Analytic scoring rubrics on individual portfolios are useful for formative evaluation purposes. Holistic scoring rubrics may be more appropriate for summative evaluations. The types of rating scales used to score performance assessments are in the most part also appropriate for scoring portfolios. In order to gain objective scoring, it is good practice to conceal the identity of the student. Biases such as the halo effect should be guarded against as much as possible.

Chapter 13: Assessment Procedures: Observational Techniques, Peer Appraisal, and Self-Report

What type of settings are best suited for observational techniques? What kinds of behaviors are best assessed by observational techniques?

Observational techniques are best suited for assessment in naturalistic environments. These would include natural interactions in the classroom, on the playground, or in the lunchroom. Behaviors well suited for assessment by informal observation includes important noncognitive outcomes, such as attitudes, appreciations, and personal-social development.

What are anecdotal records? How do they differ from random observations made by teachers? What are some of the uses of anecdotal records? 

Anecdotal records are factual descriptions of the meaningful incidents and events that the teacher has observed. Each incident is written down shortly after it happens. Anecdotal records differ from random observations in that they are both purposeful and systematic in collection and in scoring. The use of anecdotal records has frequently been limited to the area of social adjustment. Although they are especially appropriate for this type of reporting, they can usually be applied to any area of learning.

What are the advantages and limitations of anecdotal records?

Probably the most important advantage of anecdotal records is that they depict actual behavior in natural situations. Records of actual behavior provide a check on other assessment methods and also enable us to determine the extent of change in the student's typical patterns of behavior. Another advantage of anecdotal records is that they help gather evidence on events that are exceptional but significant. Anecdotal records can be used with very young students and with students who have limited basic communication skills. They are especially valuable in situations where paper-and-pencil tests, performance assessments, self-report techniques, and peer appraisals are likely to be impractical or of limited use. One limitation of anecdotal records is the amount of time required to maintain an adequate system of records. Another serious limitation of anecdotal records is the difficulty of being objective when observing and reporting student behavior. A third limitation is obtaining an adequate sample of behavior. This limitation can affect validity.

What are the uses of peer appraisal and self-report scales? What forms can the two techniques take? What are Likert scales? 

Peer appraisal and self-report scales are useful when assessment is not easily carried out by the teacher or when many of the behaviors to be assessed are conducted in more naturalistic environments like the playground or after school. In peer appraisal, the guess-who technique can be used as can peer nominations. In self-assessment, rating scales and interviews are both appropriate techniques. A Likert scale is a rating scale that has a number of choice points which differ along a continuum. An example of such a scale would be a five point scale containing the choices: disagree strongly, disagree, don't know, agree, and strongly agree.

What are interest inventories? What is their connection to aptitude? What are some techniques of personality assessment? 

As the name implies, interest inventories measure a student's interest, willingness, or enthusiasm to engage in some activity. They differ from aptitude tests in that a student may be interested in an activity but lack the necessary skills or talents to be successful in that activity. Some techniques of personality assessment are interviews, rating scales, and projective techniques.
Chapter 14: Assembling, Administering, and Appraising Classroom Tests and Assessments

When reviewing objective test items, what are some of the key things that a teacher should look for and/or correct?

There are a number of questions that the teacher should ask him/herself when reviewing objective test items being considered for inclusion in a test. The first is whether the format is appropriate for the learning outcome being measured. For example if the student was expected to be able to produce a definition, if a short-answer item should be used instead of true-false. Another important question is whether the level of the behavior required in the test item matches the taxonomic level of the objective. If the student is expected in the objective to apply knowledge then a test item that requires rote knowledge is inappropriate. A third requirement is that the point of the item is clear to the student. A fourth requirement is that the item be as short as possible and be free from excess verbiage. Another requirement is that the projected answer of the item would be agreed upon by experts in the field of inquiry. The item should also be free from clues and technical errors. Finally, the item should be free of ethnic, racial, or gender bias.

What are some of the ways that test items can be arranged on a test?

One way that items may be arranged is by the type of items being used. In this system, for example, all of the multiple-choice items would be arranged first followed by true-false, etc. A second method is to arrange items by the goals, objectives, or learning outcomes that the test measures. For example, if four learning outcomes were being measured, all of the first objective's test questions would appear first followed by the second objective's test questions, etc. A third way would be by the difficulty of the items. In order to increase test motivation, the easiest items would appear first. Finally, test items may be arranged by the subject matter that the test covers.

What are some of the ways that a teacher can reduce test anxiety before and during testing?

There are a number of ways that teachers can reduce test anxiety. One way is not to use tests or the threat of tests as punishment for classroom misbehavior or for not completing school assignments. Another technique is by not stating that students need to do their best because this test is crucial for some aspect of their future life such as getting into a good college, etc. A third technique is telling the students to work fast because they will need to finish on time. Finally, the teacher should not warn students of harsh consequences if they do not do well or fail the test.

Describe the concepts of item discrimination and difficulty. What should be the ideal levels of each concept on a test item?

Item discrimination refers to the idea that students who do well on the entire test should answer a particular item correctly while students who score poorly on the entire test should answer a given item incorrectly. Item difficulty looks at the percentage of students in the entire class who answered an item correctly or incorrectly. Perfect item discrimination (1.00) occurs if the top 27% of the test scorers answers the item correctly and none of the bottom 27% answers the item correctly. Items should be of moderate difficulty (i.e., .75) with item discrimination being as close to 1.00 as possible.

What is a correction for guessing? When does it apply? Should it be used for most tests?

A correction for guessing is based on the assumption that some students will answer all questions (especially multiple-choice questions) even if they are guessing at some answers while some students will leave items blank and not guess. Students who guess will get some of the items correct by guessing while students who do not guess at items will automatically get those items incorrect. Thus, the correction for guessing is an attempt to compensate for the different modes of test taking. A correction for guessing is superfluous when all students answer all items. Thus it is better to make sure that all students answer all items rather than try to compensate after the fact with a correction for guessing mathematical procedure.

Chapter 15: Grading and Reporting

What are the functions of grading and reporting systems? Why are these functions important or useful?

School grading and reporting systems are designed to serve a variety of functions in the school. These include instructional uses, reports to parents, and administrative and guidance uses. The main function of grades should focus on learning and student development. This function is strengthened when grades clarify instructional objectives, indicate the student's strengths and weaknesses in learning, provide information concerning the student's personal-social development, and contribute to the student's motivation. Informing parents (or guardians) of their child's school progress is a basic function of grading and reporting systems. These reports should help parents understand the objectives of the school and how well their children are achieving the intended learning outcomes of their particular program. Finally, grades and progress reports serve a number of administrative functions. They are used for determining promotion and graduation, awarding honors, determining athletic eligibility, and reporting information to other schools and prospective employers.

What are the main types of grading and reporting systems? What is an advantage and limitation to each system?

The traditional grading system has been letter grades. This system is concise and convenient. The grades are easily averaged, and they are useful in predicting future achievement. However, they have several shortcomings including that they typically are a combination of achievement, effort, work habits, and good behavior, that the proportion of students assigned each letter grade varies from teacher to teacher, and that they do not indicate a student's specific strengths and weaknesses in learning.
Pass-fail is a two category system in which the person either receives a passing or a failing grade with no gradations in between. An advantage is that it permits students to take some courses, usually elective courses, under a pass-fail option that is not included in their grade-point average. A limitation is that it offers very little information as to the extent of learning.
Checklists are ratings of progress toward the major objectives in each subject-matter area. An advantage of checklists is that they provide a detailed analysis of the student's strengths and weaknesses so that constructive action can be taken to help improve learning. Difficulties encountered with such reports are in keeping the list of statements down to a workable number and in stating them in such simple and concise terms that they are readily understood by all users of the reports.
Another method of grading and reporting is sending letters home to parents or guardians. Letters make it possible to report on the unique strengths, weaknesses, and learning needs of each student and to suggest specific plans for improvement. Among the limitations are that they require an excessive amount of time and skill, that descriptions of a student's learning weaknesses are easily misinterpreted by parents and that letters fail to provide a systematic and cumulative record of student progress toward the objectives of the school.
Portfolios can be an effective means of showing student progress, illustrating strengths, and identifying areas where greater effort is needed. Portfolios must be systematic and conform to all of the guidelines for maintaining good portfolios.
The parent-teacher conference is a flexible procedure that provides for two-way communication between home and school. The parent-teacher conference is an extremely useful tool, but it shares two important limitations with the informal letter. First, it requires a substantial amount of time and skill. Secondly, it does not provide a systematic record of student progress.

What is a multiple grading and reporting system? What components would multiple grading systems contain? What is the advantage of adopting such a system?

Rather than replace letter grades, many educators have advocated trying to improve the letter-grade system and supplement it with more detailed and meaningful reports of student learning progress. The typical multiple reporting system retains the use of traditional letter grades and supplements the grades with checklists of objectives. In some cases, two grades are assigned to each subject: one for achievement and the other for effort, improvement, or growth. When letter grades are supplemented by these other methods of reporting, the grades become more meaningful.

What are some of the questions that a teacher should answer for him/herself before adopting a letter system of grading and assigning letter grades to students?

A number of questions and issues must be resolved before the teacher adopts a letter grading system and begins assigning letter grades to students. These include: determining what should be included in a letter grade, answering questions as to how achievement data should be combined in assigning letter grades, determining the frame of reference to be used in grading, and answering issues as to how distribution of letter grades should be determined.

What is the benefit of parent teacher conferences? When and how often should parent-teacher conferences be held?

The face-to-face conference makes it possible to share information with parents or guardians. It helps to overcome any misunderstanding between home and school, and to plan cooperatively a program of maximum benefit to the student. At the elementary school level, conferences with parents are regularly scheduled. At the secondary level, the parent-teacher conference is typically used only when some special problem situation arises.

Chapter 16: Achievement Tests

What are the major types of published achievement tests? How are they similar? How do they differ? 

There are a variety of achievement tests. These include achievement test batteries, achievement tests in individual subject areas, and individual achievement tests. These are alike in that they are all commercially available and are standardized. This means that they have standardized rules for administration and scoring, a test manual, and a proven reliability and validity. Most are norm referenced and have been normed on a national group or groups of students. Most have equivalent forms. Achievement test batteries measure a number of different curricular areas. Standardized achievement tests in individual subject areas measure achievement in only one curricular area such as reading. Individual achievement tests unlike most standardized achievement tests are given in a one-on-one setting.

What are the major differences between standardized achievement tests and informal tests? 

The main differences between standardized achievement tests and informal classroom tests are in the nature of the learning outcomes and content measured. A second difference is in the quality of the test items. Thirdly, they differ in the proven reliability and validity of the tests. Fourth, they differ in the procedures for administering and scoring. Finally, they differ in the interpretation of scores.

What are standardized achievement test batteries? Why are they useful? What is a limitation or disadvantage in using them?

Standardized achievement tests are frequently used in the form of survey test batteries. A battery consists of a series of individual tests all standardized on the same national sample of students. This makes it possible to compare test scores on the separate tests and thus determine the students' relative strengths and weaknesses in the different areas covered by the test. One limitation of test batteries is that all parts of the battery are usually not equally appropriate for measuring a particular school's objectives.

Compare standardized achievement test batteries to standardized achievement tests in a specific area. How do they compare in terms of reliability?

There are literally hundreds of separate tests designed to measure achievement in specific areas or single curricular topics such as reading or math. The majority of these can be classified as tests of course content or reading tests of the general survey type. Tests also have been developed for use in determining learning readiness. Since they can ask more questions in a given curricular area (e.g., reading) than a standardized test battery, they tend to have greater reliability in that particular area being assessed.

What is a customized achievement test? Why is it useful to the teacher? What precautions should the teacher take in using such a test?

Banks of objectives and related test items are maintained by most large test publishers and by some other organizations. These item banks are used for computer generation of customized tests. In some cases, the test publisher prepares the tests. In others, the publisher will sell or lease computer software that includes banks of items keyed to objectives and a program for constructing and printing locally prepared customized tests. The advantage of these customized tests is that they allow the teacher to choose questions particularly suited to or in conformance with classroom objectives. A limitation of these customized questions is that enough of these questions must appear on the test to be reliable. Regular achievement and customized achievement tests both measure what a student has learned, and both are useful for predicting success in learning new tasks. The main differences lie in the types of learning measured by each test and the types of prediction for which each test is most useful.

Chapter 17: Aptitude Tests

What are tests of aptitude? What are their uses? What are the limitations of aptitude tests?

Aptitude tests are designed to predict future performance in some activity. Aptitude tests can provide information that is useful in determining learning readiness, individualizing instruction, organizing classroom groups, identifying underachievers, diagnosing learning problems, and helping students with their educational and vocational plans. Contrary to popular belief, aptitude tests do not measure a fixed capacity nor can they predict future behavior with 100% accuracy.

What is the difference between aptitude and achievement? How is aptitude conceptualized or aptitude viewed today? 

Historically, aptitude was viewed as potential for acquiring some trait (e.g., learning) while achievement was viewed as past learning that occurred as a function of instruction or experience. More recently, this view has been modified in that the present level of learned abilities can be useful in predicting future performance. Performance on aptitude tests is influenced by previous learning experiences, but it is less directly dependent on specific courses of instruction than is performance on achievement tests. The various types of learning measured by achievement and aptitude tests can be best depicted if they are arranged along a continuum. The spectrum classifies the various types of tests according to the degree to which the test content depends on specific learning experiences. At one extreme is the content-oriented achievement test that measures knowledge of specific course content. At the other extreme is the culture-oriented nonverbal aptitude test that measures a type of learning not influenced much by typical school experiences. As one moves through the different levels of the spectrum, the test content becomes less dependent on any particular set of learning experiences.

What is the relationship between scholastic aptitude, intelligence, and learning ability?

Tests designed to measure learning ability traditionally have been called "intelligence tests." Many people have historically equated learning aptitude and intelligence as the same construct. This terminology is still used for some individually administered tests and for some group tests, but its use is declining. Today the terms learning ability tests, school ability tests, cognitive ability tests, andscholastic aptitude tests are used rather than intelligence tests. All these terms emphasize the fact that these tests measure developed abilities useful in learning and not innate capacity or undeveloped potential.

What are group learning ability tests? What are the two types of group learning ability tests and how do they differ? Name one major group test of learning ability from each type.

The majority of tests of learning ability administered in the schools are group tests. These are tests that, like standardized achievement tests, can be administered to many students at one time by persons with relatively little training in test administration. Some group tests yield a single score; others yield two or more scores based on measures of separate aspects of ability. An example of a group ability test that yields a single score is the Otis-Lennon School Ability Test. An example of a group ability test that yields separate scores is Cognitive Abilities Test.

What are individual tests of ability? Why and with whom are they used? What are the two most often used individual tests of ability?

Learning abilities may be measured by individual tests. Sometimes these tests are called intelligence tests. Individual tests are administered to one examinee at a time in a face-to-face situation. The examiner presents the problems orally, and the examinee responds by pointing, giving an oral answer, or performing some manipulative task. The administrator of the test must usually be a licensed school psychologist. Because the individual test is administered to one student at a time, it is possible to control more carefully such factors as motivation and to assess more accurately the extent to which disabling behaviors are influencing the score. The influence of reading skill is deemphasized because the tasks are presented orally to the student. In addition, clinical insights concerning the student's method of attacking problems and persistence in solving them are more readily obtained with individual testing. These advantages make the individual test especially useful for testing young children, for retesting students whose scores on group tests are questionable, and for testing students with special problems. The two most popular individual tests of ability are the Stanford-Binet Intelligence Scale and the Wechsler Scales.

Chapter 18: Test Selection, Administration, and Use

Where might an educator go to obtain information about published tests? What types of information are contained in these sites?

There are a variety of places that an educator may go to obtain information about published tests. The two most helpful are probablyBuros Mental Measurements Yearbook and Tests in Print. Both of these resources contain information about the test publisher, test cost, intended uses of the tests, technical information and independent reviews. Another source is the Education Testing Service Test Collection. This resource contains information and abstracts about thousands of tests. Still another resource is the publishers themselves. Information may be obtained from their catalogues although that information may not be completely objective. Finally, test information may be obtained both from textbooks and educational and psychological professional journals.

What are the steps involved in selecting appropriate tests?

The first step in selecting a test is to decide the purpose for which the test will be used. Defining testing needs should be the chief determinant in choosing a test. Another step is using available information in narrowing the choice of possible candidate tests. Next, one should locate suitable tests and obtain a specimen copy of the tests under consideration. Finally, the tests should be reviewed and evaluated before a final choice is made.

How should a test be administered? What are some fair and unfair practices in administering a published test?

The main requirement is that the testing procedures prescribed in the test manual be rigorously followed. Teachers do not have leeway in making special considerations in test administration for their students regardless of how much they personally want their students to succeed. However, teachers may try to reduce student test anxiety prior to their taking the test. But teaching specifically to the test or giving long practice sessions with past tests is probably an inappropriate procedure.

What are some permissible uses of published test results?

The main aim of using published tests results is in improving educational planning for students. This helps in identifying the current level of student achievement including strengths and weaknesses. Also, any discrepancies between perceived student ability and test results should be noted and addressed. Other legitimate uses of test information is the sharing of test results to parents to help them understand how their child is progressing in meeting learning goals. Finally, test results may be used in helping the educator make educational and/or vocational choices but these tests should not be the only criteria used in making these decisions.

What are some unwise uses of using published test information? 

Published test results should not be used to assign course grades. Teacher made tests are better designed and are more valid for fulfilling that purpose. Assignment to a remedial track or even retention in a grade should not be the purpose of published tests. Rather a variety of data and information should be used in these decisions. Finally, published test results should not be used to judge teacher effectiveness, fire teachers, or give raises. Children differ widely in their abilities and in the environmental contributors to their learning and teacher soundness or unsoundness should not be judged on the basis of a single published test.

Chapter 19: Interpreting Test Scores and Norms

What are two differences between educational/psychological tests and tests in the natural sciences?

There are two primary differences between educational/psychological tests and tests in the natural sciences. The first is the issue of the true zero point. While it is possible, for example, to have a spot where there is no length when one is beginning to measure something concrete such as a table, there is no similar place in learning that contains a true zero point where no learning has occurred. The second issue related to that is that since there is no true zero point in educational tests one can not be sure that the difference between scores are exactly equal and comparable. For example while we can be sure that the distance between one inch and two inches and between two inches and three inches is exactly three inches, we cannot be sure that an IQ of 100 is precisely twice that of an IQ of 50.

What is the difference between criterion-referenced and norm-referenced test scores? What are different types of each variety?

Criterion-referenced tests measure the extent to which a student has learned a set of specified objectives. Norm-referenced tests measure how well a student has done compared to other students in the norm group. Scores that measure criterion-referenced tests are raw scores, percentages, and expectancy tables. Scores that measure norm-referenced test performance include raw scores and derived scores such as percentiles and grade equivalents.

What are percentile ranks? How are they derived? What is the difference between percentile ranks and percentages?

What is the normal curve? How is the standard deviation related to the normal curve?

The normal curve is a symmetrical bell-shaped curve that has many useful mathematical properties. One of the most useful from the viewpoint of test interpretation is that when it is divided into standard deviation units, each portion under the curve contains a fixed percentage of cases. The standard deviation is related to the mean and indicates how scores disperse themselves around the mean. The normal curve is divided into approximately six equal standard deviation units.

What are standard scores? What are some of the standard score measures? Describe them.

Standard scores express test performance in terms of standard deviation units from the mean. The basic types of standard scores are z-scores, T-scores, normalized standard scores, stanines, normal-curve equivalents, and standard age scores. These scores express test performance simply and directly as the number of standard deviation units a raw score is above or below the mean. T-scores are derived from z-scores with the purpose of making each T-score a positive integer. Normalized standard scores use the z-score and T-score to measure the area of conversion of scores on a normal curve. It is derived by using a table based on the normal curve. Stanines are single digit scores on a nine-point scale ranging from 1-9.  The normal-curve equivalent (NCE) is another normalized standard score that was introduced in order to avoid some of the pitfalls of grade-equivalent scores. Finally, another widely used standard score for ability tests is the standard age score (SAS). With these scores the mean is set at 100 and the standard deviation at 16.

No comments:

Post a Comment