PUP Item Analysis
Additional Features:
Power Up Plus adds two powerful features to Break Out source code and Break Out Plus: a Test Performance Profile, using discrimination and difficulty, and a related student counseling matrix. These descriptive statistics tell a valid story about how each item and student performed on this test. With enough students, to reduce sampling error, this includes predictions of future performance and points to items needing revision or to a change in instruction. The Test Performance Profile provides an overview of the test on one page.

The test performance profile, sheet 7, divides the test into three sub-tests: Mastery/Easy, Unfinished, and Discriminating. The Mastery/Easy are needed to verify that students are mastering critical information or skills or to adjust the average score of the test. All items perform only as Mastery/Easy items in a perfect instructional system in which all material is mastered, and as such, a test functions as a check list. On this test, 23 of the 52 items are Mastery/Easy. The average score is 84%. This is expected in a mature course, a course taught several times with a continuing effort to adjust instruction to the needs of the students and pass/fail set at 75%.
An imperfect instructional system has two other groups of items: those that separate the students who know from those that do not (discriminate), and those that fail to make the separation (unfinished).
Discriminating items are also labeled with their power to discriminate. The 14 items labeled with A, B, or C are discriminating on this test. They are more difficult than the Mastery/Easy and easier than the Unfinished. A test composed only of these 14 items would be labeled A, very discriminating. A test of equal discrimination ability with 50 items would be expected to have a test reliability of 0.92 which is characteristic of a standardized test. This is much higher than the 0.64 earned by the actual test, which is below the rule of thumb value of 0.70 for an acceptable classroom test. Discriminating items are also needed to create a usable score/grade distribution.
The 15 Unfinished items require some change in instruction, learning, and/or the item itself. These items have little discriminating power and only dither the score distribution toward a “normal curve of error” (usually shortened to "normal curve"). They are candidates for revision into discriminating items or for instructional changes that will make them perform as Easy/Mastery.
The above generalities provide the framework for a detailed examination of the test on a re-tabled, sheet 3, student counseling mark matrix. The results with just 24 students are too unstable to make predictions of item performance but tell interesting stories in the current test.

The seven items with a difficulty of 96%, labeled at the bottom left of sheet 3 illustrate item discrimination calculation. One student missed each item. The three items labeled NB, NC, and NC (below DIFFICULTY) show negative discrimination because the three students who marked the wrong answers where in the top of the class. The next two items are not labeled as these students were in the middle of the score distribution. The last two labeled B and A were at the bottom of the class. Statistically it is very significant for the only student to miss an item to also be at the bottom of the class. From a practical point of view, none of these labels are of much value. That is why discriminating items with difficulties above 90% are not listed in the Discrimination column.
At-risk students can be found in the plot for Unfinished (center) and Discriminating (right column). Their wrong marks are scattered about among the Unfinished but settle to the bottom among the Discriminating. Here lower scoring students get the items wrong. Here a prediction can be made of which students are at risk: Kent, Murta, and Salton who have most of their wrong answers in the Discriminating column. It also follows that their wrong answers made these items discriminating.
Usability Problems and Limits:
Sampling error occurs as a natural consequence of drawing small samples from the available collection. The results of each classroom test consists of scores from a small set of items of all possible items answered by a small number of students of all possible students. The larger the number of items and students in relation to the total range, the smaller the effect of sampling error. The average score of five items, varying by one point, drawn from a total of 100 can range from 3% to 97%, 94 points. The average score of 30 items can range from 15% to 85%, 70 points. The average score of 60 items can range from 30% to 70%, 40 points. The average score for 90 items can range only from 45% to 55%, 10 points. Data stability, increases with increased sample size.
Sampling error limits the usability of test results for test item revision. A minimum of 75 students or a consensus from three tests is required to avoid wasting teacher time. A medium sample of 150 students is better. An adequate sample of 300 is stable.
Predictability based on a correlation coefficient r is related to the square of the correlation coefficient. The rule of thumb of 0.7 for acceptable test reliability has a predictability of 0.49. The rule of thumb 0.2 for acceptable item discrimination has a predictability of 0.04, which is nearly zero. Estimated discrimination ability varies almost randomly at low levels of r and with small class size. Increasing sample size may have little effect on discrimination other than make the estimate more stable.
Historical Problems:
The value for point biserial r is from the “corrected” formula in which the item under study is omitted from the calculation. At lower levels of r, this value is significantly less than the “uncorrected” value from old hand calculation methods that are often computerized.
Standardized test makers use item discrimination as a valuable tool in selecting items for tests. The same techniques do not serve the classroom teacher well. Standardized test makers are most interested in using the fewest and most discriminating items to rank students. Teachers should be interested most in what students actually know and what they need to learn as part of a functional instructional system. Item discrimination is of value in describing results from each small classroom test but should not be the primary determiner in editing items. If you need the item to verify that your students have mastered a requirement, then keep it, even if it is not very discriminating.
Scoring only right marks on the test further clouds test results. With a pass/fail point set at 60%, a right mark falls about evenly between guessing and knowing. At the lower end of the scoring scale where the need to know what a student knows and needs to learn are most critical, the results are muddied with forced guessing (and hope for a passing score).
One Solution:
Knowledge and Judgment Scoring solves the problem created by traditional scoring which uses only part (right marks only) of the information available in multiple-choice items. Gambling is not required and the student gets to report (and receive credit) for knowing what he/she knows. This requires the use of higher levels of thinking on all items. Students receive scores for both knowledge and judgment, knowing and knowing what they know, quantity and quality.
Correct counseling is possible for low scoring students. Students with good judgment need to apply more time with their current study methods. Those with poor judgment need, in general, to change study habits from rote memorization to questioning, relating, and verifying.
Power Up Plus score both methods: traditional and knowledge and judgment. Just change the test instructions and you are ready to administer a multiple-choice test for fair, honest, accurate, and meaningful scores:
TEST INSTRUCTIONS: Mark an answer for each question. Score 1 point for each right mark and zero for wrong. Or mark only answers to questions you know. Score 1 point for each right answer, 1/2 point for good judgment to not make a wrong mark (omit), and zero for wrong.
28 October 2008