3  Scoring Procedure For Physics Learning and Performance Outcomes

For all measures which required qualitative coding or scoring, the following procedure was used: First, a rubric was developed through group discussion with the research team. Then, a portion of the responses were coded independently by a minimum of two coders. Discrepancies from the first round of coding were reviewed by the team and the rubric was revised as needed. The remaining responses were coded again in full by two coders and all discrepancies were resolved in the presence of a third coder. Any responses that could not be easily resolved were brought to the research team for review. For each measure that was coded or scored, we report two inter-rater reliability statistics, one for each round of coding.

3.1 Physics Assessment, Part 1: Quantitative Problem Solving

The quantitative problems (one for each test version) were scored according to a rubric developed by the research team. Partial credit was given for incomplete or partially correct solutions. Responses from baseline and posttest were combined and randomized within each test version before scoring, and team members were blinded to condition. The intraclass correlation for the first round of coding was .74 for version A (20 responses) and .8 for version B (29 responses). The intraclass correlation for the remaining solutions were .7 for version A (129 responses), and .94 for version B (117 responses). Final scores were calculated by taking the proportion of points earned out of the total possible points. Three responses on the posttest quantitative problem-solving task were removed from analysis. In one case, it was clear there was additional work cropped out of the uploaded image which could not be scored; in another, the uploaded file was a duplicate of their baseline file; and in the last case the handwriting was deemed illegible by all of the raters.

3.2 Physics Assessment, Part 2: Problem Categorization

Individual items were scored dichotomously as 0 (incorrect) or 1 (correct). Final scores were calculated by taking the average accuracy across the five categorization items.

3.3 Physics Assessment, Part 3: Qualitative Problem Solving

The three multiple-choice questions were scored dichotomously as 0 (incorrect) or 1 (correct). For the purposes of the current work and questions we did not analyze the open-response explanations for the multiple-choice questions. The open-ended word problem was coded and scored by the research team. Responses were awarded a maximum of two points: one point for each component of the correct explanation. For example, the question in test version B described a scenario in which an elevator cable snaps, and the emergency friction brakes are engaged. The student was asked to describe the types of energy transfer which occur in this scenario. One point was given for describing gravitational potential energy being converted to kinetic energy, and another for describing kinetic energy being converted to thermal and sound energy due to work done by friction. Similar to the quantitative problem, all responses were combined and randomized across timepoints, and experimental condition was removed from the data before scoring. The weighted Kappa for the first round of coding was .36 for version A (50 responses), and .58 for version B (53 responses). The weighted Kappas for the second round of coding were .71 for version A (99 responses), and .74 for version B (95 responses). When calculating the final score, the open-ended word problem was weighted equal to the multiple-choice, such that two-point responses were scored as a 1 and one-point responses were scored as a .5. Final scores for qualitative problem solving were calculated by taking the mean score of all the items.

3.4 Physics Assessment, Part 4: Preparation for Future Learning (PFL)

There were two components to the response for each of the two PFL questions: multiple-choice selection and open-response explanation. We defined correctness on the PFL as selecting the correct multiple choice response option and reasoning correctly in the open-response explanation. For both questions, student reasoning was considered correct if they mentioned the relative maximum heights or initial y velocities of the two trajectories as a justification for their answer, or if they mentioned that the y component was most important for determining time in the air. These questions were the only ones for which we did not complete two separate rounds of coding because we build directly on our past work with a similar population (Weinlader et al. 2019). Otherwise, the coding procedure was identical to the others. Responses were randomized and coders were blinded to condition. Both PFL questions were coded in full by two coders as either correct or incorrect. The unweighted Kappa for both of the PFL questions was .88. All discrepancies were resolved through discussion in the presence of a third coder.

Supplementary Table 3.1: Summary of Item Types and Scoring Scales by Outcome
Outcome Measure Items and Measurement Scoring
Physics Problem Solving Performance
Part 1: Quantitative Problem Solving 1 item, open response Points earned/points possible, continuous from 0 to 1
Part 2: Problem Categorization 5 items, forced-choice Mean score, continuous from 0 to 1
Part 3: Qualitative Problem Solving 3 multiple-choice, 1 open response Mean score, continuous from 0 to 1
Preparation for Future Learning
Part 4: Preparation for Future Learning 2 items, multiple-choice with open response explanation Dichotomous, 0 or 1: 1 = Correct multiple choice response & attends to y component correctly in explanation
Momentary Item-Level Perceptions
Confidence 1 item, measured repeatedly Continuous from 1 to 6
Anxiety 1 item, measured repeatedly Continuous from 1 to 6
Difficulty 1 item, measured repeatedly Continuous from 1 to 6