"

14-3: Assessment & Grading

Psychology of Learning

Module 14: Educational Psychology

Part 3: Assessment & Grading

Looking Back

Part 2 examined the learning environment—how effective teachers maximize instructional time through efficient transitions & procedures, establish classroom management systems during critical initial class periods, & apply behavioral principles including functional assessment & applied behavior analysis to address serious misbehavior. Now we explore assessment & grading.

Developing & Using Learning Objectives

A learning objective is a statement of specific knowledge, skills, or dispositions students should demonstrate after instruction. Clear objectives guide instructional planning, focus teaching efforts, & enable meaningful assessment. Without explicit objectives, teachers may wander through content without direction, & assessment becomes arbitrary rather than aligned with intended outcomes. Learning objectives serve multiple functions: clarifying instructional intent for teachers, students, & stakeholders; guiding selection of instructional activities; providing criteria for assessing whether learning occurred; & communicating expectations so students understand what they should learn.

Effective learning objectives contain three elements. The behavior element specifies observable actions demonstrating learning. Vague terms like “understand,” “know,” or “appreciate” resist observation & measurement. Better verbs specify observable actions: identify, compare, calculate, construct, explain, predict, demonstrate. “Students will understand photosynthesis” provides no assessment guidance. “Students will diagram the process of photosynthesis, labeling inputs, outputs, & energy transformations” specifies observable behavior.

The conditions element describes circumstances under which performance occurs—what resources, constraints, or contexts apply. “Given a topographic map” or “without using a calculator” or “working in collaborative groups” specify conditions affecting performance expectations. The criterion element specifies acceptable performance levels—”with 80% accuracy” or “completing at least four of five steps correctly” or “within 10 minutes” establish standards enabling consistent evaluation. Without criteria, assessment becomes subjective—how good is good enough?

Bloom’s taxonomy provides framework for writing objectives at different cognitive levels. The revised taxonomy (Anderson & Krathwohl, 2001) organizes cognitive processes hierarchically: remembering (retrieve relevant knowledge from memory), understanding (determine meaning of instructional messages), applying (use procedures in given situations), analyzing (break materials into components & determine relationships), evaluating (make judgments based on criteria & standards), & creating (put elements together to form coherent wholes or produce original work). The revision changed category names from nouns to verbs & reordered the highest levels, placing creating above evaluating. Effective instruction includes objectives across multiple levels, recognizing that lower levels provide necessary foundations for higher-order thinking.

The Importance & Functions of Assessment

Assessment serves multiple distinct functions in education, each requiring different approaches & interpretations. Understanding these functions prevents misusing assessment information & helps select appropriate strategies for particular purposes. Formative assessment, occurring during instruction, provides ongoing feedback guiding teaching & learning. Its purpose is improvement—identifying what students understand & where confusion remains so instruction can adapt. Quick checks, exit tickets, observations, questioning, & practice problems all serve formative purposes when teachers use results to modify instruction.

A systematic review of meta-analyses (Sortwell et al., 2024) synthesized 13 previous meta-analyses on formative assessment in K-12 settings, finding positive effects across studies though with varying magnitude depending on assessment type & implementation features. Earlier meta-analyses (Xuan et al., 2022) found modest effects (ES = 0.19) specifically for reading achievement, with effectiveness varying by cultural context & instructional features. Summative assessment, occurring after instruction, evaluates achievement for reporting & decision-making. Unit tests, final exams, & end-of-course assessments serve summative purposes—determining what students learned & assigning grades or certifying competency.

The distinction between formative & summative assessment lies primarily in purpose & timing rather than instrument characteristics. The same quiz could serve formative purposes (identifying misconceptions to address) or summative purposes (contributing to course grades) depending on how results are used. Placement assessment determines appropriate instructional levels or program placements. Diagnostic assessment identifies specific learning difficulties requiring targeted intervention. These specialized functions require assessments designed for their particular purposes—a summative test may poorly serve diagnostic functions even if technically sound for its intended purpose.

The Power of Feedback

Feedback represents formative assessment’s core mechanism. A comprehensive meta-analysis (Wisniewski et al., 2020), synthesizing 435 studies with over 61,000 participants, found a medium overall effect (d = 0.48) of feedback on student learning. However, significant heterogeneity in the data reveals that feedback cannot be understood as a single consistent treatment—its effectiveness varies substantially based on feedback content, timing, & delivery method. Feedback targeting cognitive & motor skills showed larger effects than feedback targeting motivational outcomes, where 21% of effect sizes were actually negative—often when feedback consisted of uninformative rewards or punishments.

Effective formative assessment requires several features. Frequency ensures ongoing monitoring rather than delayed discovery of problems. Immediacy of feedback enables timely adjustment—feedback weeks later loses instructional value. Specificity identifies particular strengths & weaknesses rather than global judgments. Actionability suggests what to do differently, not just what went wrong. Generic feedback (“good job,” “needs improvement”) provides no guidance. Effective feedback identifies specific successes (“Your thesis clearly states your argument”) & specific improvements needed (“Your second paragraph lacks evidence supporting your claim—add a specific example”). Research consistently shows that specific, actionable feedback improves learning more than grades alone.

Norm-Referenced & Criterion-Referenced Evaluation

Norm-referenced evaluation compares students’ performance to each other, ranking individuals within a group. Scores indicate relative standing—how a student performed compared to peers. Grading on a curve exemplifies norm-referenced evaluation: regardless of absolute performance, some students receive high grades & some low grades based on relative position. Standardized tests typically use norm-referenced interpretation, reporting percentile ranks indicating what percentage of the norming group scored below a particular score. A student at the 75th percentile scored higher than 75% of the comparison group.

However, norm-referenced evaluation has limitations. It provides no information about what students actually know or can do—only how they compare to others. If the comparison group performs poorly, a high percentile rank might reflect mediocre absolute achievement. Norm-referenced evaluation also creates inherent competition; one student’s success comes at another’s relative expense, potentially undermining collaborative learning cultures.

Criterion-referenced evaluation compares performance to established standards rather than to other students. Scores indicate what students know or can do regardless of how others performed. A student either meets the standard or does not, independent of peer performance. If all students master objectives, all can succeed; success is not zero-sum. Driver’s license tests exemplify criterion-referenced evaluation: passing requires demonstrating specific competencies—parallel parking, traffic law knowledge—regardless of how other test-takers perform. Criterion-referenced evaluation aligns with mastery learning philosophy—the belief that given sufficient time & appropriate instruction, most students can master most objectives.

Test Construction Principles

Well-constructed tests align with learning objectives, include appropriate item types, sample content representatively, & minimize construct-irrelevant variance—factors affecting scores unrelated to intended measurement. A mathematics test requiring extensive reading comprehension introduces construct-irrelevant variance; poor readers may score low despite mathematical competence. Test validity refers to whether score interpretations are justified for intended purposes. Contemporary views emphasize that validity concerns the inferences drawn from assessment data rather than properties of the test itself—no assessment tool is inherently invalid; what matters is the appropriateness of conclusions drawn from results.

Reliability refers to consistency of measurement—whether a test produces similar results under similar conditions. Unreliable tests cannot support valid inferences; random measurement error obscures true performance. Content validity requires adequate sampling of the content domain—a test covering only Chapter 3 lacks validity for assessing the entire unit. Item difficulty should match purpose: competency assessments target the threshold level, while discriminating tests include items varying in difficulty. Very easy items (everyone correct) & very difficult items (everyone incorrect) provide no information distinguishing students.

Writing Objective Test Items

Objective tests include items scored as right or wrong with no judgment required—scoring is objective in that different scorers reach identical conclusions. Multiple-choice questions represent the most versatile format, capable of assessing knowledge through application levels. Each item includes a stem (question or incomplete statement), one correct answer, & several distractors (plausible incorrect options). Effective distractors represent common misconceptions or errors; implausible distractors waste options & inflate guessing success. Stems should present complete, clear problems; avoid stems requiring students to read all options to understand the question.

Common multiple-choice item flaws include grammatical clues signaling correct answers (“an” before vowel-initial options), longest option being correct, “all of the above” or “none of the above” providing strategic cues, & overlapping options creating multiple correct answers. Avoid absolute determiners (all, never, always, none) that make statements obviously false & specific determiners (often, sometimes, generally) making statements likely true—test-wise students exploit these patterns without content knowledge. True-false items allow 50% guessing success & limit assessment to recognition of factual accuracy. Matching items present premises & responses to associate; include more responses than premises to reduce guessing & keep content homogeneous within sets.

Developing Essay Questions

Essay questions require students to organize & express ideas in written form, assessing abilities difficult to measure with objective items: synthesis, evaluation, argumentation, extended explanation, & creative expression. Essays reveal thinking processes & communication skills invisible in selected-response formats. However, essays require substantial scoring time, introduce scorer subjectivity, & sample content narrowly given time constraints.

Effective essay questions include clear specific directions avoiding ambiguity. “Discuss the causes of World War I” provides insufficient guidance—discuss how many causes? In what depth? Better: “Identify three causes of World War I. For each cause, explain the historical events leading to it & analyze how it contributed to the outbreak of war.” Define scope appropriately for available time. Extended essay questions allowing 30+ minutes enable deeper analysis but limit content coverage; restricted-response questions allow broader sampling but shallower treatment.

Develop detailed rubrics before administering essays. Rubrics specify criteria & performance levels, reducing subjectivity & clarifying expectations. Analytic rubrics score multiple dimensions separately (content, organization, evidence, mechanics); holistic rubrics assign single overall scores based on general quality. Score essays anonymously using student ID numbers to reduce bias from expectations based on prior performance. Score all responses to one question before moving to the next, maintaining consistent standards & avoiding halo effects.

Performance-Based Assessment

Performance-based assessment evaluates students through demonstrations, products, or extended projects rather than traditional tests. Science experiments, oral presentations, artistic performances, research papers, constructed models, & athletic demonstrations exemplify authentic tasks resembling real-world applications. Performance assessment suits objectives involving complex skills difficult to assess through paper-and-pencil tests: laboratory techniques, speaking skills, physical performances, artistic creation, collaborative work, & extended inquiry. Knowing about swimming differs from swimming; performance assessment evaluates actual swimming rather than knowledge about it.

Advantages include assessing authentic complex skills, revealing student thinking processes, providing meaningful tasks motivating engagement, & enabling evaluation of skills invisible in traditional tests. Challenges include substantial time requirements for both completion & scoring, difficulty ensuring reliability across raters & occasions, limited content sampling given time demands, & potential for construct-irrelevant factors (presentation anxiety, group dynamics) to affect scores. Effective performance assessment requires detailed rubrics clearly specifying evaluation criteria & performance levels. Portfolios collect student work samples over time, demonstrating growth, achievement, & reflection. Effective portfolios include student selection & reflection—explaining why pieces were chosen & what they demonstrate about learning.

Peer & Self-Assessment

Peer assessment involves students evaluating each other’s work, providing feedback & sometimes assigning grades. A meta-analysis (Double et al., 2020) of 54 studies found a small to medium effect (g = 0.31) of peer assessment on academic performance across primary, secondary, & tertiary students. Research indicates peer assessment is most effective as formative feedback where students can apply suggestions to improve subsequent performance. As feedback givers, students analyze peers’ work against assessment criteria, make judgments, construct suggestions, & reflect on their own work by comparison. As feedback receivers, they evaluate multiple perspectives & self-regulate their learning based on peer input.

Self-assessment requires students to evaluate their own work against criteria, developing metacognitive awareness & self-regulation skills essential for lifelong learning. Combined self & peer assessment interventions show positive effects on academic performance, particularly when students receive explicit training on assessment criteria & feedback provision. Challenges include students’ tendency toward inflated self-ratings, discomfort with evaluating peers, & time required for training & implementation. Effective implementation requires explicit rubrics with concrete examples, practice opportunities with feedback, & clear expectations about how peer & self-assessments will be used.

Grading

Grading translates assessment information into symbols (letters, numbers, percentages) communicating achievement to students, parents, & institutions. Effective grading systems accurately represent learning, motivate continued effort, & provide useful information for decisions. Grading philosophies vary substantially. Some teachers believe grades should reflect achievement only—what students learned, regardless of effort, behavior, or improvement. Others include effort, participation, or behavior, believing grades should reflect the “whole student.” Including non-achievement factors confounds interpretation—does a B reflect moderate achievement, high achievement with poor behavior, or low achievement with exceptional effort?

Standards-based grading emphasizes criterion-referenced evaluation against explicit learning standards. Rather than averaging points across assignments, standards-based grading reports proficiency on specific learning outcomes. A cluster randomized control trial (Krier et al., 2024) evaluated PARLO (Proficiency-based Assessment & Reassessment of Learning Outcomes) in ninth-grade mathematics classrooms, finding positive effects particularly for previously low-performing students when the system included formative feedback & opportunities for reassessment. Key features included basing final grades on demonstrated proficiency rather than accumulated points & allowing students to reassess for full credit after further study—fostering a growth mindset where initial struggles do not permanently damage grades.

Grade calculation methods significantly affect outcomes & incentives. Averaging all assignments equally weights early performance when students were still learning with later performance demonstrating mastery. Some teachers weight recent assessments more heavily, use median rather than mean calculations, or drop lowest scores. Zero grades for missing work devastate averages disproportionately to the missing learning—a zero on a 100-point scale represents 60 points below failing, while an A represents only 10 points above a B. Extra credit raises philosophical questions about what grades should represent. These decisions should align with grading philosophy & communicate clearly to students what grades mean.

Looking Forward

Module 14 comprehensively explored educational psychology’s application to classroom practice—research methods establishing evidence for effective teaching, student diversity shaping learning experiences, the learning environment maximizing instructional time & managing behavior, & assessment & grading evaluating & communicating student achievement. These foundations integrate learning principles developed across earlier modules, enabling application of behavioral science to the complex realities of education.

License

Psychology of Learning TxWes Copyright © by Jay Brown. All Rights Reserved.