The steps
About 45 minPick four dimensions tied to the rubric
Open the role's rubric and pick four dimensions. Three is fine if the role is narrow. Five is the absolute upper bound. The four should be the highest-leverage signals your team will actually use in the decision.
Write the level-5 anchor for each dimension
Start at the top. What does the strongest possible answer on this dimension look like — in behaviour, not adjective? Write one or two sentences per dimension. This is the hardest writing of the whole exercise.
Write the level-1 anchor for each dimension
Now the floor. What does an unambiguous 'no' look like on this dimension? Behaviour, not adjective. This sets the dynamic range of the scale.
Fill in levels 2, 3, and 4 by interpolating
With the floor and ceiling set, the middle three are easier. Level 3 is the median acceptable hire on this dimension. Levels 2 and 4 are halfway-to-floor and halfway-to-ceiling. Behaviour-anchored throughout.
Add the recommendation field with four options
Strong yes, yes, no, strong no. No 'maybe.' 'Maybe' is a non-answer that becomes a default and produces decisions by interviewer-count rather than by judgement.
Add one mandatory free-text field per dimension
Two sentences. What evidence in the interview pushed the score in this direction? This is what makes the score reviewable in the debrief and recallable a month later.
Test the scorecard with one interviewer on one real candidate
Run a 60-minute interview, then ask the interviewer to fill out the scorecard with a stopwatch. If completion takes more than 5 minutes after the interview, the scorecard is too long. Cut one dimension and re-test.
Roll out with a 48-hour feedback SLA
Every interviewer submits the scorecard within 48 hours of the interview. Memory decays sharply after that. The SLA is the social contract that makes the scorecard actually used.
Most interview scorecards fail in the same two places: too many dimensions, and anchors written as adjectives instead of behaviours. The result is a scorecard interviewers fill in by averaging — every candidate gets a 3 — and a debrief that runs on vibes instead of evidence. This is how to write one that survives the first ten uses.
Why do most interview scorecards get abandoned?
The root cause is friction. A scorecard with seven dimensions, each requiring a score and a paragraph of justification, takes 15 to 20 minutes to fill out properly. After the third interview of the day, interviewers compress the bottom half of the form into rubber-stamp scores, and the scorecard's data becomes noise. According to recurring research from the Society for Industrial and Organizational Psychology and Harvard Business Review on structured interviewing, the predictive validity of a scorecard collapses when interviewer completion rates fall — and completion rates fall when the form takes too long.
The fix is fewer dimensions, behaviour-anchored levels, and a strict completion-time target. Four dimensions, 1-to-5 scale, two-sentence note per dimension. Five minutes per interview, after the interview ends. Anything longer is the design failing — not the interviewer.
TIP — The completion-time test Time the first interviewer who uses your scorecard, after a real 60-minute interview. If it takes more than 5 minutes to fill out, cut one dimension. If it takes more than 10 minutes, the scorecard needs a redesign. Length is the single biggest predictor of whether it gets used.
What does a working scorecard look like, dimension by dimension?
Four dimensions cover the vast majority of roles. The specifics change; the structure does not.
| Dimension | Level 5 anchor (behaviour) | Level 1 anchor (behaviour) |
|---|---|---|
| Communication | Restates the question, gives the answer in one sentence, then supports with evidence | Cannot summarise their own point; meanders or stalls |
| Ownership | Talks about decisions they made, trade-offs they weighed, mistakes they corrected | Talks about what 'we' did with no individual contribution visible |
| Role-specific competence | Demonstrates depth through specific examples and named trade-offs | Surface-level mentions of relevant tools or methods, no depth |
| Wildcard (role-specific) | Custom per role — typically curiosity, judgement, or strong opinion loosely held | Custom per role — typically deflection, dogma, or absence of stance |
The dimensions stay the same across roles. The wildcard slot is where role specificity lives. For an engineering role, it might be "names the constraint that matters and ignores the rest." For a sales role, "asks better discovery questions than the interviewer." For ops, "spots the second-order failure mode." One slot, role-specific, behaviour-anchored.
Why behaviour anchors and not adjectives?
Adjective anchors — "strong," "satisfactory," "weak" — sound rigorous and aren't. The problem is that every interviewer interprets "strong" against their own personal bar, calibrated against the people they have personally worked with. Two interviewers who both score a candidate "strong" may have wildly different evidence in mind. The scorecard then says they agree when they don't.
Behaviour anchors fix this by describing what the interviewer should have heard or seen. "Restates the question, gives the answer in one sentence, then supports with evidence" is a behaviour. Two interviewers either both observed it or didn't. The disagreement, when it happens, is about what the candidate did — which is a recoverable disagreement, not a values one.
Writing behaviour anchors is the hardest part of building a scorecard. Most teams cheat and write something like "strong communicator." Resist this. The compounding cost of a vague anchor is a year of misaligned interview debriefs.
Should interviewer scores be visible to other interviewers?
After each interviewer has submitted their own, yes. Hidden-score interview loops were standard in 1990s structured-interview literature. Modern research from sources like the Society for Industrial and Organizational Psychology, the Harvard Business School faculty research on hiring decisions, and the recurring LinkedIn Talent Solutions Global Talent Trends reports has consistently shown that visible scores plus a written rubric outperform blind scoring on both calibration and decision time.
The mechanism is simple. A rubric forces interviewers to justify their score against named criteria. When interviewer A sees that interviewer B gave the candidate a 4 on communication while A gave a 2, A has to look at the rubric and the candidate's behaviour, not their gut feeling. That comparison is where calibration happens. Blind scoring just replaces "we disagree about this rubric" with "we disagree about our gut feelings."
The one rule: scores become visible only after each interviewer has independently submitted. Visible scores during an interview create cascading anchoring; visible scores after submission create calibration.
"The argument for blind interview scoring is usually 'to avoid bias.' In practice, blind scoring removes the rubric's protection and lets bias in through the side door. The structured-rubric version is the bias-resistant one."
— Collin, Founder, RecruitIn
How does the scorecard plug into an AI screener?
If the team uses AI candidate screening, the AI's output is the first signal on the candidate card — and it belongs in the same view as the interviewer scorecards, not on a separate tab. The simplest integration: the AI's verdict (strong yes / yes / maybe / no / strong no) and rating sit at the top of the candidate card; each interviewer's scorecard sits below as it is submitted; the aggregate of all signals — AI plus interviewer averages — is the team's decision input.
The human signal should outweigh the AI signal in the aggregate. The AI is a first-pass filter and a useful second opinion; it is not the decider. A 5-from-the-AI plus a 2-from-the-onsite-interviewer almost always means the onsite found something the AI missed — read the interviewer's notes, not the AI's reasoning, to decide. The scorecard architecture is what makes that comparison trivial.
A note on the recommendation field
Four options: strong yes, yes, no, strong no. No "maybe."
"Maybe" is the option interviewers reach for when they did not see enough evidence either way. The fix is reading the rubric and re-anchoring against the dimensions. If the dimensions all sit around a 3, the recommendation is "no" — the candidate did not show enough above-bar evidence on any one. If two dimensions sit above 4 and the rest are at 3, the recommendation can still be "yes" with explicit caveats in the note. The four-option recommendation is the forcing function that makes interviewers commit.
A team that allows "maybe" produces hiring decisions where the count of "yes" votes determines the outcome. A team that does not allow "maybe" produces hiring decisions where the evidence determines the outcome. The difference compounds across every hire.
Closing thought
A scorecard your team uses is a scorecard your team can complete in 5 minutes, calibrated against behaviour anchors, with a forced four-option recommendation at the end. The dimensions are constant across roles; the wildcard slot is where role specificity lives. The compounding value of getting this right is that every subsequent hire is a slightly better-calibrated decision than the last, because every interviewer is comparing the same kind of evidence.
To plug a scorecard into a working pipeline today, create a free RecruitIn workspace — structured interview feedback is built into the candidate card, with the rating, recommendation, and notes flowing back to the pipeline view. To prepare the rubric the scorecard is built against, read how to write a job description with AI (the rubric is upstream of the JD). To set up the pipeline the scorecard sits inside, see how to set up your first hiring pipeline in under 30 minutes. Pricing details on the pricing page.

