How to build an interview scorecard your team will actually use

A practical guide to interview scorecards your team will use. The four-dimension structure, the anchor language that works, and the 5-minute completion test.

CollinCollinFounder, RecruitIn6 min read
Abstract scorecard with four rating rows showing 1 to 5 dots and short comment fields beside each

The steps

About 45 min
  1. Pick four dimensions tied to the rubric

    Open the role's rubric and pick four dimensions. Three is fine if the role is narrow. Five is the absolute upper bound. The four should be the highest-leverage signals your team will actually use in the decision.

  2. Write the level-5 anchor for each dimension

    Start at the top. What does the strongest possible answer on this dimension look like — in behaviour, not adjective? Write one or two sentences per dimension. This is the hardest writing of the whole exercise.

  3. Write the level-1 anchor for each dimension

    Now the floor. What does an unambiguous 'no' look like on this dimension? Behaviour, not adjective. This sets the dynamic range of the scale.

  4. Fill in levels 2, 3, and 4 by interpolating

    With the floor and ceiling set, the middle three are easier. Level 3 is the median acceptable hire on this dimension. Levels 2 and 4 are halfway-to-floor and halfway-to-ceiling. Behaviour-anchored throughout.

  5. Add the recommendation field with four options

    Strong yes, yes, no, strong no. No 'maybe.' 'Maybe' is a non-answer that becomes a default and produces decisions by interviewer-count rather than by judgement.

  6. Add one mandatory free-text field per dimension

    Two sentences. What evidence in the interview pushed the score in this direction? This is what makes the score reviewable in the debrief and recallable a month later.

  7. Test the scorecard with one interviewer on one real candidate

    Run a 60-minute interview, then ask the interviewer to fill out the scorecard with a stopwatch. If completion takes more than 5 minutes after the interview, the scorecard is too long. Cut one dimension and re-test.

  8. Roll out with a 48-hour feedback SLA

    Every interviewer submits the scorecard within 48 hours of the interview. Memory decays sharply after that. The SLA is the social contract that makes the scorecard actually used.

Most interview scorecards fail in the same two places: too many dimensions, and anchors written as adjectives instead of behaviours. The result is a scorecard interviewers fill in by averaging — every candidate gets a 3 — and a debrief that runs on vibes instead of evidence. This is how to write one that survives the first ten uses.

Why do most interview scorecards get abandoned?

The root cause is friction. A scorecard with seven dimensions, each requiring a score and a paragraph of justification, takes 15 to 20 minutes to fill out properly. After the third interview of the day, interviewers compress the bottom half of the form into rubber-stamp scores, and the scorecard's data becomes noise. According to recurring research from the Society for Industrial and Organizational Psychology and Harvard Business Review on structured interviewing, the predictive validity of a scorecard collapses when interviewer completion rates fall — and completion rates fall when the form takes too long.

The fix is fewer dimensions, behaviour-anchored levels, and a strict completion-time target. Four dimensions, 1-to-5 scale, two-sentence note per dimension. Five minutes per interview, after the interview ends. Anything longer is the design failing — not the interviewer.

TIP — The completion-time test Time the first interviewer who uses your scorecard, after a real 60-minute interview. If it takes more than 5 minutes to fill out, cut one dimension. If it takes more than 10 minutes, the scorecard needs a redesign. Length is the single biggest predictor of whether it gets used.

What does a working scorecard look like, dimension by dimension?

Four dimensions cover the vast majority of roles. The specifics change; the structure does not.

DimensionLevel 5 anchor (behaviour)Level 1 anchor (behaviour)
CommunicationRestates the question, gives the answer in one sentence, then supports with evidenceCannot summarise their own point; meanders or stalls
OwnershipTalks about decisions they made, trade-offs they weighed, mistakes they correctedTalks about what 'we' did with no individual contribution visible
Role-specific competenceDemonstrates depth through specific examples and named trade-offsSurface-level mentions of relevant tools or methods, no depth
Wildcard (role-specific)Custom per role — typically curiosity, judgement, or strong opinion loosely heldCustom per role — typically deflection, dogma, or absence of stance

The dimensions stay the same across roles. The wildcard slot is where role specificity lives. For an engineering role, it might be "names the constraint that matters and ignores the rest." For a sales role, "asks better discovery questions than the interviewer." For ops, "spots the second-order failure mode." One slot, role-specific, behaviour-anchored.

Why behaviour anchors and not adjectives?

Adjective anchors — "strong," "satisfactory," "weak" — sound rigorous and aren't. The problem is that every interviewer interprets "strong" against their own personal bar, calibrated against the people they have personally worked with. Two interviewers who both score a candidate "strong" may have wildly different evidence in mind. The scorecard then says they agree when they don't.

Behaviour anchors fix this by describing what the interviewer should have heard or seen. "Restates the question, gives the answer in one sentence, then supports with evidence" is a behaviour. Two interviewers either both observed it or didn't. The disagreement, when it happens, is about what the candidate did — which is a recoverable disagreement, not a values one.

Writing behaviour anchors is the hardest part of building a scorecard. Most teams cheat and write something like "strong communicator." Resist this. The compounding cost of a vague anchor is a year of misaligned interview debriefs.

Should interviewer scores be visible to other interviewers?

After each interviewer has submitted their own, yes. Hidden-score interview loops were standard in 1990s structured-interview literature. Modern research from sources like the Society for Industrial and Organizational Psychology, the Harvard Business School faculty research on hiring decisions, and the recurring LinkedIn Talent Solutions Global Talent Trends reports has consistently shown that visible scores plus a written rubric outperform blind scoring on both calibration and decision time.

The mechanism is simple. A rubric forces interviewers to justify their score against named criteria. When interviewer A sees that interviewer B gave the candidate a 4 on communication while A gave a 2, A has to look at the rubric and the candidate's behaviour, not their gut feeling. That comparison is where calibration happens. Blind scoring just replaces "we disagree about this rubric" with "we disagree about our gut feelings."

The one rule: scores become visible only after each interviewer has independently submitted. Visible scores during an interview create cascading anchoring; visible scores after submission create calibration.

"The argument for blind interview scoring is usually 'to avoid bias.' In practice, blind scoring removes the rubric's protection and lets bias in through the side door. The structured-rubric version is the bias-resistant one."

— Collin, Founder, RecruitIn

How does the scorecard plug into an AI screener?

If the team uses AI candidate screening, the AI's output is the first signal on the candidate card — and it belongs in the same view as the interviewer scorecards, not on a separate tab. The simplest integration: the AI's verdict (strong yes / yes / maybe / no / strong no) and rating sit at the top of the candidate card; each interviewer's scorecard sits below as it is submitted; the aggregate of all signals — AI plus interviewer averages — is the team's decision input.

The human signal should outweigh the AI signal in the aggregate. The AI is a first-pass filter and a useful second opinion; it is not the decider. A 5-from-the-AI plus a 2-from-the-onsite-interviewer almost always means the onsite found something the AI missed — read the interviewer's notes, not the AI's reasoning, to decide. The scorecard architecture is what makes that comparison trivial.

A note on the recommendation field

Four options: strong yes, yes, no, strong no. No "maybe."

"Maybe" is the option interviewers reach for when they did not see enough evidence either way. The fix is reading the rubric and re-anchoring against the dimensions. If the dimensions all sit around a 3, the recommendation is "no" — the candidate did not show enough above-bar evidence on any one. If two dimensions sit above 4 and the rest are at 3, the recommendation can still be "yes" with explicit caveats in the note. The four-option recommendation is the forcing function that makes interviewers commit.

A team that allows "maybe" produces hiring decisions where the count of "yes" votes determines the outcome. A team that does not allow "maybe" produces hiring decisions where the evidence determines the outcome. The difference compounds across every hire.

Closing thought

A scorecard your team uses is a scorecard your team can complete in 5 minutes, calibrated against behaviour anchors, with a forced four-option recommendation at the end. The dimensions are constant across roles; the wildcard slot is where role specificity lives. The compounding value of getting this right is that every subsequent hire is a slightly better-calibrated decision than the last, because every interviewer is comparing the same kind of evidence.

To plug a scorecard into a working pipeline today, create a free RecruitIn workspace — structured interview feedback is built into the candidate card, with the rating, recommendation, and notes flowing back to the pipeline view. To prepare the rubric the scorecard is built against, read how to write a job description with AI (the rubric is upstream of the JD). To set up the pipeline the scorecard sits inside, see how to set up your first hiring pipeline in under 30 minutes. Pricing details on the pricing page.

On this page

Hire smarter. Start free.

Free forever plan. No credit card. Set up your first AI-screened pipeline before this article finishes loading.

Frequently asked

How many dimensions should an interview scorecard have?

Four for most roles, five at the absolute upper bound. The dimensions should be the ones the rubric was written against — communication, ownership, role-specific competence, and one wildcard ('strong opinion loosely held' or 'curiosity' or whatever the team values). Past five dimensions, interviewers start treating the lower rows as noise and fill them by averaging.

Should anchors be numbers or behaviours?

Behaviours, anchored to numbers. A '5' on communication is 'restates the question, gives the answer in one sentence, then supports with evidence'; a '3' is 'answers accurately but bypasses structure'; a '1' is 'cannot summarise their own point.' The number is the shorthand. The behaviour is the contract.

Should interviewers see each other's scores during the loop?

After they submit their own, yes. Hidden-score interview loops are a 1990s idea that has not held up under modern research. Visible scores plus a written rubric do more to reduce bias than blind scoring does, because the rubric forces interviewers to justify their score against the same criteria. Hidden scoring usually replaces structured criteria with vibes.

What is the right scoring scale — 1 to 3, 1 to 5, or 1 to 10?

1 to 5 is the sweet spot. 1 to 3 is too coarse to differentiate strong candidates; 1 to 10 is too granular and produces fake precision (no one can distinguish a 7 from an 8 reliably). 1 to 5 with named anchors gives enough range to surface disagreement while staying coarse enough to be calibratable.

Can AI screening feed into the same scorecard?

Yes, and it should. The AI's score and reasoning belong on the candidate card alongside the interviewer scorecards, not in a separate view. The aggregate signal is what drives the decision — and treating AI and human signals as comparable inputs (with the human signal weighted higher) is the simplest way to integrate them.

Keep reading