Reliability |
Internal consistency |
Evaluates the similarity of test items assessed in one domain. One measure is split-half reliability, which compares the scores on two halves of a test in a single domain. |
High internal consistency suggests that some items are too similar, so no additional information is gained from assessing them. Low internal consistency suggests the items may not be assessing the same domain. |
Interobserver |
Evaluates variability between different assessors on the same subject |
There may be systematic errors, specific to a particular group of assessors, and this parameter may not be generalisable when the tool is used by a different group of assessors. |
Intraobserver |
Evaluates variability within a single assessor on a single subject |
Commonly evaluated by the same assessor scoring video recordings of their own assessments. This is not essential unless there is low interobserver reliability |
Validity |
Test-retest |
Evaluates variability within the subject (influenced by random factors such as familiarity with items and mood) |
Difficult to interpret in early childhood when changes in development occur over a short time. Usually the repeat assessment should be carried out within 2 weeks of the first test. |
Content |
Experts in the field make consensus agreement on whether the individual item and the range of items adequately sample and represent the domain of interest. |
Subjective measure that cannot be used in isolation to evaluate validity. |
Criterion |
Ideally assessed by comparison to an established ‘gold standard’ test assessing the same construct |
Usually ‘gold standard’ tests are not available so the comparison is typically against another recognised test regularly used in the same population and thought to measure the same domain. |
Discriminant/convergent |
Evaluates expected positive and negative correlations between scores in different domains or between different tests of the same or differing underlying construct. |
Scores from two independent tests (eg, one using report method the other a direct test) of one domain should correlate where neither test is considered a ‘gold-standard’. To ensure the test is not overlapping with constructs not of interest, the scores evaluating different constructs should poorly correlate, for example, test scores on ‘fine motor’ should correlate poorly with ‘social emotional’. |
Construct |
Statistical evaluation to see whether values of observed data fit a theoretical model of the constructs (confirmatory) or to explore a possible model of the ‘underlying traits’ being measured. |
Large numbers of assessments are required to evaluate this. |