To determine the reasons for inter-scorer variability in sleep staging of polysomnograms (PSGs).
Fifty-six PSGs were scored (5-stage sleep scoring) by 2 experienced technologists, (first manual, M1). Months later, the technologists edited their own scoring (second manual, M2) based upon feedback from the investigators that highlighted differences between their scoring. The PSGs were then scored with an automatic system (Auto) and the technologists edited them, epoch-by-epoch (Edited-Auto). This resulted in 6 different manual scores for each PSG. Epochs were classified as scorer errors (one M1 score differed from the other 5 scores), scorer bias (all 3 scores of each technologist were similar, but differed from the other technologist) and equivocal (sleep scoring was inconsistent within and between technologists).
Percent agreement after M1 was 78.9% ± 9.0% and was unchanged after M2 (78.1% ± 9.7%) despite numerous edits (≈40/PSG) by the scorers. Agreement in Edited-Auto was higher (86.5% ± 6.4%, p < 1E−9). Scorer errors (< 2% of epochs) and scorer bias (3.5% ± 2.3% of epochs) together accounted for < 20% of M1 disagreements. A large number of epochs (92 ± 44/PSG) with scoring agreement in M1 were subsequently changed in M2 and/or Edited-Auto. Equivocal epochs, which showed scoring inconsistency, accounted for 28% ± 12% of all epochs, and up to 76% of all epochs in individual patients. Disagreements were largely between awake/NREM, N1/N2, and N2/N3 sleep.
Inter-scorer variability is largely due to epochs that are difficult to classify. Availability of digitally identified events (e.g., spindles) or calculated variables (e.g., depth of sleep, delta wave duration) during scoring may greatly reduce scoring variability.
Younes M, Raneri J, Hanly P. Staging sleep in polysomnograms: analysis of inter-scorer variability. J Clin Sleep Med 2016;12(6):885–894.
Inter-scorer variability in scoring polysomnograms is a well-recognized problem.1–12 It not only affects the diagnosis and management of sleep disorders but also confounds interpretation of outcome studies. The reasons for discrepancies between scorers are not well understood, and their identification may provide an opportunity for solutions.
Current Knowledge/Study Rationale: Inter-scorer variability in scoring polysomnograms is a well-recognized problem that impacts the diagnosis and management of sleep disorders and confounds interpretation of outcome studies. We wished to determine whether differences between highly qualified technologists in scoring sleep are related to inattention errors, scoring bias, or to the signals in a number of epochs being difficult to score definitively with current guidelines (equivocal epochs).
Study Impact: We found that inattention errors and bias contribute little, while the vast majority of scoring differences between qualified technologists result from the presence of a large number of equivocal epochs that can legitimately be assigned any of two, or even three, sleep stages by competent technologists. These findings suggest that digital identification of key staging variables (e.g., spindles, delta wave duration, objective sleep depth) is needed if inter-scorer variability is to be minimized and that better training or fine-tuning of the scoring guidelines are not likely to be effective.
In theory, the sleep scoring of two polysomnography technologists may differ for one of the following reasons:
Inadequate training of one or both scorers: While this is clearly a contributing factor in some cases, its solution is clear cut, namely through adequate training. However, ensuring adequate training will by no means solve the problem since substantial inter-scorer variability remains between highly experienced technologists.1–13
Inattention by one or both scorers: Polysomnography (PSG) scoring can be monotonous, and errors related to boredom and inattention may be expected. It may be anticipated that differences related to inattention would be eliminated if the scorers were asked to edit their own scores.
Bias in the interpretation of scoring guidelines: Many of the scoring guidelines are qualitative and their implementation is subject to bias. Differences related to bias in interpretation of qualitative guidelines would persist if the scorers were asked to edit the scoring of a third party. Thus, if an epoch is scored as stage 1 non-rapid eye movement sleep (N1) by one scorer and stage 2 non-REM sleep (N2) by the other, and each scorer is convinced of his/her score, the N2 scorer may be expected to change the third party's score to N2 if it were N1, and vice versa.
Equivocal epochs: Even when the guidelines are quantitative, a decision may require unacceptably long time, or digital means, to determine whether the features meet the guidelines. Examples include whether a tentative delta wave is 75 μV in amplitude or total delta wave duration is > 6 seconds, or whether a brief high frequency burst has the requisite spindle frequency (11–16 Hz). In such cases, most scorers simply “eyeball” the signals. For most epochs, the score is unambiguous. However, in many equivocal epochs a technologist may be willing to accept either of two scoring options, and agreement—or lack of it—is left to chance. Scoring differences associated with equivocal epochs may be expected to disappear if the technologists were asked to edit preexisting scores, since both would agree with the third score unless it is not one of the scores they are considering.
In this study, two experienced polysomnography technologists scored 56 PSGs, and agreement between their scores was determined. Agreement was determined again after each scorer edited his/her own previous scoring, and after editing scoring performed by an automatic system.
This study reports the results of additional analysis of 56 polysomnograms (PSGs) used previously to validate a continuous index of sleep depth.14 PSGs were randomly selected from the sleep center database at the University of Calgary to include a broad spectrum of sleep pathology: severe obstructive sleep apnea (OSA) (apnea-hypopnea index [AHI] > 30, n = 8), moderate OSA (AHI 15–30, n = 10), mild OSA (AHI 5–15, n = 10), central sleep apnea (AHI > 15, n = 4), OSA on continuous positive airway pressure CPAP throughout (n = 5), periodic limb movement disorder (PLM index > 25, n = 4), insomnia (n = 5), narcolepsy (n = 5), and no sleep pathology (n = 5). The PSGs included two central signals and one occipital electroencephalography (EEG) signal, 2 electroculograms, a chin electromyogram, an electrocardiogram, chest and abdomen respiratory inductance plethysmography bands, nasal pressure and oro-nasal thermistor, oxyhemoglobin saturation, and a microphone. They were recorded with a Sandman system (Natus Medical, USA). The PSG scoring performed for the initial clinical evaluation is not considered here.
The PSG files were scored anew by 2 registered PSG technologists, Scorer 1 (S1) and scorer 2 (S2) (Manual 1, M1) Each technologist had more than 10 years of experience at an academic sleep center, one at the University of Calgary and one at the University of Manitoba. The scorers followed their standard scoring practice, which was based on the 2007 AASM guidelines.15 They did not compare their scoring practice and were blinded to each others' scoring results. Several months later, the technologists were told that there were substantial differences between their scores and were asked to review and, if necessary, revise their original scoring (Manual 2, M2; to be distinguished from re-scoring the same PSGs). Prior to M2, two areas were identified in which their scoring differed, namely awake vs. non-rapid eye movement sleep (NREM) sleep and stage 2 non-REM sleep vs. stage 3 non-REM sleep (N2/N3) disagreements. The PSGs were also exported in the European Data Format (EDF) and scored on site with a validated10 automatic scoring system (Michele Sleep Scoring (MSS), Younes Sleep Technologies (YST), Winnipeg, Canada) using a laptop operated by a company engineer. The unedited scores were left at the University of Calgary Sleep Centre, and were edited by the same 2 PSG technologists 3–4 months after M2. The technologists did not have access to their previous scores during this last step as the PSG files were coded. They were asked to correct every sleep score they disagreed with. For each PSG, there were accordingly 7 scoring results, the original and edited-manual scores (M1 and M2) by the 2 scorers (S1M1, S1M2, S2M1, S2M2), an unedited automatic score (Auto), and 2 edited automatic scores (Edited-Auto-S1, Edited-Auto-S2).
Manual scoring and editing of the automatic scores were done by the 2 technologists at their respective homes outside working hours. The scoring was done in a quiet environment and without any distractions from other activities or household members. There were no limits on the amount of time that could be spent on this work, which was completed over several weeks.
Staging of sleep by MSS relies on algorithms that calculate the odds-ratio-product (ORP) and identify rapid eye movements (REMs), delta waves, sleep spindles, and K complexes in 30-s epochs. The ORP is a continuous index of depth of sleep that ranges from 0 (very deep sleep) to 2.5 (full wakefulness) and is calculated every 3 seconds.14 Staging takes place in 4 discrete steps. First, 30-s epochs deemed to be definitely “awake” are identified based primarily on the average and pattern of the odds ratio product (ORP) in the epoch. Second, epochs with REM sleep are identified from within the remaining epochs if there are rapid eye movements (REMs) and the chin electromyogram (EMG) is below a patient-specific threshold. Next, remaining epochs with definite sleep, based on the average ORP, are classified as NREM sleep and the specific stage within NREM sleep is determined based on total duration of delta waves within the 30 s and presence of spindles and/or K complexes. Finally, a decision is made on whether the remaining unclassified epochs are scored awake or NREM sleep depending on whether the epoch characteristics place it within one of several categories. These categories incorporate different combinations of features including average ORP, muscle activity, presence of spindles, REMs, snoring or respiratory events, and sleep stages in preceding or following epochs, among other variables. Epochs classified as NREM sleep using this process are then further divided into N1, N2 or N3 in the same way as epochs scored earlier as definite NREM sleep.
The following comparisons were made: Between scorer 1 and scorer 2 following each of the 3 scoring stages (M1, M2, Edited-Auto), between each scorer's M1 and M2 scores, between each scorer's M2 and Edited-Auto scores, between unedited Auto and each scorer's M2 score and between unedited Auto and the 2 Edited-Auto scores. For each comparison agreement was defined as the number of epochs with agreement*100/total number of epochs. Disagreements were classified into 6 categories: wake/NREM, Wake/REM, NREM/REM, N1/N2, N2/ N3, and N1/N3. Disagreements between Auto and Edited-Auto reflect the number of epochs where the automatic score was changed by the technologist.
Each 30-s epoch was classified into one of the following 4 categories based on the MI, M2 and Edited-Auto results (i.e., unedited Auto was not considered here):
Epochs with inattention error (Tech Error): These are epochs in which 5 of the 6 scores were similar and the odd score was in M1 (1/5 split, Pattern 1, Figure 1). This scenario indicates that the odd score was unintentional since it was reversed by the same technologist in the next round and there was no other disagreement between the 2 technologists' sleep scoring.
Epochs with Bias: Here, all 3 manual scores of each technologist were the same but differed from those of the other technologist, thereby showing consistency on the part of each technologist (3/3 split, Pattern 2, Figure 1).
Unequivocal epochs: All 6 manual scores were the same (6/0 split, Pattern 3, Figure 1).
Equivocal epochs: All remaining epochs. A variety of patterns are included in this category, all of which indicate scoring inconsistency between the 2 technologists as well as inconsistency among the different scores of at least one technologist. In the 5/1 split (Pattern 4, Figure 1), there is only one different score but, unlike Tech errors, it occurs during M2 or in Edited-Auto PSG. This pattern cannot be attributed to inattention since the score was actively changed in M2 or while editing Auto, or the scorer accepted a different Auto score, indicating that he/she was not committed to the original manual score. In the 4/2 split, one sleep stage was scored 4 times and another sleep stage was scored twice. In one subtype, (Pattern 5, Figure 1) both technologists were inconsistent, while in the other (Pattern 6, Figure 1) one technologist appears consistent and the other not. In the 3/3 split (Pattern 7, Figure 1), 3 counts of each of 2 stages are present in the 6 results but, unlike Bias, both stages are present within the 3 scores of each technologist. More complex patterns are also seen in which more than 2 different sleep stages were observed within the 6 scores (Patterns 8 and 9, Figure 1).
Scoring patterns within the 4 manual and 2 post-edit scores in individual 30-s epochs.
The left 3 cells within each bar represent the scoring assigned to the epoch by one scorer in the first manual (S1M1), second manual (S1M2), and post-edit scores (S1Post), respectively, and the right cells show the scores of the second scorer (S2) in the same sequence. Different shades indicate that different sleep stages were assigned but do not reflect any specific sleep stages.
Scoring patterns within the 4 manual and 2 post-edit scores in individual 30-s epochs.
Results are given as the average number of epochs (SD) and/or % of epochs (SD) per PSG, except where otherwise indicated.
Patients were 35 females and 21 males, 51 ± 14 years in age. Body mass index was 35 ± 12 kg/m2. Total sleep time was 244 ± 106 minutes, sleep efficiency was 68% ± 21%, AHI was 21 ± 25/h, arousal/awakening index was 25 ± 14/h, and PLM index was 17 ± 31/h.
Inter- and Intra-scorer Agreement
Table 1 shows the agreement between and within technologists at different stages in the scoring protocol. Figure 2 describes the evolution of the different scoring patterns and explains the results of Table 1. The average number of epochs per PSG was 732 ± 190. Percent agreement following the first manual scoring (Column 1, Table 1 and top row, Figure 2) was 78.9% ± 9.0% (range: 46.0–97.6%). The most common types of disagreements were awake/NREM, N1/N2, and N2/N3, which were equally represented at ≈45 epochs/PSG (Column 1, Table 1). REM/NREM disagreements occurred in 13 ± 15 epochs/ PSG. N1/N3 and Awake/REM disagreements were rare.
Scoring agreement between technologists.
Scoring agreement between technologists.
Flow chart describing the evolution of different scoring patterns.
The numbers refer to the frequency of agreement/disagreement in an average PSG containing 732 epochs. Joined twin columns show the scoring of the 2 technologists at the different stages of the analysis. Similar shades in the twin columns indicate scoring agreement between the 2 technologists, while different shades indicate disagreement. Following the first manual scoring (M1) there were 149 epochs with disagreement and 583 epochs with agreement. The numbers above the columns in the Auto panels indicate the frequency of the Auto score being similar to, or different from, the scores in one or both of the earlier manual scoring. In the Auto panel to the right a black Auto column reflects disagreement with the common manual score. In the left Auto panel, the gray column represents disagreement with the earlier manual scores since black and white scores occurred in the manual phases. The frequency of each pattern is given in the first row of numbers below the Edited-Auto patterns. The last row of numbers gives the frequency of one scorer (left number) or both scorers (right number) altering the Auto score within each pattern.
Flow chart describing the evolution of different scoring patterns.
During M2 the 2 technologists altered their original scores in 31 ± 24 and 50 ± 35 epochs/PSG (Columns 2 and 3, Table 1). These changes resulted in new disagreements in 37 ± 24 epochs where there was agreement in M1 (Figure 2). Concurrently, they resulted in agreement in 32 ± 22 epochs where there was disagreement in M1 (Figure 2, 2nd row). Interestingly, a third scenario was introduced in M2 in 7 ± 9 epochs, resulting in a different disagreement than existed in M1 (Figure 2, 2nd row). Because of the mixed positive (more agreement) and negative effects of M2 edits, agreement between the 2 scorers did not improve after M2 (78.1% ± 9.7%; 4th column, Table 1) despite a large number of changes made by the technologists (81 ± 49 per PSG).
Unedited Auto Results
Where all 4 scores of M1 and M2 were the same (546 ± 179 epochs/PSG) Auto was similar to the manual score in 476 ± 179 epochs (86% ± 8%; Figure 2, rightmost pathway). Where there was disagreement in either or both M1 and M2 (186 ± 79/PSG), Auto was similar to one of the manually scored sleep stages in 166 ± 67 epochs/PSG (89% ± 9%). In the remaining 20 epochs, Auto scoring introduced a third scenario into the mix (gray boxes, Figure 2, 3rd row). The scoring agreement between Auto and the second manual scores (M2) was 75.9 ± 7.9% for S1 (column 5, Table 1) and 76.4 ± 11.2% for S2 (column 6, Table 1). These were marginally lower (p = 0.02 for S1 and 0.04 for S2) than the agreement between the two scorers in M2 (78.1% ± 9.7%).
During editing of Auto, S1 and S2 changed Auto scoring in 60 and 113 epochs (Columns 7 and 8, Table 1). These are small proportions of the epochs in which there was disagreement between Auto and the two technologists in M2 (172 and 167 epochs; columns 5 and 6, Table 1), suggesting that technologists accepted Auto scoring in a large number of epochs where Auto scoring was different from their manual score. Agreement between Edited-Auto and the manual score of each scorer in M2 was improved relative to agreement between Auto and manual scores (81.3% vs. 75.9% for S1 (column 9 vs. column 5) and 86.9% vs. 76.4% for S2 (column 10 vs. column 6)). However, scoring agreement fell short of 100%, the value expected if all disagreements between Auto and manual scoring resulted from Auto scoring errors or technologist bias.
The last row of Figure 2 shows the various scoring outcomes after editing Auto scoring and Table 2 shows the distribution of different epoch patterns. Where there was agreement in both M1 and M2 (right stream, Figure 2; n = 546) only 24 ± 20 of the 70 ± 34 epochs where Auto scoring was different were corrected by both technologists to their original common manual score. These clearly represent true Auto scoring errors. The end result was 488 ± 176 (65% ± 13%) epochs with scoring agreement across all 6 manual scores (Pattern 3, Figure 2). These are considered unequivocal epochs. Fifty-five epochs in this category were considered equivocal epochs since: (a) in 15 ± 10 epochs with Auto-manual scoring disagreement both technologists accepted Auto scoring, resulting in post-edit agreement but in a different sleep stage (Pattern 5a, Figure 2); (b) in the remaining 31 epochs with Auto-manual scoring disagreement, one technologist accepted Auto scoring while the other changed it to the original manual score; and (c) in a further 9 epochs, one of the scorers paradoxically changed Auto scoring to a different sleep stage even though it was similar to the manual score. The last 2 actions resulted in 40 ± 19 epochs with post-edit disagreement (Pattern 4, Figure 2).
Frequency of different scoring patterns.
Frequency of different scoring patterns.
For epochs in which there was disagreement in M1, M2 or both (n = 186 ± 79) the Edited-Auto scoring outcome was highly variable depending on whether the Auto score was changed and, if so, whether only one or both technologists altered it (Figure 2). In 19 ± 15 of the epochs in which disagreement in M1 was corrected in M2 (n = 32 ± 22; leftmost stream, Figure 2) the technologists responded to Auto scoring in a manner consistent with their M2 scoring. The disagreement in M1 in these epochs was considered to be an error by one of the technologists that was corrected in M2 (Pattern 1, Figure 2). There were 8.7 ± 9.1 such epochs/PSG for S1 and 10.5 ± 11.6 epochs/PSG for S2.
For the 110 ± 55 epochs where similar disagreements were seen in M1 and M2, Auto scoring was necessarily different from one or both scorers. Only in 25 ± 15 of these epochs did the technologists respond in a way that preserved the same split seen in M1 and M2. These epochs were considered to be examples of a true technologist bias (Pattern 2, Figure 2). The split scores in these epochs were Awake/N1 (7.8 epochs), N2/N3 (7.0 epochs), and N1/N2 (6.9 epochs). Other disagreements were extremely rare. All other epochs in this category were considered equivocal since the response to Auto scoring indicated that the technologists were not committed to their manual scores.
Figure 3 illustrates 2 equivocal epochs in one patient. The top panel is an example of pattern 8 (3 different stages scored) and the bottom panel is an epoch that was given 4 different stages at different times (pattern 9). The legend outlines the assumed rationale for the scoring choices that were made.
Two examples of equivocal epochs.
C4/A1, C3/A2 and O1/A2 are electroencephalography electrodes (120 μV calibration bar common to all); EOG, electroculogram; S1 and S2, first and second scorers; ORP, odds ratio product. (A) In the first manual round, S1 scored the epoch as NREM sleep stage 1 (N1) while S2 scored it as N2. In the second manual round S1 changed the stage to N2 while S2 changed the stage to awake (W). The automatic system (Auto) scored the epoch as awake based primarily on a high average ORP (2.0). Neither scorer corrected the Auto score resulting in a common score of W. It is difficult to determine whether the duration of the awake pattern in this epoch is more or less than 15 seconds; hence the difficulty of distinguishing W from NREM sleep. There is a brief period of high EEG frequency which may or not be a spindle; hence the difficulty of distinguishing N1 from N2. (B) The EEG in this epoch could visually be either awake or asleep. Whether the epoch is scored W, N1, N2, or REM depends on whether one scores the eye movement as slow or rapid, whether the brief high frequency bursts are considered spindles or brief beta bursts (subthreshold arousal), and whether the chin EMG is low or high for REM. All these features are questionable in this epoch; hence the 3 different assigned stages in the first 2 manual sessions. Auto scored the epoch as N2 because the ORP was closer to the definite sleep level (average ORP 1.33), the eye movement was too slow, and because the high frequency events were confirmed as spindles. Nonetheless, both scorers over-ruled Auto even though N2 was scored twice before manually.
Two examples of equivocal epochs.
In total there were 199 ± 87 equivocal epochs/PSG, representing 28% ± 12% (range 5–76%) of all epochs (Patterns 4 to 8, Figure 2 and Table 2). Of these, neither technologist changed the Auto scored sleep stage in 117 ± 60 epochs/PSG. In 11 ± 11 epochs both technologists changed the Auto score and in another 72 ± 39 epochs/PSG only one technologist changed the Auto score. The dual changes in Auto scoring (n = 11 ± 11) resulted in 2 epochs with disagreement, where agreement was present previously (Pattern 4, Figure 2), one epoch with agreement but in a different stage (Pattern 5a, Figure 2) and 8 epochs with agreement where there was no agreement during M1 and/or M2. Figure 4 shows the actions taken when only one technologist changed the Auto score in these equivocal epochs. In one pattern, the change made by the technologist appeared to show consistency with the original manual scores of the same technologist (Pattern A, Figure 3). This pattern accounted for 39 ± 23 of the 72 epochs in this category. On the other hand, in the remainder of equivocal epochs where a single change was made (33 ± 21 epochs/PSG) the change did not reflect consistency in that the Auto score was changed despite being similar to one or both manual scores of the same technologist (patterns B, D, and E, Figure 3) or, when Auto was different, it was changed to a completely new stage (pattern C, Figure 3), or the original manual scores were already different (pattern F, Figure 3). In all cases the intervention by one of the 2 technologists resulted in Edited-Auto disagreement (Figure 2).
Actions taken by technologists when they changed the Auto score of equivocal epochs.
In each case only one technologist changed the Auto score while the other accepted it. The columns in the first and second rows show the scoring in the first and second manual stages (M1 and M2) of the same scorer who changed Auto. In patterns A to C, the technologist scored the same stage in both M1 and M2, while in D to F the 2 scores were different. In pattern A, the Auto score was different to the common manual score of this technologist (although it was similar in most cases to the other technologist). The technologist changed the Auto score to his/her earlier manual score, suggesting consistency. However, in an approximately equal number of epochs the change in scoring did not reflect consistency (patterns B to F).
Actions taken by technologists when they changed the Auto score of equivocal epochs.
When all epochs are considered (732 ± 190 epochs/PSG) both technologists accepted the Auto scoring, thereby passively resulting in post-edit agreement, in 601 ± 178 epochs (82% ± 8%). Both changed Auto scoring in 37 ± 26 epochs and this was followed by post-edit agreement in 33 ± 25 epochs (patterns 1, 3, 4a, 5a, 6a, 7a, 8a, Figure 2) and disagreement in 4 ± 3epochs (patterns 2, 4b, and 8b, Figure 2). In 95 ± 45 epochs only one technologist changed Auto scoring, resulting in post-edit disagreement in all (Figure 2). Thus, full editing was followed by 634 ± 179 agreements between the two technologists (87% ± 6%; column 11, Table 1), significantly higher than agreement between them following M2 (78.1% ± 9.7%; column 4, Table 1, p < 1E−9). However, had there been no editing at all, there would have been post-edit agreement in all epochs and the final stage would have been similar to a stage assigned by at least one technologist during M1 and M2 in 642 ± 169 epochs (87% ± 6%; Figure 2, 3rd row). Furthermore, the number of “true” errors (i.e., unacceptable difference from 2 technologists who agreed on the score) would have been only 24 ± 20 epochs/PSG, representing < 4% of epochs on average, with a maximum error in one PSG of 11.7%.
The main finding in this study is that inter-scorer variability in sleep staging is largely related to the presence of a large and highly variable number of epochs that are difficult to score with confidence, such that the scorer may readily reverse the earlier score or accept a different score made by a third party. Since assigning a Rechtschaffen and Kales (R&K) stage to such epochs is somewhat arbitrary, differences between scorers will arise at random and their frequency depends on the frequency of equivocal epochs. The problem is therefore rooted in the PSG records and not in the scorers. As such, inter-scorer variability cannot be solved by more training or fine-tuning the scoring guidelines, but requires a major re-assessment of scoring methods. An important secondary finding is that editing an existing score made by a third party substantially reduces inter-scorer variability by offering an option that experienced scorers would be willing to accept for such equivocal epochs, thereby reducing the chance of the random staging differences that would otherwise occur.
Critique of Epoch Classification
In this study two registered, highly experienced polysomnography technologists scored the PSGs. Accordingly, scoring differences due to inadequate training can be ruled out. When a score in M1 is corrected in M2 by the technologist who made it and does not appear later in the third round of scoring it can be safely assumed that it was an inattention error. This type of error was rare, occurring in < 2% of epochs, thereby further confirming the scoring expertise of the technologists. Clearly, the contribution of such errors would be higher if one or both technologists are less experienced.
Epochs where both technologists consistently assigned a different sleep stage through three scoring exercises, including one instance where a different Auto score was actively changed to the sleep stage preferred by one of the scorers (Pattern 2, Figure 2), would be consistent with a scoring bias. The EEG in such epochs must clearly be different from the EEG in unequivocal epochs (Pattern 3, Figure 2) since it allows room for differences of opinion, due to different interpretation of qualitative scoring guidelines or different training. In other words, these epochs are also ambiguous. However, unlike in the equivocal epochs, differences between scorers here are hard-wired and consistent. We had expected this to be the most common type of disagreement. However, it occurred in only 25 ± 15 epochs/ PSG (17% ± 9% of epochs with disagreement in M1).
There is much evidence that epochs assigned to the equivocal category are so ambiguous that scorers have little or no commitment to the sleep stage they initially assigned and that the choice between two, or sometimes three or four, possible sleep stages was nearly random (e.g., Figure 3). First, in all such epochs at least one experienced technologist changed his/ her initial score and in a third (patterns 5, 7, 8, and 9, Figure 1 and Table 2), both technologists changed their original score later. That both technologists changed their score in only a third of these epochs is to be expected from this protocol since a change in the original score requires: (a) overruling one's own score during M2 (i.e., this was not re-scoring an unscored PSG, but simply reviewing one's previous score); (b) accepting an Auto score that is different; or (c) changing an Auto score that is similar to theirs. The chance of these conditions being met in the same epoch for both scorers is small.
Second, in the majority of equivocal epochs (173 ± 75/PSG), the choice was between only two sleep stages (Patterns 4–7, Figure 1 and Table 2), and in virtually all the rest (26 ± 25 epochs/PSG) it was between three sleep stages (Pattern 8). If the choice of sleep stage in these equivocal epochs were random, one would expect a nearly equal number of agreements in M1 to have been produced also at random, such that their sleep stage would be changed later. The number of epochs with agreement in M1 that showed disagreement later and classified as equivocal was 92 ± 44 epochs/PSG (patterns 4b, 5a, and 37 disagreements that appeared first in M2, Figure 2). This is comparable to the number of equivocal epochs included in M1 disagreements (105 ± 54 epoch/PSG).
Third, in most equivocal epochs (117/199; 59%) both scorers accepted the Auto-scored sleep stage even though it was different from the sleep stage assigned by at least one of them. Where changes to Auto score were made the newly assigned sleep stage appeared to show consistency within the scorer in 39 ± 32 epochs/PSG and did not show consistency in the other 33 ± 21 epochs/PSG (Figure 4). Thus, it can be argued that these changes were also likely to be random.
Figure 5 shows the relationship between the number of equivocal epochs/PSG and the percent agreement between the two scorers in M1. Although an association may be expected because equivocal epochs contribute to scoring disagreements in M1, the correlation is nonetheless remarkable, particularly when considering that not all M1 disagreements are counted as equivocal (Patterns 1 and 2 are excluded), and only about half of equivocal epochs are included in M1 disagreements. This relationship suggests that inter-scorer agreement between experienced technologists is largely dependent on PSG file characteristics. The percent of equivocal epochs ranged from 3% to a remarkable 76% of total epochs in different PSGs and the corresponding percent agreement between scorers ranged from 45% to 98% (Figure 5). Given this fact, the percent agreement between two scorers or between a scorer and an automatic system is meaningless in terms of their competence or accuracy unless PSG characteristics are taken into account. Likewise, when disagreement between competent scorers is large, addition of other scorers to reach consensus does not add validity to the score since another random score of the equivocal epochs is simply added.
The kappa statistic (κ)16 is often used to correct the percent agreement between scorers for agreement by chance. Calculation of agreement by chance using Cohen's κ is based on the relative frequency of different sleep stages in the PSG. The current study shows that the frequency of equivocal epochs is another major determinant of agreement by chance and that agreement according to Cohen's κ may still overestimate the actual agreement between scorers, particularly when a large number of equivocal epochs is present. Because in these epochs the competent scorer's choice is nearly always between two sleep stages, equivocal epochs contribute nearly half their number to the total number of agreements.
Because inter-scorer variability is largely a function of ambiguous EEG characteristics, increasing scorers' attention or tweaking of scoring guidelines is not likely to solve the problem. In this study, the technologists scored the PSGs in a quiet environment without distractions or time limits. Their attentiveness is confirmed by the minimal contribution of differences due to inattention errors to the overall scoring differences. Furthermore, it is not likely that fine-tuning the guidelines will have much impact given the underlying reasons for uncertainty (exactly when alpha rhythm begins and ends, equivocal sleep spindles, or the precise duration of delta waves). Resolving these uncertainties requires digital analysis or unacceptably long manual scoring time. It is also not clear that forcing scorers to assign a specific R&K stage to epochs where the guidelines cannot be implemented with confidence is desirable. Scoring an epoch as either N2 or awake when the EEG is in between wakefulness and sleep may completely distort the result of the sleep study when equivocal epochs of the wake/NREM type are numerous. The same PSG may be reported as showing insomnia or normal sleep. Likewise, the merit of choosing N2 or N3 when delta wave duration is borderline is not clear. The current study points to some potential solutions, which are discussed below.
Relationship between frequency of equivocal epochs in different polysomnograms and the percent agreement between the 2 scorers.
M1, first manual scoring.
Relationship between frequency of equivocal epochs in different polysomnograms and the percent agreement between the 2 scorers.
Potential Approaches to Mitigate Inter-scorer Variability
We and others5–8,11,12 have found that disagreements occur primarily between wakefulness and sleep, between stages N1 and N2 and between stages N2 and N3 (Table 1). A number of digital techniques are available that can reduce these disagreements:
We have shown previously14 that most wake/sleep disagreements between scorers occur when ORP is between the clearly awake range (> 2.0) and the consolidated sleep range (< 1.0). By adding results of such an index to the scoring report, equivocal epochs of the wake/NREM type will receive a score between 1.0 and 2.0, thereby declaring their transitional nature. Furthermore, below 1.0, ORP showed an excellent correlation (r2 = 0.98) with arousability, thereby providing an alternate (to the N1-N3 staging) or auxiliary system of quantifying sleep depth. In another application of ORP, ORP values could appear on the screen while the epoch is being scored. This helps the scorer to decide whether that stage should be “awake” or asleep in such epochs. For example, knowledge that average ORP in the epoch of Figure 3A is near 2.0 would help unify the score as awake (note that the 2 scorers did not revise the Auto score [awake], indicating they were both comfortable with it). On the other hand, the displayed ORP in Figure 3B (average = 1.33) would indicate this epoch to be light sleep. All that remains to finalize the scoring is to decide whether there are spindles or a rapid eye movement in the epoch (see below).
A number of validated fully automatic scoring systems are currently available.6,7,10,13,17 This study demonstrated that editing such a system yields better agreement between scorers than strictly manual scoring (column 11 vs. columns 1 and 4, Table 1). The time taken for full editing of the system used here, including arousal, respiratory and motor events (≈50 min)17 is less than the time taken to score the PSG de novo, so that the cost of using the automatic system is offset by reduced technician scoring time; furthermore, the results are more reliable and consistent.
This study also showed that epoch-by-epoch editing of an automatic system may be counterproductive in that the advantage of providing a unitary score for equivocal epochs that most technologists would agree with is partially offset by the tendency for technologists to modify some of the preexisting Auto scores even if it agreed with the scoring of another competent technologist or even with their own earlier score. As reported previously,17 agreement between the automatic score and the manual score of each technologist improved after editing (compare columns 9 vs. 5, and 10 vs. 6, Table 1). Although this may be considered an advantage of editing, it may also perpetuate inter-scorer variability. On the other hand, complete reliance on an automatic system may not be acceptable. One compromise is to perform an abbreviated edit in which epoch-by-epoch editing is abandoned in favor of editing specific areas identified by the automatic system as questionable and where errors in Auto scoring may impact clinical decisions. One such abbreviated editing system has been validated.17 Editing time was reduced from over 50 minutes to 6 minutes, and there were no clinically significant differences between the results of the abbreviated and full edits.17
N1/N2 disagreements must arise from assessment of sleep spindles and/or K complexes. Sleep spindles vary greatly between PSGs in their frequency, duration and amplitude and in some PSGs sleep spindles are so infrequent that failing to score one may result in several epochs of N1 instead of N2 sleep. Furthermore, in some PSGs, spindles are of such low amplitude and/or short duration, or their apparent frequency may be visually uncertain, that, in the absence of quantitative measurements, it is difficult to score them with certainty. There are numerous algorithms for detecting spindles18–20 that can reduce these disagreements if incorporated as a pre-scoring module. For example, the events marked as “spindles?” in Figure 3B may visually appear to be either high frequency sleep spindles or beta activity (e.g., subthreshold arousal). The digital system identified these to be clearly spindles (dominant frequency = 14.3 Hz with the correct attributes of spindles). Such information, if available at the time of scoring, would help to unify the scoring around stage N2 (the eye movement in Figure 3B was deemed too slow to be a REM).
N2/N3 disagreements arise from differences in estimating total duration of delta waves in epochs where these waves are neither too few nor too frequent, particularly since there is no clear guideline as to how to measure the duration of a given delta wave. In many files, the EEG remains in a borderline state for extensive periods, setting the stage for major N2/N3 disagreements. If not already available, algorithms can be developed to determine the duration of delta waves in each epoch, thereby resolving this problem.
The results of the present study pertain to two specific scorers. However, they had excellent credentials and their agreement with each other (≈80% in 5-stage scoring) was within the range previously reported for two experienced scorers from different institutions.5–8,11 Furthermore, the rate of identified Tech errors was quite small (< 2% for both). Agreement between Auto and their individual scores was similar (columns 5 and 6, Table 1) indicating that there was no appreciable difference between the two technologists in scoring expertise. Thus, we believe that the manual scoring results are reliable and representative of conventional PSG scoring practice.
The PSGs in the current study did not contain frontal EEG signals. In the absence of frontal EEG signals K complexes are easier to miss and some delta waves may not reach the threshold 75μV.21 Although absence of frontal EEG signals may affect the scoring of N1, N2, and N3, it is not likely that this impacted the current findings. Differences between frontal and central EEG signals are simply a matter of amplitude.22,23 There is currently no amplitude requirement for K complexes other than the wave should “stand out” against background activity.15 A slow wave with an amplitude of 50 μV in the central EEG may have an amplitude of 80 μV in the frontal EEG. Thus, such a wave will “stand out” more, or is counted as a delta wave, if frontal EEG were available. Since both scorers were looking at the same signal, one would expect both to miss the K complex or delta wave in the central EEG and to score a K complex or a delta wave in the frontal EEG. No scoring discrepancies should arise unless the visual estimation of amplitude is different between the scorers. A slow wave with borderline amplitude in the central EEG (e.g., 70 μV) may be counted by only one of the scorers if his/her visual estimate of amplitude is more liberal. The same wave would be scored by both if the frontal EEG were available. However, there would still be waves in the frontal EEG with borderline amplitudes and these would be scored differently by the two scorers if their amplitude estimates are different. Thus, the presence or absence of frontal EEG would simply result in the inter-scorer differences appearing in different epochs.
Study was supported by YRT Ltd, Winnipeg, Manitoba, Canada, and by the Faculty of Medicine, Sleep Research Program, University of Calgary. Dr. Younes is majority owner of YRT Ltd. The other authors have indicated no financial conflicts of interest.
American Academy of Sleep Medicine
automatic score by Michele sleep scoring software
continuous positive airway pressure
European Data Format
automatic score by Michele sleep scoring software after being edited epoch by epoch by the technologists
first manual scoring session
second manual scoring session
Michele sleep scoring software
stage 1 of non-rapid eye movement sleep
stage 2 of non-rapid eye movement sleep
stage3 of non-rapid eye movement sleep
non-rapid eye movement sleep
odds ratio product
obstructive sleep apnea
periodic limb movements
Rechtschaffen and Kales
Younes Sleep Technologies
The authors thank Colleen Leslie and John Laprairie for scoring the PSGs, Marc Soiferman for performing the automatic scoring on site, and Michele Ostrowski for coordinating the study.
Ferri R, Ferri P, Colognola RM, Petrella MA, Musumeci SA, Bergonzi P, authors. Comparison between the results of an automatic and a visual scoring of sleep EEG recordings. Sleep. 1989;12:354–62. [PubMed]
Whitney CW, Gottlieb DJ, Redline S, et al., authors. Reliability of scoring respiratory disturbance indices and sleep staging. Sleep. 1998;21:749–57. [PubMed]
Norman RG, Pal I, Stewart C, Walsleben JA, Rapoport DM, authors. Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep. 2000;23:901–8. [PubMed]
Collop NA, author. Scoring variability between polysomnography technologists in different sleep laboratories. Sleep Med. 2002;3:43–7. [PubMed]
Danker-Hopfe H, Kunz D, Gruber G, et al., authors. Interrater reliability between scorers from eight European sleep laboratories in subjects with different sleep disorders. J Sleep Res. 2004;13:63–9. [PubMed]
Pittman SD, MacDonald MM, Fogel RB, et al., authors. Assessment of automated scoring of polysomnographic recordings in a population with suspected sleep-disordered breathing. Sleep. 2004;27:1394–403. [PubMed]
Anderer P, Gruber G, Parapatics S, et al., authors. An E-health solution for automatic sleep classification according to Rechtschaffen and Kales: validation study of the Somnolyzer 24 × 7 utilizing the Siesta database. Neuropsychobiology. 2005;51:115–33. [PubMed]
Magalang UJ, Chen NH, Cistulli PA, et al., authors. Agreement in the scoring of respiratory events and sleep among international sleep centers. Sleep. 2013;36:591–6. [PubMed Central][PubMed]
Kuna ST, Benca R, Kushida CA, et al., authors. Agreement in computer-assisted manual scoring of polysomnograms across sleep centers. Sleep. 2013;36:583–9. [PubMed Central][PubMed]
Malhotra A, Younes M, Kuna ST, et al., authors. Performance of an automated polysomnography scoring system vs. computer-assisted manual scoring. Sleep. 2013;36:573–82. [PubMed Central][PubMed]
Zhang X, Dong X, Kantelhardt JW, et al., authors. Process and outcome for international reliability in sleep scoring. Sleep Breath. 2015;19:191–5. [PubMed]
Rosenberg RS, Van Hout S, authors. The American Academy of Sleep Medicine inter-scorer reliability program: sleep stage scoring. J Clin Sleep Med. 2013;9:81–7. [PubMed Central][PubMed]
Punjabi NM, Shifa N, Dorffner G, Patil S, Pien G, Aurora RN, authors. Computer-assisted automated scoring of polysomnograms using the Somnolyzer System. Sleep. 2015;38:1555–66. [PubMed Central][PubMed]
Younes M, Ostrowski M, Soiferman M, et al., authors. Odds ratio product of sleep EEG as a continuous measure of sleep state. Sleep. 2015;38:641–54. [PubMed Central][PubMed]
Iber C, Ancoli-Israel S, Chesson AL, Quan SF. The AASM manual for the scoring of sleep and associated events: rules, terminology and technical specifications. Westchester, IL: American Academy of Sleep Medicine, 2007.
Cohen J, author. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.
Younes M, Thompson W, Leslie C, Egan T, Giannouli E, authors. Utility of technologist editing of polysomnography scoring performed by a validated automatic system. Ann Am Thorac Soc. 2015;12:1206–18. [PubMed]
Adamczyk M, Genzel L, Dresler M, Steiger A, Friess E, authors. Automatic sleep spindle detection and genetic influence estimation using continuous wavelet transform. Front Hum Neurosci. 2015;9:624. [PubMed Central][PubMed]
Ray LB, Sockeel S, Soon M, et al., authors. Expert and crowd-sourced validation of an individualized sleep spindle detection method employing complex demodulation and individualized normalization. Front Hum Neurosci. 2015;9:507. [PubMed Central][PubMed]
Parekh A, Selesnick IW, Rapoport DM, Ayappa I, authors. Detection of K-complexes and sleep spindles (DETOKS) using sparse optimization. J Neurosci Methods. 2015;251:37–46. [PubMed]
Silber MH, Ancoli-Israel S, Bonnet MH, et al., authors. The visual scoring of sleep in adults. J Clin Sleep Med. 2007;3:121–31. [PubMed]
Happe S, Anderer P, Gruber G, Klosch G, Saletu B, Zeitlhofer J, authors. Scalp topography of the spontaneous K-complex and of delta-waves in human sleep. Brain Topogr. 2002;15:43–9. [PubMed]
McCormick L, Nielsen T, Nicolas A, Ptito M, Montplaisir J, authors. Topographical distribution of spindles and K-complexes in normal subjects. Sleep. 1997;20:939–41. [PubMed]