Significant difference between three observers in the assessment of intraepidermal nerve fiber density in skin biopsy

Background The determination of Intraepidermal Nerve Fiber Density (IENFD) in skin biopsy is a useful method for the evaluation of different types of peripheral neuropathies. To allow a reliable use of the method it is necessary to determine interobserver reliability. Previous studies dealing with this topic used limited suitable statistical methods. Methods In the present study three observers determined the IENFD and estimated the staining quality of the basement membrane for an adequate quantity of 120 skin biopsies (stained with indirect immunofluorescence technique) from 68 patients. More adequate statistical methods like intraclass correlation coefficient and Bland Altman Plot were chosen to estimate interobserver reliability. Results We found an unexpected significant difference in IENFD between the observers (p < 0.05) and so the results of this study are not in line with the high interobserver reliability reported before (intraclass correlation coefficient: 0.73). The Bland Altmann Plot showed a variance growing with rising mean. The difference in IENFD between the observers and the resulting low interobserver reliability is likely caused by different interpretations of the standard counting rules. There was no significant difference in IENFD between observers for biopsies with a well-defined basement membrane. Thus skin biopsies with an inexactly defined basement membrane should not be used diagnostically for the determination of IENFD. Conclusion These results emphasise that standardisation of the method is extremely important and at least two observers should analyse skin biopsies with critical IENFD near the cut-off values.


Background
Despite the fact that numerous patients in pain or neurology departments are admitted for typical neuropathic symptoms such as paraesthesia and dysaesthesia the conventional diagnostic methods such as nerve conduction studies and electromyography often do not show patho-logical findings [1][2][3][4]. Immunohistochemical illustration of the intraepidermal nerve fibers (IENF) in skin biopsy and quantitative sensory testing (QST) are two new diagnostic methods to objectify the disorders of some of these patients [5]. In 2005 Lauria et al. published the guidelines of the European Federation of Neurological Societies (EFNS) on the use of skin biopsy and the determination of IENF density (IENFD) in the diagnosis of peripheral neuropathy [6]. For a reliable use of this method, a check of methodical quality criteria is essential. Especially reliability as a degree of methodical accuracy has to be determined, e.g. by calculating the interobserver reliability. Therefore two or more observers conduct the same test and their accordance is subsequently analysed.
A few previous studies deal with interobserver reliability [7][8][9][10]. Some of them by calculating the correlation coefficients [7,10]. The value of such correlation coefficients to determine interobserver reliability is limited, since level differences remain unnoticed and extreme values can pretend a higher reliability [11]. Smith et al. calculated the intraclass correlation coefficient and the relative intertrial variability (RIV) to determine interobserver reliability. Special calculations apply for the RIV ([(IENFD 1 -IENFD 2 )/MW (IENFD) ] *100 [%]) and values less than 10% indicate a high degree of reproducibility. Small absolute differences at low IENFD values are presented as high percentage values while equivalent absolute differences at high IENFD values are being presented as lower percentage values [9]. This approach can lead to an incorrect estimation of the reliability. Gøransson et al. estimated the interobserver reliability by calculating the absolute difference between the IENFD results of two observers.
Due to the limited suitability of the statistical methods so far applied, there is still some need to adequately demonstrate interobserver reliability of the IENFD determination by skin biopsy. To achieve this, three independent observers analysed a sufficient quantity of biopsies. Additionally, more appropriate statistical methods were chosen in order to confirm a reliable use of the skin biopsy in clinical diagnostics.

Patients
Skin biopsies from 68 patients were examined, who all previously participated in several independent studies designed to examine the validity of QST and to determine the IENFD in skin biopsy. Patients suffered from polyneuropathy (n = 23), fibromyalgia (n = 18), arthritis (n = 13) or neuropathic pain after nerve injury at the lower limb (n = 14). The duration of symptoms of patients with nerve injury ranged from 14 to 294 months, with a median of 46 months. Affected nerves were either the common or superficial peroneal nerve (n = 11) or the lateral cutaneous nerve of the thigh (n = 3). The age of all patients ranged from 21 to 74 years (mean 52 ± 13 years) ( Table  1). All studies were approved by the local ethics committee of the Ruhr University Bochum and the patients gave written informed consent.

Skin biopsy
The procedure of skin biopsy followed the protocol by Vlckova-Moravcova et al. [12], as a modified version of the original Guidelines of the EFNS [6]. Indirect immunofluorescence technique was used. Two samples were taken from each patient, one from the affected and one from an unaffected skin area. In patients with polyneuropathy, fibromyalgia and arthritis biopsies were therefore carried out from dorso-lateral foot and back (dermatome L4). The very distal biopsy site at the foot was chosen because all patients had complaints at this area, but not all had complaints at the lower leg, which would be the standard biopsy site recommended by the EFNS guidelines. As a level A recommendation those guidelines also suggest the sampling of an additional biopsy from an unaffected site in patients with generalised diseases to provide information about a length-dependent process. L4 dermatome was assessed as a second area, which was the least affected area in most of the patients. In patients with nerve injury biopsies were carried out bilaterally from foot (dorsolateral or dorsomedial) or lateral thigh. After local injection of 2% lidocaine the removal was carried out under sterile conditions with a 3 mm biopsy punch (Stiefel GmbH, Offenbach, Germany). Tissue was fixed in 4% phosphate-buffered paraformaldehyde for 3-4 hours and cryoprotected in 10% sucrose at 4°C overnight. Subsequently the skin samples were embedded in TissueTek ® , frozen in 2-methylbutane cooled in liquid nitrogen and stored at 70°C until further processing. Sections of 40 μm thickness were cut on a sliding microtome and immunostained with rabbit polyclonal antibodies to human PGP 9.5 (Ultraclone, UK, 1:800) as primary antibody and marked with Cyanine 3 (Jackson Immuno Research, USA). The intraepidermal nerve fibers were counted manually in two sections of approximately 3 mm length each by three independent observers (MF, ISH, SW), who were professionally trained at an approved skin biopsy laboratory (Department of Neurology, University of Würzburg, Germany). Counting was conducted in a blinded fashion to determine interobserver reliability at 400× magnification with a Zeiss Axiophot 2 microscope adhering to standard counting rules [13], agreed on by the European guidelines 2005 [6]. Samples were only evaluated if the staining quality of both sections were judged to be satisfactory by all observers (e.g. distinct discrimination of dermis and epidermis, clearly illustrated nerve fibers). Samples were excluded for the determination of interobserver reliability if they were judged to be of bad quality for counting by at least one observer (e.g. nerve fibers or basement membrane stained badly). Using Image Pro Plus 4.0 software (Media Cybernetics, Leiden, The Netherlands), the epidermal length was accurately measured. The average intraepidermal nerve fiber density (IENFD) per mm of epidermal length was then calculated. IENFD results from biopsies taken from the foot were compared with published control data [12] as done in a previous study [4] and classified as pathologic in case of IENFD less than 9 fibers/mm.
Additionally every observer evaluated the definition of the basement membrane in each biopsy, classifying it as 'well', 'moderately' or 'inexactly' defined. In summary the basement membrane was rated 'well defined' if at least two observers ranked it so.

Data analysis
All statistical analyses were performed using the Statistica software package, release 7.1 for Windows (StatSoft Inc., USA) and the statistical package for social sciences (SPSS 12). Differences between observers were analysed using a one-way analysis of variance (ANOVA). Due to the unprovable homogeneity of variance post hoc comparisons were calculated using Dunnet T3 post hoc tests. P values < 0.05 were considered significant. Since IENFD from adjacent sections of one biopsy showed a high degree of association [7], the accuracy of each observer was estimated by calculating the standard deviation between both sections of one biopsy (intersection variability) and the relative standard deviation (SD/mean). To demonstrate the variance growing with rising mean the results were presented as Bland Altman Plot [14]. Interobserver reliability was measured by calculating intraclass correlation coefficient with absolute agreement definition [15]. To compare the results of this study with those of previous studies correlation coefficients and RIV were also measured. For the RIV applies ([(IENFD 1 -IENFD 2 )/MW (IENFD) ] *100 [%]) and values of less than 10% indicate a high degree of reproducibility [9].

Results
A total of 120 biopsies from 68 patients (polyneuropathy: n = 44; nerve injury at the lower limb: n = 25; fibromyalgia: n = 30; athritis: n = 21) were analysed. 16 biopsies had to be excluded due to bad quality.
Evaluation of the complete data showed a significant difference between the IENFD counted by different observers ( Table 2). Variance increased with rising mean (Figure 1). However, even at low IENFD values, e.g. in biopsies taken from the foot, the difference between the observers remained significant. Overall, observer 2 counted the highest values for all biopsy sites, whereas observer 3 stated the lowest values for all biopsy sites. In conformity with these results the Post Hoc tests revealed that in all cases the significant difference laid only between these two observers.
The intersection variability differed significantly between the observers for the foot data and the complete data. Observer 3 had the lowest values in contrast to observer 2 who had the highest ones (table 2). In this case the Post Hoc tests revealed a significant difference between observer 3 and both other observers for the foot data. However, with respect to the overall data the intersection variability differed significantly between observer 2 and 3.
The comparison of IENFD results of 71 foot biopsies with published control data showed that the significant interobserver difference would generate different rates of pathological results. The results from observer 3 would add up to 68 pathological biopsies in opposition to the other observers with lower numbers of pathological biopsies (62 and 63 respectively). Since the control data were taken from the distal calf [12] the accuracy of the comparison results might be limited.
The intraclass correlation coefficient for all data was 0.73. Due to the significant difference between the observers the correlation coefficients (Figure 2 + 3) and RIV with participation of observer 3 showed the lowest values. The RIV was 35.6% between observer 1 and 2, 61% between observer 1 and 3 and 63.8% between observer 2 and 3.

Discussion
Our results revealed an unexpected significant difference in IENFD between three observers. Despite having received the same training, the three observers most likely interpreted the standard counting rules [13] differently. Since we found the lowest values of intersection variability and therefore the highest accuracy for the observer stating the lowest IENFD values, the strict interpretation might be more reliable.
Other groups stated higher interobserver reliability with correlation coefficients ranging from 0.86-0.96 [7,10]. In further studies the RIV was 9.6%, the intraclass correlation coefficient 0.98 [9] and the mean difference between the IENFD results of two observers 0.4 ± 1.5 fibers/mm [8].
The low interobserver reliability in our study was probably caused by the described significant interobserver difference in IENFD. Additionally we might have found higher interobserver reliability by counting three sections Bland Altman Plot for all biopsies (n = 120) Figure 1 Bland Altman Plot for all biopsies (n = 120).
as recommended by the EFNS Guidelines [6]. Considering the pronounced significant difference between the observers in our study, the results would have probably been similar. Furthermore an accessory analysis of intraobserver reliability would allow a more accurate interpretation of the interobserver reliability.
The qualitative evaluation of the basement membrane before counting the intraepidermal nerve fibers could be an approach to improve the methodical accuracy. The results allow the conclusion that interobserver reliability is higher if the basement membrane is well defined. Con-sequently skin biopsies with inexact illustration of the basement membrane should not be used for the determination of IENFD in clinical diagnostics and scientific studies. However the number of biopsies with a well defined basement membrane was quite small in our study and there was only a little improvement of interobserver reliability.
Another possibility to avoid inaccurate IENF counting due to an inexactly defined basement membrane might be the use of antibodies against collagen IV with confocal microscopy to better visualise the basement membrane [6].

Conclusion
In summary, the determination of IENFD by skin biopsy is a useful method to investigate different types of peripheral neuropathy [16], but our results show that standardisation of the method is extremely important. However the number of biopsies was quite small in our study and we used a modified version of the original Guidelines of the EFNS. Therefore our results are limited to a small number of patients but lead us to following conclusion. To avoid Correlation between IENFD measured by three independent observers for all biopsies (n = 120) Figure 3 Correlation between IENFD measured by three independent observers for all biopsies (n = 120). (Correlation coefficient for observer 2/3).  [6] recommend the application of the counting protocol which was described by Kennedy et al [13]. Our results show that a consensus should be reached on the interpretation of the counting rules in biopsies with less accurate illustration of the skin innervation. We recommend that observers undergo thorough training and intraobserver reliability must be demonstrated by intra-lab assessment to avoid different interpretation of the counting rules by individuals. Nevertheless IENFD counting may still be a subjective investigation partially. Skin biopsies with critical IENFD values (IENFD near the cut-off values) should be analysed by at least two observers together. Furthermore, mandatory external quality controls of skin biopsy laboratories e.g. by interlaboratory comparison should be enforced. Whilst in experienced laboratories the interobserver reliability may not an issue, consensus data is still needed for application to all labs.
Skin biopsy immunostained for PGP 9.5, which were evalu-ated in this study Figure 4 Skin biopsy immunostained for PGP 9.5, which were evaluated in this study. Number of IENF stated by the observers (MF, ISH and SW) of this study: 0-2 fibers. The different results were probably caused by difficulties to determine the correct position of the fiber: (e.g.: the fiber approaches the basement membrane but do not cross it → 0 fibers, the fiber branches after crossing the basement membrane → 1 fiber, the fiber branches within the basement membrane → 2 fibers. Skin biopsy immunostained for PGP 9.5, which were evalu-ated in this study Figure 5 Skin biopsy immunostained for PGP 9.5, which were evaluated in this study. Number of IENF stated by the observers (MF, ISH and SW) of this study: 7-11 fibers. The different results were probably caused by difficulties to determine the correct position of the fibers due to the high number of fibers and inexact illustration of nerve fibers and basement membrane.