Central nervous system antiretroviral efficacy in HIV infection: a qualitative and quantitative review and implications for future research

Background There is conflicting information as to whether antiretroviral drugs with better central nervous system (CNS) penetration (neuroHAART) assist in improving neurocognitive function and suppressing cerebrospinal fluid (CSF) HIV RNA. The current review aims to better synthesise existing literature by using an innovative two-phase review approach (qualitative and quantitative) to overcome methodological differences between studies. Methods Sixteen studies, all observational, were identified using a standard citation search. They fulfilled the following inclusion criteria: conducted in the HAART era; sample size > 10; treatment effect involved more than one antiretroviral and none had a retrospective design. The qualitative phase of review of these studies consisted of (i) a blind assessment rating studies on features such as sample size, statistical methods and definitions of neuroHAART, and (ii) a non-blind assessment of the sensitivity of the neuropsychological methods to HIV-associated neurocognitive disorder (HAND). During quantitative evaluation we assessed the statistical power of studies, which achieved a high rating in the qualitative analysis. The objective of the power analysis was to determine the studies ability to assess their proposed research aims. Results After studies with at least three limitations were excluded in the qualitative phase, six studies remained. All six found a positive effect of neuroHAART on neurocognitive function or CSF HIV suppression. Of these six studies, only two had statistical power of at least 80%. Conclusions Studies assessed as using more rigorous methods found that neuroHAART was effective in improving neurocognitive function and decreasing CSF viral load, but only two of those studies were adequately statistically powered. Because all of these studies were observational, they represent a less compelling evidence base than randomised control trials for assessing treatment effect. Therefore, large randomised trials are needed to determine the robustness of any neuroHAART effect. However, such trials must be longitudinal, include the full spectrum of HAND, ideally carefully control for co-morbidities, and be based on optimal neuropsychology methods.


Background
The possibility that some antiretroviral drugs with more efficient Central Nervous System (CNS) penetration as part of Highly Active Antiretroviral Therapy (HAART) may be associated with better neurocognitive (NC) functioning and more efficient cerebrospinal fluid (CSF) HIV RNA suppression than other ARVs has important clinical and therapeutic implications [1]. In this paper, we will refer to more efficient CNS HAART as neuroHAART.
First, if true, it indeed means that a non-negligible number of individuals with HIV-associated neurocognitive disorders (HAND) are not receiving optimal treatment. Without proactive assessment of HAND, the individuals with asymptomatic neurocognitive impairment (ANI) and Mild Neurocognitive Impairment (MND) [2], which now represent the greatest proportion of HAND (as opposed to HIV-associated dementia; HAD in the pre-HAART era) are likely not to be considered for a specific therapeutic strategy. A French study [3] brings support to this potential sub-optimal treatment scenario. This study found that in the pre-HAART era, individuals with HAD were preferentially treated with antiretrovirals with greater CNS penetration. However, this was still sub-optimal as it used pre-HAART drugs. They then found that in the HAART era, the treatment strategy as assessed retrospectively, did not favour neuroHAART for individuals with HAD, while ANI and MND were not considered at all.
Second, and perhaps most importantly, there are currently no HAART guidelines for HAND [4]. While this topic is hotly debated [5], still an informative review of the literature has been missing.
Lastly, there is the prospect that some individuals at risk for HAND may benefit from preventative treatment. This question is currently being studied in an international large-scale trial: http://insight.ccbr.umn.edu/start/ The possible superior efficacy of neuroHAART, however, remains highly controversial in the HIV research community because studies have provided conflicting results. More importantly, the definitions of what constitutes neuroHAART vary (see [1] for review of neuro-HAART definitions). The underlying premise that the brain is an HIV sanctuary site less amenable to effective systemic treatment is itself at the centre of a scientific debate [6]. However, evidence for the central nervous system (CNS) being a virological sanctuary site can be drawn from clinical [7], CSF [8][9][10] and human neuropathological studies [11,12].
The aim of this review is to better synthesise the results of existing studies conducted to address the issue of the potential superior efficacy of neuroHAART on brain functions and CSF HIVRNA suppression. This review is based on an innovative staged review strategy which was designed to overcome study design and neu-roHAART definition variations. First, each study was reviewed using a blind quality scoring for the presence or absence of major design, methodological and analysis features (adapted from [13]) combined with a non-blind assessment of appropriate use of demographic and longitudinal (practice effect) corrections on neuropsychological data, as well as test sensitivity to HAND. Studies with at least three methodological limitations were excluded (incidentally retaining studies within the upper confidence interval limits of the quality scoring distribution). Remaining studies were then evaluated for statistical power, that is their ability to detect a significant difference based on their stated aims. The implications of the findings are discussed for future research.
The following criteria were then used to select studies: 1. Reports had to be conducted in the HAART era (that is after 1996). 2. Group comparisons had to have subject numbers of 10 or more; lower numbers provide unstable effect sizes. 3. Reports had to investigate the effect of more than one single drug on an existing HAART regimen because the current review was focused on multiple ARV agents' effect. 4. Studies had to not be based on retrospective data analyses. Some of those studies not only included a sub-optimal definition of HAND, but also were prone to systemic biases in the baseline clinical status of individuals starting a more or less efficient neuroHAART regimen [14].
Using these criteria 16 studies were identified and they are detailed in Table 1. Seven were excluded and their references are included in additional file 1.

Qualitative analysis
Blinded and non-blinded review for quality scoring Because the quality (as assessed by statistical power, sample size, reliability of assessment of HAND, definitions of CNS penetration efficiency (CPE) score, and neuropsychological methods) of studies varied substantially, we computed a score to rank each study on these factors. This scoring form was adapted from [13]. The statistical methods and design used in each study were reviewed blindly by E.K.W who was provided with a printed copy of the studies (after removal of authorship identifiers and without the title, the abstract and the discussion) and a scoring form (additional file 2). In brief, we used 15 quality criteria that fell into 5 categories (Blinded: design, outcomes, subjects, controls, and unblinded outcomes). A not-applicable option was provided for criteria that might not apply for all studies (e. g., it made no sense to examine whether there were demographic differences between cases and controls in  observational studies with no control arm). Importantly, a non-blinded review of the presence or absence of neuropsychological cross-sectional norms, longitudinal norms (correction for practice effect) and validity and sensitivity of tests used to assess NC performance in HAND (including size of the neuropsychological battery) was then performed by a neuropsychologist (L.A.C.). This review was non-blinded because it sometimes required the exploration of previous papers by the same research team. Therefore a choice was made that all papers would be treated equally with a non-blind assessment.
As mentioned above, in some instances, quality criteria were "not applicable" to some included studies. To account for these cases and still rank the study on a similar scale, a total percentage score was developed. This total percentage score was computed based on the aforementioned criteria and studies were ranked (see Table 1). A score less than or equal to 80% meant that a study presented at least three or more significant methodological limitations. This continuous score was developed to be more transparent than assessing quality categories such as "poor", "medium" or "high" quality (additional file 3).
Whilst our qualitative analysis necessarily involved subjectivity, when presented with studies of poor or variable quality some subjective assessment is needed to ensure that like is compared with like; a well designed RCT should ideally not be grouped with a retrospective observational study in such an analysis. Whilst Finney argued in a seminal paper on meta-analysis that an assessment can be purely qualitative when studies are heterogeneous in nature or provide low quality evidence, [15] our scoring method enabled us to transparently identify qualitatively similar studies and analyse them further to provide some quantitative conclusions.

Power computations
The objective of the power analysis was to determine the study's ability to assess its proposed research aims. Accordingly, we developed the following strategy: the power computations were conducted [16] using GPower version 3.1 [17] for the studies with a quality scoring greater than 80%. The power projections were made for conventionally small, medium and large effect sizes (namely, 0.20; 0.50 and 0.80) [18]; and were made separately for cross-sectional and longitudinal designs and univariate and multivariate designs (see Figure 1). The studies were then checked against an "acceptable" criterion: power of 80%; two-tailed with a p-value less than 0.05. Using this criterion for a medium effect size (d = 0.50), we selected which studies in the review were conducted with acceptable power, though it should be acknowledged that the definition of acceptable power is inherently subjective [19].

Results
The quality scores from the 16 studies were normally distributed with a mean of 76.6% and a standard deviation of 12.8% (additional file 3). Of the 16 studies analysed, six had less than three methodological limitations, and none obtained a quality of 100% which would have reflected an optimal design for addressing their research aims (see Table 1). Four of the six studies retained were longitudinal; all included individuals who had been well characterised neuropsychologically and clinically to determine the full HAND spectrum, while using appropriate normative data when required. The four retained longitudinal studies included statistical methods taking into account the potential practice effect associated with repeated neuropsychological testing. All these studies In each panel, the six studies found to be of "appropriate quality" were rated against the power's projection for their relevant design and arrows indicates the N enrolled for each study. In addition a red circle in each power panel indicates the criterion against which studies were defined as having "appropriate versus nonappropriate power" and this was selected for the medium effect size. found a positive effect of NeuroHAART on NC yielding small to large effect sizes (additional file 4). One crosssectional study and one longitudinal study also found that CSF HIVRNA was more effectively suppressed as a function of a higher CPE.
Studies' quality scoring highlights (see also Table 1)

Heterogeneity of study samples
A number of studies included individuals with various HIV-associated neurological conditions. Others did not carefully report inclusion or exclusion criteria. A few studies did not provide enough information to assess the baseline rate of neuropsychological impairment in their cohort. Lastly, convenience comparisons were sometimes used between a sample on and off Neuro-HAART while these samples differed on a number of important clinical characteristics.

Insensitive assessment tools in the case of neurocognitive studies
Among the studies investigating NC functions, five (out of 11) included either a small number of tests or clinical scales rather than standard neuropsychological assessment. Others used non-cognitive endpoints such as time to HAND and survival time, which do not directly address potential change in NC functions, but only represent surrogates that are less sensitive to neuropsychological change.

Lack of norms and practice effect correction
One study did not use adequate normative data correction for demographic factors potentially misclassifying impairment rate at baseline. Among the neuropsychological longitudinal studies, 50% did not correct for practice.

Assessment time points
One major point in the heterogeneity of the published study designs was the variation in test retest intervals. Among studies with higher quality scoring, only one used a short test retest at 3 months. The four others used at least a 12 month retest interval.
Studies quantitative analysis highlights (see also Table 1) Power Using our criterion for "appropriate power" we found that two of the six studies with higher quality scoring were conducted with appropriate power (medium effect size that is d = .50; 80%, 2-tailed with p = .05). Figure 1 also illustrates which criterion was used for the four different study designs).

Other major issues in all studies
Non-reported p-values or data to compute effect sizes The majority of studies reported p-values or data allowing effect sizes' conversion when appropriate. Two studies did not provide exact p-values or data for their non-significant findings (see legend for Table 1).

Various definitions of neuroHAART
Studies used various definitions of neuroHAART (see Table 1) sometimes including less than the conventional definition for HAART without good rationale (i.e., less than three ARVs in some comparisons). Seven studies used a simple continuous aggregate of ARVs with good penetration. Moreover, five of 16 studies (excluding the CNS Penetration Effectiveness (CPE) validation study) have used what has been termed the CPE score. This is an empirically derived score that is a summed aggregate of the individual scores for each ARV in a HAART regimen.

Discussion
Our search criteria identified 16 studies that addressed the issue of neuroHAART efficacy. Qualitative analysis of these studies for the presence of less than three major methodological limitations (which incidentally corresponded to selecting studies within the 5% upper bound of the Student t distribution) showed that 37.5% (6 out of the 16) met the criterion for "higher quality". The main methodological limitations, which are relatively common to the field of clinical research in NeuroAIDS, were clinical heterogeneity of the sample studied or compared, unclear inclusion/exclusion criteria, insensitivity/brevity of the neuropsychological battery in 45% of studies, and lack of correction for practice effect (this was found in 50% of longitudinal studies). Other less common limitations were lack of normative correction for baseline demographic factors, and the definition of neuroHAART not fully described to allow direct replication.
Among the studies that scored above the quality scoring cut-off, it should be noted that none achieved a 100% score. This reflects the practical and scientific constraints of clinical research studies. One study did not provide clear enough inclusion/exclusion criteria; one did not provide a clear neuroHAART definition for direct replication; one used a medium-size neuropsychological battery and one a very brief neuropsychological battery; one included comparison groups with a lack of clinical homogeneity, and one did not include correction for practice effect. It should be noted here that the blind review allowed applying the same strictness to the studies' evaluation for all studies.
Moreover, our quantitative analysis on the remaining six studies which met an adequate quality scoring showed that two met the criterion for 80% a priori power. The advantage of our staged strategy was that only studies with a higher quality, hence greater comparability could be assessed in the quantitative phase.
Altogether, this review clearly demonstrates that methodological limitations and in most studies, lack of power render the literature difficult to readily interpret without both the qualitative and quantitative approach outlined in this paper. In favour of a positive Neuro-HAART effect are the six studies with higher quality scoring, however most are underpowered and none were randomised. Nonetheless, all six controlled for factors that may have been affected by non-randomization using multivariate analyses and still yielded results in favour of a positive neuroHAART effect on NC functions or CSF HIVRNA suppression.
While these results are not definitive they may assist the clinician in decision making as no negative effect of NeuroHAART was predominant. It seems reasonable, therefore, to consider neuroHAART whenever possible in a patient with HAND, notwithstanding issues of adherence and resistance to particular ARVs. However, the most definitive answer to the issue of the potential superior efficacy of neuroHAART remains randomised controlled clinical trials conducted in different regions of the world to provide cumulative evidence.
Limitations of the current qualitative and quantitative review are that it was based only on published studies and did not account for any publication bias towards only reporting positive effects. However, because the existence of any neuroHAART effect is an area of great uncertainty in NeuroHIV and HIV/AIDS research, we contend that a negative finding is of substantial interest and as likely to lead to a publication as a positive finding. Finally it could be argued that our qualitative phase should have selected a stricter level of quality. However, as discussed above, a stricter level of quality would have excluded almost all studies form the quantitative analysis. Moreover, despite methodological limitations, observational studies are an important phase of clinical research, providing preliminary evidence regarding treatment efficacy.
From our analysis of existing studies there are several key aspects that should be considered in the design of a potential future trial. 1. A priori power analyses should be conducted for the ability to detect at least a medium effect size. This implies that a future trial should include at least 100 subjects in each arm (see also Figure 1; multivariate graph). This also takes into account the full HAND spectrum including at least mild neurocognitive disorder (MND) [2] and that adequate power is needed to detect a relatively small effect (the treatment effect) in individuals who are not demented, but do have neurocognitive difficulties.
2. Adequate outcome measures should be selected. Even if a brief assessment is selected, it should be targeted to include tests of psychomotor speed, working memory as well as learning and memory to have a wider range of potential benefit from neuroHAART [20]. Related to this issue, and because of the high complexity of the neuropsychological data in this type of study, the inclusion of a senior neuropsychologist in the research team is essential. Moreover, the use of neuro/ psychology graduates for a high quality and standard administration of neuropsychological instruments including computerised battery is highly recommended. Because the NC measure represents the main outcome of those studies, special care in the data collection and data management should be a basic requirement. This will reduce measurement error and systematic biases that are likely with a poorly trained staff or staff who do not have the basic clinical skills to examine patients with cognitive impairment.
3. Adequate sampling: the study should carefully consider what types of patients are to be included with clear inclusion and exclusion criteria. Principally, the current HAART status and HAART duration should be considered, (See [21] for extensive consideration on this issue). Lastly, if the study includes a test of HIV RNA change, a baseline level of detection may be set as an inclusion criterion to improve the homogeneity in each arm on this aspect. Other aspects that may be considered to improve arm homogeneity are HIV duration, nadir CD4, and previous HAART history. 4. Adequate study time points: the study should be longitudinal and select an early follow-up to detect HIV RNA changes in the CSF and plasma as well as NC change (between 4 and 7 weeks after treatment initiation) and, a longer follow-up ideally at about 48 weeks, to detect long-term NC change. The risk of a shorter term trial is to produce negative or neutral findings when actually a positive effect is at play [21]. 5. Adequate analytical strategies: in a randomised trial, which by design minimize systematic biases between treatment arms on the outcome measures, extra consideration in the use of neuropsychological normative data to determine baseline impairment rate is needed. In some instances adaptive randomization may offer a flexible solution (see [22] for further discussion on this issue). 5. We would recommend the use of the CPE score pending improved definitions and over older definitions. This would allow planning preliminary analyses with the version of a current CPE at the time of enrolment in future clinical trials. Uniform use of the CPE score would make direct comparisons of the regimens used in different studies easier and enable a larger evidence base to define the most effective neuroHAART regimens to be compiled. In the future definitions, the role of a potentially impaired BBB should also be considered [23]. Also, the adverse effect of some ARVs on the cardio-vascular system would need to be taken into account in the new version of a CPE score as they have been associated with NC impairment in the HIV population [24].

Conclusions
Studies assessed as using more rigorous methods found that neuroHAART was effective in improving neurocognitive function and decreasing CSF viral load, but only two of those studies were adequately statistically powered. Because all of these studies were observational, they represent a less compelling evidence base than randomised control trials for assessing treatment effect. Therefore, large randomised trials are needed to determine the robustness of neuroHAART effect. However, such trials must be longitudinal, include the full spectrum of HAND, ideally carefully control for co-morbidities and be based on optimal neuropsychology methods.