Close this search box.

Multivariate canonical correlation analysis identifies additional genetic variants for chronic kidney disease – npj Systems Biology and Applications

NURTuRE-CKD and Salford Kidney Study participants

The NURTuRE-CKD cohort has linked genotype-phenotype data, and all participants provided written informed consent57. The study was approved by the South Central—Berkshire Research Ethics Committee, abides by the principles of the Declaration of Helsinki and is registered at (NCT04084145)57. The Salford Kidney Study (SKS) dataset, which has been described previously, received ethical approval from the North West Greater Manchester South Research Ethics Committee (REC15/NW/0818) and written informed consent was obtained from all patients58. For the genome-wide genotyping, blood samples were collected, stored as whole blood or centrifuged to separate plasma or serum, aliquoted and frozen at −80 °C. Deoxyribonucleic acid was extracted from the frozen whole blood. Of the NURTuRE cohort, 2903 CKD and 99 control participants were genotyped57. Of the Salford Kidney Study (SKS) cohort, 2409 participants were genotyped58. NURTuRE-CKD and SKS cohort participants had non-dialysis dependent CKD since they were recruited if their eGFR was below 60 ml/min/1.73 m2 (all SKS patients) or if urine albumin to creatinine ratio was >30 mg/mmol in those with eGFR above 60 ml/min/1.73 m2. End-stage kidney disease and renal replacement therapy were exclusion criteria.

Single nucleotide polymorphism genotype data processing

For each of the NURTuRE (including 99 controls) and SKS datasets, SNPs were genotyped using the Illumina Global Screening array v2.0 with additional multi-disease content and 2k custom sequences. After excluding duplicated probes using BCFTools version 1.9 (using htslib 1.9), totals of 671,485 and 672,412 variants remained, respectively59. The following steps were carried out using Plink version 260. Variants were excluded if the minor allele frequency (MAF) < 0.01 (162,483 and 170,554 variants excluded, respectively), missing SNP genotype call rate ≥1.5% (19,149 and 23,889 SNPs excluded, respectively), Hardy-Weinberg assumptions violated with P-value < 0.001% (116 and 71 SNPs excluded, respectively) and if they were located on mitochondrial or sex chromosomes (21,133 and 19,426 SNPs excluded, respectively). From the SKS dataset only, a total of 135 overlapping samples with NURTURE-CKD were excluded. Further samples were excluded if they were known first or second-degree relatives of another participant (13 and 0 excluded), showed gender mismatches (11 and 33 excluded), showed >10% low call rate SNPs (12 and 32 samples excluded, respectively), or showed any cryptic relations using KING cut-off of 0.177 (four and 51 samples excluded, respectively). To avoid potential confounding of results due to different genetic ancestries in the dataset, and to match with the genetic ancestries of the CKDGen dataset, non-European ancestry samples were excluded by using principal component analysis (PCA). This was computed using the The 1000 Genomes Project (1000GP), human genome build 38 (hg38) reference dataset and the plinkQC package in R, by adapting the published R script called “Processing 1000 Genomes reference data for ancestry estimation” on the plinkQC website ( for hg38 use61,62. After combining the sample and reference datasets and running PCA, the plinkQC “evaluate_check_ancestry” function was used to select individuals of European descent61. This function uses principal components 1 and 2 to find the center of the known European ancestry reference samples. It then labels study samples as non-European if their Euclidean distance from the center falls outside the radius specified by the maximum Euclidean distance of the reference samples multiplied by a scaling factor. A scaling factor of 2 was chosen as it was the minimum value that produced a radius that included all the 1000GP European ancestry reference samples (Supplementary Figure 2). We identified a small number of 10 NURTuRE-CKD samples that self-declared as White British but were outside of the European ancestry reference radius using PCA (Supplementary Fig. 2). To avoid any potential confounding, these 10 were excluded as non-European based on the reference population. Overall, using this strategy, 350 NURTuRE-CKD, 19 NURTuRE-controls and 108 SKS non-European ancestry samples were excluded (Supplementary Fig. 3). Our adapted script is available here: No batch effects were seen using PCA. Since CCA cannot handle any missing data, SNPs with any missing data were excluded. Remaining for analysis were 2505 of 2903 NURTuRE-CKD, 80 of 99 NURTuRE-controls and 2078 of 2409 SKS samples. For the NURTuRE-CKD (plus 80 NURTuRE-controls) dataset, 468,604 variants remained before SNP imputation. For the SKS dataset plus NURTuRE-controls dataset, 458,472 variants remained before SNP imputation. All calculations were performed in a 64-bit Linux conda environment.

Kidney function data

For each participant, the first eGFR and serum urea measurements taken on the same date, either on exactly or on the closest after the cohort recruitment date were used. Of the 2505 European ancestry NURTuRE-CKD, 2078 SKS and 80 NURTURE-control samples, totals of 2475, 1898, and 38, respectively, had available eGFR and serum urea data on the same date. eGFR was calculated using the CKD-Epidemiology Collaboration (CKD-EPI) equation (without ethnicity adjustment). For NURTuRE-CKD, the ranges of the eGFR and serum urea values were 3.3–139 ml/min/1.73 m2 and 1.7–63.7 mmol/L, respectively. For NURTuRE-controls, the ranges of the eGFR and urea values were 53–95 ml/min/1.73 m2 and 3.5–8 mmol/L, respectively. For SKS, the ranges of the eGFR and urea values were 6–88 ml/min/1.73 m2 and 3–47.7 mmol/L, respectively. Urea was converted to BUN by dividing by a factor63. BUN measurements were standardized using the common log transformation and Z-score, and eGFR measurements were standardized using the rank-based inverse normal transformation, as described previously. In the CCA input data matrix, the CKD cases (n = 2475 and n = 1898, respectively) and controls (a different set of n = 19 for each dataset) were each encoded as binary indicator variables, 1 and 0, respectively. Kidney function variables were not adjusted for CKD status since they were correlated with CKD status and controlling for CKD status may have introduced collider bias.

Single nucleotide polymorphism imputation

For the NURTuRE and SKS datasets, ungenotyped SNPs were imputed using Beagle version 5.4 with The 1000 Genomes Project hg38 genotype dataset as the reference dataset64,65. The 1000 Genomes Project hg38 Variant Call Format files were downloaded from the 1000 Genomes Project website ( Imputation is based on the fact that physically close markers are likely inherited together in a cluster and therefore result in the non-random association of alleles (linkage disequilibrium). Imputation in Beagle is performed by using a Hidden Markov Model to identify the most likely path through the haplotype cluster based on the non-missing genotypes present66. The Beagle R2 accuracy score approximates the squared correlation between the best estimate genotype (i.e. the allele dosage with the highest posterior probability) and the true genotype66. This is estimated from the posterior genotype probabilities when the true genotypes are not observed66. Unix scripting was used to run Beagle and Plink commands. After imputation, there were 6,419,966 NURTuRE-CKD and 6,290,407 SKS variants for analysis (including the NURTuRE-controls in each dataset).

Genome-wide association study summary statistics on kidney function

European ancestry CKDGen eGFR and BUN GWAS summary statistics were downloaded from the CKDGen Consortium website on 28/05/202017. For eGFR and BUN, there were 567,460 samples including 41,395 CKD cases (7%), as described on their website (

Published BioBank Japan GWAS summary statistics (6,108,953 SNPs) for eGFR (bbj-a-60) and BUN (bbj-a-11) for 143,658 and 139,818 individuals, respectively, were downloaded from the Medical Research Council Integrative Epidemiology Unit OpenGWAS project website ( on 20/04/202116,67. The East Asian ancestry dataset contained approximately 8586 CKD cases (5%). The eGFR range was 17.2–132.9 ml/min/1.73 m216,68.

For each of the three GWAS summary statistics datasets, only SNPs in chromosomes 1–22 were selected since these were the only ones available in the CKDGen dataset. SNPs with both eGFR and BUN statistics available were selected based on matching chromosome, position, reference and effect alleles. For the CKDGen and BioBank Japan datasets, any SNPs with AF < 0.01 were excluded from the metaCCA analyses. For each dataset, 8,346,783 and 5,837,593 SNPs remained for analysis, respectively.

Statistical analysis using canonical correlation analysis

CCA was used to compute canonical correlations between each SNP with both eGFR and BUN by using the “cancor” function in R v3.6.0, based on CCA scripts published by Seoane et al, Ferreira et al, and Tang et al.23,25,26. To analyze the NURTuRE and SKS datasets, we used our scripts available from Github ( In the CCA matrix, the same 26 controls were used for each of the NURTuRE and SKS dataset analyses. The controls were encoded as ‘0’ and the cases as ‘1’, as an additional binary indicator variable.

In CCA, linear combinations of two sets of variables, or views of the same object, X and Y, with the highest correlations are found. This corresponds to finding vectors a G and b P that maximize:

$${r}_{1}=frac{{left(X{{boldsymbol{a}}}_{1}right)}^{T}(Y{{boldsymbol{b}}}_{1})}{{||X}{{boldsymbol{a}}}_{1}{||; ||Y}{{boldsymbol{b}}}_{1}{||}}=frac{{{boldsymbol{a}}}_{1}^{T}{sum }_{{XY}}{{boldsymbol{b}}}_{1}}{sqrt{{{boldsymbol{a}}}_{1}^{T}{sum }_{{XX}}{{boldsymbol{a}}}_{1}}sqrt{{{boldsymbol{b}}}_{1}^{T}{sum }_{{YY}}{{boldsymbol{b}}}_{1}}}$$


This value r1 is called (the first) canonical correlation between X and Y, and the corresponding vectors a1 and b1 are called (the first) canonical weights. To identify the variables that provide a large contribution to the observed canonical correlation, the magnitudes of the canonical weights are used. These are found by firstly computing the matrix K from the covariance matrices:

$$K={sum }_{{XX}}^{-1/2}{sum }_{{XY}}{sum }_{{YY}}^{-1/2}.$$


Then, a Singular Value Decomposition (SVD) is applied, to decompose the matrix into eigenvectors and eigenvalues. The canonical correlation values can then be computed directly from the matrix K, and the canonical weights are determined based on the eigenvectors. To test the significance of each canonical correlation r, which equals the maximum correlation between the variant and the phenotypes, the test statistic was computed based on Wilks’ Lambda, as described previously23. For both CCA and metaCCA, for each test, the N parameter was set to the total number of samples analysed. Results were visualized using Manhattan plots. For univariate-SNP analyses, to account for multiple testing, the P-values were adjusted for multiple comparisons using a Bonferroni correction based on the number of SNPs analyzed. A cut-off of adjusted P-value < 0.05 was used to determine significance. For NURTuRE-CKD and SKS, our approximate power analysis based on univariate analysis suggested it was possible to identify a significant SNP with an effect size (canonical correlation r) of 0.12 or 0.14, respectively, with 80% power (Table 4). The sample size required to identify CCA correlation r of 0.1 with 90% power with two variables has been reported as approximately 1000 samples in Helmer et al.69

Table 4 Power analysis for two individual level genotype datasets

Statistical analysis using metaCCA

For each of the two GWAS summary statistics datasets, the metaCcaGp function provided by the metaCCA package in R v3.6.0 was used to compute the canonical correlation between each SNP and both eGFR and BUN28,70. Our scripts for analyzing the two GWAS summary statistics datasets using metaCCA are available from Github ( Before running metaCCA, SNPs were removed if they had a standard error of 0 for the eGFR or BUN beta coefficients, had reference and alternative alleles not containing ‘A’, ‘C’, ‘G’, or ‘T’, or if they were SNPs that were duplicates. A mathematical explanation of metaCCA is provided by Cichonska et al.28 After running metaCCA, any SNPs that showed a significant metaCCA P-value were assessed for kidney function relevance based on the published eGFR and BUN GWAS summary statistics. SNPs were excluded if the eGFR and BUN effect sizes (with respect to the same effect allele) were in the same direction, because they were unlikely to be relevant for kidney function as previously described17.

Single nucleotide polymorphism annotation

SNPs were annotated and prioritized by using the Functional Mapping and Annotation of Genome-Wide Association Studies (FUMA) program32. FUMA “SNP2GENE” was used to find lead (most signficant, independent) SNPs for each locus genomic region, using default r2 thresholds of 0.6 to define independent significant SNPs and 0.1 to define lead SNPs32. The HUGO Gene Nomenclature Committee (HGNC) online multi-symbol checker was used to verify gene symbols71. g:Profiler g:GOSt and stringDB were used for functional enrichment analyses72,73. The effects of any missense SNPs were predicted by using four in silico variant prediction tools including FATHMM-XF, Combined Annotation Dependent Depletion (CADD), Sorting Intolerant From Tolerant (SIFT) and PolyPhen-2 (run by using the ensembl Variant Effect Predictor online tool)74,75,76,77. To test for significant overlap between dataset results, the hypergeometric test provided by the package “hypeR” (v1.5.4) via Bioconductor in R (v4.0.2) was used78.

Colocalisation of CCA and eQTL signals

We applied Bayesian colocalisation analyses by using the R package ‘coloc’ (,34. We applied the COLOC function, which uses Approximate Bayes Factor computations, to lead SNPs identified in CKDGen and BioBank Japan by metaCCA, and in NURTURE-CKD by CCA, and to overlapping lead SNPs identified across 2 or 3 datasets. We used default priors that a random variant in the region is associated with either kidney function metaCCA/CCA or eQTL individually (prior probabilities = 1 × 10−4), and set the prior probability that the random variant is causal to both kidney function metaCCA/CCA and eQTL (prior probability = 1 × 10−6). As recommended by the authors of the method, we defined the variants as colocalised when the posterior probability of a colocalised signal (PP4) was >0.8. We used all eQTL data available in the EBI eQTL Catalogue database which included Gtex v8 ( These were downloaded for analysis by adapting an R script available from the EBI public eQTL Catalogue resources ( We followed the coloc example in this script thus included the lead SNP plus surrounding SNPs within 200 kB as input. The single-SNP matrixEQTL results for NephQTL glomerular and tubular cells were downloaded from the NephQTL2 browser (

Allele frequency analyses

For any shortlisted SNPs that showed a difference of at least 10% in allele frequency between the gnomAD general population and each of the same CKD groups in NURTURE-CKD and SKS, a Chi-square test with Yates’ correction was used to test for significant association of the SNP with CKD cases40. This was analyzed using the online GraphPad QuickCalcs tool (

Differential gene expression analyses

Using the Gene Expression Omnibus GEO2R application, the differential gene expression between CKD cases and healthy controls was computed for the “GSE66494: development of gene expression profiles in human chronic kidney disease” and “GSE37171: expression data from uremic patients and 20 healthy controls (normals)” datasets37,38. Uremia is the build-up of toxins in blood which occurs when the kidneys stop working and is a sign of CKD. The R scripts generated to reproduce the results are available from Github: Genes were considered to show significant differential expression between the CKD cases and controls where the log2(fold change)≥1 or log2(fold change)≤−1 and adjusted P-value based on default Benjamini & Hochberg false discovery rate method < 0.0539.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.