GCA: An R package for genetic connectedness analysis using pedigree and genomic data

Introduction

Genetic connectedness quantifies the extent of linkage between individuals across management units. The concept of genetic connectedness can be extended to measure the connectedness level between training and testing sets in whole-genome prediction. We developed the genetic connectedness analysis R package, GCA, which utilizes pedigree or genomic data to measure the connectedness between individuals across units.

Connectedness Statistics

The connectedness statistics implemented in this R package can be classified into two core functions: prediction error variance (PEV) and variance of unit effect estimates (VE). The PEV-derived metrics include prediction error variance of differences (PEVD), coefficient of determination (CD), and prediction error correlation (r). These PEV-derived metrics can be summarized at the unit level as the average PEV within and across units (GrpAve), average PEV of all pairwise differences between individuals across units (IdAve), or using a contrast vector (Contrast). VE-derived metrics include variance of differences in management unit effects (VED), coefficient of determination of VED (CDVED), and connectedness rating (CR). Three correction factors accounting for the number of fixed effects can be applied for each VE-derived metric. These include no correction (0), correcting for one fixed effect (1), and correcting for two or more fixed effects (2). The R code is integrated with C++ to improve computational speed using the Rcpp package [1]. We expect this R package provides a comprehensive and effective tool for genetic connectedness analysis and whole-genome prediction. A comprehensive list of all connectedness statistics implemented in this R package can be summarized as following. The details of these connectedness statistics can be found in the paper [2].

Figure 1: An overview of connectedness statistics implemented in the GCA R package.

Figure 1: An overview of connectedness statistics implemented in the GCA R package.

Application of the GCA package

Data preparation

The simulated dataset GCcattle contains two objects cattle.pheno and cattle.W, which include phenotypic and marker information, respectively. The details can be obtained by typing ?GCcattle.

## [1] 2500    6
## [1]  2500 10000

The heritability of simulated phenotype was set to 0.6 with \(\sigma^2_a\) = 0.6 and \(\sigma^2_e\) = 0.4.

Below we construct the incidence matrix of fixed effects and a genomic relationship matrix.

## Genomic relationship matrix has been computed. Number of SNPs removed: 99

Available connectedness statistics in the GCA package

The following section lists all available connectedness statistics in the GCA package, which are available by setting the argument of statistic.

  • PEVD_IdAve : Pairwise individual average PEVD, the optional argument of ‘scale’ is available.
  • PEVD_GrpAve : Group average PEVD, the optional arguments of ‘scale’ and ‘diag’ are available.
  • PEVD_contrast: Contrast PEVD, the optional argument of ‘scale’ is available.
  • CD_IdAve : Pairwise individual CD.
  • CD_GrpAve : Group average CD, the optional argument of ‘diag’ is available.
  • CD_contrast : Contrast CD.
  • r_IdAve : Pairwise individual r.
  • r_GrpAve : Group average r, the optional argument of ‘diag’ is available.
  • r_contrast : Contrast r.
  • VED0 : Variance of estimate of unit effects differences. The optional argument of ‘scale’ is available.
  • VED1 : Variance of estimate of unit effects differences with the correction of unit effect. The optional argument of ‘scale’ is available.
  • VED2 : Variance of estimate of unit effects differences with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required to inform the last column number of units in the incidence matrix corresponding to fixed effects and the optional argument of ‘scale’ is available.
  • CDVED0 : Coefficient of determination of VED, the optional argument of ‘diag’ is available.
  • CDVED1 : Coefficient of determination of VED with the correction of unit effect. The optional argument of ‘diag’ is available.
  • CDVED2 : Coefficient of determination of VED with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required to inform the last column number of units in the incidence matrix corresponding to fixed effects and the optional argument of ‘diag’ is available.
  • CR0 : Connectedness rating.
  • CR1 : Connectedness rating with the correction of unit effect.
  • CR2 : Connectedness rating with the correction of unit effect and additional fixed effects. The additional argument of ‘Uidx’ is required to inform the last column number of units in the incidence matrix corresponding to fixed effects.

Examples of connectedness measures across units

Below we list some examples of connectedness statistics available in the GCA package. The gca function is the main engine in the GCA package. The usage and detailed arguments of the gca function are explained below.

  • Kmatrix Genetic relationship matrix constructed from either pedigree or genomics.
  • Xmatrix: Fixed effects incidence matrix excluding intercept. The first column of the Xmatrix should start with unit effects followed by other fixed effects if applicable.
  • sigma2a and sigma2e: Estimates of additive genetic and residual variances, respectively.
  • MUScenario: A vector of fixed factor units.
  • statistic: Choice of connectedness statistic. Available options include
    1. PEV-derived functions: PEVD_IdAve, PEVD_GrpAve, PEVD_contrast, CD_IdAve, CD_GrpAve, CD_contrast, r_IdAve, r_GrpAve, and r_contrast.
    2. VE-derived functions: VED0, VED1, VED2, CDVED0, CDVED1, CDVED2, CR0, CR1, and CR2.
  • NumofMU: Return either pairwise unit connectedness (Pairwise) or overall connectedness across all units (Overall).
  • Uidx: An integer indicating the last column number of units in the Xmatrix. This Uidx is required for VED2, CDVED2, and CR2 statistics. The default is NULL.
  • scale (logical): Should sigma2a be used to scale statistic (i.e., PEVD_IdAve, PEVD_GrpAve, PEVD_contrast, VED0, VED1, and VED2) so that connectedness is independent of measurement unit? The default is TRUE.
  • diag (logical): Should the diagonal elements of the PEV matrix (i.e., PEVD_GrpAve, CD_GrpAve, and r_GrpAve) or the K matrix (CDVED0, CDVED1, and CDVED2) be included? The default is TRUE.

PEV-derived statistic: PEVD_GrpAve (pairwise vs. overall)

The following example illustrates the group average PEVD between all units. Based on the results, units 1 and 8 are the most connected (PEVD_GrpAve = 0.0156) and the least connected units are units 4 and 6 (PEVD_GrpAve = 0.0571). The PEVD statistic ranges from 0 to 1, with the smaller value indicates more connectedness.

Alternatively, the gca function can return an overall connectedness, which averages all pairwise PEVD_GrpAve between units by setting NumofMU to 'Overall'.

## [1] 0.03752627

Following the above example, the pairwise individual average PEVD and contrast PEVD can be easily calculated by changing the argument statistic to 'PEVD_IdAve' and 'PEVD_contrast', respectively.

PEV-derived statistic: CD_GrpAve

The group average CD statistic between units is shown in the following example. The most connected units was found between units 1 and 7 (CD_GrpAve = 0.7096). On the other hand, the least connected design was found between units 1 and 6 (CD_GrpAve = 0.6494). The larger CD statistics indicates the greater connectedness. These results are different from the findings according to PEVD_GrpAve, because CD penalizes connectedness measures for reduced genetic variability [3, 4]. Similarly, the pairwise individual average CD and contrast CD can be calculated by changing the argument statistic to 'CD_IdAve' and 'CD_contrast', respectively.

VE-derived statistic: VED

The following example illustrates three VED statistics of VED0, VED1, and VED2 when sex and unit effects are both included in the model. Here, the smaller value indicates the greater connectedness. We can see that the most connected units across three VED statistics are found between units 1 and 8 (VED0 = 0.0205, VED1 = 0.0156, VED2 = 0.0156). On the other hand, units 4 and 6 (VED0 = 0.0728, VED1 = 0.0571, VED2 = 0.0571) shows the least connectedness.

VE-derived statistic: CDVED

An example of CDVED statistics (CDVED0, CDVED1, and CDVED2) is shown below. The most connected units are found between units 1 and 7 (CDVED0 = 0.6284, CDVED1 = 0.7098, CDVED2 = 0.7096), whereas units 1 and 6 (CDVED0 = 0.5501, CDVED1 = 0.6494, CDVED2 = 0.6494) have the least connectedness.