The assembly release version 7.0 of genomic pseudomolecules of japonica cv. Nipponbare was downloaded from Michigan State University and used as the reference genome. Reads of all varieties were aligned to the pseudomolecules using the software BWA v0.7.12-r1039. SNPs/INDELs were identified using GATK v3.3-0-g37228af. We first map the reads to the reference with BWA mem and then Generate GVCF per-sample with HaplotypeCaller (with parameters of -T HaplotypeCaller --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000, mapping quality ≥20 were used), after creating the GVCF file, we use CombineGVCFs to generate VCF file. The variations identified by GATK were further filtered: the allele count in VCF file must >10, depth must >=50.
After obtaining raw genotype calls from GATK, 33.4% of genotypes were missing due to low-coverage sequencing. We then performed imputation using an in-house modified k nearest neighbor algorithm. In imputation, heterozygous calls were set to missing and we split the variations to 4058 bins (each 5000variations) for imputation. For these missing genotype in high coverage region, we set it to be 'DEL'. After imputation, we got an overall missing data rate reduced to 2.32% and overall DEL rate to 9.57%. The detailed precision rate and missing rate of each bin after imputation are shown below:
Figure 1. Precision rate statistic after imputation
Figure 2. Missing rate statistic after imputation
To estimate the accuracy of imputed genotype, we genotyped 50 accessions using Illumina Infinium array RiceSNP50. There are 41709 high-quality SNP markers in the array and 41709 SNPs covered by the RiceVarMap v2 well-imputed SNPs. The accuracy of Infinium array is proved. Thus, the concordance of genotypes using array hybridization and sequencing can be used to estimate the accuracy of raw genotypes from direct sequencing and after imputation. The results suggested an accuracy of 99.9% for raw genotypes and 99.8% for genotypes after imputation (Table 1).
|ID||Raw Prop. Concordance||Raw Num. Concordance||Raw Num. Difference||Imputed Prop. Concordance||Imputed Num. Concordance||Imputed Num. Difference|
The population structure of the 4,726 accessions was inferred using ADMIXTURE based on 210,521 SNPs which randomly selected from the genome (per 5Kb randomly pick out 3 SNPs, MAF >=0.01). The parameter of the number of ancient clusters K was set from 2 to 7 to obtain different inferences. Each accession was classified based on its maximum subpopulation component. Accessions with the maximum subpopulation component value differing from the second value less than 0.4 were classified as intermediate.
When K=2, accessions were divided into indica and japonica varietal groups.
At K=3, the aus cluster (Aus) appeared within the indica varietal group.
At K=4, the indica were further divided into two sub groups (indica I and indica III, also denote as IndI and IndIII), indica accessions with similar components of IndI and IndIII (<0.4) were classified as Indica Intermediate.
At K=5, the indI were further divided into two sub groups (indica I and indica II, also denote as IndI and IndII), indica accessions with similar components of IndI and IndII (<0.4) were classified as Indica Intermediate.
At K=6, japonica was divided into two sub groups, corresponding to tropical japonica (TrJ) and temperate japonica (TeJ), japonica accessions with similar components of TeJ and TrJ (<0.4) were classified as Japonica Intermediate.
At K=7, an independent group (VI) emerged, which is an intermediate group between indica and japonica. Only fourteen accessions belonged to VI and we found that nine of them were with mutated fragrance gene fgr, which suggested that VI is corresponding to Group V/Aromatic group reported in other studies (Glaszmann et al. Theor Appl Genet, 1987, 74: 21-30; 1. Garris et al. Genetics, 2005, 169: 1631-1638).
The set of 4729 rice accessions sequenced in this study was accordingly classified into 595 IndI, 465 IndII, 913 IndIII, 786 indica intermediate, 767 TeJ, 504 TrJ, 241 japonica intermediate, 269 Aus, 96 VI, and 90 intermediate, The details of classification and values of subpopulation component can be queried in Cultivar Information page.