Details of data processing:

Plant materials

A diverse worldwide collection of 533 O. sativa accessions including both landraces and elite varieties was obtained, including 200 varieties from a core/mini-core collection of Oryza sativa L. in China, 132 parental lines used in the International Rice Molecular Breeding Program, 148 varieties from a mini-core subset of the US Department of Agriculture rice gene bank, 18 varieties used for SNP discovery in the OryzaSNP project and 35 additional worldwide rice germplasm obtained from the Rice Germplasm Center at the International Rice Research Institute. Sequences of 950 accessions generated by Huang et al. (2012, Nat. Genet. 44:32-39) were downloaded from the EBI European Nucleotide Archive with accession number ERP000106 and ERP000729. Information about the accessions, including variety name, country of origin, longitude and latitude origin and subpopulation identity, can be queried in Cultivar Information page.

Sequence alignment and SNP/INDEL identification

The assembly release version 6.1 of genomic pseudomolecules of japonica cv. Nipponbare was downloaded from Michigan State University and used as the reference genome. Reads of all varieties were aligned to the pseudomolecules using the software BWA61. SNPs/INDELs were identified using SAMtools and BCFtools. Only alignments with mapping quality ≥40 were used, and bases with base quality ≥10 were used to identify SNPs (with parameters of -C50 -Q10 -q40 for the mpileup subcommand of SAMtools). The SNPs identified by BCFtools were further filtered: only a SNP with the proportion of heterozygous genotype called by BCFtools of less than 0.1 and the average depth for a variety (calculated by dividing the overall depth by the number of varieties) between 0.2 and 2.5 were considered. A total of 8,044,699 SNPs were identified. Some SNPs appeared to have three or more alleles, which might due to sequencing errors. We only considered a candidate SNP if, when sorting the alleles in descending order of supporting reads, the total number of reads supporting the third and fourth alleles was less than half of the number supporting the second allele, as well as less than ten, in which case only the first two alleles were retained; other alleles were considered sequencing errors and set to missing. Only SNPs with the number of varieties with minor allele greater than five in the whole 1,483 variety population and at least 20% varieties genotyped were considered as high-quality SNPs. Finally, we obtained 6,551,358 high quality SNPs. In the first set of 533 accessions, three were found with excessive heterozygosity and one with low mapping rate, which were excluded in further analysis.

Imputing missing genotype using an LD-KNN algorithm

After obtaining raw genotype calls from BCFtools, 47.1% of genotypes was missing due to low-coverage sequencing. We then performed imputation using an in-house modified k nearest neighbour algorithm. Four varieties with apparent heterozygosity were excluded in the imputation analysis, and the remaining heterozygous calls were set to missing. Only SNPs polymorphic in the remaining 529 varieties were selected for imputation. The performance of KNN imputation is determined by a number of features, including the probability of successfully identifying k-nearest neighbors and how similar the target sample is to the neighbors. To increase the performance of KNN imputation, we merged all 1,479 rice varieties together for the imputation. Briefly, we first obtained initial imputation genotypes of the missing genotypes using a classic KNN algorithm with relaxed parameters. Then, for each SNP, the 100 SNPs upstream and downstream were selected, and the r2 of LD was calculated. A threshold was selected to screen high-LD SNPs, and then the KNN algorithm with strict parameters was carried out and the final imputation genotype was determined. Because we used strict parameters, after a round of LD-KNN imputation, there were still a lot of missing genotypes. The LD-KNN imputation was run several times iteratively to reach user requirement. Parameters with window size of 100 SNPs, k-nearest of five SNPs, f = 1, s =-7, r2 = 0.09 and five iterations were selected to perform imputation in our study. After imputation, for the 6,551,358 SNPs, we got an overall missing data rate reduced to 0.42%. We further selected SNPs with rate of missing data less than 20% of the accessions resulting in a total of 6,428,770 SNPs, with the overall missing data rate about 0.38%. These SNPs were used to carry out GWAS.

Data evaluation:

We estimated the recall rate of SNP identification from five datasets, and all results show that the 6,551,358 SNPs cover over 91.8% of SNPs so far identified in rice (Table 1). The 14-fold Illumina sequences of Zhenshan 97B and Minghui63 are available in NCBI Sequence Read Archive with accession number SRA012177. Three SNP sets were derived from pairwise comparisons of Zhenshan 97B, Minghui63 and reference genome Nipponbare using SAMtools and BCFtools, respectively. Two SNP sets, 157,058 SNPs from OryzaSNP project and 36,901 high quality SNPs from rice 44K SNP array, were downloaded from Gramene (http://gramene.org/diversity/download_data.html).
Table 1. Recall rates of five SNP sets against 6,551,358 high quality SNPs of which the minor allele appears in at least five accessions.

Comparison

Num. SNPs

Num. recalled SNPs

Recall rate

Nipponbare/Minghui 63

1258436

1172123

93.14%

Nipponbare/Zhenshan 97B

1245925

1162072

93.27%

Minghui 63/Zhenshan 97B

572527

525618

91.81%

OryzaSNP

157058

145508

92.65%

Rice44k

36901

35680

96.69%


To estimate the accuracy of imputed genotype, we genotyped 48 accessions using Illumina Infinium array RiceSNP50. There are 43386 high quality SNP markers in the array and 42759 (98.6%) SNPs covered by the 6,428,770 well-imputed SNPs. The accurary of Infinium array is proved. Thus, the concordance of genotypes using array hybridization and sequencing can be used to estimate the accurary of raw genotypes from direct sequencing and after imputation. The results suggested an accuracy of 99.8% for raw genotypes and 99.3% for genotypes after imputation (Table 2).
Table 2. Concordance between genotyping results of array hybridization and sequencing on 42759 SNPs.
ID
Raw genotype
Imputed genotype
Num. Concordance Num. Difference Prop. Concordance Num. Concordance Num. Difference Prop. Concordance
C13736808430.998833417391460.996514
C143349111240.996461406594190.9898
C12336634340.999073415821870.995523
C006351211200.996595405844430.989202
C094335921000.997032405394470.989094
C05637195100.99973142248990.997662
C04334737980.997187408454040.990206
C131346551210.996521405874550.988914
C135352961740.995094407474790.988381
C00536855510.998618415052130.994894
C110345681220.996483405724650.988669
C14034618970.997206405283940.990372
C04135093870.997527407644000.990283
C038351831110.996855406074080.990052
C02837893230.999393421541140.997303
C025346811300.996266407234020.990225
C14436774300.999185417851610.996162
C12037714200.99947422621010.997616
C134372111090.997079416982360.994372
C124340981010.997047405994050.990123
C148343761110.996781407803800.990768
C027350551110.996844408623960.990402
C091338961030.99697408114430.989262
C037349921330.996214406704410.989273
C096337331260.996279404264190.989742
C129341011380.99597405054550.988892
C07436182350.999034415851850.995571
C114342911000.997092406184010.990224
C11334759950.997274406743760.99084
C05236930140.999621420751070.997463
C040348861090.996885407224180.98984
C00236096470.9987407652160.994729
C00337070240.999353413851820.995622
C065340541020.997014405704350.989392
C02636751270.999266422131180.997212
C10336364230.999368417191630.996108
C05034500990.997139410793700.991073
C06436754200.999456420511240.99706
C02936971410.998892418051770.995784
C05436828170.999539420181320.996868
C14937072170.99954242195940.997777
C06736327240.99934417341690.995967
C10636821380.998969417821740.995853
C10135957220.999389417371740.995848
C060342661040.996974407754290.989588
C06336595240.999345421571260.99702
C04837636170.999549420851080.99744
C042345571040.997407854330.989495
Total171145735300.997941979310140230.99296

The genetic structure and diversity of the rice germplasms:

The population structure of the 1479 accessions was inferred using ADMIXTURE based on 188,637 SNPs with even distribution sampled from the 6,428,770 well-imputed SNPs (missing data ≤20%). In SNP selection, we divided the genome into ~3.8 kb regions, at most two SNPs with minor allele frequencies (MAF) ≥0.01 were randomly chosen for each region. Parameter of the number of ancient clusters K was set from 2 to 7 to obtain different inferences. Each accession was classified based on its maximum subpopulation component. Accessions with the maximum subpopulation component value differing from the secondary value less than 0.4 were classified as intermediate.
When K=2, accessions were divided into indica and japonica varietal groups.
At K=3, the aus cluster (Aus) appeared within the indica varietal group.
At K=4, the indica were further divided into two sub groups (indica I and indica II, also denote as IndI and IndII), indica accessions with similar components of IndI and IndII (<0.4) were classified as Indica Intermediate.
At K=5, japonica were divided into two sub groups, corresponding to tropical japonica (TrJ) and temperate japonica (TeJ), japonica accessions with similar components of TeJ and TrJ (<0.4) were classified as Japonica Intermediate.
At K=6, an independent group (VI) emerged, which is an intermediate group between indica and japonica. Only fourteen accessions belonged to VI and we found that nine of them were with mutated fragrance gene fgr, which suggested that VI is corresponding to Group V/Aromatic group reported in other studies (Glaszmann et al. Theor Appl Genet, 1987, 74: 21-30; 1. Garris et al. Genetics, 2005, 169: 1631-1638).
The set of 529 rice accessions sequenced in this study was accordingly classified into 98 IndI, 105 IndII, 92 indica intermediate, 91 TeJ, 44 TrJ, 21 japonica intermediate, 46 Aus, 14 VI, and 18 intermediate, and the set of 950 rice accessions sequenced by Huang et al. was divided into 283 IndI, 111 IndII, 120 indica intermediate, 287 TeJ, 54 TrJ, 50 japonica intermediate, 21 Aus, and 24 intermediate. The details of classification and values of subpopulation component can be queried in Cultivar Information page.


Figure 1. Neighbor-joining tree of 1479 accessions constructed from matching distance of 188,637 even-distributed and randomly selected SNPs. Different subpopulations, indica I (IndI), indica II (IndII), Aus, temperate japonica (TeJ) and tropical japonica (TrJ) are shown in different color and the numbers of accessions in each subpopulation are marked. In this figure, the number of accessions of Intemediate contains VI group (denotes in purple).




Figure 2. The distribution of the estimated subpopulation components for each accession analyzing by ADMIXTURE under different assumptions of ancient clusters K = 2 to 7 for 1479 accessions.