Zephyrnet Logo

ddRAD sequencing based genotyping of six indigenous dairy cattle breeds of India to infer existing genetic diversity and population structure – Scientific Reports

Date:

Quality control, alignment and SNP calling

The ddRAD sequencing based genotyping of 58 individuals belonging to six native cattle breeds; Gir, Sahiwal, Tharparkar, Rathi, Red Sindhi and Kankrej cattle with their geographical and ecological distribution (Fig. 1) including the productive purpose, coat colour, representative agroclimatic zone, breeding tract, the geographical co-ordinate of each breeding tract along with animal ID and sex of each individual presented in Supplementary Table S1; resulted in 138.59 million raw reads that corresponded to 23 million reads per breed and 2.2 million reads per animal. After initial filtering on the basis of read quality and adaptor removal, majority of the reads (138.58 million reads; 99.9%) were retained (Supplementary Table S2). A high percentage of reads (94.53%) were mapped to the Bos taurus (ARS-UCD1.2) reference assembly (Supplementary Table S2). In this study, the effort was made to analyze only the SNPs across different cattle breeds, therefore all other variants were not considered in subsequent analysis. The number of SNPs in 6 cattle breeds ranged between 8,42,768 and 3,81,966 after individual variant calling. Maximum number of SNPs were observed in SAC (8,42,768), followed by GIC (8,34,780), KAC (8,10,279), RAC (8,05,020), RSC (6,72,632) and THC (3,81,966) (Table 1). The combined data set across 6 cattle breeds produced a total of 43,47,445 SNPs. Subsequently, the VCF file was processed in a stepwise manner to filter out low quality SNPs. Firstly, the SNPs were filtered at read depth of 2 (RD 2), read depth of 5 (RD 5) and read depth of 10 (RD 10). For further analysis, the data set of 9,82,174 SNPs identified at RD of 5, were utilized for subsequent analysis (Table 1). All those SNPs that were present at low coverage (RD < 5) were removed from the data set. The SNPs that were identified at RD of 5 were further filtered using various criteria’s such as proportion of missing genotypes, minor allele frequency and Hardy Weinberg Equilibrium (HWE). The series of filtering resulted in a total of 84,027 high quality SNPs. Post filtering, the number of SNPs across breeds varied considerably. The highest number of SNPs was observed in GIC (34,743), followed by RSC (13,092), KAC (12,812), SAC (8956), THC (7356) and RAC (7068) (Table 2).

Figure 1
figure 1

Geographical distribution of six cattle breeds included in this study (The map was generated using websites Map Chart https://www.mapchart.net/ and Paint Maps https://paintmaps.com/).

Table 1 Number of SNPs identified at read depth (RD) of 2, 5 and 10 in 6 Indian cattle breeds.
Table 2 The number of high-quality SNPs in each cattle breed post series of filtering criteria.

Functional annotation of variants

The merged high-quality SNPs dataset of all the 6 milch breeds was annotated to Bos taurus (ARS-UCD1.2) reference genome. With respect to their distribution in the genome, a large number of annotated SNPs were predicted to be in the intronic region (41,372 SNPs, 53.87%) followed by intergenic regions (26,834 SNPs, 34.94%). There were only 948 SNPs (1.23%) that were distributed in the exonic regions. Further, there were 3497 SNPs (4.55%) located within the 5 Kb region upstream and 3661 SNPs (4.77%) in the downstream of transcription start site. The analysis also resulted in 93 SNPs (0.121%) located in 5’UTR, 293 SNPs (0.38%) in 3’UTR region. A total of 8 SNPs (0.01%) were predicted to cause premature stop codon were also identified (Fig. 2).

Figure 2
figure 2

Overall partitioning of SNPs with respect to genomic distribution for all the breeds.

On the basis of the impact of SNPs on protein coding genes, the SNPs were categorized as having high impact (10 SNPs; 0.01%), moderate impact (298 SNPs; 0.39%), and low impact, (697 SNPs; 0.91%). Majority of the SNPs (75,801; 98.69%) were identified as modifier (Supplementary Table S3). Additionally, high proportion of SNPs (65.74%) were silent in nature, followed by missense (33.37%) and nonsense (0.89%), with an average missense/silent ratio of 0.507 (Supplementary Table S4). In addition, amongst all genotypes substituted identified in the present study, C/T and G/A genotypes were found to be predominant, whereas A/T genotype was found to be in lowest proportions (Supplementary Table S5). For individual breed, the annotation results are summarized in Fig. 3 and Supplementary Table S6. In GIC, highest number of SNPs 32,283 (53.96%) were predicted to be in the intronic region followed by intergenic region 20,395 (34.09%). Only 777 (1.3%) were detected in the exonic region. Similar to GIC, the highest number of SNPs were distributed in intronic region followed by intergenic and exonic region in all other cattle breeds. For example, in SAC, 53.87% of SNPs (8429) were predicted in the intronic region followed by intergenic region 33% (5163 SNPs) and only 1.75% (273 SNPs) in exonic region. A similar trend was observed for RAC, RSC, KAC and THC cattle breeds with 6834 (55.63%), 11,147 (52.12%), 8429 (53.87%), 6374 (52.58%) SNPs, respectively in the intronic region, 4186 (34.08%), 8192 (38.30%), 5163 (33%), 4507 (37.18%) SNPs respectively, in the intergenic region and only 142 (1.16%), 266 (1.24%), 273 (1.75%), 123 (1.02%) were predicted in the exonic regions (Fig. 3). The number of synonymous variants identified in GIC, KAC, RAC, RSC, SAC and THC were 570, 190, 101, 172, 213 and 87 respectively. On the other hand, the number of non-synonymous variants detected for the 6 cattle breeds were 165, 64, 31, 82, 53 and 30 respectively. The TS/TV ratio observed in GIC, KAC, RAC RSC SAC and THC was 2.55, 2.64, 2.33, 2.43, 2.51 and 2.19 respectively (Supplementary Table S6).

Figure 3
figure 3

Genomic distribution of SNPs across the genome of six Indian milch cattle breeds.

The numbers of intergenic SNPs were 4,639,873 (68.1%) and 1,676,710 (24.6%) were intronic. There were 230,365 (3.4%) SNPs located within 5 kb upstream and 197,827 (2.9%) in downstream of a transcription start site; 12,428 SNPs were located in the 5′ UTR and 2613 in the 3′ UTR. A total of 4356 SNPs were located in splice sites of 2966 genes: 142 were in splice-donor sites, 142 were splice-acceptor sites and 4072 were within the region of the splice site. We identified 45,776 SNPs affecting the coding sequences of 11,538 genes. There were 221 SNPs predicted to cause premature stop codon and 17 to cause gain in coding sequence. The numbers of SNPs predicted to be non-synonymous were 20,828. The numbers of intergenic SNPs were 4,639,873 (68.1%) and 1,676,710 (24.6%) were intronic. There were 230,365 (3.4%) SNPs located within 5 kb upstream and 197,827 (2.9%) in downstream of a transcription start site; 12,428 SNPs were located in the 5′ UTR and 2613 in the 3′ UTR. A total of 4356 SNPs were located in splice sites of 2966 genes: 142 were in splice-donor sites, 142 were splice-acceptor sites and 4072 were within the region of the splice site. The numbers of intergenic SNPs were 4,639,873 (68.1%) and 1,676,710 (24.6%) were intronic. There were 230,365 (3.4%) SNPs located within 5 kb upstream and 197,827 (2.9%) in downstream of a transcription start site; 12,428 SNPs were located in the 5′ UTR and 2613 in the 3′ UTR. A total of 4356 SNPs were located in splice sites of 2966 genes: 142 were in splice-donor sites, 142 were splice-acceptor sites and 4072 were within the region of the splice site. We identified 45,776 SNPs affecting the coding sequences of 11,538 genes. There were 221 SNPs predicted to cause premature stop codon and 17 to cause gain in coding sequence. The numbers of SNPs predicted to be non-synonymous were 20,828. The numbers of intergenic SNPs were 4,639,873 (68.1%) and 1,676,710 (24.6%) were intronic. There were 230,365 (3.4%) SNPs located within 5 kb upstream and 197,827 (2.9%) in downstream of a transcription start site; 12,428 SNPs were located in the 5′ UTR and 2613 in the 3′ UTR. A total of 4,356 SNPs were located in splice sites of 2966 genes: 142 were in splice-donor sites, 142 were splice-acceptor sites and 4072 were within the region of the splice site. We identified 45,776 SNPs affecting the coding sequences of 11,538 genes. There were 221 SNPs predicted to cause premature stop codon and 17 to cause gain in coding sequence. The numbers of SNPs predicted to be non-synonymous were 20,828.

Within breed diversity

The nucleotide diversity (π) was highest in THC (π = 0.458) followed by RSC (π = 0.364), SAC (π = 0.363), GIC (π = 0.356), KAC (π = 0.348) and RAC (π = 0.347). The mean nucleotide diversity value was 0.373 (Table 3). The Tajima’s D values were negative for 4 cattle breeds viz., RSC, RAC, SAC and THC except for GIC and SAC where positive D values was observed. The highest negative Tajima’s D value was observed in THC (-1.194) followed by RSC (− 1.088), RAC (− 0.295) and KAC (− 0.279).

Table 3 Nucleotide diversity and Tajima’s D values in six Indian milch cattle breeds.

The observed heterozygosity (HO) values ranged from 0.464 to 0.551 while the expected heterozygosity (HE) ranged from 0.448 to 0.535. The highest observed heterozygosity values were observed in THC (HO = 0.551) followed by RAC (HO = 0.523), RSC (HO = 0.5184), SAC (HO = 0.5180), GIC (HO = 0.499) and KAC (HO = 0.464) (Table 4). The average FIS (inbreeding coefficient) ranges from -0.253 in THC to 0.0513 in KAC. The FIS estimate amongst the six cattle breeds was highest in THC (FIS = − 0.253) followed by RAC (FIS = − 0.105), whereas the lowest FIS estimate was observed in KAC (FIS = 0.0513) followed by GIC (FIS = − 0.00063). The overall FIS analysis revealed excess of heterozygosity for all the cattle breeds except for KAC (Table 4). The heterozygosity and FIS estimates indicated presence of sufficient diversity within the six cattle breeds.

Table 4 Within breed diversity statistics in six Indian milch cattle breeds.

Between breed diversity

The genetic differentiation on the basis of fixation index (FST) ranged from 0.2840 to 0.3905, indicating sufficient between breed diversity. The highest divergence was observed between RAC-SAC pair (FST = 0.3905), followed by RSC-RAC breed pair (FST = 0.3790), RSC-SAC breed pair (FST = 0.3751). The least divergence was observed for KAC-THC breed pair (FST = 0.2840) (Table 5). Neighbour Joining (NJ) based tree constructed, grouped the individual animals of 6 cattle breeds as per their breed affiliations with GIC and RSC being the most diverse breed amongst the 6 studied cattle breeds. The phylogenetic relationship at individual level is shown in Fig. 4. The breed wise NJ tree depicted in Fig. 5, more or less corroborated with the individual level tree. Furthermore, UPGMA based phylogenetic tree was constructed at breed level using “phangorn” package in R platform with 100 bootstrap values. The bootstrap values of each node were close to 100% indicating high robustness of the constructed tree. UPGMA based phylogenetic tree reflected similar genetic relationship as revealed by NJ based genetic differentiation (individual wise and at breed level) where GIC and RSC appeared as the most distinct breeds. GIC appeared on major node and clustered as one group while the other populations formed two groups with RSC clustered on one node and RAC, THC, SAC and KAC formed other sub clusters (Fig. 6).

Table 5 Pairwise FST statistics indicating genetic differentiation amongst the 6 Indian milch cattle breeds.
Figure 4
figure 4

Neighbour-Joining based phylogenetic grouping of 58 animals of six Indian milch cattle breeds using Tassel software.

Figure 5
figure 5

Neighbour-Joining based grouping of 6 Indian milch cattle breeds using “phangorn” package of R platform.

Figure 6
figure 6

UPGMA based phylogenetic grouping of six Indian milch breed using “phangorn” package of R platform.

Population structure analysis

The admixture analysis was carried out by partitioning the genome of each individual into a predefined cluster. The analysis was performed at K = 3, 4, 5 and 6 (Fig. 7). The individuals could not be grouped at K = 3 as per their respective breed. Only GIC could distinctly be differentiated while the individuals of KAC and SAC appear as one group and RAC, THC and RSC are clustered together indicating their shared ancestry. At K = 4, and even at K = 5, THC, RAC and RSC clustered together indicating their strong shared ancestry, while all other individuals clustered in their own respective breed. The best K in population structure analysis is K = 6, whereby almost all the animals were grouped to their respective breed, clearly indicating their sperate ancestry, with the exception of RSC and THC which still clustered together. The genetic closeness between RSC and THC could be unveiled by further in-depth studies and by increasing the number of samples.

Figure 7
figure 7

Admixture analysis assuming 3 ≤ K ≤ 6.

The PCA based analysis also clustered 6 cattle breeds separately and reinforces the fact that these are distinct cattle breeds (Supplementary Fig. S1). Individuals of KAC were grouped together in one quadrant, while individuals of SAC RAC, THC and RSC cattle breeds fall in a different quadrant. Individuals of GIC cattle breed appeared as a distinct population.

spot_img

Latest Intelligence

spot_img