Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Biometrical Models and Introduction to Genetic Analysis

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Biometrical Models and Introduction to Genetic Analysis**Pak Sham, University of Hong Kong 4th March 2019 The 2019 International Workshop on Statistical Genetics Methods for Human Complex Traits**What is biometrical genetics?**• How do genes contribute to the biometrical (statistical) properties of continuous (quantitative) traits in the populations • For single trait, biometrical properties include • Means and Variances in individuals • Covariances between relatives • For multiple traits, biometrical properties also include • Covariances between different traits in the same individual • Covariances between different traits in different (related) individuals**History of biometrical genetics**Mendel Genes Galton Biometrics Jinks Mather Biometrical genetics Fisher Correlation between relatives on the supposition of mendelian inheritance Eaves Fulker Statistical modelling in biometrical genetics**Genes are discrete entities**• Mendelian disorders are caused by mutations in a single gene • Mendelian disorders are also discrete entities • How can discrete entities produce continuous variation? https://gameofthrones.fandom.com/wiki/Dwarfism**1 Gene** 3 Genotypes 3 Phenotypes 2 Genes 9 Genotypes 5 Phenotypes 3 Genes 27 Genotypes 7 Phenotypes 4 Genes 81 Genotypes 9 Phenotypes Origin of continuous variation • Continuous (quantitative) variation can be explained by polygenic inheritance • The sum of independent and approximately equal influences will approach a continuous, normal distribution, as the number of influences increases (central limit theorem) https://www.youtube.com/watch?v=kDkmSI39sWQ**Major loci and polygenes**• Quantitative traits can be influenced by genetic mutations with very large effects (major loci) in addition to multiple genetic variants with small effects (polygenes) • Adult males with achondroplasia have mean height of 52 inches, compared to the population adult male mean of 69 inches. This difference of 17 inches is almost 6 standard deviations of adult male height in the general population. • Thus even the tallest adults with achondroplasia are seldom taller than the shortest adults without achondroplasia. Height for females with achondroplasia (mean/standard deviation [SD]) compared to normal standard curves. The graph is based on information from 214 females. Adapted from Horton WA, Rotter JI, Rimoin DL, et al. Standard growth curves for achondroplasia. J Pediatr. 1978 Sep; 93(3): 435-8.**Genetic polymorphisms, alleles, genotypes**• A genetic polymorphism is a variable site in the genome (e.g. single nucleotide polymorphism, SNP) • The alternative sequences at a locus are called alleles, often denoted as capital and small letters (e.g. A, a) • The alleles present at the polymorphic site (locus) of an individual is called his or her genotype (e.g. AA, aa, Aa)**Analysis of variance**• Fisher developed Analysis of Variance (ANOVA) for “factorial designs”, where the factors have discrete levels (e.g. binary). • The overall variance of a trait is decomposed into components due to the main effects of the factors, two-way interactions, 3-way interactions, etc., in a hierarchical fashion**Biometrical model for single locus**• Consider the effects of a single locus on a quantitative trait • All other influences are considered as “error” or “residual”, which are assumed to be uncorrelated andhave no interaction with the locus being considered • In Fisher’s convention, effects are measured from the “midpoint” of two homozygous genotypes Model: Y = c + X + R Genotype means 0 Aa AA aa X: -a AA c + a -a +a d Aa c + d R: Residual influences aa c – a Note: we do not distinguish paternal from maternal transmitted alleles, implicitly assuming that their effects are the same**Population genotype frequencies**• We also need to specify the frequencies of the 3 genotypes in the population • In a large population under random mating, the frequencies of genotypes AA, Aa and aa follow the binomial proportions s p2:2pq:q2, where p and q (=1-p) are the frequencies of alleles A and a • Genotypes in such proportions are said to be in Hardy-Weinberg equilibrium; deviation from such proportions is called Hardy-Weinberg Disequilibrium (HWD)**Derivation of Hardy-Weinberg proportions**Parental frequencies – not necessarily in Hardy-Weinberg proportions**Random mating**Under random mating, the mating type frequencies are**Mendelian segregation**• Mendel’s law of segregation: when a parent has heterozygous genotype Aa, there is equal probability for the two alleles (A and a) to be transmitted to an offspring) Aa 1/2 1/2 A a**Segregation ratios**According to Mendel’s law of segregation, the offspring genotype frequencies for the mating types are:**Offspring genotype frequencies**Averaging over the mating types, the offspring genotype frequencies are**Offspring allele frequencies**Averaging over the genotypes, the offspring allele frequencies are**Hardy-Weinberg equilibrium**The genotype can be thought of as consisting of 2 independent factors, one from each parent (as in a 2-way factorial design)**Biometrical model: mean**The mean effect of genotype under Hardy-Weinberg Equilibrium is thus Genotype AA Aa aa Frequency p2 2pq q2 Effect a d -a Mean m = p2(a) + 2pq(d) + q2(-a) = (p-q)a + 2pqd**Biometrical model: variance**The variance of the genotypic effect is therefore Genotype AA Aa aa Frequency p2 2pq q2 (X-m)2 (a-m)2 (d-m)2 (-a-m)2 Variance = (a-m)2p2 + (d-m)22pq + (-a-m)2q2 = 2pq[a+(q-p)d]2 + (2pqd)2 (intermediate steps not shown)**Average allele effect, additive variance**• The first variance component isdue the additive effects of the two alleles of the genotype: • The presence of dominance (i.e. when d0) means that the effect of an allele depends on the other allele in the genotype: • When the other allele is A, the effect of allele A is a-d (i.e. effect of AA – effect of Aa) • When the other allele is a, the effect of allele A is a+d (i.e. effect of Aa – effect of aa) • Therefore the average effect of allele A = p(a-d)+q(a+d) = a+(q-p)d • If genotype is coded additively as G= number allele A in the genotype (i.e. G = 0, 1 or 2), then the regression coefficient is the trait on G is a + (q-p)d • Thus the additive genetic variance is (a+(q-p)d)2Var(G) = 2pq(a+(q-p)d)2 • The second component of the variance, (2pqd)2, is therefore attributed to the dominance deviation (2ndorder interaction between the 2 alleles at the genotype at the same locus)**Variance components and heterozygosity**• 2pq is the expected heterozygosity of a biallelic locus under Hardy-Weinberg equilibrium • When p=q=1/2, the expected heterozygosity takes its highest value of 1/2. As allele frequency approaches 0 or 1, heterozygosity approaches 0 • Additive genetic variance is proportional to the expected heterozygosity • Dominance genetic variance is proportional to the square of the expected heterozygosity • Dominance genetic variance declines much more rapidly than additive genetic variance, as allele frequency approaches 0 or 1. (Why is this intuitively obvious?)**Covariance between pairs of relatives**AA Aa aa AA (a-m)2 Aa (a-m)(d-m) (d-m)2 aa (a-m)(-a-m) (-a-m)(d-m) (-a-m)2 The matrix is symmetrical, therefore upper triangular elements are not shown, The covariance between relatives of a certain class is the weighted average of these cross-products, where each cross-product is weighted by its frequency in that class.**Genetic identity-by-descent (IBD)**For two-locus genotype frequencies of two relatives, the concept of genetic identity by descent is helpful • DNA segments (e.g. genes) are identical-by-descent if they are descended from, and therefore replicates of, a single ancestral DNA segment. • The IBD genetic segments should have identical genetic sequence (unless new mutation has occurred) • At any autosomal location, two individuals can share 0, 1 or 2 alleles • There are 3 genetic relationships where the IBD sharing is the same throughput the genome (What are these?) AB CD AC AD**IBD for MZ twins**AB CD AC AC MZ twins share 2 alleles IBD for all loci**IBD for parent-offspring (PO)**AB CD AC When the parents are unrelated to each other, PO pairs share 1 allele IBD at all loci**IBD for unrelated individuals**• Two unrelated individuals share 0 alleles IBD at all loci**Covariance of MZ twins**AA Aa aa AA p2 Aa 0 2pq aa 0 0 q2 Covariance = (a-m)2p2 + (d-m)22pq + (-a-m)2q2 = 2pq[a+(q-p)d]2 + (2pqd)2 = VA + VD**Covariance for parent-offspring (P-O)**AA Aa aa AA p3 Aa p2q pq aa 0 pq2 q3 Covariance = (a-m)2p3 + (d-m)2pq + (-a-m)2q3 + (a-m)(d-m)2p2q+ (-a-m)(d-m)2pq2 = pq[a+(q-p)d]2 = VA / 2**Covariance for unrelated pairs (U)**AA Aa aa AA p4 Aa 2p3q 4p2q2 aa p2q2 2pq3 q4 Covariance = (a-m)2p4 + (d-m)24p2q2 + (-a-m)2q4 + (a-m)(d-m)4p3q+ (-a-m)(d-m)4pq3 + (a-m)(-a-m)2p2q2 = 0**IBD: half sibs**AB CD EE IBD Sharing Probability 0 ½ 1 ½ AC CE/DE Average IBD sharing = 0(1/2) + 1(1/2) = 1/2 In terms of IBD sharing, half siblings are similar to Parent-offspring for ½ of the genome Unrelated individuals for ½ of the genome**Covariance: half sibs**Genotype frequencies are weighted averages: ½ Parent-offspring (when IBD=1) ½ Unrelated (when IBD=0) Covariance = ½(VA/2) + ½(0) = ½VA**IBD: full sibs**IBD paternal alleles IBD Sharing Probability 0 1/4 1 1/2 21/4 IBD maternal alleles Average IBD sharing = 0(1/4) + 1(1/2) + 2(1/4) = 1 In terms of IBD sharing, full siblings are similar to MZ twins for ¼ of the genome Parent-offspring for ½ of the genome Unrelated individuals for ¼ of the genome**Covariance: full sibs**Genotype frequencies are weighted averages: ¼ MZ twins (when IBD=2) ½ Parent-offspring (when IBD=1) ¼ Unrelated (when IBD=0) Covariance = ¼(VA+VD) + ½(VA/2) + ¼ (0) = ½VA + ¼VD**Generalization: proportion of alleles IBD ()**• IBD can be expressed as a proportion (= number IBD / 2), thus = 0, 1/2or 1 • The probability distribution is Prob(=0), Prob(=1/2), Prob(=1) • E() = Prob(=1) +(1/2) Prob(=1/2) • Var() = Prob(=1) +(1/4) Prob(=1/2) – (E())2 Relationship E() Var() Prob(=1) MZ 1 0 1 Parent-Offspring 0.5 0 0 Unrelated 0 0 0 Half sibs 0.25 0.0625 0 Full sibs 0.5 0.125 0.25**Covariance: general relative pair**• The covariance is a weighted average of the covariances for MZ twins, parent-offspring and unrelated individuals • Covariance = Prob(=1)(VA+VD) + Prob(=1/2)(VA/2) + Prob(=0)(0) • = (Prob(=1)+Prob(=1/2)/2)VA + Prob(=1)VD • = E()VA + Prob(=1)VD**Kinship coefficient**• The kinship coefficient (K) between two individuals is defined as the probability that two alleles, one from each individual, drawn at random at an autosomal locus, will be identical-by-descent (IBD) • Let the paternal and maternal alleles of individuals 1 and 2 be denoted G1P, G1M, G2P, G2M. The genotypes of the 2 individuals, additively coded (0,1,2), would be G1=G1P,+G1Mand G2=G2P,+G2M • The covariance between the two genotypes is Cov(G1, G2) = Cov(G1P,G2P)+Cov(G1P,G2M)+Cov(G1M, G2P)+Cov(G1M,G2M) • In the absence of inbreeding, Var(G1) = Var(G2) = 2pq, and each covariance term is either pq when the alleles are IBD or 0 when they are not. Also, each allele of one person can be IBD with at most 1 allele of the other person. In this scenario E() is equivalent to 2K and represents the correlation between G1 and G2 • K is of wider applicability than E() when there is inbreeding**“Attenuation” of kinship**• If two individuals (A and B) have kinship coefficient K, what is the kinship coefficient between A and the offspring of B, assuming that the other parent of this offspring is unrelated to A? • At any genomic location, the offspring of B will have inherited 1 of the 2 DNA segments of B. • When a DNA segment is drawn at random from the offspring of B, there is a probability ½ that this is inherited from B, and probability ½ that this is inherited from the other parent. • If the segment is inherited from B, then there is probability K that it is IBD with a segment drawn from the corresponding genomic location from A. • If the segment is inherited from the other parent, then the probability is 0 because the other parent is unrelated to A. • Therefore the kinship coefficient between A and the offspring of B is ½K. • Applying this result recursively, we can show that the Kinship coefficient between two individuals sharing one common ancestor is equal to (½)g+1, where g is the number of meiosesseparating the 2 individuals**Inbreeding coefficient**• The inbreeding coefficient of an individual , I, is the probability that the 2 alleles at anylocus are IBD. It is equal to the kinship coefficient of his or her parents, since in meiosis an allele is randomly drawn from the genotype of a parent. • Inbreeding inflates the variance of a additively coded genotype: Var(G) = Var(GP)+Var(GM)+2Cov(GP,GM) • Inbreeding also inflates the covariance between the additively coded genotypes of 2 individuals, since now it is possible for an allele in one person to be IBD with both alleles of the other person.**Two-locus biometrical genetic model**• Generalize biometrical model to 2 loci • This is necessary only when there is either correlation or interaction between the 2 loci; otherwise the loci can be considered separately • Two-locus Interactions include • second-order inter-loci interactions involving 1 allele each locus, additive-additive (AA) • third-order interactions involving both alleles from one locus and 1 allele from the other, additive-dominance (AD) • fourth-order interactions involving both alleles from both loci, dominance-dominance (DD) • For 2 loci, there are 3x3=9 genotypic groups (assuming no parent-of-origin effect). In principle, if we can write down the trait means and population frequencies of these 9 genotypic groups, then we can proceed with variance partitioning using a hierarchical ANOVA, when the two loci are not correlated. This is straightforward by computer program but tedious by hand - see Sham (1997) Statistics in Human Genetics, Chapter 5.**Covariance of epistatic components**• The AA interaction between 2 alleles are shared by 2 individuals when the 2 alleles are both IBD, and not shared when at least one of them is not IBD. When the 2 loci are independent, the probability of sharing is the product of proportion of IBD sharing of the 2 loci, . For a particular class of relative pairs, the expected covariance is E()=[E()]2 • Similarly, the expected covariance of the AD interactions for a class of relative pairs is E()Prob(=1) • Finally, the expected covariance of the DD interactions for a class of relative pairs is [Prob(=1)]2**Covariance: general relative pair**Including 2 loci interactions, the covariance for 2 relatives of a given class is: Covariance = E()VA + P(=1)VD + [E()]2VAA + E()P(=1)VAD + [P(=1)]2VDD This can be further extended to epistasis involving more than 2 loci**Genetic linkage - two-locus transmission**• The correlation between 2 loci depends on “linkage” • Given a heterozygous genotype Aa, the 2 possible haplotypes (A and a) are equally likely to be transmitted to an offspring (Mendel’s first law) • How about an individual heterozygous for two loci, AaBb, what are the probabilities of transmitted each of the 4 haplotypes AB, Ab, aB, ab? • If segregation at the 2 loci are independent, then transmission probability of each haplotype is ½ x ½ = ¼. • This is true when the loci are on different chromosomes (Mendel’s second law), but not when they are on the same chromosome. • Which two types will be more likely to be transmitted? AaBb Parent ab aB Ab AB Gametes**Haplotypes and recombination**Likely gametes (Non-recombinants) Unlikely gametes (Recombinants) Parental haplotypes A1 Q1 A1 A1 Q1 Q2 A2 Q2 Q1 A2 A2 Q2 • Haplotype = set of alleles inherited from the same parent • Alleles that were inherited together from the previous generation are more likely to be transmitted together to the next generation, if the loci are on the same chromosome • Alleles which have different parental origins but are transmitted together in the same gamete are called “recombinant” • The proportion of gametes of 2 loci that are recombinant is called the recombination fraction • Two loci are “linked” if their recombination fraction is less than 1/2**Crossovers during meiosis**• A chromosome inherited from a parent is usually not transmitted intact to a offspring • Instead, crossovers between chromatids occur during meiosis, resulting in each transmitted chromosome being a hybrid of alternating segments of the paternal and maternal chromosomes**Fully Informative Gametes**aabb AABB • Recombinants and non-recombinants can be inferred in double backcross data. • The offspring of the double backcross constitute fully informative gametes aabb AaBb Aabb AaBb aabb aaBb Recombinant Non-recombinant**Population haplotype frequencies**• If there is no association between alleles of the two loci, then the frequency of each haplotype is equal to the product of the frequencies of its constituent alleles • Two loci with such haplotype frequencies are said to be in linkage equilibrium**Linkage Disequilibrium (LD)**• Deviation of haplotype frequencies from the product of constituent allele frequencies is called linkage disequilibrium • The deviation D is a measure of linkage disequilibrium • The normalized D’ measure = D/Dmax. When D>0, it cannot exceed the smallest value which causes either ps-D<0 or qr-D<0. Similar consideration applies when D<0.D’=1 implies that 1 of the 4 haplotypes is absent. • The r2 measure is D2/pqrs and represents the squared correlation between the two haplotypes coded numerically. An r2 of 1 implies that 2 of the 4 haplotypes are absent, and that the 2 loci have equal allele frequencies.**Decay of LD through recombination**• Frequency of AB gametes = (1-)(pr+d)+pr = pr+(1-)D • Thus, the LD measure D decays by a factor of (1-) per generation. • For unlinked loci, any LD will quickly decay to near 0, whereas fortightly linked loci, any LD will be maintained for many degenerations. • In any case, once the haplotype frequency decays to pr, it will tend to stay at that frequency (other than random fluctuations), hence “linkage equilibrium”) Gametes 1- Non-recombinant Recombinant 1-(pr+D) Pr+D pr 1-pr Others AB Others AB**Impact of LD on biometrical model**• Denote the additive genetic effects of loci 1 and 2 by G1 and G2, with additive variances V1A and V2A respectively. • In the absence of LD, the variance of the total additive genetic effects G=G1+G2 is simply V1A+V2A • However, in the presence of LD, G1 and G2 are correlated, and the variance of G becomes V1A+V2A+2Cov(G1,G2) =V1A+V2A+2r(V1AV2A), where r is the correlation between the trait increasing alleles of the 2 loci. • Denote the the total additive effects of person 1 and person 2 as (G11+G12) and (G21+G22), respectively, the covariance between these total genetic effects is Cov(G11,G21)+Cov(G12,G22)+Cov(G11,G22)+Cov(G12,G21)= E()[(V1A+V2A+2r(V1AV2A) • Thus the correlation between the additive effects remains unchanged at E() • We do not attempt to address the impact of LD on dominance or epistasis.**LD allows indirect association analysis**• If the correlation between a trait and the un-genotyped causal locus is and the correlation between the causal locus and a genotyped marker locus is r, then the overall correlation the trait and the genotyped marker locus is r • When r is close to 1, testing the marker locus is almost equivalent to testing the causal locus – this makes indirect association feasible • However, when r is modest this results in substantial reduction in the association signal, such that the sample size needs to be increased by a factor of 1/r2 to achieve the same statistical power asan direct association analysis of the causal SNP. https://www.nature.com/articles/nrg1521