Weekly Seminar: Amy Williams

October 31, 2024

Weekly Seminar (6).jpg

When: November 7th @ 11am

Where: TMCB 1170

Talk Title: Reconstructing parental genomes and near perfect hphasing using data from millions of people

The advent of large genotyped cohorts from genetic testing companies and biobanks have opened the door to a host of analyses and implicitly include data for massive numbers of relatives. Genetic relatives share identity-by-descent (IBD) segments they inherited from common ancestors and several methods have been developed to reconstruct ancestors’ DNA from relatives. We present HAPI-RECAP, a tool that reconstructs the DNA of parents from full siblings and their relatives. Given data for one parent, phasing alone with HAPI2 reconstructs large fractions of the missing parent’s DNA, between 77.6% and 99.97% among all families, and 90.3% on average in three- and four-child families. When reconstructing both parents, HAPI- RECAP infers between 33.2% and 96.6% of the parents’ genotypes, averaging 70.6% in four-child families. Reconstructed genotypes have average error rates < 10−3, comparable to those from direct genotyping. Besides relatives, massive genetic studies enable precise haplotype inference. We benchmarked state-of the-art methods on > 8 million diverse, research-consented 23andMe, Inc. customers and the UK Biobank (UKB), finding that both perform exceptionally well. Beagle’s median switch error rate (after excluding single SNP switches) in white British trios from UKB is 0.026% compared to 0.00% for European ancestry 23andMe research participants; 55.6% of European ancestry 23andMe participants have zero non-single SNP switches, compared to 42.4% of white British trios in UKB. SHAPEIT and Beagle excel at ‘intra-chromosomal’ phasing, but lack the ability to phase across chromosomes, motivating us to develop an inter-chromosomal phasing method called HAPTIC (HAPlotype TIling and Clustering), that assigns paternal and maternal variants discretely genome-wide. Our approach uses IBD segments to phase blocks of variants on different chromosomes. We ran HAPTIC on 1022 UKB trio children, yielding a median phase error of 0.08% in regions covered by IBD segments (33.5% of sites) and on 23andMe trio children, finding a median phase error of 0.92% in Europeans (93.8% of sites) and 0.09% in admixed Africans (92.7% of sites). HAPTIC’s precision depends heavily on data from relatives, so will increase as datasets grow larger and more diverse. HAPTIC and HAPI-RECAP enable analyses that require the parent-of-origin of variants, such as association studies and ancestry inference of untyped parents.

Amy L. Williams is a Senior Scientist at 23andMe in the Product Research and Development department. Prior to joining industry in 2022, she was an Associate Professor of Computational Biology at Cornell University. She received her PhD (2010) and SM (2005) degrees in Computer Science from Massachusetts Institute of Technology, and BS (2003) in Computer Science and in Mathematics from the University of Utah. From 2009-2013 she worked as a postdoctoral research fellow at Harvard Medical School, and from 2013-2014 she was a postdoctoral research associate at Columbia University. Her research interests span the intersection of computer science and genetics and she is especially interested in characterizing genetic relatives in large datasets.