## Robust and scalable inference of population history from hundreds of unphased whole genomes _ nature genetics _ nature research

It has recently been demonstrated that inference methods based on genealogical processes with recombination can uncover past population history in unprecedented detail. However, these methods scale poorly with sample size, limiting resolution in the recent past, and they require phased genomes, which contain switch errors that can catastrophically distort the inferred history. Here we present SMC++, a new statistical tool capable of analyzing orders of magnitude more samples than existing methods while requiring only unphased genomes (its results are independent of phasing).

SMC++ can jointly infer population size histories and split times in diverged populations, and it employs a novel spline regularization scheme that greatly reduces estimation error. We apply SMC++ to analyze sequence data from over a thousand human genomes in Africa and Eurasia, hundreds of genomes from a Drosophila melanogaster population in Africa, and tens of genomes from zebra finch and long-tailed finch populations in Australia.

Two populations were simulated under the recent expansion demography. Each population consisted of n = 10 lineages. Different colors correspond to different divergence times. From the point of divergence until the present, population 2 maintains a constant effective population size equal to the one it had at the time of the split. The solid colored lines represent the inferred demographies for population 1, which should follow the solid black line corresponding to the simulated demography. The dashed colored lines represent the inferred demographies for population 2, which should be flat from the time of the split onward. The vertical dotted lines represent the true values of the splits, whereas solid dots in corresponding colors correspond to the values of the inferred split times. These results show that our method is able to infer divergence times with low error over a wide range of split times, spanning approximately 6,000 to 120,000 years. kya, thousand years ago.

Each step plot represents inference on a single simulated data set with sample size n = 50. The colors of the estimated size histories correspond to the ratio of recombination to mutation used in each simulation, which was not known to SMC + + during model fitting. The ratio ranged from 1:10 (black) to 10:1 (light blue). The true demography used for simulation is shown in bold black. The nested scatterplot compares the true versus estimated ratio of recombination to mutation rates. The mutation rate θ/2 was assumed to be known. SMC + + is able to fairly accurately estimate the recombination rate over two orders of magnitude with respect to the mutation rate and is most accurate when the mutation and recombination rates are approximately equal.

Each step plot represents inference on a single simulated data set with sample size n = 50. The colors of the estimated size histories correspond to the ratio of recombination to mutation used in each simulation, which was not known to SMC + + during model fitting. The ratio ranged from 1:10 (black) to 10:1 (light blue). The true demography used for simulation is shown in bold black. The nested scatterplot compares the true versus estimated ratio of recombination to mutation rates. The mutation rate θ/2 was assumed to be known. SMC + + is able to fairly accurately estimate the recombination rate over two orders of magnitude with respect to the mutation rate and is most accurate when the mutation and recombination rates are approximately equal.

Blue lines are reproduced from Figure 5. Red lines represent the result of randomly downsampling the data to contain 90 % of the original set of chromosomes and rerunning the analysis.

The HMM used in PSMC tracks the hidden TRMCA of a pair of haploid lineages and emits binary symbols based on the heterozygosity of this pair at each block of sites. MSMC tracks the hidden time to first coalescence among several haploid lineages, as well as the identity (denoted by the bolded bars) of the two lineages that coalesce first. It considers as emissions the allelic state of all lineages in the sample. SMC + +, like PSMC, tracks the TMRCA in only a pair of individuals and emits 2-tuples whose distribution is given by the conditioned SFS (Section S1).

