Title: | The Breakdown of Genomic Ancestry Blocks in Hybrid Lineages |
---|---|
Description: | Individual based simulations of hybridizing populations, where the accumulation of junctions is tracked. Furthermore, mathematical equations are provided to verify simulation outcomes. Both simulations and mathematical equations are based on Janzen (2018, <doi:10.1101/058107>) and Janzen (2020, <doi:10.1101/2020.09.10.292441>). |
Authors: | Thijs Janzen [aut, cre] |
Maintainer: | Thijs Janzen <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.1.0 |
Built: | 2024-11-09 05:09:37 UTC |
Source: | https://github.com/thijsjanzen/junctions |
The theory of junctions is extended by this package by including the effect of a finite number of recombination sites along the chromosome. The package provides functions to calculate the estimated number of junctions, depending on the time since the onset of hybridization, population size, number of recombination sites, initial heterozygosity and the number of crossovers per meiosis.
This package provides individual based simulations in order to simulate the accumulation of junctions over time, both for chromosomes with a finite and an infinite number of recombination sites. Furthermore, the package provides mathematical tools to verify the outcomes of the individual based simulations.
Update version 2.1.0 : updated tbb::task_scheduler_init to tbb::global_control
Update version 2.0.2 : simplified some tests
Update version 2.0 : merged many functions with similar functionality, added vignette that provides overview of all functionality.
Update version 1.9 : added c++ versions of the unphased and phased likelihoods.
Update version 1.8 : added multithreading using the TBB library.
Update version 1.7 : further improved the recombination function following Hanno Hildenbrandt's suggestions
Update version 1.6 : improved the recombination function to run twice as fast
Update version 1.5.1: added option to track the true number of junctions
Update version 1.5: added support for inferring the time since admixture based on phased and unphased data. Also included are simulation functions to simulate appropriate data (e.g. phased and unphased).
Update version 1.4: added support for estimating the number of junctions, and simulating the number of junctions, under a backcrossing scheme, using the code supplied in Lavretsky et al. 2019.
Update version 1.3: added support for estimating the time since admixture using unphased data.
Update version 1.3: added individual based simulations returning phased and unphased data.
Update version 1.3: Updated entire package to Roxygen.
Update version 1.2: added support for estimating the expected number of junctions for arbitrarily distributed markers.
Update version 1.1: updated underlying random number generator for picking recombination sites. The previous generator had limited precision, which could generate duplicate recombination sites. This update fixes that
Maintainer: Thijs Janzen <[email protected]>
Janzen, T. , Nolte, A. W. and Traulsen, A. (2018), The breakdown of genomic ancestry blocks in hybrid lineages given a finite number of recombination sites. Evolution, 72: 735-750. doi:10.1111/evo.13436
Lavretsky, P, Janzen, T. and McCracken, KG. (2019) Identifying hybrids & the genomics of hybridization: Mallards & American black ducks of Eastern North America. Ecology and Evolution 9: 3470-3490. doi:10.1002/ece3.4981
Calculate the average number of junctions after an infinite number of generations, provided information on the initial heterozygosity, population size and the number of generations.
calc_k(N = Inf, R = Inf, H_0 = 0.5, C = 1)
calc_k(N = Inf, R = Inf, H_0 = 0.5, C = 1)
N |
population size |
R |
number of markers |
H_0 |
initial heterozygosity (at the time of admixture) |
C |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
The number of junctions for at time = infinity
k <- calc_k(N = 100, R = 1000, H_0 = 0.5, C = 1)
k <- calc_k(N = 100, R = 1000, H_0 = 0.5, C = 1)
Function that calculates the maximum time after hybridization after which the number of junctions can still be reliably used to estimate the onset of hybridization. This is following equation 15 in Janzen et al. 2018.
calculate_mat(N = Inf, R = Inf, H_0 = 0.5, C = 1)
calculate_mat(N = Inf, R = Inf, H_0 = 0.5, C = 1)
N |
Population Size |
R |
Number of genetic markers |
H_0 |
Frequency of heterozygosity at t = 0 |
C |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
The maximum accurate time
calculate_mat(N = Inf, R = 1000, H_0 = 0.5, C = 1)
calculate_mat(N = Inf, R = 1000, H_0 = 0.5, C = 1)
Estimate the time since the onset of hybridization, following equation 14 in Janzen et al. 2018
estimate_time(J = NA, N = Inf, R = Inf, H_0 = 0.5, C = 1)
estimate_time(J = NA, N = Inf, R = Inf, H_0 = 0.5, C = 1)
J |
The observed number of junctions |
N |
Population Size |
R |
Number of genetic markers |
H_0 |
Frequency of heterozygosity at t = 0 |
C |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
The number of generations passed since the onset of hybridization
cat("example calculate time") J <- number_of_junctions(N = 100, R = 1000, H_0 = 0.5, C = 1, t = 200) estimate_time(J = J, N = 100, R = 1000, H_0 = 0.5, C = 1) # should be 200 again
cat("example calculate time") J <- number_of_junctions(N = 100, R = 1000, H_0 = 0.5, C = 1, t = 200) estimate_time(J = J, N = 100, R = 1000, H_0 = 0.5, C = 1) # should be 200 again
Calculates the time since admixture, given unphased ancestry data.
estimate_time_diploid( ancestry_information, analysis_type = "individuals", phased = FALSE, pop_size = 1000, freq_ancestor_1 = 0.5, lower_lim = 2, upper_lim = 2000, num_threads = 1, verbose = FALSE )
estimate_time_diploid( ancestry_information, analysis_type = "individuals", phased = FALSE, pop_size = 1000, freq_ancestor_1 = 0.5, lower_lim = 2, upper_lim = 2000, num_threads = 1, verbose = FALSE )
ancestry_information |
a matrix with five columns: column 1) indicator of individual, column 2) indicator of chromosome, 3) location of marker in Morgan, 4) ancestry at chromosome 5) ancestry at chromosome 2. |
analysis_type |
how should the data be broken down? there are multiple options: "individuals" - time is inferred for each individual separately, grouping all chromosomes together that belong to the same individual. "chromosomes" - time is inferred for each chromosome separately, grouping chromosomes together belonging from separate individuals. "separate" - time is inferred for each chromosome from each individual separately, "all" - time is inferred jointly for all chromosomes and individuals, grouping all chromosomes and individuals together. |
phased |
is the data phased? |
pop_size |
population size |
freq_ancestor_1 |
Frequency of ancestor 1 at t = 0 |
lower_lim |
lower limit of the optimization algorithm. Increase if the expected admixture time is relatively ancient |
upper_lim |
upper limit of hte optimization algorithm. If set too large, recent admixture events can be overlooked - best to set as low as possible. |
num_threads |
num_threads, default is all threads. 5 threads is recommended. |
verbose |
display intermediate output? Default = FALSE |
Estimate the time since the onset of hybridization, for a haploid genome
estimate_time_haploid( ancestry_matrix, N = 1000, freq_ancestor_1 = 0.5, lower_lim = 2, upper_lim = 1000, verbose = FALSE )
estimate_time_haploid( ancestry_matrix, N = 1000, freq_ancestor_1 = 0.5, lower_lim = 2, upper_lim = 1000, verbose = FALSE )
ancestry_matrix |
matrix with 3 columns, column 1 = chromosome, column 2 = location in Morgan, column 3 = ancestry. |
N |
Population Size |
freq_ancestor_1 |
Frequency of ancestor 1 at t = 0 |
lower_lim |
lower limit of the optimization algorithm. Increase if the expected admixture time is relatively ancient |
upper_lim |
upper limit of the optimization algorithm. If set too large, recent admixture events can be overlooked - best to set as low as possible. |
verbose |
return verbose output |
The number of generations passed since the onset of hybridization
Estimate the time since the onset of hybridization, following equation 1 in Janzen et al. unpublished
estimate_time_one_chrom( J = NA, N = Inf, H_0 = 0.5, marker_distribution = NA, lower_lim = 2, upper_lim = 1000 )
estimate_time_one_chrom( J = NA, N = Inf, H_0 = 0.5, marker_distribution = NA, lower_lim = 2, upper_lim = 1000 )
J |
The observed number of junctions |
N |
Population Size |
H_0 |
Frequency of heterozygosity at t = 0 |
marker_distribution |
A vector containing the position of all markers in Morgan. |
lower_lim |
lower limit of the optimization algorithm. Increase if the expected admixture time is relatively ancient |
upper_lim |
upper limit of the optimization algorithm. If set too large, recent admixture events can be overlooked - best to set as low as possible. |
The number of generations passed since the onset of hybridization
cat("example estimate time one chrom") markers <- seq(from = 0, to = 1, length.out = 100) J <- number_of_junctions_markers(N = 100, H_0 = 0.5, t = 200, marker_distribution = markers) estimate_time_one_chrom(J = J, N = 100, H_0 = 0.5, marker_distribution = markers) #should be 200 again
cat("example estimate time one chrom") markers <- seq(from = 0, to = 1, length.out = 100) J <- number_of_junctions_markers(N = 100, H_0 = 0.5, t = 200, marker_distribution = markers) estimate_time_one_chrom(J = J, N = 100, H_0 = 0.5, marker_distribution = markers) #should be 200 again
Calculates the log likelihood of observing the phased data, given the population size, initial heterozygosity and time since admixture
log_likelihood_diploid( local_anc_matrix, pop_size, freq_ancestor_1 = 0.5, t, phased = FALSE, num_threads = 1 )
log_likelihood_diploid( local_anc_matrix, pop_size, freq_ancestor_1 = 0.5, t, phased = FALSE, num_threads = 1 )
local_anc_matrix |
a matrix with four columns: column 1) chromosome indicator, 2) location of marker in Morgan on respective chromosome 3) ancestry at chromosome 4) ancestry at chromosome 2. |
pop_size |
population size |
freq_ancestor_1 |
Frequency of ancestor 1 at t = 0 |
t |
time since admixture |
phased |
is the data phased or not? default is false. |
num_threads |
number of threads, default is one thread. Set to -1 to use all available threads. |
log likelihood
log likelihood of the time since admixture for a set of single chromosomes (for ex. in Yeast).
log_likelihood_haploid(ancestry_matrix, N = 1000, freq_ancestor_1 = 0.5, t = 2)
log_likelihood_haploid(ancestry_matrix, N = 1000, freq_ancestor_1 = 0.5, t = 2)
ancestry_matrix |
matrix with 3 columns, column 1 = chromosome, column 2 = location in Morgan, column 3 = ancestry. |
N |
Population Size |
freq_ancestor_1 |
Frequency of ancestor 1 at t = 0 |
t |
time since admixture |
loglikelihood
Calculate the average number of junctions in a single chromosome after t generations, provided information on the initial heterozygosity, population size and the number of generations.
number_of_junctions(N = Inf, R = Inf, H_0 = 0.5, C = 1, t = 100)
number_of_junctions(N = Inf, R = Inf, H_0 = 0.5, C = 1, t = 100)
N |
Population Size |
R |
Number of genetic markers |
H_0 |
Frequency of heterozygosity at t = 0 |
C |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
t |
Time since admixture |
Estimated number of junctions at time t
jt <- number_of_junctions(N = 100, R = 1000, H_0 = 0.5, C = 1, t = 1000) jt2 <- number_of_junctions(N = 100, R = 1000, H_0 = 0.5, C = 1, t = 0:1000)
jt <- number_of_junctions(N = 100, R = 1000, H_0 = 0.5, C = 1, t = 1000) jt2 <- number_of_junctions(N = 100, R = 1000, H_0 = 0.5, C = 1, t = 0:1000)
Calculate the expected number of junctions after t generations, in a backcrossing mating scheme.
number_of_junctions_backcross(H_0 = 0.5, C = 1, t = 100)
number_of_junctions_backcross(H_0 = 0.5, C = 1, t = 100)
H_0 |
Frequency of heterozygosity at t = 0 |
C |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
t |
Time since admixture |
Estimated number of junctions at time t
cat("example number of junctions backcross") jt <- number_of_junctions_backcross(H_0 = 0.1, C = 1, t = 5)
cat("example number of junctions backcross") jt <- number_of_junctions_backcross(H_0 = 0.1, C = 1, t = 5)
Calculate the expected number of junctions after t generations, provided information on the initial heterozygosity, population size, the number of generations since the onset of admixture and the distance between two markers.
number_of_junctions_di(N = Inf, H_0 = 0.5, t = 100, di = 1e-06)
number_of_junctions_di(N = Inf, H_0 = 0.5, t = 100, di = 1e-06)
N |
Population Size |
H_0 |
Frequency of heterozygosity at t = 0 |
t |
Time since admixture |
di |
Distance between two markers in Morgan |
Estimated number of junctions at time t
number_of_junctions_di(N = 100, H_0 = 0.5, t = 1000, di = 0.01)+
number_of_junctions_di(N = 100, H_0 = 0.5, t = 1000, di = 0.01)+
Calculate the expected number of junctions after t generations, provided information on the initial heterozygosity, population size, the number of generations since the onset of admixture and the distribution of markers.
number_of_junctions_markers( N = Inf, H_0 = 0.5, t = 100, marker_distribution = NA )
number_of_junctions_markers( N = Inf, H_0 = 0.5, t = 100, marker_distribution = NA )
N |
Population Size |
H_0 |
Frequency of heterozygosity at t = 0 |
t |
Time since admixture |
marker_distribution |
A vector containing the position of all markers in Morgan. |
Estimated number of observed junctions at time t
markers <- seq(from = 0, to = 1, length.out = 1000) jt <- number_of_junctions_markers(N = 100, H_0 = 0.5, t = 1000, marker_distribution = markers) random_markers <- sort(runif(1000, 0, 1)) jt2 <- number_of_junctions_markers(N = 100, H_0 = 0.5, t = 1000, marker_distribution = random_markers)
markers <- seq(from = 0, to = 1, length.out = 1000) jt <- number_of_junctions_markers(N = 100, H_0 = 0.5, t = 1000, marker_distribution = markers) random_markers <- sort(runif(1000, 0, 1)) jt2 <- number_of_junctions_markers(N = 100, H_0 = 0.5, t = 1000, marker_distribution = random_markers)
Individual based simulation of the accumulation of junctions, under a back crossing scheme
sim_backcrossing( population_size = 100, freq_ancestor_1 = 0.5, total_runtime = 5, size_in_morgan = 1, number_of_markers = 100, seed = 6, time_points = -1 )
sim_backcrossing( population_size = 100, freq_ancestor_1 = 0.5, total_runtime = 5, size_in_morgan = 1, number_of_markers = 100, seed = 6, time_points = -1 )
population_size |
Population size |
freq_ancestor_1 |
Frequency of ancestor 1 at t = 0 |
total_runtime |
Number of generations to simulate |
size_in_morgan |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
number_of_markers |
number of molecular markers |
seed |
Seed of the pseudo-random number generator |
time_points |
vector with time points at which local ancestry has to be recorded to be returned at the end of the simulation. If left at -1, ancestry is recorded at every generation (computationally heavy). |
List with five entries: average_junctions: average number of junctions over time, detected_junctions: average number of detected junctions, given the markers. markers: vector with the locations of the molecular markers, junction_distribution: distribution of junctions per time step average_heterozygosity: average heterozygosity.
sim_backcrossing(population_size = 100, total_runtime = 5, size_in_morgan = 1, number_of_markers = 100, seed = 6, time_points = 1:5)
sim_backcrossing(population_size = 100, total_runtime = 5, size_in_morgan = 1, number_of_markers = 100, seed = 6, time_points = 1:5)
Individual based simulation of the accumulation of junctions for a chromosome with regularly distributed markers.
sim_fin_chrom( pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 100, morgan = 1, seed = 42, R = 100 )
sim_fin_chrom( pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 100, morgan = 1, seed = 42, R = 100 )
pop_size |
Population Size |
freq_ancestor_1 |
Frequency of ancestor 1 at t = 0 |
total_runtime |
Maximum time after which the simulation is to be stopped |
morgan |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
seed |
Seed of the pseudo-random number generator |
R |
Number of regularly distributed markers |
avgJunctions |
vector of the average number of junctions at time = [0, total_runtime] |
cat("example sim_fin_chrom") sim_fin_chrom(pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 10, morgan = 1, seed = 42, R = 100)
cat("example sim_fin_chrom") sim_fin_chrom(pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 10, morgan = 1, seed = 42, R = 100)
Individual based simulation of the accumulation of junctions for a chromosome with an infinite number of recombination sites.
sim_inf_chrom( pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 100, morgan = 1, markers = -1, seed = 42 )
sim_inf_chrom( pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 100, morgan = 1, markers = -1, seed = 42 )
pop_size |
Population Size |
freq_ancestor_1 |
Frequency of ancestor 1 at t = 0 |
total_runtime |
Maximum time after which the simulation is to be stopped |
morgan |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
markers |
The number of genetic markers superimposed on the chromosome. If markers is set to -1, no markers are superimposed (faster simulation) |
seed |
Seed of the pseudo-random number generator |
avgJunctions |
vector of the average number of junctions at time = [0, total_runtime] |
cat("example sim inf chrom") v <- sim_inf_chrom(pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 10, morgan = 1, markers = 100, seed = 42) plot(v$avgJunctions, type = "l", xlab = "Generations", ylab = "Number of Junctions", main = "Example Infinite Chromosome") lines(v$detectedJunctions, col = "blue") legend("bottomright", c("Real number","Number detected"), lty = 1, col = c("black", "blue"))
cat("example sim inf chrom") v <- sim_inf_chrom(pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 10, morgan = 1, markers = 100, seed = 42) plot(v$avgJunctions, type = "l", xlab = "Generations", ylab = "Number of Junctions", main = "Example Infinite Chromosome") lines(v$detectedJunctions, col = "blue") legend("bottomright", c("Real number","Number detected"), lty = 1, col = c("black", "blue"))
Individual based simulation of the accumulation of junctions, returning phased and unphased data. Ancestry on both chromosomes of 10 randomly sampled individuals per generations is returned.
sim_phased_unphased( pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 100, size_in_morgan = 1, markers = 100, time_points = -1, num_threads = 1, verbose = FALSE, record_true_junctions = FALSE, num_indiv_sampled = 10, coverage = 1, error_rate = 0 )
sim_phased_unphased( pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 100, size_in_morgan = 1, markers = 100, time_points = -1, num_threads = 1, verbose = FALSE, record_true_junctions = FALSE, num_indiv_sampled = 10, coverage = 1, error_rate = 0 )
pop_size |
Population Size |
freq_ancestor_1 |
Frequency of ancestor 1 at t = 0 |
total_runtime |
Maximum time after which the simulation is to be stopped |
size_in_morgan |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
markers |
If a single number is provided, the number is used as the total number of markers generated either randomly, or using a regular distribution (a regular distribution is chosen if the number is negative). If a vector is provided, that vector is used. |
time_points |
vector with time points at which local ancestry has to be recorded to be returned at the end of the simulation. If left at -1, ancestry is recorded at every generation (computationally heavy). |
num_threads |
default is 1. -1 takes all available threads. |
verbose |
displays a progress bar |
record_true_junctions |
keep track of the true number of junctions? |
num_indiv_sampled |
the number of individuals sampled at each time point to be genotyped |
coverage |
fraction of markers that can be succesfully phased |
error_rate |
fraction of markers that are erroneously phased (e.g. swapped) |
a tibble with five columns: [time, individual, marker location, ancestry chromosome 1, ancestry chromosome 2]
## Not run: sim_phased_unphased(pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 10, size_in_morgan = 1, markers = 10, time_points = c(0, 5, 10), num_threads = 1) ## End(Not run)
## Not run: sim_phased_unphased(pop_size = 100, freq_ancestor_1 = 0.5, total_runtime = 10, size_in_morgan = 1, markers = 10, time_points = c(0, 5, 10), num_threads = 1) ## End(Not run)
Calculate the error in the estimate of the onset of hybridization, following Equations 3 & 4 in the Supplementary information of Janzen et al. 2018.
time_error(t = NA, N = Inf, R = Inf, H_0 = 0.5, C = 1, relative = TRUE)
time_error(t = NA, N = Inf, R = Inf, H_0 = 0.5, C = 1, relative = TRUE)
t |
Inferred time |
N |
Population Size |
R |
Number of genetic markers |
H_0 |
Frequency of heterozygosity at t = 0 |
C |
Mean number of crossovers per meiosis (e.g. size in Morgan of the chromosome) |
relative |
Boolean flag, if TRUE: return the relative error, if FALSE: return error in generations |
Expected error in the time estimate