Title: | Simulate DNA Alignments Using Node Substitutions |
---|---|
Description: | Simulate DNA sequences for the node substitution model. In the node substitution model, substitutions accumulate additionally during a speciation event, providing a potential mechanistic explanation for substitution rate variation. This package provides tools to simulate such a process, simulate a reference process with only substitutions along the branches, and provides tools to infer phylogenies from alignments. More information can be found in Janzen (2021) <doi:10.1093/sysbio/syab085>. |
Authors: | Thijs Janzen |
Maintainer: | Thijs Janzen <[email protected]> |
License: | GPL-3 |
Version: | 1.2.8 |
Built: | 2024-10-24 04:30:28 UTC |
Source: | https://github.com/thijsjanzen/nodesub |
Simulate DNA sequences for the node substitution model. In the node substitution model, substitutions accumulate additionally during a speciation event, providing a potential mechanistic explanation for substitution rate variation. This package provides tools to simulate such a process, simulate a reference process with only substitutions along the branches, and provides tools to infer phylogenies from alignments. More information can be found in Janzen (2021) <doi:10.1093/sysbio/syab085>.
Version History:
Version 1.2.7 - Removed beta calculation due to apTreeshape removal from
CRAN
Version 1.2.3 - Removed summary statistic tests for CRAN
Version 1.2.2 - Changed codedov links in README
Version 1.2.1 - Expanded depenency on RPANDA
Version 1.2 - Release on CRAN
Thijs Janzen Maintainer: Thijs Janzen <[email protected]>
Thijs Janzen, Folmer Bokma, Rampal S Etienne, Nucleotide Substitutions during Speciation may Explain Substitution Rate Variation, Systematic Biology, 2021; syab085
calculates the relative contribution of substitutions at the nodes
calc_fraction(phy = NULL, node_time = 0, model = "unlinked")
calc_fraction(phy = NULL, node_time = 0, model = "unlinked")
phy |
phylogenetic tree (optional) |
node_time |
time spent at the node |
model |
node substitution model |
expected fraction
calculates the required node time to obtain a desired fraction of substitutions at the node
calc_required_node_time(phy = NULL, s = 0.5, model = "unlinked")
calc_required_node_time(phy = NULL, s = 0.5, model = "unlinked")
phy |
phylogenetic tree |
s |
desired fraction |
model |
node substitution model, either "linked" or "unlinked". |
expected fraction
calculate summary statistics of a phylogenetic tree, compared with a reference tree. The following statistics are calculated: the beta statistic, gamma statistic, crown age, mean branch length, number of tips, the nLTT statistic and the laplacian difference, given by RPANDA's JSDtree. Because JSDtree can sometimes cause issues, some additional checks are performed to ensure that is possible to run this function.
calc_sum_stats(trees, true_tree, verbose = FALSE)
calc_sum_stats(trees, true_tree, verbose = FALSE)
trees |
a phyloList object containing multiple trees |
true_tree |
a phylo object containing the reference tree, preferably without extinct lineages. If extinct lineages are found, these are dropped. |
verbose |
verbose output if true (e.g. progressbars) |
list with two tibbles 1) containing the summary statistics of all trees and 2) containing the difference with the true tree
create a balanced tree out of branching times
create_balanced_tree(brts)
create_balanced_tree(brts)
brts |
vector of branching times |
phylo phylo object
function create an alignment with identical information content
create_equal_alignment( input_tree, sub_rate, alignment_result, sim_function = NULL, verbose = FALSE, node_time = NULL, input_alignment_type = "nodesub" )
create_equal_alignment( input_tree, sub_rate, alignment_result, sim_function = NULL, verbose = FALSE, node_time = NULL, input_alignment_type = "nodesub" )
input_tree |
phylogeny for which to generate alignment |
sub_rate |
substitution rate used in the original phylogeny |
alignment_result |
result of sim_normal, sim_linked or sim_unlinked |
sim_function |
function that accepts a tree, sequence length, rootsequence and substitution rate (in that order). Default is sim_normal |
verbose |
provide intermediate output |
node_time |
node time |
input_alignment_type |
was the input alignment simulated with a node substitution model or a normal substitution model? Used to calculate the twin mutation rate. Options are "nodesub" and "normal". |
list with four properties: 1) alignment: the alignment itself, 2) adjusted rate: the substitution rate used to obtain identical information content 3) total_accumulated_substitutions: the total number of substitutions accumulated. 4) total_node_substitutions: total number of substitutions accumulated on the nodes 5) total_branch_substitutions: total number of substitutions accumulated on the branches.
function create an alignment with identical information content, using the explicit method to simulate substitutions
create_equal_alignment_explicit( input_tree, sub_rate, alignment_result, verbose = FALSE )
create_equal_alignment_explicit( input_tree, sub_rate, alignment_result, verbose = FALSE )
input_tree |
phylogeny for which to generate alignment |
sub_rate |
substitution rate used in the original phylogeny |
alignment_result |
result of sim_normal, sim_linked or sim_unlinked |
verbose |
provide intermediate output |
list with four properties: 1) alignment: the alignment itself, 2) adjusted rate: the substitution rate used to obtain identical information content 3) total_accumulated_substitutions: the total number of substitutions accumulated. 4) total_node_substitutions: total number of substitutions accumulated on the nodes 5) total_branch_substitutions: total number of substitutions accumulated on the branches.
create an unbalanced tree out of branching times
create_unbalanced_tree(brts)
create_unbalanced_tree(brts)
brts |
vector of branching times |
phylo phylo object
estimate_marginal_models estimates the marginal likelihood of both the strict and the relaxed clock model, given the JC69 substitution model, using the NS package in BEAST, made available via the babette R package. The NS package performs nested sampling, and uses an MCMC approach to estimate the marginal likelihood. Sampling is performed until convergence of the MCMC chain. Unfortunately, currently the babette package is unavailable on CRAN, requiring installation through GitHub to enjoy the full functionality of this function.
estimate_marginal_models( fasta_filename, use_yule_prior = FALSE, rng_seed = 42, sub_rate = 1, verbose = FALSE )
estimate_marginal_models( fasta_filename, use_yule_prior = FALSE, rng_seed = 42, sub_rate = 1, verbose = FALSE )
fasta_filename |
file name of fasta file holding alignment for which the marginal likelihood is to be estimated |
use_yule_prior |
by default, a birth-death prior is used as tree prior, but if use_yule_prior is set to TRUE, a pure-birth prior will be used. |
rng_seed |
seed of pseudo-random number generator |
sub_rate |
substitution rate |
verbose |
boolean indicating if verbose intermediate output is to be generated |
data frame with marginal likelihoods and relative weights per clock model.
calculates the p matrix
get_p_matrix(branch_length, eig = phangorn::edQt(), rate = 1)
get_p_matrix(branch_length, eig = phangorn::edQt(), rate = 1)
branch_length |
branch length |
eig |
eigen object |
rate |
rate |
p matrix
infer the time calibrated phylogeny associated with the provided alignment. This function uses the R package babette to infer the phylogeny using BEAST2.
infer_phylogeny( alignment, treatment_name, tree_prior = beautier::create_bd_tree_prior(), clock_prior = beautier::create_strict_clock_model(), mcmc_seed = NULL, chain_length = 1e+07, sample_interval = 5000, burnin = 0.1, working_dir = NULL, sub_rate = 1 )
infer_phylogeny( alignment, treatment_name, tree_prior = beautier::create_bd_tree_prior(), clock_prior = beautier::create_strict_clock_model(), mcmc_seed = NULL, chain_length = 1e+07, sample_interval = 5000, burnin = 0.1, working_dir = NULL, sub_rate = 1 )
alignment |
Phydat object containing the focal alignment |
treatment_name |
string to be appended to BEAST files |
tree_prior |
tree prior used, default = birth-death prior |
clock_prior |
clock prior used, default = strict clock |
mcmc_seed |
seed of the mcmc chain, default is the system time |
chain_length |
length of the mcmc chain, default is 1e7. |
sample_interval |
interval of sampling, default is 5000 |
burnin |
burnin of posterior distribution |
working_dir |
beast2 working dir |
sub_rate |
substitution rate used to generate the original alignment (if available), default is 1 |
list with all trees, and the consensus tree
Function to remove speciation events occuring after an extinction event. Extinct species are pruned randomly, such that only a single extinct species per branching event (if any extinct species) remains.
reduce_tree(tree)
reduce_tree(tree)
tree |
phylo object |
pruned tree
simulate a sequence assuming conditional substitutions on the node.
sim_linked( phy, Q = rep(1, 6), rate = 0.1, node_mut_rate_double = 1e-09, l = 1000, bf = rep(0.25, 4), rootseq = NULL, node_time = 0.01 )
sim_linked( phy, Q = rep(1, 6), rate = 0.1, node_mut_rate_double = 1e-09, l = 1000, bf = rep(0.25, 4), rootseq = NULL, node_time = 0.01 )
phy |
tree for which to simulate sequences |
Q |
substitution matrix along the branches, default = JC |
rate |
mutation rate , default = 1 |
node_mut_rate_double |
mutation rate on the node, default = 1e-9 |
l |
number of base pairs to simulate |
bf |
base frequencies, default = c(0.25, 0.25, 0.25, 0.25) |
rootseq |
sequence at the root, simulated by default |
node_time |
time spent at the node |
list with four items
alignment Phydat object with the resulting alignment
rootseq the rootsequence used
total_branch_substitutions total number of substitutions accumulated on the branches
total_node_substitutions total number of substitutions accumulated at the nodes
simSeq
from the phangorn package.Simulate sequences for a given evolutionary tree, using a standard model
of sequence evolution along the branches. Code for this function was heavily
inspired by the function simSeq
from the phangorn package.
sim_normal(x, l = 1000, Q = NULL, bf = NULL, rootseq = NULL, rate = 1)
sim_normal(x, l = 1000, Q = NULL, bf = NULL, rootseq = NULL, rate = 1)
x |
a phylogenetic tree |
l |
length of the sequence to simulate. |
Q |
the rate matrix. |
bf |
base frequencies. |
rootseq |
a vector of length l containing the root sequence, other root sequence is randomly generated. |
rate |
mutation rate |
list with four items
alignment Phydat object with the resulting alignment
rootseq the rootsequence used
total_branch_substitutions total number of substitutions accumulated on the branches
total_node_substitutions total number of substitutions accumulated at the nodes
Klaus Schliep [email protected]
simulate a sequence assuming substitutions are only accumulated along the branches, using the explicit simulation method (e.g. reverse substitutions are modeled explicitly)
sim_normal_explicit(x, l = 1000, Q = NULL, bf = NULL, rootseq = NULL, rate = 1)
sim_normal_explicit(x, l = 1000, Q = NULL, bf = NULL, rootseq = NULL, rate = 1)
x |
a phylogenetic tree |
l |
length of the sequence to simulate. |
Q |
the rate matrix. |
bf |
base frequencies. |
rootseq |
a vector of length l containing the root sequence, other root sequence is randomly generated. |
rate |
mutation rate or scaler for the edge length, a numerical value greater than zero. |
list with four items
alignment Phydat object with the resulting alignment
rootseq the rootsequence used
total_branch_substitutions total number of substitutions accumulated on the branches
total_node_substitutions total number of substitutions accumulated at the nodes
Simulate a sequence assuming node substitutions are not shared amongst offspring, given two substitution matrices: one for substitutions occuring on the nodes, and one for substitutions occuring along the branches.
sim_unlinked( phy, Q1 = rep(1, 6), Q2 = rep(1, 6), rate1 = 0.1, rate2 = 0.1, l = 1000, bf = rep(0.25, 4), rootseq = NULL, node_time = 0.001 )
sim_unlinked( phy, Q1 = rep(1, 6), Q2 = rep(1, 6), rate1 = 0.1, rate2 = 0.1, l = 1000, bf = rep(0.25, 4), rootseq = NULL, node_time = 0.001 )
phy |
tree for which to simulate sequences |
Q1 |
substitution matrix along the branches, default = JC |
Q2 |
substitution matrix on the nodes, default = JC |
rate1 |
mutation rate along the branch, default = 0.1 |
rate2 |
mutation rate on the node, default = 0.1 |
l |
number of base pairs to simulate |
bf |
base frequencies, default = c(0.25, 0.25, 0.25, 0.25) |
rootseq |
sequence at the root, simulated by default |
node_time |
amount of time spent at the nodes |
list with four items
alignment Phydat object with the resulting alignment
rootseq the rootsequence used
total_branch_substitutions total number of substitutions accumulated on the branches
total_node_substitutions total number of substitutions accumulated at the nodes
Simulate a sequence assuming node substitutions are not shared amongst offspring, using the explicit simulation method (e.g. reverse substitutions are modeled explicitly)
sim_unlinked_explicit( phy, Q1 = rep(1, 6), Q2 = rep(1, 6), rate1 = 0.1, rate2 = 0.1, l = 1000, bf = rep(0.25, 4), rootseq = NULL, node_time = 0.001 )
sim_unlinked_explicit( phy, Q1 = rep(1, 6), Q2 = rep(1, 6), rate1 = 0.1, rate2 = 0.1, l = 1000, bf = rep(0.25, 4), rootseq = NULL, node_time = 0.001 )
phy |
phylogenetic tree for which to simulate sequences |
Q1 |
substitution matrix along the branches, default = JC |
Q2 |
substitution matrix on the nodes, default = JC |
rate1 |
mutation rate along the branch, default = 0.1 |
rate2 |
mutation rate on the node, default = 0.1 |
l |
number of base pairs to simulate |
bf |
base frequencies, default = c(0.25, 0.25, 0.25, 0.25) |
rootseq |
sequence at the root, simulated by default |
node_time |
amount of time spent at the nodes |
list with four items
alignment Phydat object with the resulting alignment
rootseq the rootsequence used
total_branch_substitutions total number of substitutions accumulated on the branches
total_node_substitutions total number of substitutions accumulated at the nodes
get_p_matrix
but provides a way to debug and verifythis function calculates the p matrix within R
this is slower than the C++ implementation in get_p_matrix
but provides a way to debug and verify
slow_matrix(eig, branch_length, rate)
slow_matrix(eig, branch_length, rate)
eig |
eigen object |
branch_length |
branch length |
rate |
substitution rate |
p matrix