Using QMaker - NQMaker to iteratively optimize and generate new substitution models #380

000generic · 2024-12-15T20:09:14Z

000generic
Dec 15, 2024

Hi!

I have a series questions regarding the generation of new amino acid substitution models using IQTree (IQ-TREE multicore version 2.3.6 for Linux x86 64-bit built Aug 4 2024) - and would greatly appreciate any guidance you might have.

I read in another issue here that two iterations of QMaker/NQMake model building is useful/generally sufficient for generating an optimized new model - and I am wondering how to best implement this - and how to best generate new substitution models in general using IQTree.

I am able to generate an initial model (Iteration 1 model_i1) following the IQTree instructions here:

QMaker/NQMaker examples

Specifically for time-reversible and time-non-reversible versions of Iteration 1 model_i1, I ran:

Step 1: iqtree2 -S output/2-output  --prefix output/3-output/3-model_training -seed 1 -mset LG,WAG,JTT -cmax 4 -T Auto --threads-max 50

Step 2 time reversible: iqtree2 -S output/3-output/3-model_training.best_model.nex -te output/3-output/3-model_training.treefile -seed 1 --model-joint GTR20+FO --init-model LG --prefix output/4-output/4-model_joint_reversible.GTR20 -T Auto --threads-max 50

Step 2 time non-reversible: iqtree2 -S output/3-output/3-model_training.best_model.nex -te output/3-output/3-model_training.treefile  -seed 1 --model-joint NONREV+FO --init-model LG --prefix output/5-output/5-model_joint_non_reversible.NONREV -T Auto --threads-max 50

Question 1:
I set -cmax to 4 based on the IQTree example that has -cmax 4 - but I anticipate C20 or C60 might be useful for some downstream/later data sets (long alignments and distant species). Is it better in my case to set -cmax to say 60 at Step 1? Was -cmax 4 set low (below default of 10 I think) to reduce computational load/time? I have reasonable computing resources - up to 128 CPUs with 968 Gb RAM on a node and up to 6 nodes if I use mpi. So if -cmax 4 was to reduce computation load - and -cmax 60 might be better for my data, I could reasonably implement this I think - but not sure if this is the correct way to think about things.

Question 2:
I set -mset to LG,WAG,JTT based on the IQTree example that has -mset LG,WAG,JTT. My target species for tree building are across animal phyla and unicellular outgroups. Given the improvement of Q, NQ, and ELM models over LG for major animal groups in recent Q, NQ, and GTRpmix papers - I am wondering if it makes sense in my case to include these models with -mset in Step 1 expanded to say -mset LG,WAG,JTT,Q.pfam,NQ.pfam,ELM. Or what would be recommended / how to assess think about this?

Question 3:
If I do expand -mset to include additional models like Q.pfam, NQ.pfam, and ELM in Step 1 - what would be the appropriate --model-joint to use in Step 2? I used the IQTree example setting of --model-joint GTR20+FO for time-reversible and --model-joint NONREV+FO for time non-reversible - but no idea if these would still be appropriate or optimal with the addition of Q.pfam,NQ.pfam, and/or ELM in Step 1.

Question 4:
If I do expand -mset to include additional models like Q.pfam, NQ.pfam, and ELM in Step 1 - what would be the appropriate --init-model to use in Step 2? I used the IQtree example of --init-model LG. However, I was thinking to replace LG with ELM in this case, as the GTRpmix paper indicated it was better for major animal groups - so would run --init-model ELM at Step 2, give an expansion of -mset LG,WAG,JTT,Q.pfam,NQ.pfam,ELM in Step 1.

For Iteration 2 model_i2 to further optimize the initial model, I am trying the following:

Step 1 time reversible: iqtree2 -S output/2-output  --prefix output/7-output/7-model_training -seed 1 -mset output/6-output/rQ.model_i1 -cmax 4 -T Auto --threads-max 50

Step 1 time non-reversible: iqtree2 -S output/2-output  --prefix output/8-output/8-model_training -seed 1 -mset output/6-output/nrQ.model_i1 -cmax 4 -T Auto --threads-max 50

Step 2 time reversible: iqtree2 -S output/7-output/7-model_training.best_model.nex -te output/7-output/7-model_training.treefile -seed 1 --model-joint GTR20+FO --init-model LG --prefix output/9-output/9-model_joint_reversible.GTR20 -T Auto --threads-max 50

Step 2 time non-reversible: iqtree2 -S output/8-output/8-model_training.best_model.nex -te output/8-output/8-model_training.treefile  -seed 1 --model-joint NONREV+FO --init-model LG --prefix output/10-output/10-model_joint_non_reversible.NONREV -T Auto --threads-max 50

Question 5
Here again, I have used settings from the IQTree example but I have questions similar to those above for Iteration 1 - specifically, what would be appropriate / optimal for Step 1 -mset and Step 2 --model-joint and --init-model, given that I am now running things using Iteration 1 model_i1 in Step 1 -mset.

I apologize for all the questions and specifics - and really appreciate your time and any guidance you might have.

Thank you :) Eric

roblanf · 2024-12-16T01:59:49Z

roblanf
Dec 16, 2024
Maintainer

Hi @000generic,

I'll try to answer your questions as best I can. I'll be brief because it's the end of the year and there's a rush to get things done here...

Question 1:
I set -cmax to 4 based on the IQTree example that has -cmax 4 - but I anticipate C20 or C60 might be useful for some downstream/later data sets (long alignments and distant species). Is it better in my case to set -cmax to say 60 at Step 1? Was -cmax 4 set low (below default of 10 I think) to reduce computational load/time? I have reasonable computing resources - up to 128 CPUs with 968 Gb RAM on a node and up to 6 nodes if I use mpi. So if -cmax 4 was to reduce computation load - and -cmax 60 might be better for my data, I could reasonably implement this I think - but not sure if this is the correct way to think about things.

Here you seem to be misunderstanding -cmax. -cmax refers to the maximum number of rate categories for the FreeRate model. So, higher is always better if you can afford the compute time. Lower numbers are usually OK if you're inferring models from a set of e.g. single-locus alignments, each with their own model and tree. But if you have a large concatenated alignment with one rate model across the whole alignment, larger values for -cmax will definitely be better.

C60 refers to profile mixture models. If you want to infer matrices using profile mixture models you should read the phylobayes papers, but also the recent GTRpmix paper here: https://academic.oup.com/mbe/article/41/9/msae174/7735827 That paper describes how to estimate a GTR model under the C60 profile mixture, and why it's better to do that than infer a GTR model with e.g. Qmaker (or just use e.g. LG or WAG) and then add C60 to that.

Question 2:
I set -mset to LG,WAG,JTT based on the IQTree example that has -mset LG,WAG,JTT. My target species for tree building are across animal phyla and unicellular outgroups. Given the improvement of Q, NQ, and ELM models over LG for major animal groups in recent Q, NQ, and GTRpmix papers - I am wondering if it makes sense in my case to include these models with -mset in Step 1 expanded to say -mset LG,WAG,JTT,Q.pfam,NQ.pfam,ELM. Or what would be recommended / how to assess think about this?

A longer list with better models is always better. This list determines how good your initial set of trees and branch lengths is for inferring the Q matrix.

Question 3:
If I do expand -mset to include additional models like Q.pfam, NQ.pfam, and ELM in Step 1 - what would be the appropriate --model-joint to use in Step 2? I used the IQTree example setting of --model-joint GTR20+FO for time-reversible and --model-joint NONREV+FO for time non-reversible - but no idea if these would still be appropriate or optimal with the addition of Q.pfam,NQ.pfam, and/or ELM in Step 1

I think this is fine, but I would like @bqminh or @thomaskf to weigh in here too. As above, all you are doing with -mset is setting up the trees and branch length for each locus, so I don't see any real issue here. One thing you can do (and that I would do) if you are unsure - run it this way, and the other way (with e.g. just LG in the mset), and compare the Q matrix you get at the end. If you run a few iterations, the two Q matrices you get should be very similar (identical, ideally). As you are really just optimising the same model from different starting points. They may not be quite identical - that depends a bit on the variance in your loci in terms of how well they can all be approximated by a single Q matrix, and what's left to mop up by models in the mset.

Question 4:
If I do expand -mset to include additional models like Q.pfam, NQ.pfam, and ELM in Step 1 - what would be the appropriate --init-model to use in Step 2? I used the IQtree example of --init-model LG. However, I was thinking to replace LG with ELM in this case, as the GTRpmix paper indicated it was better for major animal groups - so would run --init-model ELM at Step 2, give an expansion of -mset LG,WAG,JTT,Q.pfam,NQ.pfam,ELM in Step 1.

The best --init-model is the one that you guess is closest (in terms of parameter values) to the model you are trying to estimate. I may be wrong here, but my best guesses would be something like

Q.pfam for --model-joint GTR20+FO
NQ.pfam' for --model-joint NONREV+FO`
ELM if you are following the methods in the GTRpmix paper

In each case, these most closely match (by my guess at least) the model you are aiming for in your estimation, and that should help the optimiser in terms of speed and accuracy.

In an ideal world (i.e. where the optimisers find the global optimum every time, regardless of starting conditions) the initial conditions shouldn't matter to the final parameter estimates. This will be particularly the case if you run multiple iterations (as we do in the Qmaker papers), i.e. where the first init model is your best guess as above, but the in subsequent iterations the init model is the model you output from the previous iteration. In the QMaker papers we keep going until the model pretty much stops changing (I think we use a pearson correlation of >99.9% or something similar - you'd have to double check the papers...).

One thing you can do here is try a few starting conditions for each model you optimise, then compare the likelihoods of the final models, as well as their parameters. Highest likelihood wins!

Question 5
Here again, I have used settings from the IQTree example but I have questions similar to those above for Iteration 1 - specifically, what would be appropriate / optimal for Step 1 -mset and Step 2 --model-joint and --init-model, given that I am now running things using Iteration 1 model_i1 in Step 1 -mset.

Sorry, I'm lost on this question! If the answers are contained in the above, just let me know and see if you can explain in more detail.

Hope some of that helps,

Rob

0 replies

000generic · 2024-12-16T05:48:38Z

000generic
Dec 16, 2024
Author

Thank you for all the helpful details @roblanf! And for moving things to Discussion.

Your clarification on -cmax is super helpful! I will go through everything again in the manual and associated papers with this new understanding to see if things fall into place better.

Some follow-up questions - no problem if it should wait until after the holidays - but great to hear back.

ONE
Given the option of 1) building my own GTRpmix model (which is the less biologically correct time-reversible model) vs 2) building my own NQMaker model (which is the more biologically correct time non-reversible model) - QUESTION 1 which would you opt for / how do NQMaker vs GTRpmix models compare?

By the end of the GTRpmix paper I was wondering and unsure which of the two option is better, as the paper only evaluates against time-reversible LG (I thought) - as far as I understood, the paper does not evaluate it's time reversible GTRpmix model vs time non-reversible models, like those from NQMaker.

So I am wondering is one clearly better or are both important to build and test as a general approach, time and resources permitting.

QUESTION 2 I'm guessing QMaker might be dropped, given models built from NQmaker and/or GTRpmix - methods - does this seem reasonable?

TWO
I built substitution models using QMaker and NQMaker and 1000 of the longest (minus the 10 very longest) BUSCO gene alignments out of an available 14,000+ that I have for 25 species - coming out at 5-10 million alignment positions for consideration. I then used these two models along with LG, ELM, Q.pfam, NQ.pfam, and a few others to build trees for each of the 14,000 alignments - and my NR model (typically as NR+I+G4) was selected as best model based on BIC for all of the several hundred iqtree reports I scanned today. Given this, it seems like my NR model would be great to use in building a GTRpmix model on the same 1000 sequences - providing it under --init-model to initiate things / reduce the search space.

The pipeline used to generate the NR model also has the advantage of generating a tree that I believe can similarly be provided in building a GTRpmix model - under -te.

In the documents on IQTree and in the supplemental materials for the GTRpmix paper, the GTRpmix model is built using a single concatenated alignment under -s. QUESTION 3 Is this necessary, or could -S also work to build a GTRpmix model from individual gene alignments (similar to the option in QMaker/NQMaker) - if it is ok to do, how should I then provide the associated tree for each of the alignments.

QUESTION 4 Does it make sense to scale GTRpmix model building across nodes on different machines using mpi (gaining access to up to 768 CPUs)? Or is there no advantage to this? On a single machine, I can access 128 CPUs. However, it seems like there is NOT always an advantage to more CPUs in building IQTree trees. Is this also the case in building the GTRpmix model? The paper and online documentation highlight how GTRpmix model building can be a long intensive process - and I understand providing a tree of the input alignment and setting the optimization threshold -me to 0.99 helps speed things up - and I'm wondering if more CPUs will also help - or not.

Thank you again :) Eric

2 replies

roblanf Dec 16, 2024
Maintainer

GTRpmix vs. NQ model

This depends on your data, and can be tricky to assess. My hunch is that the C60 models (i.e. GTRpmix et al) are going to be better in a lot of cases, including (my guess only) yours. I'm not aware of a formal comparison, but if you wanted to do one then some form of model adequacy tests would be the way to go. A good option in a likelihood framework is to use parametric bootstraps under both models, then compare aspects (you can use any number, but they should be chosen to represent key properties of the data) of the simulated data to your real data to decide which model is best.

Dropping reversible Q

I probably wouldn't. For the simple reason that everything is a tradeoff. Depending on the size of your dataset, tree search is likely to be far more efficient under a reversible than an nonrev or a C60 model. Because tree search is often the limiting factor, it's not always the case that the best model (in terms of likelihood / AIC fit, or whatever) gives you the best trees. See e.g. this paper which shows that parsimony beats likelihood for certain types of dataset, even if you ONLY care about the maximum likelihood tree: https://academic.oup.com/sysbio/article/72/5/1039/7180192 This is only because in these cases parsimony and likleihood scores are highly correlated, but parsimony is orders of magnitude quicker to compute and do tree searching under, so it wins even if you are only trying to maximise the likelihood.

-S for GTRpmix

We are working on this, but it's not ready yet.

CPUs for GTRpmix

If you could use -S (you can't yet), then it would make sense to use at least as many CPUs as you have alignments. But with a concatenated alignment the tradeoff is less clear. Typically the overall execution time will decrease initially, then increase when the count gets too high. Where the sweet spot is depends on the data (basically on the number of site patterns). I think you probably won't see much gain (maybe you'd go backwards even) from 128 to 768, so it's probably not worth trying MPI (which I think is still under development anyway, so may be tricky to use). If you're going to be building lots of models, I'd probably try something like 128CPus, 64CPUs, and 8CPUs, run then for a fixed time each (there are likely to be some fixed overheads at low CPU at the start), then examine the output files to see which looks best. Once you have that, you can hone in on some other CPU counts to further optimise if you like.

000generic Dec 16, 2024
Author

Great - that is all super helpful - thank you!

000generic · 2025-01-10T02:20:54Z

000generic
Jan 10, 2025
Author

I have a followup big-picture / general approach question on building the substitution models - no problem if you don't have time / github is not the place:

My current pipeline is basically:

Collect BUSCO orthogroup gene sets
Clean sequence errors with Prequal
Align with Mafft (might switch to FAMSA based on recent feedback on OrthoFinder GitHub)
Clean sequence and alignment errors locally with Taper
Clean sequence and alignment errors locally with HMMCleaner
IQTree alignments to produce gene trees
DISCO gene trees to produce single-copy gene trees while retaining most sequences
Generate new unaligned gene set fastas based on sequences present per post-DISCO gene trees
Mafft align new gene set fastas
Clean alignment errors locally with Taper
Clean alignment errors locally with Spruceup
Filter for 1) minimum 250 aa alignments - 2) 50% alignment coverage sequences - 3) 75% species representation - and 4) ensure maximum representation of overall lowest represented species across trees
Run QMaker and nQmaker pipelines on filtered alignments (1,250 sequences across Species43 holozoans and animals)

For building the models, my thinking is to have as many positions with maximium species across trees to get more representative substitutions into the model - and tradeoff to have around 1000+ gene trees to work with (based on the 1000 gene trees used in nQ/QMaker papers). I am working now with a Species43 species set - but it was arrived at by a reduction from Species67 to Species51 based on removal of lower quality genome assemblies and annotations, and then a reduction from Species51 to Species43 based on species/phyla exhibiting long-branch attraction that I could not shake in species tree building using BUSCO sequences and VeryFastTree and Asteroid. My feeling was that in both cases these species and their sequences will generally increase incorrect alignments leading to incorrect substitutions in the model - and so should be avoided. Or that was the thinking going into Species43 for model building.

Given Species43 - there are still long-branches in the gene trees. Though it may not be a lot of them, technically, I think they would be producing incorrect alignments or regions of alignment that will skew the substitution model with artifacts. So - I am thinking to run Phylter or PhlyoPyPrunner before DISCO (or could be after DISCO) in the pipeline above to remove these long branching sequences prior to modeling building by nQ/QMaker.

But then I am thinking - outside of sequences and pipelines for model building - when its general use of all genes in all 43 species that are selecting / implementing the best of available substitution models in IQTree - they are going to be having long-branch attraction sequences in their alignments and trees - for species trees and for gene trees of interest etc. So given that in general use, there will alignments/trees that will be impacted by long-branch attraction sequences - would it be better to include sequences that produce long-branches within the Species43 sequences being used for model building?

It may not have a major impact one way or the other - but may be it could. And more generally, I'm wondering how people think/design in selecting/preparing sequences for use in building substitution models. It seems like removing as many potential artifacts as possible is best - but when the reality of usage is so different - I'm unsure.

I apologize for all the extended details - and that its higher-level / not specific to details of IQTree code issues etc - but if you have time, it would be great to get guidance!

Thank you very much, Eric

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using QMaker - NQMaker to iteratively optimize and generate new substitution models #380

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Using QMaker - NQMaker to iteratively optimize and generate new substitution models #380

000generic Dec 15, 2024

Replies: 3 comments · 2 replies

roblanf Dec 16, 2024 Maintainer

000generic Dec 16, 2024 Author

roblanf Dec 16, 2024 Maintainer

000generic Dec 16, 2024 Author

000generic Jan 10, 2025 Author

000generic
Dec 15, 2024

Replies: 3 comments 2 replies

roblanf
Dec 16, 2024
Maintainer

000generic
Dec 16, 2024
Author

roblanf Dec 16, 2024
Maintainer

000generic Dec 16, 2024
Author

000generic
Jan 10, 2025
Author