Using QMaker - NQMaker to iteratively optimize and generate new substitution models #380
Replies: 3 comments 2 replies
-
Hi @000generic, I'll try to answer your questions as best I can. I'll be brief because it's the end of the year and there's a rush to get things done here...
Here you seem to be misunderstanding C60 refers to profile mixture models. If you want to infer matrices using profile mixture models you should read the phylobayes papers, but also the recent GTRpmix paper here: https://academic.oup.com/mbe/article/41/9/msae174/7735827 That paper describes how to estimate a GTR model under the C60 profile mixture, and why it's better to do that than infer a GTR model with e.g. Qmaker (or just use e.g. LG or WAG) and then add C60 to that.
A longer list with better models is always better. This list determines how good your initial set of trees and branch lengths is for inferring the Q matrix.
I think this is fine, but I would like @bqminh or @thomaskf to weigh in here too. As above, all you are doing with
The best
In each case, these most closely match (by my guess at least) the model you are aiming for in your estimation, and that should help the optimiser in terms of speed and accuracy. In an ideal world (i.e. where the optimisers find the global optimum every time, regardless of starting conditions) the initial conditions shouldn't matter to the final parameter estimates. This will be particularly the case if you run multiple iterations (as we do in the Qmaker papers), i.e. where the first init model is your best guess as above, but the in subsequent iterations the init model is the model you output from the previous iteration. In the QMaker papers we keep going until the model pretty much stops changing (I think we use a pearson correlation of >99.9% or something similar - you'd have to double check the papers...). One thing you can do here is try a few starting conditions for each model you optimise, then compare the likelihoods of the final models, as well as their parameters. Highest likelihood wins!
Sorry, I'm lost on this question! If the answers are contained in the above, just let me know and see if you can explain in more detail. Hope some of that helps, Rob |
Beta Was this translation helpful? Give feedback.
-
Thank you for all the helpful details @roblanf! And for moving things to Discussion. Your clarification on Some follow-up questions - no problem if it should wait until after the holidays - but great to hear back. ONE By the end of the GTRpmix paper I was wondering and unsure which of the two option is better, as the paper only evaluates against time-reversible LG (I thought) - as far as I understood, the paper does not evaluate it's time reversible GTRpmix model vs time non-reversible models, like those from NQMaker. So I am wondering is one clearly better or are both important to build and test as a general approach, time and resources permitting. QUESTION 2 I'm guessing QMaker might be dropped, given models built from NQmaker and/or GTRpmix - methods - does this seem reasonable? TWO The pipeline used to generate the NR model also has the advantage of generating a tree that I believe can similarly be provided in building a GTRpmix model - under In the documents on IQTree and in the supplemental materials for the GTRpmix paper, the GTRpmix model is built using a single concatenated alignment under QUESTION 4 Does it make sense to scale GTRpmix model building across nodes on different machines using mpi (gaining access to up to 768 CPUs)? Or is there no advantage to this? On a single machine, I can access 128 CPUs. However, it seems like there is NOT always an advantage to more CPUs in building IQTree trees. Is this also the case in building the GTRpmix model? The paper and online documentation highlight how GTRpmix model building can be a long intensive process - and I understand providing a tree of the input alignment and setting the optimization threshold Thank you again :) Eric |
Beta Was this translation helpful? Give feedback.
-
I have a followup big-picture / general approach question on building the substitution models - no problem if you don't have time / github is not the place: My current pipeline is basically:
For building the models, my thinking is to have as many positions with maximium species across trees to get more representative substitutions into the model - and tradeoff to have around 1000+ gene trees to work with (based on the 1000 gene trees used in nQ/QMaker papers). I am working now with a Species43 species set - but it was arrived at by a reduction from Species67 to Species51 based on removal of lower quality genome assemblies and annotations, and then a reduction from Species51 to Species43 based on species/phyla exhibiting long-branch attraction that I could not shake in species tree building using BUSCO sequences and VeryFastTree and Asteroid. My feeling was that in both cases these species and their sequences will generally increase incorrect alignments leading to incorrect substitutions in the model - and so should be avoided. Or that was the thinking going into Species43 for model building. Given Species43 - there are still long-branches in the gene trees. Though it may not be a lot of them, technically, I think they would be producing incorrect alignments or regions of alignment that will skew the substitution model with artifacts. So - I am thinking to run Phylter or PhlyoPyPrunner before DISCO (or could be after DISCO) in the pipeline above to remove these long branching sequences prior to modeling building by nQ/QMaker. But then I am thinking - outside of sequences and pipelines for model building - when its general use of all genes in all 43 species that are selecting / implementing the best of available substitution models in IQTree - they are going to be having long-branch attraction sequences in their alignments and trees - for species trees and for gene trees of interest etc. So given that in general use, there will alignments/trees that will be impacted by long-branch attraction sequences - would it be better to include sequences that produce long-branches within the Species43 sequences being used for model building? It may not have a major impact one way or the other - but may be it could. And more generally, I'm wondering how people think/design in selecting/preparing sequences for use in building substitution models. It seems like removing as many potential artifacts as possible is best - but when the reality of usage is so different - I'm unsure. I apologize for all the extended details - and that its higher-level / not specific to details of IQTree code issues etc - but if you have time, it would be great to get guidance! Thank you very much, Eric |
Beta Was this translation helpful? Give feedback.
-
Hi!
I have a series questions regarding the generation of new amino acid substitution models using IQTree (IQ-TREE multicore version 2.3.6 for Linux x86 64-bit built Aug 4 2024) - and would greatly appreciate any guidance you might have.
I read in another issue here that two iterations of QMaker/NQMake model building is useful/generally sufficient for generating an optimized new model - and I am wondering how to best implement this - and how to best generate new substitution models in general using IQTree.
I am able to generate an initial model (Iteration 1 model_i1) following the IQTree instructions here:
QMaker/NQMaker examples
Specifically for time-reversible and time-non-reversible versions of Iteration 1 model_i1, I ran:
Question 1:
I set -cmax to 4 based on the IQTree example that has -cmax 4 - but I anticipate C20 or C60 might be useful for some downstream/later data sets (long alignments and distant species). Is it better in my case to set -cmax to say 60 at Step 1? Was -cmax 4 set low (below default of 10 I think) to reduce computational load/time? I have reasonable computing resources - up to 128 CPUs with 968 Gb RAM on a node and up to 6 nodes if I use mpi. So if -cmax 4 was to reduce computation load - and -cmax 60 might be better for my data, I could reasonably implement this I think - but not sure if this is the correct way to think about things.
Question 2:
I set -mset to LG,WAG,JTT based on the IQTree example that has -mset LG,WAG,JTT. My target species for tree building are across animal phyla and unicellular outgroups. Given the improvement of Q, NQ, and ELM models over LG for major animal groups in recent Q, NQ, and GTRpmix papers - I am wondering if it makes sense in my case to include these models with -mset in Step 1 expanded to say -mset LG,WAG,JTT,Q.pfam,NQ.pfam,ELM. Or what would be recommended / how to assess think about this?
Question 3:
If I do expand -mset to include additional models like Q.pfam, NQ.pfam, and ELM in Step 1 - what would be the appropriate --model-joint to use in Step 2? I used the IQTree example setting of --model-joint GTR20+FO for time-reversible and --model-joint NONREV+FO for time non-reversible - but no idea if these would still be appropriate or optimal with the addition of Q.pfam,NQ.pfam, and/or ELM in Step 1.
Question 4:
If I do expand -mset to include additional models like Q.pfam, NQ.pfam, and ELM in Step 1 - what would be the appropriate --init-model to use in Step 2? I used the IQtree example of --init-model LG. However, I was thinking to replace LG with ELM in this case, as the GTRpmix paper indicated it was better for major animal groups - so would run --init-model ELM at Step 2, give an expansion of -mset LG,WAG,JTT,Q.pfam,NQ.pfam,ELM in Step 1.
For Iteration 2 model_i2 to further optimize the initial model, I am trying the following:
Question 5
Here again, I have used settings from the IQTree example but I have questions similar to those above for Iteration 1 - specifically, what would be appropriate / optimal for Step 1 -mset and Step 2 --model-joint and --init-model, given that I am now running things using Iteration 1 model_i1 in Step 1 -mset.
I apologize for all the questions and specifics - and really appreciate your time and any guidance you might have.
Thank you :) Eric
Beta Was this translation helpful? Give feedback.
All reactions