Jonathan Romiguier, Marek L. Borowiec, Arthur Weyna, Quentin Helleu, Etienne Loire, Christine La Mendola, Christian Rabeling, Brian L. Fisher, Philip S. Ward, Laurent Keller, Institut des Sciences de l'Evolution de Montpellier (UMR ISEM), Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-École Pratique des Hautes Études (EPHE), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut de recherche pour le développement [IRD] : UR226-Centre National de la Recherche Scientifique (CNRS)-Université de Montpellier (UM), Université de Lausanne = University of Lausanne (UNIL), University of Idaho [Moscow, USA], Animal, Santé, Territoires, Risques et Ecosystèmes (UMR ASTRE), Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad)-Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE), Département Systèmes Biologiques (Cirad-BIOS), Centre de Coopération Internationale en Recherche Agronomique pour le Développement (Cirad), School of Life Sciences [Tempe, USA], Arizona State University [Tempe] (ASU), California Academy of Sciences, University of California [Davis] (UC Davis), University of California (UC), and We thank the FEBS (Federation of European Biochemical Societies) for the long-term fellowship granted to J.R., the funding from the French ANR (t-ERC grant RoyalMess) and ERC (RoyalMess, grant 948688 ) and the ERC and Swiss NSF to L.K., a grant from the US National Science Foundation ( CAREER DEB-1943626 ) to C.R., and the three reviewers for their very useful comments. We thank the following for helping us to collect samples: C.S. Moreau, F.A. Esteves, S. Van Noort, N. Gunawardene, C. Poteaux, T. Bruscht, J.M. Gómez Durán, M. Molet, E. Vargo, Bui Tuan Viet, M.D. Goodisman, T. Delsinne, and J. Orive
The content of each folder/archive is described in this README. # phylogenies Command lines for producing the trees of their respective supplementary figures: Fig S1 : iqtree -s megaUce-auto-taxon75.fa -spp ./megaUce-auto-taxon75.partitions.txt -m GTR+F+I+G4 -bb 1000 -wbtl -pre megaUce-auto-taxon75.spp.gtrig4 Fig S2 : iqtree -s megaAcuTri-autotrue.faa -wbtl -bb 1000 -m LG+C20+F+G -ft megaAcuTri-auto.faa.removeBad.spp.lgfg4.treefile -pre megaAcuTri-autotrue.truepmfs Fig S3 : iqtree -s megaAntBeeTri-auto.faa -spp ./megaAntBeeTri-auto.faa.partitions.txt -wbtl --symtest-remove-bad -bb 1000 -alrt 1000 -m LG+F+G4 -pre megaAntBeeTri-auto.faa.removeBad.spp.lgfg4.alrt.treefile Fig S4 : iqtree -s megaAntBeeTri-autotrue.faa -wbtl -bb 1000 -m LG+C20+F+G -ft megaAntBeeTri-auto.faa.removeBad.spp.lgfg4.alrt.treefile.treefile -pre megaAntBeeTri-autotrue.truepmfs Fig S5 : java -jar astral.5.7.4.jar -i ./genes500trees.orthoAntBeeTriModelSearch.nwk -o astral.orthoAntBeeTriModelSearch500.nwk # random_outgroup_removal The archive contains the amino-acid alignments (faa files) and results (treefiles) of the analysis of random outgroup removal. The number of outgroups can be determined by the number after the -o in the filename (e.g. the -o115 file contains all outgroups, no outgroup in -o0). # mcmctree-independant-clock For divergence time analyses we used a node dating approach, as implemented in MCMCTree, a part of the PAML package, v4.10 (Yang 2007). MCMCTree utilized rapid approximate likelihood computation (dos Reis and Yang 2011), which makes it suitable for divergence dating of genome-scale data sets (dos Reis et al. 2012). Due to computational constraints, we used an alignment with loci containing a minimum of 95% of our 83 taxa, totaling 182,809 amino acid sites (available in the folder with the filename megaAntBeeTri-autotrue95.removeBad.faa). We fixed the topology to be the same as our analysis of the full alignment. We constrained our root node with a soft bound maximum age of 236 Ma, corresponding to the lower bound of the 95% highest posterior density (HPD) interval for that split in Peters et al. (2017). We also set soft bounds on the root of the Formicidae to be 103 Ma and 169 Ma, corresponding to upper 95% bound of HPD in Borowiec et al. (2019) and lower bound in Economo et al. (2018), the two most divergent of recent estimates of the age of the family (Borowiec et al. 2020). We also used the following minimum node age constraints based on fossils: • MRCA of Ambylopone australis and Apomyrma CF02 at least 34 Ma old, based on two species of Stigmatomma in Baltic amber (Dlussky 2009). • MRCA of Lioponera sp and Aenictus glabrinotum at least 34 Ma old, based on three species of Procerapachys in Baltic amber (Dlussky 2009). • MRCA of Pseudomyrmex pallidus and Tetraponera allaborans at least 52 Ma old, based on three species of Tetraponera in Oise amber (Aria et al. 2011). • MRCA of Aneuretus simoni and Tapinoma erraticum at least 78 Ma old, based on Chronomyrmex medicinehatensis in Grassy Lake amber (McKellar et al. 2013). • MRCA of Myrmelachista zeledoni and Atta cephalotes at least 92 Ma old, based on Kyromyrma neffi in New Jersey amber (Grimaldi and Agosti 2000). • MRCA of Anoplolepis gracilipes and Plagiolepis pygmaea at least 34 Ma old, based on six species of Plagiolepis in Baltic and other late Eocene ambers (Dlussky 2010). • MRCA of Polyrhachis dives and Camponotus fellah at least 34 Ma old, based on Camponotus mengei in Baltic amber (Wheeler 1915). • MRCA of Gnamptogenys sp and Ectatomma ruidum at least 34 Ma old, based on two species of Gnamptogenys in Baltic amber (Dlussky 2009). • MRCA of Myrmica rubra and Pogonomyrmex barbatus at least 34 Ma old, based on Myrmica species from Baltic and Saxonian ambers (Dlussky and Rasnitsyn 2009). • MRCA of Temnothorax nylanderi and Vollenhovia emeryi at least 34 Ma old, based on six described species of Temnothorax in Baltic amber (Dlussky and Rasnitsyn 2009). We also applied We ran each analysis unpartitioned, under the LG model for 5 million generations. We examined each run’s statistics in Tracer and confirmed convergence and sufficient effective sample sizes (>>200) for all parameters. All results are available in the mcmctree-independant-clock folder. Bibliography: Yang, Z. (2007). PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and evolution, 24(8), 1586-1591. Reis, M. D., & Yang, Z. (2011). Approximate likelihood calculation on a phylogeny for Bayesian estimation of divergence times. Molecular biology and evolution, 28(7), 2161-2172. dos Reis, M., Inoue, J., Hasegawa, M., Asher, R. J., Donoghue, P. C., & Yang, Z. (2012). Phylogenomic datasets provide both precision and accuracy in estimating the timescale of placental mammal phylogeny. Proceedings of the Royal Society B: Biological Sciences, 279(1742), 3491-3500. Peters, R. S., Krogmann, L., Mayer, C., Donath, A., Gunkel, S., Meusemann, K., ... & Niehuis, O. (2017). Evolutionary history of the Hymenoptera. Current Biology, 27(7), 1013-1018. Borowiec, M. L., Rabeling, C., Brady, S. G., Fisher, B. L., Schultz, T. R., & Ward, P. S. (2019). Compositional heterogeneity and outgroup choice influence the internal phylogeny of the ants. Molecular phylogenetics and evolution, 134, 111-121. Economo, E. P., Narula, N., Friedman, N. R., Weiser, M. D., & Guénard, B. (2018). Macroecology and macroevolution of the latitudinal diversity gradient in ants. Nature communications, 9(1), 1-8. Borowiec, M. L., Moreau, C. S., Rabeling, C., & Starr, C. K. (2020). Ants: phylogeny and classification. Encyclopedia of social insects.–Springer International Publishing, Cham, 1-18. Dlussky, G. M. (2009). The ant subfamilies Ponerinae, Cerapachyinae, and Pseudomyrmecinae (Hymenoptera, Formicidae) in the late Eocene ambers of Europe. Paleontological Journal, 43(9), 1043-1086. Dlussky, G. M. (2009). The ant subfamilies Ponerinae, Cerapachyinae, and Pseudomyrmecinae (Hymenoptera, Formicidae) in the late Eocene ambers of Europe. Paleontological Journal, 43(9), 1043-1086. Aria, C., Perrichot, V., & Nel, A. (2011). Fossil Ponerinae (Hymenoptera: Formicidae) in Early Eocene amber of France. Zootaxa, 2870(1), 53-62. McKellar, R. C., Glasier, J. R., & Engel, M. S. (2013). New ants (Hymenoptera: Formicidae: Dolichoderinae) from Canadian Late Cretaceous amber. Bulletin of Geosciences, 88(3), 583-594. Grimaldi, D., & Agosti, D. (2000). A formicine in New Jersey Cretaceous amber (Hymenoptera: Formicidae) and early evolution of the ants. Proceedings of the National Academy of Sciences, 97(25), 13678-13683. Dlussky, G. M. (2010). Ants of the genus Plagiolepis Mayr (Hymenoptera, Formicidae) from late Eocene ambers of Europe. Paleontological Journal, 44(5), 546-555. Wheeler, W. M. 1915 ("1914"). The ants of the Baltic Amber. Schriften der Physikalisch-Ökonomischen Gesellschaft zu Königsberg 55:1-142. # absrelAnalysis We provide three main folders containing the data and results of the absrel analysis on the three cleaning strategies of the article. Main cleaning strategy presented in the article is the most stringent one: hmmcleaner-75codons (corresponding to hmmcleaner cleaning and keeping only codons that are complete in 75% of species of the alignment). For each strategy, we provide cleaned fasta files for each alignments, a corresponding tree and the json output result of the absrel analysis. All runs of absrel have been performed with the following example command line: hyphy absrel --code Universal --alignment alignmentEXAMPLE.fasta --tree treeEXAMPLE.tr --branches All We also provide the scripts of hmmcleaner v1.8 that we used with the following example command: ./HMMcleanAA.pl alignmentEXAMPLE.fasta 5 Low effective population sizes tend to decrease the efficiency of purifying selection, which increases the global rate of non-synonymous substitutions in the whole genome. To explore whether the particularly high increase of positive selection events detected on the branch leading to formicoids is associated to particularly low effective population sizes, we also used absrel output to compute the average omega (Baseline MG94xREV omega ratio, i.e. dN/dS - non-synonymous substitution rate over synonymous substitution rate) of each branch. To avoid biasing average omega with extreme values, we excluded genes with omega ratios of more than 2. Results ar available in Table S3. For every cleaning strategy, the branch leading to formicoids (i.e. branch id 25) does not display the highest average omega ratio (28th, 30th and 34th highest values for HmmCleaner only, 50% complete codon cleaning and 75% complete codon cleaning, respectively), which suggest that its high increase of positive selection events is not associated to particularly low effective population sizes. # GeneFamilyAnalysis This folder contains 7 numbered folders grouping: 1 - An example of control files of the maker2 pipeline for gene prediction. 2 - The resulting protein gene sets with from genes with more than 1000 nucleotides. 3 - The main output files of orthofinder on these protein gene sets 4 - Input files (gene counts table and chronogram newick file) for running the CAFE5 analysis. 5 - Output files of the CAFE5 analyses after excluding gene families with more than 50 between the max and min counts 6 - eggNOG annotation of our protein gene sets 7 - topGO analysis input (orthoGOall.csv for GOterms associated to each gene families and orthoGOsignif.csv for just gene families identified in significant expansion/contraction by CAFE5) and output table. # assemblies.tar.gz This archive contains all assemblies of the genomes sequenced for this study. Refers to Table S1 of the article for translating acronymes in species names.