Complete Chloroplast Genome and Comparative Analysis of Entada phaseoloides (Fabaceae)

Background: Entada phaseoloides (Fabaceae) is a large woody climber that is found widely in southern China and other tropical and subtropical areas worldwide. The genus Entada contains ~30 species and E. phaseoloides is most commonly found in China. The E. phaneroneura and E. pervillei are endangered species. Previous studies had focused on medicinal components, transcriptional regulation and nuclear genomes. Phylogenetic relationships within the Entada are poorly understood and the chloroplast genome of Entada has not been disclosed. In this study, we performed short-read sequencing of E. phaseoloides and assembled and analyzed its chloroplast genome. Methods: Dry specimen leaves of E. phaseoloides were subjected to DNA extraction and sequenced using the Illumina Novoseq platform. The chloroplast genome was assembled using Get Organelle, annotated using CPGAVAS2 and Geneious Prime. Long repeat and SSR analysis were performed using the Reputer and MISA software, respectively. Phylogenetic analyses were performed using IQTREE and MrBayes software. Result: The complete chloroplast genome of E. phaseoloides was 159,963 bp in length and had a quadripartite structure with large single copy of 89,972 bp and a small single copy of 19,309 bp separated by inverted repeats of 25,341 bp. A total of 112 genes in E. phaseoloides comprised 78 protein-coding genes, 30 transfer RNA genes and 4 ribosomal RNA genes. We carried out phylogenetic analysis based on homologous protein-coding genes among 21 species derived from Fabaceae. We found that the phylogeny was largely congruent with prior hypotheses about the position of E. phaseoloides in evolutionary branches. The E. phaseoloides had a closer relationship with the P. africanum .


INTRODUCTION
Entada phaseoloides (L.) Merr (1914) is a large woody vine in the Fabaceae family that grows in southern China and other tropical and subtropical areas worldwide.The genus Entada contains about 30 species and E. phaseoloides is most commonly found in China (Ohashi et al., 2010).Entada populations are declining due to over-harvesting and habitat destruction, with E. phaneroneura and E. pervillei being endangered species (www.iplant.cn/rep/protlist/4?key=Entada).The seeds and stems of E. phaseoloides are very large and are often used as ornamental items and the seeds are often used as ornaments.The roots of Entada have nitrogen-fixing bacteria that are capable of symbiotic nitrogen fixation.Therefore, E. phaseoloides can be used to help restore nitrogen-deficient soils and improve and protect the environment (Diabate et al., 2005).Additionally, the stem of E. phaseoloides is widely used in traditional medicine because of its remarkable pharmacological activity.Its main bioactive components are triterpenoid saponin compounds and the representative saponins are oleanane-type triterpenoid saponins containing seven sugar chains (Liao et al., 2020).
There are few phylogenetic studies on Entada.In 2003, researchers analyzed the phylogenetic relationships of 134 Mimosoideae species based on trnL intron, trnK intron and matK sequences, demonstrated that the Tribe Mimoseae forms a paraphyletic grade in which are embedded both Acacieae and Ingeae and showed that Entada is closely related to Piptadeniastrum (Luckow et al., 2003).the chromosomal-level genome of E. phaseoloides was reported and the evolution of its triterpenoid saponins biosynthetic genes was revealed (Lin et al., 2022).Although chloroplast genomes of most Fabaceae species have been published in recent years (Kim et al., 2016;Souza et al., 2019;Su et al., 2021), studies on the chloroplast genome and phylogeny of Entada are still relatively lacking.

Complete Chloroplast Genome and Comparative Analysis of Entada phaseoloides (Fabaceae)
W ith the curren t developmen t of sequenc ing technology, the study of chloroplast genomes will provide a solid basis for understanding the phylogeny among species.Thus, using data from the Illumina Novoseq platform, we assembled and examined the chloroplast genome of E. phaseoloides.Our main objectives are as follows: (1) to analyze the structural features of complete chloroplast genome of E. phaseoloides; (2) to analyze simple sequence repeats (SSRs) and repeat sequences; and (3) to infer the phylogenetic position of E. phaseoloides.

Plant material, DNA extraction and sequencing
Dry specimen leaves of E. phaseoloides were collected from Xishuangbanna, Yunnan Province, China (22.01N, 100.79E).Our experimental studies, including the collection of plant material, were in accordance with institutional, national or international guidelines.The sample was deposited at the Herbarium o f the Xinyang Agriculture and Fo restry University (voucher number: EP001, hutt0716@163.com).W hole genomic DNA was extracted using the CTAB method (Doyle and Doyle, 1987).DNA library of next generation sequencing with an insert size of 300 bp was constructed and sequenced using the Illumina Novoseq 6000 platform, yielded ~5 Gb of raw data and low-quality data were removed to obtain clean data.

Relative synonymous codon usage and IR boundary analysis
The relative synonymous codon usage (RSCU) was calculated using an online cloud platform (cloud.genepioneer. com).Furthermore, comparisons between the borders of the IR, SSC and LSC regions were generated using IRscope (Amiryousefi et al., 2018).

Phylogenetic analysis
The chloroplast genomes of 18 Fabaceae species and one Polygalaceae species were downloaded from GenBank.The Polygala tenuifolia was used as outgroup.We extracted and aligned 78 common protein-coding genes from the genome annotation files using PhyloSuite v. 1.2.2 (Zhang et al., 2020), then the 78 aligned sequences were concatenated.Based on the matrix of concatenate sequences, a phylogenetic tree was constructed using the Maximum Likelihood (ML) method and Bayesian inference (BI) method.

Chloroplast genome characterization of E. phaseoloides
The complete chloroplast genome map of E. phaseoloides (GenBank number: OQ558908), was a circular molecule with a length of 159,963 bp and the GC content of 36.30% (Fig 1).It had a four-region structure comprising a large single copy, a small single copy and two inverted repeats.The LSC and SSC regions were 89,972 bp and 19,309 bp, respectively, while IRa and IRb regions were 25,341 bp each (Table 1).The length of the coding region was 66,765bp and represented 41.74% of the whole genome.Notes: Gene*: Gene with one intron; Gene**: Gene with two introns; #Gene: Pseudo gene; Gene (2): Number of copies of multi-copy genes.The total number of unique genes was 112, containing 78 protein-coding genes and 30 tRNA genes and 4 rRNAs.(Table 2).Among the 78 protein-coding genes, 20 genes contained one intron each (ndhA, ndhB, petB, petD, atpF, rpl16, rpl2, rps16, atpF and rpoC1) and two genes (rps12, clpP and ycf3) had two introns each.The gene with the largest intron (2,657 bp) was trnK-UUU and the matK gene was included in this intron.

Repeat analysis
Repeat sequences play a role in the recombination and variation of chloroplast genomes.This chloroplast genome contained 11 long repeats, including 4 palindromic repeats (36.36%) and 7 forward repeats (63.64%) (Fig 2A).These long repeats were at least 30 bp in length, with the longest being 25,341 bp.In population genetic studies, the number and position of repeated DNA motifs (with 1-6 nucleotides) have been routinely employed for the detection of polymorphisms in cp genomes.In the E. phaseoloided chloroplast genome, we identified 327 SSRs and most of them consisted of dinucleotide repeats, with mono-, di-, tri-, tetra-, penta-and hexa-nucleotide SSRs accounted for 30.58%,35.78%, 14.98%, 14.98%, 2.14% and 0.25% of all SSRs, respectively (Fig 2B ).

Relative synonymous codon usage (RSCU)
The 78 protein-coding genes were used to determine the RSCU of the E. phaseoloided chloroplast genome (Fig 3A).Leucine was the most frequent amino acid (10.52%), whereas cysteine was the least frequent (1.23%) (Fig 3B).The RSCU values in Table S2 showed that half of the codons

Complete Chloroplast Genome and Comparative Analysis of Entada phaseoloides (Fabaceae)
were > 1 (Fig 3C).It could be seen from the data that tryptophan (UGG) and methionine (AUG) with codon usage bias had an RSCU value of 1.

IR boundaries analysis
The comparisons between IR-SC boundaries for the 19 Mimoseae species (Fig 4).In general, the variation in length of the two LSC/SSC regions was lower than that of the IRa/ IRb regions.Compared to the chloroplast genomes of other Mimoseae species, the chloroplast genome of E. phaseoloides showed a contraction of the IR region and an expansion of the SSC region.The trnH gene showed variation in its location in the LSC region.The ycf1 gene was located within the SSC/IRa boundary in 19 Mimosaceae species, but the length of the expansion of ycf1 gene into the IRa region in E. phaseoloides was 37 bp.Except for Cylicodiscus gabunensis, the ndhF genes of other species were located in the SSC region.Variations in the location of the rps19 gene in the IR/LSC border also occurred in the cp genomes.The rps19 gene spanned the border of LSC/IRb.The E. phaseoloides, Leucaena trichandra and Prosopis farcta had two copies of the rpl2 gene located in the inverted repeat regions.

Phylogenetic analysis
We used the 78 protein-coding genes for phylogenetic analysis and selected 27 angiosperm species, including 20 Fabaceae species and Polygala tenuifolia of Polygalaceae as outgroup.Phylogenetic analysis was performed by maximum likelihood and Bayesian inference.The two phylogenetic trees were topologically similar, with the majority of nodes having 100% bootstrap (BP) values and 1.00 Bayesian posterior probabilities (PP).We found that the phylogeny was largely congruent with prior hypotheses about the position of E. phaseoloides in evolutionary branches.The E. phaseoloides and P. africanum were more closely related and belong to the same group (Fig 5).
This study presents the first chloroplast genome from E. phaseoloides.The length of the cp genome in E. phaseoloides was similar to that seen in the cp genome of other Mimoseae species.A typical angiosperm chloroplast genome consists of 113 genes, including 79 protein-coding genes, 30 tRNA genes and four rRNA genes (Wicke et al., 2011).The E. phaseoloides chloroplast genome had a similar number of genes (112 genes), including 78 proteincoding genes, 30 tRNA genes and 4 rRNA genes.
Codons encoding the leucine were the most common in the chloroplast genome of E. phaseoloides, while those encoding cysteine were the least common.These findings have also been reported in the chloroplast genome of Balanites aegyptiaca.Several reports have shown the importance of chloroplast SSRs as reliable molecular markers to discriminate specimens at lower taxonomic levels and study population structure.The E. phaseoloides chloroplast genome had 327 SSRs.Dinucleotide AA/TT SSRs were the most frequent.Therefore, we recommended the use of the chloroplast genome for the development of SSR sites and the study of the population genetic level in E. phaseoloides.
Although the plastid genome is conserved in angiosperm plants as previously reported, several studies have reported variation in the size and boundaries among IR/LSC and IR/SSC regions and variation in gene location (Al-Juhani et al., 2022;Ruhsam et al., 2016).In the present study, comparisons between IR-LSC and IR-SSC boundaries in the 19 complete chloroplast genomes of Mimoseae showed clear variation in the inverted repeat region in chloroplast genomes and significant expansion in the IR region in the chloroplast genome of E. phaseoloides.
Chloroplast genomes are composed of many efficient genes that can solve phylogenetic problems at different levels of angiosperm taxonomy (Al-Juhani et al., 2022;Dong et al., 2017).In this study, we found that E. phaseoloides was more closely related to P. africanum.
Complete Chloroplast Genome and Comparative Analysis of Entada phaseoloides (Fabaceae)

CONCLUSION
In this research, we assembled the complete chloroplast genome of E. phaseoloides with 159,963 bp for the first time, consisting of the LSC region of 89,972 bp, the SSC region of 19,309 bp and two copies of IR regions of 25,341 bp.The chloroplast genome contains 112 unique genes, which are 78 PCGs, 30 tRNA genes and 4 rRNA genes.Gene contents and orientation are similar to those found in the chloroplast genome of other Mimoseae species.This study also revealed the distribution of repeated structures and microsatellites along the chloroplast genome of E. phaseoloides.W e also generated important genomic resources for Mimoseae and Entada.Based on 78 proteincoding genes, the phylogenetic tree for 19 Mimoseae species was constructed with good supports.Using a 100% BS and 1.00 PP score, we discovered that the E. phaseoloides is more closely related to the P. africanum.These results will not only help to clarify the evolutionary study of the Entada, but also help to explore more genetic information and better utilize E. phaseoloides.

Fig 1 :
Fig 1: Chloroplast genome map of E. phaseoloide.The thick lines on the outer complete circle identify the inverted repeat regions (IRa and IRb).The innermost track of the chloroplast genome shows the GC content.

Fig 3 :
Fig 3: Codon content of 20 amino acids in all protein-coding genes of the E. phaseoloide chloroplast genome.

Fig 4 :
Fig 4: Comparison of the junction sites between the Long Single Copy (LSC, light blue), Short Single Copy (SSC, light green) and Inverted Repeat (IRa and IRb, orange) regions among the ten Mimoseae chloroplast genomes.JLB (IRb/LSC), JSB (IRb/SSC) JSA (SSC/IRa) and JLA (IRa/LSC) denote the junction sites between each corresponding regions on the chloroplast genome.

Fig 5 :
Fig 5: The phylogenetic relationships among E. phaseoloide using the maximum likelihood (ML) and Bayesian Inference (BI) methods.The number in the branch nodes represent the bootstrap (BP) values and Bayesian posteriori probability (PP).

Table 2 :
Gene function of chloroplast genome of E. phaseoloides.

Table 1 :
Summary of the complete chloroplast genome of E.