Large Scale Genome-Centric Metagenomic Data from the Gut Microbiome of Food-Producing Animals and Humans

data generation

Data was generated from GUARANI (One Health Brazilian Group) network. Initially, the aims of the GUARANI network’s project were to quantify the abundance and diversity of antibiotic resistance genes (eg, resistome) of a large number of samples in Brazil (South American), distributed in the major five Brazilian geographical regions (North region – Castanhal, 1°17′46.3776″ S–47°55′8.6016″ W; South Region – Blumenau, 26°55′10″ S 49°3.967′ W; Southeast Region – Bragança, 22°57′9.7″ S–46 °32.651′ W; Midwest Region – Dourados, 22°13′16″–S 54°48.334′ W; Northeast Region – Fortaleza, 3°43′2″ S–38°32.584′ W), and to investigate the relationship between human and food-producing animal microbiomes, and the potential exchange of pathogenic and non-pathogenic microbes between these systems (Fig. 1A) by metagenomic approaches. Supplementary Table 1 describes information about sex, species, age of animals, and demographic localization of the farms and cities where samples were collected. In general, the experimental design was based on general and descriptive traits. To cover a high number of samples from all geographical regions of Brazil and have great potential to perform large-scale genome-centric metagenomic data, we choose to use the same samples in Illumina high-throughput sequencing. For this, 107 samples [humans (N = 34), cattle (N = 28), swine (N = 15), and poultry (N = 30)] were collected in triplicate from farms located in the five Brazilian geographical regions (Fig. 1B). For each region, properties were selected based on the criteria of simultaneous swine, poultry, and cattle rearing. Human samples were collected from healthy individuals who lived in the closest urban areas to the rural properties. The World Health Organization defines health as “complete physical, mental, and social well-being”, in this study, we followed this concept to define adults (>18y-o) without any physical disease or infirmity as healthy individuals. All human data was anonymized, and the authors affirm that human research participants provided informed consent for the publication of the microbiome data and all information was approved by the research ethics committees. Data collection was approved by the Research Ethics Committee (CEP), Committee on Ethics in the Use of Animals (CEUA) from Universidade Federal de São Paulo (UNIFESP) and National System of Genetic Resource Management and Associated Traditional Knowledge SISGEN (Process numbers: 3,116 .383, 2607170119 and AA1668A, respectively). (CEP and SISGEN). All cattle, swine, and poultry samples were collected only from adult animals. In the sample collection, a swab was introduced in the first 2 cm of the rectal region to collect faecal samples of animals. Invasive rectal swabs were used only to collect samples from animals (swine, cattle, and poultry). For humans, the subjects were instructed to collect stool samples using a sterile fecal collection container with no preservative. A sterile charcoal swab was introduced in the stool specimen, followed by the rapid removal of stool excess by pressing the swab against the container wall. The samples were stored and shipped to a central lab for DNA extraction.

figure 1

The general concept of the large-scale One Health Project and sample site locations. (A) Global strategy to study the relationship between human and animal health and the transfer of microbial species (pathogens and non-pathogens) between these systems. (B.) Five geographic regions in which samples were collected in Brazil.

DNA extraction and sequencing

DNA extractions were carried out under sterile conditions in a microbiological vertical laminar airflow hood. We did not use negative control samples (eg, “blank swab”) because the reagent and laboratory contamination were most problematic in low microbial biomass microbiomes (eg, placenta or lung human microbiome) compared that find in high microbial biomass microbiomes.12,13, as such that found in the faecal samples used in this study. DNA was extracted directly from swabs using the ZymoBIOMICS (Zymo, USA) DNA Miniprep Kit. DNA integrity and quantification were performed using a Qubit ® 2.0 Fluorometer (Thermo Fisher Scientific, AU). All samples were quantified by Qubit and organized on the sequencing plates according to the DNA concentrations obtained (Supplementary Table 2). The samples that had the same range of amount (ng) of DNA were in the same plate, since the number of PCR cycles of amplification of the libraries depends on the amount of initial DNA, according to the Illumina protocol. The samples from the different hosts were treated together with maximum attention to avoid cross contamination. In short, sequencing libraries were prepared with the Nextera DNA Flex Library Preparation Kit (Illumina, USA) according to the manufacturer’s protocol. Sequencing was carried out in the NextSeq. 500 System (Illumina, USA) using NextSeq. 500/550 High Output Kit v2.5 (300 Cycles), generating 2 × 150 bp reads.


Firstly, raw reads were removed using BBDuk software ( Illumina adapters, PhiX and reads with Phred score below 20 were removed using the following parameters: minlength = 50, mink = 8, qout = auto, hdist = 1 k = 31, trimq = 10, qtrim = rl, ktrim = l, minavgquality = 20 and statscolumns = 5. Then, host-associated reads were also filtered using four reference genomes (Homo sapiens – GRCh38 v.38, bos taurus – ARS-UCD 1.2, release 106_2108, sus scrofa – Scrofa 11.1, release 106_2107 and Gallus gallus – GRCg6a, release 104a_2108). All alignments were performed in Bowtie 2.4.1 using the very-sensitive options14.

Metagenome assembly, binning, and genome quality control

To increase the throughput and maximize the number of MAGs in this dataset, we choose a strategy based on co-assembly. This strategy has been used in several studies, including in the reconstruction of genomes from poultryfifteencattle16and humans17 metagenomes. In this case, samples were merged using the combination of host and region samples (See Supplementary Table 3 to check each Co-assembly dataset). Metagenomes were assembled using Megahit software18 with the meta-large option (–min-count 2–k-list 27,37,47,57,67,77,87). A total of 4,861,910,960 high-quality reads were used to assemble 1,676,286 contigs greater than 2,500 bp (Table 1). The binning approach was used to reconstruct genomes from metagenomes based on the compositional traits of individual contigs (eg, tetra-nucleotide frequency and coverage) using Metabat2 with default parameters19. We considered only the genomes that passed rigorous quality control to remove spurious and contaminated genomes in the downstream analyses. Genomes with completeness ≥50.0 and contamination ≤10.0 were used in the downstream analyses, following the Minimum Information about a Metagenome-Assembled Genome (MIMAG) of bacteria and archaea standards.twenty in CheckM softwaretwenty-one with CheckM (lineage workflow). A total of 2,915 MAGs were reconstructed (Table 2 and Supplementary Table 3). Of these MAGs, 1,273 are high-quality drafts (≥90% of completeness and ≤5% of contamination), and 1,642 are medium-quality drafts (≥50% of completeness and ≤10% of contamination) (Fig. 2). The mean and standard deviation of genome size were 3.1 ± 1.4 Mbp, while the number of contigs had a mean of 263 ± 263. In addition, the mean genome size is compatible with those described in human stool communities.22. On the other hand, we assembled contigs greater than 2.02 Mbp in MAGs from poultry metagenomes, indicating the accuracy of the metagenome assembly. All MAGs were submitted under the NCBI database and post-processing through NCBI’s Contamination Screen to remove adapter and cross-species contamination.

Table 1 Number of reads and metagenome assembly metrics of each individual data set.
Table 2 Number and quality of metagenome-assembled genomes (MAGs) of each individual dataset.
figure 2

Quality determination of metagenome-assembled genomes (MAGs). (A) Completeness and (B.) contamination were estimated by the identification of individual marker genes. (C) Genome size was calculated by the sum of bases present in all contigs of each MAG. (D) Number of contigs.

Taxonomy prediction

We used standardized bacterial taxonomy based on genome phylogenomics proposed by Parks and collaborators.23using the GTDB-Tk v1.3.0 software24 (classify_wf workflow) and the most recent version of the Genome Taxonomy Database (GTDB) Release 05-RS9523. This workflow has been used to infer the taxonomy of MAGs, once improved classification of new uncultivated lineages and standardized taxonomy ranks based on the phylogenetic information. The most representative phyla were Firmicutes, Bacteroidota, and Proteobacteria (Fig. 3A), which are extensively studied in host-associated microbiomes.25. However, many of the MAGs described here are potential new genera or new families (Fig. 3B), highlighting new insights about the ecophysiology of these new taxonomic groups. Regarding shared species between the four microbial community hosts, 45 generated were shared among distinct hosts (Fig. 3C – Supplementary Table 5). This includes environmental species with ecological importance in the digestive microbiomes (eg, Cellulomonas and Azospirillum). Furthermore, four shared genera were generically assigned as SZUA-444, SZUA-584, UBA1305, and UBA8346, demonstrating the importance of this dataset to explore new taxonomic groups.

figure 3

Taxonomy and host-distribution of MAGs. (A) Circus plot demonstrating the abundance of phyla in each host microbiome. The external track indicates the relative number (%) of phyla or host. The internal track shows the absolute number of MAGs generalized by phyla or host, (B.) Number of potential novel lineages for each host microbiome, and (C) Absolute number of shared MAGs assigned to the genus level between host-associated microbiomes (human, swine, cattle, and poultry).

Leave a Comment