Chapter 2 Data preparation
2.1 Searching for the MAGs
- Choose the MAGs
In this case, we will be working with Parabacteroides distasonis. In the EHI database, select the MAGs with > 90% completeness and < 2.5 contamination. Do the same in the GTDB, but also filter for isolation source CONTAINS feces.
2.2 Downloading the MAGs and generating data
- Download genome indices and metadata
Download the EHI_MAG index for each species (in /data/mags_metadata folder) and the curl file and search metadata tsv from the GTDB and place in each species directory in Mjolnir.
2.5) Extract genome metadata
- Use the GTDB search tsv to run this script to obtain more metadata. (modify script a bit to make it scalable).
- Create the master index
Run the create_index.R script to create an index of all the MAGs and the download paths (needs a list of species as input and right now it is hardcoded in the script)
- Run the downloading_mags.smk to download all the genomes
snakemake -s snakefiles/downloading_mags.smk \
--executor slurm \
--jobs 50 \
--rerun-incomplete \
--keep-going
#testing with the HUMAN_EHI
snakemake -s snakefiles/downloading_and_unzipping.smk \
--executor slurm \
--jobs 50 \
--rerun-incomplete \
--keep-going- Unzip all the MAGs and make sure they are in the same folder: run
(maybe add this to the downloading_mags.smk)
bash scripts/unzip_mags.sh parabacteroides_distasonis
bash scripts/unzip_mags.sh phocaeicola_vulgatus- Make a screen session for each species.
- Run drakkar annotating_function.smk to re-annotate all the MAGs:
drakkar annotating -b /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/parabacteroides_distasonis/mags/unzipped -o /maps/projects/alberdilab/people/pjq449/comparative_metagenomics --env_path /projects/alberdilab/data/environments/drakkar --annotation-type function - Annotate with prokka and make phylogenetic tree with getphylo
Run genome_mapping.sh to map the mag ids to the corresponding names
Run drep
- Run create_pangolin_input.R to create the input file ppanggolin needs
conda activate mags_r_env
Rscript scripts/create_pangolin_input.R
#or if we want to use the annotation files by prokka:
Rscript scripts/create_pangolin_input_gff.R- Pangenome analysis with ppanggolin
conda activate ppanggolin
ppanggolin all --fasta /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/parabacteroides_distasonis/ppanggolin_input.tsv
#with annotation files:
ppanggolin all --anno maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/parabacteroides_distasonis/genome_gff_paths.tsv --rarefaction
##if you are runnning this in the species folder, write local path not absolute.2.3 R Studio- Data Analysis
2.3.1 EHI MAGs
Rows: 8857 Columns: 54
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (38): ID, mag_name, link_to_assembly, AB_batch, eha_number, GTDB_version, GTDB_release, domain, phylum, class, order, family, ...
dbl (15): fastani_ani, closest_placement_ani, closest_placement_af, completeness, contamination, size, N50, coding_density, contig...
lgl (1): annotated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.3.2 Prepare color scheme
AlberdiLab projects use unified color schemes developed for the Earth Hologenome Initiative, to facilitate figure interpretation.
ehi_mags_p <- ehi_mags %>%
mutate(phylum=str_remove_all(phylum, "p__"))
phylum_colors <- read_tsv("https://raw.githubusercontent.com/earthhologenome/EHI_taxonomy_colour/main/ehi_phylum_colors.tsv") %>%
mutate(phylum=str_remove_all(phylum, "p__")) %>%
right_join(ehi_mags_p, by=join_by(phylum == phylum)) %>%
dplyr::select(phylum, colors) %>%
unique() %>%
arrange(phylum) %>%
pull(colors, name=phylum)2.3.3 Sample metadata
Rows: 69 Columns: 9
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): Accession, Organism Name, NCBI Taxonomy, GTDB Taxonomy, GTDB Type Material, Isolation Source
dbl (2): CheckM Completeness, CheckM2 Contamination
lgl (1): GTDB Representative of Species
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 67 Columns: 115
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (69): accession, checkm2_model, checkm_marker_lineage, gtdb_genome_representative, gtdb_taxonomy, gtdb_type_designation_ncbi_...
dbl (39): ambiguous_bases, checkm2_completeness, checkm2_contamination, checkm_completeness, checkm_contamination, checkm_marker_...
lgl (5): gtdb_representative, gtdb_type_species_of_genus, mimag_high_quality, mimag_low_quality, mimag_medium_quality
date (2): ncbi_date, ncbi_seq_rel_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ehi_metadata <- read_tsv("data/mags_metadata/parabacteroides_distasonis_metadata.tsv") %>%
mutate(GC = as.numeric(str_remove(GC, "%")))Rows: 25 Columns: 52
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (36): ID, mag_name, link_to_assembly, AB_batch, eha_number, GTDB_version, GTDB_release, domain, phylum, class, order, family, ...
dbl (15): fastani_ani, closest_placement_ani, closest_placement_af, completeness, contamination, size, N50, coding_density, contig...
lgl (1): annotated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.3.4 Load genome mapping
Rows: 69 Columns: 2
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): genome_id, genome_file
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gtdb_accession_metadata_2 <- left_join(gtdb_accession_metadata, genome_mapping, by = c("search_accession" = "genome_id")
)
#Make a genome_mapping with all the MAGs (including EHI)
#First select the ehi IDs from ehi_metadata
ehi_ids <- ehi_metadata %>%
dplyr::select(ID, mag_name) %>%
dplyr::rename(genome_id = ID,
genome_file = mag_name)
ehi_ids$genome_file <- sub("\\.fa$", "", ehi_ids$genome_file)#remove .fa
genome_mapping_all <- genome_mapping %>%
bind_rows(ehi_ids)# Prepare EHI metadata with standardized column names
ehi_clean <- ehi_metadata_location%>%
dplyr::select(
ID,
species = species,
completeness,
contamination,
genome_size = size,
GC,
N50,
contigs,
host_species,
host_order,
host_class,
assembly_type,
isolation_source= sample_type,
mag_name,
eha_number,
locality,
country
) %>%
mutate(source = "EHI")
# Prepare GTDB metadata
# Join search and accession metadata
gtdb_metadata <- full_join(gtdb_accession_metadata_2, gtdb_search_metadata, by = c("search_accession" = "Accession")
)
# Prepare GTDB metadata
gtdb_clean <- gtdb_metadata %>%
dplyr::select(
ID = search_accession,
species = "Organism Name",
completeness = "CheckM Completeness",
contamination = "CheckM2 Contamination",
gtdb_taxonomy = "GTDB Taxonomy",
isolation_source = "Isolation Source",
genome_size,
GC = gc_percentage,
N50 = n50_contigs,
contigs = contig_count,
country = ncbi_country,
mimag_high_quality,
mimag_medium_quality,
gtdb_representative,
mag_name = genome_file,
ncbi_date
)%>%
mutate(source = "GTDB")
#Join
genome_metadata <- bind_rows(ehi_clean, gtdb_clean)
genome_metadata_clean <- genome_metadata %>%
separate_wider_delim(
cols = country,
delim = regex(":\\s*"),
names = c("country_clean", "extra_locality"),
too_few = "align_start"
) %>%
mutate(
locality = case_when(
!is.na(extra_locality) & !is.na(locality) ~ paste(extra_locality, locality, sep = ", "),
!is.na(extra_locality) ~ extra_locality,
TRUE ~ locality
)
) %>%
dplyr::rename(country = country_clean)
genome_metadata <- genome_metadata_clean %>%
mutate(
# United States to USA
country = case_match(
country,
"United States" ~ "USA",
.default = country
),
# Continent column
continent = case_when(
country %in% c("Spain", "Italy", "Greece", "Portugal", "Malta",
"Ireland", "Germany", "United Kingdom") ~ "Europe",
country %in% c("USA", "Canada", "Greenland") ~ "North America",
country %in% c("South Korea", "China", "Japan") ~ "Asia",
country == "Australia" ~ "Oceania",
TRUE ~ NA_character_ # Handle "none" and NA
)
)# Read master index
master_index <- read_tsv("data/master_mag_index.tsv") %>%
filter(species == "parabacteroides_distasonis") %>%
mutate(
# Extract the actual genome identifier from out_path
genome_identifier = case_when(
# For EHI: extract EHA00531_bin.1 from the filename
str_detect(out_path, "EHA") ~ str_extract(out_path, "EHA[0-9]+_bin\\.[0-9]+"),
# For NCBI: use the ID as is (GCA/GCF number)
TRUE ~ ID
)
)Rows: 287 Columns: 4
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): species, ID, MAG_url, out_path
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "Master index mapping:"
# A tibble: 10 × 3
ID genome_identifier out_path
<chr> <chr> <chr>
1 EHM016228 EHA00531_bin.1 data/parabacteroides_distasonis/mags/EHA00531_bin.1.fa.gz
2 EHM016318 EHA00539_bin.6 data/parabacteroides_distasonis/mags/EHA00539_bin.6.fa.gz
3 EHM016582 EHA00535_bin.6 data/parabacteroides_distasonis/mags/EHA00535_bin.6.fa.gz
4 EHM016991 EHA00726_bin.13 data/parabacteroides_distasonis/mags/EHA00726_bin.13.fa.gz
5 EHM020917 EHA01202_bin.66 data/parabacteroides_distasonis/mags/EHA01202_bin.66.fa.gz
6 EHM023585 EHA00928_bin.90 data/parabacteroides_distasonis/mags/EHA00928_bin.90.fa.gz
7 EHM023781 EHA01439_bin.44 data/parabacteroides_distasonis/mags/EHA01439_bin.44.fa.gz
8 EHM027637 EHA01634_bin.14 data/parabacteroides_distasonis/mags/EHA01634_bin.14.fa.gz
9 EHM028668 EHA01281_bin.18 data/parabacteroides_distasonis/mags/EHA01281_bin.18.fa.gz
10 EHM029564 EHA01845_bin.103 data/parabacteroides_distasonis/mags/EHA01845_bin.103.fa.gz
# Read contig-to-genome mapping
contig_to_genome <- read_tsv("data/p_dist_contig_to_genome.tsv",
col_names = c("contig", "genome_filename"))Rows: 13976 Columns: 2
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): contig, genome_filename
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.3.5 Genome annotations
# Read annotations and add IDs
genome_annotations <- read_tsv("data/gene_annotations.tsv.xz") %>%
mutate(contig = sub("_[^_]*$", "", gene)) %>%
left_join(contig_to_genome, by = "contig") %>%
# Extract accession from genome filename
mutate(
accession = case_when(
# For NCBI genomes: extract GCA/GCF number
str_detect(genome_filename, "^GC[AF]_") ~ str_extract(genome_filename, "^GC[AF]_[0-9]+\\.[0-9]+"),
# For EHI genomes: the filename IS the identifier
TRUE ~ genome_filename
)
) %>%
# Join with master index using genome_identifier
left_join(
master_index %>% dplyr::select(mag_id = ID, genome_identifier),
by = c("accession" = "genome_identifier")
) %>%
dplyr::select(gene, mag_id, genome = genome_filename, contig, accession, everything()) %>%
filter(!is.na(mag_id))Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
Rows: 374229 Columns: 13
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (11): gene, strand, kegg, ec, pfam, cazy, resistance_type, resistance_target, vf, vf_type, signalp
dbl (2): start, end
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
2.3.6 Distill annotations into GIFTs
genome_gifts <- distill(genome_annotations,GIFT_db,genomecol= 2, annotcol=c(9,10,11,12), verbosity = F)
Identifiers in the annotation table: 1648
Identifiers in the database: 1547
Identifiers in both: 201
Percentage of annotation table identifiers used for distillation: 12.2%
Percentage of database identifiers used for distillation: 12.99%
2.3.7 Load getphylo trees
genome_metadata$mag_name <- sub("\\.fa$", "", genome_metadata$mag_name)#remove .fa from the mag names so that they match the tree ids
getphylo_tree <- read_tree("data/combined_alignment.tree")
getphylo_tree$tip.label <- gsub("'", "", getphylo_tree$tip.label)
tip_df <- data.frame(label = getphylo_tree$tip.label) %>%
left_join(genome_metadata, by = c("label" = "mag_name"))
#replace long mag names with IDs
getphylo_tree$tip.label <- tip_df$ID