Chapter 2 Data preparation

2.1 Searching for the MAGs

Choose the MAGs
In this case, we will be working with Parabacteroides distasonis. In the EHI database, select the MAGs with > 90% completeness and < 2.5 contamination. Do the same in the GTDB, but also filter for isolation source CONTAINS feces.

2.2 Downloading the MAGs and generating data

Download genome indices and metadata
Download the EHI_MAG index for each species (in /data/mags_metadata folder) and the curl file and search metadata tsv from the GTDB and place in each species directory in Mjolnir.

2.5) Extract genome metadata
- Use the GTDB search tsv to run this script to obtain more metadata. (modify script a bit to make it scalable).

 python scripts/extract_gtdb_metadata.py

Create the master index
Run the create_index.R script to create an index of all the MAGs and the download paths (needs a list of species as input and right now it is hardcoded in the script)

conda activate mags_r_env
Rscript scripts/create_index.R

Run the downloading_mags.smk to download all the genomes

snakemake -s snakefiles/downloading_mags.smk \
    --executor slurm \
    --jobs 50 \
    --rerun-incomplete \
    --keep-going
    
    
#testing with the HUMAN_EHI
snakemake -s snakefiles/downloading_and_unzipping.smk \
    --executor slurm \
    --jobs 50 \
    --rerun-incomplete \
    --keep-going

Unzip all the MAGs and make sure they are in the same folder: run
(maybe add this to the downloading_mags.smk)

bash scripts/unzip_mags.sh parabacteroides_distasonis  
bash scripts/unzip_mags.sh phocaeicola_vulgatus

Make a screen session for each species.

screen -S parabacteroides_distasonis

Run drakkar annotating_function.smk to re-annotate all the MAGs:

drakkar annotating -b /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/parabacteroides_distasonis/mags/unzipped -o /maps/projects/alberdilab/people/pjq449/comparative_metagenomics --env_path /projects/alberdilab/data/environments/drakkar --annotation-type function

Annotate with prokka and make phylogenetic tree with getphylo

snakemake -s snakefiles/prokka_and_pangenome.smk --profile snakefiles/slurm_profile

Run genome_mapping.sh to map the mag ids to the corresponding names
Run drep

 sbatch scripts/drep_compare.slurm

Run create_pangolin_input.R to create the input file ppanggolin needs

conda activate mags_r_env
Rscript scripts/create_pangolin_input.R

#or if we want to use the annotation files by prokka:
Rscript scripts/create_pangolin_input_gff.R

Pangenome analysis with ppanggolin

conda activate ppanggolin

ppanggolin all --fasta /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/parabacteroides_distasonis/ppanggolin_input.tsv

#with annotation files:
ppanggolin all --anno maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/parabacteroides_distasonis/genome_gff_paths.tsv --rarefaction 

##if you are runnning this in the species folder, write local path not absolute.

2.3 R Studio- Data Analysis

2.3.1 EHI MAGs

ehi_mags <- read_csv("data/ehi_metadata_location.csv")

Rows: 8857 Columns: 54
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (38): ID, mag_name, link_to_assembly, AB_batch, eha_number, GTDB_version, GTDB_release, domain, phylum, class, order, family, ...
dbl (15): fastani_ani, closest_placement_ani, closest_placement_af, completeness, contamination, size, N50, coding_density, contig...
lgl  (1): annotated

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.3.2 Prepare color scheme

AlberdiLab projects use unified color schemes developed for the Earth Hologenome Initiative, to facilitate figure interpretation.

ehi_mags_p <- ehi_mags %>%
  mutate(phylum=str_remove_all(phylum, "p__")) 

phylum_colors <- read_tsv("https://raw.githubusercontent.com/earthhologenome/EHI_taxonomy_colour/main/ehi_phylum_colors.tsv") %>%
    mutate(phylum=str_remove_all(phylum, "p__")) %>%
    right_join(ehi_mags_p, by=join_by(phylum == phylum)) %>%
    dplyr::select(phylum, colors) %>% 
    unique() %>%
    arrange(phylum) %>%
    pull(colors, name=phylum)

source_colors <- c(EHI = "#8BC63F", GTDB = "#2D522D")
#source_colors <- c(EHI = "#8BC63F", GTDB = "#A90D00")

2.3.3 Sample metadata

gtdb_search_metadata <- read_tsv("data/p_dist_gtdb_metadata.tsv")

Rows: 69 Columns: 9
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): Accession, Organism Name, NCBI Taxonomy, GTDB Taxonomy, GTDB Type Material, Isolation Source
dbl (2): CheckM Completeness, CheckM2 Contamination
lgl (1): GTDB Representative of Species

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

gtdb_accession_metadata<- read_tsv("data/parabacteroides_distasonis_GTDB_accession_metadata.tsv")

Rows: 67 Columns: 115
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (69): accession, checkm2_model, checkm_marker_lineage, gtdb_genome_representative, gtdb_taxonomy, gtdb_type_designation_ncbi_...
dbl  (39): ambiguous_bases, checkm2_completeness, checkm2_contamination, checkm_completeness, checkm_contamination, checkm_marker_...
lgl   (5): gtdb_representative, gtdb_type_species_of_genus, mimag_high_quality, mimag_low_quality, mimag_medium_quality
date  (2): ncbi_date, ncbi_seq_rel_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ehi_metadata <- read_tsv("data/mags_metadata/parabacteroides_distasonis_metadata.tsv") %>%
  mutate(GC = as.numeric(str_remove(GC, "%")))

Rows: 25 Columns: 52
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (36): ID, mag_name, link_to_assembly, AB_batch, eha_number, GTDB_version, GTDB_release, domain, phylum, class, order, family, ...
dbl (15): fastani_ani, closest_placement_ani, closest_placement_af, completeness, contamination, size, N50, coding_density, contig...
lgl  (1): annotated

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

location_data<- ehi_mags %>% dplyr::select(ID, locality, country)

ehi_metadata_location <- left_join(ehi_metadata, location_data, by = "ID")

2.3.4 Load genome mapping

genome_mapping <- read_tsv("data/genome_mapping.tsv")

Rows: 69 Columns: 2
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): genome_id, genome_file

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

gtdb_accession_metadata_2 <- left_join(gtdb_accession_metadata, genome_mapping, by = c("search_accession" = "genome_id")
  )



#Make a genome_mapping with all the MAGs (including EHI)

#First select the ehi IDs from ehi_metadata
ehi_ids <- ehi_metadata %>% 
  dplyr::select(ID, mag_name) %>%
  dplyr::rename(genome_id = ID,
         genome_file = mag_name)

ehi_ids$genome_file <- sub("\\.fa$", "", ehi_ids$genome_file)#remove .fa 

genome_mapping_all <- genome_mapping %>%
  bind_rows(ehi_ids)

# Prepare EHI metadata with standardized column names
ehi_clean <- ehi_metadata_location%>%
  dplyr::select(
    ID,
    species = species,  
    completeness,
    contamination,
    genome_size = size,
    GC,
    N50,
    contigs,
    host_species,
    host_order,
    host_class,
    assembly_type,
    isolation_source= sample_type,
    mag_name, 
    eha_number,
    locality, 
    country
  ) %>%
  mutate(source = "EHI")

# Prepare GTDB metadata
# Join search and accession metadata
gtdb_metadata <- full_join(gtdb_accession_metadata_2, gtdb_search_metadata,  by = c("search_accession" = "Accession")
  )



# Prepare GTDB metadata
gtdb_clean <- gtdb_metadata %>%
  dplyr::select(
    ID = search_accession,
    species = "Organism Name",
    completeness = "CheckM Completeness",
    contamination = "CheckM2 Contamination",
    gtdb_taxonomy = "GTDB Taxonomy",
    isolation_source = "Isolation Source",
    genome_size,
    GC = gc_percentage,
    N50 = n50_contigs,
    contigs = contig_count,
    country = ncbi_country,
    mimag_high_quality,
    mimag_medium_quality,
    gtdb_representative,
    mag_name = genome_file,
    ncbi_date
  )%>%
  mutate(source = "GTDB")

#Join
genome_metadata <- bind_rows(ehi_clean, gtdb_clean)

genome_metadata_clean <- genome_metadata %>%
  separate_wider_delim(
    cols = country, 
    delim = regex(":\\s*"),
    names = c("country_clean", "extra_locality"),
    too_few = "align_start"
  ) %>%
  mutate(
    locality = case_when(
      !is.na(extra_locality) & !is.na(locality) ~ paste(extra_locality, locality, sep = ", "),
      !is.na(extra_locality) ~ extra_locality,
      TRUE ~ locality
    )
  ) %>%
  dplyr::rename(country = country_clean)


genome_metadata <- genome_metadata_clean %>%
  mutate(
    # United States to USA
    country = case_match(
      country,
      "United States" ~ "USA",
      .default = country
    ),
    
    # Continent column
    continent = case_when(
      country %in% c("Spain", "Italy", "Greece", "Portugal", "Malta", 
                     "Ireland", "Germany", "United Kingdom") ~ "Europe",
      country %in% c("USA", "Canada", "Greenland") ~ "North America",
      country %in% c("South Korea", "China", "Japan") ~ "Asia",
      country == "Australia" ~ "Oceania",
      TRUE ~ NA_character_  # Handle "none" and NA
    )
  )

# Read master index
master_index <- read_tsv("data/master_mag_index.tsv") %>%
  filter(species == "parabacteroides_distasonis") %>%
  mutate(
    # Extract the actual genome identifier from out_path
    genome_identifier = case_when(
      # For EHI: extract EHA00531_bin.1 from the filename
      str_detect(out_path, "EHA") ~ str_extract(out_path, "EHA[0-9]+_bin\\.[0-9]+"),
      # For NCBI: use the ID as is (GCA/GCF number)
      TRUE ~ ID
    )
  )

Rows: 287 Columns: 4
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): species, ID, MAG_url, out_path

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Check the mapping
print("Master index mapping:")

[1] "Master index mapping:"

master_index %>% 
  dplyr::select(ID, genome_identifier, out_path) %>%
  head(10) %>%
  print()

# A tibble: 10 × 3
   ID        genome_identifier out_path                                                   
   <chr>     <chr>             <chr>                                                      
 1 EHM016228 EHA00531_bin.1    data/parabacteroides_distasonis/mags/EHA00531_bin.1.fa.gz  
 2 EHM016318 EHA00539_bin.6    data/parabacteroides_distasonis/mags/EHA00539_bin.6.fa.gz  
 3 EHM016582 EHA00535_bin.6    data/parabacteroides_distasonis/mags/EHA00535_bin.6.fa.gz  
 4 EHM016991 EHA00726_bin.13   data/parabacteroides_distasonis/mags/EHA00726_bin.13.fa.gz 
 5 EHM020917 EHA01202_bin.66   data/parabacteroides_distasonis/mags/EHA01202_bin.66.fa.gz 
 6 EHM023585 EHA00928_bin.90   data/parabacteroides_distasonis/mags/EHA00928_bin.90.fa.gz 
 7 EHM023781 EHA01439_bin.44   data/parabacteroides_distasonis/mags/EHA01439_bin.44.fa.gz 
 8 EHM027637 EHA01634_bin.14   data/parabacteroides_distasonis/mags/EHA01634_bin.14.fa.gz 
 9 EHM028668 EHA01281_bin.18   data/parabacteroides_distasonis/mags/EHA01281_bin.18.fa.gz 
10 EHM029564 EHA01845_bin.103  data/parabacteroides_distasonis/mags/EHA01845_bin.103.fa.gz

# Read contig-to-genome mapping
contig_to_genome <- read_tsv("data/p_dist_contig_to_genome.tsv",
                              col_names = c("contig", "genome_filename"))

Rows: 13976 Columns: 2
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (2): contig, genome_filename

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.3.5 Genome annotations

# Read annotations and add IDs
genome_annotations <- read_tsv("data/gene_annotations.tsv.xz") %>%
  mutate(contig = sub("_[^_]*$", "", gene)) %>%
  left_join(contig_to_genome, by = "contig") %>%
  # Extract accession from genome filename
  mutate(
    accession = case_when(
      # For NCBI genomes: extract GCA/GCF number
      str_detect(genome_filename, "^GC[AF]_") ~ str_extract(genome_filename, "^GC[AF]_[0-9]+\\.[0-9]+"),
      # For EHI genomes: the filename IS the identifier
      TRUE ~ genome_filename
    )
  ) %>%
  # Join with master index using genome_identifier
  left_join(
    master_index %>% dplyr::select(mag_id = ID, genome_identifier),
    by = c("accession" = "genome_identifier")
  ) %>%
  dplyr::select(gene, mag_id, genome = genome_filename, contig, accession, everything()) %>%
  filter(!is.na(mag_id))

Warning: One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)
  problems(dat)

Rows: 374229 Columns: 13
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (11): gene, strand, kegg, ec, pfam, cazy, resistance_type, resistance_target, vf, vf_type, signalp
dbl  (2): start, end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2.3.6 Distill annotations into GIFTs

genome_gifts <- distill(genome_annotations,GIFT_db,genomecol= 2, annotcol=c(9,10,11,12), verbosity = F)


Identifiers in the annotation table: 1648 
Identifiers in the database: 1547 
Identifiers in both: 201 
Percentage of annotation table identifiers used for distillation: 12.2%
Percentage of database identifiers used for distillation: 12.99%

2.3.7 Load getphylo trees

genome_metadata$mag_name <- sub("\\.fa$", "", genome_metadata$mag_name)#remove .fa from the mag names so that they match the tree ids


getphylo_tree <- read_tree("data/combined_alignment.tree")
getphylo_tree$tip.label <- gsub("'", "", getphylo_tree$tip.label)
tip_df <- data.frame(label = getphylo_tree$tip.label) %>%
  left_join(genome_metadata, by = c("label" = "mag_name"))

#replace long mag names with IDs
getphylo_tree$tip.label <- tip_df$ID

2.4 Wrap working objects

All working objects are wrapped into a single Rdata object to facilitate downstream usage.

save(ehi_mags,
     phylum_colors,
     genome_annotations,
     genome_gifts,
     contig_to_genome,
     gtdb_metadata,
     ehi_metadata,
     master_index,
     genome_metadata,
     source_colors,
     getphylo_tree,
     file = "data/data.Rdata")