Chapter 2 Data preparation

2.1 Searching for the MAGs

  1. Choose the MAGs
    For this project, the following bacterial species were selected: Lactococcus lactis, Hafnia paralvei, Enterococcus faecalis, Bacteroides uniformis, Phocaeicola vulgatus, Parabacteroides goldsteinii, Citrobacter braaki, Akkermansia municiphila, Enterococcus hirae and Bacteroides fragilis.

EHI MAGs: In the EHI database, select the MAGs with > 90% completeness and < 2.5 contamination for each of the species.

GTDB MAGs: (“NCBI Taxonomy” CONTAINS “Lactococcus lactis” AND “CheckM2 Completeness” > “90” AND “CheckM Contamination” < “2.5” AND (“Isolation Source” CONTAINS “feces” OR “Isolation Source” CONTAINS “excrement” OR “Isolation Source” CONTAINS “metagenome” OR “Isolation Source” CONTAINS “microbiome” OR “Isolation Source” CONTAINS “gut” OR “Isolation Source” CONTAINS “faeces” OR “Isolation Source” CONTAINS “fecal”) AND “Isolation Source” IS NOT “N/A”)

NCBI MAGs: Refer to 03_downloading_mags.Rmd

2.2 Downloading the MAGs and generating data

  1. Download genome indices and metadata
    Download the EHI_MAG index for each species (in /data/mags_metadata folder) and the curl file and search metadata tsv from the GTDB and place in each species directory in Mjolnir.

2.5) Extract genome metadata
- Use the GTDB search tsv to run this script to obtain more metadata.

snakemake -s /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/snakefiles/gtdb_metadata_pipeline.smk -j 1 --use-conda --rerun-incomplete

METADATA FILES: -The EHI metadata files : data/mags_metadata/lactococcus_lactis_metadata.tsv -The GTDB metadata files: data/mags_metadata/lactococcus_lactis_gtdb_final_metadata.tsv -The NCBI metadata files: lactococcus_lactis_ncbi_metadata.rds

  1. Create the master index
    Run the create_index.R script to create an index of all the MAGs and the download paths (needs a list of species as input and right now it is hardcoded in the script)
conda activate r_env
Rscript scripts/create_index.R 
  1. Run the downloading_mags.smk to download all the genomes
module load snakemake
#testing

snakemake -s snakefiles/downloading_and_unzipping.smk \
  --executor slurm \
  --jobs 50 \
  --rerun-incomplete \
  --keep-going \
  --rerun-triggers mtime

Download the MAGs from the NCBI

### DOWNLOAD THE MAGS - add this to download and unzip mags??
conda activate drakkar_env 

#Run this inside the mags folder
datasets download genome accession --inputfile ../phocaeicola_vulgatus_ncbi_selected_accessions.txt --include genome --filename phocaeicola_vulgatus_ncbi_selected_genomes.zip

#then unzip
zip="phocaeicola_vulgatus_ncbi_selected_genomes.zip"

mkdir -p unzipped

unzip -Z1 "$zip" | grep '\.fna$' | while read -r f; do
    acc=$(basename "$(dirname "$f")")
    echo "Extracting $acc"
    unzip -p "$zip" "$f" > "unzipped/${acc}.fna"
done
  1. Run drep
sbatch scripts/drep_compare.slurm
  1. Remove the MAGs that are <95% ANI (not the same species)
python3 scripts/filter_by_cdb.py data data/clusters_to_drop.tsv 

6.2) Re-run drep without removed MAGs

sbatch --export=SPECIES="$SPECIES" scripts/rerun_drep_removed.sh
  1. Make a screen session for each species.
screen -S parabacteroides_distasonis
  1. Run drakkar annotating_function.smk to re-annotate all the MAGs:
drakkar annotating -b /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/phocaeicola_vulgatus/mags/unzipped -o /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/phocaeicola_vulgatus --env_path /projects/alberdilab/data/environments/drakkar --annotation-type function 

7.1) Run contig to genome mapping

bash /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/scripts/build_contig_to_mag_maps_all.sh  # though in the end i just ran one in terminal directly
  1. Pangenome analysis with ppanggolin
snakemake --snakefile Snakefile.pangolin \
          --configfile snakefiles/pangolin_config.yaml \
          --use-conda \
          -j 40