Chapter 2 Data preparation

2.1 Searching for the MAGs

Choose the MAGs
For this project, the following bacterial species were selected: Lactococcus lactis, Hafnia paralvei, Enterococcus faecalis, Bacteroides uniformis, Phocaeicola vulgatus, Parabacteroides goldsteinii, Citrobacter braaki, Akkermansia municiphila, Enterococcus hirae and Bacteroides fragilis.

EHI MAGs: In the EHI database, select the MAGs with > 90% completeness and < 2.5 contamination for each of the species.

GTDB MAGs: (“NCBI Taxonomy” CONTAINS “Lactococcus lactis” AND “CheckM2 Completeness” > “90” AND “CheckM Contamination” < “2.5” AND (“Isolation Source” CONTAINS “feces” OR “Isolation Source” CONTAINS “excrement” OR “Isolation Source” CONTAINS “metagenome” OR “Isolation Source” CONTAINS “microbiome” OR “Isolation Source” CONTAINS “gut” OR “Isolation Source” CONTAINS “faeces” OR “Isolation Source” CONTAINS “fecal”) AND “Isolation Source” IS NOT “N/A”)

NCBI MAGs: Refer to 03_downloading_mags.Rmd

2.2 Downloading the MAGs and generating data

Download genome indices and metadata
Download the EHI_MAG index for each species (in /data/mags_metadata folder) and the curl file and search metadata tsv from the GTDB and place in each species directory in Mjolnir.

2.5) Extract genome metadata
- Use the GTDB search tsv to run this script to obtain more metadata.

snakemake -s /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/snakefiles/gtdb_metadata_pipeline.smk -j 1 --use-conda --rerun-incomplete

METADATA FILES: -The EHI metadata files : data/mags_metadata/lactococcus_lactis_metadata.tsv -The GTDB metadata files: data/mags_metadata/lactococcus_lactis_gtdb_final_metadata.tsv -The NCBI metadata files: lactococcus_lactis_ncbi_metadata.rds

Create the master index
Run the create_index.R script to create an index of all the MAGs and the download paths (needs a list of species as input and right now it is hardcoded in the script)

conda activate r_env
Rscript scripts/create_index.R

Run the downloading_mags.smk to download all the genomes

module load snakemake
#testing

snakemake -s snakefiles/downloading_and_unzipping.smk \
  --executor slurm \
  --jobs 50 \
  --rerun-incomplete \
  --keep-going \
  --rerun-triggers mtime

Download the MAGs from the NCBI

### DOWNLOAD THE MAGS - add this to download and unzip mags??
conda activate drakkar_env 

#Run this inside the mags folder
datasets download genome accession --inputfile ../phocaeicola_vulgatus_ncbi_selected_accessions.txt --include genome --filename phocaeicola_vulgatus_ncbi_selected_genomes.zip

#then unzip
zip="phocaeicola_vulgatus_ncbi_selected_genomes.zip"

mkdir -p unzipped

unzip -Z1 "$zip" | grep '\.fna$' | while read -r f; do
    acc=$(basename "$(dirname "$f")")
    echo "Extracting $acc"
    unzip -p "$zip" "$f" > "unzipped/${acc}.fna"
done

Run drep

sbatch scripts/drep_compare.slurm

Remove the MAGs that are <95% ANI (not the same species)

python3 scripts/filter_by_cdb.py data data/clusters_to_drop.tsv

6.2) Re-run drep without removed MAGs

sbatch --export=SPECIES="$SPECIES" scripts/rerun_drep_removed.sh

Make a screen session for each species.

screen -S parabacteroides_distasonis

Run drakkar annotating_function.smk to re-annotate all the MAGs:

drakkar annotating -b /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/phocaeicola_vulgatus/mags/unzipped -o /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/phocaeicola_vulgatus --env_path /projects/alberdilab/data/environments/drakkar --annotation-type function

7.1) Run contig to genome mapping

bash /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/scripts/build_contig_to_mag_maps_all.sh  # though in the end i just ran one in terminal directly

Pangenome analysis with ppanggolin

snakemake --snakefile Snakefile.pangolin \
          --configfile snakefiles/pangolin_config.yaml \
          --use-conda \
          -j 40