Chapter 2 Data preparation
2.1 Searching for the MAGs
- Choose the MAGs
For this project, the following bacterial species were selected: Lactococcus lactis, Hafnia paralvei, Enterococcus faecalis, Bacteroides uniformis, Phocaeicola vulgatus, Parabacteroides goldsteinii, Citrobacter braaki, Akkermansia municiphila, Enterococcus hirae and Bacteroides fragilis.
EHI MAGs: In the EHI database, select the MAGs with > 90% completeness and < 2.5 contamination for each of the species.
GTDB MAGs: (“NCBI Taxonomy” CONTAINS “Lactococcus lactis” AND “CheckM2 Completeness” > “90” AND “CheckM Contamination” < “2.5” AND (“Isolation Source” CONTAINS “feces” OR “Isolation Source” CONTAINS “excrement” OR “Isolation Source” CONTAINS “metagenome” OR “Isolation Source” CONTAINS “microbiome” OR “Isolation Source” CONTAINS “gut” OR “Isolation Source” CONTAINS “faeces” OR “Isolation Source” CONTAINS “fecal”) AND “Isolation Source” IS NOT “N/A”)
NCBI MAGs: Refer to 03_downloading_mags.Rmd
2.2 Downloading the MAGs and generating data
- Download genome indices and metadata
Download the EHI_MAG index for each species (in /data/mags_metadata folder) and the curl file and search metadata tsv from the GTDB and place in each species directory in Mjolnir.
2.5) Extract genome metadata
- Use the GTDB search tsv to run this script to obtain more metadata.
snakemake -s /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/snakefiles/gtdb_metadata_pipeline.smk -j 1 --use-conda --rerun-incompleteMETADATA FILES: -The EHI metadata files : data/mags_metadata/lactococcus_lactis_metadata.tsv -The GTDB metadata files: data/mags_metadata/lactococcus_lactis_gtdb_final_metadata.tsv -The NCBI metadata files: lactococcus_lactis_ncbi_metadata.rds
- Create the master index
Run the create_index.R script to create an index of all the MAGs and the download paths (needs a list of species as input and right now it is hardcoded in the script)
- Run the downloading_mags.smk to download all the genomes
module load snakemake
#testing
snakemake -s snakefiles/downloading_and_unzipping.smk \
--executor slurm \
--jobs 50 \
--rerun-incomplete \
--keep-going \
--rerun-triggers mtimeDownload the MAGs from the NCBI
### DOWNLOAD THE MAGS - add this to download and unzip mags??
conda activate drakkar_env
#Run this inside the mags folder
datasets download genome accession --inputfile ../phocaeicola_vulgatus_ncbi_selected_accessions.txt --include genome --filename phocaeicola_vulgatus_ncbi_selected_genomes.zip
#then unzip
zip="phocaeicola_vulgatus_ncbi_selected_genomes.zip"
mkdir -p unzipped
unzip -Z1 "$zip" | grep '\.fna$' | while read -r f; do
acc=$(basename "$(dirname "$f")")
echo "Extracting $acc"
unzip -p "$zip" "$f" > "unzipped/${acc}.fna"
done- Run drep
- Remove the MAGs that are <95% ANI (not the same species)
6.2) Re-run drep without removed MAGs
- Make a screen session for each species.
- Run drakkar annotating_function.smk to re-annotate all the MAGs:
drakkar annotating -b /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/phocaeicola_vulgatus/mags/unzipped -o /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/data/phocaeicola_vulgatus --env_path /projects/alberdilab/data/environments/drakkar --annotation-type function 7.1) Run contig to genome mapping
bash /maps/projects/alberdilab/people/pjq449/comparative_metagenomics/scripts/build_contig_to_mag_maps_all.sh # though in the end i just ran one in terminal directly- Pangenome analysis with ppanggolin