AlphaFold

This documentation provides the information and templates to run Alpha Fold.

The parameters needed to run Alpha Fold are:

  • ALPHAFOLD_DATA_PATH: Absolute path to folder with databases.

  • ALPHAFOLD_MODELS: Absolute path to folder with models.

  • pwd: Path to Singularity Image File (SIF) file.

  • fasta_paths: Path to the input sequence in fasta format.

  • uniref90_database_path: Path to Uniref90 database for use by JackHMMER.

  • mgnify_database_path: Path to the MGnify database for use by JackHMMER.

  • bfd_database_path: Path to the BFD database for use by HHblits.

  • uniclust30_database_path: Path to Uniclust30 database for use by HHblits.

  • pdb70_database_path: Path to PDB70 database for use by HHsearch.

  • template_mmcif_dir_ Path to a directory with template mmCIF structures, each named <pdb_id>.cif.

  • uniprot_database_path Path to the UniProt database for AlphaFold Multimer.

  • obsolete_pdbs_path: Path to a file mapping obsolete PDB IDs to their replacements.

  • max_template_date: Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets. Default is None.

  • output_dir: Path to a directory that will store the results.

  • model_preset: [‘monomer’, ‘monomer_casp14’, ‘monomer_ptm’, ‘multimer’]. Control which AlphaFold model use, choosing between the original model used at CASP14 with no ensembling (monomer), the original model used at CASP14 with num_ensemble=8, matching our CASP14 configuration (monomer_casp14), the original CASP14 model fine tuned with the pTM head, providing a pairwise confidence measure (‘monomer_ptm’) and the AlphaFold-Multimer model (‘multimer’), to use this model, provide a multi-sequence FASTA file.

  • db_preset: [‘reduced_dbs’, ‘full_dbs’, ‘casp14’]. Choose preset model configuration - no ensembling and smaller genetic database config (reduced_dbs), no ensembling and full genetic database config (full_dbs) or full genetic database config and 8 model ensemblings (casp14). Default is full_dbs.

  • benchmark: [True, False]. Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins. Default is False.

Important

The reference databases and models were downloaded in the directory /shared/work/NBD_Utilities/AlphaFold/databases and the singularity image file (alphafold_2.1.0.sif) of AlphaFold is available at /shared/work/NBD_Utilities/AlphaFold

Important

The max_template_date flag is mandatory when running AlphaFold. If you are predicting the structure of a protein that is already in PDB and you wish to avoid using it as a template, then max_template_date must be set to be before the release date of the structure. If you do not need to specify a date, by default we set today’s date. For example, if we are running the simulation on August 7th 2021, we set – max_template_date = 2021-08-07.

Running AlphaFold within Singularity

Here is an example on how to run AlphaFold. First, we need a protein sequence in FASTA format.

>5ZE6_1
MNLEKINELTAQDMAGVNAAILEQLNSDVQLINQLGYYIVSGGGKRIRPMIAVLAARAVGYEGNAHVTIAALIEFIHTATLLHDDVVDESDMRRGKATANAA
FGNAASVLVGDFIYTRAFQMMTSLGSLKVLEVMSEAVNVIAEGEVLQLMNVNDPDITEENYMRVIYSKTARLFEAAAQCSGILAGCTPEEEKGLQDYGRYLG
TAFQLIDDLLDYNADGEQLGKNVGDDLNEGKPTLPLLHAMHHGTPEQAQMIRTAIEQGNGRHLLEPVLEAMNACGSLEWTRQRAEEEADKAIAALQVLPDTP
WREALIGLAHIAVQRDR

If we want to run AlphaFold in, for example, the directory AlphaFold/test

user@login01:~$  cd AlphaFold/test
user@login01:AlphaFold/test$  mkdir alphafold_output # Create the directory for AlphaFold output
user@login01:AlphaFold/test$  ls # The directory should contain the singularity image file (.sif) and the input FASTA sequence
alphafold_output alphafold_2.1.0.sif input.fasta
To run Alpha Fold, please change in the following template:
  • output_dir

  • fasta_paths

Alpha Fold - v2.0.1 - Template to run

The memory needed for a job depends on the length of the input FASTA sequence and the number of models used.
Consider increasing the memory if you are working with a large sequence or with all the models.
#!/bin/bash
#SBATCH --job-name alphafold-run
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=40G

#set the environment PATH
export PYTHONNOUSERSITE=True
ALPHAFOLD_DATA_PATH=/shared/work/NBD_Utilities/AlphaFold/databases
ALPHAFOLD_MODELS=/shared/work/NBD_Utilities/AlphaFold/databases/params

module load CUDA/9.2.88-iccifort-2018.1.163-GCC-6.4.0-2.28
module load cuDNN
export CUDA_VISIBLE_DEVICES=-1

#Run the command
singularity run --nv \
 -B $ALPHAFOLD_DATA_PATH:/data \
 -B $ALPHAFOLD_MODELS \
 -B .:/etc \
 --pwd  /app/alphafold alphafold_2.1.0.sif \
 --fasta_paths=/path/to/input/sequence/input.fasta  \
 --uniref90_database_path=/data/uniref90/uniref90.fasta  \
 --data_dir=/data \
 --mgnify_database_path=/data/mgnify/mgy_clusters.fa   \
 --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
 --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
 --pdb70_database_path=/data/pdb70/pdb70  \
 --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
 --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
 --max_template_date= YYYY-MM-DD \
 --output_dir=/path/to/output/directory  \
 --model_preset='monomer'

AlphaFold output

The outputs will be in a subfolder of output_dir. They include the computed MSAs, unrelaxed structures, relaxed structures, ranked structures, raw model outputs, prediction metadata, and section timings. The output_dir directory will have the following structure:

<target_name>/
    |- input/
       |- features.pkl
       |- ranked_{0,1,2,3,4}.pdb
       |- ranking_debug.json
       |- relaxed_model_{1,2,3,4,5}.pdb
       |- result_model_{1,2,3,4,5}.pkl
       |- timings.json
       |- unrelaxed_model_{1,2,3,4,5}.pdb
       |- msas/
          |- bfd_uniclust_hits.a3m
          |- mgnify_hits.sto
          |- uniref90_hits.sto

The contents of each output file are as follows:

  • features.pkl: A pickle file containing the input feature NumPy arrays used by the models to produce the structures.

  • unrelaxed_model_x.pdb: A PDB format text file containing the predicted structure, exactly as outputted by the model.

  • relaxed_model_x.pdb: A PDB format text file containing the predicted structure, after performing an Amber relaxation procedure on the unrelaxed structure prediction (see Jumper et al. 2021, Suppl. Methods 1.8.6 for details).

  • ranked_x.pdb: A PDB format text file containing the relaxed predicted structures, after reordering by model confidence. Here ranked_0.pdb should contain the prediction with the highest confidence, and ranked_4.pdb the prediction with the lowest confidence. To rank model confidence, we use predicted LDDT (pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6 for details).

  • ranking_debug.json: A JSON format text file containing the pLDDT values used to perform the model ranking, and a mapping back to the original model names.

  • timings.json: A JSON format text file containing the times taken to run each section of the AlphaFold pipeline.

  • msas/: - A directory containing the files describing the various genetic tool hits that were used to construct the input MSA.

  • result_model_x.pkl: A pickle file containing a nested dictionary of the various NumPy arrays directly produced by the model. In addition to the output of the structure module, this includes auxiliary outputs such as:

    • Distograms (distogram/logits contains a NumPy array of shape [N_res, N_res, N_bins] and distogram/bin_edges contains the definition of the bins).

    • Per-residue pLDDT scores (plddt contains a NumPy array of shape [N_res] with the range of possible values from 0 to 100, where 100 means most confident). This can serve to identify sequence regions predicted with high confidence or as an overall per-target confidence score when averaged across residues.

    • Present only if using pTM models: predicted TM-score (ptm field contains a scalar). As a predictor of a global superposition metric, this score is designed to also assess whether the model is confident in the overall domain packing.

    • Present only if using pTM models: predicted pairwise aligned errors (predicted_aligned_error contains a NumPy array of shape [N_res, N_res] with the range of possible values from 0 to max_predicted_aligned_error, where 0 means most confident). This can serve for a visualisation of domain packing confidence within the structure.

Running AlphaFold Multimer

The steps are the same as when folding a monomer, but it is needed to provide:

  1. An input fasta file with multiple sequences.

  2. Set the –model-preset flag to ‘multimer’.

  3. Optionally set the –is_prokaryote_list flag with booleans that determine whether all input sequences in the given fasta file are prokaryotic. If that is not the case or the origin is unknown, set to false for that fasta.

Example

In this tutorial we will fold a multimer using AlphaFold. We will be using a Human GITR-GITRL complex (PDB ID: 7KHD).

1. Sequence file preparation: The multimer sequence can be downloaded from the PDB databse.

>7KHD_1|Chains A, B|Tumor necrosis factor ligand superfamily member 18|Homo sapiens (9606)
QLETAKEPCMAKFGPLPSKWQMASSEPPCVNKVSDWKLEILQNGLYLIYGQVAPNANYNDVAPFEVRLYKNKDMIQTLTNKSKIQNVGGTYELHVGDTIDLIFNSEHQVLKNNTYWGIILLANPQFIS
>7KHD_2|Chains C, D|Tumor necrosis factor receptor superfamily member 18|Homo sapiens (9606)
QRPTGGPGCGPGRLLLGTGTDARCCRVHTTRCCRDYPGEECCSEWDCMCVQPEFHCGDPCCTTCRHHPCPPGQGVQSQGKFSFGFQCIDCASGTFSGGHEGHCKPWTDCTQFGFLTVFPGNKTHNAVCVPGSPPAEP

If the multimer has repeated chains,the input fasta file should be:

>7KHD_1|Chain A
QLETAKEPCMAKFGPLPSKWQMASSEPPCVNKVSDWKLEILQNGLYLIYGQVAPNANYNDVAPFEVRLYKNKDMIQTLTNKSKIQNVGGTYELHVGDTIDLIFNSEHQVLKNNTYWGIILLANPQFIS
>7KHD_2|Chain B
QLETAKEPCMAKFGPLPSKWQMASSEPPCVNKVSDWKLEILQNGLYLIYGQVAPNANYNDVAPFEVRLYKNKDMIQTLTNKSKIQNVGGTYELHVGDTIDLIFNSEHQVLKNNTYWGIILLANPQFIS
>7KHD_3|Chain C
QRPTGGPGCGPGRLLLGTGTDARCCRVHTTRCCRDYPGEECCSEWDCMCVQPEFHCGDPCCTTCRHHPCPPGQGVQSQGKFSFGFQCIDCASGTFSGGHEGHCKPWTDCTQFGFLTVFPGNKTHNAVCVPGSPPAEP
>7KHD_4|Chain D
QRPTGGPGCGPGRLLLGTGTDARCCRVHTTRCCRDYPGEECCSEWDCMCVQPEFHCGDPCCTTCRHHPCPPGQGVQSQGKFSFGFQCIDCASGTFSGGHEGHCKPWTDCTQFGFLTVFPGNKTHNAVCVPGSPPAEP

In our protein, chains A-B and chains C-D are repeated.

Then, submit the following sh file, remember to change the –output_dir and –fasta_paths to match your input fasta file and output folder:

When running AlphaFold Multimer, it is needed to define UniProt's database file.
#!/bin/bash
#SBATCH --job-name af_multimer
#SBATCH --cpus-per-task=8
#SBATCH --mem=20G

#set the environment PATH
export PYTHONNOUSERSITE=True
ALPHAFOLD_DATA_PATH=/shared/work/NBD_Utilities/AlphaFold/databases
ALPHAFOLD_MODELS=/shared/work/NBD_Utilities/AlphaFold/databases/params

module purge
module load CUDA/9.2.88-iccifort-2018.1.163-GCC-6.4.0-2.28
module load cuDNN
export CUDA_VISIBLE_DEVICES=-1

#Run the command
singularity run --nv \
 -B $ALPHAFOLD_DATA_PATH:/data \
 -B $ALPHAFOLD_MODELS \
 -B .:/etc \
 --pwd  /app/alphafold  alphafold_2.1.0.sif \
 --data_dir=/data \
 --fasta_paths=/shared/work/NBD_Utilities/AlphaFold/test_container/af_multimer/sep_chains/7khd.fasta \
 --uniref90_database_path=/data/uniref90/uniref90.fasta  \
 --data_dir=/data \
 --mgnify_database_path=/data/mgnify/mgy_clusters.fa   \
 --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
 --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
 --pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \
 --uniprot_database_path=/data/uniprot/uniprot.fasta \
 --template_mmcif_dir=/data/pdb_mmcif/mmcif_files  \
 --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \
 --max_template_date=2021-03-03 \
 --model_preset='multimer' \
 --output_dir=/shared/work/NBD_Utilities/AlphaFold/test_container/af_multimer/sep_chains/7khd