========================== AlphaFold ========================== This documentation provides the information and templates to run `Alpha Fold `_. The parameters needed to run Alpha Fold are: * **ALPHAFOLD_DATA_PATH:** Absolute path to folder with databases. * **ALPHAFOLD_MODELS:** Absolute path to folder with models. * **pwd:** Path to Singularity Image File (SIF) file. * **fasta_paths:** Path to the input sequence in fasta format. * **uniref90_database_path:** Path to Uniref90 database for use by JackHMMER. * **mgnify_database_path:** Path to the MGnify database for use by JackHMMER. * **bfd_database_path:** Path to the BFD database for use by HHblits. * **uniclust30_database_path:** Path to Uniclust30 database for use by HHblits. * **pdb70_database_path:** Path to PDB70 database for use by HHsearch. * **template_mmcif_dir_** Path to a directory with template mmCIF structures, each named .cif. * **uniprot_database_path** Path to the UniProt database for AlphaFold Multimer. * **obsolete_pdbs_path:** Path to a file mapping obsolete PDB IDs to their replacements. * **max_template_date:** Maximum template release date to consider (ISO-8601 format - i.e. YYYY-MM-DD). Important if folding historical test sets. Default is None. * **output_dir:** Path to a directory that will store the results. * **model_preset:** ['monomer', 'monomer_casp14', 'monomer_ptm', 'multimer']. Control which AlphaFold model use, choosing between the original model used at CASP14 with no ensembling (monomer), the original model used at CASP14 with num_ensemble=8, matching our CASP14 configuration (monomer_casp14), the original CASP14 model fine tuned with the pTM head, providing a pairwise confidence measure ('monomer_ptm') and the AlphaFold-Multimer model ('multimer'), to use this model, provide a multi-sequence FASTA file. * **db_preset:** ['reduced_dbs', 'full_dbs', 'casp14']. Choose preset model configuration - no ensembling and smaller genetic database config (reduced_dbs), no ensembling and full genetic database config (full_dbs) or full genetic database config and 8 model ensemblings (casp14). Default is full_dbs. * **benchmark:** [True, False]. Run multiple JAX model evaluations to obtain a timing that excludes the compilation time, which should be more indicative of the time required for inferencing many proteins. Default is False. :Important: The reference databases and models were downloaded in the directory **/shared/work/NBD_Utilities/AlphaFold/databases** and the singularity image file (alphafold_2.1.0.sif) of AlphaFold is available at **/shared/work/NBD_Utilities/AlphaFold** :Important: The max_template_date flag is mandatory when running AlphaFold. If you are predicting the structure of a protein that is already in PDB and you wish to avoid using it as a template, then max_template_date must be set to be before the release date of the structure. If you do not need to specify a date, by default we set today's date. For example, if we are running the simulation on August 7th 2021, we set -- max_template_date = 2021-08-07. ======================================= Running AlphaFold within Singularity ======================================= Here is an example on how to run AlphaFold. First, we need a protein sequence in FASTA format. :: >5ZE6_1 MNLEKINELTAQDMAGVNAAILEQLNSDVQLINQLGYYIVSGGGKRIRPMIAVLAARAVGYEGNAHVTIAALIEFIHTATLLHDDVVDESDMRRGKATANAA FGNAASVLVGDFIYTRAFQMMTSLGSLKVLEVMSEAVNVIAEGEVLQLMNVNDPDITEENYMRVIYSKTARLFEAAAQCSGILAGCTPEEEKGLQDYGRYLG TAFQLIDDLLDYNADGEQLGKNVGDDLNEGKPTLPLLHAMHHGTPEQAQMIRTAIEQGNGRHLLEPVLEAMNACGSLEWTRQRAEEEADKAIAALQVLPDTP WREALIGLAHIAVQRDR If we want to run AlphaFold in, for example, the directory ``AlphaFold/test`` :: user@login01:~$ cd AlphaFold/test user@login01:AlphaFold/test$ mkdir alphafold_output # Create the directory for AlphaFold output user@login01:AlphaFold/test$ ls # The directory should contain the singularity image file (.sif) and the input FASTA sequence alphafold_output alphafold_2.1.0.sif input.fasta To run Alpha Fold, please change in the following template: * output_dir * fasta_paths **Alpha Fold - v2.0.1 - Template to run** :: The memory needed for a job depends on the length of the input FASTA sequence and the number of models used. Consider increasing the memory if you are working with a large sequence or with all the models. .. code-block:: bash #!/bin/bash #SBATCH --job-name alphafold-run #SBATCH --gres=gpu:1 #SBATCH --cpus-per-task=8 #SBATCH --mem=40G #set the environment PATH export PYTHONNOUSERSITE=True ALPHAFOLD_DATA_PATH=/shared/work/NBD_Utilities/AlphaFold/databases ALPHAFOLD_MODELS=/shared/work/NBD_Utilities/AlphaFold/databases/params module load CUDA/9.2.88-iccifort-2018.1.163-GCC-6.4.0-2.28 module load cuDNN export CUDA_VISIBLE_DEVICES=-1 #Run the command singularity run --nv \ -B $ALPHAFOLD_DATA_PATH:/data \ -B $ALPHAFOLD_MODELS \ -B .:/etc \ --pwd /app/alphafold alphafold_2.1.0.sif \ --fasta_paths=/path/to/input/sequence/input.fasta \ --uniref90_database_path=/data/uniref90/uniref90.fasta \ --data_dir=/data \ --mgnify_database_path=/data/mgnify/mgy_clusters.fa \ --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --pdb70_database_path=/data/pdb70/pdb70 \ --template_mmcif_dir=/data/pdb_mmcif/mmcif_files \ --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \ --max_template_date= YYYY-MM-DD \ --output_dir=/path/to/output/directory \ --model_preset='monomer' ==================== AlphaFold output ==================== The outputs will be in a subfolder of `output_dir`. They include the computed MSAs, unrelaxed structures, relaxed structures, ranked structures, raw model outputs, prediction metadata, and section timings. The `output_dir` directory will have the following structure: :: / |- input/ |- features.pkl |- ranked_{0,1,2,3,4}.pdb |- ranking_debug.json |- relaxed_model_{1,2,3,4,5}.pdb |- result_model_{1,2,3,4,5}.pkl |- timings.json |- unrelaxed_model_{1,2,3,4,5}.pdb |- msas/ |- bfd_uniclust_hits.a3m |- mgnify_hits.sto |- uniref90_hits.sto The contents of each output file are as follows: * **features.pkl:** A pickle file containing the input feature NumPy arrays used by the models to produce the structures. * **unrelaxed_model_x.pdb:** A PDB format text file containing the predicted structure, exactly as outputted by the model. * **relaxed_model_x.pdb:** A PDB format text file containing the predicted structure, after performing an Amber relaxation procedure on the unrelaxed structure prediction (see Jumper et al. 2021, Suppl. Methods 1.8.6 for details). * **ranked_x.pdb:** A PDB format text file containing the relaxed predicted structures, after reordering by model confidence. Here `ranked_0.pdb` should contain the prediction with the highest confidence, and `ranked_4.pdb` the prediction with the lowest confidence. To rank model confidence, we use predicted LDDT (pLDDT) scores (see Jumper et al. 2021, Suppl. Methods 1.9.6 for details). * **ranking_debug.json:** A JSON format text file containing the pLDDT values used to perform the model ranking, and a mapping back to the original model names. * **timings.json:** A JSON format text file containing the times taken to run each section of the AlphaFold pipeline. * **msas/:** - A directory containing the files describing the various genetic tool hits that were used to construct the input MSA. * **result_model_x.pkl:** A `pickle` file containing a nested dictionary of the various NumPy arrays directly produced by the model. In addition to the output of the structure module, this includes auxiliary outputs such as: * Distograms (**distogram/logits** contains a NumPy array of shape [N_res, N_res, N_bins] and **distogram/bin_edges** contains the definition of the bins). * Per-residue pLDDT scores (**plddt** contains a NumPy array of shape [N_res] with the range of possible values from 0 to 100, where 100 means most confident). This can serve to identify sequence regions predicted with high confidence or as an overall per-target confidence score when averaged across residues. * Present only if using pTM models: predicted TM-score (**ptm** field contains a scalar). As a predictor of a global superposition metric, this score is designed to also assess whether the model is confident in the overall domain packing. * Present only if using pTM models: predicted pairwise aligned errors (**predicted_aligned_error** contains a NumPy array of shape [N_res, N_res] with the range of possible values from 0 to **max_predicted_aligned_error**, where 0 means most confident). This can serve for a visualisation of domain packing confidence within the structure. ================================ Running AlphaFold Multimer ================================ The steps are the same as when folding a monomer, but it is needed to provide: 1. An input `fasta` file with multiple sequences. 2. Set the **--model-preset** flag to 'multimer'. 3. Optionally set the **--is_prokaryote_list** flag with booleans that determine whether all input sequences in the given `fasta` file are prokaryotic. If that is not the case or the origin is unknown, set to `false` for that `fasta`. Example ######### In this tutorial we will fold a multimer using AlphaFold. We will be using a Human GITR-GITRL complex (PDB ID: 7KHD). 1. Sequence file preparation: The multimer sequence can be downloaded from the PDB databse. :: >7KHD_1|Chains A, B|Tumor necrosis factor ligand superfamily member 18|Homo sapiens (9606) QLETAKEPCMAKFGPLPSKWQMASSEPPCVNKVSDWKLEILQNGLYLIYGQVAPNANYNDVAPFEVRLYKNKDMIQTLTNKSKIQNVGGTYELHVGDTIDLIFNSEHQVLKNNTYWGIILLANPQFIS >7KHD_2|Chains C, D|Tumor necrosis factor receptor superfamily member 18|Homo sapiens (9606) QRPTGGPGCGPGRLLLGTGTDARCCRVHTTRCCRDYPGEECCSEWDCMCVQPEFHCGDPCCTTCRHHPCPPGQGVQSQGKFSFGFQCIDCASGTFSGGHEGHCKPWTDCTQFGFLTVFPGNKTHNAVCVPGSPPAEP If the multimer has repeated chains,the input `fasta` file should be: :: >7KHD_1|Chain A QLETAKEPCMAKFGPLPSKWQMASSEPPCVNKVSDWKLEILQNGLYLIYGQVAPNANYNDVAPFEVRLYKNKDMIQTLTNKSKIQNVGGTYELHVGDTIDLIFNSEHQVLKNNTYWGIILLANPQFIS >7KHD_2|Chain B QLETAKEPCMAKFGPLPSKWQMASSEPPCVNKVSDWKLEILQNGLYLIYGQVAPNANYNDVAPFEVRLYKNKDMIQTLTNKSKIQNVGGTYELHVGDTIDLIFNSEHQVLKNNTYWGIILLANPQFIS >7KHD_3|Chain C QRPTGGPGCGPGRLLLGTGTDARCCRVHTTRCCRDYPGEECCSEWDCMCVQPEFHCGDPCCTTCRHHPCPPGQGVQSQGKFSFGFQCIDCASGTFSGGHEGHCKPWTDCTQFGFLTVFPGNKTHNAVCVPGSPPAEP >7KHD_4|Chain D QRPTGGPGCGPGRLLLGTGTDARCCRVHTTRCCRDYPGEECCSEWDCMCVQPEFHCGDPCCTTCRHHPCPPGQGVQSQGKFSFGFQCIDCASGTFSGGHEGHCKPWTDCTQFGFLTVFPGNKTHNAVCVPGSPPAEP In our protein, chains A-B and chains C-D are repeated. Then, submit the following sh file, remember to change the **--output_dir** and **--fasta_paths** to match your input `fasta` file and output folder: :: When running AlphaFold Multimer, it is needed to define UniProt's database file. .. code-block:: bash #!/bin/bash #SBATCH --job-name af_multimer #SBATCH --cpus-per-task=8 #SBATCH --mem=20G #set the environment PATH export PYTHONNOUSERSITE=True ALPHAFOLD_DATA_PATH=/shared/work/NBD_Utilities/AlphaFold/databases ALPHAFOLD_MODELS=/shared/work/NBD_Utilities/AlphaFold/databases/params module purge module load CUDA/9.2.88-iccifort-2018.1.163-GCC-6.4.0-2.28 module load cuDNN export CUDA_VISIBLE_DEVICES=-1 #Run the command singularity run --nv \ -B $ALPHAFOLD_DATA_PATH:/data \ -B $ALPHAFOLD_MODELS \ -B .:/etc \ --pwd /app/alphafold alphafold_2.1.0.sif \ --data_dir=/data \ --fasta_paths=/shared/work/NBD_Utilities/AlphaFold/test_container/af_multimer/sep_chains/7khd.fasta \ --uniref90_database_path=/data/uniref90/uniref90.fasta \ --data_dir=/data \ --mgnify_database_path=/data/mgnify/mgy_clusters.fa \ --bfd_database_path=/data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --uniclust30_database_path=/data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --pdb_seqres_database_path=/data/pdb_seqres/pdb_seqres.txt \ --uniprot_database_path=/data/uniprot/uniprot.fasta \ --template_mmcif_dir=/data/pdb_mmcif/mmcif_files \ --obsolete_pdbs_path=/data/pdb_mmcif/obsolete.dat \ --max_template_date=2021-03-03 \ --model_preset='multimer' \ --output_dir=/shared/work/NBD_Utilities/AlphaFold/test_container/af_multimer/sep_chains/7khd