MSA format for "unpairedMsa" in fold_input.json #47

smg3d · 2024-11-15T03:51:12Z

Thanks for providing the AF3 source. it is really appreciated.

I could not find the format to use in order to provide our own MSA in the input json file.

The input documentation mentions "If the unpairedMsa field is set to a custom A3M string, AlphaFold 3 will use the provided MSA instead of building one as part of the data pipeline. This is considered an expert option.". But what is the format of the "custom A3M string"

The doc provides the two following examples, but does not show the string or list format for unpairedMsa

{
  "protein": {
    "id": "A",
    "sequence": "PVLSCGEWQL",
    "modifications": [
      {"ptmType": "HY3", "ptmPosition": 1},
      {"ptmType": "P1L", "ptmPosition": 5}
    ],
    "unpairedMsa": ...,
    "pairedMsa": ...,
    "templates": [...]
  }
}

and

{
  "protein": {
    "id": "A",
    "sequence": ...,
    "unpairedMsa": "The A3M you want to run with",
    "pairedMsa": "",
    "templates": []
  }
}

For "unpairedMsa": I tried filename and various list formats, but none are working.

The text was updated successfully, but these errors were encountered:

Hanziwww · 2024-11-15T04:25:11Z

Here is my suggesttion:

Prepare Your MSA: Format your MSA in A3M, which is similar to FASTA but can include lowercase letters for insertions.
Embed MSA Content in JSON: Place your MSA content in the "unpairedMsa" field of the input JSON file. Ensure newline characters are correctly handled with \n.

Example:

{
  "protein": {
    "id": "A",
    "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
    "unpairedMsa": ">seq1\\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\\n>seq2\\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
    "pairedMsa": "",
    "templates": []
  }
}

Considerations:

Handling Newlines: In JSON strings, newlines should be represented by \\n (in the actual JSON file, it’s \n, but needs escaping in strings).
Direct Embedding: The "unpairedMsa" field should contain the actual MSA content string, not a filename or path.
Validate JSON Format: Make sure your JSON file is correctly formatted. You might want to use an online JSON validator for checking.

smg3d · 2024-11-15T05:53:11Z

Thanks @Hanziwww .

Does that input.json work for you? For me, it does not recognize the first sequence of the MSA (looks like it reads an empty sequence):

    raise ValueError(
ValueError: First MSA sequence  is not the query_sequence='MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF'

Hanziwww · 2024-11-15T06:30:11Z

Hi @smg3d,

You're absolutely right—I made a mistake in my previous response. The newline character in JSON strings should be represented as \n, not \\n. Using \\n will not correctly parse the newlines within the JSON string, leading to errors like the one you encountered.

Here's the corrected JSON input:

{
  "name": "My AlphaFold Job",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
        "unpairedMsa": ">seq1\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\n>seq2\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
        "pairedMsa": "",
        "templates": []
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}

Here's how you can run AlphaFold using Docker with the corrected JSON:

docker run -it \
  --volume /home/mars/disk3/af3input:/root/af_input \
  --volume /home/mars/disk3/af3output:/root/af_output \
  --volume /home/mars/disk3/af3md:/root/models \
  --volume /home/mars/disk3/af3db:/root/public_databases \
  --gpus all alphafold3 \
  python run_alphafold.py \
  --json_path=/root/af_input/fold_input.json \
  --model_dir=/root/models \
  --output_dir=/root/af_output

output cif: my_alphafold_job_model.zip

Sorry for misleading.

smg3d · 2024-11-15T06:41:38Z

Thanks @Hanziwww .

It works now.

I think it might be a good idea to show such an example in the input doc:

{
  "protein": {
    "id": "A",
    "sequence": "PVLSCGEWQL",
    "modifications": [
      {"ptmType": "HY3", "ptmPosition": 1},
      {"ptmType": "P1L", "ptmPosition": 5}
    ],
    "unpairedMsa": ">seq1\nPVLSCGEWQL\n>seq2\nPILSCADWQ-",
    "pairedMsa": ...,
    "templates": [...]
  }
}

Hanziwww · 2024-11-15T08:07:36Z

I'm glad to hear that the input is working now.

By the way, I'd like to introduce a user-friendly graphical interface that I developed to solve the JSON generation issue and running AlphaFold 3 predictions. Feel free to check out GUI repository.

sky1ove · 2024-11-18T05:17:16Z

Hi @smg3d,

You're absolutely right—I made a mistake in my previous response. The newline character in JSON strings should be represented as \n, not \\n. Using \\n will not correctly parse the newlines within the JSON string, leading to errors like the one you encountered.

Here's the corrected JSON input:

{
  "name": "My AlphaFold Job",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
        "unpairedMsa": ">seq1\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\n>seq2\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
        "pairedMsa": "",
        "templates": []
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}

Here's how you can run AlphaFold using Docker with the corrected JSON:

docker run -it \
  --volume /home/mars/disk3/af3input:/root/af_input \
  --volume /home/mars/disk3/af3output:/root/af_output \
  --volume /home/mars/disk3/af3md:/root/models \
  --volume /home/mars/disk3/af3db:/root/public_databases \
  --gpus all alphafold3 \
  python run_alphafold.py \
  --json_path=/root/af_input/fold_input.json \
  --model_dir=/root/models \
  --output_dir=/root/af_output

output cif: my_alphafold_job_model.zip

Sorry for misleading.

It works. Thanks for sharing.

May I ask if it's ok to skip pairedMsa and templates in terms of model performance? I didn't see much differences on my end though.

Hanziwww · 2024-11-18T09:28:00Z

Hi @smg3d,
You're absolutely right—I made a mistake in my previous response. The newline character in JSON strings should be represented as \n, not \\n. Using \\n will not correctly parse the newlines within the JSON string, leading to errors like the one you encountered.
Here's the corrected JSON input:
{
  "name": "My AlphaFold Job",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
        "unpairedMsa": ">seq1\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\n>seq2\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
        "pairedMsa": "",
        "templates": []
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}
Here's how you can run AlphaFold using Docker with the corrected JSON:
docker run -it \
  --volume /home/mars/disk3/af3input:/root/af_input \
  --volume /home/mars/disk3/af3output:/root/af_output \
  --volume /home/mars/disk3/af3md:/root/models \
  --volume /home/mars/disk3/af3db:/root/public_databases \
  --gpus all alphafold3 \
  python run_alphafold.py \
  --json_path=/root/af_input/fold_input.json \
  --model_dir=/root/models \
  --output_dir=/root/af_output
output cif: my_alphafold_job_model.zip
Sorry for misleading.
It works. Thanks for sharing.

May I ask if it's ok to skip pairedMsa and templates in terms of model performance? I didn't see much differences on my end though.

Certainly. According to the guidelines, you can skip pairedMsa and templates, but when using unpairedMsa, these parameters still need to be present (even if they're left empty).

Augustin-Zidek added documentation Improvements or additions to documentation question Further information is requested labels Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSA format for "unpairedMsa" in fold_input.json #47

MSA format for "unpairedMsa" in fold_input.json #47

smg3d commented Nov 15, 2024

Hanziwww commented Nov 15, 2024

smg3d commented Nov 15, 2024

Hanziwww commented Nov 15, 2024

smg3d commented Nov 15, 2024

Hanziwww commented Nov 15, 2024

sky1ove commented Nov 18, 2024

Hanziwww commented Nov 18, 2024

MSA format for "unpairedMsa" in fold_input.json #47

MSA format for "unpairedMsa" in fold_input.json #47

Comments

smg3d commented Nov 15, 2024

Hanziwww commented Nov 15, 2024

smg3d commented Nov 15, 2024

Hanziwww commented Nov 15, 2024

smg3d commented Nov 15, 2024

Hanziwww commented Nov 15, 2024

sky1ove commented Nov 18, 2024

Hanziwww commented Nov 18, 2024