Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSA format for "unpairedMsa" in fold_input.json #47

Open
smg3d opened this issue Nov 15, 2024 · 7 comments
Open

MSA format for "unpairedMsa" in fold_input.json #47

smg3d opened this issue Nov 15, 2024 · 7 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@smg3d
Copy link

smg3d commented Nov 15, 2024

Thanks for providing the AF3 source. it is really appreciated.

I could not find the format to use in order to provide our own MSA in the input json file.

The input documentation mentions "If the unpairedMsa field is set to a custom A3M string, AlphaFold 3 will use the provided MSA instead of building one as part of the data pipeline. This is considered an expert option.". But what is the format of the "custom A3M string"

The doc provides the two following examples, but does not show the string or list format for unpairedMsa

{
  "protein": {
    "id": "A",
    "sequence": "PVLSCGEWQL",
    "modifications": [
      {"ptmType": "HY3", "ptmPosition": 1},
      {"ptmType": "P1L", "ptmPosition": 5}
    ],
    "unpairedMsa": ...,
    "pairedMsa": ...,
    "templates": [...]
  }
}

and

{
  "protein": {
    "id": "A",
    "sequence": ...,
    "unpairedMsa": "The A3M you want to run with",
    "pairedMsa": "",
    "templates": []
  }
}

For "unpairedMsa": I tried filename and various list formats, but none are working.

@Hanziwww
Copy link

Here is my suggesttion:

  1. Prepare Your MSA: Format your MSA in A3M, which is similar to FASTA but can include lowercase letters for insertions.

  2. Embed MSA Content in JSON: Place your MSA content in the "unpairedMsa" field of the input JSON file. Ensure newline characters are correctly handled with \n.

Example:

{
  "protein": {
    "id": "A",
    "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
    "unpairedMsa": ">seq1\\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\\n>seq2\\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
    "pairedMsa": "",
    "templates": []
  }
}

Considerations:

  • Handling Newlines: In JSON strings, newlines should be represented by \\n (in the actual JSON file, it’s \n, but needs escaping in strings).

  • Direct Embedding: The "unpairedMsa" field should contain the actual MSA content string, not a filename or path.

  • Validate JSON Format: Make sure your JSON file is correctly formatted. You might want to use an online JSON validator for checking.

@smg3d
Copy link
Author

smg3d commented Nov 15, 2024

Thanks @Hanziwww .

Does that input.json work for you? For me, it does not recognize the first sequence of the MSA (looks like it reads an empty sequence):

    raise ValueError(
ValueError: First MSA sequence  is not the query_sequence='MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF'

@Hanziwww
Copy link

Hi @smg3d,

You're absolutely right—I made a mistake in my previous response. The newline character in JSON strings should be represented as \n, not \\n. Using \\n will not correctly parse the newlines within the JSON string, leading to errors like the one you encountered.

Here's the corrected JSON input:

{
  "name": "My AlphaFold Job",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
        "unpairedMsa": ">seq1\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\n>seq2\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
        "pairedMsa": "",
        "templates": []
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}

Here's how you can run AlphaFold using Docker with the corrected JSON:

docker run -it \
  --volume /home/mars/disk3/af3input:/root/af_input \
  --volume /home/mars/disk3/af3output:/root/af_output \
  --volume /home/mars/disk3/af3md:/root/models \
  --volume /home/mars/disk3/af3db:/root/public_databases \
  --gpus all alphafold3 \
  python run_alphafold.py \
  --json_path=/root/af_input/fold_input.json \
  --model_dir=/root/models \
  --output_dir=/root/af_output

output cif: my_alphafold_job_model.zip

Sorry for misleading.

@smg3d
Copy link
Author

smg3d commented Nov 15, 2024

Thanks @Hanziwww .

It works now.

I think it might be a good idea to show such an example in the input doc:

{
  "protein": {
    "id": "A",
    "sequence": "PVLSCGEWQL",
    "modifications": [
      {"ptmType": "HY3", "ptmPosition": 1},
      {"ptmType": "P1L", "ptmPosition": 5}
    ],
    "unpairedMsa": ">seq1\nPVLSCGEWQL\n>seq2\nPILSCADWQ-",
    "pairedMsa": ...,
    "templates": [...]
  }
}

@Hanziwww
Copy link

I'm glad to hear that the input is working now.

By the way, I'd like to introduce a user-friendly graphical interface that I developed to solve the JSON generation issue and running AlphaFold 3 predictions. Feel free to check out GUI repository.

@Augustin-Zidek Augustin-Zidek added documentation Improvements or additions to documentation question Further information is requested labels Nov 15, 2024
@sky1ove
Copy link

sky1ove commented Nov 18, 2024

Hi @smg3d,

You're absolutely right—I made a mistake in my previous response. The newline character in JSON strings should be represented as \n, not \\n. Using \\n will not correctly parse the newlines within the JSON string, leading to errors like the one you encountered.

Here's the corrected JSON input:

{
  "name": "My AlphaFold Job",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
        "unpairedMsa": ">seq1\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\n>seq2\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
        "pairedMsa": "",
        "templates": []
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}

Here's how you can run AlphaFold using Docker with the corrected JSON:

docker run -it \
  --volume /home/mars/disk3/af3input:/root/af_input \
  --volume /home/mars/disk3/af3output:/root/af_output \
  --volume /home/mars/disk3/af3md:/root/models \
  --volume /home/mars/disk3/af3db:/root/public_databases \
  --gpus all alphafold3 \
  python run_alphafold.py \
  --json_path=/root/af_input/fold_input.json \
  --model_dir=/root/models \
  --output_dir=/root/af_output

output cif: my_alphafold_job_model.zip

Sorry for misleading.

It works. Thanks for sharing.

May I ask if it's ok to skip pairedMsa and templates in terms of model performance? I didn't see much differences on my end though.

@Hanziwww
Copy link

Hi @smg3d,
You're absolutely right—I made a mistake in my previous response. The newline character in JSON strings should be represented as \n, not \\n. Using \\n will not correctly parse the newlines within the JSON string, leading to errors like the one you encountered.
Here's the corrected JSON input:

{
  "name": "My AlphaFold Job",
  "modelSeeds": [1],
  "sequences": [
    {
      "protein": {
        "id": "A",
        "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF",
        "unpairedMsa": ">seq1\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF\n>seq2\nMVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTFFPHF",
        "pairedMsa": "",
        "templates": []
      }
    }
  ],
  "dialect": "alphafold3",
  "version": 1
}

Here's how you can run AlphaFold using Docker with the corrected JSON:

docker run -it \
  --volume /home/mars/disk3/af3input:/root/af_input \
  --volume /home/mars/disk3/af3output:/root/af_output \
  --volume /home/mars/disk3/af3md:/root/models \
  --volume /home/mars/disk3/af3db:/root/public_databases \
  --gpus all alphafold3 \
  python run_alphafold.py \
  --json_path=/root/af_input/fold_input.json \
  --model_dir=/root/models \
  --output_dir=/root/af_output

output cif: my_alphafold_job_model.zip
Sorry for misleading.

It works. Thanks for sharing.

May I ask if it's ok to skip pairedMsa and templates in terms of model performance? I didn't see much differences on my end though.

Certainly. According to the guidelines, you can skip pairedMsa and templates, but when using unpairedMsa, these parameters still need to be present (even if they're left empty).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants