Questions about using custom MSAs #289

ntnn19 · 2025-02-06T13:44:16Z

Hi,

Thank you very much for making your inference code publicly available.

My questions are as follows:

The documentation calls setting both unpairedMsa and pairedMsa to a custom non-empty A3M string an "expert option". Why is it considered as an expert option?
The documentation recommends manual pairing or using the output of "appropriate software". Could you provide examples of such software or recommended best practices for manual pairing?
The documentation recommends providing the paired MSA data using only the unpairedMsa field after manual pairing. This is counter-intuitive. Since the data is paired, why is it provided in the unpairedMsa field? Could the naming be clarified or the rationale explained more clearly?
For multimer prediction, is it possible to provide custom, non-paired MSA for each chain, and let your pipeline do the pairing? if yes, which json key should hold these custom, non-paired MSAs?

I would appreciate your help on this.

The text was updated successfully, but these errors were encountered:

Augustin-Zidek · 2025-02-06T14:52:58Z

Hi!

Here are answers to your questions:

The non-expert option is to let AlphaFold do the genetic search and let it pair the MSA as part of its data pipeline. Once you use custom MSA, you need to make sure a lot of things are ok -- the format, alignment, ordering by quality, deduplication, ... Moreover, for multimers, you have to make sure the pairing was done right. I.e. a lot of things can be subtly wrong in the MSA leading to worse prediction quality and we want to highlight that to our users by saying it is an expert option.
See e.g. https://zhanggroup.org/cpxDeepMSA/, https://seq2fun.dcmb.med.umich.edu/DeepMSA2/, https://github.com/sokrypton/ColabFold, ... For manual pairing you should put the sequences you want to be paired in the same row in each of the MSA (unpairedMsa) for each of the chains in the complex. Pad with gap-only sequences. Typically you want to pair together sequences that are e.g. from the same organism or have some biological relationship.
Yes, the naming is bad (sorry!), unfortunately. See MSA Pairing #257 for a detailed discussion.
Yes, two options: Provide the unpairedMsa for each chain, they will be paired row-wise in the order provided. The other option is providing pairedMsa for each chain and making sure the naming matches the UnProt naming scheme. AlphaFold will then pair using organism IDs. However, pairing should not matter that much in practice. I would simply provide custom unpaired MSA for each chain and see what the prediction looks like. Only if the predictions are not good (i.e. low predicted accuracy metrics), I would tweak the MSA.

The simplest option though is to let AlphaFold build the MSA using its built-in data pipeline.

GXcells · 2025-02-07T08:15:41Z

Hi!

Here are answers to your questions:

The non-expert option is to let AlphaFold do the genetic search and let it pair the MSA as part of its data pipeline. Once you use custom MSA, you need to make sure a lot of things are ok -- the format, alignment, ordering by quality, deduplication, ... Moreover, for multimers, you have to make sure the pairing was done right. I.e. a lot of things can be subtly wrong in the MSA leading to worse prediction quality and we want to highlight that to our users by saying it is an expert option.

See e.g. https://zhanggroup.org/cpxDeepMSA/, https://seq2fun.dcmb.med.umich.edu/DeepMSA2/, https://github.com/sokrypton/ColabFold, ... For manual pairing you should put the sequences you want to be paired in the same row in each of the MSA (unpairedMsa) for each of the chains in the complex. Pad with gap-only sequences. Typically you want to pair together sequences that are e.g. from the same organism or have some biological relationship.

Yes, the naming is bad (sorry!), unfortunately. See MSA Pairing #257 for a detailed discussion.

Yes, two options: Provide the unpairedMsa for each chain, they will be paired row-wise in the order provided. The other option is providing pairedMsa for each chain and making sure the naming matches the UnProt naming scheme. AlphaFold will then pair using organism IDs. However, pairing should not matter that much in practice. I would simply provide custom unpaired MSA for each chain and see what the prediction looks like. Only if the predictions are not good (i.e. low predicted accuracy metrics), I would tweak the MSA.

The simplest option though is to let AlphaFold build the MSA using its built-in data pipeline.

So basically we can run MSA using data pipeline on each protein separately and then feed the unpaired MSA when predicting multimer complex? Then the pipeline will automatically also predict paired MSA for the given complex? I am asking this because it is too time and resource consuming to repredict each time MSA when using several time same protein for multiple different complexes.

Augustin-Zidek · 2025-02-07T11:39:23Z

So basically we can run MSA using data pipeline on each protein separately and then feed the unpaired MSA when predicting multimer complex? Then the pipeline will automatically also predict paired MSA for the given complex? I am asking this because it is too time and resource consuming to repredict each time MSA when using several time same protein for multiple different complexes.

Yes indeed, you can run AlphaFold 3 data pipeline for each of the monomer chains individually, then compose the input JSON that includes the MSA for the various multimer combinations. See #171 where this was discussed for more details.

ntnn19 changed the title ~~Question about using custom MSAs~~ Questions about using custom MSAs Feb 6, 2025

Augustin-Zidek added the question Further information is requested label Feb 6, 2025

Augustin-Zidek closed this as completed Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about using custom MSAs #289

Questions about using custom MSAs #289

ntnn19 commented Feb 6, 2025 •

edited

Loading

Augustin-Zidek commented Feb 6, 2025

GXcells commented Feb 7, 2025

Augustin-Zidek commented Feb 7, 2025

Questions about using custom MSAs #289

Questions about using custom MSAs #289

Comments

ntnn19 commented Feb 6, 2025 • edited Loading

Augustin-Zidek commented Feb 6, 2025

GXcells commented Feb 7, 2025

Augustin-Zidek commented Feb 7, 2025

ntnn19 commented Feb 6, 2025 •

edited

Loading