Computational studies continue to serve an important role in modeling and understanding protein dynamics in biology. Molecular Dynamics (MD) can model the molecular structures of proteins and simulate their motion over nanosecond-microsecond time scales using classical mechanics. MD simulations can reveal insights into the folding process that are beyond present laboratory means. When trajectories of the proteins' motion are generated by MD simulation, machine learning algorithms like k- means, spectral, and subspace clustering help identify the structures and processes that are integral to the folding process, which is challenging to do by eye. We aimed to evaluate the performance of these various algorithms with a special focus on the recent hybrid spectral/subspace method by comparing their normalized mutual information (NMI) scores over cumulative simulation time. Principal Component Analysis (PCA) was performed to visualize the trajectories and their clustering results. The theory of protein dynamics suggests that given an infinite amount of time the sampling space should become increasingly mixed. Algorithms that can still identify distinct structures are better suited for clustering MD data. We found that the hybrid spectral/subspace method delivered the best performance overall, and provided the most conservative estimate of the sampling adequacy.
MD simulations are a common tool in computational biology for simulating the dynamic behavior of biomolecular structures like proteins. Classical mechanics model the forces acting on the proteins on a molecular level and computers simulate the effects, allowing for exploration of the energy landscape of the protein, and consequently the conformation states of the protein. Of special interest is the protein’s native conformation state, which determines the protein’s function. This can be important in a myriad of scientific interests including supporting the design of safe and effective drugs, vaccine production, study of neurodegenerative diseases and other biomedical research. However, while MD simulation is an effective way of studying protein folding, it is computationally expensive. For protein folding that happens beyond the millisecond range, healthy exploration of the possible conformation states can take an unfeasible amount of time (months or years). Protein also get stuck negotiating energy barriers on the way to their native conformations. Consequently, automated methods that can accurately organize protein conformations are especially useful to the computational biologist studying proteins and protein folding. We investigate four clustering algorithms and assess their impact through comparative analysis.
For computational biologists, the more accurate the algorithm, the better. The hybrid spectral/subspace method performed quite well, boasting the highest NMI in the most difficult space. • Sci-kit learn’s ordinary k-means algorithm delivered very respectable performance and often beat out spectral and subspace. However, manual hyperparameter optimization may have hid the clustering power of the spectral and subspace algorithms. • Different forcefields for the same protein may affect the robustness of the clustering algorithms. Further inquiry is required.