Performance Grading of Clustering Algorithms on Molecular Dynamics Simulations of Proteins

Research Poster

Abstract

Computational studies continue to serve an important role in modeling and understanding protein dynamics in biology. Molecular Dynamics (MD) can model the molecular structures of proteins and simulate their motion over nanosecond-microsecond time scales using classical mechanics. MD simulations can reveal insights into the folding process that are beyond present laboratory means. When trajectories of the proteins' motion are generated by MD simulation, machine learning algorithms like k- means, spectral, and subspace clustering help identify the structures and processes that are integral to the folding process, which is challenging to do by eye. We aimed to evaluate the performance of these various algorithms with a special focus on the recent hybrid spectral/subspace method by comparing their normalized mutual information (NMI) scores over cumulative simulation time. Principal Component Analysis (PCA) was performed to visualize the trajectories and their clustering results. The theory of protein dynamics suggests that given an infinite amount of time the sampling space should become increasingly mixed. Algorithms that can still identify distinct structures are better suited for clustering MD data. We found that the hybrid spectral/subspace method delivered the best performance overall, and provided the most conservative estimate of the sampling adequacy.

Introduction and Background

MD simulations are a common tool in computational biology for simulating the dynamic behavior of biomolecular structures like proteins. Classical mechanics model the forces acting on the proteins on a molecular level and computers simulate the effects, allowing for exploration of the energy landscape of the protein, and consequently the conformation states of the protein. Of special interest is the protein’s native conformation state, which determines the protein’s function. This can be important in a myriad of scientific interests including supporting the design of safe and effective drugs, vaccine production, study of neurodegenerative diseases and other biomedical research. However, while MD simulation is an effective way of studying protein folding, it is computationally expensive. For protein folding that happens beyond the millisecond range, healthy exploration of the possible conformation states can take an unfeasible amount of time (months or years). Protein also get stuck negotiating energy barriers on the way to their native conformations. Consequently, automated methods that can accurately organize protein conformations are especially useful to the computational biologist studying proteins and protein folding. We investigate four clustering algorithms and assess their impact through comparative analysis.

Discussion

For computational biologists, the more accurate the algorithm, the better. The hybrid spectral/subspace method performed quite well, boasting the highest NMI in the most difficult space. • Sci-kit learn’s ordinary k-means algorithm delivered very respectable performance and often beat out spectral and subspace. However, manual hyperparameter optimization may have hid the clustering power of the spectral and subspace algorithms. • Different forcefields for the same protein may affect the robustness of the clustering algorithms. Further inquiry is required.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
COMSREU_2019_EKIM_poster_final.pdf		COMSREU_2019_EKIM_poster_final.pdf
LICENSE		LICENSE
README.md		README.md
hybrid.ipynb		hybrid.ipynb
lyso_hybrid_10010.ipynb		lyso_hybrid_10010.ipynb
lyso_kmeans_10010.ipynb		lyso_kmeans_10010.ipynb
lyso_spectral_10010.ipynb		lyso_spectral_10010.ipynb
lyso_subspace_10010.ipynb		lyso_subspace_10010.ipynb
poster-final.png		poster-final.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Performance Grading of Clustering Algorithms on Molecular Dynamics Simulations of Proteins

Research Poster

Abstract

Introduction and Background

Discussion

About

Releases

Packages

Contributors 2

Languages

License

coms-reu/md-clustering

Folders and files

Latest commit

History

Repository files navigation

Performance Grading of Clustering Algorithms on Molecular Dynamics Simulations of Proteins

Research Poster

Abstract

Introduction and Background

Discussion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages