Lay Summary

An example of a structure determined by NMR (1A5R) which consists of a rigid central domain and two highly flexible terminal chains.

For most of the history of structural biology, we have tended to only consider proteins as static structures when it came to explaining their function and properties, largely due to the early development of techniques such as X-ray crystallography. However, new techniques, such as protein NMR, have allowed us to study the movements, or dynamics, of proteins and the degree to which they contribute to protein function in nature has only recently been fully appriciated. Furthering our understanding of protein dynamics will be essential to understand diseases caused by misfolding and aggregation, such as Alzheimer’s, and developing next generation drugs and de novo protein design.

Substantial progress has been made recently in the field of protein structure prediction through the application of machine learning. However, these benefits have yet to be seen to the same extent in other areas of structural bioinformaics. Here, we attempt to combine recent developments in molecular simulations and a type of machine learning model that can natively take protein structures as an input to see how well these methods perform on learning the dynamics of small proteins.

Our learnt dyamics model is able to model backbone flexibility and can simulate small proteins over short timesteps. However, in comparisons to other force-fields and known NMR data, our model tends to introduce too much flexibility in the protein and does not capture the complex interactions between sidechain and backbone atoms that is cruical for stabalising tertiary structure over longer periods. Most importantly, we demonstrate that neural network based approaches have the potential to learn accurate dynamics with short training times and using small amounts of data, something that will be important if these approaches are to be used in other applications (e.g. protein-drug interactions). We conclude by discussing future directions of possible research, specifically using models that are better suited to learning on proteins (and their shapes) and that can also learn higher-order interactions in proteins.

Abbreviations and Glossary

A computational method to quickly determine the derivative of a function (also known as back propergation when used in Deep Learning)
The angle formed between 2 intersecting planes, a common why to describe the geometry of 4 atoms in space
Differentiable Molecular Simulators: A class of molecular simulator which is fully differentiable, meaning the parameters can be very quickly refined
Graph Neural Networks: A type of neural network which takes graph structured data as the input
Highly flexible proteins that do not form stable structures yet back biological functions
K-nearest neighbours algorithum: a non-parametric classification method to determine the k cloestest neighbours of an object
The function by which we can measure the "cost" of some event. In machine learning, this function is minimised during training.
Multi Layer Perceptron: the most basic kind of artifical neural network
A simulation method for the study of movements of atoms and molecules
The study of the transitions proteins take between stable structures
Root Mean Squared Deviation: a measure of the average distance between atoms of two superimposed proteins
Root Mean Squared Fluctuations: a measure of the average flexibility of a protein or individual residues
Toplogical objects which generalise graphs to higher dimensions
A novel message-passing layer for graphs that is interpretable

Acknowledgements

My thanks go to Michael Bronstein and Bruno Correia for their supervision and generous access to computational resources.

I would also like to thank Freyr Sverrisson for his guidance throughout the project, Fabrizio Frasca for the discussions on simplicial complexes, and Joe Greener for providing the RMSF data for his model.