In eukaryotes, gene expression is regulated by enhancers, which are sequences adjacent to the gene coding region which recruit different constellations of transcription factors that eventually give rise to specific patterns of gene expression. We will develop a strategy where enhancers in different organisms will be constructed in a bottom-up approach from a minimal number of transcription factor motifs (building blocks), and gradually increase the complexity of the synthetic enhancer and study the effects on gene-expression output.
We will also peruse a top-down approach by introducing variation to natural enhancers, for example by comparing evolutionarily close organisms and perturbing natural enhancers in a systematic fashion. This will allow us to map hotspots for sequence variation, which in turn can be modelled with advanced computational tools to try to identify potential regulatory rules.
We will build a predictive nucleotide-level model of enhancer function in order to extract the underlying regulatory principles from the data obtained from both the top-down and bottom-up approaches. We will utilize two orthogonal modelling approaches: The first, a thermodynamic model, will be implemented initially on the minimal enhancer dataset and will employ numerical algorithms that compute thermodynamically feasible configurations for DNA-protein (nucleoprotein) complexes, and subsequently using these ensembles to compute the probability of looping to estimate the resulting effects on gene expression. After tuning the regulatory thermodynamic model on the bottom-up data-set, it will be refined further on the top-down data set. The second approach will employ equilibrium-based models that we and others have devised, which describe transcriptional regulation as an ‘equilibrium competition’ between DNA binding molecules offering different configurations with varying probabilities. These statistical models allow us to compute the expression level of each sequence by weighting the probability of each configuration.
Notably, the large number of sequences measured and our ability to design the identity of each sequence should provide us with an unprecedented opportunity to accurately learn the details of such models and use them to generate and test hypotheses about mechanisms of transcriptional regulation.
Next, we will assess the model's predictions of enhancer activity in bacteria, yeast and mammalian cells. We will generate and study thousands of phenotypes within bacteria and yeast, testing predictions of base pair modifications of a given enhancer, across 100’s of enhancer elements. In ES cells, we will use loss-of-function library and gain-of-function transcription factor library, to validate which transcription factors’ activity is being mediated through each sequence motif.
The data received from all abovementioned stages will culminate into what we refer to as the Grammar rules database – a comprehensive database comprising the grammar rules of the regulatory code. Once a grammar rules’ database is available, we will next address the spatial and temporal activity of selected elements with specific predictions examined in transgenic Drosophila embryos. All stages will both gain and contribute to other stages, and lessons learned in each will help refine the grammar rules database.
We believe that the hierarchical multi-level approach proposed here will allow us to reliably cover the regulatory phase space of possibilities, allowing us to sample this space at the embryo level in such a way as to convincingly show that we understand regulatory logic at both a cellular and organism level.