| Feature context-dependency and complexity-reduction in probability landscapes for integrative genomics1Institut des Hautes Études Scientifiques, Bures-sur-Yvette, France 2Institut de Recherche Interdisciplinaire – CNRS USR3078 – Université Lille I, France
Theoretical Biology and Medical Modelling 2008, 5:21doi:10.1186/1742-4682-5-21 The electronic version of this article is the complete one and can be found online at: http://www.tbiomed.com/content/5/1/21
©
2008 Lesne and Benecke; licensee BioMed Central Ltd. AbstractBackgroundThe question of how to integrate heterogeneous sources of biological information into a coherent framework that allows the gene regulatory code in eukaryotes to be systematically investigated is one of the major challenges faced by systems biology. Probability landscapes, which include as reference set the probabilistic representation of the genomic sequence, have been proposed as a possible approach to the systematic discovery and analysis of correlations amongst initially heterogeneous and un-relatable descriptions and genome-wide measurements. Much of the available experimental sequence and genome activity information is de facto, but not necessarily obviously, context dependent. Furthermore, the context dependency of the relevant information is itself dependent on the biological question addressed. It is hence necessary to develop a systematic way of discovering the context-dependency of functional genomics information in a flexible, question-dependent manner. ResultsWe demonstrate here how feature context-dependency can be systematically investigated using probability landscapes. Furthermore, we show how different feature probability profiles can be conditionally collapsed to reduce the computational and formal, mathematical complexity of probability landscapes. Interestingly, the possibility of complexity reduction can be linked directly to the analysis of context-dependency. ConclusionThese two advances in our understanding of the properties of probability landscapes not only simplify subsequent cross-correlation analysis in hypothesis-driven model building and testing, but also provide additional insights into the biological gene regulatory problems studied. Furthermore, insights into the nature of individual features and a classification of features according to their minimal context-dependency are achieved. The formal structure proposed contributes to a concrete and tangible basis for attempting to formulate novel mathematical structures for describing gene regulation in eukaryotes on a genome-wide scale. BackgroundThe deciphering of the gene regulatory code of eukaryotic cells and the inference of gene regulatory programs belong to the computationally "hard" problems that are very probably insoluble without using very large collections of experimental genome activity recordings under many different biological conditions in conjunction with empirical gene structure and function annotations [1-4]. Genomic sequence, gene structure and function annotation, as well as functional genomics experimental data, are of heterogeneous nature. In order to conceive computationally efficient algorithms capable of statistical integration of these different types of information, transformations of the different types of data into a continuous and homogeneous data structure have to be developed. We have recently proposed such a concept, which we refer to as probability landscapes [5]. Briefly, we have shown on theoretical grounds how any type of observable quantity (which we shall refer to hereafter as "feature") can, without loss of information, be transformed into a local probability with nucleotide resolution along the genome (creating what we define as a probability profile). For any feature, as for instance the predicted alpha-helicity of an inferred amino-acid sequence or the transcriptome of a cell recorded under a particular biological condition, such a local probability can be calculated for all nucleotides of the genome under study, resulting in a profile. If this procedure is repeated for many different features, a stack of probability profiles ("landscape") is obtained. While it might, on first sight, seem awkward to calculate a probability for every nucleotide in a genome to be part of an alpha-helix provided this nucleotide were part of an expressed codon, the advantage of translating any type of relevant experimental information into a homogeneous structure that can be used directly for statistical correlation analysis by far outweighs the apparent absurdity of having executed a secondary protein structure prediction algorithm on sequences that a priori are never even transcribed into RNA, leave alone translated into protein. Furthermore, our information on transcribed sequences for instance is still incomplete – just consider the recent discoveries related to microRNAs – and hence a complete, unbiased probability annotation is more coherent [5]. Interestingly, a probabilistic framework also alleviates the problem of the formally undefined cause and effect relationship in the case of intrinsic stochasticity in the noisy experimental data by introducing the notion of fuzziness into the mapping; a process referred to as conditioning. The nature of biological experimentation imposes two general constraints that need to be taken into account especially in the field of functional genomics. First, obviously, experimental information is never complete in that it is either a snap-shot of a dynamic reality, obtained as a mean measurement over large numbers of objects, biased by experimental or conceptual priors, or, most often, a combination of all the above, leading to context-dependency of the results. Second, the measurement itself introduces a non-negligible, albeit to some extent controllable, bias leading to further context-dependency of functional genomics data. Moreover, biological systems themselves display a strong context-dependency which is notably the object of study in functional genomics/systems biology: It is the combination of molecules in a cell that creates a biological function; hence the activity of a single molecule is context dependent. Thus, context-dependency of features is relevant for the comprehension of stimuli-responses and signals. Finally, context-dependency is itself question dependent. Consider the following example: Whether or not a given cell is differentiated to some defined state requires investigation of the presence of state-specific gene products and functionalities and the concomitant absence of molecules and functions specific to other cell-states. It does not, however, require any knowledge about the time dependency of the changes in gene expression and cellular physiology. A time series of experiments conducted on a differentiating cell, in this case, can therefore be simply projected, eliminating the time-dimension in addressing the question. The projection thereby has an important advantage over a simple end-point comparison, as (i) intermediate events are not omitted from the analysis, and (ii) statistical power is improved. However, when one tries to infer gene regulatory circuits, the time dimension of the experimental data is of outmost importance, whereas for instance the estimates of absolute molecular species quantities are far less important. Furthermore, the available genomic information can often be analyzed in a hierarchical manner. For certain biological questions it will not be important to have a detailed knowledge of feature probability profiles themselves but rather a more integrated, coarse-grained, combination of individual features. Ideally, by combining different features the set-theoretic conditioning can be turned into an unambiguous and well-defined cause and effect mapping. As studying different biological questions requires concomitant investigation of correlation and non-correlation, context-dependency and independency are similarly important. In conclusion, the very same set of information displays different context-dependencies as a function of the biological problem studied. We shall refer to this phenomenon from here on as "circumstantial context". We develop here a mathematical approach to the quantification and statistical significance testing of context dependency in functional genomics data using our previously developed probability landscape framework. As context-dependency is not an absolute but a relative quantity, a flexible approach depending on the biological problem studied has to be realized. We furthermore demonstrate how according to the circumstantial context even very large numbers of individual landscapes stemming from experimental recordings can be merged into a single, collapsed profile with greatly improved statistical properties. This procedure can therefore be used in a systematic and controlled manner to reduce the computational and formal complexity of probability landscapes. Increased algorithmic efficiency and statistical power result jointly with heightened understanding of the biological mechanisms. ResultsCircumstantial probability profilesCircumstantial context-dependency of functional genomics information does at the same time create important constraints, which need to be taken into consideration during statistical analysis, and simultaneously provides additional knowledge on the biological question studied. We have recently proposed probability landscapes as a means to integrate any relevant type of functional genomics information coherently and systematically into a structurally homogeneous object that can more easily be analyzed computationally. Here we asked whether or not the proposed structure of probability landscapes also permits systematic detection, analysis, and utilization of context-dependencies. Let X be an observable quantity under investigation, taking either discrete, possibly symbolic, or continuous values. We have shown how experimental information on X can be expressed in a homogeneous and universal way as a genome-wide probability profile [5]. Given the biological nature of the information (see Background), probability profiles thus de facto involve conditional probabilities: P(Xn = x|B) in case of a discrete-valued feature X or ρ(Xn = x|B)dx in case of a continuous-valued feature X. We shall use
In all that follows, we shall consider a discrete-valued feature X for the sake of simplicity, without restricting the generality. Considering a continuous-valued feature requires only replacing ∑x ∈ χ by Eliminating spurious conditioning, detecting essential onesConsidering the set of all the conditions that can be controlled or at least identified during the experiment, each feature will depend on some of these conditions whereas it will be independent of others (cf. Background). We thus want to determine for each biological question and each feature the subset of factors actually conditioning its probability landscape, and hence its effective context C(X). If Ci does not add any information on X, it does not belong to the context C(X). Conversely, the proposed analysis allows features to be grouped in different subsets according to their circumstantial context. Finding the effective, thus minimal, context C(X) among the full conditionings of X ('minimax' entity) is a well-posed issue only in a hierarchical formulation: we have to investigate whether an additional condition C decreases the indeterminacy of X knowing B, and conversely whether data obtained under different conditions (B∧Cj)j can be grouped into a single condition B∧C where C is the reunion of conditions (Cj)j or even into the single condition B if (Cj)j form a complete family, so that C adds in fact no additional prescription on B. This dual process can be iterated in both directions. The issue is thus to compare P(X|B) and P(X|B∧C) to see whether the additional prescription C on the experimental conditions adds constraints and information on X (knowing B) or not (Figure 2). The issue has a very concrete and in practice essential outcome: providing a criterion to appreciate whether it is relevant to pool the data, or conversely whether some additional condition requires the set of data to be partitioned into sub-groups for a relevant analysis. Note that only explicitly controlled or described conditions, of which the experimentalist is aware, can be mentioned in the probabilities. A wealth of implicit conditions is also present, and one of the aims of this work is to develop a coherent way to bring forward the relevant ones. In confronting two probability landscapes P(X|B,1) and P(X|B,2) constructed from data recorded independently, one might guess that an additional condition C is present, that explains the discrepancy between the two landscapes, if any: Divergence of probability profilesAt each genome location n, the probabilities Note that it is meaningless to compare In the case that the feature probability profiles
Statistical significance testingThe Kullback-Leibler divergence thus provides a tool for calculating the difference of the individual conditional feature probability profiles where V(Pn, ε) is the ball of radius centered on the distribution Pn (distribution over the space χ); it is thus a neighborhood in a functional space, where the radius bounds the Kullback-Leibler divergence between an element and the center of the ball. We have recently investigated for a more general case how conjoint statistical significance testing for similarity and distinctness can be achieved on such a measure. Please refer for a more detailed description of the methodology to [7]. Briefly, any experimentally obtained signal (such as the fluorescence/chemiluminescence signal of a spot on a microarray) is interpreted as a random independent sample of some random variable, assumed normally distributed and with unknown average. The mean and variance estimates can be used to construct an unbiased maximum likelihood estimator, which is itself a random variable of Gaussian form. In order to formulate quantitative statements concerning the relative differences between different biological conditions, we introduce a cone Cα over the first diagonal of a signal estimate under two different biological conditions with half-angle α. The rationale for considering such cones rather than homogeneous error margins is to control the relative error. Using the so-called ratio distribution for independent normal distributions, we can then determine a likelihood of the mean estimates being within a distance smaller than Cα or not of the actual mean of the random variable. This distance measure is symmetric in the sense that we can estimate both similarity and distinctness. Moreover, the measure is also amendable to testing for statistical significance using serialized two-sided T-tests. By defining a single confidence interval on the above measure the decision on whether or not to collapse feature probability profiles then becomes straight-forward. Interestingly, the significance testing of distinctness and similarity, as we develop it in [7], takes into account the relative variance over the measure in case of massive-parallel data such as functional genomics experimental observations in form of the half-angle α of the cone Cα. In this case the quality, or better statistically perceived quality, of the measure on the observable under different biological conditions is directly taken into consideration when estimating the statistical significance of the Kullback-Leibler divergence. Extending the divergence analysis over the genomeSo far we have only discussed the context-dependency analysis locally; that is at any genome position n. As feature probability profiles extend over the entire genomic sequence of the organism under study, a generalization is required, which as shown below is straight-forward in our approach. Consider the case where a subset of feature probability profiles is known on biological grounds to reflect relevant measures on the biological and physical properties of a stretch I of the genome (e.g. the linear extension of a gene, possibly with gaps, such as transcriptome data, Figure 3). We compute for each n ∈ I a distance
Circumstantial and hierarchical complexity reductionAs discussed throughout this work, context-dependency of features is itself dependent on the biological question addressed. Given a biological question or context, any set of context-dependent conditions can be tested against a cumulative biological condition calculated as an average measure over the set of sub-conditions for its relative contribution to the overall information. This can be achieved in parallel for as many different (sub-)conditions as available. The relevance of any feature probability profile with respect to the biological question addressed is hereby and importantly solely defined through a statistical significance measure in the information theoretical divergence from the pooled information when considering larger and larger joint sets of conditions. This procedure can be hierarchically repeated (using a single confidence interval) to conditionally collapse individual profiles further and further (Figure 5). The schematic representation of different conditioned feature probability profiles, their inter-relationship, and the natural hierarchy of the different probability profiles with respect to a biological condition B are illustrated. Wherever the statistical significance of the distance measure exceeds a defined threshold the distance is considered insufficient to warrant the sub-condition being analyzed separately, and thus the corresponding profiles are collapsed. This procedure can be performed recursively. Consider for example the question of what the transcribed sequences in a given genome are (notably without any restriction of a particular biological condition). If one uses the many thousands of available microarray transcriptome studies, or in the near future, high throughput sequencing transcriptome data, which were all recorded under precise experimental and thus biological conditions, no significant context-dependency arises through the choice of the appropriate biological conditioning. Thus, all existing transcriptome data would successively be collapsed to give a single feature probability profile that could directly be seen as a probability of any nucleotide in the genome being transcribed (obviously only provided sufficiently divergent transcriptome data are available). Such an optimally conditioned profile could subsequently be used to search for correlations between the genomic sequence and the occurrences of all expressed sequences in order to search for sequence elements statistically significantly associated with transcribed sequences. While this example, as extreme as it is, might not seem appropriate, just consider that any level of acceptable divergence can be defined with respect to the biological question addressed, and that feature probability profiles can be regrouped into any number of not necessarily exclusive subsets the experimentator sees fit (Figure 6). Therefore, a continuum of nested profiles ranking from individual feature profiles to a totally collapsed landscape exists. This continuum needs to be explored for every biological question separately, which is why the complexity of the landscape can not be reduced permanently. Essentially, for every new investigation of the structure, the feature probability landscape is at first totally uncompressed, and using the method described here, is then locally – with respect to the sub-conditions Ci – collapsed as a function of the biological conditions Bj. Different biological conditions B will lead to different combinations of Ci profiles being collapsed (Figure 6). Genome probability landscapes are therefore a dynamic structure that can be locally collapsed as a function of the circumstantial context.
Circumstantial context illustrated with a theoretical exampleIn order to illustrate the applicability of the methodology developed here let us consider the theoretical example of an analysis of different T-cell populations from a plausible human patient study for how context-dependency analysis is performed in a biological question motivated manner (Figure 7).
Let Px (x = 1, 2, 3) be a subject from whom a blood sample has been drawn. The peripheral blood mononuclear cell (PBMC) population has subsequently been separated by fluorescence activated cell sorting (FACS) and the two T-cell subpopulations CD4+CD25+, CD4+CD25- were enriched using the corresponding cell surface markers. Assume furthermore that the CD4+CD25+ (red) and CD4+CD25- (blue) cells, which are both involved for instance in the inflammatory response, have undergone brief exposure to an inflammation inducing agent such as an interleukin during ex vivo primary cell culture, before the cells were harvested and total RNA was extracted for transcriptome analysis using several technical replicates per subject (Figure 7A). Finally, assume that subject P3 carries an unknown genetic variant with limited but functional implication for the expression of some genes. For simplicity, consider the technical variability of the experiment to be sufficiently small to warrant the calculation of mean expression profiles for each T-cell subtype from each subject. Several biological questions might be addressed using such a dataset. The first set of questions could relate to the difference in the transcriptional responses of CD4+CD25+ and CD4+CD25- T-cells to stimulation using the interleukin (Figure 7B–D). Depending on the statistical significance of the Kullback-Leibler divergence between the different transcriptome probability profiles of the subjects in either the CD4+CD25+ or the CD4+CD25- cases (and therefore the heterogeneity between individuals), the probability profiles might either need to be considered separately (Figure 7B) or can be collapsed to a CD4+CD25+ and CD4+CD25- probability profile (Figure 7C). Note that any other combination of the data into subsets is theoretically possible as well. In the latter case (Figure 7C) one would conclude that the biological variability between subjects is sufficiently small with respect to the difference of the two cell-types to be neglected. Now assume that you restrict your analysis to genes targeted by the interferon gamma (IFNγ) pathway which we shall consider equally active in both T-cell populations. In this case the Kullback-Leibler divergence calculated exclusively over the IFNγ target gene subset might indicate that indeed the probability profiles of all six samples (across subjects and across cell types) might be collapsed to give rise to a single profile (Figure 7D). This total collapse of the data however and importantly has been only calculated on, and therefore is only valid for, the IFNγ regulated subset of genes. These two examples justify the fact that feature probability profile complexity reduction is dependent of the biological phenomenon under study and the specific context. The example can be extended to the analysis of inter-subject variation (Figure 7E–H) independent of T-cell subpopulation. Again, the Kullback-Leibler divergence analysis will provide a statistically sound argument to either analyze the probability profiles individually (Figure 7E), collapse the two probability profiles available for each subject (Figure 7F), or combine all profiles into one (Figure 7G), or any combination thereof. Note that although the result of the operation shown in Figure 7D and 7G might appear to be identical, this is not the case as the statistical analysis leading to these similar results is based on distinct quantities: in the former case the similarity between gene expression responses between different cell types; in the latter the similarity between different individuals. Finally, assume that the genetic variation in subject P3 affects IFNγ signaling (which could be the case in some auto-immune disorders like allergy). It is reasonable to believe that if you were to restrict your analysis to the IFNγ pathway as above (Figure 7D) you might find the analysis based on the Kullback-Leibler divergence to exceed the statistical significance threshold and hence to warrant separate analysis of the regrouped profiles from subjects P1 and P2 versus subject P3 (Figure 7H). Again, context-dependency and circumstantial context will require different analysis strategies. Circumstantial context analysis on actual transcriptome dataTo demonstrate practical applicability of our approach we present here an analysis of circumstantial context at a concrete example of transcriptome data. The dataset we used was recently generated in our laboratory and has been published [8]. All microarray experiments discussed hereafter are available from the GEO database using accession number GSE10795 (see also Methods). In [8] we present a transcriptome analysis of the apoptotic transcription program downstream of the delta splice-isoform of the TFIID associated factor TAF6δ in two human isogenic cell lines inactivated or not for the p53 gene. Briefly, we demonstrate that TAF6δ acts downstream and independently of p53 to control gene expression at the onset of apoptosis [8]. For the following demonstration we selected six experiments: GSM272658-60 (TAF6δ induction in the p53-/- background, hereafter referred to as biological condition B-, using three independent biological replicates referred to as C1-, C2-, and C3-), and GSM272664-6 (TAF6δ induction in the p53+/+ background, hereafter referred to as biological condition B+, using three independent biological replicates referred to as C1+, C2+, and C3+). The data were processed as described in the Methods section and in [5] in order to obtain probability profiles, and subsequently we calculated the Kullback-Leibler divergence at probe resolution for different contexts (Figures 8 &9). Note that certain simplifications were introduced into the calculation of the probability profiles. Those modifications are described and justified in the Methods section, and reflect the limited scope of the analysis presented here (focusing on the circumstantial context only), and the very limited amount of data used, sufficient for the demonstration but very far from fully exploiting the wider concept of probability landscapes. The corresponding data for the analysis discussed below are to be found as additional files SupDataFile01.txt, SupDataFile02.txt, and SupDataFile03.txt.
As shown in Figure 8A and 8B, we have first calculated the Kullback-Leibler divergence for the individual biological replicates versus a collapsed probability profile for the entire biological condition. As very few datasets were used, neither the calculation of statistical significance of individual divergences between the different biological replicates and the collapsed probability, nor the statistical significance of differences between the Kullback-Leibler divergence distributions was exploited, and we simply use the median of the divergences as well as its mean over a set of comparisons as comparative measures (Figure 8B). Having compared the individual biological replicates to the corresponding integrated probability profile of the biological condition, we also investigated the respective divergence distributions obtained when comparing the Ci+ of B+ to B- and vice versa the Ci- of B- to B+ (Figure 8C &8D). As can be easily appreciated, in all cases the divergence increases as would be expected for data from different biological conditions. The increase in the means for instance might appear relatively modest, but given the distribution of the Kullback-Leibler divergences (see for instance the histogram in Figure 9C), such differences are probably indeed significant. As mentioned above, a statistical analysis would, however, require a much larger dataset. We then decided to do two experiments in order to substantiate the claims made above using the theoretical example (Figure 7). First, we swapped the probabilities associated with 899 probes that we had previously found to detect statistically significant changes in gene expression when comparing the B+ (p53+/+) and B- (p53-/-) biological conditions [8]. In order to do so the probabilities calculated for the corresponding probes from Ci+ were assigned to the same probe in Ci- and vice versa (Figure 8E, indicated by the addition of "s" to the biological condition identifier). We thus exchanged 2.8% of the entire probability profile with its counterpart from the other biological condition. The corresponding divergence measures are shown in Figure 8F. As can be seen by comparison with the values in Figure 8B, we observe a modest increase of the Kullback-Leibler divergences, which, however, should – at least given the sample size – not be considered significant. Therefore, and unlike swapping the entire profiles (Figure 8C &8D), such a restricted modification of the profiles is not necessarily detectable (compare also our discussion of Figure 7G in the preceding section). If, however, one restricts the analysis of the Kullback-Leibler divergence to the 899 probes only (cf. our discussion of extending the analysis over the genome, Figures 3 &4), measurable differences between the normal and swapped situations again occur (Figure 9A &9B). These differences are the more striking if one compares the histograms over the entire divergence distribution for the first experiment (Figure 8E) and the second (Figure 9A) with their non-modified counterparts, as shown in Figure 9C and 9D. Whereas, in the first case where the 899 swapped probabilities have an almost undetectable effect on the median as well as the entire distribution (Figure 9D), the case where only the 899 probes are considered in isolation not only shows an increase in the median, but also a starkly modified overall distribution (Figure 9C). Note that both histograms are on a log scale and that the last bin encompasses all values greater than unity. Therefore, and as we had pointed out in our theoretical discussion of the properties of the Kullback-Leibler divergence and circumstantial context above, the biological question will condition the decision whether or not to collapse several profiles into one. Concretely, if one were exclusively interested in studying the p53 responsive genes in above dataset, as the latter swapping case demonstrates, a complexity reduction would not be advisable, whereas on the other hand, when studying the entire genomic response to the stimulus, the divergence over the swapped p53 target gene responses would not significantly affect the outcome of the analysis. This illustrates the applicability and feasibility of the methodology we develop here. DiscussionWe have introduced probability landscapes as a homogeneous and formally consistent representation of any type of functional genomics information in order to achieve a unique structure that can statistically be systematically interrogated using correlation measures [5]. To reduce unnecessary formal, mathematical and computational complexity we propose here to use the existing de facto context-dependency of features as a question-dependent measure for collapsing subsets of the landscapes. Consider the case where Ci refer to sub-conditions of the circumstantial context of the biological condition B in which the feature X has been recorded (Figure 1). We want to know whether it is necessary to consider them as distinct populations or whether it is meaningful to pool them. We pool the local measures Note that since we are comparing the distributions of the same random variable under different conditions, it is only the distance (or divergence) between the two distributions that is meaningful. A joint probability, such as mutual information, can not be envisioned. This also holds for the case of two different variables because the joint probability distribution is inaccessible. Eventually, one could envision considering mutual information in the context of the comparison of two probability distributions (rather than individual variables), thereby rejoining the concept of probabilities of probabilities we have previously developed [5]. However, this seems impractical in concrete terms. The methodology developed here represents a systematic and simple way of testing the statistical limits of complexity-reduction and hence explanatory power of the integrative genomics data in their respective contexts (see for instance Figures 7 &8). We note that our method represents an application of concepts related to context-trees to the probability landscape idea. Circumstantial context analysis and landscape collapse thereby operate in similar manners to Markov chains with variable length for the analysis of time-series from t0 to -∞ (which can be considered the historic context) [9]. Markov chains and Hidden Markov Models (HMMs) have found wide-spread application in the analysis of genomic and gene sequences ([10] and references therein). In contrast to our approach, however, the probabilities assigned to individual nucleotides here reflect the linear sequence context ("horizontal" analysis of sequence statistics) whereas the probability landscape concept we advocate uses the nucleotide based probabilities to integrate "vertically" sequence-dependent features such as activity. Both approaches share common ideas such as the use of probabilities and a single nucleotide resolution, but they differ significantly in their scope and methodology. HMMs are for instance quite constrained in that they require sequentiality (making them particularly interesting in the studies of sequences) and restricted in the number of sequential objects/variables under study. It does not at all seem feasible to develop HMM approaches for entire genomes. Probability landscapes, in contrast, neither require sequential organization per se, nor are limited in the number of objects under study as they can be decomposed. The complexity-reduction procedure for probability landscapes developed here can also be seen as an illustration of both of these features. It is therefore quite obvious that HMMs and probability landscapes are independent though complementary concepts that should acquire synergistic roles in genome analysis. We also note that the Kullback-Leibler divergence calculation provides measures that can be used directly for clustering of probability profiles. Clustering of probability profiles might help to establish and analyze relatedness among data otherwise not compared directly. ConclusionFeature context-dependency can be systematically investigated using probability landscapes. Furthermore, different, independent feature probability profiles can be collapsed as a function of circumstantial context to reduce the computational and formal complexity of probability landscapes. Interestingly, the possibility of complexity reduction can be linked directly to the analysis of context-dependency. Furthermore, as the criteria for circumstantial complexity reduction are statistically controlled, an optimal probability landscape is created in a biological question dependent manner. These two advances in our understanding of the properties of probability landscapes not only simplify subsequent cross-correlation analysis in hypothesis-driven model building and testing, but also provide additional insights into the biological gene regulatory problems studied. The nature of individual features can be probed with respect to posed problems and a classification of features according to their respective contexts can be achieved. Therefore, increased algorithmic efficiency and statistical power result jointly with heightened understanding of the biological mechanisms. Obviously, other features of circumstantial context and probability landscapes in general still remain to be fully exploited. MethodsConstructing
|




on Google Scholar







author email
corresponding author email

Figure 1.


Figure 2.







Figure 3.







Figure 4.
Figure 5.
Figure 6.
Figure 7.
Figure 8.
Figure 9.











