Abstract
Background
The identification of transcription factor binding sites (TFBSs) and cisregulatory modules (CRMs) is a crucial step in studying gene expression, but the computational method attempting to distinguish CRMs from NCNRs still remains a challenging problem due to the limited knowledge of specific interactions involved.
Methods
The statistical properties of cisregulatory modules (CRMs) are explored by estimating the similarword set distribution with overrepresentation (Zscore). It is observed that CRMs tend to have a thintail Zscore distribution. A new statistical thintail test with two thinness coefficients is proposed to distinguish CRMs from noncoding nonregulatory regions (NCNRs).
Results
As compared with the existing fluffytail test, the first thinness coefficient is designed to reduce computational time, making the novel thintail test very suitable for long sequences and large database analysis in the postgenome time and the second one to improve the separation accuracy between CRMs and NCNRs. These two thinness coefficients may serve as valuable filtering indexes to predict CRMs experimentally.
Conclusions
The novel thintail test provides an efficient and effective means for distinguishing CRMs from NCNRs based on the specific statistical properties of CRMs and can guide future experiments aimed at finding new CRMs in the postgenome time.
Keywords:
Statistical approach; Transcription factor binding sites (TFBSs); Cisregulatory modules (CRMs)Background
The identification of transcription factor binding sites (TFBSs) and cisregulatory modules (CRMs) is a crucial step in studying gene expression. The computational methods of predicting CRMs from noncoding nonregulatory regions (NCNRs) can be classified into three types: 1) TFBSbased methods, 2) homologybased methods and 3) contentbased methods. TFBSbased methods, such as ClusterBuster [1] and MCAST [2], use information about known TFBSs to identify potential CRMs. The methods of this type are limited to the recognition of similarly regulated CRMs, and generally unable to be applied to genes for which TFBSs have not yet been studied experimentally. Homologybased methods use information contained in the pattern of conservation among related sequences. The related sequences can come from single species [3], two species [4] and multiple species [5]. The methods of this type using the pattern of conservation alone are limited in their performance because TFBS conservation necessary to maintain regulatory function in binding sequences may not be significantly higher than in nonbinding sequences [6,7]. In addition, it still remains an open question that how many genomes are sufficient to the reliable extraction of regulatory regions. Contentbased methods assume that different genome regions (CRMs and NCNRs) have the different rates of evolutionary micro changes; therefore, they exhibit different statistical properties in nucleotide composition. TFBSs often occur together in clusters as CRMs. The binding site cluster causes a biased word distribution within CRMs, and this bias leaves a distinct “signature” in nucleotide composition. Contentbased methods detect this signature by statistical [8,9] or machinelearning [10,11] techniques in order to distinguish CRMs from nonCRMs. The methods of this type may be used to predict the CRMs which have not yet been observed experimentally, but the poor performance on noncoding sequences limits their applications [12]. A large number of CRM search tools have been reported in the literature, but the computational method attempting to distinguish CRMs from NCNRs still remains a challenging problem due to the limited knowledge of specific interactions involved [13].
The fluffytail test [9] is one of contentbased methods. It is a bootstrapping procedure to recognize statistically significant abundant similarwords in CRMs. There are two problems with the fluffytail test: 1) Due to its bootstrapping procedure, the computational time of calculating the fluffiness coefficient is determined by the number of realization. In order to get reliable results statistically, the number of realization is usually set as very large in the fluffytail test, so the computational time is expensive, especially for long sequences. This limits the use of the fluffytail test under the situation when more and more DNA sequences need to be analyzed in the postgenome time. 2) The separation performance between CRMs and NCNRs is far from satisfactory [12]. The reason of poor performance is that both CRMs and NCNRs contain repetitive elements such as poly(N) tracts (… TTT…) or long simple repeats (…CACACA…). These strings are less interesting than the overrepresented strings with more balanced AT/GC ratio. It is an interest to address these two issues of the fluffytail test and to develop a more efficient and effective CRM prediction method.
In this paper, the statistical properties of CRMs are explored by evaluating the overrepresentation value of similarword sets (motifs). Zscore is used as the measure of overrepresentation of similarword sets. Then, Zscore distribution is estimated to distinguish CRMs from NCNRs.
Methods
Training datasets
To estimate the statistical properties of distinguishing CRMs from NCNRs, two (positive and negative) training datasets are employed in this paper. The positive training dataset is a collection of 60 experimentallyverified functional Drosophila melanogaster regulatory regions [14]. The positive training dataset consists of CRMs located far from gene coding sequences and transcription start sites. It contains many binding sites and site clusters, including abdominalb, bicoid, caudal, deformed, distalless, engrailed, evenskipped, fushi tarazu, giant, hairy, huckebein, hunchback, knirps, krüppel, oddpaired, pleiohomeotic, runt, tailless, tramtrack, twist, wingless and zeste. The total size of the positive training dataset comprises about 99 kilobase (kb) sequences. The negative training dataset is 60 randomlypicked Drosophila melanogaster NCNRs: The NCNRs of length 1 kb upstream and downstream of genes are excluded by using the Ensembl genome browser. The negative training dataset contains 90 kb sequences in total.
Formulation of the thintail test
The thintail test is based on the assumption that each word (binding site) recognized by a given transcription factor belongs to its own family of similarword sets (binding site motifs) found in the same enhancer sequence and the redundancy of the binding sites within CRMs leaves distinct “signatures” in similarword set distribution. For a given mletter segment W_{m} as a seedword, all mletter words that differ from W_{m} by no more than j substitution comprise a corresponding similarword set Nj(W_{m}). Because the core of TFBSs is relatively short [15], a 5letter seedword is considered, allowing for 1 mismatch, i.e., m = 5 and j = 1. In order to distinguish CRMs from NCNRs, the thintail test is adopted to study the Zscore distribution shape and to predict the probable function of the original input sequence. The test features special statistics accounting for word overlaps in the same DNA strand. A flow chart of the thintail test is shown in Figure 1.
Figure 1. A flow chart of thintail test.
Step 1: Search for all different seedwords (Wm)
The input sequence is scanned to find all the different mletter words, allowing overlaps. As an example, consider a stretch of DNA: ACGACGCCGACT. For m = 5, all 5letter segments W_{5} are selected as seedwords, i.e., ACGAC, CGACG, …, CGACT.
Step 2: Number of similarwords with the same seedword (n)
For each seedword Wm, all mletter words with no more than j substitution comprise a corresponding similarword set Nj(Wm). In this example, the first seedword W_{5}, ACGAC, has 3 similarwords with no more than 1 mismatch: ACGAC, ACGCC, CCGAC. n is the cardinality, n = N_{j}(W_{m}) = N_{1}(ACGAC) = 3.
Step 3: Zscore with the same seedword (Z)
A similarword set that occurs significantly more often than chance expectation is said to be overrepresentation. A reasonable overrepresentation measure would reflect whether the actual occurrence number of similarword set is significantly greater than the number counted in a random sequence with the same composition of input sequence. For any seedword W_{m}, a statistical overrepresentation measure Zscore can be defined by
where E(W_{m}) and V(W_{m}) are, respectively, the occurrence expectation and variance of similarword set N_{j}(W_{m}), these being calculated for a random sequence with the same composition of input sequence [16]. In a random Bernoulli type sequence, both occurrence expectation and variance can be derived analytically by using a generating function technique [17]. The Zscore with more overlaps is smaller than one with less overlaps. For example, the Zscore corresponding to simple repeat strings, TTTTT or AAAAA, is smaller than one corresponding to the seedword with more balanced composition. Z (Zscore) forms X axis in Figures 2, 3, 4, 5.
Figure 2. Histogram of CRMs (m = 5, j = 1, k = 0.3).
Figure 3. Histogram of CRMs (m = 5, j = 1, k = 0.14) after random shuffle.
Figure 4. Histogram of NCNRs (m = 5, j = 1, k = 0.54).
Figure 5. Histogram of NCNRs (m = 5, j = 1, k = 0.15) after random shuffle.
Step 4: Number of seedwords with the same Zscore (f)
f(Z) is the number of the seedwords with Zscore and forms Y axis in Figures 2, 3, 4, 5, 6, 7.
Figure 6. Histograms for CRMs and NCNRs classified by E (m = 5, j = 1).
Figure 7. Histograms for CRMs and NCNRs classified by T_{50}(m = 5, j = 1).
Step 5: Kurtosis (k)
The kurtosis k of Zscore distribution f(Z) is evaluated as
where i is the ith seedword, M is the total number of seedwords, μ and σ are the mean and standard deviation of Zscore distribution f(Z) respectively.
Step 6: Two thinness coefficients (E and T_{r})
The first thinness coefficient E is defined as:
Here k_{0} denotes the kurtosis k of the original input sequence without random shuffle and ε is the standard error calculated by:
E is used to measure how strongly Zscore distribution deviates from the normal distribution. The 95% confidence interval is set between 2ε and 2ε.
A sequence is called “random” if it is obtained by randomly shuffling the original input sequence r times, preserving its single nucleotide composition. To measure how strongly the Zscore distribution deviates from randomness, the second thinness coefficient T_{r} is computed by comparing with all rtimes randomlyshuffled sequence versions of the original input sequence:
Here T_{r} can be regarded as measuring the degree of the difference between signal and noise, where the signal is regarded as the original input sequence, and the noise is regarded as randomized sequences.
In the fluffytail test [9], the fluffiness coefficient F_{r} is defined as:
where L_{r} is the number of the seedwords with the maximal similarwords for the rtimes randomlyshuffled sequences and s_{r} is the standard deviation of the similarword set distribution between the number g(n) of seedwords and the number n of similarwords. Here it is worth to mention to this end that both CRMs and NCNRs contain repetitive elements such as poly(N) tracts (… TTT…) or long simple repeats (…CACACA…), which are less interesting than the overrepresented strings with more balanced AT/GC ratio. Since Zscore measures the overrepresentation of similarword sets, the second thinness coefficient T_{r} based on Zscore distribution should be a more reasonable index than the fluffiness coefficient F_{r} based on similarword set distribution in order to distinguish CRMs from NCNRs.
Results
Distribution for CRMs
Figure 2 shows the Zscore distribution for all Drosophila melanogaster CRMs in the positive training dataset. It can be seen that some similarword sets have extreme positive/negative Zscore (Z > 3 or Z < 3). This means that some similarword sets are overrepresented or underrepresented.
To obtain a random distribution, the original sequence is randomly shuffled r = 50 times. Figure 3 shows a typical example of Zscore distribution after random shuffle. As compared with the original input sequence in Figure 2, the randomized sequence in Figure 3 lacks the overrepresented/underrepresented similarword set (i.e. similarword set with extreme Zscore, Z > 3 or Z < 3).
Distribution for NCNRs
Figure 4 shows the Zscore distribution for all randomlypicked Drosophila melanogaster NCNRs in the negative training dataset. The presence of short right tail is noted in Figure 4. Figure 5 shows a typical example of Zscore distribution after random shuffle. The distribution for the original input sequence notably differs from that for the randomized version. The difference degree of the distribution between the original and randomlyshuffled sequences for NCNRs is greater than that for CRMs.
Thintail test
In order to distinguish CRMs from NCNRs, E and T_{r} are calculated for 120 sequences in these two training datasets. Figure 6 shows that CRMs tend to have a smaller E than NCNRs. Table 1(a) lists functional classification based on E. Nearly 71.7% CRMs has E < 0.6, while only 41.7% NCNRs has E < 0.6. Figure 7 shows T_{50} for CRMs and NCNRs. For each sequence, its 50times randomlyshuffled versions are generated to calculate T_{50.} It can be seen that CRMs tend to have a smaller T_{50} than NCNRs. Table 1(b) lists functional classification based on T_{50}. Nearly 73.3% CRMs has T_{50} < 0, while only 40% NCNRs has T_{50} < 0.
Table 1. Classification of 120 sequences
Discussion
Some statistical properties of Zscore distribution in these two training datasets have been explored. Results show that CRMs have a thintail distribution, i.e., tend to have low thinness coefficients (E < 0.6, T_{r} < 0), while NCNRs lack a thintail distribution, i.e., tend to have high fatness coefficients. Thus, E and T_{r} can be used to distinguish CRMs from NCNRs effectively. CRMs are predominant if (E < 0.6, T_{r} < 0), while NCNRs are prevailing if (E > 0.6, T_{r} > 0). Thus, the regions with (E < 0.6, T_{r} < 0) are CRMs and those with (E > 0.6, T_{r} > 0) are NCNRs.
Comparison with fluffytail test
The thintail test is evaluated by comparison with the fluffytail test [9]. The performance of three parameters is assessed: 1) the first thinness coefficient E, 2) the second thinness coefficient T_{r} and 3) the fluffiness coefficient F_{r} based on the separation between CRMs and NCNRs.
These two training datasets are employed to evaluate the above three parameters. For comparison, the original input sequence is randomly shuffled 50 times to calculate T_{50} and F_{50}. The thresholds of E, T_{50} and F_{50} are set as 0.6, 0 and 2 respectively. For the thintail test, the original input DNA sequence with E < 0.6 and T_{50} < 0 is considered as predicted CRMs. For the fluffytail test, the original input DNA sequence with F_{50} > 2 is considered as predicted CRMs. The classification result of 120 sequences in these two training datasets by F_{50} is listed in Table 1(c). The fluffytail test F_{50} only identified 29 out of 60 NCNRs in the negative training dataset; while the thintail test identified 35 and 36 NCNRs based on E and T_{50} respectively (see Table 1). For each parameter, sensitivity (SN) (number of true positive/number of positive), specificity (SP) (number of true negative/number of negative) and accuracy (number of true positive + number of true negative)/(number of positive + number of negative) are calculated to distinguish CRMs from NCNRs (Table 2).
Table 2. Evaluation of E, T_{50 }and F_{50}
The thintail test with T_{50} has the best accuracy (66.7%), as compared with the other two parameters (E: 65%; F_{50}: 65%). Thus, the thintail test with T_{50} can effectively distinguish CRMs from NCNRs. Moreover, the thintail test (SP = 60% for T_{50} and SP = 58.3% for E) can more efficiently identify NCNRs than the fluffytail test (SP = 48.3% for F_{50}). The thintail test with E has the same accuracy as the fluffytail test. However, the computational time (CPU time) of calculating E for an original input DNA sequence length of 1000 is 50 times faster than that of calculating T_{50} and 6 times faster than that of calculating F_{50} for the same original input sequence due to no sequence shuffle. Thus, the thintail test with E is very suitable for long sequences and large database.
Time complexity
The second thinness coefficient T_{r} is gotten by bootstrapping procedure, the value is affected by the number of realization r. In order to get the more reliable estimation of T_{r}, a large r is needed, so that high computational time is expected. For the reliable result within reasonable computational time, the original input sequence is randomly shuffled 50 times to calculate T_{r}.
In Table 2(c), the computational time (CPU time) of calculating E for an original input DNA sequence length of 1000 is 50 times faster than that of calculating T_{50} and 6 times faster than that of calculating F_{50} for the same original input sequence due to no sequence shuffle. All computations are run on a 3.2 GHz Pentium IV processor with 1 G physical memory.
Large CRM datasets
The thintail algorithm has been tested on the current version 3 of the REDfly database [18], which contains 894 experimentallyverified CRMs from Drosophila. Results show that 72.5% CRMs has E < 0.6 and 70.8% CRMs has T_{50} < 0 passing the thintail test. It is worth to mention to the point that the fluffytail algorithm has never been tested on the large CRM datasets.
Conclusions
In the thintail test, the statistical properties of CRMs are investigated by examining Zscore distribution pattern. The special statistical method used for calculating Zscore can reduce the effect of poly N and other simple strings on the distribution pattern of similarword sets. Results show that the Zscore distribution of CRMs tends to be a thintail distribution as compared with that of NCNRs. Based on this observation, two thinness coefficients E and T_{r} are introduced here. By using E and T_{r}, the thintail test has the better separation accuracy of distinguishing CRMs from NCNRs than the fluffytail test [9]. Especially by using the first thinness coefficient E, the computational time is significantly decreased, in view of a bootstrapping procedure to be required for calculating T_{r} and F_{r}. For the example as r = 50, the thintail test with E is 50 times faster than the thintail test with T_{50}, and is 6 times faster than the fluffytail test with F_{50}. Thus, the novel thintail test greatly simplifies the function prediction of an original input DNA sequence and can guide future experiments aimed at finding new CRMs in the postgenome time [1923].
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors read and approved the final manuscript.
References

Frith MC, Li MC, Weng ZP: ClusterBuster: Finding dense clusters of motifs in DNA sequences.
Nucleic Acids Res 2003, 31(13):36663668. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Bailey TL, Noble WS: Searching for statistically significant regulatory modules.
Bioinformatics 2003, 19(2):II16II25. PubMed Abstract  Publisher Full Text

van Helden J, André B, ColladoVides J: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.
J Mol Biol 1998, 281(5):827842. PubMed Abstract  Publisher Full Text

Grad YH, Roth FP, Halfon MS, Church GM: Prediction of similarly acting cisregulatory modules by subsequence profiling and comparative genomics in Drosophila melanogaster and D. pseudoobscura.
Bioinformatics 2004, 20(16):27382750. PubMed Abstract  Publisher Full Text

Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM: Phylogenetic shadowing of primate sequences to find functional regions of the human genome.
Science 2003, 299(5611):13911394. PubMed Abstract  Publisher Full Text

Emberly E, Rajewsky N, Siggia ED: Conservation of regulatory elements between two species of Drosophila.

Li L, Zhu Q, He X, Sinha S, Halfon MS: Largescale analysis of transcriptional cisregulatory modules reveals both common features and distinct subclasses.
Genome Biol 2007, 8(6):R101. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Nazina AG, Papatsenko DA: Statistical extraction of Drosophila cisregulatory modules using exhaustive assessment of local word frequency.

Abnizova I, te Boekhorst R, Walter K, Gilks WR: Some statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the Drosophila genome: The fluffytail test.

Chan BY, Kibler D: Using hexamers to predict cisregulatory motifs in Drosophila.

Kantorovitz MR, Kazemian M, Kinston S, MirandaSaavedra D, Zhu Q, Robinson GE, Göttgens B, Halfon MS, Sinha S: Motifblind, genomewide discovery of cisregulatory modules in Drosophila and mouse.
Dev Cell 2009, 17(4):568579. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shu JJ, Li Y: A statistical fattail test of predicting regulatory regions in the Drosophila genome.
Comput Biol Med 2012, 42(9):935941. PubMed Abstract  Publisher Full Text

Su J, Teichmann SA, Down TA: Assessing computational methods of cisregulatory module prediction.
PLoS Comput Biol 2010, 6(12):1001020. Publisher Full Text

Papatsenko DA, Makeev VJ, Lifanov AP, Régnier M, Nazina AG, Desplan C: Extraction of functional binding sites from unique regulatory regions: The Drosophila early developmental enhancers.
Genome Res 2002, 12(3):470481. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Wingender E, Chen X, Fricke E, Geffers R, Hehl R, Liebich I, Krull M, Matys V, Michael H, Ohnhäuser R, Prüβ M, Schacherer F, Thiele S, Urbach S: The TRANSFAC system on gene expression regulation.
Nucleic Acids Res 2001, 29(1):281283. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Leung MY, Marsh GM, Speed TP: Over and underrepresentation of short DNA words in herpesvirus genomes.
J Comput Biol 1996, 3(3):345360. PubMed Abstract  Publisher Full Text

Régnier M: A unified approach to word occurrence probabilities.

Gallo SM, Gerrard DT, Miner D, Simich M, Des Soye B, Bergman CM, Halfon MS: REDfly v3.0: Toward a comprehensive database of transcriptional regulatory elements in Drosophila.
Nucleic Acids Res 2011, 39(1):D118D123. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Shu JJ, Ouw LS: Pairwise alignment of the DNA sequence using hypercomplex number representation.
Bull Math Biol 2004, 66(5):14231438. PubMed Abstract  Publisher Full Text

Shu JJ, Li Y: Hypercomplex crosscorrelation of DNA sequences.
J Biol Syst 2010, 18(4):711725. Publisher Full Text

Shu JJ, Wang QW, Yong KY: DNAbased computing of strategic assignment problems.
Phys Rev Lett 2011, 106(18):188702. PubMed Abstract  Publisher Full Text

Shu JJ, Yong KY, Chan WK: An improved scoring matrix for multiple sequence alignment.

Shu JJ, Yong KY: Identifying DNA motifs based on match and mismatch alignment information.