Graphically Locating Coding Regions
By Daniel Lofaro
Introduction:
The purpose of this paper is to give a graphical method of locating the coding region in a nucleotide sequence.
In genomic signal processing there is always a search for correlations between different attributes of the genome. This paper will focus on the relationship between the maximum power spectrum peaks, over a given window, of a genome and the correspond CG bias. This has been chosen to be analyzed because of the interesting relationship that has been vied when using various different analysis methods.
Background:
In order to analyze the genome we need to represent the sequences composed of A, T, C, and G to a numerical representation. For the methods used in this section we will be using what is known as binary mapping.
Binary Mapping:
Binary mapping is a way of mapping the nucleotide is to map the sequence in to a binary string. In reality it is more of a binary array of size 4 by N where N is the number of nucleotides in the sequence. The vertical size is 4 because there are four nucleotides to choose from, A, T, C, and G. The way the mapping works is that a one is placed in the column that you are looking at in the in the row corresponding to the nucleotide in question. Please see Table 1 below for an example of converting the nucleotide string ATCGATGCCATG in to the binary array.

A 
T 
C 
G 
A 
T 
G 
C 
C 
A 
T 
G 
Adenosine 
1 
0 
0 
0 
1 
0 
0 
0 
0 
1 
0 
0 
Thymine 
0 
1 
0 
0 
0 
1 
0 
0 
0 
0 
1 
0 
Guanine 
0 
0 
0 
1 
0 
0 
1 
0 
0 
0 
0 
1 
Cytosine 
0 
0 
1 
0 
0 
0 
0 
1 
1 
0 
0 
0 
Table 1: Binary Mapping Example
Important Characteristics:
Two important traits of the genome is the peak at 1/3 on the power spectrum graph for nucleotides when inside of a coding region. The second important trait is the CG bias when you are in a coding region.
Power Spectrum Peak at 1/3:
After the sequence has been transformed into the binary indicator form we are able to analyze it using an FFT. By analyzing the system using an FFT we will be able to obtain the frequency power information. It has been well documented that in coding regions it is much more likely for there to be spectral peaks at a frequency of 1/3 nuclitide^{1}.
The FFT is taken of each of the strings in order to see the frequency content. It is important to note that the frequency content is not looked at directly. To amplify the repetition of the signal the power spectrum is looked at instead of just the spectrum. The power of the signal is taken by summing the square of the magnitude of the FFT of the signal for each of the binary sequences A, T, C, and G [3]. It is important to note that by squaring the magnitudes of the FFT the spectral peaks and repetition are accentuated. See Equation 1 below for taking the FFT of each nucleotide binary sequence.
_{}
Equation 1: FFT of each nucleotide binary sequence [3]
The power spectrum of each of the elements in Equation 1 above are then taken and gives us the form seen in Equation 2 below.
_{}
Equation 2: Power spectrum of the nucleotide sequence [3]
Figure 1 below is an example of the power spectrum of a coding region. It is important to note the two prominent spectral peaks at 1/3 nuclitides^{1}.
Figure 1: Power Spectrum of a Coding Region (Note the Peaks)
Figure 2 below is the power spectrum plot of a NonCoding region. Please note that it has no definitive spectral marker like what can be found in Figure 1 above.
Figure 2: Power Spectrum of a NonCoding Region
CG Bias:
When a CG bias occurs, either high or low CG content, you have a higher probability of being in a coding region.
A CG bias is just the percent of Cs and Gs you have in a given sequence. For example, if you have a nucleotide sequence of length Q where there were a total of N Cs and M Gs the CG bias would simply be (N+M)/Q or (N+M)/Q x100% if you want it in percent. For a given sequence section, if the CG bias is not around 50% then that probably means that you are in a coding region.
Figure 3: CG Percentage Vs. Index location of L48217
Figure 3 above shows how the CG percentage changes as you move along a sequence with a given read window size. It is important to note that the dip from the index value of 600 to 1200 indicates that in that area there is a coding region.
Extended Work:
In an attempt to gain further knowledge in the relationship between the different characteristics of a nucleotide sequence a set of tools was created that would allow for analyzing the sequence in a more graphical manor.
The first step was to create a set of Matlab functions that would find the CG percent and the maximum spectral peaks of the power of the given nucleotide sequence. The two latter functions are described below.
getCGs(seq, wL, jL): where seq is the NT sequence, wL is the window length and jL is the jump length. This function will take a strand and return the CG percent for that strand with a given window length. The CG percent is returned in a value of 01 in a vector of length “length(seq)wL.” Now I have a vector whose indices will match up with my other functions so I can plot them against each other in the future. Please see Code 1 below for the code for the function getCGs().
Code 1: Matlab code for function getCGs()
getCG(seq): where seq is the NT sequence. This function is only sent a nucleotide sequence and will return the CG percent in the value between 0 and 1. Please see Code 2 below for the code for the function getCG().
Code 2: Matlab code for function getCG()
getSpec(seq, wL, jL, cut): where seq is the NT sequence, wL is the window length, jL is the jump length, and cut is the amount to cut of the beginning and the end of the spectrum. It will return a vector of length “length(seq)wL” containing the maximum of the power spectrum calculations. Because it will always take the power spectrum over the same window length it will not have to be normalized. It is important to note that while we are going along the sequence we are not taking the maximum height of the power spectrum as a whole NOT just at 1/3 nucleotide^{1}. Please see Code 3 below for the code for the function getSpec().
Code 3: Matlab code for function getSpec()
Using the functions created above I was then able to create a set of uniform length vectors that I could the plot against each other. One of the important thing about the functions above is that you sent a given read window length and a given jump length for each. A read window length is the number of nucleotides that you will take in to account as your sequence that you are going to do the processing on for the given iteration. The jump length is how many nucleotides you will move towards the 3’ after each iteration. If you do not want any overlap and not have any missed nucleotides then make your jump length the same as your read window size. If you want to make sure you have every set of data you will want to sweep with a small jump length, preferably 1. Please note that both read window size and jump length are both integer values.
Now that the functions above are created the uniform length spectrum and CG content vectors can be plotted against each other. Code 4 below is the code for making the plots of the spectral peaks vs. index value, CG percent vs. index value and most importantly the CG Percent vs. Spectral Peaks.
The first sequence that was plotted using Code 4 below was that of the NCBI sequence L48217.gb in GenBank format. This strand was chosen because it is known to have a coding region in the middle of it, see Figure 3 above. Figure 4 below is the Spectral peaks vs. index value for the given sequence. As expected there are higher peaks between 600 and 1200, right when there is a dip in the CG percentage.
Figure 4: Spectral Peaks vs. index
The interesting part comes when a plot is made of the CG Percentage vs. Spectral Peaks is plotted with CG Percentage being the horizontal axis. Figure 5 below shows the latter described graph.
Figure 5: CG Percentage vs. Spectral Peaks
You will note that there is a definite value that the plots do not cross. This boundary seems to be a second order equation as seen in Equation 3 below. This boundary is known as the Lofaro Boundary. It is important to note that the parabolic boundary is independent of read window size when varied between 50 and 500 and as well as jump size. The latter statement will be shown to be true later on in the paper.
_{}
Equation 3: Lofaro Boundary
Code 4: Matlab code for generating the Spectrum, CG Percent, and Spectrum Vs. CG Percent Plots
Changing the Reading Window and Jump Length:
The four figures below were via the use of Code 4 above with different window sizes as well as different window jump lengths. The data used for the below plots is from the first 10,000 nucleotides of the NCBI NT_011387.8.
The four figures below are plots of the max power spectrum height, vertical axis, versus the CG percent, horizontal axis. The values for each dot is received by the max power spectrum height and the CG percentage for the given window.
Things to note:
The straight lines that you see in the four figures below, they will be referenced to as bins, are always 1.96% apart. This means that there is approximately 51, 100%/1.96%, separate bins from 0% to 100% CG.
The parabolic shape that this creates, referenced to as the Lofaro Boundary, is less of an approximation of where the Max Power Spectrum Height vs. CG Percent will lie, but rather a boundary where, out of 10,000 nucleotides from a real strand, not one fell outside the boundary.
Figure 6: Spectrum maximum power vs CG percentage, window length of 75 and slide length of 10
Please note that for Figure 6 the separation between each line is 0.0196%
Figure 7: Spectrum maximum power vs CG percentage, window length of 75 and slide length of 1
Please note that for Figure 7 the separation between each line is 0.0196%
Figure 8: Spectrum maximum power vs CG percentage, window length of 50 and slide length of 10
Please note that for Figure 8 the separation between each line is 0.0196%
Figure 9: Spectrum maximum power vs CG percentage, window length of 50 and slide length of 50
Please note that for Figure 9 the separation between each line is 0.0196%
Practical Use:
It is important to remember that this information is meant to be used by biologists in an attempt to find coding regions graphically. The way that this can help with that is by looking at the locations where there is a high, or low, CG content as well as a high spectral peak. Figure 10 below shows the areas that are attributed with coding regions and those that are not.
Figure 10: Coding Region Definitions
Example:
Take the whole sequence L48217 as used above. You will note that from Figure 11 below, the plot of the Spectrum Peak Vs. CG Percent for the whole sequence of L48217, that there is places with high spectral peaks as well as a CG bias. Thus we can conclude that there is a coding region in this sequence.
Figure 11: Spectrum Peak Vs. CG Percent for L48217 (Whole)
Next the region will be cut in to three pieces. The sequence is approximately 1600 long so the pieces will be cut into the following 1 to 525, 526 to 1050, and 1051 to 1575. Please see the plots below for the Spectral Peak Vs. CG Percent for each of the latter ranges.
Figure 12: Peak Vs. CG Percent for L48217 (Index:1525)
Figure 13: Peak Vs. CG Percent for L48217 (Index:5261050)
Figure 14: Peak Vs. CG Percent for L48217 (Index:10511575)
From viewing the latter three graphs it is easy to discern that Figure 13 (index: 5261050) is highly likely to contain a coding region when compared to the coding region definitions in Figure 10. This sequence is a known sequence and thus can be checked to see if the analysis is correct. By checking the GenBank files for the coding region indices we find that the longest coding region is found between 495 and 1346.
Animated Graphs:
To make viewing the coding region easier a tool has been created that will take a sequence and cut it down in to smaller parts and it will make the power spectrum peaks vs. CG content graphs for each of those chunks. The system will have a read window system like the functions above. The output graph will then be saved as a image file and can be viewed in rapid succession, like a movie, to see how the properties of the sequence change in real life. The two sets of code below are the tools used to create the latter. Screen shots of the demo movie, using L48217 as the sequence being read, can be seen.
Code 5: Code used to create images for animated graph
Code 6: Function num2chat(): converts a decimal number to the ascii decimal
Figure 15: Screen Shots of Movie created from images from Code 5
Conclusion:
This system has been shown to work as a graphical way of
determining the general location of a coding region. This tool, Code 5, can be
made more useful by making it take a whole chromosome and cut it up in to
smaller pieces, which will be then cut up into even smaller pieces and make
many of these plots out of it. Because using this method all of the data
points will stay near the Lofaro Boundary the movie that can be created using
each plot in sequence will show the points moving up and down the Lofaro
Boundary, see Figure 15. When it goes up the boundary you know that it is a
coding region so it can be zoomed in on and studied further. Thus this is a potentially
useful tool for biologists to locate coding regions.
Work Cited:
[1] Anastassiou, Dimitris: Genomic Signal Processing, IEEE Signal Processing Magazine. 2001.
[2] Anastassiou, Dimiris: DSP in Genomics, Processing and FrequencyDomain Analysis of Character Strings. IEEE 0780370414. 2001
[3] Lyshevski, Sergey E; Krueger, Frank E; Theodorou Elias: Nanoengineering Bioinformatics: Nanotechnology Paradigm and its Applications. IEEE 0780379764. 2003.
[4] Sussillo, David; Kunkaje, Anshul; Anastassiou, Dimitris: Spectrogram Analysis: EURASIP Journal on Applied Signal Processing 2004:1,2942
[5] Xiao, Xuan; Shao Shihuang; Ding, Yongsheng; Chen, Xiaojing: Digital codign for Amino Acid Based on Cellular Automata: IEEE International Conference on Systems, Man and Cybernetics. 2004