Using GRIL to determine the rearrangement structure
of five genomes
To give an example of how GRIL
can be used, we have applied it to five enterobacterial genome sequences.
The five genomes are those of E. coli K-12 MG1655, E. coli O157:H7
EDL933, E. coli O157:H7 Sakai VT-2, Salmonella enterica serovar
Typhimurium LT2, and Salmonella enterica serovar Typhi CT18. The
length of each genome ranges from 4.6 to 5.5MB. These genomes contain
significant rearranged homologous regions and also numerous lineage specific
regions.
To determine the locally collinear
blocks, and hence the structure of rearrangement in these genomes, GRIL
can be executed using the command line:
> gril -m 23 -f 100000
-r 10000 ecolim52.fas ecolim52.sml EDL933.fas EDL933.sml o157sakai.fas
o157sakai.sml typhimurium.fas typhimurium.sml styphi.fas styphi.sml
The -m 23 parameter specifies
that GRIL should use 23-bp exact and unique mers as the seed matches when
searching for MUMs. The -f 100000 parameter specifies that 100000 is the
maximum permissible difference in generalized offset between a MUM and
neighboring MUMs. The -r 10000 parameter specifies that GRIL should remove
LCBs that span less than 10000b.p. For a detailed description of the generalized
offset, range, and identity filters, see the GRIL
Algorithms web page. The remaining parameters specify the input sequence
files and the sorted mer list files. If the sorted mer list (.sml) files
do not exist, GRIL will create them.
The resulting list of LCBs
generated by GRIL is:
|
LCB Start Coordinates
|
|
LCB End Coordinates
|
LCB Number |
Gen. 1 |
Gen. 2 |
Gen. 3 |
Gen. 4 |
Gen. 5 |
|
Gen. 1 |
Gen. 2 |
Gen. 3 |
Gen. 4 |
Gen. 5 |
0 Start: |
1078 |
1095 |
1095 |
1078 |
1078 |
0 End: |
1210698 |
1701969 |
1617996 |
1325533 |
1235018 |
1 Start: |
1223602 |
1753271 |
1665866 |
-1913801 |
-1835402 |
1 End: |
1245807 |
1782945 |
1695540 |
-1895884 |
-1817487 |
2 Start: |
1271008 |
1805062 |
1717127 |
-1869318 |
1236924 |
2 End: |
1311799 |
1844558 |
1756620 |
-1829190 |
1276519 |
3 Start: |
1313968 |
-2266213 |
1816964 |
-1825650 |
1280059 |
3 End: |
1628062 |
-1928536 |
2155935 |
-1586945 |
1505444 |
4 Start: |
1660659 |
2333722 |
2262438 |
-1574763 |
1517474 |
4 End: |
1878072 |
2552854 |
2481652 |
-1351305 |
1762656 |
5 Start: |
1885530 |
2560538 |
2489335 |
1914664 |
1836265 |
5 End: |
3421295 |
4221880 |
4158663 |
3566860 |
3418174 |
6 Start: |
3426685 |
4227281 |
4164060 |
3572293 |
-4257207 |
6 End: |
3938414 |
4799212 |
4734232 |
4099112 |
-3750093 |
7 Start: |
3944539 |
4805339 |
4740359 |
4105326 |
-3552636 |
7 End: |
4032184 |
4895478 |
4830715 |
4195123 |
-3424381 |
8 Start: |
4039571 |
4902706 |
4837943 |
4202772 |
-3742567 |
8 End: |
4205483 |
5081454 |
5016709 |
4394434 |
-3558097 |
9 Start: |
4211801 |
5087365 |
5022622 |
4400783 |
4263481 |
9 End: |
4638208 |
5523431 |
5497437 |
4856419 |
4808024 |
Negative start and end coordinates
signify a region that has been inverted relative to the first (the reference)
sequence. Using the GenVision software package from DNAstar, the block
structure can be graphically rendered as follows:
In this figure, each colored
block represents a region of each bacteria's chromosome that has not undergone
significant rearrangement. Blocks below a sequence's center line denote
that the region has been inverted relative to the reference sequence,
in this case E. coli K-12 MG1655. For example, the yellow block
represents a region that is homologous in all sequences and is inverted
in the Salmonella strains relative to E. coli K-12 MG1655.
As shown in the above graphic, GRIL locates 10 LCBs using the parameters
-r 10000 and -f 100000. By using different parameters, the blocks found
can change. For example, if the minimum range (-r) parameter is reduced
to 1000 GRIL reports the following 21 LCBs:
|
LCB Start Coordinates
|
|
LCB End Coordinates
|
|
LCB Number |
Gen. 1 |
Gen. 2 |
Gen. 3 |
Gen. 4 |
Gen. 5 |
|
Gen. 1 |
Gen. 2 |
Gen. 3 |
Gen. 4 |
Gen. 5 |
Annotation |
0 Start: |
1078 |
1095 |
1095 |
1078 |
1078 |
0 End: |
1029700 |
1249171 |
1159948 |
1173990 |
1083736 |
1/1 |
1 Start: |
1031539 |
1296253 |
1211436 |
1885005 |
1806607 |
1 End: |
1038839 |
1303553 |
1218736 |
1892269 |
1813871 |
hya/app operon |
2 Start: |
1057724 |
1321671 |
1236854 |
-4024121 |
3824605 |
2 End: |
1060269 |
1324216 |
1239399 |
-4021574 |
3827152 |
tor operon |
3 Start: |
1062940 |
1326889 |
1242072 |
1199861 |
1109950 |
3 End: |
1210698 |
1701969 |
1617996 |
1325533 |
1235018 |
1/2 |
4 Start: |
1223602 |
1753271 |
1665866 |
-1913801 |
-1835402 |
4 End: |
1245807 |
1782945 |
1695540 |
-1895884 |
-1817487 |
2/1, hya insert in salmonella |
5 Start: |
1256020 |
1790084 |
1702679 |
-1882834 |
-1804436 |
5 End: |
1267636 |
1801700 |
1714295 |
-1871014 |
-1792616 |
2/2 |
6 Start: |
1271008 |
1805062 |
1717127 |
-1869318 |
1236924 |
6 End: |
1311799 |
1844558 |
1756620 |
-1829190 |
1276519 |
3 |
7 Start: |
1313968 |
-2266213 |
1816964 |
-1825650 |
1280059 |
7 End: |
1588799 |
-1969157 |
2115315 |
-1629679 |
1468629 |
4/1 |
8 Start: |
1599093 |
-1959481 |
2124991 |
4283076 |
-3662353 |
8 End: |
1605058 |
-1953516 |
2130956 |
4289144 |
-3656285 |
ambiguously annotated area |
9 Start: |
1606250 |
-1952324 |
2132148 |
-1604983 |
1488687 |
9 End: |
1628062 |
-1928536 |
2155935 |
-1586945 |
1505444 |
4/2 |
10 Start: |
1660659 |
2333722 |
2262438 |
-1574763 |
1517474 |
10 End: |
1878072 |
2552854 |
2481652 |
-1351305 |
1762656 |
5 |
11 Start: |
1885530 |
2560538 |
2489335 |
1914664 |
1836265 |
11 End: |
2685024 |
3473207 |
3407370 |
2696428 |
2660528 |
6/1, cad insert in Salmonella |
12 Start: |
2685256 |
3473439 |
3407602 |
2703825 |
2667912 |
12 End: |
3421295 |
4221880 |
4158663 |
3566860 |
3418174 |
6/2 |
13 Start: |
3426685 |
4227281 |
4164060 |
3572293 |
-4257207 |
13 End: |
3865055 |
4723394 |
4658412 |
4010335 |
-3838391 |
7/1, tor insert in Salmonella |
14 Start: |
3873910 |
4727497 |
4662515 |
4034380 |
-3815805 |
14 End: |
3938414 |
4799212 |
4734232 |
4099112 |
-3750093 |
7/2 |
15 Start: |
3944539 |
4805339 |
4740359 |
4105326 |
-3552636 |
15 End: |
4032184 |
4895478 |
4830715 |
4195123 |
-3424381 |
8 |
16 Start: |
4039571 |
4902706 |
4837943 |
4202772 |
-3742567 |
16 End: |
4108253 |
4975395 |
4910632 |
4280125 |
-3665303 |
9/1, AAA insert in Salmonella |
17 Start: |
4108693 |
4975835 |
4911072 |
4290638 |
-3654791 |
17 End: |
4205483 |
5081454 |
5016709 |
4394434 |
-3558097 |
9/2 |
18 Start: |
4211801 |
5087365 |
5022622 |
4400783 |
4263481 |
18 End: |
4348110 |
5230966 |
5204982 |
4550394 |
4395511 |
10/1 |
19 Start: |
4353775 |
5236631 |
5210647 |
-2702365 |
-2666452 |
19 End: |
4358667 |
5241524 |
5215540 |
-2697483 |
-2661582 |
cad operon |
20 Start: |
4360149 |
5243006 |
5217022 |
4566599 |
4543093 |
20 End: |
4638208 |
5523431 |
5497437 |
4856419 |
4808024 |
10/2 |
Through inspection of the sequence
annotation, we can validate the existence of the additional LCBs reported
by GRIL. Four new LCBs were found using the lower minimum range threshold
and have been labeled according to their sequence annotation. Three of
the new LCBs are 'mobile' operons that have relocated across the 5 genome
sequences, and the fourth is ambiguously annotated. The four new LCBs
break up the original 10 blocks into 17 collinear blocks, for a total
of 21 LCBs. The original LCBs are labeled with the corresponding block
number from the first comparison, and blocks that were broken are labeled
as [original number]/[piece number]. The biological significance of the
sequence contained in one of the rearrangements is unknown. Using GenVision,
the position of the new rearrangements can be overlaid onto our visualization
of the original set of 10 rearrangements:
Thus by modulating the minimum
identity, maximum generalized offset distance, and especially the minimum
range parameters, genome rearrangements at different scales can be selectively
identified or ignored.
|