Genome Rearrangement and Inversion Locator

Using GRIL to determine the rearrangement structure of five genomes

To give an example of how GRIL can be used, we have applied it to five enterobacterial genome sequences. The five genomes are those of E. coli K-12 MG1655, E. coli O157:H7 EDL933, E. coli O157:H7 Sakai VT-2, Salmonella enterica serovar Typhimurium LT2, and Salmonella enterica serovar Typhi CT18. The length of each genome ranges from 4.6 to 5.5MB. These genomes contain significant rearranged homologous regions and also numerous lineage specific regions.

To determine the locally collinear blocks, and hence the structure of rearrangement in these genomes, GRIL can be executed using the command line:

> gril -m 23 -f 100000 -r 10000 ecolim52.fas ecolim52.sml EDL933.fas EDL933.sml o157sakai.fas o157sakai.sml typhimurium.fas typhimurium.sml styphi.fas styphi.sml

The -m 23 parameter specifies that GRIL should use 23-bp exact and unique mers as the seed matches when searching for MUMs. The -f 100000 parameter specifies that 100000 is the maximum permissible difference in generalized offset between a MUM and neighboring MUMs. The -r 10000 parameter specifies that GRIL should remove LCBs that span less than 10000b.p. For a detailed description of the generalized offset, range, and identity filters, see the GRIL Algorithms web page. The remaining parameters specify the input sequence files and the sorted mer list files. If the sorted mer list (.sml) files do not exist, GRIL will create them.

The resulting list of LCBs generated by GRIL is:

LCB Start Coordinates
LCB End Coordinates
LCB Number Gen. 1 Gen. 2 Gen. 3 Gen. 4 Gen. 5 Gen. 1 Gen. 2 Gen. 3 Gen. 4 Gen. 5
0 Start: 1078 1095 1095 1078 1078 0 End: 1210698 1701969 1617996 1325533 1235018
1 Start: 1223602 1753271 1665866 -1913801 -1835402 1 End: 1245807 1782945 1695540 -1895884 -1817487
2 Start: 1271008 1805062 1717127 -1869318 1236924 2 End: 1311799 1844558 1756620 -1829190 1276519
3 Start: 1313968 -2266213 1816964 -1825650 1280059 3 End: 1628062 -1928536 2155935 -1586945 1505444
4 Start: 1660659 2333722 2262438 -1574763 1517474 4 End: 1878072 2552854 2481652 -1351305 1762656
5 Start: 1885530 2560538 2489335 1914664 1836265 5 End: 3421295 4221880 4158663 3566860 3418174
6 Start: 3426685 4227281 4164060 3572293 -4257207 6 End: 3938414 4799212 4734232 4099112 -3750093
7 Start: 3944539 4805339 4740359 4105326 -3552636 7 End: 4032184 4895478 4830715 4195123 -3424381
8 Start: 4039571 4902706 4837943 4202772 -3742567 8 End: 4205483 5081454 5016709 4394434 -3558097
9 Start: 4211801 5087365 5022622 4400783 4263481 9 End: 4638208 5523431 5497437 4856419 4808024

Negative start and end coordinates signify a region that has been inverted relative to the first (the reference) sequence. Using the GenVision software package from DNAstar, the block structure can be graphically rendered as follows:

In this figure, each colored block represents a region of each bacteria's chromosome that has not undergone significant rearrangement. Blocks below a sequence's center line denote that the region has been inverted relative to the reference sequence, in this case E. coli K-12 MG1655. For example, the yellow block represents a region that is homologous in all sequences and is inverted in the Salmonella strains relative to E. coli K-12 MG1655. As shown in the above graphic, GRIL locates 10 LCBs using the parameters -r 10000 and -f 100000. By using different parameters, the blocks found can change. For example, if the minimum range (-r) parameter is reduced to 1000 GRIL reports the following 21 LCBs:

LCB Start Coordinates
LCB End Coordinates
 
LCB Number Gen. 1 Gen. 2 Gen. 3 Gen. 4 Gen. 5 Gen. 1 Gen. 2 Gen. 3 Gen. 4 Gen. 5 Annotation
0 Start: 1078 1095 1095 1078 1078 0 End: 1029700 1249171 1159948 1173990 1083736 1/1
1 Start: 1031539 1296253 1211436 1885005 1806607 1 End: 1038839 1303553 1218736 1892269 1813871 hya/app operon
2 Start: 1057724 1321671 1236854 -4024121 3824605 2 End: 1060269 1324216 1239399 -4021574 3827152 tor operon
3 Start: 1062940 1326889 1242072 1199861 1109950 3 End: 1210698 1701969 1617996 1325533 1235018 1/2
4 Start: 1223602 1753271 1665866 -1913801 -1835402 4 End: 1245807 1782945 1695540 -1895884 -1817487 2/1, hya insert in salmonella
5 Start: 1256020 1790084 1702679 -1882834 -1804436 5 End: 1267636 1801700 1714295 -1871014 -1792616 2/2
6 Start: 1271008 1805062 1717127 -1869318 1236924 6 End: 1311799 1844558 1756620 -1829190 1276519 3
7 Start: 1313968 -2266213 1816964 -1825650 1280059 7 End: 1588799 -1969157 2115315 -1629679 1468629 4/1
8 Start: 1599093 -1959481 2124991 4283076 -3662353 8 End: 1605058 -1953516 2130956 4289144 -3656285 ambiguously annotated area
9 Start: 1606250 -1952324 2132148 -1604983 1488687 9 End: 1628062 -1928536 2155935 -1586945 1505444 4/2
10 Start: 1660659 2333722 2262438 -1574763 1517474 10 End: 1878072 2552854 2481652 -1351305 1762656 5
11 Start: 1885530 2560538 2489335 1914664 1836265 11 End: 2685024 3473207 3407370 2696428 2660528 6/1, cad insert in Salmonella
12 Start: 2685256 3473439 3407602 2703825 2667912 12 End: 3421295 4221880 4158663 3566860 3418174 6/2
13 Start: 3426685 4227281 4164060 3572293 -4257207 13 End: 3865055 4723394 4658412 4010335 -3838391 7/1, tor insert in Salmonella
14 Start: 3873910 4727497 4662515 4034380 -3815805 14 End: 3938414 4799212 4734232 4099112 -3750093 7/2
15 Start: 3944539 4805339 4740359 4105326 -3552636 15 End: 4032184 4895478 4830715 4195123 -3424381 8
16 Start: 4039571 4902706 4837943 4202772 -3742567 16 End: 4108253 4975395 4910632 4280125 -3665303 9/1, AAA insert in Salmonella
17 Start: 4108693 4975835 4911072 4290638 -3654791 17 End: 4205483 5081454 5016709 4394434 -3558097 9/2
18 Start: 4211801 5087365 5022622 4400783 4263481 18 End: 4348110 5230966 5204982 4550394 4395511 10/1
19 Start: 4353775 5236631 5210647 -2702365 -2666452 19 End: 4358667 5241524 5215540 -2697483 -2661582 cad operon
20 Start: 4360149 5243006 5217022 4566599 4543093 20 End: 4638208 5523431 5497437 4856419 4808024 10/2

Through inspection of the sequence annotation, we can validate the existence of the additional LCBs reported by GRIL. Four new LCBs were found using the lower minimum range threshold and have been labeled according to their sequence annotation. Three of the new LCBs are 'mobile' operons that have relocated across the 5 genome sequences, and the fourth is ambiguously annotated. The four new LCBs break up the original 10 blocks into 17 collinear blocks, for a total of 21 LCBs. The original LCBs are labeled with the corresponding block number from the first comparison, and blocks that were broken are labeled as [original number]/[piece number]. The biological significance of the sequence contained in one of the rearrangements is unknown. Using GenVision, the position of the new rearrangements can be overlaid onto our visualization of the original set of 10 rearrangements:

Thus by modulating the minimum identity, maximum generalized offset distance, and especially the minimum range parameters, genome rearrangements at different scales can be selectively identified or ignored.

For further discussion and analysis of this 5-way genome comparison please read this PDF document.