Growing sequencing and assembly efforts have been met by the
advances in high throughput machines. However, the presence of
massive amounts of repeats and transposons complicates the assembly
process. Given a library of possible repeats, this paper
considers the problem of identifying repeats and transposons in the
fragments (also called reads) generated from sequencing machines.
This is a difficult problem as the locations of the fragments on the
complete genome are not known. Furthermore, due to insertion,
deletion and other evolutionary factors, different copies of repeats
can diverge from each other. The presence of transposons, (also
called jumping genes) makes the problem even harder as they can
split other repeats and make them diverge from the actual repeats.
We develop a graph based method named RepFrag which can efficiently identify repeats in a given set of fragments. We first align the fragments to the repeats in the given repeat library. We model the alignments as the vertices in the graphs. We create edges between two vertices if they can jointly express a potential repeat better. We traverse the paths in this graph to find a path that has a high potential of representing a complete repeat. We mask the aligned regions on the fragments corresponding to the vertices on that path. Using the unaligned regions, we create a new fragment and align it with the repeats. We modify the existing graphs based on the new alignments, if there are any. We iterate this process of path selection and modification of graph until no promising path remains.
We compared the performance of our method to that of Repeat- Masker on 30 different fragment datasets generated from five different chromosomes of Arabidopsis Thaliana for varying coverage values and fragment lengths. On average RepFrag had a 35% better true positive-false positive ratio. RepFrag was 7.3 times faster than RepeatMasker on the average. Thus, our results suggest that, our method improves significantly over RepeatMasker in terms of speed and accuracy.