Using gapped suffix arrays for approximate string matching

Posted on 2011-04-19
Last Modified: 2012-05-11
Hi there,

I have been tasked with implementingin an algorithm that builds an approximate index on a set of text (eg: a collection of websites) and perform some operations related to it (eg: Search, Find & Replace).

The suggested technique to implement this is to use gapped suffix arrays. I have searched online for more detail but with no luck. I understand what suffix arrays are about, but not the "gapped" piece.

Can anyone kindly assist me or point me to a web link with more information on this topic?

Thanks very much,
Question by:2ooth
    LVL 32

    Accepted Solution

    LVL 37

    Assisted Solution

    Gapped suffix arrays are a new thing as of 2010. They were introduced in this research paper:
    There's not a lot of implementation using them out there yet. They have not been accepted as an industry standard and are still primarily academic.

    If you want to launch into the paper and implement the new technology, then it may be a challenge, but it could be quite useful. There are more standard methods out there (the paper details one) so you could use one of those to fall back on. There's no guarantee that a new method will actually be better than the old ones until more people have tried to use it.
    So far that paper has only been cited by two others (according to Microsoft Research) so there's not a lot of third party testing involved yet. You'd be a first.

    Author Comment

    Thank you Sara and Tommy for your prompt responses.

    I will have a read-through of the wiki links shortly.


    Looking at the research paper, I agree that the gapped suffix arrays technique is probably not the path I should go down with, in the interest of time. Can you kindly mention the more standard methods that are definitely proven to work? Perhaps its worth my time doing my ground work on alternate "working" methods :)

    Thanks again..

    LVL 37

    Expert Comment

    Regular Suffix arrays would work. The original paper (Suffix arrays: A new method for on-line string searches) was published in 1989 and has over 1000 citations. There is a good Wikipedia page for suffix arrays (it is listed in Sara's first link under indexing methods too). I would start there.

    Author Comment

    Ok fantastic, thanks so much!

    Will give that a go and let you know if I need further input.


    Featured Post

    Highfive + Dolby Voice = No More Audio Complaints!

    Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

    Join & Write a Comment

    Suggested Solutions

    Using SQL Scripts we can save all the SQL queries as files that we use very frequently on our database later point of time. This is one of the feature present under SQL Workshop in Oracle Application Express.
    Why do we like using grid based layouts in website design? Let's look at the live examples of websites and compare them to grid based WordPress themes.
    Viewers will get an overview of the benefits and risks of using Bitcoin to accept payments. What Bitcoin is: Legality: Risks: Benefits: Which businesses are best suited?: Other things you should know: How to get started:
    This video teaches users how to migrate an existing Wordpress website to a new domain.

    729 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    18 Experts available now in Live!

    Get 1:1 Help Now