[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 517
  • Last Modified:

Using gapped suffix arrays for approximate string matching

Hi there,

I have been tasked with implementingin an algorithm that builds an approximate index on a set of text (eg: a collection of websites) and perform some operations related to it (eg: Search, Find & Replace).

The suggested technique to implement this is to use gapped suffix arrays. I have searched online for more detail but with no luck. I understand what suffix arrays are about, but not the "gapped" piece.

Can anyone kindly assist me or point me to a web link with more information on this topic?

Thanks very much,
C
0
2ooth
Asked:
2ooth
  • 2
  • 2
2 Solutions
 
TommySzalapskiCommented:
Gapped suffix arrays are a new thing as of 2010. They were introduced in this research paper:
http://www.dcs.kcl.ac.uk/staff/mac/DOC/gsuf-short.pdf
There's not a lot of implementation using them out there yet. They have not been accepted as an industry standard and are still primarily academic.

If you want to launch into the paper and implement the new technology, then it may be a challenge, but it could be quite useful. There are more standard methods out there (the paper details one) so you could use one of those to fall back on. There's no guarantee that a new method will actually be better than the old ones until more people have tried to use it.
So far that paper has only been cited by two others (according to Microsoft Research) so there's not a lot of third party testing involved yet. You'd be a first.
0
 
2oothAuthor Commented:
Thank you Sara and Tommy for your prompt responses.

I will have a read-through of the wiki links shortly.

Tommy,

Looking at the research paper, I agree that the gapped suffix arrays technique is probably not the path I should go down with, in the interest of time. Can you kindly mention the more standard methods that are definitely proven to work? Perhaps its worth my time doing my ground work on alternate "working" methods :)

Thanks again..

C
0
 
TommySzalapskiCommented:
Regular Suffix arrays would work. The original paper (Suffix arrays: A new method for on-line string searches) was published in 1989 and has over 1000 citations. There is a good Wikipedia page for suffix arrays (it is listed in Sara's first link under indexing methods too). I would start there.
0
 
2oothAuthor Commented:
Ok fantastic, thanks so much!

Will give that a go and let you know if I need further input.

0

Featured Post

Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now