Solved

What way is faster to check for a substring within a null-terminated string...

Posted on 2014-10-06
21
186 Views
Last Modified: 2014-11-17
Hi guys... assuming I have an array of AnsiChar called "buffer" that contains some characters, and I want to find out if a certain substring "substr" exists within there, what would be the fastest/most efficient way to do it?

- do a "SearchBuf" command on the array of Char looking for "substr"
- or convert both "buffer" and "substr" to PChar, and do a  AnsiStrPos command?
- or possibly some other way...?

Thanks!
    Shawn
0
Comment
Question by:shawn857
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 9
  • 4
  • 4
  • +2
21 Comments
 
LVL 19

Expert Comment

by:Thommy
ID: 40365311
Regarding the Delphi impementation of the different search functions SearchBuf is slower than Pos/PosEx.

But also take into account, that SearchBuf has options like start offset, forward/backward search, whole word search and case sensitive search, which could make it faster compared to Pos/PosEx with additional code.

Anyway, before making a decision you should try it out in your application with real data...
0
 
LVL 19

Accepted Solution

by:
MerijnB earned 500 total points
ID: 40365464
Hi Shawn,

as long as you are sure you don't need unicode, QStrings is still a very fast option. The code is very old, so you need to find replace stuff like:

string -> ansistring
char -> ansichar
pchar -> pansichar

I use it in XE7 and it works fine. Comments are unreadable, but function names tell a lot.
0
 

Author Comment

by:shawn857
ID: 40367038
Thanks guys. Merijn - so if I change string to ansistring, etc within QStrings, then it will work for Unicode?

Thanks
   Shawn
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:shawn857
ID: 40367580
Hmmm, I think should give you guys the "bigger picture" as to what I'm trying to do overall. Basically I have a large file of records (this could be gigabytes big), each record terminated by a CR/LF. I also have a file of search strings or "search criteria" that I need to check if any exist in each record of my file. These search criteria will usually be short text strings - up to 50 or so chars max. I could have a few hundred or more of these search criteria strings. I need to find the fastest way to compare each record of my disk file to all of these individual search criteria... to see if they exist within the record or not. Googling around, I saw some algorithms such as Aho-Corasick and Rabin-Karp... and I'm wondering if a pre-made Delphi unit exists that would implement one of these superfast algorithms.

Should I be asking this as a separate question...?

Thanks!
    Shawn
0
 
LVL 19

Expert Comment

by:MerijnB
ID: 40367773
so if I change string to ansistring, etc within QStrings, then it will work for Unicode?

No, it will never work for unicode, but if you do this you can use it for ansi strings in a unicode aware Delphi
0
 
LVL 37

Expert Comment

by:Geert Gruwez
ID: 40372742
ugh ... database ?
insert the text in a database and then run an sql

i thought that's what databases were for ... :)
0
 

Author Comment

by:shawn857
ID: 40375192
Does anyone know of a Delphi implementation or unit of a very fast/efficient way to scan for multiple sub-strings within a string?

Thanks
   Shawn
0
 

Author Comment

by:shawn857
ID: 40393351
Anybody still with me?

Cheers
    Shawn
0
 
LVL 37

Expert Comment

by:Geert Gruwez
ID: 40393740
just an idea

for very large texts
you could use gabr's omnithreadlibrary to start searching in several places at the same time
each thread searches a portion of text (including an overlap of x = search text length)
and returns info found in a message queue

the otl can keep threads live in background so it doesn't have to create/destroy them each time
it also has a non-blocking message queue

should work the same for search text in file
you can open a file and move to a certain place within each thread providing it gets the starting search point
0
 
LVL 7

Expert Comment

by:kfoster11
ID: 40396118
"I have a large file of records (this could be gigabytes big), each record terminated by a CR/LF"

Are you trying to search this large file of records on disk, or is it loaded into memory?
0
 

Author Comment

by:shawn857
ID: 40396192
Thanks guys...

Geert - I am still on D7, appears OmniThreadLibrary needs Delphi 2007 at least.

Kfoster - I am reading from disk always... I never load into memory. And I must search *each* line (or record) for all the substrings.

Thanks!
    Shawn
0
 
LVL 7

Expert Comment

by:kfoster11
ID: 40396350
So basically, you need to search gigabytes of strings for a given string on disk.  Personally, I would first get this working, and then try and put a memory cache in the middle.

While it is no longer being maintained, AsyncCalls is an open source Delphi 7 compatible library that manages a thread pool.  While I have  not tried to use it, I believe the following article on delphi.about.com is doing basically what you are trying to do.

http://delphi.about.com/od/kbthread/a/delphi-thread-pool-example-using-asynccalls.htm

Also, just because 10 threads speeds up the process, does not mean that 10,000 will speed it up 1000 times more.  The following blog explains thread optimization
http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx

Good Luck.
0
 

Author Comment

by:shawn857
ID: 40400543
Thanks kFoster... I will have a look at that AsyncCalls. I'm pretty new to threading though.. it might be over my head.

You also mentioned about "putting a memory cache in the middle". Could you explain a bit what that is exactly?

Thanks!
   Shawn
0
 
LVL 7

Expert Comment

by:kfoster11
ID: 40408321
A memory cache is used to put pieces of the file into memory as you are reading it.  When you do a search, your routines would need to first check to see if the file has changed since the data was stored (Cached) into memory.  If it was, then the cache is dumped and not used, if it hasn't changed then you search through the memory cache instead of reading the disk.  Disk reads are time expensive, where memory is not.

If you are reading a file that is changing a lot, then the cache is probably not useful, but if it doesn't change much then it will greatly improve your performance.  

If it is new, and you haven't done threading then it may be more than you want to tackle.  Also, with today's operating systems and smart drives a lot of this is cached anyway.
0
 

Author Comment

by:shawn857
ID: 40420154
Thanks kfoster. My file's data does not change - it's just a static "flat file".. but it could be very large (many gigs) so it's not practical to hold it all in memory at once. I do already read in a sizable chunk of this file at a time into a buffer using TFileStream (65000+ bytes for each read), and then parse each block line-by-line (looking for CR/LF's to detect the end of each line). This would kinda be like the "memory cache" approach you mention, no?

Thanks
    Shawn
0
 
LVL 37

Expert Comment

by:Geert Gruwez
ID: 40420987
depending on what you want to give your users
you could go for a text solution like this:
http://www.oracle.com/technetwork/database/12coracletexttwp-1961244.pdf
0
 

Author Comment

by:shawn857
ID: 40429265
Thanks Geert, but that would be a whole other can of worms, it seems. Maybe a bit "overkill".

Thanks
   Shawn
0
 
LVL 7

Expert Comment

by:kfoster11
ID: 40429924
Why not just import the text file into a table in SQL Express and let the RDMS do the work?

Select fieldname from tablename where fieldname like '%strtofind%'

No need to reinvent the wheel since your table is static.  SQL Express is free for databases up to 10GIGS
0
 
LVL 19

Expert Comment

by:MerijnB
ID: 40446724
Is this correct? I haven't been participating in the discussion for about a month but do get all the points?
0
 

Author Comment

by:shawn857
ID: 40447827
Yours was the best solution after all was said and done Merijn!

Thanks
   Shawn
0
 
LVL 19

Expert Comment

by:MerijnB
ID: 40447972
ok, glad to hear it worked :)
0

Featured Post

Enroll in May's Course of the Month

May’s Course of the Month is now available! Experts Exchange’s Premium Members and Team Accounts have access to a complimentary course each month as part of their membership—an extra way to increase training and boost professional development.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction Raise your hands if you were as upset with FireMonkey as I was when I discovered that there was no TListview.  I use TListView in almost all of my applications I've written, and I was not going to compromise by resorting to TStringGrid…
In my programming career I have only very rarely run into situations where operator overloading would be of any use in my work.  Normally those situations involved math with either overly large numbers (hundreds of thousands of digits or accuracy re…
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question