?
Solved

code:review case insensitive string search

Posted on 2011-04-21
17
Medium Priority
?
622 Views
Last Modified: 2012-08-14
i need to run a find on a huge string.

    void insert_ad(const char* data_str, off_t length, const char* tag) {
        string temp_str(data_str, length);
        size_t found = temp_str.find("</body>");
        temp_str.insert(found, tag);
        _msg_data.swap(temp_str);
    }

whats the best way to a case insensitive search so that i can search for both body and BODY in the most efficient way...the data_string is an output buffer string so i cannot convert the string itself in an case-insensitive form
0
Comment
Question by:Vlearns
  • 8
  • 6
  • 3
17 Comments
 
LVL 32

Expert Comment

by:phoffric
ID: 35444647
Here is a case insensitive program.
  "You could use `std::search` with a custom predicate."
#include <locale>  
#include <iostream>  
#include <algorithm>  
using namespace std;  
  
// templated version of my_equal so it could work with both char and wchar_t  
template<typename charT>  
struct my_equal {  
    my_equal( const std::locale& loc ) : loc_(loc) {}  
    bool operator() (charT ch1, charT ch2) {  
        return std::toupper(ch1, loc_) == std::toupper(ch2, loc_);  
    }  
private:  
    const std::locale& loc_;  
};  
  
// find substring (case insensitive)  
template<typename T>  
int ci_find_substr( const T& str1, const T& str2, const std::locale& loc = std::locale() )  
{  
    T::const_iterator it
       = std::search(
            str1.begin(), str1.end(),
            str2.begin(), str2.end(), my_equal<T::value_type>(loc) );  
    if ( it != str1.end() )  
        return it - str1.begin();  
    else  
        return -1; // not found  
}  
  
int main(int arc, char *argv[])   
{  
    // string test  
    std::string str1 = "FIRST HELLO";  
    std::string str2 = "hello";  
    int f1 = ci_find_substr( str1, str2 );  
  
    return 0;  
}

Open in new window

0
 

Author Comment

by:Vlearns
ID: 35444775
thanks for the reply, could you give me some explanation of whats happening in this code, the approach?
what are the performance impact of using templates vs calling the standard find twice ? once for body and other for BODY?

0
 
LVL 32

Expert Comment

by:phoffric
ID: 35445012
In the search last arg is the expression: my_equal<T::value_type>(loc)

my_equal is the constructor of the struct my_equal. If you are not concerned about locale, then you could leave out the default constructor from the struct my_equal definition.

Then the last arg of the search is a created object of type struct my_equal<T::value_type>
T is string
string::value_type is char_type, which for ASCII chars is just char

This last arg of the search (in this case, an object) is a predicate that return a bool. The search takes this object and applies the () operator. Usually you think of () operator as acting on function pointers, but a unary operator can be applied to objects as well. Line 10 shows the definition of this objects () operator. After the template is realized, you can see that it just compares two chars after normalizing them both to uppercase.

On performance, take a look at this brief statement (and other descriptions of functors earlier in the article):
       http://en.wikipedia.org/wiki/Function_object#Performance

If performance is a concern, then for this application, there is a good chance that using char * (or char[] ) types instead of string class, and using strstr may be more optimal. However, the tradeoff, as mentioned in previous questions, is that using  string is much easier to code without buffer overruns.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 32

Expert Comment

by:phoffric
ID: 35445324
I'll be back later to see if I can add more to your performance question.
0
 
LVL 12

Expert Comment

by:trinitrotoluene
ID: 35448288
this might be more efficient if you are looking at time and space complexity even though it might seem inelegant at first.

size_t found1 = temp_str.find("</body>");
size_t found2 = temp_str.find("</BODY>");

if(found1 != npos)  
      temp_str.insert(found1, tag);
else
        temp_str.insert(found2, tag);
0
 

Author Comment

by:Vlearns
ID: 35449822
is this not better? still thinking of the performance issue with tags.... not sure whats most efficient...  


void insert_ad(const char* data_str, off_t length, const char* tag) {
        string temp_str(data_str, length);
        size_t found = 0;
        found = temp_str.find("</body>");
        if (!found) {
            found = temp_str.find("</BODY>");
          }
        if (found) {
           temp_str.insert(found, tag);
           _msg_data.swap(temp_str);
          }
    }
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35449875
If efficiency is paramount in importance, then you should use a C approach and not create strings which are very useful in writing and maintaining code with less bugs than with C. That is the trade-off you need to make. Given that the problem is not too hard, going with the lower-level C approach is desireable if it will yield significant performance benefits.

The most efficient way to find </body> or </BODY> in const char* data_str is to:

Loop one char at a time for '<'. (no need to deal with upper/lower case issues here.)

If next char is '/', then check next char for 'b' or 'B'
else if next char is not '/', then continue loop looking one char at a time for '<'.

If you found the 'b' or 'B', then follow this pattern to find the 'd' and the 'y'. Then the next char must be '>'. Otherwise, start looking for a '<' from the last pattern failure.

With this approach, you would not create your C++ string temp_str.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35449925
If you wish to use C library functions, then you can use strchr to find the '<'
    http://www.cplusplus.com/reference/clibrary/cstring/strchr/
char * pch;
pch=strchr(data_str,'<');

However, if this function is not inlined, then you pay the overhead price of calling the function and passing in the argument.
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35450296
I checked the C++ standard.

"17.4.4.3 Global or non-member functions"
"It is unspecified whether any global or non-member functions in the C + + Standard Library are defined as inline (7.1.2)."
Even if you explore the header files and find that an implementation of a global function is defined as inline..

"7.1.2 Function specifiers"
"A function declaration (8.3.5, 9.3, 11.4) with an inline specifier declares an inline function. The inline specifier indicates to the implementation that inline substitution of the function body at the point of call is to be preferred to the usual function call mechanism. An implementation is not required to perform this inline substitution at the point of call..."

So, if using a library function, you need to perform timing tests to see whether it performs as well as straight inlined non-function code. (In C, the library was permitted to use a function-like macro to give a guaranteed effect of inline; however, in C++, this practice is not allowed. Instead the use of inline is recommended - but as you can see, even if inline is used, it does not need to be implemented.)

On a practical note, usually the compiler will implement inline for very brief functions.
0
 

Author Comment

by:Vlearns
ID: 35450628
hmmm, tahnks for all your help! appreciate it.

one approach i thought was the following

        string search("</body>");
 
        string::iterator fpos = search(temp_str.begin(), temp_str.end(), search.begin(), search.end(), icase);
        if (fpos != temp_str.end())
           found = distance(temp_str.begin(), fpos);


but i am getting the following compile error

code.h: In method `void IMAP_Message::insert_ad(const char *, long long int, const char *)':
codee.h:283: no match for call to `(string) (char *, char *, char *, char *, {unknown type})'
gmake: *** [fsse.o] Error 1

0
 

Author Comment

by:Vlearns
ID: 35450641
where icase is defined as

   bool icase (char l, char r) {
        return (std::tolower(l) == std::tolower(r));
    }

0
 

Author Comment

by:Vlearns
ID: 35451048
   static bool icase (char l, char r) {
        return (std::tolower(l) == std::tolower(r));
    }

    void insert_ad(const char* data_str, off_t length, const char* tag) {
        string temp_str(data_str, length);
        size_t found = 0;
        string search("</body>");


        string::iterator fpos =  std::search(temp_str.begin(), temp_str.end(), search.begin(), search.end(), &IMAP_Message::icase);
        if (fpos != temp_str.end())
           found = distance(temp_str.begin(), fpos);
/*
        found = temp_str.find("</body>");
        if (!found) {
            found = temp_str.find("</BODY>");
          }

*/
        if (found) {
           temp_str.insert(found, tag);
           _msg_data.swap(temp_str);
          }

    }


this compiles fine
0
 

Author Comment

by:Vlearns
ID: 35451122
it seems to work......should be better from a performance perspective right?
0
 
LVL 32

Expert Comment

by:phoffric
ID: 35451972
>> search(temp_str.begin(), temp_str.end(), strPattern.begin(), strPattern.end(), &IMAP_Message::icase);
The search internally loops on each char in temp_str from begin to the first find of the string pattern, and calls icase, which, in turn, calls tolower twice.
Pro:    Easy to implement.
Con:   Multiple function calls (even if inlined, multiple operations)

BTW - if you use function object as shown in the first post instead of icase, then my understanding is that there is a better chance that the compiler will inline the object's bool operator(), and thereby improve performance.

======
For comparison below is a short program that illustrates the approach described in http:#35449875. (It would be useful if you could perform timing tests using the clock function, and post the results.)

It's using this string::find overloaded version:
     size_t find ( char c, size_t pos = 0 ) const;
                  http://www.cplusplus.com/reference/string/string/find/

Now consider the first step:
    size_t position =src.find('<', 0);
This just checks each char from the beginning of src until it finds a '<'. Unlike the search, there is no need for calling tolower . Also, using this version of find, it is likely that the STL implementation will put the '<' in a register for optimal performance.

After finding a '<', the find_end_body_tag function then checks for a '/', and then calls tolower on the tag, "body", one char at a time. Most of the time, the first or second test condition in the tag will fail, so the remaining 4-5 conditions in the if-statement will not be executed.
size_t find_end_body_tag( string src ) {
    size_t position = 0;

    while( ( position = src.find('<', position) ) != string::npos ) {
        if(               src[position+1]   == '/'
            && ::tolower( src[position+2] ) == 'b'
            && ::tolower( src[position+3] ) == 'o'
            && ::tolower( src[position+4] ) == 'd'
            && ::tolower( src[position+5] ) == 'y'
            &&            src[position+6]   == '>' ) {
                break;
        }
        ++position; // this could be further tweaked to skip more than 1 but
                    // probably negligible benefit for such a short pattern match
    }
    if( position == string::npos ) {
        cout << "ERROR </body> not found" << endl;
        // Maybe append </body> ?
    }
    return position;
}

int main(int arc, char *argv[])     
{    
    // string test
    string src   = "test<BODY>the true <b>body</b> of the program</BODY> rest of stuff";
    char * addin = "- ADDIN HERE - A VERY LONG MESSAGE GOES HERE";
    size_t pos = find_end_body_tag( src );
    if( pos != string::npos ) {
        for( int k=pos; k < pos+7; ++k ) {
            cout << src[k];
        }
        cout << endl;
    }
    return 0;    
}

Open in new window

0
 
LVL 12

Expert Comment

by:trinitrotoluene
ID: 35457734
Inlining doesn't necessarily mean better performance.

http://www.parashift.com/c++-faq-lite/inline-functions.html

vlearns : I would go with my approach which you modified in your subsequent post.

0
 
LVL 12

Assisted Solution

by:trinitrotoluene
trinitrotoluene earned 400 total points
ID: 35457735
however it would be good if you could time all the approaches listed here and see which scores the best
0
 
LVL 32

Accepted Solution

by:
phoffric earned 1600 total points
ID: 35458191
There were 3 approaches:
1. search using tolower
2. two finds, one for uppercase and one for lowercase
3. Use find_end_body_tag

Items 1 and 2 are easy to implement.

Item 1's search approach, as mentioned,  has extra tolower operations on every char during its icase call.

Item 2 may do two passes over the data when the first pass does not find the right case.

Item 3's find_end_body_tag function results in one pass over the bulk of the string. Not as easy as two find's, but not hard to implement.

Whether the inefiiciencies in items 1 and 2 are significant is, of course, application dependent. If you are examining huge amounts of data constantly to the point that two passes over the data adds up, then just use the one pass approach.
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Errors will happen. It is a fact of life for the programmer. How and when errors are detected have a great impact on quality and cost of a product. It is better to detect errors at compile time, when possible and practical. Errors that make their wa…
Many modern programming languages support the concept of a property -- a class member that combines characteristics of both a data member and a method.  These are sometimes called "smart fields" because you can add logic that is applied automaticall…
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…
Suggested Courses

862 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question