phoffric
asked on
Find keywords in text blocks using Regex
Given a set of keywords, I would like to list each one that I find in a text block (including, say, the offset). Then I will fill the block with new text and search again until there is no more text to search.
I would like to use Boost regex since my understanding is that it is probably more efficient than a simple approach that I could come up with by using only C++ and STL. (BTW, I cannot use C++11.) Boost should also allow the program to be built on both Windows and Linux without platform dependent code. I have hardly used regex, so, if there is a tidy C++ solution, then I hope you will throw in a few words to help me understand the regex part so that I can add appropriate comments.
If possible, could you show the C++ function code? I can only use Boost and standard STL libraries, and standard C++ code.
Thanks,
Paul
I would like to use Boost regex since my understanding is that it is probably more efficient than a simple approach that I could come up with by using only C++ and STL. (BTW, I cannot use C++11.) Boost should also allow the program to be built on both Windows and Linux without platform dependent code. I have hardly used regex, so, if there is a tidy C++ solution, then I hope you will throw in a few words to help me understand the regex part so that I can add appropriate comments.
If possible, could you show the C++ function code? I can only use Boost and standard STL libraries, and standard C++ code.
Thanks,
Paul
ASKER CERTIFIED SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
It's been a long time since I've done C++ so I can only provide pseudo-code to do what you want. The regex portion should be the same.
If you would prefer, I could provide Perl code to do this but buffer handling in perl is very different (no pre-allocating arrays/buffers).
foreach keyword {
push(rx_array, concat('\b', keyword, '\b'));
if (length(keyword) > max_len) { max_len = length(keyword) }
}
while not end_of_input_file {
if (buf) { buf = strcpy(buf, -max_len) }
buf = concat(buf, read(input_file, buf_size-size(buf)));
foreach element of rx_array {
if (rx_match(element, buf)) {
print("found match for ", element);
# regex normally can't provide match offset in buffer (unless that's particular to Boost)
}
}
}
If you would prefer, I could provide Perl code to do this but buffer handling in perl is very different (no pre-allocating arrays/buffers).
ASKER
Hello kdo and farzanj. Thanks for the replies. I am travelling this week, so I may not be able to check things out immediately.
@kdo,
If I understand you right, in non-regex terms, the set of keywords might have been written as:
string keywords[] = {"this", "is", "the", "regex", "string"};
and then are you looping through each of the 5 keywords searching std::string text as in:
std::size_t offset = text.std::find( keyword[0] ); // then indexes of 1..4
and in your example, this would find the keywords "this" and "is" in the text string.
Is there any reason to think that this regex solution will be faster than the std::find approach? (I'll time the two approaches. If no improvement, then maybe regex isn't the right approach.)
@farzanj,
I don't see a set of keywords in your example, such as:
string keywords[] = {"this", "is", "the", "regex", "string"};
>> make a block in such a way that is doesn't end at mid-word
If my blocks are, say 1KB, and let's say I am reading in a file, then it is possible that the last two bytes read in are "st", which is a candidate to the keyword "string". Since the max string length of all the keywords in the above example is 6, then before reading in the next 1KB block, I would copy the last 6 bytes of the previous block to the beginning of the buffer and then append it with the next 1KB bytes of the new text.
But maybe regex has a way to detect partial matches on keywords. But if this is possible I can save that for another question. I would be happy if I could get a performance improvement using regex even without the partial keyword.
@kdo,
If I understand you right, in non-regex terms, the set of keywords might have been written as:
string keywords[] = {"this", "is", "the", "regex", "string"};
and then are you looping through each of the 5 keywords searching std::string text as in:
std::size_t offset = text.std::find( keyword[0] ); // then indexes of 1..4
and in your example, this would find the keywords "this" and "is" in the text string.
Is there any reason to think that this regex solution will be faster than the std::find approach? (I'll time the two approaches. If no improvement, then maybe regex isn't the right approach.)
@farzanj,
I don't see a set of keywords in your example, such as:
string keywords[] = {"this", "is", "the", "regex", "string"};
>> make a block in such a way that is doesn't end at mid-word
If my blocks are, say 1KB, and let's say I am reading in a file, then it is possible that the last two bytes read in are "st", which is a candidate to the keyword "string". Since the max string length of all the keywords in the above example is 6, then before reading in the next 1KB block, I would copy the last 6 bytes of the previous block to the beginning of the buffer and then append it with the next 1KB bytes of the new text.
But maybe regex has a way to detect partial matches on keywords. But if this is possible I can save that for another question. I would be happy if I could get a performance improvement using regex even without the partial keyword.
ASKER
@wilcoxon, thanks for the reply. I am travelling now; will look at your post tomorrow. In light of what I just wrote with emphasis on performance, will your solution run faster than using a non-regex approach as I described? What language are you comfortable writing in? I appreciate your reply. Keep in mind that I may have to use boost if there is a performance increase over using std::find.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
My language of choice anymore is Perl. Here's what the solution would be in Perl (should perform well but will be fairly different from C++ just due to language differences).
I coded it in such a way to give you several options for handling...
I coded it in such a way to give you several options for handling...
my @keywords = qw(this other word or another);
my $small_files = 1; # set to 0 if files are too big to easily fit in memory
my @buf;
if ($small_files) {
open IN, 'input_file' or die "could not open input_file: $!";
@buf = map { chomp; $_ } <IN>;
close IN;
} else {
# can handle any size file by not reading all of it into memory
use Tie::File;
tie @buf, 'Tie::File', 'input_file' or die "can not tie input_file: $!";
}
my $alt_method = 0; # set to 1 to try a different version
my $rx = '\b(' . join('|', @keywords) . ')\b' if $alt_method;
# alternate - not sure which would be more efficient:
# my $rx = join '|', map { '\b' . $_ . '\b' } @keywords if $alt_method;
for my $i (0..$#buf) {
if ($alt_method) {
if ($buf[$i] =~ m{$rx}io) {
print "matched $1 on line $i\n";
}
} else {
foreach my $word (@keywords) {
if ($buf[$i] =~ m{\b$word\b}i) {
print "matched $word on line $i\n";
}
}
}
}
ASKER
Thanks for all your input. Probably not much time this week to consider all of your replies. Just to let you know, there might be many keywords, and many text blocks from various sources.
>> Although my regular expressions were highly optimized, it still decreased the execution time by many times.
I find this very interesting - and it shows a major misconception on my part. Maybe there are other Boost libraries that are faster for the kind of searches I was talking about.
>> Although my regular expressions were highly optimized, it still decreased the execution time by many times.
I find this very interesting - and it shows a major misconception on my part. Maybe there are other Boost libraries that are faster for the kind of searches I was talking about.
ASKER
fyi - because you mentioned that regex is slower than string operations, I did a search:
http://blogs.msdn.com/b/oanapl/archive/2009/04/04/performance-comparison-regex-versus-string-operations.aspx
This thread suggests the opposite (but only 1 vote, so taking with grain of salt):
http://stackoverflow.com/questions/15503372/using-tr1regex-search-to-match-a-big-list-of-strings
c++ is regex faster than string operations -python -perlThis article seems to confirm the assertion that string operations are faster than regex:
http://blogs.msdn.com/b/oanapl/archive/2009/04/04/performance-comparison-regex-versus-string-operations.aspx
This thread suggests the opposite (but only 1 vote, so taking with grain of salt):
http://stackoverflow.com/questions/15503372/using-tr1regex-search-to-match-a-big-list-of-strings
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
As per your second link, yes it is possible in some specific cases--if you are doing a string search on many words and you are making one pass per string, you are making multiple passes. However, if you are using a regex with various match options (string1|string2|string3) and it makes just a single pass, it may be faster. It would certainly help if your regex is DFA and not NFA.
Eg, instead of writing (?:fast|fatter) you write (?:fa(?:st|tter)) -- just a trivial example.
Eg, instead of writing (?:fast|fatter) you write (?:fa(?:st|tter)) -- just a trivial example.
Hi Phoffric,
Performance seems to be paramount here...
Will your your application always be searching for the same key string, or just a small number of fixed string? If so, or if the object being searched is rather large, you could easily benefit from using one of the more "exotic" searches instead of a common linear search. There's usually a significant startup overhead, but once they begin the actual search operation they can be much, much faster.
Kent
Performance seems to be paramount here...
Will your your application always be searching for the same key string, or just a small number of fixed string? If so, or if the object being searched is rather large, you could easily benefit from using one of the more "exotic" searches instead of a common linear search. There's usually a significant startup overhead, but once they begin the actual search operation they can be much, much faster.
Kent
ASKER
Thanks for all your inputs. Some of the this discussion is a bit beyond me at this time. As I delve into this, I may have some questions in the future. I will try to implement some of the suggested approaches to get my feet wet and maybe measure their performance against string operations. (I may be working this weekend, so there may be a week delay.) I'll leave performance issues/questions for the future.
fyi -
I wanted to see what is DFA and NFA and came up with this link:
http://stackoverflow.com/questions/3978438/dfa-vs-nfa-engines-what-is-the-difference-in-their-capabilities-and-limitations
which led to some debate on performance and then to this link:
http://swtch.com/~rsc/regexp/regexp1.html
I skimmed the start of this last article, and the claims are amazing.
This is getting off-topic; just thought you might be interested in the article.
I am familiar with some of the more exotic algorithms. From discussions with co-workers I thought Boost might be incorporating some of them. I'll seek clarification from them.
fyi -
I wanted to see what is DFA and NFA and came up with this link:
http://stackoverflow.com/questions/3978438/dfa-vs-nfa-engines-what-is-the-difference-in-their-capabilities-and-limitations
which led to some debate on performance and then to this link:
http://swtch.com/~rsc/regexp/regexp1.html
I skimmed the start of this last article, and the claims are amazing.
This is getting off-topic; just thought you might be interested in the article.
I am familiar with some of the more exotic algorithms. From discussions with co-workers I thought Boost might be incorporating some of them. I'll seek clarification from them.
Interesting article. However, it left me wondering what the article was not saying. It's unclear on if any of the NFA implementations support full PCRE functionality. There are a lot of smart people working on most (probably all) of those recursive backtracking regex implementation languages. There must be a reason that none of them have switched to using Thomson NFAs for regexes.
Also, in practice, it's certainly possible to occasionally find some slow regexes but virtually unheard of to see a properly written regex that results in anything approaching the example regex level of slowness (I've never run into one).
Also, in practice, it's certainly possible to occasionally find some slow regexes but virtually unheard of to see a properly written regex that results in anything approaching the example regex level of slowness (I've never run into one).
the std::string::find will make a sequential search, similar to runtime strstr (where the latter is a little faster as it has no template overhead).
a faster algorithm for searching keywords in huge text is to have an array of 256 elements (assuming ansi text only), that could be indexed by each char of the text to search. an array element would be length of key if the corresponding character is not in key. for the characters of the key, the array element would be the number of characters in the key that were right from the character. using this array you would not search sequentially in the text but you would start at keylength-1 and can increment the current pos by array[text[pos]] as long as this value is not zero. if the value is zero the character matches the last character of the key and you have to compare the before string whether you got a match.
i don't know whether for multiple keywords it makes sense to use one skip array for all keys or a matrix with a new column for each key or to have multiple searches. but i would assume any of these solutions should be much faster than the standard search.
Sara
a faster algorithm for searching keywords in huge text is to have an array of 256 elements (assuming ansi text only), that could be indexed by each char of the text to search. an array element would be length of key if the corresponding character is not in key. for the characters of the key, the array element would be the number of characters in the key that were right from the character. using this array you would not search sequentially in the text but you would start at keylength-1 and can increment the current pos by array[text[pos]] as long as this value is not zero. if the value is zero the character matches the last character of the key and you have to compare the before string whether you got a match.
i don't know whether for multiple keywords it makes sense to use one skip array for all keys or a matrix with a new column for each key or to have multiple searches. but i would assume any of these solutions should be much faster than the standard search.
Sara
ASKER
Today I installed boost on Cygwin and got a sample (non-regex) program to work, but when I saw that I had to build regex, I installed regex on Ubuntu (over VMware on Windows 7) using apt-get.
@farzanj
On Ubuntu, I started with your program since it looked like the quickest way for me to test my installation. I have attached a file showing linker errors. (It compiled fine.) If you see a quick solution, please let me know. If complicated (especially since I don't know all the Linux utilities), then I'll put the problem in another thread. Thanks.
The attached file shows my apt-get install command for boost; shows the installed boost libraries; shows g++ command; and linker errors.
Not sure if it matters, but I installed the 64-bit Ubuntu version since I have Windows 7 Home Premium 64-bit.
After getting this to work, I'll then try Kdo's solution and call it a week (or two). I may ask another question on the Perl solution provided (thanks Wilcoxon) since I would like to understand what you did and compare results (functional and timing) with the Boost solutions provided.
regex.txt
@farzanj
On Ubuntu, I started with your program since it looked like the quickest way for me to test my installation. I have attached a file showing linker errors. (It compiled fine.) If you see a quick solution, please let me know. If complicated (especially since I don't know all the Linux utilities), then I'll put the problem in another thread. Thanks.
The attached file shows my apt-get install command for boost; shows the installed boost libraries; shows g++ command; and linker errors.
Not sure if it matters, but I installed the 64-bit Ubuntu version since I have Windows 7 Home Premium 64-bit.
After getting this to work, I'll then try Kdo's solution and call it a week (or two). I may ask another question on the Perl solution provided (thanks Wilcoxon) since I would like to understand what you did and compare results (functional and timing) with the Boost solutions provided.
regex.txt
ASKER
Because of linker errors, I wasn't able to run the two regex solutions, so just started reviewing your code.
OP: >> I would like to list each one that I find in a text block (including, say, the offset).
I don't see the offset. In fact, Wilcoxon explicitly addressed this point:
>> regex normally can't provide match offset in buffer
Suppose I have 10 keywords and a text block of 100KB. I wanted to search the text to find the first occurrence of any of the 10 keywords. (Then I might do some processing using that keyword.) Naturally to find the next occurrence of any of the 10 keywords, I wouldn't want to start the search at the beginning of the text; I would start from the end of the keyword that I just found. What if I found a keyword, say, at 90KBs into the text block. The next search should only be over the remaining 10KB. That's why I asked for an offset; I wanted to be able to skip over the text block already processed. Alternatives, such as a pointer to the next portion to be searched is an alternative. (Or, as in strtok(), I don't care if the pointer or offset is internal to the boost library; just as long as I don't have to start a search from the beginning of the text block every time.)
Here is a simple example:
Keywords:
"last it This time"
Text:
"This is it! This is the last time! When it rains, it pours! "
Results:
This 0
it 8
This 12
last 24
time 29
it 40
it 50
We can assume every word in text is surrounded non-letter char. I am not interested in substrings. So, if "is" were a keyword, I would not be interested in "his" (however, for simplicity, I would not object if the "is" in "his" were returned as I could then validate the result.
If, as Wilcoxon says, this cannot be done with Regex, then I would just like to know whether this can be done with some other Boost libraries. If so, I'll ask another question on that topic. Thanks again.
OP: >> I would like to list each one that I find in a text block (including, say, the offset).
I don't see the offset. In fact, Wilcoxon explicitly addressed this point:
>> regex normally can't provide match offset in buffer
Suppose I have 10 keywords and a text block of 100KB. I wanted to search the text to find the first occurrence of any of the 10 keywords. (Then I might do some processing using that keyword.) Naturally to find the next occurrence of any of the 10 keywords, I wouldn't want to start the search at the beginning of the text; I would start from the end of the keyword that I just found. What if I found a keyword, say, at 90KBs into the text block. The next search should only be over the remaining 10KB. That's why I asked for an offset; I wanted to be able to skip over the text block already processed. Alternatives, such as a pointer to the next portion to be searched is an alternative. (Or, as in strtok(), I don't care if the pointer or offset is internal to the boost library; just as long as I don't have to start a search from the beginning of the text block every time.)
Here is a simple example:
Keywords:
"last it This time"
Text:
"This is it! This is the last time! When it rains, it pours! "
Results:
This 0
it 8
This 12
last 24
time 29
it 40
it 50
We can assume every word in text is surrounded non-letter char. I am not interested in substrings. So, if "is" were a keyword, I would not be interested in "his" (however, for simplicity, I would not object if the "is" in "his" were returned as I could then validate the result.
If, as Wilcoxon says, this cannot be done with Regex, then I would just like to know whether this can be done with some other Boost libraries. If so, I'll ask another question on that topic. Thanks again.
ASKER
ASKER
BTW - I didn't mention input files in my OP and first post. But it could be. Or the input could come from a pipe. So, I can't generalize to using an mmap solution for all cases.
i made a quick test program to implement the algorithm i suggested, and it seems to work.
it probably is not so difficult to add the non-substring requirement. you also could read the text from a stream, though i would read it in blocks then, for performance reasons.
Sara
struct SearchResult
{
std::string key;
unsigned int off;
};
bool keySearch(const std::vector<std::string> & keys, const std::string & searchText, std::vector<SearchResult> & results)
{
size_t m = -1;
for (int i=0; i < (int)keys.size(); ++i)
{
size_t nm;
if ((nm = keys[i].length()) < m)
{
m = nm;
}
}
if (m == (size_t)(-1))
return false;
// init all char codes with minimum length
unsigned char skips[256];
memset(skips, m, 256);
for (int i=0; i < (int)keys.size(); ++i)
{
int l = (int)keys[i].length();
for (int n= 0; n < l; ++n)
{
unsigned char & sc = skips[keys[i][n]];
unsigned char nsc = (unsigned char)(l - n - 1);
if (sc > nsc)
{
sc = nsc;
}
}
}
size_t pos = m-1;
const char * psz = searchText.c_str();
while (pos < searchText.length())
{
unsigned char c = psz[pos];
unsigned char off = skips[c];
if (off != 0)
{
pos += off;
continue;
}
// if we have 0 the last character of at least one key is a match
for (int i=0; i < (int)keys.size(); ++i)
{
const std::string & key = keys[i];
size_t k = 0;
bool f = true;
for (size_t n = key.size(); n != 0; k++)
{
if (key[--n] != psz[pos-k])
{
f = false;
break;
}
}
if (f == false)
continue;
// we have a match
results.push_back(SearchResult());
SearchResult & sr = results.back();
sr.key = key;
sr.off = pos - k + 1;
// continue for further matches?
}
pos++;
}
return results.empty() == false;
}
int main(int argc, char **argv)
{
std::vector<std::string> keys;
keys.push_back("last");
keys.push_back("it");
keys.push_back("This");
keys.push_back("time");
std::string searchText = "This is it! This is the last time! When it rains, it pours! ";
std::vector<SearchResult> results;
if (keySearch(keys, searchText, results))
{
for (int i = 0; i < (int)results.size(); ++i)
{
std::cout << std::setw(20) << std::left << results[i].key << std::right << std::setw(10) << results[i].off << std::endl;
}
}
return 0;
}
it probably is not so difficult to add the non-substring requirement. you also could read the text from a stream, though i would read it in blocks then, for performance reasons.
Sara
ASKER
Hi Sara,
I appreciate your trying to help make a good algorithm. Please understand that this question is strictly related to boost regex. From my recent survey, I see that there are multiple flavors of regex. (I will see if I can get the Perl regex solution provided here to conform to regex.) So, I am hoping for a regex solution since I am asked to use boost. I know of exotic algorithms that can do the work - there are whole books on the subject. I may get involved with them someday, but for now, it's boost.
As I mentioned earlier, if, as suggested here, boost regex may not be able to do the job, but some other boost library combinations can do the job, then if someone mentions that here, I may open another question if I need help.
Any further boost related comments pertaining to my comment http:#a39325051 would be appreciated.
Paul
I appreciate your trying to help make a good algorithm. Please understand that this question is strictly related to boost regex. From my recent survey, I see that there are multiple flavors of regex. (I will see if I can get the Perl regex solution provided here to conform to regex.) So, I am hoping for a regex solution since I am asked to use boost. I know of exotic algorithms that can do the work - there are whole books on the subject. I may get involved with them someday, but for now, it's boost.
As I mentioned earlier, if, as suggested here, boost regex may not be able to do the job, but some other boost library combinations can do the job, then if someone mentions that here, I may open another question if I need help.
Any further boost related comments pertaining to my comment http:#a39325051 would be appreciated.
Paul
ASKER
ok, pressures off. Found in our infracture utilities a tested aho corasick optimized class. Still need to pursue Regex to be able to handle boost spirit. Hope to get regex working this weekend.
Sara, Just a suggestion. How about taking your program and making an article or tutorial out of it for EE. Start out with the design and add in comments. If you do, let me know, and I will run it, try to find time to critique it before publication, and will mark it with a thumbs up.
Paul
Sara, Just a suggestion. How about taking your program and making an article or tutorial out of it for EE. Start out with the design and add in comments. If you do, let me know, and I will run it, try to find time to critique it before publication, and will mark it with a thumbs up.
Paul
ASKER
Got boost to work and tried the program by farzanj in http:#a39307669. But, unless I am missing something, it does not appear to address: "Given a set of keywords, I would like to list each one that I find in a text block". (And not doing that, there are no offsets to list either.) Is this problem not doable in Regex?
ASKER
@kdo,
Here is a program that I took from your post, but I get no output when I run it. Can you tell me what I have to change? Thanks.
Here is a program that I took from your post, but I get no output when I run it. Can you tell me what I have to change? Thanks.
#include<iostream>
#include<boost/regex.hpp>
using namespace std;
using namespace boost;
int main()
{
std::string text(" this is a text sample ");
const char* pattern = "this is the regex string";
// Instantiate the regex class object and register the pattern string
boost::regex MyText (pattern);
// Now loop through the string finding each match.
boost::sregex_iterator it(text.begin(), text.end(), MyText);
boost::sregex_iterator end;
for (; it != end; ++it) {
cout << it->str() << endl; // contains the string
}
}
SOLUTION
membership
Create a free account to see this answer
Signing up is free and takes 30 seconds. No credit card required.
ASKER
>> The regex string passed to boost would be more like "This|is|the|regex|string" .
>> const char* pattern = "[^| ]*is[ \t,\.\:\;";
Then do I need two loops, one to parse out the keywords (in a manner similar to farzanji's program), and then in a second loop, apply each keyword, one at a time, to the text string?
Is there a way to do this with only one loop?
>> const char* pattern = "[^| ]*is[ \t,\.\:\;";
Then do I need two loops, one to parse out the keywords (in a manner similar to farzanji's program), and then in a second loop, apply each keyword, one at a time, to the text string?
Is there a way to do this with only one loop?
The more "general" the search, the more complicated the regex.
What are trying to find? Maybe we can put together a regex that will search for both in a single pass.
Kent
What are trying to find? Maybe we can put together a regex that will search for both in a single pass.
Kent
ASKER
Thanks!
In general, there will be just a bunch of keywords, and for this question, I don't have to worry about case sensitiviy; so "This" will not match against "this". Let's say that each keyword ends in a known delimiter (say, '!' in the text to be searched) - not sure if this helps at all, but thought I'd throw it in, in case it did). Here is a list of 7 keywords:
CMD! WRITE! READ! QUIT! OPEN! CLOSE! FILENAME!
The text string might look like this:
"Here begins the CMD!OPEN!a Quick Brown Fox!FILENAME!paul's filename!READ!sdfsdfafffsd asfdasfd!R EAD!765757 7READ!qwer ty!WRITE!R EAD!asdfgj jkll;jksdf a!WRITE!CL OSE!QUIT!a nd that's all there is!"
Ideally, I would like to somehow know what follows the keyword; for example, after FILENAME! is the expression "paul's filename!" (and that's why I asked for an offset in the OP). But for now, if that is too much for a beginner like myself, I'll be happy just to know what keywords are in the text string. I can always ask another question to advance my understanding. I even added the delimiter, !, at the end of the text following a keyword, in case that might help.
In general, there will be just a bunch of keywords, and for this question, I don't have to worry about case sensitiviy; so "This" will not match against "this". Let's say that each keyword ends in a known delimiter (say, '!' in the text to be searched) - not sure if this helps at all, but thought I'd throw it in, in case it did). Here is a list of 7 keywords:
CMD! WRITE! READ! QUIT! OPEN! CLOSE! FILENAME!
The text string might look like this:
"Here begins the CMD!OPEN!a Quick Brown Fox!FILENAME!paul's filename!READ!sdfsdfafffsd
Ideally, I would like to somehow know what follows the keyword; for example, after FILENAME! is the expression "paul's filename!" (and that's why I asked for an offset in the OP). But for now, if that is too much for a beginner like myself, I'll be happy just to know what keywords are in the text string. I can always ask another question to advance my understanding. I even added the delimiter, !, at the end of the text following a keyword, in case that might help.
ASKER
@wilcoxon,
I asked a question about your perl program here:
https://www.experts-exchange.com/questions/28190975/Need-Perl-program-decoded-for-newbie.html
Paul
I asked a question about your perl program here:
https://www.experts-exchange.com/questions/28190975/Need-Perl-program-decoded-for-newbie.html
Paul
sorry, I was out for a while.
Did you try my comment ID: 39307669
You need to link using -l option on g++. Do you still get a linking error?
Did you try my comment ID: 39307669
You need to link using -l option on g++. Do you still get a linking error?
ASKER
I have to leave now for a week, but will try to respond in nights as I get a chance.
@farzanj, I did run your program and saw how you split the text into tokens. I had problems linking at first, but solved it here:
https://www.experts-exchange.com/questions/28184274/Boost-Regex-Linker-Error-Possible-bad-Boost-installation.html
I think I will close this question now, and ask add-on questions later. Thanks to all who tried to help this new boost regex programmer. (My regex experience consists of everything on this page; same true for boost. Thanks all for your patience.)
@farzanj, I did run your program and saw how you split the text into tokens. I had problems linking at first, but solved it here:
https://www.experts-exchange.com/questions/28184274/Boost-Regex-Linker-Error-Possible-bad-Boost-installation.html
I think I will close this question now, and ask add-on questions later. Thanks to all who tried to help this new boost regex programmer. (My regex experience consists of everything on this page; same true for boost. Thanks all for your patience.)
Hmmm.....
That's a whole different problem. :(
A regex defines matches. To get the data between matches you'll need a program such as awk or sed.
How do your keywords (READ! WRITE! etc.) appear? Can there be text immediately preceding them or does the exclamation point separate all of the tokens?
That's a whole different problem. :(
A regex defines matches. To get the data between matches you'll need a program such as awk or sed.
How do your keywords (READ! WRITE! etc.) appear? Can there be text immediately preceding them or does the exclamation point separate all of the tokens?
ASKER
I've requested that this question be closed as follows:
Accepted answer: 83 points for Kdo's comment #a39307641
Assisted answer: 84 points for farzanj's comment #a39307669
Assisted answer: 84 points for Kdo's comment #a39307917
Assisted answer: 83 points for farzanj's comment #a39307920
Assisted answer: 83 points for farzanj's comment #a39313902
Assisted answer: 83 points for Kdo's comment #a39343349
Assisted answer: 0 points for phoffric's comment #a39343538
for the following reason:
This has been a good start for learning a little about boost and regex. Thanks for your help.
Accepted answer: 83 points for Kdo's comment #a39307641
Assisted answer: 84 points for farzanj's comment #a39307669
Assisted answer: 84 points for Kdo's comment #a39307917
Assisted answer: 83 points for farzanj's comment #a39307920
Assisted answer: 83 points for farzanj's comment #a39313902
Assisted answer: 83 points for Kdo's comment #a39343349
Assisted answer: 0 points for phoffric's comment #a39343538
for the following reason:
This has been a good start for learning a little about boost and regex. Thanks for your help.
ASKER
Accidentally accepted on of my posts as a solution.
ASKER
Thanks for all your assistance in helping me understand some of the basics of C++ boost/regex. I will be asking follow up questions as I better absorb your comments.
ASKER