• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 185
  • Last Modified:

start regular expression from 2nd word only

How can I instruct regular expression to ignore the first word, then start its search for a pattern.
0
PeterBaileyUk
Asked:
PeterBaileyUk
  • 8
  • 3
  • 2
  • +2
4 Solutions
 
duncanb7Commented:
you could try this, remove first word of each line first and then search pattern

~ s/^\S+\s*//

s/.../.../            # Substitute command.
^                     # (Zero-width) Begin of line.
\S+                   # Non-space characters.
\s*                   # Blank-space characters.
//                    # Substitute with nothing, so remove them.

Hope understand your question completely.If not, please point it out

Duncan
0
 
Dan CraciunIT ConsultantCommented:
Depending on your Regexp engine, this usually works:
^\b.*?\b(your search pattern here)

Open in new window

\b.*?\b will match the first word (sequence of letters, numbers or _)

or
^\s*\w+\s*(your search pattern here)

Open in new window

\s*\w+\s* will match some (optional) space characters, 1 or more word characters, then some more optional space characters.

HTH,
Dan
0
 
PeterBaileyUkAuthor Commented:
ok i will take a look. Ive been involved in a project recently involving 750000 word groups and asked a lot of the experts. I am widening the data sets for testing and trying now to create shorter patterns that cover most of what i need as its not possible to find a pattern that does all that i need.

its become evident that the first word has more weight in the description.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
PeterBaileyUkAuthor Commented:
i put a simple pattern in with respect to idID: 40192739
~ s/^\S+\s*//\d+     i've added the \d+ just a simple pattern

Test data 207 HATCHBACK XR 1.6 VTi [120] 5dr Auto I had hoped the 120 would be highlighted after ignoring the first word the 207.

i put a simple pattern in with respect to idID: 40192765   ^\s*\w+\s*(\d+)
207 HATCHBACK XR 1.6 VTi [120] 5dr Auto    that highlights the first word, it should have ignored that and find the 120

I wanted to ignore the first word as the manufacturer have names their models with integers ie 207

would it be simpler maybe to reverse the search start from end in?
0
 
PeterBaileyUkAuthor Commented:
I have noticed that the first word is never preceded by a space so is that how some of these work?
0
 
käµfm³d 👽Commented:
What programming language or text editor are you using to execute this? If we don't know what engine you use, then we could be offering bad advice. For example, what duncanb7 suggested is a PERL-based expression. It would not work (the way you think it would) in a .NET regex.
0
 
PeterBaileyUkAuthor Commented:
I am in vb access using VBScript.RegExp object
0
 
Terry WoodsIT GuruCommented:
Peter, are you able to please post the code you've currently got? Any existing capturing groups (and their replacement) may need to be adjusted once we provide a change to your existing pattern.

I have noticed that the first word is never preceded by a space so is that how some of these work?
If the pattern you're talking about, with regards to that comment, is the one starting with
^\s*\w+

Open in new window

then the \s* part of the pattern can match zero occurrences of a space character. To match one or more occurrences, rather than zero or more, you would use \s+ instead of \s*

The pattern
^\s*\w+\s*(\d+)

Open in new window

is matching the first word, but capturing the second one (well, at least it is if the second word is only made up of numbers, as it's \d+ rather than \w+). As I mentioned above, we need to know what's happening with the capturing groups to determine what happens from there on. I suspect this pattern isn't one that will work for you, as it only matches when the second "word" is made up of numbers.
0
 
PeterBaileyUkAuthor Commented:
I have a pattern which evolved around one manufacturer and model.

Ignoring the very first word is fundamental in case the actual vehicle was named with digits like puegeot 207 but had the BHP which could be from 2 digits to 3 digits long i.e 90 or 109 and or have the word BHP as a separate word or attached 110bhp.

I wanted to try to create a less complex pattern that was not vehicle specific.

StrPattern = "^\s*(\S+(?=\s).*?)(?:([^\d.])\d+$|\(?\d* ?bhp?\d*\)?|\(\d+\)|\[\d{2,3}\]|\d{3})"

I have discovered a boundary so it must find values between 39-302 inclusive but not pick up for example 3020 or 390
skipping the first word is paramount. the pattern must find the values if they are at the end of the string also.
0
 
Terry WoodsIT GuruCommented:
Are you doing a replace with the replacement string being something like "$1"? We may need to change that as part of the solution. In fact, if you weren't doing a replace with "$1" as the replacement string, then try doing that as it may solve the problem :-)
0
 
PeterBaileyUkAuthor Commented:
Yes it has the bit you mentioned

I believe (and reason why I started to try simplify it) was that it allows some through and that it would be better to check bands.





Function removeBHP(ByVal txt As String) As String
Dim StrPattern As String
If CountWordsInString(txt) > 1 Then

'StrPattern = "^\s*(\S+\s.*?)(?:([^\d.])\d+$|\(?\d* ?bhp?\d*\)?|\(\d+\)|\[\d{2,3}\]|\d{3})"

'StrPattern = "^\s*(\S+(?=\s).*?)(?:([^\d.])\d+$|\(?\d* ?bhp?\d*\)?|\(\d+\)|\[\d{2,3}\]|\d{3})"
txt = removeC02(txt)

'StrPattern = "^\s*(\S+(?=\s).*?)(?:([^\d.])\d+$|\(?\d* ?bhp?\d*\)?|\(\d+\)|\[\d{2,3}\]|\d{3}|\d{2})"
'StrPattern ="^\s*(\S+(?=\s).*?)(?:([^\d.])\d+$|\(?\d* ?bhp?\d*\)?|\(\d+\)|\[\d{2,3}\]|\d{2})"

StrPattern = "^\s*(\S+(?=\s).*?)(?:([^\d.])\d+$|\(?\d* ?bhp?\d*\)?|\(\d+\)|\[\d{2,3}\]|\d{3})"


   With CreateObject("VBScript.RegExp")
       .IgnoreCase = True
       .Pattern = StrPattern
       .Global = True
       
       removeBHP = .Replace(txt, "$1$2")
    
'       removeBHP = .Replace(removeBHP, "$1$2")
      
      
      
      
       Debug.Print "Input: " & txt & " Output: " & removeBHP
       
'       Debug.Print txt
   End With
   
   Else
   
   removeBHP = txt
   End If
End Function

Open in new window

0
 
Terry WoodsIT GuruCommented:
Ok, so the pattern you've shown:
StrPattern = "^\s*(\S+(?=\s).*?)(?:([^\d.])\d+$|\(?\d* ?bhp?\d*\)?|\(\d+\)|\[\d{2,3}\]|\d{3})"

Open in new window

captures the first word in the (\S+(?=\s).*?) part of the pattern. It then gets inserted back into the string as a replacement (it's the first of two capturing groups that are put back in, being $1 and $2), so the end result is that the first word shouldn't be removed.

Are you sure it's this pattern that's giving the problem? (or is the data something other than "207 HATCHBACK XR 1.6 VTi [120] 5dr Auto"?)

I also see you're making a call to the removeCO2 function, which could be a cause of the problem if my above suggestions don't help
txt = removeC02(txt)

Open in new window

0
 
PeterBaileyUkAuthor Commented:
Whats happening certainly in the first model case Audi "A3" was that if the string contained c02 value and bhp value then c02 had to be removed first hence the c02 removal call. the c02 call was very specific 107G whereas bhp wasnt always so neat.

It appears that if after the first word it finds another number lets say as part of valves 16V then bhp that it stops after identifying the first number thus ignoring the bhp and leaving it in. It almost suggests that the other technical details should be removed first and then the bhp pattern should then be called. it might explain why some bhp get left in the string.

here is an example: 207 GTi 1.6 16V 175 3dr the output string is : 207 GTi 1.6 V 175 3dr so it took out the 16 off the valves then stopped and left in the bhp.

same here: Insignia 1.8 16V 140 Elite 6Spd   becomes Insignia 1.8 V 140 Elite 6Spd so took the 16 of valves out then stopped and left in the bhp.

If i can band the pattern for bhp within a range that would cure part of this.

I wonder now to close this question and put in a new as it does currently ignore the first word.
0
 
PeterBaileyUkAuthor Commented:
Ive shared the points just out of fairness hope i didnt miss anyone.

I post a new question about adding banding to current pattern as its off topic on this question.
0
 
käµfm³d 👽Commented:
Not that I don't appreciate the gesture, but I didn't really contribute anything toward the actual solution.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 8
  • 3
  • 2
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now