asked on

RegEx N'th Occurrence

I have a file that has values separated by spaces. I only want to grab the third space on each line. How would I do that?

kaufmed

What programming language or text editor are you using?

You might try:

^ [^ ]+ [^ ]+( )

Open in new window

Also, I took your question quite literally (as a regex would!), so I'm sure the above isn't exactly what you are looking for. Can you clarify what you are after?

lconnell

ASKER

Sublime Text Editor, also would be nice to know for VIM.

That did not work when using the RegEx search in Sublime.

kaufmed

I don't know if you saw the edit in my comment, but can you clarify what you are after? It seems weird that you would want the third space. I suspect what you meant was what follows the third space.

lconnell

ASKER

So I want to edit a file using multi-selection. I have 100 lines of the following text.

data1 data2 data3 data4 data5
...
...
...

I want to use Sublime or any editor to find the 3rd space so I can edit every line at once at that space. So this way I can modify data4 on every line at one time to say "test_data4". Data4 can be any value that's why I want to match at the third space.

kaufmed

OK, I see where I went wrong. This should be correct now:

^[^ ]+ [^ ]+ [^ ]+

Open in new window

This pattern assumes that a line never starts with a space.

aikimark

Here's an alternative pattern

\w+ \w+ \w+ (\w+)

Open in new window

You can then use the regex Replace method against the \1 capture group

kaufmed

@aikimark

There's no perceived benefit to using the "word character" class over "not a space". In the worst case the pattern won't match if there are any characters other than alphabetic, numeric, or underscores.

aikimark

@kaufmed

I realize that. Normally, I would use the not-a-space pattern. But you'd already used it and I find that \w+ is simpler to type than [^ ]+
Three characters versus five characters.

What I hope I've added is the grouping of the fourth 'word' that will allow the Replace method to be used.

aikimark

It looks like my pattern needed tweaking. It should be: (\w+ \w+ \w+ )(\w+)( .*?\r\n)
Example:

    Dim strData As String
    Dim oRE As Object
    Dim oMatches As Object, oM As Object
    Set oRE = CreateObject("vbscript.regexp")
    oRE.Global = True
    oRE.Pattern = "(\w+ \w+ \w+ )(\w+)( .*?\r\n)"
    strData = "data1 data2 data3 data4 data5" & vbCrLf
    strData = strData & "data21 data22 data23 data24 data25" & vbCrLf
    strData = strData & "data31 data32 data33 data34 data35" & vbCrLf
    If oRE.test(strData) Then
        Debug.Print oRE.Replace(strData, "$1test_$2$3")
    End If

Open in new window

Contents of Immediate window after running the above code:

data1 data2 data3 test_data4 data5
data21 data22 data23 test_data24 data25
data31 data32 data33 test_data34 data35

Open in new window

aikimark

Yes. It is possible to use the not-a-space pattern: ([^ ]+ [^ ]+ [^ ]+ )([^ ]+)( .*?\r\n)

Surrano

vim pattern:

:%s/^\(\([^ ]* \)\{3\}\)\([^ ]*\)/\1test_\3/

Open in new window

lconnell

ASKER

Thanks for the assistance everyone. So there is still a problem here. I only want to select the actual white space in the third column, not the text up to the 3rd white space.

aikimark

@lconnell

Please test the code I posted

lconnell

ASKER

aikimark, it does not work. It actually doesn't match anything.

aikimark

It actually doesn't match anything.

Does your actual data reflect the sample data you posted?

Have you changed my code to read your data or are you expecting my sample code to change your file data? The code shows how to use regular expression to do a replace. I used string literals that was meant to simulate the data you used in your example.

kaufmed

The problem you face is that ST uses the Boost regex engine, which does not support arbitrary-length lookbehinds, which is what you would need in order to effectively skip over the first two spaces without actually including them in the match. The only thing you can do at this point is to do a find/replace as aikimark described above, except that you would capture the whole string, not just the last non-space:

e.g.

Find