We help IT Professionals succeed at work.

Need help on a regex

Gene Klamerus
on
I need to do a somewhat complex text substitution in VIM.

For some reason the output file I have from a report contains some space characters I need to remove.  I have a field that should start on a very specific column, but at times there are added spaces before it (usually 1 or 2).  I need to keep all the characters before those extra spaces, but remove them.  I cannot search on just spaces because there are plenty.  I just need to remove a space or two, keeping everything before and after.

For example I might several rows like the following:

asdfasfd           qwerqwer         asdfasdfa          asdfasdf
eascerdf           qwercsseas        serseresd          asdfasdf
wecasdf           asdf                    daease              se3dfsa

In this example you can see that the third field in the second row has an extra space I need to remove.  Once I've done that everything will be aligned again.  I need to keep all the characters before and after this unwanted space.  There is not a fixed number of spaces between fields 2 and 3.  I do know the exact position where field 3 should start though.

So, I'd like a regex that eats up a specific number of characters and then then checks if the next character is a white space.  When this happens I want to put back the characters read, but not the white space.  The rest of the line will take care of itself.  The third field starts in position 980.

I was thinking to use something like \(.\{980}\), but this only consume repeats of a character, not any character.


Alternatively, I could use a check on a character value at a specific point in the row.  If I could check if there is a space in column 980 and delete it (if there is) that works too.
Comment
Watch Question

Bill PrewTest your restores, not your backups...
Expert of the Year 2019
Top Expert 2016

Commented:
If it were me, I'd probably approach it slightly differently, using regex to realign the data in fixed width columns format feels like a tough challenge.  Although I'm interested to see what some of our regex guru experts propose.

I'd probably just do a small AWK script and read the existing file, parsing the columns based on space delimiters (treating multiple spaces as one) and then format the output into fixed width columns as needed and output.

None of the data fields contain embedded spaces is that correct?

And there are no missing values, like the second line below?

asdfasfd          qwerqwer         asdfasdfa          asdfasdf
eascerdf          qwercsseas                          asdfasdf
wecasdf           asdf             daease             se3dfsa

Open in new window



»bp
Gene KlamerusTechnical Architect

Author

Commented:
The problem is that spaces are legitimate values in the fields as well, hence the reason why the data is column aligned.  That's also the way the tool produces it, so I don't have choices on what I'm starting with.

A legitimate field 2 might be a person's name (with spaces).
Bill PrewTest your restores, not your backups...
Expert of the Year 2019
Top Expert 2016

Commented:
Well, that makes it a much harder problem, honestly I'm not sure a computer can puzzle that out on it's own.  I'd probably need to see the real data to determine if there were assumptions or constraints that could be assumed to work it out.

But if you can have single or multiple spaces in a data value, I don't know how you would tell the spaces in there apart from the one(s) between data values?


»bp
Gene KlamerusTechnical Architect

Author

Commented:
I'm giving it another shot.  I may have had a typo.

I'm starting to expand out on using:

:g/^\(.\{980}\) /s//\1/

It may be that this will actually match any characters for 980 characters.  I just need to try this again very carefully (I hope).
Terry WoodsIT Guru
Most Valuable Expert 2011

Commented:
When I copy and paste your source data, it looks like this:
asdfasfd           qwerqwer         asdfasdfa          asdfasdf
eascerdf           qwercsseas        serseresd          asdfasdf
wecasdf           asdf                    daease              se3dfsa

Open in new window


Was it meant to look like this perhaps?
asdfasfd          qwerqwer         asdfasdfa          asdfasdf
eascerdf          qwercsseas        serseresd          asdfasdf
wecasdf           asdf             daease             se3dfsa

Open in new window


If there could only be 1 extra space on a line, then we probably wouldn't need to worry about spaces between names, as the first character of a field should never be a space. However, 2 extra spaces on a line could be confused with a name that has an initial then a space, eg J Bloggs.

Is there always more than 1 space between fields?
Gene KlamerusTechnical Architect

Author

Commented:
Okay, it seems I was mistaken originally.  The substitution:

g/^\(.\{1130}\) /s//\1/

will gobble up any character (not just repeats of a specific character).  The actual column I needed to align was 1130, which has some white spaces in it.
Technical Architect
Commented:
I appreciate the insight/comments.  Not sure how to recognize that since it wasn't really a solution.  Not sure if points get awarded with this mechanism, but I think maybe.