asked on

Regular Expression

Hi,

I have a question regarding regular expressions. I've tried looking it up, but haven't had any luck finding a solution so far. I have been given a comma separated value file that has some extra line breaks added into it. It's a 250MB text file so I don't want to go hunting through by hand to find all the extra line spaces and remove them myself.

Each new line should start out with one of three values:

NAYT,
CREN,
SF,

I need to find all the lines that do not start with those words so that I can remove the extra line breaks. The only thing I've been able to find so far would be something along the lines of \n[^NCS] which would only check the first character, when I need it to check the entire word.

I'm offering 500 points for this because I'm in kind of a rush.

Thanks a lot,

--Andrew

Headspace

...if you're only worried about extra line breaks, and not whitespace in front of any of the three values, why not just search for the lines beginning with whitespace?

StillUnAware

You could do this:

to find the lines that do not start with those three:

\n[^[(NAYT,)|(CREN,)|(SF,)]].*

BTW, are those expressions case sensitive? What kind of regex are You using, unix style or some customized?

Headspace

...or completely blank lines, for that matter (using ^$)...

jmgst116

^\n[^(NAYT|CERN|SF)]

this should find every line not starting with the above

cf_mi6

ASKER

I'm trying to do a replacement in Textpad before I can import the data. I'm not sure if it's texpad itself or not, but it's still skipping everything that starts with any of the characters within the square brackets rather than treating it as a string match.

jmgst116

i dont know the exact data you are working with but if CERN, SF,NAYT, appear only as leading data

you can probably remove all the \n first then readd them when NAYT,CREN,SF, are found

in textpad

to remove all the \n in doc
find: \n
replace:

to find CERN, SF,NAYT, and replace with \n then what ever was found
find: (CERN|SF|NAYT),
replace: \n\1,

cf_mi6

ASKER

Unfortunately the values show up at other points in the .csv as well.

Whitespace check won't work because it breaks in the middle of a description field, so the values are shown as leading characters on the next line. There's gotta be a regexp that'll work in Textpad for this...I think...

Headspace

if you'll respond to my comments, cf_mi6, i'll give you what you need : )

...let me restate them as questions.

1. do you have to worry about whitespace in front of NAYT, CREN, or SF?

2. is the only thing that has to be removed from the text file the extra lines that have no data in them?

mish33

You can get yourself grep (http://unxutils.sourceforge.net/) or perl (http://www.activestate.com):

grep "^(CERN|SF|NAYT)," 250MBfile.csv >good.csv

Headspace

well, I won't be around for a few days, so I'll post this now hoping that you'll answer 'yes' to #2 above:

Since you're using the replace tool in TextPad, let's do this one operation at a time to make the 'undo' easier if mistakes are made. I'll lay this out in a 'find' field, 'replace' field type fashion, and DON'T INCLUDE THE ' (single quotes) when you enter the values.

1. find: '\n'
replace: 'nothing' (that means don't put anything in the 'replace' field)
...this will find all of the line breaks in the file and remove them. Once we've removed unwanted spaces, we'll put the line breaks back in where they need to be.

2. find: ','
replace: ', ' (that's a comma followed by one space)
...this will ensure that at least one space exists after every comma. since this is a comma delimited file, then there shouldn't be any commas in any of the data. If there are, then there must be an escape character that you can include here to ensure that you don't replace commas with ', '. It'll be up to you to change this operation if you need to.

3. find: ', [ ]+' (there's a space inside of the brackets)
replace: ', ' (comma followed by one space)
...this will remove any extra whitespace between the comma delimiter and the next value, so that only one space exists after every comma.

4. find: ' WORD' (space in front of keyword)
replace: '\nWORD'
...this will replace the line breaks, placing them in the appropriate position before each keyword. You'll have to do this one 3 times, one for NAYT, one for CREN, and one for SF.

5. find: ',$'
replace: 'nothing' (again, don't put anything in the replace field)
...this will remove any left-over commas from the end of each line.

I know that it seems like alot to do, but i think this will be the easiest way to reformat your textfile using the replace tool, and you'll be able to monitor the changes as you go (which will make it easier to modify the expression if need be).

Oh, and the reason the other statements aren't working for you is due to TextPad's syntax; the standard 'or' (|) in regular expressions must be typed as '\|' in TextPad. TextPad also doesn't use parenthesis in a standard manner, which helps to compound the problem : ).

wil

ASKER CERTIFIED SOLUTION

jmgst116

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial