Link to home
Start Free TrialLog in
Avatar of hankknight
hankknightFlag for Canada

asked on

Delete lines from long file UNLESS the line starts with specified character (Perl)

Hello,

I have a file that is more than 800 mb (it is a .sql dump saved in UTF-8).

It has useful information but also allot of junk.

Most of the lines that I do NOT need start with funny characters like this:

ÂÈÒØÙæçè÷øÿ

So I want to REMOVE all lines UNLESS they start with one of the following:

abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'"`\/-#$*()

Lines may ALSO begin with a space or a tab.

If a line starts with ANYTHING else then I want the entire line DELETED.

How can I do this with a Perl script that I can exicute from a command line?

Thanks!
SOLUTION
Avatar of ravenpl
ravenpl
Flag of Poland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Adam314
Adam314

You might want to use [\x20-\x7E].  This will get all normal characters.
See here for the normal characters that you are missing in the above:
http://www.asciitable.com/

You could use this in the command given my ravenpl, just change the characters matched.
Avatar of hankknight

ASKER

Thanks.

I tried both of your ideas and neither one worked.  They both created 800mb files, meaning they deleted nothing.

cat dump.sql | perl -ne 'next unless(m/^[\sa-zA-Z\"\`\\\/\-\#\$\*\(\)]/); print' > fix1.sql
cat dump.sql | perl -ne 'next unless(m/^[\x20-\x7E]/); print' > fix2.sql

I have placed a small except from the real file here for testing (just the first 1.5 mb)
https://filedb.experts-exchange.com/incoming/ee-stuff/4405-fixDump.zip


The dump.sql file has 46 lines, with lines 45 and 46 being about 1,000,000 characters long each.
Lines 45 and 46 are each an INSERT statement.
It looks as if must of that data is an embeded image (looks like JPEG data at quick look).

The fixed.sql file has 45 lines - only 1 INSERT statement.
The INSERT INTO that begins at byte 2142 is actually data within the 11 (content) column in the first INSERT.

I think if you try to use the fixed.sql, you'll get an error.
What are you trying to do?  Are you trying to remove the content column?  Obviously, that data will be lost if you do.  But if you do, you need to remove it properly, or the file will become useless.
I have a .sql dump that is very important to me.  My server crashed and I MUST restore this .sql file backup.  

But it is 800mb large and when I try to restore it from a command line I get a mySQL error.

I don't mind manually going in there and removing some junk but there is way too much junk to remove.

I have another idea?  Maybe remove ALL lines that are longer than 500 characters and leave the rest?  What was left would be more manageable for me.

Yes, most of the useless data are binary jpgs  or png images that I do not need.
In the sample that you posted, if you remove all lines > 500 characters, you will be left with only the create table - no data.

What is the error you get when you try to restore the file?  
This is the error I am getting:

ERROR 1064 (42000) at line 158: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''PNG\r\n\Z\n\0\0\0\rIHDR\0\0Ø\0\0\0\0\0´Âø·\0\0\0+tEXtCreation Time\0Thu 9' at line 1

Can you think of a way to fix it?
Well, if you don't care about the content column data, we could remove it from the .sql file.  You should still be able to get all other data.

You might want to ask in the MySQL area to see if anyone there knows of a way to solve the problem without losing your data.
I have the mySQL people and they can't help    :-(

How can I get rid of all content column data?
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Was this file ever transferred via FTP?
->>Was this file ever transferred via FTP?

Yes.  Maybe that is my problem!  I tried to transfer it three times and it was bust all three times. What is your idea?
I'm not certain on the format of the file, but I'm guessing the file needed to be transferred as BINARY.  If it was transferred as ASCII, that would have corrupted the data.

If the FTP transfer failed, that might have made things worse...