hankknight
asked on
Delete lines from long file UNLESS the line starts with specified character (Perl)
Hello,
I have a file that is more than 800 mb (it is a .sql dump saved in UTF-8).
It has useful information but also allot of junk.
Most of the lines that I do NOT need start with funny characters like this:
ÂÈÒØÙæçè÷øÿ
So I want to REMOVE all lines UNLESS they start with one of the following:
abcdefghijklmnopqrstuvwxyz ABCDEFGHIJ KLMNOPQRST UVWXYZ'"`\ /-#$*()
Lines may ALSO begin with a space or a tab.
If a line starts with ANYTHING else then I want the entire line DELETED.
How can I do this with a Perl script that I can exicute from a command line?
Thanks!
I have a file that is more than 800 mb (it is a .sql dump saved in UTF-8).
It has useful information but also allot of junk.
Most of the lines that I do NOT need start with funny characters like this:
ÂÈÒØÙæçè÷øÿ
So I want to REMOVE all lines UNLESS they start with one of the following:
abcdefghijklmnopqrstuvwxyz
Lines may ALSO begin with a space or a tab.
If a line starts with ANYTHING else then I want the entire line DELETED.
How can I do this with a Perl script that I can exicute from a command line?
Thanks!
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks.
I tried both of your ideas and neither one worked. They both created 800mb files, meaning they deleted nothing.
cat dump.sql | perl -ne 'next unless(m/^[\sa-zA-Z\"\`\\\ /\-\#\$\*\ (\)]/); print' > fix1.sql
cat dump.sql | perl -ne 'next unless(m/^[\x20-\x7E]/); print' > fix2.sql
I have placed a small except from the real file here for testing (just the first 1.5 mb)
https://filedb.experts-exchange.com/incoming/ee-stuff/4405-fixDump.zip
I tried both of your ideas and neither one worked. They both created 800mb files, meaning they deleted nothing.
cat dump.sql | perl -ne 'next unless(m/^[\sa-zA-Z\"\`\\\
cat dump.sql | perl -ne 'next unless(m/^[\x20-\x7E]/); print' > fix2.sql
I have placed a small except from the real file here for testing (just the first 1.5 mb)
https://filedb.experts-exchange.com/incoming/ee-stuff/4405-fixDump.zip
The dump.sql file has 46 lines, with lines 45 and 46 being about 1,000,000 characters long each.
Lines 45 and 46 are each an INSERT statement.
It looks as if must of that data is an embeded image (looks like JPEG data at quick look).
The fixed.sql file has 45 lines - only 1 INSERT statement.
The INSERT INTO that begins at byte 2142 is actually data within the 11 (content) column in the first INSERT.
I think if you try to use the fixed.sql, you'll get an error.
What are you trying to do? Are you trying to remove the content column? Obviously, that data will be lost if you do. But if you do, you need to remove it properly, or the file will become useless.
Lines 45 and 46 are each an INSERT statement.
It looks as if must of that data is an embeded image (looks like JPEG data at quick look).
The fixed.sql file has 45 lines - only 1 INSERT statement.
The INSERT INTO that begins at byte 2142 is actually data within the 11 (content) column in the first INSERT.
I think if you try to use the fixed.sql, you'll get an error.
What are you trying to do? Are you trying to remove the content column? Obviously, that data will be lost if you do. But if you do, you need to remove it properly, or the file will become useless.
ASKER
I have a .sql dump that is very important to me. My server crashed and I MUST restore this .sql file backup.
But it is 800mb large and when I try to restore it from a command line I get a mySQL error.
I don't mind manually going in there and removing some junk but there is way too much junk to remove.
I have another idea? Maybe remove ALL lines that are longer than 500 characters and leave the rest? What was left would be more manageable for me.
Yes, most of the useless data are binary jpgs or png images that I do not need.
But it is 800mb large and when I try to restore it from a command line I get a mySQL error.
I don't mind manually going in there and removing some junk but there is way too much junk to remove.
I have another idea? Maybe remove ALL lines that are longer than 500 characters and leave the rest? What was left would be more manageable for me.
Yes, most of the useless data are binary jpgs or png images that I do not need.
In the sample that you posted, if you remove all lines > 500 characters, you will be left with only the create table - no data.
What is the error you get when you try to restore the file?
What is the error you get when you try to restore the file?
ASKER
This is the error I am getting:
ERROR 1064 (42000) at line 158: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''PNG\r\n\Z\n\0\0\0\rIHDR\ 0\0Ø\0\0\0 \0\0´Âø·\0 \0\0+tEXtC reation Time\0Thu 9' at line 1
Can you think of a way to fix it?
ERROR 1064 (42000) at line 158: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''PNG\r\n\Z\n\0\0\0\rIHDR\
Can you think of a way to fix it?
Well, if you don't care about the content column data, we could remove it from the .sql file. You should still be able to get all other data.
You might want to ask in the MySQL area to see if anyone there knows of a way to solve the problem without losing your data.
You might want to ask in the MySQL area to see if anyone there knows of a way to solve the problem without losing your data.
ASKER
I have the mySQL people and they can't help :-(
How can I get rid of all content column data?
How can I get rid of all content column data?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Was this file ever transferred via FTP?
ASKER
->>Was this file ever transferred via FTP?
Yes. Maybe that is my problem! I tried to transfer it three times and it was bust all three times. What is your idea?
Yes. Maybe that is my problem! I tried to transfer it three times and it was bust all three times. What is your idea?
I'm not certain on the format of the file, but I'm guessing the file needed to be transferred as BINARY. If it was transferred as ASCII, that would have corrupted the data.
If the FTP transfer failed, that might have made things worse...
If the FTP transfer failed, that might have made things worse...
See here for the normal characters that you are missing in the above:
http://www.asciitable.com/
You could use this in the command given my ravenpl, just change the characters matched.