jculkincys
asked on
sed remove <html>
Hello
I am having trouble getting these sed commands to perform the desired results
I have to remove "<html>" and "</html>" from all the .txt files in a directory.
here is what I have so far.
sed -e 's/\<html\>//g' *.txt
Thanks for your help
I am having trouble getting these sed commands to perform the desired results
I have to remove "<html>" and "</html>" from all the .txt files in a directory.
here is what I have so far.
sed -e 's/\<html\>//g' *.txt
Thanks for your help
OK - that looks long, but that's only because I've explained it really carefully, and provided examples:)
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Of course, you'd spotted that deliberate mistake, .....hadn't you?
ASKER
Thanks for the in-depth answer
Hmm I must be missing something simple
I do
sed -e "s/<\/html>//g" *.txt
Then I do
grep "<\/html>" *.txt
in the same directory and I get a lot of results where "</html>" is still present.
Any ideas
Hmm I must be missing something simple
I do
sed -e "s/<\/html>//g" *.txt
Then I do
grep "<\/html>" *.txt
in the same directory and I get a lot of results where "</html>" is still present.
Any ideas
Yes - all you are doing is reading in every *.txt file, and then copying it to the screen (with the appropriate bit removed). You are not altering it in the file.
Read through the solution above carefully, and you'll see that the file is being read, filtered and then saved to another file. The original files are then removed and replaced by the new ones (that have the html tags removed)
Just do it on 1 file first - the command line created is:
cat 1.txt | sed -e "s/<html.*>//g" | sed -e "s/<\/html.*>//g" > outputdir/1.txt
and you'll see that that the appropriate tags are removed:
cat 1.txt
prints out the file "1.txt"
This is then piped to a sed expression. The first expression strips out the <html asdfsvas>, and the second strips out </html> . The final bit puts the end result into a new file (in the outputdir that you've previously created.
Obviously, you don't want to type this out for each file, so awk is used to build the command line from a list of *.txt files in your directory. Once happy that all of the command line is correct, you pipe the command lines to /bin/bash in order to execute them.
HTH:)
Read through the solution above carefully, and you'll see that the file is being read, filtered and then saved to another file. The original files are then removed and replaced by the new ones (that have the html tags removed)
Just do it on 1 file first - the command line created is:
cat 1.txt | sed -e "s/<html.*>//g" | sed -e "s/<\/html.*>//g" > outputdir/1.txt
and you'll see that that the appropriate tags are removed:
cat 1.txt
prints out the file "1.txt"
This is then piped to a sed expression. The first expression strips out the <html asdfsvas>, and the second strips out </html> . The final bit puts the end result into a new file (in the outputdir that you've previously created.
Obviously, you don't want to type this out for each file, so awk is used to build the command line from a list of *.txt files in your directory. Once happy that all of the command line is correct, you pipe the command lines to /bin/bash in order to execute them.
HTH:)
ASKER
Ok not I think I understand better - thanks
this should work if I take out the "bin" right?
find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '} | bash
this should work if I take out the "bin" right?
find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '} | bash
Yes no problems as long as bash is in your path. This technique is *EXTREMELY* POWERFUL (building the command from a find). I normally get people to run the command first without the bash at the end so that they can check what they are running before it happens. Of course always make a back up, and *ESPECIALLY* if the command is being run as root be very careful when checking your commnads.
I really like this type of question because it shows of the real power of linux, in that you can build these fantasticly complicated commands from the little bricks that exist:)
Another really useful trick for this type of thing is if you need to run the command regularly, then set the command up as an alias, or stick it in a shell script.
HTH:)
I really like this type of question because it shows of the real power of linux, in that you can build these fantasticly complicated commands from the little bricks that exist:)
Another really useful trick for this type of thing is if you need to run the command regularly, then set the command up as an alias, or stick it in a shell script.
HTH:)
ASKER
Yea I really like the answer too
Do you have any simplier useful examples of building a command from Find?
Do you have any simplier useful examples of building a command from Find?
find *.txt | awk {'print "cat " $0 '} | bash Just lists all the .txt files one after another.
is the simplest type of thing....you just build up the command as you need it. If you can work out the command for 1 file, then you build it up for all of the files.
Key tricks are greping particular types of files before bing piped into the awk command line, and also that 'special' chars have to be escaped with a back-slash when in the awk statement.
awk is incredibly powerful, you can write whole programs in it which can be put in a file and called from the command line. Likewise with sed. Just a case of understanding how each of the little tools work. When combined together in the right way, they become incredibly powerful:)
is the simplest type of thing....you just build up the command as you need it. If you can work out the command for 1 file, then you build it up for all of the files.
Key tricks are greping particular types of files before bing piped into the awk command line, and also that 'special' chars have to be escaped with a back-slash when in the awk statement.
awk is incredibly powerful, you can write whole programs in it which can be put in a file and called from the command line. Likewise with sed. Just a case of understanding how each of the little tools work. When combined together in the right way, they become incredibly powerful:)
http://www.student.northpark.edu/pemente/sed/sed1line.txt
http://www.cs.uu.nl/docs/vakken/st/nawk/nawk_41.html
Provide some great examples of uses for awk and sed
:)
http://www.cs.uu.nl/docs/vakken/st/nawk/nawk_41.html
Provide some great examples of uses for awk and sed
:)
ASKER
thanks alot
ASKER
Would I need all 3 \\\ in this statement?
find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '} | bash
or would this work?
find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\/html.*>//g\" > outputdir/" $0 '} | bash
find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '} | bash
or would this work?
find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\/html.*>//g\" > outputdir/" $0 '} | bash
The awk cmd prints what is between the " ".
in order to get it to print \/ (which is needed to produce the / inside the sed statement) you need the 3 \\\
If in doubt try it out without the | bash at the end, then you can see what is produced. You can then pick one line out and try it manually if you want to confirm.
As always, make backups before doing this type of thing:)
HTH:)
in order to get it to print \/ (which is needed to produce the / inside the sed statement) you need the 3 \\\
If in doubt try it out without the | bash at the end, then you can see what is produced. You can then pick one line out and try it manually if you want to confirm.
As always, make backups before doing this type of thing:)
HTH:)
ASKER
Hmmm
Still having a little trouble
find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '} | bash
results in
find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '} | bash
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: {print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 }
awk: cmd. line:1: ^ backslash not last character on line
Still having a little trouble
find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '} | bash
results in
find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '} | bash
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: {print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 }
awk: cmd. line:1: ^ backslash not last character on line
find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '}
errors
find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" > $outputdir/" $0 '}
still errors
find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" > $outputdir/" $0 '}
still errors
find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" > $outputdir/" $0 '}
OK......
There is nothing wrong with your command....*EXCEPT* for the length. I'm getting a similar issue as well. You need to do the stripping as 2 seperate processes. One to strip out the html bits, and then a second process to strip out the pre.
Alternatively, you can put the awk or sed elements into a script file (sedit):
-------------------------- ---------- -------8X- ---------- ---------- ---------- ----
#!/bin/sed -f
s/<html.*>//g
s/<\/html>//g
#Add other substitutions as necessary
-------------------------- ---------- -------8X- ---------- ---------- ---------- ----
chmod +x sedit
You may find this approach easier as you don't have to worry about all of the \ and "s
Now change your command line to be along the line of
find *.* | awk {'print "./sedit " $0 '} | /bin/bash
And you should be away:)
HTH...
errors
find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" > $outputdir/" $0 '}
still errors
find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" > $outputdir/" $0 '}
still errors
find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" > $outputdir/" $0 '}
OK......
There is nothing wrong with your command....*EXCEPT* for the length. I'm getting a similar issue as well. You need to do the stripping as 2 seperate processes. One to strip out the html bits, and then a second process to strip out the pre.
Alternatively, you can put the awk or sed elements into a script file (sedit):
--------------------------
#!/bin/sed -f
s/<html.*>//g
s/<\/html>//g
#Add other substitutions as necessary
--------------------------
chmod +x sedit
You may find this approach easier as you don't have to worry about all of the \ and "s
Now change your command line to be along the line of
find *.* | awk {'print "./sedit " $0 '} | /bin/bash
And you should be away:)
HTH...
In fact, that looks a lot neater for really complex scripts! Almost elegant in fact!
ASKER
I may go with the shorter version but I don't completely understand it yet
I have been working on it and I am really close
find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>//ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>//ig\" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>i//ig\" > $outputdir/"$0 '}
The issue I am having is with -----------> > $outputdir/"$0 '}
I think I will be able to get it to work this way hopefully
I really appreciate your help
I have been working on it and I am really close
find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>//ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>//ig\" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>i//ig\" > $outputdir/"$0 '}
The issue I am having is with -----------> > $outputdir/"$0 '}
I think I will be able to get it to work this way hopefully
I really appreciate your help
I think that the problem is that the command line is longer than 255 chars, so you need to shorten the line by taking out some rules, *OR* do the process in a number of stages (First strip out the html bits, and then the pre bits)
ASKER
Hmm I guess I would need to go with the shorter version but I don't quite understand how I would output the results.
Try it! - The output is EXACTLY the same process - you've merely moved the sed commands to a file rather than trying to put all the commands on the command line:)
cp -R sourcedir backupdir
Now make a working copy:
cp -R sourcedir workingdir
The sed statements that you need are as follows:
sed -e "s/<\/html>//g" in order to remove </html>
and
sed -e "s/<html.*>//g" in order to remove <html and another bit of text>
We now have to automate it:
cd working dir
mkdir outputdir
Check that the following line produces the desired output to remove the lines:
find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\ | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '}
You'll see that the produced command line takes the output and passes it through the 2 sed filtering processes, and then copies the file to the output directory. Once we are happy with the result which should look like this:
--------------------------
cat 1.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g" > outputdir/1.txt
cat 2.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g" > outputdir/2.txt
cat 3.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g" > outputdir/3.txt
cat 4.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g" > outputdir/4.txt
--------------------------
then we send the commands that we have printed out to a bash shell:
find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\ | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '} | bin bash
Check the output directory and files to see if it has done exactly what you want. If it has:
mv sourcedir sourcedirbackup
mv outputdir sourcedir
...and your source dir now has all the html bits that you needed removing removed:)
HTH:)