asked on

sed remove <html>

Hello

I am having trouble getting these sed commands to perform the desired results
I have to remove "<html>" and "</html>" from all the .txt files in a directory.

here is what I have so far.

sed -e 's/\<html\>//g' *.txt

Thanks for your help

pjedmond

First copy the directory!:

cp -R sourcedir backupdir

Now make a working copy:

cp -R sourcedir workingdir

The sed statements that you need are as follows:

sed -e "s/<\/html>//g" in order to remove </html>
and
sed -e "s/<html.*>//g" in order to remove <html and another bit of text>

We now have to automate it:

cd working dir
mkdir outputdir

Check that the following line produces the desired output to remove the lines:

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\ | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '}

You'll see that the produced command line takes the output and passes it through the 2 sed filtering processes, and then copies the file to the output directory. Once we are happy with the result which should look like this:
----------------------------8X--------------------------------------
cat 1.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g" > outputdir/1.txt
cat 2.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g" > outputdir/2.txt
cat 3.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g" > outputdir/3.txt
cat 4.txt | sed -e "s/<html.*>//g | sed -e "s/<\/html.*>//g" > outputdir/4.txt
----------------------------8X--------------------------------------

then we send the commands that we have printed out to a bash shell:

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\ | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '} | bin bash

Check the output directory and files to see if it has done exactly what you want. If it has:

mv sourcedir sourcedirbackup

mv outputdir sourcedir

...and your source dir now has all the html bits that you needed removing removed:)

HTH:)

pjedmond

OK - that looks long, but that's only because I've explained it really carefully, and provided examples:)

ASKER CERTIFIED SOLUTION

pjedmond

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

pjedmond

Of course, you'd spotted that deliberate mistake, .....hadn't you?

jculkincys

ASKER

Thanks for the in-depth answer

Hmm I must be missing something simple

I do
sed -e "s/<\/html>//g" *.txt
Then I do
grep "<\/html>" *.txt
in the same directory and I get a lot of results where "</html>" is still present.

Any ideas

pjedmond

Yes - all you are doing is reading in every *.txt file, and then copying it to the screen (with the appropriate bit removed). You are not altering it in the file.

Read through the solution above carefully, and you'll see that the file is being read, filtered and then saved to another file. The original files are then removed and replaced by the new ones (that have the html tags removed)

Just do it on 1 file first - the command line created is:

cat 1.txt | sed -e "s/<html.*>//g" | sed -e "s/<\/html.*>//g" > outputdir/1.txt

and you'll see that that the appropriate tags are removed:

cat 1.txt

prints out the file "1.txt"

This is then piped to a sed expression. The first expression strips out the <html asdfsvas>, and the second strips out </html> . The final bit puts the end result into a new file (in the outputdir that you've previously created.

Obviously, you don't want to type this out for each file, so awk is used to build the command line from a list of *.txt files in your directory. Once happy that all of the command line is correct, you pipe the command lines to /bin/bash in order to execute them.

HTH:)

jculkincys

ASKER

Ok not I think I understand better - thanks

this should work if I take out the "bin" right?

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '} | bash

pjedmond

Yes no problems as long as bash is in your path. This technique is *EXTREMELY* POWERFUL (building the command from a find). I normally get people to run the command first without the bash at the end so that they can check what they are running before it happens. Of course always make a back up, and *ESPECIALLY* if the command is being run as root be very careful when checking your commnads.

I really like this type of question because it shows of the real power of linux, in that you can build these fantasticly complicated commands from the little bricks that exist:)

Another really useful trick for this type of thing is if you need to run the command regularly, then set the command up as an alias, or stick it in a shell script.

HTH:)

jculkincys

ASKER

Yea I really like the answer too

Do you have any simplier useful examples of building a command from Find?

pjedmond

find *.txt | awk {'print "cat " $0 '} | bash Just lists all the .txt files one after another.

is the simplest type of thing....you just build up the command as you need it. If you can work out the command for 1 file, then you build it up for all of the files.

Key tricks are greping particular types of files before bing piped into the awk command line, and also that 'special' chars have to be escaped with a back-slash when in the awk statement.

awk is incredibly powerful, you can write whole programs in it which can be put in a file and called from the command line. Likewise with sed. Just a case of understanding how each of the little tools work. When combined together in the right way, they become incredibly powerful:)

pjedmond

http://www.student.northpark.edu/pemente/sed/sed1line.txt

http://www.cs.uu.nl/docs/vakken/st/nawk/nawk_41.html

Provide some great examples of uses for awk and sed

:)

jculkincys

ASKER

thanks alot

jculkincys

ASKER

Would I need all 3 \\\ in this statement?

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\\\/html.*>//g\" > outputdir/" $0 '} | bash

or would this work?

find *.txt | awk {'print "cat " $0 " | sed -e \"s/<html.*>//g\" | sed -e \"s/<\/html.*>//g\" > outputdir/" $0 '} | bash

pjedmond

The awk cmd prints what is between the " ".
in order to get it to print \/ (which is needed to produce the / inside the sed statement) you need the 3 \\\

If in doubt try it out without the | bash at the end, then you can see what is produced. You can then pick one line out and try it manually if you want to confirm.

As always, make backups before doing this type of thing:)

HTH:)

jculkincys

ASKER

Hmmm

Still having a little trouble

find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '} | bash

results in

find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '} | bash
awk: cmd. line:1: warning: escape sequence `\/' treated as plain `/'
awk: cmd. line:1: {print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 }
awk: cmd. line:1: ^ backslash not last character on line

pjedmond

find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>/ig\" > $outputdir/" $0 '}

errors

find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" | sed -e \"s/<title.*>//ig\" > $outputdir/" $0 '}

still errors

find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>/ig" > $outputdir/" $0 '}

still errors

find client.zip | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>/ig\" | sed -e \"s/<pre.*>//ig\" > $outputdir/" $0 '}

OK......

There is nothing wrong with your command....*EXCEPT* for the length. I'm getting a similar issue as well. You need to do the stripping as 2 seperate processes. One to strip out the html bits, and then a second process to strip out the pre.

Alternatively, you can put the awk or sed elements into a script file (sedit):

-------------------------------------------8X-----------------------------------
#!/bin/sed -f
s/<html.*>//g
s/<\/html>//g

#Add other substitutions as necessary
-------------------------------------------8X-----------------------------------

chmod +x sedit

You may find this approach easier as you don't have to worry about all of the \ and "s

Now change your command line to be along the line of

find *.* | awk {'print "./sedit " $0 '} | /bin/bash

And you should be away:)

HTH...

pjedmond

In fact, that looks a lot neater for really complex scripts! Almost elegant in fact!

jculkincys

ASKER

I may go with the shorter version but I don't completely understand it yet

I have been working on it and I am really close

find *.* | awk {'print "cat " $0 " | sed -e \"s/<html.*>//ig\" | sed -e \"s/<\\\/html.*>//ig\" | sed -e \"s/<head.*>//ig\" | sed -e \"s/<\\\/head.*>//ig\" | sed -e \"s/<pre.*>//ig\" | sed -e \"s/<\\\/pre.*>//ig\" | sed -e \"s/<title.*>//ig\" | sed -e \"s/<\\\/title.*>i//ig\" > $outputdir/"$0 '}

The issue I am having is with -----------> > $outputdir/"$0 '}

I think I will be able to get it to work this way hopefully

I really appreciate your help

pjedmond

I think that the problem is that the command line is longer than 255 chars, so you need to shorten the line by taking out some rules, *OR* do the process in a number of stages (First strip out the html bits, and then the pre bits)

jculkincys

ASKER

Hmm I guess I would need to go with the shorter version but I don't quite understand how I would output the results.

pjedmond

Try it! - The output is EXACTLY the same process - you've merely moved the sed commands to a file rather than trying to put all the commands on the command line:)

sed remove &lt;html&gt;

sed remove <html>