[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

get urls from sitemap (sed or grep)

Posted on 2011-10-16
3
Medium Priority
?
783 Views
Last Modified: 2012-05-12
How can I use sed or grep to create a list with only URLs from my sitemap file?

sitemap file looks like:
	<url>
		<loc>http://www.domain.coml/bla</loc>
		<lastmod>2011-10-16</lastmod>
		<changefreq>monthly</changefreq>
		<priority>0.8</priority>
	</url>
	<url>
		<loc>http://www.domain.com/bla2</loc>
		<lastmod>2011-10-16</lastmod>
		<changefreq>monthly</changefreq>
		<priority>0.8</priority>
	</url>

Open in new window

0
Comment
Question by:Dennie
3 Comments
 
LVL 21

Assisted Solution

by:Papertrip
Papertrip earned 332 total points
ID: 36976502
Here is a simple awk syntax that will do the trick.

[root@broken ee]# cat sitemap
 <url>
                <loc>http://www.domain.com/bla</loc>
                <lastmod>2011-10-16</lastmod>
                <changefreq>monthly</changefreq>
                <priority>0.8</priority>
        </url>
        <url>
                <loc>http://www.domain.com/bla2</loc>
                <lastmod>2011-10-16</lastmod>
                <changefreq>monthly</changefreq>
                <priority>0.8</priority>
        </url>
[root@broken ee]# awk -F'[<|>]' '/loc/{print $3}' sitemap
http://www.domain.com/bla
http://www.domain.com/bla2
[root@broken ee]#

Open in new window

0
 
LVL 23

Accepted Solution

by:
Maciej S earned 336 total points
ID: 36976554
sed version:
sed '/loc/!d;s/.*>\([^<]*\)<.*/\1/' sitemap

Open in new window

0
 
LVL 38

Assisted Solution

by:Gerwin Jansen, EE MVE
Gerwin Jansen, EE MVE earned 332 total points
ID: 36982656
Just grep is not possible but grep and sed combined:

cat sitemap | grep "<[/]*loc>" | sed 's/[<][/]*loc[>]//g;s/^[ \t]*//'

Open in new window


grep will filter out the lines containing the url's, like this:

            <loc>http://www.domain.coml/bla</loc>
            <loc>http://www.domain.com/bla2</loc>

first sed command will remove the loc start and end tags, like this:

            http://www.domain.coml/bla
            http://www.domain.com/bla2

adding the second sec command (after the ;) will remove the with space at the beginning of the lines, like this:

http://www.domain.coml/bla
http://www.domain.com/bla2
0

Featured Post

Free learning courses: Active Directory Deep Dive

Get a firm grasp on your IT environment when you learn Active Directory best practices with Veeam! Watch all, or choose any amount, of this three-part webinar series to improve your skills. From the basics to virtualization and backup, we got you covered.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The purpose of this article is to demonstrate how we can use conditional statements using Python.
Google Drive is extremely cheap offsite storage, and it's even possible to get extra storage for free for two years.  You can use the free account 15GB, and if you have an Android device..when you install Google Drive for the first time it will give…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
Suggested Courses
Course of the Month18 days, 19 hours left to enroll

834 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question