Solved

Bash shell : grab the name in a web page between the <title> and </title> and then modify that information to a variable

Posted on 2014-12-08
6
136 Views
Last Modified: 2014-12-09
Hi,

What I want to be able to do is to grab the name in a web page between the <title> and </title> and then modify that information. And then afterwards, I need to be able to copy the modified information to a variable with bash shell linux.

For exemple :

<title>En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 - vostfr </title>

Copy this part « En Selle, Sakamichi - Grande Road - Saison 2 Épisode 8 – vostfr »
Edit it so that it looks like « En Selle, Sakamichi – Grande Road – S02 Ép 08

It needs to use conditions. If the title uses a specific word. Ex : Saison, Épisode, Episode. They need to be changed automatically to « S01 » as in Season 01 or « Ép01 » for Épisode 01.

Also, it has to be able to delete a selection of words. Ex : vostfr. Same goes for special characters

Thank you !
0
Comment
Question by:hexo dark
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 4
  • 2
6 Comments
 
LVL 19

Accepted Solution

by:
simon3270 earned 500 total points
ID: 40487931
How about:
TITLE=$(grep '<title>' /path/to/file.html | sed -e 's,^.*<title>,,' -e 's,</title>.*,,' | \
    sed -e 's/Épisode \([0-9][^0-9]\)/Ép0\1/g' \
        -e 's/Saison \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Season \([0-9][^0-9]\)/S0\1/g' \
        -e 's/Épisode \([0-9][0-9]\)/Ép\1/g' \
        -e 's/Saison \([0-9][0-9]\)/S\1/g' \
        -e 's/Season \([0-9][0-9]\)/S\1/g' \
        -e 's/vostfr//g' \
        -e 's/badword//g' \
        -e 's/unwanted words//g' \
        -e 's/[ ][ ]*/ /g' \
        -e 's/[ ]*-[ ]*$//' | \
        tr -d "'"'"()^%~#/')

echo "TITLE is <${TITLE}>"

Open in new window

The "grep" finds any lines with "<title>" in them.

The first "sed" then removes the line up to and including the "<title>" bit, and from "</title>" to the end of the line.  This assumes that "<title> and "</title>" are on the same text line.

The next sed manipulates the contents.  For example, if "Épisode" is followed by a single digit, this is replaced by "Ép0" followed by the digit.  The same is done foe "Episode n", and for Saison or Season, "S0" followed by the digit.  If the same words are followed by two digits, the extra "0" is not inserted, so "Épisode 12" is replace by "Ép" followed by the "12".  the next three lines remove any unwanted words (here "vostrfr", "badword" and "unwanted words").  The next line compresses any multiple spaces to a single space character, and the next removes a trailing " - " (e.g. if "vostfr" is removed in your example).

The "tr" then removes the specified "special" characters.  The long sequence of single and double quotes at the start says that any single or double quotes are removed.  Any other characters can be added, then a final single quote ends the set of characters.

The "echo" line displays the result - the "<" and ">" are there to show any leading/trailing spaces in the result.

To modify, add any other words that might be used (e.g. copy the episode and season lines for any that you want to modify in the same way, and add any other words you want to remove).  Just keep the format the same, with each line ending in a "\".  You can ad almost any character the the set of characters to be deleted too.
0
 
LVL 2

Author Comment

by:hexo dark
ID: 40487956
works on a local html files  the web page is on the net :)

edit it works !!

i add this wget -O test.htm weblink
0
 
LVL 2

Author Closing Comment

by:hexo dark
ID: 40487964
nikel good !!! a++ 5 star !!! ty ^^
0
Why Off-Site Backups Are The Only Way To Go

You are probably backing up your data—but how and where? Ransomware is on the rise and there are variants that specifically target backups. Read on to discover why off-site is the way to go.

 
LVL 2

Author Comment

by:hexo dark
ID: 40487973
juste one more question
how to remove this character ( - )
0
 
LVL 19

Expert Comment

by:simon3270
ID: 40488605
Add \- to the tr command, so

    tr -d "'"'"()^%~#/\-'

The backslash is to stop the "-" being treated as indicating a character range. It is not strictly necessary if the "-" is the last character in the list of characters, but it would become necessary if another character were added to the end of the  list, so to be on the safe side I have put it in!
0
 
LVL 2

Author Comment

by:hexo dark
ID: 40489164
ty :)
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In my business, I use the LTS (Long Term Support) versions of Linux. My workstations do real work, and so I rarely have the patience to deal with silly problems caused by an upgraded kernel that had experimental software on it to begin with from a r…
You ever wonder how to backup Linux system files just like Windows System Restore?  Well you can use Timeshift in Linux to perform those similar action.  This tutorial will show you how to backup your system files and keep regular intervals. Note…
How to Install VMware Tools in Red Hat Enterprise Linux 6.4 (RHEL 6.4) Step-by-Step Tutorial
If you're a developer or IT admin, you’re probably tasked with managing multiple websites, servers, applications, and levels of security on a daily basis. While this can be extremely time consuming, it can also be frustrating when systems aren't wor…

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question