Link to home
Start Free TrialLog in
Avatar of SSupreme
SSupremeFlag for Belarus

asked on

trim files in CMD and concatenate to one file

Hello everyone,

I have folder in which I have allot of different files.
I want to trim out same section in each of XXX specific files.
The name of specific files start from News-x end at News-xxx.
Content of file is like this
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<meta http-equiv="content-type" content="text/html;charset=utf-8" />
<head>
<meta content="DYNAMIC" name="DOCUMENT-STATE" />
<meta content="Copyright (c) 'Церковь Вифлеем' - info@vflm.by" name="Copyright" />
<link href="http://xxxxxxxxx.com/App_Themes/main/layout.css" type="text/css" rel="stylesheet" />
<link href="http://xxxxxxxxx.com/App_Themes/main/style.css" type="text/css" rel="stylesheet" />
<link href="http://xxxxxxxxx.com/App_Themes/main/tables.css" type="text/css" rel="stylesheet" />
<meta content="Заметки церкви Вифлеем, Новый епископ Союза ЕХБ Беларуси – В.Н. Крутько" name="keywords" />
<meta content="Новости официального сайта церкви Вифлеем, 20 марта 2010 года в нашей церкви прошел XIV съезд Союза евангельских христиан-баптистов Беларуси." name="description" />
<link rel="icon" href="favicon.ico" type="image/x-icon" />
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon" />
<meta content="info@xxxxxxxxx.com" name="Reply-to" />
<title>
	Новости - Церковь Вифлеем - Новый епископ Союза ЕХБ Беларуси – В.Н. Крутько
</title></head>
<body id="news_page">
<form method="post" action="http://xxxxxxxxx.com/News-xxx.aspx" id="aspnetForm_news">
<div>
</div>
<div id="main_wrapper"><div class="main_wr_l"><div class="main_wr_r"><div id="main"><div class="body-bg">
<!-- HEADER -->
<div id="header">
<div class="row-1">
<div id="logo">
<h1><a href="Default.html">Церковь «Вифлеем»</a></h1>
<h5>Библейская Церковь Евангельских Христиан Баптистов «Вифлеем», г.Минск, Беларусь.</h5>
</div>
<div id="top-menu">
<!-- top menu -->
<ul>
	<li class="resources"><a href="Resources.html">Ресурсы</a></li>
	<li class="beliefs"><a href="Beliefs.html">Вероучение</a></li>
	<li class="cal"><a href="Calendar.html">Календарь</a></li>
	<li class="about"><a href="About.html">О нас</a></li>
</ul>
<!-- end menu -->
</div>
</div><div class="nav-pass"><table cellpadding="0" cellspacing="0" style="border-width:0;" ><tr><td><a href="Default.html">Главная</a> ></td><td  style="white-space:nowrap;">Новости</td></tr></table></div>
</div>
<!-- END HEADER -->
<!-- CONTENT -->
<div id="content"><div class="box"><div class="border-top"><div class="border-right"><div class="border-bot"><div class="border-left"><div class="left-top-corner"><div class="right-top-corner"><div class="right-bot-corner"><div class="left-bot-corner"><div class="wrapper">
<div class="col-caption">
<h3>РќРѕРІРсти</h3>
<h4>РќРѕРІРсти Серкви «Вифлеем»</h4>
</div>
<div class="col-caption"><h2>Новый епископ Союза ЕХБ Беларуси – В.Н. Крутько</h2></div>
<div class="col-1">
 <div class="img-indent"><div class="img-box1"><img src="img/vflmNews/picture_14_syod.jpg" alt="picture_14_syod.jpg" /></div></div>
 <a href="News.html" class="link">читать все новости</a><br /><br />
</div>
<div class="col-2">
 <h4>20 марта 2010 г.</h4>
<b>20 марта 2010 года в нашей церкви прошел XIV съезд Союза евангельских христиан-баптистов Беларуси. </b><br /><p>В «Вифлееме» собрались представители всех церквей братства, всего 291 делегат.  По итогам съезда старший пресвитер нашей церкви Виктор Никодимович Крутько был избран новым председателем Союза ЕХБ Беларуси. Генеральным секретарем стал Николай Васильевич Синковец, который занимал пост председателя до этого. Заместителем председателя переизбран Иосиф Николаевич Рачковский.</p>                                                     
</div>
</div></div></div></div></div></div></div></div></div></div></div>
<!-- END CONTENT -->
<!-- FOOTER -->
<div id="footer">
<div class="indent">
<div class="wrapper"><div class="col-1">
	&nbsp;</div>
<div class="col-2">
	<ul>
		<li>
			<noindex><img alt="Минская Богословская Семинария" height="31" src="img/stxt/banner_seminary.jpg" width="88" /></noindex></li>
		<li>
			<noindex><img alt="Союз Евангельских Христиан-Баптистов" height="31" src="img/stxt/banner_baptist.jpg" width="88" /></a></noindex></li>
		<li>
			<noindex><img alt="журнал Крынiца Жыцця" height="31" src="img/stxt/banner_krinitsa.jpg" width="88" /></noindex></li>
		<li>
			<noindex><img alt="Евангелие и Реформация" height="31" src="../epbook.by/shop/images/epbook_banner_88x31.jpg" width="88" /></noindex></li>

	</ul>
</div>
<div class="col-3">
	<a href="index.html">Церковь &laquo;Вифлеем&raquo;</a> &copy; 2010 | <a href="SiteMap.html">Карта сайта</a></div>
</div>
</div>
</div>
<!-- END FOOTER -->
</div></div></div></div></div>
</form>
</body>
</html>

Open in new window

I need to get out only following divs boxes.<div class="col-caption">, <div class="col-1"> and <div class="col-2">
It would be nice to combine all ready files into one txt file with space between each other.

I think there is an easy way to do it in old school.

Appreciate your help.
Avatar of SStory
SStory
Flag of United States of America image

Get a copy of grep:
http://gnuwin32.sourceforge.net/packages/grep.htm

grep col-caption *.html

would return lines having col-caption in them.

grep is a very complex tool that can quickly find search terms in text files. It has a lot of options.

To see them type
grep --help

at the command line and hit enter.

example:
grep -i (case insentive)

Another thing to note is that you can use regular expressions with grep.
http://www.opensourceforu.com/2012/06/beginners-guide-gnu-grep-basics-regular-expressions/

Once you've built grep the way you want, put

> youroutputfilename.txt

at the end of it to write to a file.

Example:
grep 'col-caption\|col-1\|col-2'  > outputfile.txt

or

grep -H 'col-caption\|col-1\|col-2'  > outputfile.txt
Avatar of SSupreme

ASKER

'col-1\' is not recognized as an internal or external command,
operable program or batch file.

Open in new window

Thanks for you answer and help! I used grep before on linux, spent some time to install grep and  make it work. I cannot imagine grep is doing what I want. output multiple rows of same div.

Output should be like this:
<div class="col-caption">
<h3>РќРѕРІРсти</h3>
<h4>РќРѕРІРсти Серкви «Вифлеем»</h4>
</div>
<div class="col-caption"><h2>Новый епископ Союза ЕХБ Беларуси – В.Н. Крутько</h2></div>
<div class="col-1">
 <div class="img-indent"><div class="img-box1"><img src="img/vflmNews/picture_14_syod.jpg" alt="picture_14_syod.jpg" /></div></div>
 <a href="News.html" class="link">читать все новости</a><br /><br />
</div>
<div class="col-2">
 <h4>20 марта 2010 г.</h4>
<b>20 марта 2010 года в нашей церкви прошел XIV съезд Союза евангельских христиан-баптистов Беларуси. </b><br /><p>В «Вифлееме» собрались представители всех церквей братства, всего 291 делегат.  По итогам съезда старший пресвитер нашей церкви Виктор Никодимович Крутько был избран новым председателем Союза ЕХБ Беларуси. Генеральным секретарем стал Николай Васильевич Синковец, который занимал пост председателя до этого. Заместителем председателя переизбран Иосиф Николаевич Рачковский.</p>                                                     
</div>

Open in new window

Well, I spent sometime learning and practicing but with no luck. I feel like it is a tiny part of solution.
Like in Excel you can use FIND command to locate first and last character, and MID to return content between locations.
Worked for me in Linux.  Try double quotes " " instead and see if that does anything.
Grep can match and output every line containing the word in the word list.  -A5 option would output the 5lines after that match too.



Example:
grep 'col-caption\|col-1\|col-2'  News*.* > output.txt

Or grep '"col-caption\|col-1\|col-2'" News*.* > output.txt

Or grep -A5 'col-caption\|col-1\|col-2'  News*.* > output.txt

Or grep  -A5 '"col-caption\|col-1\|col-2'" News*.* > output.txt

Now if it must go until it finds the ending div tag, that is another story.  Then you could use awk.  Or write a simple parser in VB or something.

The News*.* should grep all your News files. I had forgotten to specify what to grep before.
Now if it must go until it finds the ending div tag, that is another story.
Looks like I am looking for another story.
I thought I can get solution in few hours, but as usual no solution in few days.
I know I can learn grep, sed and awk, it would take few days or as I use grep, awk once a year, It would take few days to process those files manually. While I will be doing it, I will think about computer as something that hard to communicate and that is why I cannot my life simple.
ASKER CERTIFIED SOLUTION
Avatar of SStory
SStory
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks, Ill try PHP and will place solution here.