Solved

extract and modify data from text file

Posted on 2011-02-21
5
403 Views
Last Modified: 2012-05-11
Hi,

I have a large text file (20k lines) with entries like below:
# disposition      protocol      source                   destination            operator      port-range ###header for explanation; does not exist in original file
      permit      tcp                 10_12_10_0      23      host_10_14_0_181      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_4_16            range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_217      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_218      range      1525      1527            

      permit      tcp      10_119_160_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_24_18_0      24      host_10_14_0_157      eq      1526                  

      permit      tcp      host_10_14_1_40            host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_44      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_40            host_10_13_5_46      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_46      range      1531      1550

I want to format the data as below:
source <unique_source> destination <unique_destination> application <protocol>_<port>[-<range>]

All IP subnets starting with 10_ and followed by two letter mask; should get listed as subnet_mask in the final output. Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

A group of lines are separated by a blank line as shown above. So in every group we want unique host IP or subnet IP and would put them in [] square brackets if they are more than one for a specific source or destination.
There is a possibility that an IP address might be same between two groups but that should not get clubbed together.

All groups have same port or port range; there is a possibility that the protocol might be both tcp and udp, for eg,
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      udp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
In above case, application should get reported as tcp_udp_port[-range]. If this is tough to code then I can remove such lines and only have lines where the port/protocol are same for a single group.

Working on text above, it needs to be formatted as:

source 10_12_10_0_23 destination [ host_10_14_0_181 host_10_14_4_16 host_10_14_5_217 host_10_14_5_218 ] application tcp_1525_1527

source [ 10_119_160_0_24 10_97_163_0_24 10_24_18_0_24 ] destination host_10_14_0_157 application tcp_1526

source [ host_10_14_1_40 host_10_14_1_50 10_14_42_0_24 ] destination [ host_10_13_5_44 host_10_13_5_46] application tcp_1531_1550

Sorry for the long question.
0
Comment
Question by:dpk_wal
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
5 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34945660
How close did I come?

sub FlushIt
{
	my @sources = sort(keys(%sources));
	my $sourceCount = @sources;
	my $sourceString = ($sourceCount > 1 ) ? ( "[ " . join(" ",@sources) . " ]" ) : $sources[0];

	my @destinations = sort(keys(%destinations));
	my $destinationCount = @destinations;
	my $destinationString = ($destinationCount > 1 ) ? ( "[ " . join(" ",@destinations) . " ]" ) : $destinations[0];

	my @protocols = sort(keys(%protocols));
	my $protocolCount = @protocols;
	my $protocolString = join("_",@protocols) . "_" . $minPort . (( $maxPort ne '' ) ? ("_" . $maxPort) : '');

	print "source $sourceString destination $destinationString application $protocolString\n";

	undef %sources;
	undef %destinations;
	undef %protocols;
}


while ( <> )
{
	s/[\r\n]//g;

	if ( $_ eq '' ) { FlushIt(); }
	else
	{
		# All IP subnets starting with 10_ and followed by two digit mask;
		# should get listed as subnet_mask in the final output.
		# Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

		s/(\s+10\_[0-9\_]+)\s+([0-9][0-9])(\s+)/$1\_$2$3/;

		#       permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550

		s/^\s+//;		# trim leading spaces
		s/\s+$//;		# trim trailing spaces

		($disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) = split(/\s+/);
		##print STDERR join("\n", $disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) . "\n\n";

		$sources{$source} = 1;
		$destinations{$destination} = 1;
		$protocols{$protocol} = 1;
	}
}

FlushIt();

Open in new window



Input:
      permit      tcp                 10_12_10_0      23      host_10_14_0_181      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_4_16            range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_217      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_218      range      1525      1527            

      permit      tcp      10_119_160_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_24_18_0      24      host_10_14_0_157      eq      1526                  

      permit      tcp      host_10_14_1_40            host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_44      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_40            host_10_13_5_46      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_46      range      1531      1550

      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      udp      10_97_163_0      24      host_10_14_0_157      eq      1526                  

Open in new window


Output:

c:\temp>perl foo.pl foo.txt
source 10_12_10_0_23 destination [ host_10_14_0_181 host_10_14_4_16 host_10_14_5_217 host_10_14_5_218 ] application tcp_1525_1527
source [ 10_119_160_0_24 10_24_18_0_24 10_97_163_0_24 ] destination host_10_14_0_157 application tcp_1526
source [ 10_14_42_0_24 host_10_14_1_40 host_10_14_1_50 ] destination [ host_10_13_5_44 host_10_13_5_46 ] application tcp_1531_1550
source 10_97_163_0_24 destination host_10_14_0_157 application tcp_udp_1526

Open in new window

0
 
LVL 32

Author Comment

by:dpk_wal
ID: 34946088
Works great;Thank you!
just one problem; if I have a subnet in destination; the subnet mask is getting truncated.

For example; if I change the lines in sample output as below:
Input:
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
Output:
-bash-2.05b$ perl flushit.pl subMask
source 10_12_10_0_23 destination 10_12_10_0 application tcp_range_1525

Also, in such cases I think the port range is also not getting captured.

If the address is 10_ then it would be followed by two digit subnet mask; if the address is host_ then it would be single address. We can have host_ or 10_ addresses for both source and destination.

Thank you.
0
 
LVL 32

Author Comment

by:dpk_wal
ID: 34946145
The mask followed by 10_ address can even be single digit, but would be 10_x_x_x space or tab and then mask; eg, 10_0_0_0     8

Thank you for all your help and support; really appreciate it!
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 34946284
Changed so that more than one of the 10_... mask pairs can appear on a single line.

Also changed so that the mask can be one or two digits, not just two digits.

sub FlushIt
{
	my @sources = sort(keys(%sources));
	my $sourceCount = @sources;
	my $sourceString = ($sourceCount > 1 ) ? ( "[ " . join(" ",@sources) . " ]" ) : $sources[0];

	my @destinations = sort(keys(%destinations));
	my $destinationCount = @destinations;
	my $destinationString = ($destinationCount > 1 ) ? ( "[ " . join(" ",@destinations) . " ]" ) : $destinations[0];

	my @protocols = sort(keys(%protocols));
	my $protocolCount = @protocols;
	my $protocolString = join("_",@protocols) . "_" . $minPort . (( $maxPort ne '' ) ? ("_" . $maxPort) : '');

	print "source $sourceString destination $destinationString application $protocolString\n";

	undef %sources;
	undef %destinations;
	undef %protocols;
}


while ( <> )
{
	s/[\r\n]//g;

	if ( $_ eq '' ) { FlushIt(); }
	else
	{
		# All IP subnets starting with 10_ and followed by two digit mask;
		# should get listed as subnet_mask in the final output.
		# Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

		while ( s/(\s+10\_[0-9\_]+)\s+([0-9]+)(\s+)/$1\_$2$3/ ) {}

		#       permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550

		s/^\s+//;		# trim leading spaces
		s/\s+$//;		# trim trailing spaces

		($disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) = split(/\s+/);
		##print STDERR join("\n", $disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) . "\n\n";

		$sources{$source} = 1;
		$destinations{$destination} = 1;
		$protocols{$protocol} = 1;
	}
}

FlushIt();

Open in new window

0
 
LVL 32

Author Closing Comment

by:dpk_wal
ID: 34948767
Worked like charm!! Many thanks! :)
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This Windows batch file is useful for organizing image files from a digital camera or other source, but can have many other uses.  It simply renames the file(s) to match their create date.  For example, if you took a picture today at 1:40pm and the …
Recently, an awarded photographer, Selina De Maeyer (http://www.selinademaeyer.com/), completed a photo shoot of a beautiful event (http://www.sintjacobantwerpen.be/verslag-en-fotoreportage-van-de-sacramentsprocessie-door-antwerpen#thumbnails) in An…
In a recent question (https://www.experts-exchange.com/questions/29004105/Run-AutoHotkey-script-directly-from-Notepad.html) here at Experts Exchange, a member asked how to run an AutoHotkey script (.AHK) directly from Notepad++ (aka NPP). This video…

739 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question