Solved

extract and modify data from text file

Posted on 2011-02-21
5
399 Views
Last Modified: 2012-05-11
Hi,

I have a large text file (20k lines) with entries like below:
# disposition      protocol      source                   destination            operator      port-range ###header for explanation; does not exist in original file
      permit      tcp                 10_12_10_0      23      host_10_14_0_181      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_4_16            range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_217      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_218      range      1525      1527            

      permit      tcp      10_119_160_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_24_18_0      24      host_10_14_0_157      eq      1526                  

      permit      tcp      host_10_14_1_40            host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_44      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_40            host_10_13_5_46      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_46      range      1531      1550

I want to format the data as below:
source <unique_source> destination <unique_destination> application <protocol>_<port>[-<range>]

All IP subnets starting with 10_ and followed by two letter mask; should get listed as subnet_mask in the final output. Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

A group of lines are separated by a blank line as shown above. So in every group we want unique host IP or subnet IP and would put them in [] square brackets if they are more than one for a specific source or destination.
There is a possibility that an IP address might be same between two groups but that should not get clubbed together.

All groups have same port or port range; there is a possibility that the protocol might be both tcp and udp, for eg,
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      udp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
In above case, application should get reported as tcp_udp_port[-range]. If this is tough to code then I can remove such lines and only have lines where the port/protocol are same for a single group.

Working on text above, it needs to be formatted as:

source 10_12_10_0_23 destination [ host_10_14_0_181 host_10_14_4_16 host_10_14_5_217 host_10_14_5_218 ] application tcp_1525_1527

source [ 10_119_160_0_24 10_97_163_0_24 10_24_18_0_24 ] destination host_10_14_0_157 application tcp_1526

source [ host_10_14_1_40 host_10_14_1_50 10_14_42_0_24 ] destination [ host_10_13_5_44 host_10_13_5_46] application tcp_1531_1550

Sorry for the long question.
0
Comment
Question by:dpk_wal
  • 3
  • 2
5 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34945660
How close did I come?

sub FlushIt
{
	my @sources = sort(keys(%sources));
	my $sourceCount = @sources;
	my $sourceString = ($sourceCount > 1 ) ? ( "[ " . join(" ",@sources) . " ]" ) : $sources[0];

	my @destinations = sort(keys(%destinations));
	my $destinationCount = @destinations;
	my $destinationString = ($destinationCount > 1 ) ? ( "[ " . join(" ",@destinations) . " ]" ) : $destinations[0];

	my @protocols = sort(keys(%protocols));
	my $protocolCount = @protocols;
	my $protocolString = join("_",@protocols) . "_" . $minPort . (( $maxPort ne '' ) ? ("_" . $maxPort) : '');

	print "source $sourceString destination $destinationString application $protocolString\n";

	undef %sources;
	undef %destinations;
	undef %protocols;
}


while ( <> )
{
	s/[\r\n]//g;

	if ( $_ eq '' ) { FlushIt(); }
	else
	{
		# All IP subnets starting with 10_ and followed by two digit mask;
		# should get listed as subnet_mask in the final output.
		# Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

		s/(\s+10\_[0-9\_]+)\s+([0-9][0-9])(\s+)/$1\_$2$3/;

		#       permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550

		s/^\s+//;		# trim leading spaces
		s/\s+$//;		# trim trailing spaces

		($disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) = split(/\s+/);
		##print STDERR join("\n", $disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) . "\n\n";

		$sources{$source} = 1;
		$destinations{$destination} = 1;
		$protocols{$protocol} = 1;
	}
}

FlushIt();

Open in new window



Input:
      permit      tcp                 10_12_10_0      23      host_10_14_0_181      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_4_16            range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_217      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_218      range      1525      1527            

      permit      tcp      10_119_160_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_24_18_0      24      host_10_14_0_157      eq      1526                  

      permit      tcp      host_10_14_1_40            host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_44      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_40            host_10_13_5_46      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_46      range      1531      1550

      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      udp      10_97_163_0      24      host_10_14_0_157      eq      1526                  

Open in new window


Output:

c:\temp>perl foo.pl foo.txt
source 10_12_10_0_23 destination [ host_10_14_0_181 host_10_14_4_16 host_10_14_5_217 host_10_14_5_218 ] application tcp_1525_1527
source [ 10_119_160_0_24 10_24_18_0_24 10_97_163_0_24 ] destination host_10_14_0_157 application tcp_1526
source [ 10_14_42_0_24 host_10_14_1_40 host_10_14_1_50 ] destination [ host_10_13_5_44 host_10_13_5_46 ] application tcp_1531_1550
source 10_97_163_0_24 destination host_10_14_0_157 application tcp_udp_1526

Open in new window

0
 
LVL 32

Author Comment

by:dpk_wal
ID: 34946088
Works great;Thank you!
just one problem; if I have a subnet in destination; the subnet mask is getting truncated.

For example; if I change the lines in sample output as below:
Input:
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
Output:
-bash-2.05b$ perl flushit.pl subMask
source 10_12_10_0_23 destination 10_12_10_0 application tcp_range_1525

Also, in such cases I think the port range is also not getting captured.

If the address is 10_ then it would be followed by two digit subnet mask; if the address is host_ then it would be single address. We can have host_ or 10_ addresses for both source and destination.

Thank you.
0
 
LVL 32

Author Comment

by:dpk_wal
ID: 34946145
The mask followed by 10_ address can even be single digit, but would be 10_x_x_x space or tab and then mask; eg, 10_0_0_0     8

Thank you for all your help and support; really appreciate it!
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 34946284
Changed so that more than one of the 10_... mask pairs can appear on a single line.

Also changed so that the mask can be one or two digits, not just two digits.

sub FlushIt
{
	my @sources = sort(keys(%sources));
	my $sourceCount = @sources;
	my $sourceString = ($sourceCount > 1 ) ? ( "[ " . join(" ",@sources) . " ]" ) : $sources[0];

	my @destinations = sort(keys(%destinations));
	my $destinationCount = @destinations;
	my $destinationString = ($destinationCount > 1 ) ? ( "[ " . join(" ",@destinations) . " ]" ) : $destinations[0];

	my @protocols = sort(keys(%protocols));
	my $protocolCount = @protocols;
	my $protocolString = join("_",@protocols) . "_" . $minPort . (( $maxPort ne '' ) ? ("_" . $maxPort) : '');

	print "source $sourceString destination $destinationString application $protocolString\n";

	undef %sources;
	undef %destinations;
	undef %protocols;
}


while ( <> )
{
	s/[\r\n]//g;

	if ( $_ eq '' ) { FlushIt(); }
	else
	{
		# All IP subnets starting with 10_ and followed by two digit mask;
		# should get listed as subnet_mask in the final output.
		# Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

		while ( s/(\s+10\_[0-9\_]+)\s+([0-9]+)(\s+)/$1\_$2$3/ ) {}

		#       permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550

		s/^\s+//;		# trim leading spaces
		s/\s+$//;		# trim trailing spaces

		($disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) = split(/\s+/);
		##print STDERR join("\n", $disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) . "\n\n";

		$sources{$source} = 1;
		$destinations{$destination} = 1;
		$protocols{$protocol} = 1;
	}
}

FlushIt();

Open in new window

0
 
LVL 32

Author Closing Comment

by:dpk_wal
ID: 34948767
Worked like charm!! Many thanks! :)
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now