Solved

extract and modify data from text file

Posted on 2011-02-21
5
400 Views
Last Modified: 2012-05-11
Hi,

I have a large text file (20k lines) with entries like below:
# disposition      protocol      source                   destination            operator      port-range ###header for explanation; does not exist in original file
      permit      tcp                 10_12_10_0      23      host_10_14_0_181      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_4_16            range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_217      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_218      range      1525      1527            

      permit      tcp      10_119_160_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_24_18_0      24      host_10_14_0_157      eq      1526                  

      permit      tcp      host_10_14_1_40            host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_44      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_40            host_10_13_5_46      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_46      range      1531      1550

I want to format the data as below:
source <unique_source> destination <unique_destination> application <protocol>_<port>[-<range>]

All IP subnets starting with 10_ and followed by two letter mask; should get listed as subnet_mask in the final output. Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

A group of lines are separated by a blank line as shown above. So in every group we want unique host IP or subnet IP and would put them in [] square brackets if they are more than one for a specific source or destination.
There is a possibility that an IP address might be same between two groups but that should not get clubbed together.

All groups have same port or port range; there is a possibility that the protocol might be both tcp and udp, for eg,
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      udp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
In above case, application should get reported as tcp_udp_port[-range]. If this is tough to code then I can remove such lines and only have lines where the port/protocol are same for a single group.

Working on text above, it needs to be formatted as:

source 10_12_10_0_23 destination [ host_10_14_0_181 host_10_14_4_16 host_10_14_5_217 host_10_14_5_218 ] application tcp_1525_1527

source [ 10_119_160_0_24 10_97_163_0_24 10_24_18_0_24 ] destination host_10_14_0_157 application tcp_1526

source [ host_10_14_1_40 host_10_14_1_50 10_14_42_0_24 ] destination [ host_10_13_5_44 host_10_13_5_46] application tcp_1531_1550

Sorry for the long question.
0
Comment
Question by:dpk_wal
  • 3
  • 2
5 Comments
 
LVL 16

Expert Comment

by:sjklein42
ID: 34945660
How close did I come?

sub FlushIt
{
	my @sources = sort(keys(%sources));
	my $sourceCount = @sources;
	my $sourceString = ($sourceCount > 1 ) ? ( "[ " . join(" ",@sources) . " ]" ) : $sources[0];

	my @destinations = sort(keys(%destinations));
	my $destinationCount = @destinations;
	my $destinationString = ($destinationCount > 1 ) ? ( "[ " . join(" ",@destinations) . " ]" ) : $destinations[0];

	my @protocols = sort(keys(%protocols));
	my $protocolCount = @protocols;
	my $protocolString = join("_",@protocols) . "_" . $minPort . (( $maxPort ne '' ) ? ("_" . $maxPort) : '');

	print "source $sourceString destination $destinationString application $protocolString\n";

	undef %sources;
	undef %destinations;
	undef %protocols;
}


while ( <> )
{
	s/[\r\n]//g;

	if ( $_ eq '' ) { FlushIt(); }
	else
	{
		# All IP subnets starting with 10_ and followed by two digit mask;
		# should get listed as subnet_mask in the final output.
		# Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

		s/(\s+10\_[0-9\_]+)\s+([0-9][0-9])(\s+)/$1\_$2$3/;

		#       permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550

		s/^\s+//;		# trim leading spaces
		s/\s+$//;		# trim trailing spaces

		($disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) = split(/\s+/);
		##print STDERR join("\n", $disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) . "\n\n";

		$sources{$source} = 1;
		$destinations{$destination} = 1;
		$protocols{$protocol} = 1;
	}
}

FlushIt();

Open in new window



Input:
      permit      tcp                 10_12_10_0      23      host_10_14_0_181      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_4_16            range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_217      range      1525      1527            
      permit      tcp      10_12_10_0      23      host_10_14_5_218      range      1525      1527            

      permit      tcp      10_119_160_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      tcp      10_24_18_0      24      host_10_14_0_157      eq      1526                  

      permit      tcp      host_10_14_1_40            host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_44      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_44      range      1531      1550
      permit      tcp      host_10_14_1_40            host_10_13_5_46      range      1531      1550
      permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550
      permit      tcp      10_14_42_0      24      host_10_13_5_46      range      1531      1550

      permit      tcp      10_97_163_0      24      host_10_14_0_157      eq      1526                  
      permit      udp      10_97_163_0      24      host_10_14_0_157      eq      1526                  

Open in new window


Output:

c:\temp>perl foo.pl foo.txt
source 10_12_10_0_23 destination [ host_10_14_0_181 host_10_14_4_16 host_10_14_5_217 host_10_14_5_218 ] application tcp_1525_1527
source [ 10_119_160_0_24 10_24_18_0_24 10_97_163_0_24 ] destination host_10_14_0_157 application tcp_1526
source [ 10_14_42_0_24 host_10_14_1_40 host_10_14_1_50 ] destination [ host_10_13_5_44 host_10_13_5_46 ] application tcp_1531_1550
source 10_97_163_0_24 destination host_10_14_0_157 application tcp_udp_1526

Open in new window

0
 
LVL 32

Author Comment

by:dpk_wal
ID: 34946088
Works great;Thank you!
just one problem; if I have a subnet in destination; the subnet mask is getting truncated.

For example; if I change the lines in sample output as below:
Input:
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
      permit      tcp      10_12_10_0      23      10_12_10_0      23      range      1525      1527            
Output:
-bash-2.05b$ perl flushit.pl subMask
source 10_12_10_0_23 destination 10_12_10_0 application tcp_range_1525

Also, in such cases I think the port range is also not getting captured.

If the address is 10_ then it would be followed by two digit subnet mask; if the address is host_ then it would be single address. We can have host_ or 10_ addresses for both source and destination.

Thank you.
0
 
LVL 32

Author Comment

by:dpk_wal
ID: 34946145
The mask followed by 10_ address can even be single digit, but would be 10_x_x_x space or tab and then mask; eg, 10_0_0_0     8

Thank you for all your help and support; really appreciate it!
0
 
LVL 16

Accepted Solution

by:
sjklein42 earned 500 total points
ID: 34946284
Changed so that more than one of the 10_... mask pairs can appear on a single line.

Also changed so that the mask can be one or two digits, not just two digits.

sub FlushIt
{
	my @sources = sort(keys(%sources));
	my $sourceCount = @sources;
	my $sourceString = ($sourceCount > 1 ) ? ( "[ " . join(" ",@sources) . " ]" ) : $sources[0];

	my @destinations = sort(keys(%destinations));
	my $destinationCount = @destinations;
	my $destinationString = ($destinationCount > 1 ) ? ( "[ " . join(" ",@destinations) . " ]" ) : $destinations[0];

	my @protocols = sort(keys(%protocols));
	my $protocolCount = @protocols;
	my $protocolString = join("_",@protocols) . "_" . $minPort . (( $maxPort ne '' ) ? ("_" . $maxPort) : '');

	print "source $sourceString destination $destinationString application $protocolString\n";

	undef %sources;
	undef %destinations;
	undef %protocols;
}


while ( <> )
{
	s/[\r\n]//g;

	if ( $_ eq '' ) { FlushIt(); }
	else
	{
		# All IP subnets starting with 10_ and followed by two digit mask;
		# should get listed as subnet_mask in the final output.
		# Eg, in text above, 10_12_10_0      23 should get listed as 10_12_10_0_23

		while ( s/(\s+10\_[0-9\_]+)\s+([0-9]+)(\s+)/$1\_$2$3/ ) {}

		#       permit      tcp      host_10_14_1_50            host_10_13_5_46      range      1531      1550

		s/^\s+//;		# trim leading spaces
		s/\s+$//;		# trim trailing spaces

		($disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) = split(/\s+/);
		##print STDERR join("\n", $disposition, $protocol, $source, $destination, $operator, $minPort, $maxPort) . "\n\n";

		$sources{$source} = 1;
		$destinations{$destination} = 1;
		$protocols{$protocol} = 1;
	}
}

FlushIt();

Open in new window

0
 
LVL 32

Author Closing Comment

by:dpk_wal
ID: 34948767
Worked like charm!! Many thanks! :)
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
Active Directory replication delay is the cause to many problems.  Here is a super easy script to force Active Directory replication to all sites with by using an elevated PowerShell command prompt, and a tool to verify your changes.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

929 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now