Link to home
Start Free TrialLog in
Avatar of Dada44
Dada44Flag for Spain

asked on

PHP: to parse a text file

Hi all

I need a script to read a text file contents and check for errors.

The errors are all the lines that are not built in this maner:

9 digits number|word|date hour||word

An example:
123456789|active|01/03/2011 10:03:26||web

I don't even know how to start, can anyone please give me a clue?

Please note that I'm a newbie, a piece of code will be easier for me than literature (sure I'll make the wrong interpretation).

Thanks a lot
Avatar of mattibutt
mattibutt
Flag of United States of America image

Avatar of cc108790
cc108790

At a basic level you could do this:
<?php
$myfile = 'test.txt';
$lines = file($myfile);   
$elements='';
for($i=count($lines);$i>0;$i--){
    $elements=explode("|",$lines[$i]);
  /*
  then your individual elements are here:
  $elements[0] = 9 digit number
  $elements[1] = word
  $elements[2] = date hour
  $elements[3] = word
 
  you can then preform whatever check for each one you want to see it's valid.
*/

}
?>

Open in new window


of course you can use regex to achieve this but that is by far the easiest method.
ASKER CERTIFIED SOLUTION
Avatar of palanee83
palanee83
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Do let me if you need any help or if my understanding is not correct

Thanks
Take a look at:

<?php
$logfilename = "log.txt"; /* Replace with the file name you want to check */
$lines = file($logfilename); /* Load each line of the file into an array NOTE: does not work well with files > 10 MB */
foreach ($lines as $line) { // Loop trough the array
	/* Check with a regular expression of the line is according to standard */
	if (preg_match('%[-+]?\b[0-9]*\.?[0-9]+\b\|([a-zA-Z]+)\|(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)[0-9]{2} ([0-1]?\d|2[0-3]):([0-5]?\d):([0-5]?\d)\|\|([a-zA-Z]+)%', $line)) {
	echo $line . "<br/>"; /* If line is to standard print it on the screen */
	} 
}
?>

Open in new window

Where is your test data set?  We need to see that to test adequately.
<?php // RAY_temp_data44.php
error_reporting(E_ALL);
date_default_timezone_set('America/Chicago');

// SHOW HOW TO PROCESS ONE LINE

function my_valid_string($str)
{
    // SPLIT THE STRING AT THE PIPE CHARACTERS
    $arr = explode('|', $str);

    // THERE SHOULD BE EXACTLY 5 ELEMENTS
    if (count($arr) != 5) return FALSE;

    // THE FIRST ELEMENT MUST BE A 9-DIGIT NUMBER
    if (!preg_match('/[0-9]{9}/', $arr[0])) return FALSE;

    // THE THIRD ELEMENT MUST BE A DATETIME STRING
    if (!strtotime($arr[2])) return FALSE;

    // THE FOURTH ELEMENT MUST BE EMPTY (DOUBLE PIPES)
    if (!empty($arr[3])) return FALSE;

    // THE SECOND AND LAST ELEMENTS MUST BE A WORD
    if (preg_match('/[^A-Z \-]/i', $arr[1])) return FALSE;
    if (preg_match('/[^A-Z \-]/i', $arr[4])) return FALSE;

    // ALL TESTS PASSED
    return TRUE;
}

// TEST CASES
$str = '123456789|active|01/03/2011 10:03:26||web';
if (my_valid_string($str)) echo "<br/>OK";
if (my_valid_string('Foo')) echo "<br/>Foo";

Open in new window

On another note, please do not use a date representation like this:

01/03/2011 10:03:26

Instead use the ISO8601 date / time representation.  It will look something like this, instead, assuming that you mean January 3 instead of March 1.

2011-01-03T10:03:26

This article explains why you want to follow the standard.
https://www.experts-exchange.com/Web_Development/Web_Languages-Standards/PHP/A_201-Handling-date-and-time-in-PHP-and-MySQL.html
@Dada44: You accepted the wrong answer.  As you can see from the code snippet, it simply does not work to filter out several obvious bogus strings.  I will ask the moderators to reopen this question so you can accept a solution that does what you described in your original post.  

One of the foundation principles of successful programming is unit testing.  In order to do unit testing, you need to have test data (both good and bad) and you need to subject your code to the test set every time you make a change.  It sounds like a lot of work, but most of us who do this for a living automate the process.  Here in the EE environment you will find that some Experts test their code before they post it, and others do not.  You should really be careful to test code snippets before you accept a solution.  If you don't do that, you may find that you are building faulty software into your systems.

Best of luck with your project, ~Ray
<?php // RAY_temp_wrong_answer_data44.php
error_reporting(E_ALL);

// WE DO NOT HAVE THIS FILE - SO WE GET A WARNING.  IGNORE IT.
$loggOrgArray  = file('log.txt');

// SO WE COPY THIS SIMULATED TEST DATA FROM THE ACCEPTED ANSWER
$logtxt = <<<LOGTXT
This is my error 1
123456789|active|01/03/2011 10:03:26||web
This is my error 2
This is my error 3
12345678|something|01/03/2011 10:03:26||web
LOGTXT;
$loggOrgArray = explode(PHP_EOL, $logtxt);

// COPIED FROM THE ACCEPTED ANSWER
$logArray      =array_filter($loggOrgArray, 'getLogs');
foreach($logArray as $message)
{
	echo $message.'<br />';
}
function getLogs($value)
{
	$regex  = '/[0-9]+\|[a-zA-Z]+\|[0-9]{2}\/[0-9]{2}\/[0-9]{4}.*/';
	return !(preg_match($regex,$value));
}



// SOME NEW DATA THAT SHOWS HOW BOGUS STUFF GOES RIGHT THROUGH THE FILTERS
$logtxt = <<<LOGTXT
THE NEXT LINE SHOULD FAIL BECAUSE IT DOES NOT START WITH A NINE-DIGIT NUMBER
123456789123456789|active|01/03/2011 10:03:26||web
THE NEXT LINE SHOULD FAIL BECAUSE IT DOES NOT START WITH A NINE-DIGIT NUMBER
1|active|01/03/2011 10:03:26||web
THE NEXT LINE SHOULD FAIL BECAUSE IT DOES NOT HAVE A VALID DATE
123456789|something|97/43/0000 10:03:26||web
THE NEXT LINE SHOULD FAIL BECAUSE IT DOES NOT HAVE A VALID TIME
123456789|something|97/43/0000 GARBAGE HERE||web
THE NEXT LINE SHOULD FAIL BECAUSE IT DOES NOT HAVE A VALID SUFFIX
123456789|something|97/43/0000
LOGTXT;
$loggOrgArray = explode(PHP_EOL, $logtxt);

// FILTER THE TEST DATA
$logArray      =array_filter($loggOrgArray, 'getLogs');
foreach($logArray as $message)
{
	echo $message.'<br />';
}

Open in new window

@admin, as per Dada44 requirement, he wants to match all the line which are not in the following pattern
"123456789|active|01/03/2011 10:03:26||web"

So, I have tested with sample data as well. It works as expected. Then how it can be a wrong answers !
The REGEX at ID:34466708 permits many bogus strings to pass.  You can see the demonstration of the problem at ID:34471990.  This is an object lesson in why unit testing is so important to the process of building software systems.  When we are given only ONE line of test data it is kind of hard to get it right, unless we build some test data on our own.
Avatar of Dada44

ASKER

this solution solves my question, thanks a lot to all
To quote, "Please note that I'm a newbie, a piece of code will be easier for me than literature (sure I'll make the wrong interpretation)."

Yep!

There is no sin in being a newbie - we all are at some level on some subject.  But it is inappropriate to accept an answer that is untested, and it is inappropriate to leave his as the accepted answer when it is demonstrably wrong -- that sort of thing damages the quality of the EE data base.  I will ask the moderators to delete it so others are not drawn into thinking that it could be the right answer.
@Ray_Paseur,

This is REGX will never fails, because the input to the log files is DEFINITELY not going to come from user. I'm sure the log file will not contain any bogus string. So please don't make it more complicate.

We need to provide a simple solution based on the requirement.