Link to home
Start Free TrialLog in
Avatar of rmirabelle
rmirabelle

asked on

Read specific line from file?

Is there a way to read a specific line number from a text file without first reading the entire file?

Thanks in advance
ASKER CERTIFIED SOLUTION
Avatar of ddrudik
ddrudik
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The link ddrudik posted does show a method for obtaining a single line, but that complete file is read into memory first.  Under the best of circumstances, you do not have to read the ENTIRE file, but you will have to read it in up to the line you want.

If your each line will have the same length, you can also use fseek() to reach the line you want without reading anything first.  This will only work if you know how many characters you need to skip.
If your script is hosted on Linux operating system and you have shell_exec privileges there you can test the following :

// $filename - this is the filename, including the path to it
// $lineno - the desired line nr.
// in $line you'll get the contents of the line $lineno

$line = shell_exec ( "head -n $lineno $filename | tail -n 1" );
Neat trick, but less efficient that using PHP's functions to do the same job.  This should do the same thing, but without the overhead of calling bash plus two other apps:

<?
for ($x=0;$x<$lineno;$x++) {
  $line=fgets($file);
}
?>

Open in new window

@routinet
Depends how big is the file ... In your case he still needs to read in memory the first $lineno lines. Imagine that the file can be quite big. Why to read hundreds of megabytes in memory ?!? head and tail are optimized to do such jobs.
Avatar of rmirabelle
rmirabelle

ASKER

well, I'm on a linux derived box (centOS) , but shell isn't the way to go for me.  I'm reading a tab-delimited file so line lengths are not going to be identical - so I guess I'm stuck reading the whole file.  Not necessarily the end of the world I guess

Thanks for the input
Thanks for the question and the points.
@tkalchev: I disagree.  With your method, you are first calling a shell, then calling head, which reads up to that line number.  Finally, it calls tail, which returns just the last line of head's output.  Over all, you're calling three external applications, and reading just as much as you would with my last snippet of code.  I'll see if I can't create some benchmarks to demonstrate.

@rmirabelle:  My last post shows a method that does not need to read the whole file...it reads only up to the line number you specify.  ddrudik's method is fine (I'd use it as well), but if you really need to not read the whole file, fgets() will be the best option.  Good luck!
I've made a small test app here:

https://mohawk.abetterblind.com/test-ee.php

The code is shown below.  The very small file has 16000 lines, small file has 32000, and normal has ~64000 (a little less).  The normal file was built from the text of the US Constitutional Amendments appended over and over.  The other two were built with `head -n` from the first.  Repetitions is hard-coded to 100...that shows the difference pretty well using microtime(true).

Method 1 is ddrudik's solution, method 2 is mine, method 3 is tkalchev's.  Turns out we're all a little right, and a little wrong, depending on the size of the file and the line number you need.  Feel free to play with it a bit.  It's in an experiment-friendly sandbox.

The first thing I noticed is that ddrudik's solution is heavily impacted by the size of the file, and I do mean HEAVILY.  On the order of 7x the needed execution time from very small to normal.  My solution and tkalchev's are hardly impacted at all by file size.  On the other hand, ddrudik's solution is minimally impacted by the line number, where the other two get slower by degrees.

In the end, I'd say the average best solution would be a toss-up between tkalchev and myself, depending on the line number needed.  Once you get to about 500k or so of information needing to be read, tkalchev's solution provides for better performance.  Up to that point, mine is in the lead.  ddrudik's solution reads the entire file, and it comes in last except when considering very tiny text files.  My very first test file was only 60k or so, and his solution was consistently in the lead until I increased the file size.  I'm not sure where that line is crossed, but it is at a relatively low file size.

Enjoy!
<?
// hard-coded 100 repetitions
$reps = 100;
 
// get line number
if (isset($_GET['lineno'])) {
  $linenum = (integer)$_GET['lineno'];
} else {
  $linenum = 100;
}
if ($linenum < 0) { $linenum = 100; }
 
// get file size
if (isset($_GET['fname'])) {
  $fnum = (integer)$_GET['fname'];
} else {
  $fnum = 1;
}
switch ($fnum) {
  case 3: $filename = '/web/mohawk/test.txt'; break;
  case 2: $filename = '/web/mohawk/smtest.txt'; break;
  default: $filename = '/web/mohawk/vsmtest.txt'; break;
}
echo "reps:$reps &nbsp; &nbsp;line:$linenum<br /><br />";
 
// start method 1
$t1 = microtime(true);
for ($x=0;$x<$reps;$x++) {
  $myfile = file($filename);
  $line = $myfile[($linenum - 1)];
  unset($myfile);
}
$t2 = microtime(true);
echo "<!-- $x: $line -->\n";
echo "<u>Method 1</u><br />t1:$t1, &nbsp;t2:$t2 &nbsp;diff:",($t2-$t1),"<br /><br />\n";
 
// start method 2
$t1 = microtime(true);
for ($x=0;$x<$reps;$x++) {
  $handle = fopen($filename, "r");
  for ($y=0;$y<$linenum;$y++) {
    $line = fgets($handle);
  }
  fclose($handle);
}
$t2 = microtime(true);
echo "<!-- $x: $line -->\n";
echo "<u>Method 2</u><br />t1:$t1, &nbsp;t2:$t2 &nbsp;diff:",($t2-$t1),"<br /><br />\n";
 
// start method 3
$t1 = microtime(true);
for ($x=0;$x<$reps;$x++) {
  $line = shell_exec ( "head -n $linenum $filename | tail -n 1" );
}
$t2 = microtime(true);
echo "<!-- $x: $line -->\n";
echo "<u>Method 3</u><br />t1:$t1, &nbsp;t2:$t2 &nbsp;diff:",($t2-$t1),"<br /><br />\n";
 
?>
<form method="GET" action="/test-ee.php">
  file: &nbsp;
  <select name="fname">
    <option value="1"<?=(($fnum==1)?" selected=\"selected\"":'');?>>Very small (1.1mb)</option>
    <option value="2"<?=(($fnum==2)?" selected=\"selected\"":'');?>>Small (2.1mb)</option>
    <option value="3"<?=(($fnum==3)?" selected=\"selected\"":'');?>>Normal (4.2mb)</option>
  </select>
  line: &nbsp;<input type="text" value="<?=$linenum;?>" name="lineno" /> &nbsp;
  <input type="submit" value="Test" />
</form>

Open in new window

I've just tested with 150 MB large file, the head|tail solution works MUCH faster than method 2 :

time head -n 50000000 /var/log/httpd/access_log | tail -n1
real    0m2.342s
user    0m0.620s
sys     0m0.520s

time php test_linenr.php /var/log/httpd/access_log 50000000
real    1m15.133s
user    1m10.880s
sys     0m0.940s
Your test uses different benchmarks because it executes from the shell.  Well, one executes from the shell.  The other calls PHP from the shell.  There is no equality in the environment.  I made some alterations to my test code to allow for a large file - 153mb and 2.35 million lines.  The results are shown below.  The differences between our results highlight the necessity of equivalent test environments.

In any case, the results mesh with my earlier analysis.  Somewhere around 500k, head/tail becomes the more efficient method.  Until the file size reaches that point, fgets() is faster.  The file() method is only useful for extremely small files.

The test code is still available, and the changes I made to the code are shown below as well.

https://mohawk.abetterblind.com/test-ee.php

reps:100    line:500000
 
Method 1 not completed due to memory constraints
 
Method 2
t1:1205270835.6673,  t2:1205270887.4271  diff:51.759772062302
 
Method 3
t1:1205270887.4271,  t2:1205270906.3365  diff:18.909373998642
 
 
 
 
 
<?
if (isset($_GET['lineno'])) {
  $linenum = (integer)$_GET['lineno'];
} else {
  $linenum = 100;
}
if ($linenum < 0) { $linenum = 100; }
$reps = 100;
if (isset($_GET['fname'])) {
  $fnum = (integer)$_GET['fname'];
} else {
  $fnum = 1;
}
switch ($fnum) {
  case 4: $filename = '/web/mohawk/lgtest.txt'; set_time_limit(300); break;
  case 3: $filename = '/web/mohawk/test.txt'; break;
  case 2: $filename = '/web/mohawk/smtest.txt'; break;
  default: $filename = '/web/mohawk/vsmtest.txt'; break;
}
error_log("ee-test hit from {$_SERVER['REMOTE_ADDR']} ($fnum:$linenum)");
echo "reps:$reps &nbsp; &nbsp;line:$linenum<br /><br />\n";
 
if ($fnum != 4) {
	$t1 = microtime(true);
	for ($x=0;$x<$reps;$x++) {
	  $myfile = file($filename);
	  $line = $myfile[($linenum - 1)];
	  unset($myfile);
	}
	$t2 = microtime(true);
	echo "<!-- $x: $line -->\n";
	echo "<u>Method 1</u><br />t1:$t1, &nbsp;t2:$t2 &nbsp;diff:",($t2-$t1),"<br /><br />\n";
} else {
	echo "Method 1 not completed due to memory constraints<br /><br />\n";
}
 
$t1 = microtime(true);
for ($x=0;$x<$reps;$x++) {
  $handle = fopen($filename, "r");
  for ($y=0;$y<$linenum;$y++) {
    $line = fgets($handle);
  }
  fclose($handle);
}
$t2 = microtime(true);
echo "<!-- $x: $line -->\n";
echo "<u>Method 2</u><br />t1:$t1, &nbsp;t2:$t2 &nbsp;diff:",($t2-$t1),"<br /><br />\n";
 
$t1 = microtime(true);
for ($x=0;$x<$reps;$x++) {
  $line = shell_exec ( "head -n $linenum $filename | tail -n 1" );
}
$t2 = microtime(true);
echo "<!-- $x: $line -->\n";
echo "<u>Method 3</u><br />t1:$t1, &nbsp;t2:$t2 &nbsp;diff:",($t2-$t1),"<br /><br />\n";
 
?>
<form method="GET" action="/test-ee.php">
  file: &nbsp;
  <select name="fname">
    <option value="1"<?=(($fnum==1)?" selected=\"selected\"":'');?>>Very small (1.1mb)</option>
    <option value="2"<?=(($fnum==2)?" selected=\"selected\"":'');?>>Small (2.1mb)</option>
    <option value="3"<?=(($fnum==3)?" selected=\"selected\"":'');?>>Normal (4.2mb)</option>
    <option value="4"<?=(($fnum==4)?" selected=\"selected\"":'');?>>Large (153mb)</option>
  </select>
  line: &nbsp;<input type="text" value="<?=$linenum;?>" name="lineno" /> &nbsp;
  <input type="submit" value="Test" />
</form>

Open in new window

Thanks for getting all geeky on this answer!  It's fascinating seeing the differences in execution time.  In my case, I had not intended on processing files that in any way would be called 'large' by your standards.  To me, a 1000 line file is large!  It's actually promising to see that considerably larger files could be processed in acceptably short amounts of time.  I'll definitely add this thread to my knowledge-base and come back to it again.

Thanks again for the solid input
I'm glad you found it as informative as I did.  I'm always looking for ways to further optimize my code, and I'll certainly be keeping both of the partial-read solutions in mind moving forward.

I think the best lesson you can take from this is that the method you chose as the accepted solution is not well suited for expansion.  It will work for you right now, considering the file sizes you are dealing with, but you might want to take a moment to consider the long term picture.  The server load will increase directly with the size of file you are reading.  Notice in the last example, I could not even allow method 1 to run because of memory limitations.  Implementing that method could break your app in the future.