Solved

get specific line in large csv file using php

Posted on 2013-11-23
18
3,477 Views
Last Modified: 2013-11-27
Greetings,
I have a file.
It is standard with <email>,<tag>
There are over a million lines in the file like that.

<email>,<tag>
<email>,<tag>
..continued.

no headers.
If I have a random value, say 124535,

How can I get the 124535th line in the file without reading the whole thing into memory?
I can call local commands, but the execution order must be in PHP.
Thanks
0
Comment
Question by:Evan Cutler
  • 7
  • 4
  • 3
  • +3
18 Comments
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39671964
You can do this easily.  Just read up to the 124,535th line.  That's only about 1/8 of the lines. If you load the file into a database table, you can use the LIMIT clause to get the correct line.  Use this function: http://php.net/manual/en/function.fgets.php
0
 
LVL 13

Expert Comment

by:duncanb7
ID: 39671975
Following code  might be what you need for complete PHP code to read content at certain line-number,(Note: where 4096 in code area, you can change it with adding some margin if you know the max characters in one line of your file to speed up the code running time)  

Hope I understand your questionc completely, if no, please point it out


Duncan

php code , it is tested at my linux server
========================================
<?php
  $handle = @fopen("../../yourfile.csv", "r");
if ($handle) {
   // while (($buffer = fgets($handle, 4096)) !== false) { echo $buffer; }
   // if (!feof($handle)) {   echo "Error: unexpected fgets() fail\n";}
   $num=123;
	$line=readLine($num,$handle); ///123 is just exmple line-number;
	echo "Line Num:".$num."-----------------\n";
	echo $line."\n";
    fclose($handle);
}

  function readLine ($linenum,$fh) {
        $line = fgets ($fh, 4096);
        $pos = -1;
        $i = 0;
        while (!feof($fh) && $i<($linenum-1)) {
            $char = fgetc($fh);
            if ($char != "\n" && $char != "\r") {
                fseek($fh, $pos, SEEK_SET);
                $pos ++;
            }
            else $i ++; 
        }
        $line = fgets($fh);
        return $line;
    } 
?>

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39671984
Please examine the descriptions of these functions:

fgetc()
fgets()
fseek()

It always pays to read the manual :-)
0
 
LVL 48

Expert Comment

by:Tintin
ID: 39672057
Depending on whether you have safe_mode enabled, you can always "cheat" and do:

<?php
$line = `sed -n 124535p filename`;
echo $line
?>

Open in new window

0
 
LVL 13

Expert Comment

by:duncanb7
ID: 39672108
Dear All,

Thanks Ray_Paseur's reminder, I used slower function that is not related to 4096
and it also run much slower since using fgetc from last code.
After Tintin's suggestion code that is really good code and really fast & easy, and I try to
improve my code as follows code ,as same speed as "sed"

 arcee123, hope you try this all code including Tintin's code for speed at your side.
I think I have no more room to improve the speed  faster by fgets.

Duncan

<?php
  $f = fopen ("../../data/yourfile.csv", "r");  $i= 0;// where 5000 is line number
  while ($line= fgets ($f)) { if ($i==5000) {print ($line);break;}; ++$i; }   fclose ($f);
?>

Open in new window

0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39672207
Ok guys.Thanks much.
What happens if I need to read somewhere beyond 800000?  Doesn't it cost alt to get to the 800000 if I have to readline to that point?
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39672307
Yes, it does cost.  But as long as it is a text file without any indexes that you can use to skip the unwanted lines, that is what you have to do.  Which is why Ray suggested that you load the CSV file into a database because databases do create indexes and provide the ability to search much more quickly than reading a text file line by line.

Years ago when I had to write a program that used a CSV file for a 'database', the first thing I did when the program started was to read it all into memory and create indexes to find the data I needed.  That worked then because the data was read many times while the program was running.  If you are doing this on a web site and only reading one line each time, you would be Much better off putting it a database because it would be Much quicker.
0
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 167 total points
ID: 39672623
Yes, it does cost.  But the cost may be so small that optimizing this would be like milking a mouse -- no matter how much effort you put into it, you won't get very much out.

Please install this script and run it to see the effect.  On my server, the load process took about 3 seconds and the read/lookup took less than half a second.  You might extend this code to add a comparison of an indexed data base lookup, so you can see what is to be gained from optimizing a half-second process.  If the process is to be repeated a million times, it's clearly worth optimizing.

0800000Simulated email address,simulated tag
WRITE LINES STOPPED  3,184.101 ms
READ LINES STOPPED  343.783 ms

<?php // RAY_temp_arcee123.php
error_reporting(E_ALL);
echo '<pre>';

// INSTANTIATE THE TIMER
$sw  = new Stopwatch;

// SIMULATED DATA
$line = 'Simulated email address' . ',' . 'simulated tag';

// LOAD DATA INTO THE FILE
$sw->start('WRITE LINES');
$cnt = 1000000;
$fph = fopen('RAY_temp_arcee123.txt', 'w');
while ($cnt)
{
    $num = str_pad($cnt,7,'0', STR_PAD_LEFT);
    fwrite($fph, $num . $line . PHP_EOL);
    $cnt--;
}
fclose($fph);
$sw->stop('WRITE LINES');

// READ THE DATA TO FIND LINE #800,000
$sw->start('READ LINES');
$arr = file('RAY_temp_arcee123.txt', FILE_IGNORE_NEW_LINES);
foreach ($arr as $str)
{
    if (strpos($str, '0800000') === 0) break;
}
echo $str . PHP_EOL;
$sw->stop('READ LINES');


// A SCRIPT TIMER FOR ALL OR PART OF A SCRIPT PHP 5+
// MAN PAGE http://php.net/manual/en/function.microtime.php
class StopWatch
{
    protected $a; // START TIME
    protected $s; // STATUS - IF RUNNING
    protected $z; // STOP TIME

    public function __construct()
    {
        $this->a = array();
        $this->s = array();
        $this->z = array();
    }

    // A METHOD TO PROVIDE A FINAL READOUT, IF NEEDED
    public function __destruct()
    {
        $ret = $this->readout();
        if (!$ret) return FALSE;
        echo "<b>$ret</b>";
        echo PHP_EOL;
    }

    // A METHOD TO REMOVE A TIMER
    public function reset($name='TIMER')
    {
        // RESET ALL TIMERS
        if ($name == 'TIMER')
        {
            $this->__construct();
        }
        else
        {
            unset($this->a[$name]);
            unset($this->s[$name]);
            unset($this->z[$name]);
        }
    }

    // A METHOD TO CAPTURE THE START TIME
    public function start($name='TIMER')
    {
        $this->a[$name] = microtime(TRUE);
        $this->z[$name] = $this->a[$name];
        $this->s[$name] = 'RUNNING';
    }

    // A METHOD TO CAPTURE THE END TIME
    public function stop($name='TIMER')
    {
        $ret = NULL;

        // STOP ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                // IF THIS TIMER IS STILL RUNNING, STOP IT
                if ($this->s[$name])
                {
                    $this->s[$name] = FALSE;
                    $this->z[$name] = microtime(TRUE);
                }
            }
        }

        // STOP ONLY ONE OF THE TIMERS
        else
        {
            if ($this->s[$name])
            {
                $this->s[$name] = FALSE;
                $this->z[$name] = microtime(TRUE);
            }
            else
            {
                $ret .= "ERROR: CALL TO STOP() METHOD: '$name' IS NOT RUNNING";
            }
        }

        // RETURN AN ERROR MESSAGE, IF ANY
        return $ret;
    }

    // A METHOD TO READ OUT THE TIMER(S)
    public function readout($name='TIMER', $dec=3, $m=1000, $t = 'ms', $eol=PHP_EOL)
    {
        $str = NULL;

        // GET READOUTS FOR ALL THE TIMERS
        if ($name == 'TIMER')
        {
            foreach ($this->a as $name => $start_time)
            {
                $str .= $name;

                // IF THIS TIMER IS STILL RUNNING UPDATE THE END TIME
                if ($this->s[$name])
                {
                    $this->z[$name] = microtime(TRUE);
                    $str .= " RUNNING ";
                }
                else
                {
                    $str .= " STOPPED ";
                }

                // RETURN A DISPLAY STRING
                $lapse_time = $this->z[$name] - $start_time;
                $lapse_msec = $lapse_time * $m;
                $lapse_echo = number_format($lapse_msec, $dec);
                $str .= " $lapse_echo $t";
                $str .= $eol;
            }
            return $str;
        }

        // GET A READOUT FOR ONLY ONE TIMER
        else
        {
            $str .= $name;

            // IF THIS TIME IS STILL RUNNING, UPDATE THE END TIME
            if ($this->s[$name])
            {
                $this->z[$name] = microtime(TRUE);
                $str .= " RUNNING ";
            }
            else
            {
                $str .= " STOPPED ";
            }

            // RETURN A DISPLAY STRING
            $lapse_time = $this->z[$name] - $this->a[$name];
            $lapse_msec = $lapse_time * $m;
            $lapse_echo = number_format($lapse_msec, $dec);
            $str .= " $lapse_echo $t";
            $str .= $eol;
            return $str;
        }
    }
}

Open in new window

Best to all, over and out, ~Ray
0
 
LVL 34

Assisted Solution

by:gr8gonzo
gr8gonzo earned 167 total points
ID: 39672785
My $0.02 - I'd agree with most everything that's been said. The only thing I would add is that I often do a lot of large data imports at work, and some of them were so large that I needed to break them up into multiple runs (e.g. process 10,000 lines at a time). As I got towards the end of the file, each run took longer and longer because I kept having to re-read all the previous lines before reaching the desired line to resume the import.

I was able to get around this by using ftell() and fseek(). The ftell() function will return your current position in the file. So if you've read 1,000,000 lines that are around 50 - 150 bytes each, and you run ftell(), you might get  a number like something like 75024048.

Then, if you want to restart the script and zoom back to that position instantly, you simply use fseek() to set the file pointer back to that position without having to re-read all the previous lines.

You can use this method to build "indexes" of the major intervals (e.g. a separate file that holds the offsets for every 10,000 lines), and then you can quickly figure out the nearest 10,000th interval, fseek to it, and read far fewer lines to get to the desired line.

As Ray said, databases are far better when you're doing this repeatedly. You'll avoid locking issues and get far better performance overall.
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39672989
Using @gr8gonzo's method the results are substantially faster.

WRITE LINES STOPPED  3,467.103 ms
READ LINES STOPPED  0.057 ms

This also appeared to be independent of the actual record chosen - line #200,000 was as fast as line #800,000.
0
 
LVL 82

Assisted Solution

by:Dave Baldwin
Dave Baldwin earned 166 total points
ID: 39672999
Ray, since you are counting down to create your lines in the text file, finding '08000000' only reads 20% of the file (200,000 lines), not 80% (800,000 lines).  So it would really take 4 times as long to get to the 800,000th line.  About 1.376 seconds.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39673025
Dave: I read the entire file ...

$arr = file('RAY_temp_arcee123.txt', FILE_IGNORE_NEW_LINES);

... and used an iterator to get to the selected line.  Using or omitting the iterator was undetectable in my tests -- iterating over hundreds of thousands of lines with foreach() did not even show up in the timing results.  The time to read the 45 million bytes was where the 344 milliseconds came from.

It appears "much faster" to use fseek() if you know what you want to locate.  And (serve this with a helping of mouse milk) we are talking about taking down something that needs 1/3 second to run ;-)
0
 
LVL 9

Author Comment

by:Evan Cutler
ID: 39673028
1.7 seconds is Ok.  It's having to read for 2 minutes that's the problem. I am trying it now.
Thanks
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39673062
Oh... Ok.
0
 
LVL 9

Author Closing Comment

by:Evan Cutler
ID: 39681914
Hey, thanks guys.  This got me where I was going.
Appreciate it so much.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39681960
Thanks for the points and thanks for using EE, ~Ray
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 39681984
If this gets re-opened for any reason, I shouldn't get any points for that post because, as Ray pointed out, I was wrong.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39681993
Dave: It was still a useful contribution to the conversation.  Happy Thanksgiving, ~Ray
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

It’s 2016. Password authentication should be dead — or at least close to dying. But, unfortunately, it has not traversed Quagga stage yet. Using password authentication is like laundering hotel guest linens with a washboard — it’s Passé.
This article discusses how to create an extensible mechanism for linked drop downs.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now