Solved

Tampering with very large textfiles...

Posted on 2000-05-17
30
450 Views
Last Modified: 2010-04-04
I have a textfile on approx. 120 000 lines. How can i put the n:th line into a string?

--johan
0
Comment
Question by:sageryd
  • 10
  • 8
  • 4
  • +4
30 Comments
 
LVL 7

Expert Comment

by:God_Ares
ID: 2818129
perhaps somthing lie this?

procedure TForm1.Button1Click(Sender: TObject);
var s:string;
    text : Tstringlist;
begin

  text := Tstringlist.Create;

  text.LoadFromFile('filename.txt');

  s := text.Strings[nr_line];

  text.free;

end;
0
 
LVL 6

Expert Comment

by:edey
ID: 2818427
or, if you don't want to keep the entire file in ram you could try readln'ing through to the nth line, from the help file:

Description

The Readln procedure reads a line of text and then skips to the next line of the file.
Readln(F) with no parameters causes the current file position to advance to the beginning of the next line if there is one; otherwise, it goes to the end of the file.

hence you could try something like:

ix := 1;
while (not EoF(F))and(ix < n) do
begin
 readln(F);
 inc(ix);
end;
if (not EoF(F)) and (ix = n) then
 readln(F,my_string)
else
 messageDLG('File Contains Less Then N Lines',mtError,[mbOk],0);


GL
MIke
0
 
LVL 17

Expert Comment

by:inthe
ID: 2818837
this was from another similar question:

var
linenumber : array[1..maxlines] of Cardinal;
s : string = '';
f : file of byte;
begin
AssignFile(f, 'c:\mylargefile.txt');
reset(f);
Seek(f, linenumber[1245]);
repeat
read(f, c);
s := s + chr(c);
until c = 13;
0
 
LVL 1

Author Comment

by:sageryd
ID: 2819271
Gor_Ares, your solution does not work at all, already tried it a couple of times, a TStringList or any other kind of list is to small to be able to handle such large text-files.

Edey, I tried your solution before too, it gets very slow if you have to read, lets say, 100 000 lines before being able to read the 100 0001:st.

Barry, I think your solutions is the most optimal, and easiest, I havn't tried anything yet, but it looks promising. I'll get back to you as soon as I have the time to take a look at it.

Thanks everyone!

--johan
0
 
LVL 1

Author Comment

by:sageryd
ID: 2819273
God_ares, sorry for the misspelling of your nick, might have sounded a little bit unpleasant.."Gore_ares" ;) Cheers!
0
 
LVL 20

Expert Comment

by:Madshi
ID: 2820102
Ehm, Barry, who fills the linenumber array? As far as I understand that code, Seek is called with a linenumber that was not initialized, or am I missing something?

Regards, Madshi.
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2820297
Madshi, this is what I wondered about too. I guess the array is filled with positions of the line starts in the file. This would need a specially prepared file or a way to get to this information otherwise.

sageryd,

what you said about TStringList is definitly not true. I have used this (and a rewritten variant for wide strings) to hold a one million lines file (needs much memory though).

In the case you want a fast AND memory inexpensive solution I recommend that you look at memory mapped files. Map the file to memory and iterate through the line breaks (via simple memory pointers).

Ciao, Mike
0
 
LVL 1

Author Comment

by:sageryd
ID: 2821124
OK, Mike , maybe I was wrong, but what I really meant was that it wouldn't work with a list because of the memory needed, I don't want the users RAM to get filled up! Maybe you can give an example of what you mean with those memory pointers.

--johan
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2821243
Well, I don't have ready to use code but the idea is:

1) create a file mapping for the text file:

var
  TextFile: THandle;
  MapHandle: THandle;

begin
  TextFile := OpenFile(...);
  MapHandle := CreateFileMapping(TextFile, nil, PAGE_READONLY, 0, 0, nil);
  :
end;

2) create views of this mapping in your memory and search/count there:

const
  MapSize = 4096;

var
  Data,
  Run: PChar;
  Offset: Cardinal;
  Done: Boolean;

begin
  Offset := 0;
  Done := False;
  while not Done do
  begin
    Data := MapViewOfFile(MapHandle, FILE_MAP_READ, 0, Offset, MapSize);
    // Data points now to raw string (file) data, start searching the block
    Run := Data;
    while (Run - Data) < MapSize do
    begin
      if Run^ = #13 then
      begin
        Inc(LineCount);
        if LineCount = WantedLine then
        begin
          Done := True;
          Break;
        end;
      end;
      Inc(Run);
    end;
    UnmapViewOfFile(Data);
    Inc(Offset, MapSize);
  end;
end;

3) clean up
  CloseHandle(MapHandle);
  CloseHandle(TextFile);
  etc.

This code is not complete but should give you most of the stuff you need.

Ciao, Mike
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2821249
Ah yes, you need of course to stop the loop also when you have processed the entire file and there weren't as much lines as you expected.

Ciao, Mike
0
 
LVL 1

Author Comment

by:sageryd
ID: 2821512
ok...but isn't there a simpler way?
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2821571
Yes, TStringList.LoadFromFile.

Ciao, Mike
0
 
LVL 20

Expert Comment

by:Madshi
ID: 2821643
((-:   Hehe Mike...   :-))     (You're absolutely right of course)
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2821697
;-)
0
 
LVL 1

Author Comment

by:sageryd
ID: 2821941
Very funny.....

Wouldn't it be just as fast to do something like this:

var
  F: TextFile;
  LineNo: integer;
  S: string;
begin
  LineNo := 5067;
  AssignFile(F, 'filename');
  Reset(F);
  Seek(F, LineNo);
  Readln(F, S);
  {S = the text of the 5067:th line?}
end;

Does the above work as fast as the other? Does this work at all?

--johan
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 17

Expert Comment

by:inthe
ID: 2821969
howdy fellas:
lischke you were also right about
< the array is filled with positions of the line number >

this is correct ;-)
fill the array with positions/linenumbers of file

array[1,34,124,etc]

Seek(f, linenumber[2]);
//to got to line 124

this would only really make sense if you only wanted to store the line numbers of about 50-100 misc lines to quickly jump to a line..otherwise the array could get a bit large :o)
0
 
LVL 1

Author Comment

by:sageryd
ID: 2822412
I believe it will work just great for my purpose, I'm going to pick a random line number and display the text at that line. btw, how can I retrieve how many lines there are in the text file?

--johan
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2822582
Johan, that's exactly the point. There's no way to find lines in a text file other than iterating through the content and counting the line breaks. That's what I try to make clear here. If there would be a way to just seek to a particular line why the heck should I bother you with memory mapped files? The type TextFile is only a wrapper around a normal Windows file and some extra handling like ReadLn (which also iterates through the content).

But you should not be distracted by the code I gave here. Actually, it is almost complete. You need only to add a little file open/close handling and link your own stuff into the search code. This is really not difficult.

Ciao, Mike
0
 
LVL 7

Expert Comment

by:God_Ares
ID: 2825206
What if you make the text a fixed length?? than you could get a line really fast..!
0
 
LVL 1

Expert Comment

by:AJFleming
ID: 2834343
I was playing around with this sort of thing a couple of years ago - trying to alter text in a 4Gb file and the main trick I found was to load as much as you can cope with into memory, look through it there, then dump it out to file again once you've done what you need to.

In your case, you're unfortunately going to have to look at every character in the file - at least until you get to the line you want. There are some cute tricks you can pull though.

If you allocate yourself a decent sized buffer - say about a couple of Meg - as an array of chars. Blockread to fill this buffer - or close to it if you're at the end of the file.

Scan through it counting CR/LF pairs until you get to the line you want. If you get through an entire buffer and you haven't gotten to your insertion point then just blockwrite it back out.

Once you get to your insertion point, blockwrite everything in the buffer up to that point, write out your inserted line, then blockwrite the rest of the buffer - then blockread/blockwrite until you get to the end of the file.

Obviously, this isn't the most elegant solution - but if you don't have any extra knowledge about the layout of the file, that's all you got.

Mike, I haven't used memory mapped files myself - but I'd be really interested to see a comparison between this an a more brute force method.

cheers,

Adam...
0
 
LVL 20

Expert Comment

by:Madshi
ID: 2834404
I agree with Mike here. Memory mapped files should be the way to go, because Windows then does all the rest for you. And I'm quite sure that Windows will do it in the most performance optimized way. But I agree with Adam, I would like to see a benchmark...

Regards, Madshi.
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2834463
I won't have the time this very day, but will give you some results tomorrow.

Ciao, Mike
0
 
LVL 10

Accepted Solution

by:
Lischke earned 100 total points
ID: 2836206
Hi guys,

here I am with my test results. The system I did the tests on is a double processor PII 350 running WinNT 4 SP6 with 128 MB. The test file is a ~11MB text file with > 300,000 lines (copied from Windows.pas) on a 2GB partition (IDE drive) containing also my NT system (the program itself is on another partition).

I inserted a line "This is our target line!!!" exactly on line position 300,000 (used Delphi's IDE) which is shown when found, otherwise a preset line is shown. I tested four ways in the order of increasing speed (my expectation was exactly my result). Note for implementors: To write the code given below I needed 1 hour this morning with 3 of the 4 routines running without any bug from the first moment on (I had no template to look at). I say this not to show how good I am but to point out how easy it is to write it.

Results:

ReadLn           1094 ms
File Stream      406 ms
Pure File API    391 ms
MMF              297 ms

Here's the code I used:

object Form1: TForm1
  Left = 400
  Top = 262
  HorzScrollBar.Visible = False
  BorderStyle = bsSingle
  Caption = 'Form1'
  ClientHeight = 240
  ClientWidth = 549
  Color = clBtnFace
  Font.Charset = ANSI_CHARSET
  Font.Color = clWindowText
  Font.Height = -13
  Font.Name = 'Arial'
  Font.Style = []
  KeyPreview = True
  OldCreateOrder = True
  Scaled = False
  Visible = True
  OnKeyPress = FormKeyPress
  PixelsPerInch = 96
  TextHeight = 16
  object Label1: TLabel
    Left = 8
    Top = 80
    Width = 97
    Height = 53
    AutoSize = False
    Caption = 'Time needed to find line number 300,000:'
    WordWrap = True
  end
  object Label2: TLabel
    Left = 260
    Top = 96
    Width = 28
    Height = 16
    Caption = 'Time'
  end
  object Label3: TLabel
    Left = 164
    Top = 96
    Width = 28
    Height = 16
    Caption = 'Time'
  end
  object Label4: TLabel
    Left = 352
    Top = 96
    Width = 28
    Height = 16
    Caption = 'Time'
  end
  object Label5: TLabel
    Left = 456
    Top = 96
    Width = 28
    Height = 16
    Caption = 'Time'
  end
  object Button1: TButton
    Left = 144
    Top = 28
    Width = 75
    Height = 25
    Caption = 'ReadLn'
    TabOrder = 0
    OnClick = Button1Click
  end
  object Button2: TButton
    Left = 240
    Top = 28
    Width = 75
    Height = 25
    Caption = 'Stream'
    TabOrder = 1
    OnClick = Button2Click
  end
  object Button3: TButton
    Left = 336
    Top = 28
    Width = 75
    Height = 25
    Caption = 'File API'
    TabOrder = 2
    OnClick = Button3Click
  end
  object Button4: TButton
    Left = 432
    Top = 28
    Width = 75
    Height = 25
    Caption = 'MMF'
    TabOrder = 3
    OnClick = Button4Click
  end
end




unit Unit1;

interface

uses
  Windows, SysUtils, Forms, Classes, StdCtrls, Controls, Buttons, Graphics, Dialogs, ComCtrls, Messages,
  ExtCtrls;


type
  TForm1 = class(TForm)
    Button1: TButton;
    Button2: TButton;
    Button3: TButton;
    Button4: TButton;
    Label1: TLabel;
    Label2: TLabel;
    Label3: TLabel;
    Label4: TLabel;
    Label5: TLabel;
    procedure FormKeyPress(Sender: TObject; var Key: Char);
    procedure Button1Click(Sender: TObject);
    procedure Button2Click(Sender: TObject);
    procedure Button3Click(Sender: TObject);
    procedure Button4Click(Sender: TObject);
  private
  public
  end;

var
  Form1: TForm1;

implementation

uses
  MMSystem;
 
{$R *.DFM}

procedure TForm1.FormKeyPress(Sender: TObject; var Key: Char);
begin
  if key = #27 then
  begin
    Key:=#0;
    Close;
  end;
end;

const
  FileName = 'C:\Temp\Test.txt'; // 305315 lines in ~11.4MB (this is mainly a Windows.pas copy)

var
  Buffer: array[0..1024 * 1024 - 1] of Byte;
 
procedure TForm1.Button1Click(Sender: TObject);

var
  Start: Cardinal;
  F: TextFile;
  S: String;
  Counter: Cardinal;

begin
  Screen.Cursor := crHourGlass;
  S := 'nothing found';
  try
    AssignFile(F, FileName);
    Reset(F);
    Counter := 0;
    Start := timeGetTime;
    while not EOF(F) do
    begin
      if Counter = 299999 then
      begin
        ReadLn(F, S);
        Break;
      end;
      ReadLn(F);
      Inc(Counter);
    end;
    Label3.Caption := Format('%d ms', [timeGetTime - Start]);
    CloseFile(F);
  finally
    Screen.Cursor := crDefault;
    ShowMessage(S);
  end;
end;

procedure TForm1.Button2Click(Sender: TObject);

var
  Start: Cardinal;
  S: String;
  Stream: TFileStream;
  LineCounter,
  CharCounter: Cardinal;
  Head, Tail: PChar;

begin
  Screen.Cursor := crHourGlass;
  S := 'nothing found';
  try
    Stream := TFileStream.Create(FileName, fmOpenRead or fmShareDenyNone);
    LineCOunter := 0;
    Start := timeGetTime;
    with Stream do
    begin
      while Position < Size do
      begin
        CharCounter := Read(Buffer, SizeOf(Buffer));
        Head := @Buffer;
        repeat
          while (CharCounter > 0) and (Head^ <> #13) do
          begin
            Inc(Head);
            Dec(CharCounter);
          end;

          if CharCounter > 0 then
          begin
            Inc(Head);
            Dec(CharCounter);
           
            Inc(LineCounter);
            if LineCounter = 299999 then
            begin
              // load the line
              if Head^ = #10 then Inc(Head);
              Tail := Head;
              // NOTE: here a buffer overrun should be checked too
              while Tail^ <> #13 do Inc(Tail);
              SetString(S, Head, Tail - Head);
              Break;
            end;
          end;
        until CharCounter = 0;
      end;
    end;
    Label2.Caption := Format('%d ms', [timeGetTime - Start]);
    Stream.Free;
  finally
    Screen.Cursor := crDefault;
    ShowMessage(S);
  end;
end;

procedure TForm1.Button3Click(Sender: TObject);

var
  Start: Cardinal;
  S: String;
  LineCounter,
  CharCounter: Cardinal;
  Head, Tail: PChar;
  FileHandle: THandle;
 
begin
  Screen.Cursor := crHourGlass;
  S := 'nothing found';
  try
    // note: access speed could still be improved by using the FILE_FLAG_NO_BUFFERING flag, but this requires
    //       a buffer sized and aligned to the current disk sector size.
    FileHandle := CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, nil, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, 0);
    LineCounter := 0;
    Start := timeGetTime;
    begin
      while True do
      begin
        ReadFile(FileHandle, Buffer, SizeOf(Buffer), CharCounter, nil);
        if CharCounter = 0 then Break;
        Head := @Buffer;
        repeat
          while (CharCounter > 0) and (Head^ <> #13) do
          begin
            Inc(Head);
            Dec(CharCounter);
          end;

          if CharCounter > 0 then
          begin
            Inc(Head);
            Dec(CharCounter);
           
            Inc(LineCounter);
            if LineCounter = 299999 then
            begin
              // load the line
              if Head^ = #10 then Inc(Head);
              Tail := Head;
              // NOTE: here a buffer overrun should be checked too
              while Tail^ <> #13 do Inc(Tail);
              SetString(S, Head, Tail - Head);
              Break;
            end;
          end;
        until CharCounter = 0;
      end;
    end;
    Label4.Caption := Format('%d ms', [timeGetTime - Start]);
    CloseHandle(FileHandle);
  finally
    Screen.Cursor := crDefault;
    ShowMessage(S);
  end;
end;

procedure TForm1.Button4Click(Sender: TObject);

var
  Start: Cardinal;
  S: String;
  LineCounter,
  CharCounter: Cardinal; // Int64 for files >= 4GB
  Base,
  Head, Tail: PChar;
  FileHandle: THandle;
  FileMapping: THandle;

begin
  Screen.Cursor := crHourGlass;
  S := 'nothing found';
  try
    FileHandle := CreateFile(FileName, GENERIC_READ, FILE_SHARE_READ, nil, OPEN_EXISTING, FILE_FLAG_SEQUENTIAL_SCAN, 0);
    FileMapping := CreateFileMapping(FileHandle, nil, PAGE_READONLY, 0, 0, nil);
    CharCounter := GetFileSize(FileHandle, nil);
    // map entire file content into address space
    Base := MapViewOfFile(FileMapping, FILE_MAP_READ, 0, 0, 0);
    LineCounter := 0;
    Start := timeGetTime;
    begin
      Head := Base;
      while CharCounter > 0 do
      begin
        while (CharCounter > 0) and (Head^ <> #13) do
        begin
          Inc(Head);
          Dec(CharCounter);
        end;

        if CharCounter > 0 then
        begin
          Inc(Head);
          Dec(CharCounter);

          Inc(LineCounter);
          if LineCounter = 299999 then
          begin
            // load the line
            if Head^ = #10 then Inc(Head);
            Tail := Head;
            // NOTE: here a buffer overrun should be checked too
            while Tail^ <> #13 do Inc(Tail);
            SetString(S, Head, Tail - Head);
            Break;
          end;
        end;
      end;
      UnmapViewOfFile(Base);
    end;
    Label5.Caption := Format('%d ms', [timeGetTime - Start]);
    CloseHandle(FileMapping);
    CloseHandle(FileHandle);
  finally
    Screen.Cursor := crDefault;
    ShowMessage(S);
  end;
end;

end.




Ciao, Mike
0
 
LVL 20

Expert Comment

by:Madshi
ID: 2836242
Well, that looks like being worth a grade A (and perhaps even a point boost)...   :-))

Regards, Madshi.
0
 
LVL 1

Expert Comment

by:AJFleming
ID: 2836263
Nice code Mike, I hope you don't mind if I shamelessly lift some of it for my own purposes :)

Any idea what the difference in speed is between them?

cheers,

Adam...
0
 
LVL 1

Expert Comment

by:AJFleming
ID: 2836265
Sorry, should have said - "what was the difference in speed in your system?" I'm away from my Delphi compiler at the moment so I can't test it...

cheers,

Adam...
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2836293
:-) thank you guys...

Adam, I'm not sure why you ask about the speed difference. I have included the results I got in the text above. See there!

Ciao, Mike
0
 
LVL 10

Expert Comment

by:Lischke
ID: 2850429
Johan, are you still with us?
0
 
LVL 1

Author Comment

by:sageryd
ID: 2851801
Comment accepted as answer
0
 
LVL 1

Author Comment

by:sageryd
ID: 2851802
Yep, I'm with ya! I've just had so much to do this week - last week in high school, had to work through that pile of homework before I could deal with anything else. But now I'm done with it! I feels soooo relieved! Lischke, you've done a very good job! You'll get the points! Thanks everyone else too!

--johan
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Suggested Solutions

In this tutorial I will show you how to use the Windows Speech API in Delphi. I will only cover basic functions such as text to speech and controlling the speed of the speech. SAPI Installation First you need to install the SAPI type library, th…
Have you ever had your Delphi form/application just hanging while waiting for data to load? This is the article to read if you want to learn some things about adding threads for data loading in the background. First, I'll setup a general applica…
This video discusses moving either the default database or any database to a new volume.
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now