asked on

RUBY - Read and write to file - selected rows

Hello
I want to be able to select rows (or size - say 40KB chunks of data )from a file and write selection to a new file.
The original file is too big to read using notepad so I want to break the original file into seperate manageable files. Would it be better to select number of rows per file or by size?
Any help much appreciated

philsivyer

ASKER

Hello
Forgot to mention that the files are .txt

philsivyer

ASKER

I have managed to play and this basic script seems to do the trick - but it would be great if there was a solution to automate the whole process and write out files in manageable chunks from the one script.
My attempt thus far..

a = File.new("C:/RUBY_WORKING_MODELS/Files_test.txt")
a.sysseek(0, IO::SEEK_CUR)
p a.sysread(20)

Gertone (Geert Bormans)

so, binary access is not required.
How big will these files be?
Will they easily fit in memory eg 100MB
or will they rather be in the Gbyte range

Gertone (Geert Bormans)

I was wondering,
maybe something like this could help
set the line counter according to the blocksize you want
I am not sure that IO.foreach goes through the file streamingly,
but even still the memory object would be easy to handle in the 100MB range, far beyond what notepad can swallow

I chosse a line based approach, maybe you want to do some regular expression logic on the part files afterwards

I have done this with a 100MB file on an old laptop with no problems

Block_size = 500
curr_block = 1
arr = []
 
def dump_in_file (lar, lno)
  res = File.open("F:\\12_Ruby\\part_#{lno}.txt", "w")
  lar.each do |a|
    res.puts a
  end
 end
 
IO.foreach('F:\12_Ruby\PHILDATA.txt') do |line|
			arr << line
      if arr.length == Block_size then 
       dump_in_file(arr, curr_block)
        arr = []
        curr_block += 1
     end 
   end
 dump_in_file(arr, curr_block)

Open in new window

leflon

Gertone, what will happen if a single line is bigger than your Block_size?
The dump inside the if statement will never again be reached, or am I wrong? In the worst case this would make the output file as big as the input file ;-).

leflon

Gertone (Geert Bormans)

block_size is the number of lines in a block actually
If all data is on one line, than the result file will be the input file, that is correct
then we need a mildly different approach.
For that I need more info on the content of the file.
I am still curious about the filesize too

Gertone (Geert Bormans)

well, I have been dealing with phils data files before,
and a lot of it are dumps from excel or databases

note that the original question stated
> I want to be able to select rows (or size - say 40KB chunks of data )from a file
so I think it is safe to assume that there will be enough lines in the data
The reason why I choose the line approach, I mentioned with my answer: when choosing a memory-block approach, one might break a line, and that won't help in data analysis later

leflon

Yep, I agree. Reading memory blocks may cause problems for further analysis

And I have to admit that I was misinterpreting the arr.length statement. (mea culpa - have to refresh my old ruby memories)

Apart from being a nice little project, the first question that came to my mind was: Why using Notepad for reading a huge text file?

cheers
leflon

philsivyer

ASKER

Hello
The size of the file is 5GB and I suggested using ULTRAEDIT to open the files but company politics and all that - it has to be notepad.
Gertone - will your script write seperate files and with a a naming convention?

Gertone (Geert Bormans)

ultraedit will not handle 5 GB files as well.
Yes, there is a naming convention... block number will be in teh file name.
not padded with zeroes, so it won't sort well.
But if you want that, I can change that
I am curious now, You won't be able to load 5GB in memory, so I hope that the IO.foreach does the right thing there

Gertone (Geert Bormans)

Hi Phil, I could not resist doing the tests.
On my laptop I created a 2GB text file and split that up into 2.5MB chunks using the above script.
Memory usages never goes beyond 1.2GB on my two year old window XP laptop
So, we can safely assume that the process does not generate an in memory copy of the file.
You should be able to tackle your task with this script.
Two notes
- we should add a zero padding in the numbering otherwise we screw up the sort ordering of the files
- if you really need 40k files, you have too many to elegantly put them in one directory, so we need some extra soft to generate a sub directory structure
Let me know if you need help with either of the two tasks

philsivyer

ASKER

Gertone
I mislead you with file size - the file is 1.2GB.
I said 40k files - what can notepad handle with relative ease?
Yes- I would need help with "zero padding - what does that do?

Regards

ASKER CERTIFIED SOLUTION

Gertone (Geert Bormans)

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

philsivyer

ASKER

Gertone - Brilliant!
Can you give me a brief summary as how this works - please

Gertone (Geert Bormans)

I read in the file line by line (that is what IO.foreach does)
I put the line on an array
If the array reaches a certain size, I write the array in a file and clear teh array.
I keep a counter for every file I write, so I can add dynamic file names

I have stuffed the writing to the file in a seperate method, becaus I have to call that again at the end
(most likely the last lines don't make a full array, so I need to clear that out before I end the program)

pretty straightforward I think

philsivyer

ASKER

Gertone
Thanks for this - everything worked just great

Regards

philsivyer

ASKER

Many Thanks

Gertone (Geert Bormans)

welcome