Link to home
Start Free TrialLog in
Avatar of katerina-p
katerina-p

asked on

Importing very large 1.7Gb text file into Access

Hi Experts,

I'm trying to import a pipe-delimited txt file into Access 2007. It's a 1.7Gb Unicode file encoded UCS-2 Big Endian with a Byte Order Marker. Each line aprox 200 characters; about 8 million lines.

How can I import this in?

I've tried http://stackoverflow.com/questions/1376756/superfast-way-to-read-large-files-line-by-line-in-vba-please-critique and get error 5:InvalidProcedureCallOrArgument on line "sString = Space([1.7Gb])"

I've tried http://www.vbforums.com/showthread.php?430424-Reading-a-Binary-File-into-Excel-using-VBA but get error 7:OutOfMemory. Furthermore, if I restrict the import to an arbitrary smaller amount, every 2nd character is Null (Chr(0)), because of the Unicode presumably.


How can I import this file into a table?

Many thanks!

Katerina
Avatar of Lee W, MVP
Lee W, MVP
Flag of United States of America image

Not sure I can help, but keep in mind, Access has a 2GB file size limit.

You might try saving the file using a different character set.  Not sure, but something like Notepad++ might be able to do that (opening the file could take many minutes... or longer).
Notepad++ can definitely change the encoding... if it can open it.
Avatar of mouseware
mouseware

similar situation with even larger file... i used 7zip to split the file into manageable "chunks" using STORE mode (no compression). Then check each end of file and start of another to ensure the split didn't' happen in the middle of the line. If it did, i corrected it. Then used EditPad Pro (which shoudl open 4GB + http://www.editpadpro.com/manual/epponly.html) to update encoding, then ended up bulk importing into SQL
Avatar of katerina-p

ASKER

Access has a 2GB file size limit.
Yea I know - I only need to save a few hundred k lines from the file, but they're at random line numbers.

Notepad++ can't open it either
i used 7zip to split the file into manageable "chunks"

Can I do this programmatically through VBA? I need to process a new 1.7Gb weekly.
you can do it through a command line to split the files, but then you'd run into an issue parsing each last/first lines in each file trying to finding out if it's cut/split and then adding it to one of the files after removing the partials... doing that, much more involved.
Go to the creator of this file and simply ask them:
Does the file absolutely need the encoding?
Is it impossible to generate just the data and the Pipe delimiter?
Why is it not possible to export this file to any other format that Access can import?
(Excel, tab, CSV, XML,dbase, paradox, Lotus 123, ...etc)

Many times just asking will get you what you need...

This is all before the size issue
.Then ask if the file can be split or filtered to create smaller separate files.

Again, just ask...


JeffCoachman
Can you post a few lines of the file?

Generally, you would
1. open the file
2. read a line
3. use the SPLIT() function to parse the file on the pipe delimiter
4. examine different items in the parsed array to decide whether to keep the data
5. conditionally insert the data into a new row
repeat steps 2-5 until you reach the end of file
6. close the file

======
Notes:
* can can use Open and Input Line statements or use FileSystemObject and TextStream objects to read the file.
* if the file contains a header row, it might be simpler to write a Powershell script to extract the data you need and then do a regular import of the data into Access.
Go to the creator of this file and simply ask them:...

Tried, failed!


Can you post a few lines of the file?

OK - have attached a file which I created using the below code on the full file.

Public Function QuickRead(FName As String) As String
    Dim i As Integer
    Dim res As String
    Dim l As Long

    i = FreeFile
    l = FileLen(FName) '1.7Gb, too large
    l = 50000 'just to get a snippet
    res = Space(l)
    Open FName For Binary Access Read As #i
    Get #i, , res
    Close i
    
    i = FreeFile
    FName = Replace(FName, "FULL", "PART")
    Open (FName) For Output As #i
    Write #1, res
    Close #1
    
End Function

Open in new window


Thanks all
PART.txt
@katarina
There is no header row.

What criteria are you using for selecting which records to insert into your table?
Whether the 13-digit reference (first field) matches one of 500k keys I have.
There are no line breaks.

What constitutes the start/end of a record?
Although you stated that these are ~200 character records, I'm only seeing (what looks like) records in the high 80s-low 90s character length.
Although you stated that these are ~200 character records

Yes apologies, I meant max 200chars. The full spec as follows:

DS_ISBN      DOUBLE      13 or 15
DS_Publisher      CHAR      4
DS_Author_CN      VARCHAR      Variable (50)
DS_Author_SN      VARCHAR      Variable (50)
DS_Price      CURRENCY      99999.99
DS_VAT_Elem      CURRENCY      99999.99
DS_Public_Dat      CHAR      8
DS_Title      VARCHAR      Variable (60)
DS_Height      INTEGER      9999
DS_Width      INTEGER      9999
DS_Depth      INTEGER      9999
DS_Weight      LONG      99999
DS_Intrastat      CHAR      1
DS_FirmSale      CHAR      1
DS_Content81      CHAR      2
DS_Form07      CHAR      2
DS_Short_LT      CHAR      1

The start/end should be a Cr/Lf combination. I posted the code I used above to get the snippet, so I don't know if that affected anything.
I think it is the unicode (double byte) encoding that is confusing the VB statement/functions.  I opened the file in Word and it was able to display the line breaks after I specified that the file was unicode.  I'm still looking at this.
I think it is the unicode (double byte) encoding that is confusing the VB statement/functions.

Yes I thought it might be. As you can see it's mainly ascii, but there are the occaisional unicode characters in there.
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
katerina-p,

Thanks for replying to my post.

My guess is that aikimark can lead you to the promise land...

;-)

JeffCoachman
aikimark,

Thanks for your response. Is a DLookup faster than opening a rst and .FindFirst?

Your code works fine on the Part file (small typo: Const cDS_Width = 9 ). However, on working with the full file strBuffer is just the Null chars (Chr(0))
<<Thanks for your response. Is a DLookup faster than opening a rst and .FindFirst?>>

 It would be about the same, but in this case, a Dcount() would be better.

 Only way to improve ont hat woul dbe to open the table directly and use .Seek on an index.

Jim.
Only way to improve ont hat woul dbe to open the table directly and use .Seek on an index.
That will be fine because it's a working table to some extent and I'm matching on the primary key.
@Jim

re: http:#a39184396
I disagree with your domain aggregate function statement.  Dlookup and DFirst should stop searching after the first match.  DCount will require additional I/O or processing to return the count.

You are quite right about a table seek being the fastest search method.  It also matches an earlier statement I made about having an index on the ISBN column.  If you get the chance, compare a table seek against a Dictionary.Exists method.  In katarina's case, the ISBN data only has a 6.5 MB memory foot print.
@katarina

Maybe the byte order in your original file does not require swapping.  Please retest the code after commenting out the For...Next loop that does the byte-value swapping.

Thank you for catching my typo.  I've corrected that index value in the code snippet.

One of the reasons I used IsNull(DLookup()) is that the code was easy to write and to understand.  A recordset Find method is sometimes fast, but isn't reliably so.  Here is a good article on fast lookup functions in Access.
http:A_1921-Access-Techniques-Fast-Table-Lookup-Functions.html
Maybe the byte order in your original file does not require swapping.  Please retest the code after commenting out the For...Next loop that does the byte-value swapping.

Thanks - that did the job.

Initial testing shows

Buffer size of 1,000,000 took 49 seconds to iterate through 1,004,109 lines.
Buffer size of 500,000 took 21 seconds to iterate through 500,122 lines.
Buffer size of 250,000 took 25 seconds to iterate through 500,122 lines.
Buffer size of 100,000 took 29 seconds to iterate through 500,122 lines.
Buffer size of 75,000 took 27 seconds to iterate through 500,122 lines.
Buffer size of 50,000 took 24 seconds to iterate through 500,122 lines.

Will try running through to end.
<<@Jim

re: http:#a39184396
I disagree with your domain aggregate function statement.  Dlookup and DFirst should stop searching after the first match.  DCount will require additional I/O or processing to return the count.>>

 A DCount() will always out perform a Dlookup() for the purpose of checking for existence of a record as DLookup() needs to return a value from the record page, where as DCount() does not.   There are two things at work that may not have been obvious:

1. You use an * with the first argument.
2. Your restricting the DCount() with a where clause that results in one record.

Jim.
@katarina

You need to check that the code correctly handles the last record.  There might be some trailing unicode characters, like the BOM codes at the start of the file.  Also, the loop terminates at EOF, and this may require some post-loop processing of the data in the last item in the strLines() array.

Thank you for testing different buffer sizes.

It is possible that the file might be read more efficiently, but I would need an accurate slice of the data (with the same byte order you are seeing in your source file).  However, given the size of the source file and the import times you observed, I think it is a reasonable processing time for your user to wait.

You can test the look-up times under different methods, both those in the Harfang's article and a Dictionary object, which I mentioned in an earlier comment, to see if that makes much of a difference in your (total) import time.  Usually, I/O is the biggest time and resource component of such operations.
@Jim

I hadn't thought about using an asterisk as the field parameter value.  How does the presence of multiple duplicate values in the searched field affect the performance of DCount() ?

Are we assuming that there is an index on the field being searched?
@aikimark

Thanks so much for your help. It 'successfully', ie without error, ran through to the end, but I confess I didn't check the integrity of the last line. It took about 5 minutes as expected. I will check again later this evening and close off the question if all is well. Thank you again. Yes I will also then have a play around to find the fastest way of processing the data.
<<I hadn't thought about using an asterisk as the field parameter value.  How does the presence of multiple duplicate values in the searched field affect the performance of DCount() ?>>

  If it's running off an index, it adds very little.  Of course it would depend on the number of them, but typically a page or two of the index is all that needs to be read.  Even with a fairly long key of say 12 bytes, you still end up with hundreds of keys per page.

 But in the case of multiples, at some point a Dlookup() will out pace it.

Jim.
Won't have time to check the final line properly this evening, but the following is the output from testing, so it at least fiishes consistently (and with the same line count) without error:

Buffer size of 10,000 took 369 seconds to iterate through 7,589,200 lines.
Buffer size of 45,000 took 386 seconds to iterate through 7,589,200 lines.
Buffer size of 80,000 took 326 seconds to iterate through 7,589,200 lines.
Buffer size of 115,000 took 348 seconds to iterate through 7,589,200 lines.
Buffer size of 150,000 took 315 seconds to iterate through 7,589,200 lines.
Buffer size of 185,000 took 334 seconds to iterate through 7,589,200 lines.
Buffer size of 220,000 took 317 seconds to iterate through 7,589,200 lines.
Buffer size of 255,000 took 340 seconds to iterate through 7,589,200 lines.
Buffer size of 290,000 took 320 seconds to iterate through 7,589,200 lines.
Buffer size of 325,000 took 320 seconds to iterate through 7,589,200 lines.
Buffer size of 360,000 took 319 seconds to iterate through 7,589,200 lines.
Buffer size of 395,000 took 320 seconds to iterate through 7,589,200 lines.
Buffer size of 430,000 took 322 seconds to iterate through 7,589,200 lines.
Buffer size of 465,000 took 325 seconds to iterate through 7,589,200 lines.
Buffer size of 500,000 took 324 seconds to iterate through 7,589,200 lines.
Buffer size of 535,000 took 334 seconds to iterate through 7,589,200 lines.
Buffer size of 570,000 took 336 seconds to iterate through 7,589,200 lines.
Buffer size of 605,000 took 334 seconds to iterate through 7,589,200 lines.

So I guess I'll use a buffer in the 350k mark as 220k seems like an anomaly. Thanks again so much @aikimark