Solved

Importing very large 1.7Gb text file into Access

Posted on 2013-05-20
30
905 Views
Last Modified: 2013-05-21
Hi Experts,

I'm trying to import a pipe-delimited txt file into Access 2007. It's a 1.7Gb Unicode file encoded UCS-2 Big Endian with a Byte Order Marker. Each line aprox 200 characters; about 8 million lines.

How can I import this in?

I've tried http://stackoverflow.com/questions/1376756/superfast-way-to-read-large-files-line-by-line-in-vba-please-critique and get error 5:InvalidProcedureCallOrArgument on line "sString = Space([1.7Gb])"

I've tried http://www.vbforums.com/showthread.php?430424-Reading-a-Binary-File-into-Excel-using-VBA but get error 7:OutOfMemory. Furthermore, if I restrict the import to an arbitrary smaller amount, every 2nd character is Null (Chr(0)), because of the Unicode presumably.


How can I import this file into a table?

Many thanks!

Katerina
0
Comment
Question by:katerina-p
  • 11
  • 10
  • 3
  • +3
30 Comments
 
LVL 95

Expert Comment

by:Lee W, MVP
ID: 39182696
Not sure I can help, but keep in mind, Access has a 2GB file size limit.

You might try saving the file using a different character set.  Not sure, but something like Notepad++ might be able to do that (opening the file could take many minutes... or longer).
0
 
LVL 95

Expert Comment

by:Lee W, MVP
ID: 39182699
Notepad++ can definitely change the encoding... if it can open it.
0
 
LVL 4

Expert Comment

by:mouseware
ID: 39182763
similar situation with even larger file... i used 7zip to split the file into manageable "chunks" using STORE mode (no compression). Then check each end of file and start of another to ensure the split didn't' happen in the middle of the line. If it did, i corrected it. Then used EditPad Pro (which shoudl open 4GB + http://www.editpadpro.com/manual/epponly.html) to update encoding, then ended up bulk importing into SQL
0
 

Author Comment

by:katerina-p
ID: 39182809
Access has a 2GB file size limit.
Yea I know - I only need to save a few hundred k lines from the file, but they're at random line numbers.

Notepad++ can't open it either
0
 

Author Comment

by:katerina-p
ID: 39182851
i used 7zip to split the file into manageable "chunks"

Can I do this programmatically through VBA? I need to process a new 1.7Gb weekly.
0
 
LVL 4

Expert Comment

by:mouseware
ID: 39182858
you can do it through a command line to split the files, but then you'd run into an issue parsing each last/first lines in each file trying to finding out if it's cut/split and then adding it to one of the files after removing the partials... doing that, much more involved.
0
 
LVL 74

Expert Comment

by:Jeffrey Coachman
ID: 39183309
Go to the creator of this file and simply ask them:
Does the file absolutely need the encoding?
Is it impossible to generate just the data and the Pipe delimiter?
Why is it not possible to export this file to any other format that Access can import?
(Excel, tab, CSV, XML,dbase, paradox, Lotus 123, ...etc)

Many times just asking will get you what you need...

This is all before the size issue
.Then ask if the file can be split or filtered to create smaller separate files.

Again, just ask...


JeffCoachman
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39183893
Can you post a few lines of the file?

Generally, you would
1. open the file
2. read a line
3. use the SPLIT() function to parse the file on the pipe delimiter
4. examine different items in the parsed array to decide whether to keep the data
5. conditionally insert the data into a new row
repeat steps 2-5 until you reach the end of file
6. close the file

======
Notes:
* can can use Open and Input Line statements or use FileSystemObject and TextStream objects to read the file.
* if the file contains a header row, it might be simpler to write a Powershell script to extract the data you need and then do a regular import of the data into Access.
0
 

Author Comment

by:katerina-p
ID: 39183953
Go to the creator of this file and simply ask them:...

Tried, failed!


Can you post a few lines of the file?

OK - have attached a file which I created using the below code on the full file.

Public Function QuickRead(FName As String) As String
    Dim i As Integer
    Dim res As String
    Dim l As Long

    i = FreeFile
    l = FileLen(FName) '1.7Gb, too large
    l = 50000 'just to get a snippet
    res = Space(l)
    Open FName For Binary Access Read As #i
    Get #i, , res
    Close i
    
    i = FreeFile
    FName = Replace(FName, "FULL", "PART")
    Open (FName) For Output As #i
    Write #1, res
    Close #1
    
End Function

Open in new window


Thanks all
PART.txt
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39183984
@katarina
There is no header row.

What criteria are you using for selecting which records to insert into your table?
0
 

Author Comment

by:katerina-p
ID: 39183991
Whether the 13-digit reference (first field) matches one of 500k keys I have.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39184002
There are no line breaks.

What constitutes the start/end of a record?
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39184007
Although you stated that these are ~200 character records, I'm only seeing (what looks like) records in the high 80s-low 90s character length.
0
 

Author Comment

by:katerina-p
ID: 39184046
Although you stated that these are ~200 character records

Yes apologies, I meant max 200chars. The full spec as follows:

DS_ISBN      DOUBLE      13 or 15
DS_Publisher      CHAR      4
DS_Author_CN      VARCHAR      Variable (50)
DS_Author_SN      VARCHAR      Variable (50)
DS_Price      CURRENCY      99999.99
DS_VAT_Elem      CURRENCY      99999.99
DS_Public_Dat      CHAR      8
DS_Title      VARCHAR      Variable (60)
DS_Height      INTEGER      9999
DS_Width      INTEGER      9999
DS_Depth      INTEGER      9999
DS_Weight      LONG      99999
DS_Intrastat      CHAR      1
DS_FirmSale      CHAR      1
DS_Content81      CHAR      2
DS_Form07      CHAR      2
DS_Short_LT      CHAR      1

The start/end should be a Cr/Lf combination. I posted the code I used above to get the snippet, so I don't know if that affected anything.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39184083
I think it is the unicode (double byte) encoding that is confusing the VB statement/functions.  I opened the file in Word and it was able to display the line breaks after I specified that the file was unicode.  I'm still looking at this.
0
Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

 

Author Comment

by:katerina-p
ID: 39184092
I think it is the unicode (double byte) encoding that is confusing the VB statement/functions.

Yes I thought it might be. As you can see it's mainly ascii, but there are the occaisional unicode characters in there.
0
 
LVL 45

Accepted Solution

by:
aikimark earned 500 total points
ID: 39184256
This should come close.  You will need to change the path/name of the file being read, make sure the DLookup() function is referencing the actual field and table names, add code to insert the record.

I'm processing your file in 1000 byte chunks.  Once you know it is working, you can play with the size of the buffer to eek out a bit more performance (somewhere in the 10000 - 1000000 range).
Sub Q_28133857()
    Dim intFN As Integer
    Dim strBuffer As String
    Dim strParsed() As String
    Dim strLines() As String
    Dim b() As Byte
    Dim lngLoop As Long
    Const cDS_ISBN = 0          'DOUBLE      13 or 15
    Const cDS_Publisher = 1     '    CHAR      4
    Const cDS_Author_CN = 2     '   VARCHAR      Variable (50)
    Const cDS_Author_SN = 3     '  VARCHAR      Variable (50)
    Const cDS_Price = 4         'CURRENCY      99999.99
    Const cDS_VAT_Elem = 5      '  CURRENCY      99999.99
    Const cDS_Public_Dat = 6    '   CHAR      8
    Const cDS_Title = 7         'VARCHAR      Variable (60)
    Const cDS_Height = 8        'INTEGER      9999
    Const cDS_Width = 9         'INTEGER      9999 -- corrected by zone advisor
    Const cDS_Depth = 10        'INTEGER      9999
    Const cDS_Weight = 11       ' LONG      99999
    Const cDS_Intrastat = 12    '   CHAR      1
    Const cDS_FirmSale = 13     ' CHAR      1
    Const cDS_Content81 = 14    ' CHAR      2
    Const cDS_Form07 = 15       'CHAR      2
    Const cDS_Short_LT = 16     ' CHAR      1

    intFN = FreeFile
    Open "c:\users\aikimark\downloads\part.txt" For Binary As #intFN
    ReDim b(1 To 3)
    Get #intFN, , b
    ReDim b(1 To 1000)
    strLines = Split(" ")
    Do
        Get #intFN, , b
        'swap byte order
        For lngLoop = LBound(b) To UBound(b) - 1 Step 2
            b(lngLoop) = b(lngLoop + 1)
            b(lngLoop + 1) = 0
        Next
        strBuffer = b
        strBuffer = strLines(UBound(strLines)) & strBuffer
        strLines = Split(strBuffer, vbCr)
        For lngLoop = LBound(strLines) To UBound(strLines) - 1
            strParsed = Split(strLines(lngLoop), "|")

            If IsNull(DLookup("isbn", "yourtablename", "isbn='" & strParsed(cDS_ISBN) & "'")) Then
            Else
                'insert strParsed data into table
            End If
        Next
        
    Loop Until EOF(intFN)
    Close intFN
End Sub

Open in new window


Be sure that your lookup table has an index on the ISBN column.  Otherwise, your lookup times will kill performance.
0
 
LVL 74

Expert Comment

by:Jeffrey Coachman
ID: 39184329
katerina-p,

Thanks for replying to my post.

My guess is that aikimark can lead you to the promise land...

;-)

JeffCoachman
0
 

Author Comment

by:katerina-p
ID: 39184361
aikimark,

Thanks for your response. Is a DLookup faster than opening a rst and .FindFirst?

Your code works fine on the Part file (small typo: Const cDS_Width = 9 ). However, on working with the full file strBuffer is just the Null chars (Chr(0))
0
 
LVL 57
ID: 39184396
<<Thanks for your response. Is a DLookup faster than opening a rst and .FindFirst?>>

 It would be about the same, but in this case, a Dcount() would be better.

 Only way to improve ont hat woul dbe to open the table directly and use .Seek on an index.

Jim.
0
 

Author Comment

by:katerina-p
ID: 39184417
Only way to improve ont hat woul dbe to open the table directly and use .Seek on an index.
That will be fine because it's a working table to some extent and I'm matching on the primary key.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39184533
@Jim

re: http:#a39184396
I disagree with your domain aggregate function statement.  Dlookup and DFirst should stop searching after the first match.  DCount will require additional I/O or processing to return the count.

You are quite right about a table seek being the fastest search method.  It also matches an earlier statement I made about having an index on the ISBN column.  If you get the chance, compare a table seek against a Dictionary.Exists method.  In katarina's case, the ISBN data only has a 6.5 MB memory foot print.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39184582
@katarina

Maybe the byte order in your original file does not require swapping.  Please retest the code after commenting out the For...Next loop that does the byte-value swapping.

Thank you for catching my typo.  I've corrected that index value in the code snippet.

One of the reasons I used IsNull(DLookup()) is that the code was easy to write and to understand.  A recordset Find method is sometimes fast, but isn't reliably so.  Here is a good article on fast lookup functions in Access.
http:A_1921-Access-Techniques-Fast-Table-Lookup-Functions.html
0
 

Author Comment

by:katerina-p
ID: 39184926
Maybe the byte order in your original file does not require swapping.  Please retest the code after commenting out the For...Next loop that does the byte-value swapping.

Thanks - that did the job.

Initial testing shows

Buffer size of 1,000,000 took 49 seconds to iterate through 1,004,109 lines.
Buffer size of 500,000 took 21 seconds to iterate through 500,122 lines.
Buffer size of 250,000 took 25 seconds to iterate through 500,122 lines.
Buffer size of 100,000 took 29 seconds to iterate through 500,122 lines.
Buffer size of 75,000 took 27 seconds to iterate through 500,122 lines.
Buffer size of 50,000 took 24 seconds to iterate through 500,122 lines.

Will try running through to end.
0
 
LVL 57
ID: 39184958
<<@Jim

re: http:#a39184396
I disagree with your domain aggregate function statement.  Dlookup and DFirst should stop searching after the first match.  DCount will require additional I/O or processing to return the count.>>

 A DCount() will always out perform a Dlookup() for the purpose of checking for existence of a record as DLookup() needs to return a value from the record page, where as DCount() does not.   There are two things at work that may not have been obvious:

1. You use an * with the first argument.
2. Your restricting the DCount() with a where clause that results in one record.

Jim.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39184976
@katarina

You need to check that the code correctly handles the last record.  There might be some trailing unicode characters, like the BOM codes at the start of the file.  Also, the loop terminates at EOF, and this may require some post-loop processing of the data in the last item in the strLines() array.

Thank you for testing different buffer sizes.

It is possible that the file might be read more efficiently, but I would need an accurate slice of the data (with the same byte order you are seeing in your source file).  However, given the size of the source file and the import times you observed, I think it is a reasonable processing time for your user to wait.

You can test the look-up times under different methods, both those in the Harfang's article and a Dictionary object, which I mentioned in an earlier comment, to see if that makes much of a difference in your (total) import time.  Usually, I/O is the biggest time and resource component of such operations.
0
 
LVL 45

Expert Comment

by:aikimark
ID: 39184987
@Jim

I hadn't thought about using an asterisk as the field parameter value.  How does the presence of multiple duplicate values in the searched field affect the performance of DCount() ?

Are we assuming that there is an index on the field being searched?
0
 

Author Comment

by:katerina-p
ID: 39185033
@aikimark

Thanks so much for your help. It 'successfully', ie without error, ran through to the end, but I confess I didn't check the integrity of the last line. It took about 5 minutes as expected. I will check again later this evening and close off the question if all is well. Thank you again. Yes I will also then have a play around to find the fastest way of processing the data.
0
 
LVL 57
ID: 39185112
<<I hadn't thought about using an asterisk as the field parameter value.  How does the presence of multiple duplicate values in the searched field affect the performance of DCount() ?>>

  If it's running off an index, it adds very little.  Of course it would depend on the number of them, but typically a page or two of the index is all that needs to be read.  Even with a fairly long key of say 12 bytes, you still end up with hundreds of keys per page.

 But in the case of multiples, at some point a Dlookup() will out pace it.

Jim.
0
 

Author Comment

by:katerina-p
ID: 39186048
Won't have time to check the final line properly this evening, but the following is the output from testing, so it at least fiishes consistently (and with the same line count) without error:

Buffer size of 10,000 took 369 seconds to iterate through 7,589,200 lines.
Buffer size of 45,000 took 386 seconds to iterate through 7,589,200 lines.
Buffer size of 80,000 took 326 seconds to iterate through 7,589,200 lines.
Buffer size of 115,000 took 348 seconds to iterate through 7,589,200 lines.
Buffer size of 150,000 took 315 seconds to iterate through 7,589,200 lines.
Buffer size of 185,000 took 334 seconds to iterate through 7,589,200 lines.
Buffer size of 220,000 took 317 seconds to iterate through 7,589,200 lines.
Buffer size of 255,000 took 340 seconds to iterate through 7,589,200 lines.
Buffer size of 290,000 took 320 seconds to iterate through 7,589,200 lines.
Buffer size of 325,000 took 320 seconds to iterate through 7,589,200 lines.
Buffer size of 360,000 took 319 seconds to iterate through 7,589,200 lines.
Buffer size of 395,000 took 320 seconds to iterate through 7,589,200 lines.
Buffer size of 430,000 took 322 seconds to iterate through 7,589,200 lines.
Buffer size of 465,000 took 325 seconds to iterate through 7,589,200 lines.
Buffer size of 500,000 took 324 seconds to iterate through 7,589,200 lines.
Buffer size of 535,000 took 334 seconds to iterate through 7,589,200 lines.
Buffer size of 570,000 took 336 seconds to iterate through 7,589,200 lines.
Buffer size of 605,000 took 334 seconds to iterate through 7,589,200 lines.

So I guess I'll use a buffer in the 350k mark as 220k seems like an anomaly. Thanks again so much @aikimark
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

You can of course define an array to hold data that is of a particular type like an array of Strings to hold customer names or an array of Doubles to hold customer sales, but what do you do if you want to coordinate that data? This article describes…
Describes a method of obtaining an object variable to an already running instance of Microsoft Access so that it can be controlled via automation.
Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now