Solved

List Parsing

Posted on 2001-06-04
8
141 Views
Last Modified: 2010-05-02
Hi,

I have this program that produces these text files which contains thousands of numbers, separated by new lines. The number are between 4 and 7 digits. Some of these files are upwards of 3 megs. Unfortunately most of the numbers are duplicated. What would be the most efficient way for me to parse these files and remove all of the duplicates?

Zaphod.
0
Comment
Question by:Z_Beeblebrox
8 Comments
 
LVL 1

Expert Comment

by:superchook
ID: 6154517
Well...

One way that I have used in the past is to read the list into an array (or a database if the number of unique values is truly huge).

Thhen scan the array (db) for each new number you read, and add or discard it.

Using an SQL compliant db has a couple of other advantages - in that you can dump the list sorted/filtered by any number of criteria, whereas you have to perform the operations yourself on an array - but arrays/RAM is much faster if the sample set can fit into memory.



0
 
LVL 1

Expert Comment

by:superchook
ID: 6154518
Well...

One way that I have used in the past is to read the list into an array (or a database if the number of unique values is truly huge).

Thhen scan the array (db) for each new number you read, and add or discard it.

Using an SQL compliant db has a couple of other advantages - in that you can dump the list sorted/filtered by any number of criteria, whereas you have to perform the operations yourself on an array - but arrays/RAM is much faster if the sample set can fit into memory.



0
 

Expert Comment

by:sunnysideandy
ID: 6154557
VERSION 5.00
Object = "{831FDD16-0C5C-11D2-A9FC-0000F8754DA1}#2.0#0"; "mscomctl.ocx"
Begin VB.Form frmSortRand
   Caption         =   "Form1"
   ClientHeight    =   5190
   ClientLeft      =   60
   ClientTop       =   345
   ClientWidth     =   5895
   LinkTopic       =   "Form1"
   ScaleHeight     =   5190
   ScaleWidth      =   5895
   StartUpPosition =   3  'Windows Default
   Begin MSComctlLib.ListView lstOrder
      Height          =   4965
      Left            =   2175
      TabIndex        =   3
      Top             =   150
      Width           =   1740
      _ExtentX        =   3069
      _ExtentY        =   8758
      View            =   3
      LabelWrap       =   -1  'True
      HideSelection   =   -1  'True
      FullRowSelect   =   -1  'True
      _Version        =   393217
      ForeColor       =   -2147483640
      BackColor       =   -2147483643
      BorderStyle     =   1
      Appearance      =   1
      NumItems        =   0
   End
   Begin MSComctlLib.ListView lstRandom
      Height          =   4965
      Left            =   300
      TabIndex        =   2
      Top             =   150
      Width           =   1665
      _ExtentX        =   2937
      _ExtentY        =   8758
      View            =   3
      LabelWrap       =   -1  'True
      HideSelection   =   -1  'True
      FullRowSelect   =   -1  'True
      _Version        =   393217
      ForeColor       =   -2147483640
      BackColor       =   -2147483643
      BorderStyle     =   1
      Appearance      =   1
      NumItems        =   0
   End
   Begin VB.CommandButton cmdList
      Caption         =   "List"
      Height          =   390
      Left            =   4050
      TabIndex        =   1
      Top             =   900
      Width           =   1665
   End
   Begin VB.CommandButton cmdGetRandom
      Caption         =   "Get Random"
      Height          =   390
      Left            =   4050
      TabIndex        =   0
      Top             =   225
      Width           =   1665
   End
End
Attribute VB_Name = "frmSortRand"
Attribute VB_GlobalNameSpace = False
Attribute VB_Creatable = False
Attribute VB_PredeclaredId = True
Attribute VB_Exposed = False
Option Explicit
Const MAX_RAND = 1000

Private Sub cmdGetRandom_Click()
    Dim nIndex As Integer, nRand As Integer
    lstRandom.ListItems.Clear
    For nIndex = 1 To MAX_RAND
        lstRandom.ListItems.Add , "id=" & nIndex, CStr(Int((MAX_RAND * Rnd) + 1))
    Next
End Sub

Private Sub cmdList_Click()
    Dim aSorted(MAX_RAND) As Integer
    Dim nIndex As Integer
    For nIndex = 1 To lstRandom.ListItems.Count
        aSorted(CInt(lstRandom.ListItems.Item(nIndex).Text)) = 1
    Next
    lstOrder.ListItems.Clear
    For nIndex = 0 To MAX_RAND
        If (aSorted(nIndex) = 1) Then
            lstOrder.ListItems.Add , "id=" & nIndex, CStr(nIndex)
        End If
    Next
End Sub

Private Sub Form_Load()
    lstRandom.ColumnHeaders.Add , , , lstRandom.Width - 265
    lstOrder.ColumnHeaders.Add , , , lstOrder.Width - 265
End Sub
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Expert Comment

by:sunnysideandy
ID: 6154560
VERSION 5.00
Object = "{831FDD16-0C5C-11D2-A9FC-0000F8754DA1}#2.0#0"; "mscomctl.ocx"
Begin VB.Form frmSortRand
   Caption         =   "Form1"
   ClientHeight    =   5190
   ClientLeft      =   60
   ClientTop       =   345
   ClientWidth     =   5895
   LinkTopic       =   "Form1"
   ScaleHeight     =   5190
   ScaleWidth      =   5895
   StartUpPosition =   3  'Windows Default
   Begin MSComctlLib.ListView lstOrder
      Height          =   4965
      Left            =   2175
      TabIndex        =   3
      Top             =   150
      Width           =   1740
      _ExtentX        =   3069
      _ExtentY        =   8758
      View            =   3
      LabelWrap       =   -1  'True
      HideSelection   =   -1  'True
      FullRowSelect   =   -1  'True
      _Version        =   393217
      ForeColor       =   -2147483640
      BackColor       =   -2147483643
      BorderStyle     =   1
      Appearance      =   1
      NumItems        =   0
   End
   Begin MSComctlLib.ListView lstRandom
      Height          =   4965
      Left            =   300
      TabIndex        =   2
      Top             =   150
      Width           =   1665
      _ExtentX        =   2937
      _ExtentY        =   8758
      View            =   3
      LabelWrap       =   -1  'True
      HideSelection   =   -1  'True
      FullRowSelect   =   -1  'True
      _Version        =   393217
      ForeColor       =   -2147483640
      BackColor       =   -2147483643
      BorderStyle     =   1
      Appearance      =   1
      NumItems        =   0
   End
   Begin VB.CommandButton cmdList
      Caption         =   "List"
      Height          =   390
      Left            =   4050
      TabIndex        =   1
      Top             =   900
      Width           =   1665
   End
   Begin VB.CommandButton cmdGetRandom
      Caption         =   "Get Random"
      Height          =   390
      Left            =   4050
      TabIndex        =   0
      Top             =   225
      Width           =   1665
   End
End
Attribute VB_Name = "frmSortRand"
Attribute VB_GlobalNameSpace = False
Attribute VB_Creatable = False
Attribute VB_PredeclaredId = True
Attribute VB_Exposed = False
Option Explicit
Const MAX_RAND = 1000

Private Sub cmdGetRandom_Click()
    Dim nIndex As Integer, nRand As Integer
    lstRandom.ListItems.Clear
    For nIndex = 1 To MAX_RAND
        lstRandom.ListItems.Add , "id=" & nIndex, CStr(Int((MAX_RAND * Rnd) + 1))
    Next
End Sub

Private Sub cmdList_Click()
    Dim aSorted(MAX_RAND) As Integer
    Dim nIndex As Integer
    For nIndex = 1 To lstRandom.ListItems.Count
        aSorted(CInt(lstRandom.ListItems.Item(nIndex).Text)) = 1
    Next
    lstOrder.ListItems.Clear
    For nIndex = 0 To MAX_RAND
        If (aSorted(nIndex) = 1) Then
            lstOrder.ListItems.Add , "id=" & nIndex, CStr(nIndex)
        End If
    Next
End Sub

Private Sub Form_Load()
    lstRandom.ColumnHeaders.Add , , , lstRandom.Width - 265
    lstOrder.ColumnHeaders.Add , , , lstOrder.Width - 265
End Sub
0
 
LVL 3

Accepted Solution

by:
Hornet241 earned 50 total points
ID: 6154859
This way will take a minute or so but you won't be hampered by Max limits of a list box.(It took an K6 266, 80Meg RAM 20 sec to loop through an array of 0 to 9999999)

Dim NumArray() As Boolean

ff = freefile
Open "YourFileName" for input as #ff

Do While not EOF(ff)
    Line Input #ff, InNum
    if Val(InNum) > LastHi then
        LastHi = Val(InNum)
        ReDim Preserve NumArray(LastHi) As Boolean
    End If
    ThisNum = Val(InNum)
    NumArray(ThisNum) = True
loop
Close #ff

ff = freefile
Open "NextFileName" for Output as #ff

For a = 0 To UBound(NumArray, 1)
    If NumArray(a) = True Then
        Print #ff, Trim(Str(a))
    End If
Next a

Close #ff
0
 
LVL 7

Author Comment

by:Z_Beeblebrox
ID: 6154926
Hi,

Just so you guys don't think I am ignoring this question, so far I prefer hornet241's solution. In fact, its pretty impressive. But just in case there is a better way, I will leave this question open until tomorrow evening. If by then there is no better answer then you can have the points.

Zaphod.
0
 
LVL 43

Expert Comment

by:TimCottee
ID: 6155448
Zaphod, another possible solution, this works equally well with strings and numbers, uses the collection object which allows you to directly test using a key the existence of an element:

Private colNumbers As Collection

Private Sub Command3_Click()
    Dim lngNumber As Long
    Dim lngCount As Long
    Set colNumbers = New Collection
    Do
        lngCount = lngCount + 1
        lngNumber = Rnd() * 10000
        If Not NumberExists(lngNumber) Then
            colNumbers.Add lngNumber, CStr(lngNumber)
        End If
        Label1.Caption = colNumbers.Count & " / " & lngCount
        Label2.Caption = lngNumber
        DoEvents
    Loop Until lngCount = 10000000000000#
    MsgBox colNumbers.Count
    Set colNumbers = Nothing
End Sub

Private Function NumberExists(ByVal Number As Long)
    On Error Resume Next
    If colNumbers(CStr(Number)) <> Number Then
        NumberExists = False
    Else
        NumberExists = True
    End If
End Function

This example uses random numbers, it you run it (with the label controls to see what is going on) you can see that the count of numbers tested goes up and up but the count of elements goes up to the maximum slowly (relatively) and then just sits there.
0
 
LVL 2

Expert Comment

by:Microsoft
ID: 6157663
accept hornets code, very well written even if i say so my self.

well done horn

cheers
Andy
0

Featured Post

NAS Cloud Backup Strategies

This article explains backup scenarios when using network storage. We review the so-called “3-2-1 strategy” and summarize the methods you can use to send NAS data to the cloud

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Using "ScreenUpdating" 6 64
Adding to a VBA? 6 70
VBA: Personal Macro Retain/Highlight/Remove values in a selected column 4 30
How to make an ADE file by code? 11 85
Introduction While answering a recent question about filtering a custom class collection, I realized that this could be accomplished with very little code by using the ScriptControl (SC) library.  This article will introduce you to the SC library a…
When trying to find the cause of a problem in VBA or VB6 it's often valuable to know what procedures were executed prior to the error. You can use the Call Stack for that but it is often inadequate because it may show procedures you aren't intereste…
As developers, we are not limited to the functions provided by the VBA language. In addition, we can call the functions that are part of the Windows operating system. These functions are part of the Windows API (Application Programming Interface). U…
Get people started with the process of using Access VBA to control Outlook using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Microsoft Outlook. Using automation, an Access applic…

773 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question