• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 161
  • Last Modified:

List Parsing

Hi,

I have this program that produces these text files which contains thousands of numbers, separated by new lines. The number are between 4 and 7 digits. Some of these files are upwards of 3 megs. Unfortunately most of the numbers are duplicated. What would be the most efficient way for me to parse these files and remove all of the duplicates?

Zaphod.
0
Z_Beeblebrox
Asked:
Z_Beeblebrox
1 Solution
 
superchookCommented:
Well...

One way that I have used in the past is to read the list into an array (or a database if the number of unique values is truly huge).

Thhen scan the array (db) for each new number you read, and add or discard it.

Using an SQL compliant db has a couple of other advantages - in that you can dump the list sorted/filtered by any number of criteria, whereas you have to perform the operations yourself on an array - but arrays/RAM is much faster if the sample set can fit into memory.



0
 
superchookCommented:
Well...

One way that I have used in the past is to read the list into an array (or a database if the number of unique values is truly huge).

Thhen scan the array (db) for each new number you read, and add or discard it.

Using an SQL compliant db has a couple of other advantages - in that you can dump the list sorted/filtered by any number of criteria, whereas you have to perform the operations yourself on an array - but arrays/RAM is much faster if the sample set can fit into memory.



0
 
sunnysideandyCommented:
VERSION 5.00
Object = "{831FDD16-0C5C-11D2-A9FC-0000F8754DA1}#2.0#0"; "mscomctl.ocx"
Begin VB.Form frmSortRand
   Caption         =   "Form1"
   ClientHeight    =   5190
   ClientLeft      =   60
   ClientTop       =   345
   ClientWidth     =   5895
   LinkTopic       =   "Form1"
   ScaleHeight     =   5190
   ScaleWidth      =   5895
   StartUpPosition =   3  'Windows Default
   Begin MSComctlLib.ListView lstOrder
      Height          =   4965
      Left            =   2175
      TabIndex        =   3
      Top             =   150
      Width           =   1740
      _ExtentX        =   3069
      _ExtentY        =   8758
      View            =   3
      LabelWrap       =   -1  'True
      HideSelection   =   -1  'True
      FullRowSelect   =   -1  'True
      _Version        =   393217
      ForeColor       =   -2147483640
      BackColor       =   -2147483643
      BorderStyle     =   1
      Appearance      =   1
      NumItems        =   0
   End
   Begin MSComctlLib.ListView lstRandom
      Height          =   4965
      Left            =   300
      TabIndex        =   2
      Top             =   150
      Width           =   1665
      _ExtentX        =   2937
      _ExtentY        =   8758
      View            =   3
      LabelWrap       =   -1  'True
      HideSelection   =   -1  'True
      FullRowSelect   =   -1  'True
      _Version        =   393217
      ForeColor       =   -2147483640
      BackColor       =   -2147483643
      BorderStyle     =   1
      Appearance      =   1
      NumItems        =   0
   End
   Begin VB.CommandButton cmdList
      Caption         =   "List"
      Height          =   390
      Left            =   4050
      TabIndex        =   1
      Top             =   900
      Width           =   1665
   End
   Begin VB.CommandButton cmdGetRandom
      Caption         =   "Get Random"
      Height          =   390
      Left            =   4050
      TabIndex        =   0
      Top             =   225
      Width           =   1665
   End
End
Attribute VB_Name = "frmSortRand"
Attribute VB_GlobalNameSpace = False
Attribute VB_Creatable = False
Attribute VB_PredeclaredId = True
Attribute VB_Exposed = False
Option Explicit
Const MAX_RAND = 1000

Private Sub cmdGetRandom_Click()
    Dim nIndex As Integer, nRand As Integer
    lstRandom.ListItems.Clear
    For nIndex = 1 To MAX_RAND
        lstRandom.ListItems.Add , "id=" & nIndex, CStr(Int((MAX_RAND * Rnd) + 1))
    Next
End Sub

Private Sub cmdList_Click()
    Dim aSorted(MAX_RAND) As Integer
    Dim nIndex As Integer
    For nIndex = 1 To lstRandom.ListItems.Count
        aSorted(CInt(lstRandom.ListItems.Item(nIndex).Text)) = 1
    Next
    lstOrder.ListItems.Clear
    For nIndex = 0 To MAX_RAND
        If (aSorted(nIndex) = 1) Then
            lstOrder.ListItems.Add , "id=" & nIndex, CStr(nIndex)
        End If
    Next
End Sub

Private Sub Form_Load()
    lstRandom.ColumnHeaders.Add , , , lstRandom.Width - 265
    lstOrder.ColumnHeaders.Add , , , lstOrder.Width - 265
End Sub
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
sunnysideandyCommented:
VERSION 5.00
Object = "{831FDD16-0C5C-11D2-A9FC-0000F8754DA1}#2.0#0"; "mscomctl.ocx"
Begin VB.Form frmSortRand
   Caption         =   "Form1"
   ClientHeight    =   5190
   ClientLeft      =   60
   ClientTop       =   345
   ClientWidth     =   5895
   LinkTopic       =   "Form1"
   ScaleHeight     =   5190
   ScaleWidth      =   5895
   StartUpPosition =   3  'Windows Default
   Begin MSComctlLib.ListView lstOrder
      Height          =   4965
      Left            =   2175
      TabIndex        =   3
      Top             =   150
      Width           =   1740
      _ExtentX        =   3069
      _ExtentY        =   8758
      View            =   3
      LabelWrap       =   -1  'True
      HideSelection   =   -1  'True
      FullRowSelect   =   -1  'True
      _Version        =   393217
      ForeColor       =   -2147483640
      BackColor       =   -2147483643
      BorderStyle     =   1
      Appearance      =   1
      NumItems        =   0
   End
   Begin MSComctlLib.ListView lstRandom
      Height          =   4965
      Left            =   300
      TabIndex        =   2
      Top             =   150
      Width           =   1665
      _ExtentX        =   2937
      _ExtentY        =   8758
      View            =   3
      LabelWrap       =   -1  'True
      HideSelection   =   -1  'True
      FullRowSelect   =   -1  'True
      _Version        =   393217
      ForeColor       =   -2147483640
      BackColor       =   -2147483643
      BorderStyle     =   1
      Appearance      =   1
      NumItems        =   0
   End
   Begin VB.CommandButton cmdList
      Caption         =   "List"
      Height          =   390
      Left            =   4050
      TabIndex        =   1
      Top             =   900
      Width           =   1665
   End
   Begin VB.CommandButton cmdGetRandom
      Caption         =   "Get Random"
      Height          =   390
      Left            =   4050
      TabIndex        =   0
      Top             =   225
      Width           =   1665
   End
End
Attribute VB_Name = "frmSortRand"
Attribute VB_GlobalNameSpace = False
Attribute VB_Creatable = False
Attribute VB_PredeclaredId = True
Attribute VB_Exposed = False
Option Explicit
Const MAX_RAND = 1000

Private Sub cmdGetRandom_Click()
    Dim nIndex As Integer, nRand As Integer
    lstRandom.ListItems.Clear
    For nIndex = 1 To MAX_RAND
        lstRandom.ListItems.Add , "id=" & nIndex, CStr(Int((MAX_RAND * Rnd) + 1))
    Next
End Sub

Private Sub cmdList_Click()
    Dim aSorted(MAX_RAND) As Integer
    Dim nIndex As Integer
    For nIndex = 1 To lstRandom.ListItems.Count
        aSorted(CInt(lstRandom.ListItems.Item(nIndex).Text)) = 1
    Next
    lstOrder.ListItems.Clear
    For nIndex = 0 To MAX_RAND
        If (aSorted(nIndex) = 1) Then
            lstOrder.ListItems.Add , "id=" & nIndex, CStr(nIndex)
        End If
    Next
End Sub

Private Sub Form_Load()
    lstRandom.ColumnHeaders.Add , , , lstRandom.Width - 265
    lstOrder.ColumnHeaders.Add , , , lstOrder.Width - 265
End Sub
0
 
Hornet241Commented:
This way will take a minute or so but you won't be hampered by Max limits of a list box.(It took an K6 266, 80Meg RAM 20 sec to loop through an array of 0 to 9999999)

Dim NumArray() As Boolean

ff = freefile
Open "YourFileName" for input as #ff

Do While not EOF(ff)
    Line Input #ff, InNum
    if Val(InNum) > LastHi then
        LastHi = Val(InNum)
        ReDim Preserve NumArray(LastHi) As Boolean
    End If
    ThisNum = Val(InNum)
    NumArray(ThisNum) = True
loop
Close #ff

ff = freefile
Open "NextFileName" for Output as #ff

For a = 0 To UBound(NumArray, 1)
    If NumArray(a) = True Then
        Print #ff, Trim(Str(a))
    End If
Next a

Close #ff
0
 
Z_BeeblebroxAuthor Commented:
Hi,

Just so you guys don't think I am ignoring this question, so far I prefer hornet241's solution. In fact, its pretty impressive. But just in case there is a better way, I will leave this question open until tomorrow evening. If by then there is no better answer then you can have the points.

Zaphod.
0
 
TimCotteeHead of Software ServicesCommented:
Zaphod, another possible solution, this works equally well with strings and numbers, uses the collection object which allows you to directly test using a key the existence of an element:

Private colNumbers As Collection

Private Sub Command3_Click()
    Dim lngNumber As Long
    Dim lngCount As Long
    Set colNumbers = New Collection
    Do
        lngCount = lngCount + 1
        lngNumber = Rnd() * 10000
        If Not NumberExists(lngNumber) Then
            colNumbers.Add lngNumber, CStr(lngNumber)
        End If
        Label1.Caption = colNumbers.Count & " / " & lngCount
        Label2.Caption = lngNumber
        DoEvents
    Loop Until lngCount = 10000000000000#
    MsgBox colNumbers.Count
    Set colNumbers = Nothing
End Sub

Private Function NumberExists(ByVal Number As Long)
    On Error Resume Next
    If colNumbers(CStr(Number)) <> Number Then
        NumberExists = False
    Else
        NumberExists = True
    End If
End Function

This example uses random numbers, it you run it (with the label controls to see what is going on) you can see that the count of numbers tested goes up and up but the count of elements goes up to the maximum slowly (relatively) and then just sits there.
0
 
MicrosoftCommented:
accept hornets code, very well written even if i say so my self.

well done horn

cheers
Andy
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now