Link to home
Start Free TrialLog in
Avatar of Christopher Wright
Christopher WrightFlag for United States of America

asked on

User-form That Will Enable A Search Against A Massive Spreadsheet Being Used As A Database

Greetings Everyone.  I am reaching out for some help on a project that I hope will teach me a lot.  I have a couple of ideas/concepts; however, I am wide open for any new suggestions, ideas, constructive criticism, and/or anything else that will guide this project to fruition.

Okay, the basic concept is to have some sort of user interface for company users to perform searches based on particular item numbers and other attributes.  Currently, I have a Discoverer Report that pulls a "Master Sheet" every night via DOS batch code.  This is programmed to save the file with the date in the file name.  I would like to have a centralized userform/spreadsheet that the users can input some data and search against this form without it being open.

The cause for this is that the file itself contains nearly 750,000 rows, each containing a specific/distinct item/part number and its attributes.  It takes forever to open the file, plus it is somewhat inefficient to open this large file to find one attribute for a part number.  I have researched the web and found that there are options using ADO and/or DAO methods.  I am not fluent with these at all, hence I call out to the Experts.  

If there is a better way to have this done via Excel, I am all ears.  I greatly appreciate any help everyone.  I can provide sample data/workbooks to whoever is brave enough to accept the challenge.  Thanks again everyone and God bless!
Avatar of aikimark
aikimark
Flag of United States of America image

This might be a good place to use an HTA (HTML Application).  I created several HTA applets for one of my clients to do lightweight things, such as look up information.  In my case, the data is in Access databases, but the ADO code will be similar.

It would be helpful to post a small, but representational, sample of the workbook.

=========
If performance is still a problem, it might be worth considering a tiered approach to the data, where the rows are distributed to different files/workbooks/databases with different names that are related to their content.

At the end of the nightly run, you might post-process the workbook data to create a slimmed down version of the data with just enough data to validate the users' queries and speed up the data retrieval.

You would do your company well to populate a database, rather than use a workbook as a database.  There are several lightweight database engines and some are free if that is a concern.
Avatar of Christopher Wright

ASKER

I am interested in any option that will make life easier for everyone.  I will get you that spreadsheet first thing in the AM.  Okay?  Thanks for helping me out!  God bless!
My apologies for the delay.  What would be a sufficient example that you'd like to see?  What in particular would you like to see?  Thanks again for the help!
somewhere in the 100-200 row range will, hopefully, reveal any anomalies in the data and any inherent patterns.
Here you are.  Hope this will suffice.  What do you think would be the best direction to go with a userform here?  Thanks
ADO-Test-File.xlsx
1. Edit the following code to change the path of the Excel file in the connection string.
2. Open the Excel workbook and remove the leading space character from the worksheet name.  In fact, you will probably have a completely different name in production, so the worksheet name in the From clause will need to match your actual name (with a $ suffix)
3. Save the following code as Q_28004727.HTA
<html>
<head>
<title>Q_28004727</title>

<HTA:APPLICATION 
     ID="Q_28004727"
     APPLICATIONNAME="Q_28004727"
     SCROLL="yes"
     SINGLEINSTANCE="yes"
     WINDOWSTATE="maximize"
>
</head>
<style>
.mono {
	font-family:courier;
}
</style>
<SCRIPT Language="VBScript">
    sub ReturnRows
	Dim cn
	Dim Rs
	Dim strData
	Dim strRow
	Dim fld
	Set Rs = CreateObject("ADODB.Recordset")
	Set cn = CreateObject("ADODB.connection")
	cn.connectionstring = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Users\AikiMark\Downloads\ADO-Test-File.xlsx;Extended Properties=""Excel 12.0 Xml;HDR=YES;IMEX=1"";"  'Persist Security Info=False;"
	cn.Open
	Rs.Open "Select * from [Item Master (Test ADO)$] Where [Part Number] Like '%" & txtFilter.value & "%'", cn, 1, 3
	strData = "<table width=""100%"" border>"
        if not rs.eof then
	  strData = strData & "<tr>"
          strRow = ""
          For each fld in rs.fields
            strRow = strRow & "<th>" & fld.name & "</th>"
          Next
          strData = strData & strRow & "</tr>"
        end if 
	do until Rs.EOF
	  strRow = "<tr>"
	  For Each fld in Rs.Fields
		if fld.name = "Cat Code" then
			strRow = strRow & "<td class=""mono"">" & fld.value & "</td>"
		else
			strRow = strRow & "<td>" & fld.value & "</td>"
		end if
	  Next
	  strData = strData & strRow & "</tr>"
	  Rs.MoveNext
	Loop
	DataArea.InnerHTML = strData & "</table>"

    end sub
</SCRIPT>
<body>
<input type="text" name="txtFilter" size="50"><p>
<input id=runbutton  class="button" type="button" value="Get Data" name="run_button"  onClick="ReturnRows"><p>
<span id=DataArea>This is a span named DataArea.</span>
</body>
</html>

Open in new window


Note: the SQL statement uses wildcard matching of strings.  You will get better performance if you use exact matching.
This is working quite well so far.  It's fast and brings back the entire row with all the attributes.  Is there a possibility to have this data broken up into separate fields.  For instance the search is performed and the info populates to a list as such

ADS Part Number: ___________________________
Inventory Item ID:___________________________
Item Description:____________________________
Vendor Part Number:________________________
Manufacturer Part Number:_________________

And so on....

What I am trying to ultimately achieve is an interface/environment where a user can search by item and find multiple information but also have a "dashboard" like interface where there will be aggregates of certain vendor sales and average lead times and other data.  This would all be pulling from such files as I've provided.  The charts or whatever is chosen would be an auto run when the file is opened and the search would be on demand.  Is this a possibility?  Can such files have these types of objects?  Can they pull from multiple sources?  Thank you so much for opening up a new world for me here with the HTM route!
Is there only going to be one record retrieved?  If so, then yes -- the output can easily be displayed vertically, rather than in grid/table form.

I'm not sure what you're asking for as far as a dashboard is concerned.  Let's stick to the original question you asked about retrieving data.  After you have an acceptable solution to the current problem, you can ask another question about dashboard.
Here is a version of the HTA that produces a list output.  I changed the comparison from wildcard to exact match.  Remember to change the path.
<html>
<head>
<title>Q_28004727</title>

<HTA:APPLICATION 
     ID="Q_28004727"
     APPLICATIONNAME="Q_28004727"
     SCROLL="yes"
     SINGLEINSTANCE="yes"
     WINDOWSTATE="maximize"
>
</head>
<style>
.mono {
	font-family:courier;
}
</style>
<SCRIPT Language="VBScript">
    sub ReturnRows
	Dim cn
	Dim Rs
	Dim strData
	Dim strRow
	Dim fld
	Set Rs = CreateObject("ADODB.Recordset")
	Set cn = CreateObject("ADODB.connection")
	cn.connectionstring = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=C:\Users\AikiMark\Downloads\ADO-Test-File.xlsx;Extended Properties=""Excel 12.0 Xml;HDR=YES;IMEX=1"";"  'Persist Security Info=False;"
	cn.Open
	Rs.Open "Select * from [Item Master (Test ADO)$] Where [Part Number] =""" & txtFilter.value & """", cn, 1, 3
        strData = vbNullString
	do until Rs.EOF
	  For Each fld in Rs.Fields
		strData = strData & "<b>" & fld.Name & ": </b>" & fld.value & "<br>"
	  Next
	  Rs.MoveNext
	Loop
	DataArea.InnerHTML = strData

    end sub
</SCRIPT>
<body>
<input type="text" name="txtFilter" size="50"><p>
<input id=runbutton  class="button" type="button" value="Get Data" name="run_button"  onClick="ReturnRows"><p>
<span id=DataArea>This is a span named DataArea.</span>
</body>
</html>

Open in new window

My apologies for the ambiguity.  As far as the dashboards are concerned, I was only referencing the direction I planned to take after we concluded this project.  Thank you for helping me out.  I greatly appreciate it.  
BTW:  This code is awesome.  Now, would I need to ask another question in order to have this formatted to show up in text boxes .  Pretty much create a display similar to a common search screen.  Thanks for all of the help.
Just a quick note.  I applied this to my entire file.  It's quite slow.  Would there be an option to speed this up or anything quicker?  I'll start looking at databases.  Would I need a multi-user?
Why would you want to display the results in textboxes?  This is read-only data.  A textbox would make the user think they can update the data.

A database would most likely be the best technology to apply to this solution, since databases use indexes.  If you want to try something similar, you might do the following:
1. sort the Excel data in Part Number order
2. copy the Part Number column to a new workbook
3. save the workbook as a text file or tab-delimited file (example: PtNum.txt)
4. Save the original workbook as a text file
5. When the user wants to retrieve the part number row, the program reads the entire PtNum.txt file, uses the SPLIT() function to move the data into an array, and then does a binary search for the data.  With the 750k part number estimate, it will only take about 20 comparisons to find the part.  Since this is in memory, only the first search will be slow.
6. The big data file is opened with a FileSystemObject Textstream and the associated data line is read and displayed.

Note: the PtNum.txt file only needs to be processed once.  If the user does multiple searches, then the array can be searched from the already populated data.  In fact, you might do the I/O when the HTA form loads, so that the user never has a slow search, just a slow page load time.
If your data pattern is consistent, then based on the 200 row sample you posted the PtNum.txt file will be about 8.7 MB.  For a brief moment, the raw data string will coexist with the array, which will fill twice the RAM footprint.  After the array is populated, the raw data string can be set to vbNullString.
Another approach would be to use a Dictionary object in place of the array.  The key would be the part number and the data would be a sequential number that is the line number of the associated data.

Note: although it is theoretically possible to store all the data in a dictionary object, the memory footprint would most likely be prohibitively large.  Based on the sample workbook, storing the data in a dictionary object would consume 153 MB of RAM.  If your users have sufficient free physical memory, then this would be the best non-database implementation.
Hi there aikmark.  Thanks for the support and quick response on this.  Maybe I stated what I wanted wrong.  I don't want these data placed in textboxes per say but in formatted fields.  The intent was only to create a structured user interface.

Using the files formatted as .txt files would be an option.  The nightly reports I pull have the option of being exported as such.  If that could speed the time of the search, I'm all for it.  Another export option is html.  Would that be a better option?  

Just out of curiosity, are you familiar with Oracle Discoverer desktop?  This is what I am using to query the data each night and then export each report in sequence.  If there are options there, I am all ears as well.  Thank you so much for the help!
Pertaining to the dictionary object, this may be an option.  We have a public directory accessing by everyone which has 29.7GB of 499GB free.  Would this be doable?
Disk space isn't the problem.  Available physical memory on each user's PC is the limiting factor.

Create the PtNum.txt file and see how long it takes to populate a dictionary object, how much of a memory footprint it has, and see how long it takes to find a part number.
I hope you'd pardon my ignorance here, but I am not well versed in this area at all.  How can I get started on what you stated above?

Create the PtNum.txt file and see how long it takes to populate a dictionary object, how much of a memory footprint it has, and see how long it takes to find a part number.
You might create a file with just the data from the first column.
or
You can also open the big workbook, deleted all the columns after column A and Save As a text file.
I could simply create a new report in my discoverer with just the essential information, then have that export to a text file everynight for updated data.  That could work.  

How would it work if I went the database route?  Would I simply download the db locally and grant everyone access or would they need to download as well?  Thanks again!
It depends on where your users are located.  If they are all on the same LAN, you only need one database.  Everyone should be able to do queries against it.
Perfect.  Everyone is on the same network.  What would be the best route to take database wise.  Obviously, I would like to go the free/cheapest route.  Any suggestions?
Any names of databases that you think would best suit my needs here? Also, how would I go about connecting this database to the spreadsheet updated everynight or should I have the Oracle Discoverer export directly to the database?  Thanks
What about using the Oracle database you already have?
It is not user-friendly.  Also, users are not allowed to query against the database.
But this is a fixed query, not an ad hoc query.
This is something suitable for a view, a daily snapshot of the data, not even live data.

What is their problem?
Haha - I wish I knew.  I would not be searching for this.  I assume you'd like for me to close this one out?  Do you mind taking a peek at another one of my questions which has had 0 response?

Other Question
I didn't know if you were going to try one of the dictionary object solutions in this thread.
I'm not certain which thread you are referring to.  I beg you forgiveness.
Please look through the comments I posted in this thread.
Okay, I went back and found the thread where you stated:
A database would most likely be the best technology to apply to this solution, since databases use indexes.  If you want to try something similar, you might do the following:
1. sort the Excel data in Part Number order
2. copy the Part Number column to a new workbook
3. save the workbook as a text file or tab-delimited file (example: PtNum.txt)
4. Save the original workbook as a text file
5. When the user wants to retrieve the part number row, the program reads the entire PtNum.txt file, uses the SPLIT() function to move the data into an array, and then does a binary search for the data.  With the 750k part number estimate, it will only take about 20 comparisons to find the part.  Since this is in memory, only the first search will be slow.
6. The big data file is opened with a FileSystemObject Textstream and the associated data line is read and displayed.


I have made all of the suggested modifications as directed.  Now, what do I need to do to the code itself in order for it to search/use the *.txt files instead of the *.xlsx?  Please bear with me.  I apologize for the ignorance.  Thanks for the help aikimark.
The searches are still taking about 3 minutes to run.  Even after the first search has been conducted, it still maintains an average of around 3-4 minutes per search.  Just wanted to give a heads up.  Thanks again.
put the ptnum.txt file in a zip file and upload it.
Here it is.
PtNum.zip
Alert: Of the 581089 part numbers, you have 1846 duplicate values.  Duplicate key values are not permitted in dictionary objects. You may have a similar problem with a database as well as the larger problem of retrieving data for the wrong part number.

In my VBA test environment, I measured the following performance:
Time to read 581089 lines:   0.9189453 sec

Time to read ONLY line 11:                 0 sec
Time to read ONLY line 101:                9.765625E-04 sec
Time to read ONLY line 1001:               1.953125E-03 sec
Time to read ONLY line 10001:              1.171875E-02 sec
Time to read ONLY line 100001:             0.1142578 sec
Time to read ONLY line 200001:             0.2441406 sec
Time to read ONLY line 300001:             0.3691406 sec
Time to read ONLY line 400001:             0.5 sec
Time to read ONLY line 500001:             0.6230469 sec

Time to fill dictionary object with 581089 part nums:        42.48242 sec
Time to do (above) nine dictionary item retrievals:              3.320313E-02 sec

===================
The file I am reading is the part number file you posted, so it is much smaller than the large file we will be accessing in production. I would expect those file/line read times to be larger than the above performance figures.

Once the dictionary object is filled, any single item lookup in the dictionary object would be about .0037 seconds, followed by another 1-4 seconds to retrieve the data line for that part number.

The big file can be used both to populate the dictionary object and hold the related part data. The PtNum.txt request was meant to minimize the size of the file you uploaded to EE.

===================
Here is my performance test code.  
Note: You need to add a Scripting Runtime reference to your VBA project in order to test this.
Option Explicit

Public Sub ReadPtNum()
    Dim oFS As New Scripting.FileSystemObject
    Dim oDic As New Scripting.Dictionary
    Dim oTS As TextStream
    Dim lngCount As Long
    Dim lngDupCount As Long
    Dim sngStart As Single
    Dim sngTotal As Single
    Dim strLine As String
'    lngCount = 0
'    sngStart = Timer
'    Set oTS = oFS.OpenTextFile("C:\Users\Mark\Downloads\PtNum.txt", ForReading, TristateFalse)
'    Do Until oTS.AtEndOfStream
'        strLine = oTS.ReadLine
'        lngCount = lngCount + 1
'    Loop
'    Debug.Print "Time to read " & lngCount & " lines:", Timer - sngStart
'    oTS.Close
    
'    lngCount = 0
'    sngStart = Timer
'    Set oTS = oFS.OpenTextFile("C:\Users\Mark\Downloads\PtNum.txt", ForReading, TristateFalse)
'    For lngCount = 1 To 500000
'        oTS.SkipLine
'    Next
'    strLine = oTS.ReadLine
'    Debug.Print "Time to read ONLY line " & lngCount & ":", Timer - sngStart, strLine
'    oTS.Close
    
    oDic.RemoveAll
    lngCount = 0
    sngStart = Timer
    Set oTS = oFS.OpenTextFile("C:\Users\Mark\Downloads\PtNum.txt", ForReading, TristateFalse)
    Do Until oTS.AtEndOfStream
        strLine = oTS.ReadLine
        lngCount = lngCount + 1
        If oDic.Exists(strLine) Then
            lngDupCount = lngDupCount + 1
        Else
            oDic.Add strLine, lngCount
        End If
    Loop
    Debug.Print "Time to fill dictionary object with " & lngCount & " part nums:", Timer - sngStart, "Duplicate Count:", lngDupCount
    oTS.Close
    
    sngStart = Timer
    Debug.Print oDic("1408015")
    Debug.Print oDic("D8267")
    Debug.Print oDic("P877-18.5X31")
    Debug.Print oDic("10-0010")
    Debug.Print oDic("CPR OCN1291SF590")
    Debug.Print oDic("E01228-12EW")
    Debug.Print oDic("P2-SP")
    Debug.Print oDic("GEN III MC L4 TOP XLXXL")
    Debug.Print oDic("XX67222S")
    Debug.Print "Time to do (above) nine dictionary item retrievals:", Timer - sngStart

    
End Sub

Open in new window

Awesome!  I will look into this one now.  I appreciate that!!
This might also be solved with XML/XSLT.
Actually, I was recently introduced to this and considered it but how would I go about setting this up?  I am thinking it would be quicker too, correct?
I'm not sure about performance. I thought about this because you might place the XML file on a server and file I/O to a server might be slow.  I'm not sure how well this would perform. It is technically possible do this.
So, in your opinion, what is the best path to take at this point?
1.  place a .txt version of the big data file in a server directory
2. run some of the performance tests from within an Office product against the server resident big file.
3. place a copy of the .txt file (1) on your local hard drive
4. repeat the performance tests against the local copy of the .txt file.

You will need to address the duplicate part number problem I mentioned in my earlier comment.

Depending on the performance, you might be able to use this as a temporary solution until you implement a database solution. If the server-resident performance sucks and a local copy isn't feasible, we can look at the performance of an XML/XSLT solution.
I have tested this as directed.  While it was not the slowest in the world, it was in no way the fastest.  Can we look at testing the XML/XSLT?  I can simply compare performance with the .txt and make a call on which performs better and has the best dependability.  Thanks
What was the local performance like?
Sorry, I commented on the wrong question.  :(
Do want me to provide results?  As you did previously?
yes.  I wanted you to run the performance tests, both server and locally.
Roger that.

 As for the duplicate part number issue, the parts you call duplicate are in truth not duplicated.  These had the prevailing '0's chopped off due to .csv formatting.  Do you think that if we could search based on the part number and the unique ID that our Oracle database assigns to each item. But this would require the user to provide both as well.  Oftentimes the user does not have the ID.  Is there a way to have the '0's not chopped off by the .csv formats?

 I will provide the info requested in the server testing as well.  God bless!
ASKER CERTIFIED SOLUTION
Avatar of aikimark
aikimark
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'm going to accept this as resolved since I have killed this project.  Thanks arimark.