Go Premium for a chance to win a PS4. Enter to Win

x
Solved

# Algorithm Challenge

Posted on 2002-03-15
Medium Priority
447 Views
Given:
1- A list of files and their sizes in bytes, and
2- a list of servers and their "processing power" in bytes per second, and
3- the fact that there are many more files than servers.
How can the files be processed in the least amount of time?
In other words what would be the best algorithm to divide the files into sub-lists, one list for each server so that the overall processing time would be minimal?

My solution so far is to sort the lists in descending order. Then to assign the biggest file to the fastest server, the second biggest file to the second fastest sever etc. When the end of the server list is reached the next file would be assigned to the fastest server, and the file after that to the second fastest server etc. This process would be continued until the file list was used up.
Seems to me that this problem as been faced many times before, and the might be a better solution.
This problem is not hypothetical. I am trying to use old computers and DCOM to process a huge number of large log files.  I want to preconfigure the lists to save network overhead. The servers are all fully dedicated. The “processing power” of each server would be determined by a standard benchmark test. Each of the files is simply converted from binary to text, before further processing.
0
Question by:wsfindlater
• 10
• 6
• 5
• +5

LVL 22

Expert Comment

ID: 6868441
Just thinking out loud...

Certainly, the biggest should go to the fastest server.  After that, anything goes.

For example, suppose that the first server is a 3GHz processor, and the remaining servers are all 8086 3Mhz processors?

Given a file list whereby the first two files require a process time of 1000 units, and the rest require process times of 1 unit, you'd be much better off having the first two assigned to the fast processor than to place the second file on the second fastest processor.

I think the solution comes in determining a baseline unit time, then dividing by the processor speed (assuming that the 2X processor will process the files twice as fast as a 1X process, which is not always the case.)  Once you have this ratio, then you can start determining how long it will take each processor to process the file, then apply it to the best machine that's available next as long as it will process faster than if you waited for a better machine.

In some cases, the fastest path may have some machines sitting idle because they would take too long to process a file.  This is especially true as you approach the end of the list.
0

LVL 8

Expert Comment

ID: 6868491
Hi!

Sort the file list proportionate to the 'processing power'.  In other words, if there are 3 servers, with procssing powers at 5000-bps,3000-bps and 1000-bps, they will handle file sizes of 50MB, 30MB and 10MB respectively : )

Roughly some algorithm:

- Get filesize of each file and add to long integer, TotalWork
- Get the processing power of each server and add to long integer: TotalPower
>Loop from 1 to number of servers
- Get %of work of 1st server: 1st/(Total Power) and store it to integer, Percent
Find a file roughly about Percent*TotalWork.
If filesize not enough
- find a file that's bigggest
- Find 2nd file that's near difference of Percent*TotalWork.
In a loop while (abs(filesize-difference)>10MB ) 'or whichever value you prefer
If best file selection found > difference by too much
exit loop
go to next server.
else
find next file nearest to difference
end if
Else
- Set biggest file to current server.
- Go to next server
End If
>End of loop

That's it!

0

LVL 44

Expert Comment

ID: 6868501
This sounds VERY close to the classic OPERATIONAL Analysis problem called the "KNAPSACK problem"

in the Knapsack problem, you have some KnapSacks (this BackPacks) which have certain volumes, and a collection of
boxes, each with their own sizes (length x width x height) and the problem is to pack the boxes into the backpacks in the MOST efficient manner.

0

LVL 22

Expert Comment

ID: 6868513
So, Arthur, don't leave us hanging...what's the answer?  (I never got around to taking that class in college so I never found out :( )
0

LVL 8

Expert Comment

ID: 6868514
Hi!

Oops!  I didn't see any comment when I first saw this question.  Anyway, I presume that the 'processing powers' that you've mentioned is gotten experimentally and not theoretically.

Also, forgive me if my algorithm is really messy.  I haven't been doing algorithms since my computer science teacher stopped on algorithms some 3/4 year ago : )

That's it!

0

LVL 1

Expert Comment

ID: 6868651
There is a better solution. Although you can't give it without knowing the values of the variabels involved. It's a simle liner programming problem - one that Excel might be able to solve for you (as far as I can remember there is some goal seek with constraints function). The algorithm is called the simplex method (there are derivatives of it as well) and it's the fundemental optomization algorithm. However, it's quite difficut to implement, and unless you are doing this a fair bit it's probably not worth it. There are freeware packages out there which you might be able to link to which would give the optimal solution - but I suppose it depends on how much effort you're willing to put in to save some time. Your rule of thumb should work fairly well (although sometimes it might not be best to give the largest file to the fastest server - it's possible to contrive an example if I wasn't so apathetic!).
0

Author Comment

ID: 6868661
glass cookie: Your algorithm is the first I've seen, but have to wait to see if there is a better one.
How do you determine "difference by too much".
Yes, 'processing powers' would be determined experimentally, by some benchmark run.
Arthur: I too would like to know the solution. Yes, it does sound like a similar problem.
0

Author Comment

ID: 6868674
gd2000: Very interesting. thanks
0

Author Comment

ID: 6868725
Is this completely hypothetical? If not, than this seems overly simplistic. There would be a lot more
variables that would come into play. Are the lists of files to be processed going to be preconfigured
on each server, or does something have to send a request for each file? If the latter is true, then
you also have to account for network load etc. Are the servers going to be fully dedicated to this
process? If not, than current load and function need to come into play. Also, the differential of
processing power between each server needs to be considered. In other words, with your solution you
are assuming that the difference in file size is proportionately the same as the difference in available
processing power of each server. This may not be the case. If the difference in processing power from
server to server is greater than the difference in file size, it would posssibly make sense to have
the largest server do more of the larger files and the smallest do more of the smaller size. Another
thing - will each of the files be processed in the same way - similar data etc?

No, this is not hypothetical. I am trying to use old computers and DCOM to process a huge number of large log files.  I want to preconfigure the lists to save overhead. Yes, the servers are all fully dedicated. The “processing power” of each server would be determined by a standard benchmark test. Each of the files is simply converted from binary to text, before further processing.
0

LVL 44

Expert Comment

ID: 6868771
If you want to look at approaches to the general solution of the Knapsack Problem , go to www.google.com and Search on KnapSack Problem.  This is a classic problem in Operations Research, so the discussion is couched in HEAVY mathematical terms.
0

LVL 22

Expert Comment

ID: 6868786
Actually, the problem sounds like the SETI project, where various computers are solicited to process data about various stars.  I don't know if they do any optimization calculations, but it may be worthwhile to check out their site and see how they handle it.  It might give you some ideas.

Start at http://www.seti.org although I'm not sure if that's the right place to investigate the Computer-share project.
0

Author Comment

ID: 6868817
wsfindlater, use some solver, start here:

optimization problem
(definition)
Definition: A computational problem in which the object is to find the best of all possible solutions. More formally, find a solution in the feasible region which has the minimum (or maximum) value of the objective function.

0

Author Comment

ID: 6868871
wood: My problem seems similar to the “Knapsack Problem”(KP).  KP involves two variables volume and weight. My files have only size. KP has one knapsack. I have N number of servers.
It would take me a long time to understand the math involved. I am hope that I can find some code already written.
0

LVL 18

Expert Comment

ID: 6869140
I think that this can be simplified somewhat.

Assuming that all machines will start processing at the same time, the shortest overall processing time would be achieved if all computers finished their queue of tasks at exactly the same time.  If one machine finishes early then it could have been taking some of the processing from one of the other machines that is having to run longer.  So, this sounds kind of like a Time/Speed/Distance type of problem where you want all vehicals to finish at the same time.

If:

T = the time to finish
S = the speed of processor (say bytes per second it can process)
D = the distance to cover, in this case, it's bytes to process then the equation is this, I think:

D/S = T

So, if you want T to be equal for all machines, then you need to keep the ratio of D/S to be equal across all machines.

If you calculate the minimum T it would be the slowest machine processing the smallest files (actually the minimum T would be the fastest machine processing the smallest files, but then what would the slowest machine do?  So, the fastest time for all computers to finsh would be the slowest machine and the smallest files).

Then, given that value for T, you can calculate how many bytes each machine can process in that time:

S * T = D

Then, add up all the D's and see if that comes out to your total for all files?  If not, you're going to have to increase T until you get all bytes covered by all machines.  Then, I'd say to take one machine at a time, maybe the fastest to the slowest and take the largest files to the smallest and start assigning them to the machine.  When a machine gets to it's "D" value, if the last file is too big, just go down until you hit the first one to fit.  If you want to get fancy, you'll try all combinations of the remaining files to get the closest fit for the remaining portion.  Then, move on to the next machine.  By the time you get to the slowest machine, you should have mostly the smallest files remaining, so, it should work out OK.
0

LVL 22

Expert Comment

ID: 6869206
>So, the fastest time for all computers to finsh would be the slowest machine and the smallest files

I'm not sure that's completely correct.

For example, if you have one large file, but 10,000 small files, it may be quicker for the slow machine to crank away on the large file while the fast machine whizzes through the small files.

Example:
Fast machine=1GHz
Slow machine=100MHz
Assumption: fast machine processes 10x faster than slow machine, and overhead of queueing things up is zero.

Filesizes:
One at 10MB
10,000 at 1MB

Say fast machine can process 10MB in 10 seconds and 1MB in 1 second each.  It would take 10010 seconds to do all.

Slow machine can process 10MB in 100 seconds and 1MB in 10 seconds each.

It would seem to be optimal to have the slow machine do the large file while the fast machine does a good chunk of the small files.

After 100 seconds, the large file would be done, and 100 of the 10,000 small files would be done.  Then the remaining 9900 files can be split as 9000 to the fast machine and 900 to the slow machine.

...did I get the numbers right?

--
But I think the idea of looking at it from the "they all end together" perspective is correct.
0

LVL 18

Expert Comment

ID: 6869346
Well, the problem with my previous post is thinking that I could estimate the shortest time.  The problem is "the slowest machine and the smallest files" might be correct, but HOW MANY of the smallest files?

So, I decided to take a slightly different approach.  Assuming that we want to keep the ratio of D/S (bytes/speed) then what we can try is to calculate the total bytes given a specific value for T, and if the total bytes isn't > the Total File bytes, then we have to increase T and run the sum again.  But, this turns out to be pretty easy to do.  Once you have the approximate T, then you can figure out how many of that total number of bytes each processor should do, and then you can try to divide up the files to give each processor as close to their max as possible.  The only thing I found in practice that was different than what I expected is that when assigning the files, it was easier to get a good distribution if I filled the slowest machine with the smallest files first.  Here is the sample I was using to test:

Dim Processors(1 To 10) As Long
Dim Files(1 To 100, 1 To 2) As Long

Private Sub Command1_Click()
Dim i As Long

Dim T As Long
Dim D As Long
Dim lTotalBytes As Long
Dim lTempBytes As Long
Dim sTemp As String

' Loading sample Array with Bytes Per Second processing speed sorted fastest to slowest
Processors(1) = 1000
Processors(2) = 1000
Processors(3) = 800
Processors(4) = 800
Processors(5) = 500
Processors(6) = 500
Processors(7) = 500
Processors(8) = 300
Processors(9) = 300
Processors(10) = 200

' second element will be a flag once we start assigning files to processors
For i = 1 To 100
Files(i, 1) = 150000 / i
Files(i, 2) = 0
lTotalBytes = lTotalBytes + (150000 / (i * 0.8))
Next i

' Keep going until we find a value for Time that will
' let us process all of the bytes
T = 1
Do
lTempBytes = 0
For i = 1 To 10
lTempBytes = lTempBytes + (T * Processors(i))
Next i
If lTempBytes < lTotalBytes Then
T = T + 1
Else
Exit Do
End If

Loop

Me.Cls
For i = 1 To 10
Me.Print "Processor " & i & " " & T * Processors(i) & " bytes"
Next i

Me.Print "This will take approximately " & T & " seconds to complete."

' For each processor, let's start at the slowest
For i = 10 To 1 Step -1
lTempBytes = 0
sTemp = ""
' For each File, start at the smallest
For j = 100 To 1 Step -1
' if the file is not already assigned
If Files(j, 2) = 0 Then
' if there is room left for this processor to do this file
If (lTempBytes + Files(j, 1)) < (T * Processors(i)) Then
lTempBytes = (lTempBytes + Files(j, 1))
' mark this file taken
Files(j, 2) = 1
sTemp = sTemp & j & ","
End If
End If
Next j
' just truncate the last comma off
If Len(sTemp) > 0 Then
sTemp = Left(sTemp, Len(sTemp) - 1)
End If
Me.Print i & " Processor (" & sTemp & ") " & vbTab & lTempBytes & " Bytes" & vbTab & (lTempBytes / Processors(i)) & " Seconds"
Next i

End Sub
0

LVL 18

Expert Comment

ID: 6869362
Opps....

This:
For i = 1 To 100
Files(i, 1) = 150000 / i
Files(i, 2) = 0
lTotalBytes = lTotalBytes + (150000 / (i * 0.8))
Next i

Should have been:

' just trying to get a wider distribution of values
For i = 1 To 100
Files(i, 1) = (150000 / (i * 0.8))
Files(i, 2) = 0
lTotalBytes = lTotalBytes + (150000 / (i * 0.8))
Next i

Also, it turns out that the fastest machine usually doesn't have enough files left to fill it's timeslot.  So, it might be necessary to go back and find the queues with the most number of bytes to process and try to grab one or two to even things out more.
0

Author Comment

ID: 6869369
I also posted this question to the C++ group. I accepted the answer below.
I still have problem of dividing up files into the proper “percents” of the job. In the example below the divisions came out neatly, but I still do not know how to divide up the files in the percents. This should not, however, be too hard.

“The idea is to preconfigure the files so the processing time that the processor will "work" will be
equal on all processors.

Look at this simple example:
2 processors - 50 bytes per second
100 bps

5 files - 100 bytes
80 b
70 b
50 b

The total bytes to process are 300.
The total "processing power" is 150 bps.
The result is that the first processor(50) gets 1/3(50/150) of the job = the first file (100). and the
second processor gets 2/3 (100/150) of the job = 150 bytes = all the other files.
Did you get the picture?
I asked you about splitting file to make it optimal, but you don't need it in this case at all. “

0

Author Comment

ID: 6869380
mdougan:
0

LVL 18

Expert Comment

ID: 6869390
After grabbing an array sorting routine off the net, I made this more interesting by randomly calculating the File sizes, then sorting the files Array.  Now, I get a much better distribution of files, and if you compare the execution times for each processor, they're pretty close:

Dim Processors(1 To 10) As Long
Dim Files(1 To 100, 1 To 2) As Variant

Private Sub Command1_Click()
Dim i As Long

Dim T As Long
Dim D As Long
Dim lTotalBytes As Long
Dim lTempBytes As Long
Dim sTemp As String

' Loading sample Array with Bytes Per Second processing speed sorted fastest to slowest
Processors(1) = 1000
Processors(2) = 1000
Processors(3) = 800
Processors(4) = 800
Processors(5) = 500
Processors(6) = 500
Processors(7) = 500
Processors(8) = 300
Processors(9) = 300
Processors(10) = 200

' second element will be a flag once we start assigning files to processors
' just trying to get a wider distribution of values
Randomize
For i = 1 To 100
' Int((upperbound - lowerbound + 1) * Rnd + lowerbound)
Files(i, 1) = (Int((32000 - 100 + 1) * Rnd + 100) * 100)
Files(i, 2) = 0
lTotalBytes = lTotalBytes + Files(i, 1)
Next i

SortArray Files(), 1

T = 1
Do
lTempBytes = 0
For i = 1 To 10
lTempBytes = lTempBytes + (T * Processors(i))
Next i
If lTempBytes < lTotalBytes Then
T = T + 1
Else
Exit Do
End If

Loop

Me.Cls
For i = 1 To 10
Me.Print "Processor " & i & " " & T * Processors(i) & " bytes"
Next i

Me.Print "This will take approximately " & T & " seconds to complete."

For i = 10 To 1 Step -1
lTempBytes = 0
sTemp = ""
For J = 100 To 1 Step -1
If Files(J, 2) = 0 Then
If (lTempBytes + Files(J, 1)) < (T * Processors(i)) Then
lTempBytes = (lTempBytes + Files(J, 1))
Files(J, 2) = 1
sTemp = sTemp & J & ","
End If
End If
Next J
If Len(sTemp) > 0 Then
sTemp = Left(sTemp, Len(sTemp) - 1)
End If
Me.Print i & " Processor (" & sTemp & ") " & vbTab & lTempBytes & " Bytes" & vbTab & (lTempBytes / Processors(i)) & " Seconds"
Next i

End Sub

Public Sub SortArray(ByRef DArray(), Element As Integer)
Dim gap As Integer, doneflag As Integer, SwapArray()
Dim Index As Integer
ReDim SwapArray(2, UBound(DArray, 1), UBound(DArray, 2))
'Gap is half the records
gap = Int(UBound(DArray, 2) / 2)
Do While gap >= 1
Do
doneflag = 1
For Index = LBound(DArray) To (UBound(DArray, 2) - (gap + 1))
'Compare 1st 1/2 to 2nd 1/2
If DArray(Element, Index) > DArray(Element, (Index + gap)) Then
For acol = 0 To (UBound(SwapArray, 2) - 1)
'Swap Values if 1st > 2nd
SwapArray(0, acol, Index) = DArray(acol, Index)
SwapArray(1, acol, Index) = DArray(acol, Index + gap)
Next
For acol = 0 To (UBound(SwapArray, 2) - 1)
'Swap Values if 1st > 2nd
DArray(acol, Index) = SwapArray(1, acol, Index)
DArray(acol, Index + gap) = SwapArray(0, acol, Index)
Next
CNT = CNT + 1
doneflag = 0
End If
Next
Loop Until doneflag = 1
gap = Int(gap / 2)
Loop
End Sub
0

LVL 46

Expert Comment

ID: 6869391
I haven't read anything in the discusison about network traffic constraints in this discussion.
"I never met an I/O that I liked."

Part of this problem solution should also consider size of the data pipe from datasource to server.

Also, you need to consder the total performance of the server, not just its CPU clock speed.  Performance is affected by cache size, bus speed, and both memory size and speed.

I didn't read about what kind of processing you need to perform on the servers.  Some solutions derive from the work performed of the queue servers.

=============================================
My personal recommendations to the problem are:
1. minimize network I/O.
2. approach the file list from both ends, with the most powerful servers processing the largest (closest) files and the least powerful server(s) processing the smallest (closest) files.
3. create a queue manager that will return the next file most appropriate to the server/source requester.
0

LVL 18

Expert Comment

ID: 6869398
Disclaimer, this is a quick and dirty pass at the algorithm.  I think that there may be cases where not all files are being assigned.  For example, if there is a file that is larger than the largest byte queue on the fastest machine.  Some amount of redundant checks should be done to ensure all files are assigned.....
0

LVL 18

Accepted Solution

mdougan earned 800 total points
ID: 6869414
Sorry for the multiple posts, but I found that if I calculated the value for Time (T) and made it allow for processing just a little more than the total bytes for all files, then it gave me just enough room to ensure all files were always assigned to a slot.... so, change the code above to use this for calculating T:

T = 1
Do
lTempBytes = 0
For i = 1 To 10
lTempBytes = lTempBytes + (T * Processors(i))
Next i
' make the buffer have to be a little bigger than the total
If lTempBytes < (lTotalBytes * 1.05) Then
T = T + 1
Else
Exit Do
End If

Loop
0

LVL 22

Expert Comment

ID: 6869427
Another thing to keep in mind:  it may be quicker to keep a slower machine idle:

Case in point:

Two machines, one 10X faster

Two files, both 100K

If you apply both files to the 10X machine, you will complete faster than if you apply one file to each machine.

So again, the best solution may mean that some machines are idle once the smaller files are processed.
0

LVL 46

Expert Comment

ID: 6869436
That's why I recommended a queue manager.
0

LVL 4

Expert Comment

ID: 6869539
(listening)
0

Author Comment

ID: 6869541
mdougan:Your results are "the same" as the solution I got from the C++ group.

lTotalBytes     163016500
% processing    bytes
Processor     1     1000     0.169492     27629915
Processor     2     1000     0.169492     27629915
Processor     3     800     0.135593     22103932
Processor     4     800     0.135593     22103932
Processor     5     500     0.084746     13814958
Processor     6     500     0.084746     13814958
Processor     7     500     0.084746     13814958
Processor     8     300     0.050847     8288975
Processor     9     300     0.050847     8288975
Processor     10     200     0.033898     5525983
5900 (Total processing)

Processor     1     27630000     bytes
Processor     2     27630000     bytes
Processor     3     22104000     bytes
Processor     4     22104000     bytes
Processor     5     13815000     bytes
Processor     6     13815000     bytes
Processor     7     13815000     bytes
Processor     8     8289000     bytes
Processor     9     8289000     bytes
Processor     10     5526000     bytes
0

Author Comment

ID: 6869572
lidorc of the C++ group had a better answer for the "bytes per server"
Mdougan provided a total answer, however, and code as well.
Thank you Mdougan, and everyone else.
0

LVL 4

Expert Comment

ID: 6869596
ah, too many threads. Skipping quickly through what is left, I am about caught up other than for details. Here is my random thoughts (pardon any replication).

1) old memories of problems have it that you begin with the smallest sack, slowest machine first. For this machine you put in it, on it, the largest file that it can process. Assessing the 'can process' part would be addressed similarly to prior contributors discussing averaging, the finding of overall mean or optimal time mathematically if the fit were exactly perfect.

2) Goal remains to have only the teeniest files around at the end for whoever is available (server)

3) By saturating the slowest server first, by the end of the game you may have, for example, 3 files to go, and 6 of ten servers available; and you have increased the odds mechanically to have some faster servers left to process the remaining jobs (assuming that the byte division is not perfectly even)

4) The focus here is too much on cpu power. My initial assumption (another thread) is that the designer has incorporated other features of computers to come up with something called speed, not simply cpu. And this can really be problematic for non-theoretical real world situations. Thus inside knowledge of data is essential to (cheating) to get the best fit algorithm. If you don't understand, I'll go further. Just --- can we assume that some benchmarking runs to include things other than MHz to come up with a manufactured number 'speed' that has meaning in real world?

5) If said I missed it. Can files be merged? Presorted? Example for log file could be that filename has day, but file content (records) does not. vs.: Each logfile has each record having all information relevant to problem, including date? I thought it was said that some preprocessing was done. Or was it, that it is this question which is involved in the preprocessing functions?
0

LVL 4

Expert Comment

ID: 6869603
>thank you

over?
done?
0

Author Comment

ID: 6869648
over
done
0

## Featured Post

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

If you have ever used Microsoft Word then you know that it has a good spell checker and it may have occurred to you that the ability to check spelling might be a nice piece of functionality to add to certain applications of yours. Well the code that…
Background What I'm presenting in this article is the result of 2 conditions in my work area: We have a SQL Server production environment but no development or test environment; andWe have an MS Access front end using tables in SQL Server but we a…
Get people started with the process of using Access VBA to control Outlook using automation, Microsoft Access can control other applications. An example is the ability to programmatically talk to Microsoft Outlook. Using automation, an Access applic…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…
###### Suggested Courses
Course of the Month12 days, 17 hours left to enroll