Cluskitt

asked on

# Adding different values from a table to find a specific sum value

There are times when I need to find a certain value out of a number of records. For example, assume this table:

tblValues (ID_Value int PK,ID_Code PK,CalcValue real)

It can have values like:

1,1,123.52

1,2,15421.23

2,1,2367

3,3,4352.12

Usually, it will have a few thousand rows (the actual table has more fields, but they're not relevant).

Sometimes, I need to find rows that, when added, will generate a certain number. For example, using the above values, let's say I need to find a value of 2490.52. If I add the 1st and 3rd rows, I will get that value. If I need to find a value of 6842.64, I will get it with the 1st, 3rd and 4th rows.

Usually, when such a need arises, we need to see if any one row has that value. That is basic SQL and I won't even bother with that. Next, we move to two values. I usually do it this way:

This approach has, basically, two problems:

1- Records get repeated. That is, it will return row 1 + row 3 AND row 3 + row 1. This quickly escalates as you add more tables

2- The query quickly gets too slow to be practical. We can only run a 4 rows sum on a small subset of the table and even then, it takes too long.

So, what is the best approach to this problem? Is there a more efficient way?

Even better, is there some way to return something like all possible one row combinations, then all possible two row combinations, then all of 3, etc, up to a reasonable max of, let's say, for example, 5?

Also important, is there any way not to get what are basically repeated records?

tblValues (ID_Value int PK,ID_Code PK,CalcValue real)

It can have values like:

1,1,123.52

1,2,15421.23

2,1,2367

3,3,4352.12

Usually, it will have a few thousand rows (the actual table has more fields, but they're not relevant).

Sometimes, I need to find rows that, when added, will generate a certain number. For example, using the above values, let's say I need to find a value of 2490.52. If I add the 1st and 3rd rows, I will get that value. If I need to find a value of 6842.64, I will get it with the 1st, 3rd and 4th rows.

Usually, when such a need arises, we need to see if any one row has that value. That is basic SQL and I won't even bother with that. Next, we move to two values. I usually do it this way:

```
SELECT t1.ID_Value,t1.ID_Code,t1.CalcValue,
t2.ID_Value,t2.ID_Code,t2.CalcValue
FROM tblValues t1
INNER JOIN tblValues t2
ON t1.ID_Value<>t2.ID_Value OR t1.ID_Code<>t2.ID_Code
WHERE t1.CalcValue+t2.CalcValue=2490.52
```

If we then need to move on to 3 rows, I add another table, and so on.This approach has, basically, two problems:

1- Records get repeated. That is, it will return row 1 + row 3 AND row 3 + row 1. This quickly escalates as you add more tables

2- The query quickly gets too slow to be practical. We can only run a 4 rows sum on a small subset of the table and even then, it takes too long.

So, what is the best approach to this problem? Is there a more efficient way?

Even better, is there some way to return something like all possible one row combinations, then all possible two row combinations, then all of 3, etc, up to a reasonable max of, let's say, for example, 5?

Also important, is there any way not to get what are basically repeated records?

as for the combinatoric sums, there is no good way to do that. The factorial expansion is going to make that an expensive operation.

you can help by adding some conditions to your joins so as to avoid obviously bad combinations.

for example if your target is 600

row 1 = 100

row 2 = 300

row 3 = 400

once you've added rows 1, 2 and 3 you know you don't need to join any other rows because you've already exceeded the target so joins for rows 4 and 5 can be avoided.

you can help by adding some conditions to your joins so as to avoid obviously bad combinations.

for example if your target is 600

row 1 = 100

row 2 = 300

row 3 = 400

once you've added rows 1, 2 and 3 you know you don't need to join any other rows because you've already exceeded the target so joins for rows 4 and 5 can be avoided.

ASKER

I have actually thought of the > condition. However, that wouldn't solve the scalability problem. Trying to add 3 rows would still be a nightmare on the whole table. There should be a better solution, even if it's a stored proc with some sort of loop, or a temp table, or anything that will make adding 5 rows take less than a few hours.

it will help the scalability problem

there is really no solution to what you are looking for.

combinatoric expansion is big, there's no fighting math the best you can do is block some subset of the combinations from being generated by restricting the join conditions.

there is really no solution to what you are looking for.

combinatoric expansion is big, there's no fighting math the best you can do is block some subset of the combinations from being generated by restricting the join conditions.

I think you can improve it considerably by saving the two-value results using those to generate the 3, 4, etc, value results.

[You could probably add even more efficiency by sorting the values in order first, although I haven't worked out the code in my head to actually implement the savings yet :-) .]

OK, so you generate the two-values added result and no two-value total matches.

You can selectively join the two-value result to the original list to generate the three-value result.

You can selectively join the two-value result to the two-value result to get the four-value result.

You can selectively join the four-value result to the original list to generate the five-value result.

And so on.

I'll be happy to help with exact code if you provide some sample data, say a thousand rows or so.

Btw, I would assign a single id to each value, to insure a distinct ascending key for the original values, for efficiency. That id can then be used to lookup the original pair of ids.

For even a few thousand rows, an insert into a new table, even sorted, is very minor overhead.

[You could probably add even more efficiency by sorting the values in order first, although I haven't worked out the code in my head to actually implement the savings yet :-) .]

OK, so you generate the two-values added result and no two-value total matches.

You can selectively join the two-value result to the original list to generate the three-value result.

You can selectively join the two-value result to the two-value result to get the four-value result.

You can selectively join the four-value result to the original list to generate the five-value result.

And so on.

I'll be happy to help with exact code if you provide some sample data, say a thousand rows or so.

Btw, I would assign a single id to each value, to insure a distinct ascending key for the original values, for efficiency. That id can then be used to lookup the original pair of ids.

For even a few thousand rows, an insert into a new table, even sorted, is very minor overhead.

isn't

"selectively joining"

just another way of saying

"block some subset of the combinations from being generated by restricting the join"

maybe I'm not understanding what you are suggesting, but I don't read anything in your description that suggests combinatoric explosion will be reduced.

for example, using just 1-20, there are 309 ways to generate sums of 37

if there are thousands of records and higher sums the variations can grow even more

here's a simple example showing the expansion

WITH numbers

AS (SELECT 1 id, 1 n

UNION ALL

SELECT id + 1 id, n + 1 n

FROM numbers

WHERE n < 20)

SELECT n1.n, n2.n, n3.n, n4.n, n5.n

FROM numbers n1

LEFT JOIN numbers n2 ON n1.id > n2.id

AND n1.n + ISNULL(n2.n, 0) <= 37

LEFT JOIN numbers n3 ON n2.id > n3.id

AND n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) <= 37

LEFT JOIN numbers n4 ON n3.id > n4.id

AND n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) + ISNULL(n4.n, 0) <= 37

LEFT JOIN numbers n5 ON n4.id > n5.id

WHERE n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) + ISNULL(n4.n, 0) + ISNULL(n5.n, 0) = 37;

increasing the range of n and replacing the four 37s with other targets lets you test for other combinations.

1-100 with sums of 177 yields almost a quarter million different permutations (243339)

"selectively joining"

just another way of saying

"block some subset of the combinations from being generated by restricting the join"

maybe I'm not understanding what you are suggesting, but I don't read anything in your description that suggests combinatoric explosion will be reduced.

for example, using just 1-20, there are 309 ways to generate sums of 37

if there are thousands of records and higher sums the variations can grow even more

here's a simple example showing the expansion

WITH numbers

AS (SELECT 1 id, 1 n

UNION ALL

SELECT id + 1 id, n + 1 n

FROM numbers

WHERE n < 20)

SELECT n1.n, n2.n, n3.n, n4.n, n5.n

FROM numbers n1

LEFT JOIN numbers n2 ON n1.id > n2.id

AND n1.n + ISNULL(n2.n, 0) <= 37

LEFT JOIN numbers n3 ON n2.id > n3.id

AND n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) <= 37

LEFT JOIN numbers n4 ON n3.id > n4.id

AND n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) + ISNULL(n4.n, 0) <= 37

LEFT JOIN numbers n5 ON n4.id > n5.id

WHERE n1.n + ISNULL(n2.n, 0) + ISNULL(n3.n, 0) + ISNULL(n4.n, 0) + ISNULL(n5.n, 0) = 37;

increasing the range of n and replacing the four 37s with other targets lets you test for other combinations.

1-100 with sums of 177 yields almost a quarter million different permutations (243339)

I will throw out one significant caveat.

If your sums are small compared to the elements you are adding then the individual Join conditions will filter out many permutations very quickly.

for example

1-10 with target sum of 8

once I have n =8, the first join will fail and hence kick out all of the rest

n1=7 and n2=1, all third and following joins will fail

etc

If your sums are small compared to the elements you are adding then the individual Join conditions will filter out many permutations very quickly.

for example

1-10 with target sum of 8

once I have n =8, the first join will fail and hence kick out all of the rest

n1=7 and n2=1, all third and following joins will fail

etc

You're right: you didn't understand what I said.

In fact, you seem to have completely ignored it, since you continue to do multiple joins solely to the original list of numbers. I explicitly stated a different approach.

In fact, you seem to have completely ignored it, since you continue to do multiple joins solely to the original list of numbers. I explicitly stated a different approach.

nope, didn't ignore it at all. in fact, that's exactly why I wrote it and wrote it the way I did

I was simply illustrating the math.

no matter the method of generating them the number of permutations is the same.

If you have thousands of rows you could easily end up with billions of permutations.

Generate a million per second and it'll still take you half an hour and I doubt you can go that fast.

As the number of rows grows the permutations grow much, much faster.

If you have 10,000 rows you won't finish in a day.

Of course, as I already mentioned, there are data values and target sums that will allow for massive reduction in the joins needed. Even if you do this procedurally rather than with just sql joins, the general idea is the same. The exact same methods that allow you to reduce (or not reduce) the join combinations will apply to (the benefit or to the detrement of ) the procedural methods.

So, while the problem isn't unsolvable, the practicality of doing so is dependent more on the nature of the data itself than the method for tackling it.

I was simply illustrating the math.

no matter the method of generating them the number of permutations is the same.

If you have thousands of rows you could easily end up with billions of permutations.

Generate a million per second and it'll still take you half an hour and I doubt you can go that fast.

As the number of rows grows the permutations grow much, much faster.

If you have 10,000 rows you won't finish in a day.

Of course, as I already mentioned, there are data values and target sums that will allow for massive reduction in the joins needed. Even if you do this procedurally rather than with just sql joins, the general idea is the same. The exact same methods that allow you to reduce (or not reduce) the join combinations will apply to (the benefit or to the detrement of ) the procedural methods.

So, while the problem isn't unsolvable, the practicality of doing so is dependent more on the nature of the data itself than the method for tackling it.

NO, you did not do anything close to what I stated.

You're "illustrating math" for a different approach.

You need to read my comments again.

You're "illustrating math" for a different approach.

You need to read my comments again.

no need to be argumentative

if you really think you have something significantly different, please illustrate.

I collect algorithms. It would tickle me pink to be proven wrong.

More importantly, it would benefit the asker

if you really think you have something significantly different, please illustrate.

I collect algorithms. It would tickle me pink to be proven wrong.

More importantly, it would benefit the asker

just to give us a consistent test bed. Try this data

```
Insert into NUMBERS (ID, N) Values (1, 527);
Insert into NUMBERS (ID, N) Values (2, 345);
Insert into NUMBERS (ID, N) Values (3, 197);
Insert into NUMBERS (ID, N) Values (4, 420);
Insert into NUMBERS (ID, N) Values (5, 964);
Insert into NUMBERS (ID, N) Values (6, 74);
Insert into NUMBERS (ID, N) Values (7, 104);
Insert into NUMBERS (ID, N) Values (8, 677);
Insert into NUMBERS (ID, N) Values (9, 24);
Insert into NUMBERS (ID, N) Values (10, 357);
Insert into NUMBERS (ID, N) Values (11, 40);
Insert into NUMBERS (ID, N) Values (12, 212);
Insert into NUMBERS (ID, N) Values (13, 478);
Insert into NUMBERS (ID, N) Values (14, 76);
Insert into NUMBERS (ID, N) Values (15, 244);
Insert into NUMBERS (ID, N) Values (16, 640);
Insert into NUMBERS (ID, N) Values (17, 481);
Insert into NUMBERS (ID, N) Values (18, 304);
Insert into NUMBERS (ID, N) Values (19, 91);
Insert into NUMBERS (ID, N) Values (20, 589);
Insert into NUMBERS (ID, N) Values (21, 676);
Insert into NUMBERS (ID, N) Values (22, 276);
Insert into NUMBERS (ID, N) Values (23, 369);
Insert into NUMBERS (ID, N) Values (24, 405);
Insert into NUMBERS (ID, N) Values (25, 161);
Insert into NUMBERS (ID, N) Values (26, 232);
Insert into NUMBERS (ID, N) Values (27, 356);
Insert into NUMBERS (ID, N) Values (28, 752);
Insert into NUMBERS (ID, N) Values (29, 170);
Insert into NUMBERS (ID, N) Values (30, 570);
Insert into NUMBERS (ID, N) Values (31, 775);
Insert into NUMBERS (ID, N) Values (32, 511);
Insert into NUMBERS (ID, N) Values (33, 379);
Insert into NUMBERS (ID, N) Values (34, 192);
Insert into NUMBERS (ID, N) Values (35, 418);
Insert into NUMBERS (ID, N) Values (36, 647);
Insert into NUMBERS (ID, N) Values (37, 592);
Insert into NUMBERS (ID, N) Values (38, 98);
Insert into NUMBERS (ID, N) Values (39, 35);
Insert into NUMBERS (ID, N) Values (40, 711);
Insert into NUMBERS (ID, N) Values (41, 45);
Insert into NUMBERS (ID, N) Values (42, 101);
Insert into NUMBERS (ID, N) Values (43, 980);
Insert into NUMBERS (ID, N) Values (44, 340);
Insert into NUMBERS (ID, N) Values (45, 69);
Insert into NUMBERS (ID, N) Values (46, 262);
Insert into NUMBERS (ID, N) Values (47, 552);
Insert into NUMBERS (ID, N) Values (48, 631);
Insert into NUMBERS (ID, N) Values (49, 459);
Insert into NUMBERS (ID, N) Values (50, 280);
Insert into NUMBERS (ID, N) Values (51, 501);
Insert into NUMBERS (ID, N) Values (52, 73);
Insert into NUMBERS (ID, N) Values (53, 961);
Insert into NUMBERS (ID, N) Values (54, 849);
Insert into NUMBERS (ID, N) Values (55, 737);
Insert into NUMBERS (ID, N) Values (56, 687);
Insert into NUMBERS (ID, N) Values (57, 576);
Insert into NUMBERS (ID, N) Values (58, 552);
Insert into NUMBERS (ID, N) Values (59, 172);
Insert into NUMBERS (ID, N) Values (60, 133);
Insert into NUMBERS (ID, N) Values (61, 643);
Insert into NUMBERS (ID, N) Values (62, 878);
Insert into NUMBERS (ID, N) Values (63, 188);
Insert into NUMBERS (ID, N) Values (64, 402);
Insert into NUMBERS (ID, N) Values (65, 548);
Insert into NUMBERS (ID, N) Values (66, 457);
Insert into NUMBERS (ID, N) Values (67, 858);
Insert into NUMBERS (ID, N) Values (68, 70);
Insert into NUMBERS (ID, N) Values (69, 172);
Insert into NUMBERS (ID, N) Values (70, 278);
Insert into NUMBERS (ID, N) Values (71, 351);
Insert into NUMBERS (ID, N) Values (72, 525);
Insert into NUMBERS (ID, N) Values (73, 404);
Insert into NUMBERS (ID, N) Values (74, 71);
Insert into NUMBERS (ID, N) Values (75, 928);
Insert into NUMBERS (ID, N) Values (76, 744);
Insert into NUMBERS (ID, N) Values (77, 537);
Insert into NUMBERS (ID, N) Values (78, 920);
Insert into NUMBERS (ID, N) Values (79, 36);
Insert into NUMBERS (ID, N) Values (80, 619);
Insert into NUMBERS (ID, N) Values (81, 510);
Insert into NUMBERS (ID, N) Values (82, 856);
Insert into NUMBERS (ID, N) Values (83, 825);
Insert into NUMBERS (ID, N) Values (84, 131);
Insert into NUMBERS (ID, N) Values (85, 419);
Insert into NUMBERS (ID, N) Values (86, 512);
Insert into NUMBERS (ID, N) Values (87, 197);
Insert into NUMBERS (ID, N) Values (88, 954);
Insert into NUMBERS (ID, N) Values (89, 743);
Insert into NUMBERS (ID, N) Values (90, 589);
Insert into NUMBERS (ID, N) Values (91, 779);
Insert into NUMBERS (ID, N) Values (92, 235);
Insert into NUMBERS (ID, N) Values (93, 975);
Insert into NUMBERS (ID, N) Values (94, 221);
Insert into NUMBERS (ID, N) Values (95, 113);
Insert into NUMBERS (ID, N) Values (96, 592);
Insert into NUMBERS (ID, N) Values (97, 558);
Insert into NUMBERS (ID, N) Values (98, 838);
Insert into NUMBERS (ID, N) Values (99, 409);
Insert into NUMBERS (ID, N) Values (100, 240);
Insert into NUMBERS (ID, N) Values (101, 539);
Insert into NUMBERS (ID, N) Values (102, 902);
Insert into NUMBERS (ID, N) Values (103, 157);
Insert into NUMBERS (ID, N) Values (104, 862);
Insert into NUMBERS (ID, N) Values (105, 350);
Insert into NUMBERS (ID, N) Values (106, 429);
Insert into NUMBERS (ID, N) Values (107, 816);
Insert into NUMBERS (ID, N) Values (108, 498);
Insert into NUMBERS (ID, N) Values (109, 769);
Insert into NUMBERS (ID, N) Values (110, 772);
Insert into NUMBERS (ID, N) Values (111, 306);
Insert into NUMBERS (ID, N) Values (112, 86);
Insert into NUMBERS (ID, N) Values (113, 407);
Insert into NUMBERS (ID, N) Values (114, 996);
Insert into NUMBERS (ID, N) Values (115, 263);
Insert into NUMBERS (ID, N) Values (116, 62);
Insert into NUMBERS (ID, N) Values (117, 390);
Insert into NUMBERS (ID, N) Values (118, 384);
Insert into NUMBERS (ID, N) Values (119, 357);
Insert into NUMBERS (ID, N) Values (120, 291);
Insert into NUMBERS (ID, N) Values (121, 47);
Insert into NUMBERS (ID, N) Values (122, 637);
Insert into NUMBERS (ID, N) Values (123, 304);
Insert into NUMBERS (ID, N) Values (124, 147);
Insert into NUMBERS (ID, N) Values (125, 498);
Insert into NUMBERS (ID, N) Values (126, 463);
Insert into NUMBERS (ID, N) Values (127, 118);
Insert into NUMBERS (ID, N) Values (128, 961);
Insert into NUMBERS (ID, N) Values (129, 908);
Insert into NUMBERS (ID, N) Values (130, 338);
Insert into NUMBERS (ID, N) Values (131, 983);
Insert into NUMBERS (ID, N) Values (132, 75);
Insert into NUMBERS (ID, N) Values (133, 822);
Insert into NUMBERS (ID, N) Values (134, 192);
Insert into NUMBERS (ID, N) Values (135, 481);
Insert into NUMBERS (ID, N) Values (136, 859);
Insert into NUMBERS (ID, N) Values (137, 285);
Insert into NUMBERS (ID, N) Values (138, 641);
Insert into NUMBERS (ID, N) Values (139, 628);
Insert into NUMBERS (ID, N) Values (140, 188);
Insert into NUMBERS (ID, N) Values (141, 284);
Insert into NUMBERS (ID, N) Values (142, 502);
Insert into NUMBERS (ID, N) Values (143, 40);
Insert into NUMBERS (ID, N) Values (144, 150);
Insert into NUMBERS (ID, N) Values (145, 585);
Insert into NUMBERS (ID, N) Values (146, 43);
Insert into NUMBERS (ID, N) Values (147, 297);
Insert into NUMBERS (ID, N) Values (148, 364);
Insert into NUMBERS (ID, N) Values (149, 604);
Insert into NUMBERS (ID, N) Values (150, 468);
Insert into NUMBERS (ID, N) Values (151, 882);
Insert into NUMBERS (ID, N) Values (152, 604);
Insert into NUMBERS (ID, N) Values (153, 475);
Insert into NUMBERS (ID, N) Values (154, 712);
Insert into NUMBERS (ID, N) Values (155, 386);
Insert into NUMBERS (ID, N) Values (156, 37);
Insert into NUMBERS (ID, N) Values (157, 364);
Insert into NUMBERS (ID, N) Values (158, 274);
Insert into NUMBERS (ID, N) Values (159, 823);
Insert into NUMBERS (ID, N) Values (160, 258);
Insert into NUMBERS (ID, N) Values (161, 766);
Insert into NUMBERS (ID, N) Values (162, 800);
Insert into NUMBERS (ID, N) Values (163, 572);
Insert into NUMBERS (ID, N) Values (164, 591);
Insert into NUMBERS (ID, N) Values (165, 963);
Insert into NUMBERS (ID, N) Values (166, 785);
Insert into NUMBERS (ID, N) Values (167, 944);
Insert into NUMBERS (ID, N) Values (168, 691);
Insert into NUMBERS (ID, N) Values (169, 637);
Insert into NUMBERS (ID, N) Values (170, 890);
Insert into NUMBERS (ID, N) Values (171, 249);
Insert into NUMBERS (ID, N) Values (172, 672);
Insert into NUMBERS (ID, N) Values (173, 885);
Insert into NUMBERS (ID, N) Values (174, 396);
Insert into NUMBERS (ID, N) Values (175, 439);
Insert into NUMBERS (ID, N) Values (176, 631);
Insert into NUMBERS (ID, N) Values (177, 679);
Insert into NUMBERS (ID, N) Values (178, 600);
Insert into NUMBERS (ID, N) Values (179, 510);
Insert into NUMBERS (ID, N) Values (180, 103);
Insert into NUMBERS (ID, N) Values (181, 930);
Insert into NUMBERS (ID, N) Values (182, 999);
Insert into NUMBERS (ID, N) Values (183, 565);
Insert into NUMBERS (ID, N) Values (184, 384);
Insert into NUMBERS (ID, N) Values (185, 50);
Insert into NUMBERS (ID, N) Values (186, 370);
Insert into NUMBERS (ID, N) Values (187, 111);
Insert into NUMBERS (ID, N) Values (188, 186);
Insert into NUMBERS (ID, N) Values (189, 464);
Insert into NUMBERS (ID, N) Values (190, 304);
Insert into NUMBERS (ID, N) Values (191, 118);
Insert into NUMBERS (ID, N) Values (192, 51);
Insert into NUMBERS (ID, N) Values (193, 441);
Insert into NUMBERS (ID, N) Values (194, 200);
Insert into NUMBERS (ID, N) Values (195, 778);
Insert into NUMBERS (ID, N) Values (196, 246);
Insert into NUMBERS (ID, N) Values (197, 287);
Insert into NUMBERS (ID, N) Values (198, 984);
Insert into NUMBERS (ID, N) Values (199, 840);
Insert into NUMBERS (ID, N) Values (200, 222);
Insert into NUMBERS (ID, N) Values (201, 932);
Insert into NUMBERS (ID, N) Values (202, 545);
Insert into NUMBERS (ID, N) Values (203, 36);
Insert into NUMBERS (ID, N) Values (204, 489);
Insert into NUMBERS (ID, N) Values (205, 863);
Insert into NUMBERS (ID, N) Values (206, 321);
Insert into NUMBERS (ID, N) Values (207, 235);
Insert into NUMBERS (ID, N) Values (208, 154);
Insert into NUMBERS (ID, N) Values (209, 312);
Insert into NUMBERS (ID, N) Values (210, 896);
Insert into NUMBERS (ID, N) Values (211, 138);
Insert into NUMBERS (ID, N) Values (212, 294);
Insert into NUMBERS (ID, N) Values (213, 273);
Insert into NUMBERS (ID, N) Values (214, 388);
Insert into NUMBERS (ID, N) Values (215, 641);
Insert into NUMBERS (ID, N) Values (216, 815);
Insert into NUMBERS (ID, N) Values (217, 170);
Insert into NUMBERS (ID, N) Values (218, 682);
Insert into NUMBERS (ID, N) Values (219, 776);
Insert into NUMBERS (ID, N) Values (220, 427);
Insert into NUMBERS (ID, N) Values (221, 90);
Insert into NUMBERS (ID, N) Values (222, 62);
Insert into NUMBERS (ID, N) Values (223, 741);
Insert into NUMBERS (ID, N) Values (224, 78);
Insert into NUMBERS (ID, N) Values (225, 90);
Insert into NUMBERS (ID, N) Values (226, 27);
Insert into NUMBERS (ID, N) Values (227, 917);
Insert into NUMBERS (ID, N) Values (228, 172);
Insert into NUMBERS (ID, N) Values (229, 380);
Insert into NUMBERS (ID, N) Values (230, 279);
Insert into NUMBERS (ID, N) Values (231, 852);
Insert into NUMBERS (ID, N) Values (232, 611);
Insert into NUMBERS (ID, N) Values (233, 145);
Insert into NUMBERS (ID, N) Values (234, 545);
Insert into NUMBERS (ID, N) Values (235, 590);
Insert into NUMBERS (ID, N) Values (236, 793);
Insert into NUMBERS (ID, N) Values (237, 320);
Insert into NUMBERS (ID, N) Values (238, 798);
Insert into NUMBERS (ID, N) Values (239, 537);
Insert into NUMBERS (ID, N) Values (240, 361);
Insert into NUMBERS (ID, N) Values (241, 265);
Insert into NUMBERS (ID, N) Values (242, 248);
Insert into NUMBERS (ID, N) Values (243, 480);
Insert into NUMBERS (ID, N) Values (244, 736);
Insert into NUMBERS (ID, N) Values (245, 691);
Insert into NUMBERS (ID, N) Values (246, 758);
Insert into NUMBERS (ID, N) Values (247, 864);
Insert into NUMBERS (ID, N) Values (248, 609);
Insert into NUMBERS (ID, N) Values (249, 881);
Insert into NUMBERS (ID, N) Values (250, 554);
Insert into NUMBERS (ID, N) Values (251, 672);
Insert into NUMBERS (ID, N) Values (252, 376);
Insert into NUMBERS (ID, N) Values (253, 46);
Insert into NUMBERS (ID, N) Values (254, 580);
Insert into NUMBERS (ID, N) Values (255, 299);
Insert into NUMBERS (ID, N) Values (256, 21);
Insert into NUMBERS (ID, N) Values (257, 571);
Insert into NUMBERS (ID, N) Values (258, 953);
Insert into NUMBERS (ID, N) Values (259, 660);
Insert into NUMBERS (ID, N) Values (260, 243);
Insert into NUMBERS (ID, N) Values (261, 599);
Insert into NUMBERS (ID, N) Values (262, 87);
Insert into NUMBERS (ID, N) Values (263, 764);
Insert into NUMBERS (ID, N) Values (264, 455);
Insert into NUMBERS (ID, N) Values (265, 441);
Insert into NUMBERS (ID, N) Values (266, 728);
Insert into NUMBERS (ID, N) Values (267, 88);
Insert into NUMBERS (ID, N) Values (268, 592);
Insert into NUMBERS (ID, N) Values (269, 187);
Insert into NUMBERS (ID, N) Values (270, 178);
Insert into NUMBERS (ID, N) Values (271, 175);
Insert into NUMBERS (ID, N) Values (272, 434);
Insert into NUMBERS (ID, N) Values (273, 929);
Insert into NUMBERS (ID, N) Values (274, 256);
Insert into NUMBERS (ID, N) Values (275, 163);
Insert into NUMBERS (ID, N) Values (276, 780);
Insert into NUMBERS (ID, N) Values (277, 818);
Insert into NUMBERS (ID, N) Values (278, 605);
Insert into NUMBERS (ID, N) Values (279, 686);
Insert into NUMBERS (ID, N) Values (280, 969);
Insert into NUMBERS (ID, N) Values (281, 581);
Insert into NUMBERS (ID, N) Values (282, 590);
Insert into NUMBERS (ID, N) Values (283, 547);
Insert into NUMBERS (ID, N) Values (284, 424);
Insert into NUMBERS (ID, N) Values (285, 858);
Insert into NUMBERS (ID, N) Values (286, 152);
Insert into NUMBERS (ID, N) Values (287, 631);
Insert into NUMBERS (ID, N) Values (288, 715);
Insert into NUMBERS (ID, N) Values (289, 498);
Insert into NUMBERS (ID, N) Values (290, 250);
Insert into NUMBERS (ID, N) Values (291, 36);
Insert into NUMBERS (ID, N) Values (292, 918);
Insert into NUMBERS (ID, N) Values (293, 884);
Insert into NUMBERS (ID, N) Values (294, 301);
Insert into NUMBERS (ID, N) Values (295, 815);
Insert into NUMBERS (ID, N) Values (296, 705);
Insert into NUMBERS (ID, N) Values (297, 975);
Insert into NUMBERS (ID, N) Values (298, 566);
Insert into NUMBERS (ID, N) Values (299, 328);
Insert into NUMBERS (ID, N) Values (300, 877);
Insert into NUMBERS (ID, N) Values (301, 935);
Insert into NUMBERS (ID, N) Values (302, 40);
Insert into NUMBERS (ID, N) Values (303, 44);
Insert into NUMBERS (ID, N) Values (304, 810);
Insert into NUMBERS (ID, N) Values (305, 809);
Insert into NUMBERS (ID, N) Values (306, 834);
Insert into NUMBERS (ID, N) Values (307, 156);
Insert into NUMBERS (ID, N) Values (308, 863);
Insert into NUMBERS (ID, N) Values (309, 185);
Insert into NUMBERS (ID, N) Values (310, 985);
Insert into NUMBERS (ID, N) Values (311, 990);
Insert into NUMBERS (ID, N) Values (312, 152);
Insert into NUMBERS (ID, N) Values (313, 542);
Insert into NUMBERS (ID, N) Values (314, 206);
Insert into NUMBERS (ID, N) Values (315, 666);
Insert into NUMBERS (ID, N) Values (316, 457);
Insert into NUMBERS (ID, N) Values (317, 238);
Insert into NUMBERS (ID, N) Values (318, 395);
Insert into NUMBERS (ID, N) Values (319, 170);
Insert into NUMBERS (ID, N) Values (320, 938);
Insert into NUMBERS (ID, N) Values (321, 977);
Insert into NUMBERS (ID, N) Values (322, 122);
Insert into NUMBERS (ID, N) Values (323, 510);
Insert into NUMBERS (ID, N) Values (324, 71);
Insert into NUMBERS (ID, N) Values (325, 478);
Insert into NUMBERS (ID, N) Values (326, 989);
Insert into NUMBERS (ID, N) Values (327, 139);
Insert into NUMBERS (ID, N) Values (328, 904);
Insert into NUMBERS (ID, N) Values (329, 822);
Insert into NUMBERS (ID, N) Values (330, 489);
Insert into NUMBERS (ID, N) Values (331, 657);
Insert into NUMBERS (ID, N) Values (332, 753);
Insert into NUMBERS (ID, N) Values (333, 644);
Insert into NUMBERS (ID, N) Values (334, 729);
Insert into NUMBERS (ID, N) Values (335, 779);
Insert into NUMBERS (ID, N) Values (336, 390);
Insert into NUMBERS (ID, N) Values (337, 424);
Insert into NUMBERS (ID, N) Values (338, 701);
Insert into NUMBERS (ID, N) Values (339, 287);
Insert into NUMBERS (ID, N) Values (340, 43);
Insert into NUMBERS (ID, N) Values (341, 136);
Insert into NUMBERS (ID, N) Values (342, 621);
Insert into NUMBERS (ID, N) Values (343, 866);
Insert into NUMBERS (ID, N) Values (344, 41);
Insert into NUMBERS (ID, N) Values (345, 456);
Insert into NUMBERS (ID, N) Values (346, 701);
Insert into NUMBERS (ID, N) Values (347, 375);
Insert into NUMBERS (ID, N) Values (348, 122);
Insert into NUMBERS (ID, N) Values (349, 694);
Insert into NUMBERS (ID, N) Values (350, 984);
Insert into NUMBERS (ID, N) Values (351, 644);
Insert into NUMBERS (ID, N) Values (352, 952);
Insert into NUMBERS (ID, N) Values (353, 688);
Insert into NUMBERS (ID, N) Values (354, 836);
Insert into NUMBERS (ID, N) Values (355, 947);
Insert into NUMBERS (ID, N) Values (356, 412);
Insert into NUMBERS (ID, N) Values (357, 29);
Insert into NUMBERS (ID, N) Values (358, 182);
Insert into NUMBERS (ID, N) Values (359, 714);
Insert into NUMBERS (ID, N) Values (360, 631);
Insert into NUMBERS (ID, N) Values (361, 323);
Insert into NUMBERS (ID, N) Values (362, 811);
Insert into NUMBERS (ID, N) Values (363, 616);
Insert into NUMBERS (ID, N) Values (364, 828);
Insert into NUMBERS (ID, N) Values (365, 713);
Insert into NUMBERS (ID, N) Values (366, 769);
Insert into NUMBERS (ID, N) Values (367, 541);
Insert into NUMBERS (ID, N) Values (368, 965);
Insert into NUMBERS (ID, N) Values (369, 907);
Insert into NUMBERS (ID, N) Values (370, 952);
Insert into NUMBERS (ID, N) Values (371, 499);
Insert into NUMBERS (ID, N) Values (372, 373);
Insert into NUMBERS (ID, N) Values (373, 15);
Insert into NUMBERS (ID, N) Values (374, 37);
Insert into NUMBERS (ID, N) Values (375, 978);
Insert into NUMBERS (ID, N) Values (376, 432);
Insert into NUMBERS (ID, N) Values (377, 822);
Insert into NUMBERS (ID, N) Values (378, 884);
Insert into NUMBERS (ID, N) Values (379, 192);
Insert into NUMBERS (ID, N) Values (380, 172);
Insert into NUMBERS (ID, N) Values (381, 973);
Insert into NUMBERS (ID, N) Values (382, 782);
Insert into NUMBERS (ID, N) Values (383, 857);
Insert into NUMBERS (ID, N) Values (384, 509);
Insert into NUMBERS (ID, N) Values (385, 326);
Insert into NUMBERS (ID, N) Values (386, 603);
Insert into NUMBERS (ID, N) Values (387, 165);
Insert into NUMBERS (ID, N) Values (388, 671);
Insert into NUMBERS (ID, N) Values (389, 910);
Insert into NUMBERS (ID, N) Values (390, 494);
Insert into NUMBERS (ID, N) Values (391, 21);
Insert into NUMBERS (ID, N) Values (392, 746);
Insert into NUMBERS (ID, N) Values (393, 513);
Insert into NUMBERS (ID, N) Values (394, 902);
Insert into NUMBERS (ID, N) Values (395, 870);
Insert into NUMBERS (ID, N) Values (396, 849);
Insert into NUMBERS (ID, N) Values (397, 390);
Insert into NUMBERS (ID, N) Values (398, 408);
Insert into NUMBERS (ID, N) Values (399, 6);
Insert into NUMBERS (ID, N) Values (400, 362);
Insert into NUMBERS (ID, N) Values (401, 652);
Insert into NUMBERS (ID, N) Values (402, 873);
Insert into NUMBERS (ID, N) Values (403, 494);
Insert into NUMBERS (ID, N) Values (404, 709);
Insert into NUMBERS (ID, N) Values (405, 21);
Insert into NUMBERS (ID, N) Values (406, 622);
Insert into NUMBERS (ID, N) Values (407, 384);
Insert into NUMBERS (ID, N) Values (408, 510);
Insert into NUMBERS (ID, N) Values (409, 720);
Insert into NUMBERS (ID, N) Values (410, 139);
Insert into NUMBERS (ID, N) Values (411, 583);
Insert into NUMBERS (ID, N) Values (412, 2);
Insert into NUMBERS (ID, N) Values (413, 963);
Insert into NUMBERS (ID, N) Values (414, 571);
Insert into NUMBERS (ID, N) Values (415, 140);
Insert into NUMBERS (ID, N) Values (416, 648);
Insert into NUMBERS (ID, N) Values (417, 415);
Insert into NUMBERS (ID, N) Values (418, 780);
Insert into NUMBERS (ID, N) Values (419, 499);
Insert into NUMBERS (ID, N) Values (420, 623);
Insert into NUMBERS (ID, N) Values (421, 263);
Insert into NUMBERS (ID, N) Values (422, 562);
Insert into NUMBERS (ID, N) Values (423, 711);
Insert into NUMBERS (ID, N) Values (424, 420);
Insert into NUMBERS (ID, N) Values (425, 854);
Insert into NUMBERS (ID, N) Values (426, 369);
Insert into NUMBERS (ID, N) Values (427, 221);
Insert into NUMBERS (ID, N) Values (428, 404);
Insert into NUMBERS (ID, N) Values (429, 443);
Insert into NUMBERS (ID, N) Values (430, 983);
Insert into NUMBERS (ID, N) Values (431, 794);
Insert into NUMBERS (ID, N) Values (432, 474);
Insert into NUMBERS (ID, N) Values (433, 757);
Insert into NUMBERS (ID, N) Values (434, 685);
Insert into NUMBERS (ID, N) Values (435, 880);
Insert into NUMBERS (ID, N) Values (436, 993);
Insert into NUMBERS (ID, N) Values (437, 404);
Insert into NUMBERS (ID, N) Values (438, 241);
Insert into NUMBERS (ID, N) Values (439, 19);
Insert into NUMBERS (ID, N) Values (440, 46);
Insert into NUMBERS (ID, N) Values (441, 741);
Insert into NUMBERS (ID, N) Values (442, 747);
Insert into NUMBERS (ID, N) Values (443, 672);
Insert into NUMBERS (ID, N) Values (444, 873);
Insert into NUMBERS (ID, N) Values (445, 64);
Insert into NUMBERS (ID, N) Values (446, 160);
Insert into NUMBERS (ID, N) Values (447, 394);
Insert into NUMBERS (ID, N) Values (448, 927);
Insert into NUMBERS (ID, N) Values (449, 682);
Insert into NUMBERS (ID, N) Values (450, 369);
Insert into NUMBERS (ID, N) Values (451, 472);
Insert into NUMBERS (ID, N) Values (452, 651);
Insert into NUMBERS (ID, N) Values (453, 968);
Insert into NUMBERS (ID, N) Values (454, 716);
Insert into NUMBERS (ID, N) Values (455, 781);
Insert into NUMBERS (ID, N) Values (456, 506);
Insert into NUMBERS (ID, N) Values (457, 242);
Insert into NUMBERS (ID, N) Values (458, 714);
Insert into NUMBERS (ID, N) Values (459, 113);
Insert into NUMBERS (ID, N) Values (460, 463);
Insert into NUMBERS (ID, N) Values (461, 605);
Insert into NUMBERS (ID, N) Values (462, 178);
Insert into NUMBERS (ID, N) Values (463, 983);
Insert into NUMBERS (ID, N) Values (464, 477);
Insert into NUMBERS (ID, N) Values (465, 822);
Insert into NUMBERS (ID, N) Values (466, 463);
Insert into NUMBERS (ID, N) Values (467, 993);
Insert into NUMBERS (ID, N) Values (468, 367);
Insert into NUMBERS (ID, N) Values (469, 811);
Insert into NUMBERS (ID, N) Values (470, 159);
Insert into NUMBERS (ID, N) Values (471, 693);
Insert into NUMBERS (ID, N) Values (472, 156);
Insert into NUMBERS (ID, N) Values (473, 527);
Insert into NUMBERS (ID, N) Values (474, 171);
Insert into NUMBERS (ID, N) Values (475, 496);
Insert into NUMBERS (ID, N) Values (476, 326);
Insert into NUMBERS (ID, N) Values (477, 721);
Insert into NUMBERS (ID, N) Values (478, 106);
Insert into NUMBERS (ID, N) Values (479, 346);
Insert into NUMBERS (ID, N) Values (480, 535);
Insert into NUMBERS (ID, N) Values (481, 737);
Insert into NUMBERS (ID, N) Values (482, 692);
Insert into NUMBERS (ID, N) Values (483, 55);
Insert into NUMBERS (ID, N) Values (484, 412);
Insert into NUMBERS (ID, N) Values (485, 699);
Insert into NUMBERS (ID, N) Values (486, 574);
Insert into NUMBERS (ID, N) Values (487, 979);
Insert into NUMBERS (ID, N) Values (488, 999);
Insert into NUMBERS (ID, N) Values (489, 399);
Insert into NUMBERS (ID, N) Values (490, 992);
Insert into NUMBERS (ID, N) Values (491, 456);
Insert into NUMBERS (ID, N) Values (492, 8);
Insert into NUMBERS (ID, N) Values (493, 418);
Insert into NUMBERS (ID, N) Values (494, 3);
Insert into NUMBERS (ID, N) Values (495, 523);
Insert into NUMBERS (ID, N) Values (496, 563);
Insert into NUMBERS (ID, N) Values (497, 211);
Insert into NUMBERS (ID, N) Values (498, 666);
Insert into NUMBERS (ID, N) Values (499, 239);
Insert into NUMBERS (ID, N) Values (500, 874);
Insert into NUMBERS (ID, N) Values (501, 318);
Insert into NUMBERS (ID, N) Values (502, 88);
Insert into NUMBERS (ID, N) Values (503, 83);
Insert into NUMBERS (ID, N) Values (504, 209);
Insert into NUMBERS (ID, N) Values (505, 539);
Insert into NUMBERS (ID, N) Values (506, 967);
Insert into NUMBERS (ID, N) Values (507, 976);
Insert into NUMBERS (ID, N) Values (508, 689);
Insert into NUMBERS (ID, N) Values (509, 821);
Insert into NUMBERS (ID, N) Values (510, 127);
Insert into NUMBERS (ID, N) Values (511, 41);
Insert into NUMBERS (ID, N) Values (512, 979);
Insert into NUMBERS (ID, N) Values (513, 406);
Insert into NUMBERS (ID, N) Values (514, 167);
Insert into NUMBERS (ID, N) Values (515, 873);
Insert into NUMBERS (ID, N) Values (516, 304);
Insert into NUMBERS (ID, N) Values (517, 752);
Insert into NUMBERS (ID, N) Values (518, 963);
Insert into NUMBERS (ID, N) Values (519, 476);
Insert into NUMBERS (ID, N) Values (520, 221);
Insert into NUMBERS (ID, N) Values (521, 455);
Insert into NUMBERS (ID, N) Values (522, 449);
Insert into NUMBERS (ID, N) Values (523, 374);
Insert into NUMBERS (ID, N) Values (524, 229);
Insert into NUMBERS (ID, N) Values (525, 161);
Insert into NUMBERS (ID, N) Values (526, 216);
Insert into NUMBERS (ID, N) Values (527, 718);
Insert into NUMBERS (ID, N) Values (528, 737);
Insert into NUMBERS (ID, N) Values (529, 836);
Insert into NUMBERS (ID, N) Values (530, 734);
Insert into NUMBERS (ID, N) Values (531, 200);
Insert into NUMBERS (ID, N) Values (532, 39);
Insert into NUMBERS (ID, N) Values (533, 193);
Insert into NUMBERS (ID, N) Values (534, 428);
Insert into NUMBERS (ID, N) Values (535, 744);
Insert into NUMBERS (ID, N) Values (536, 277);
Insert into NUMBERS (ID, N) Values (537, 659);
Insert into NUMBERS (ID, N) Values (538, 31);
Insert into NUMBERS (ID, N) Values (539, 101);
Insert into NUMBERS (ID, N) Values (540, 520);
Insert into NUMBERS (ID, N) Values (541, 701);
Insert into NUMBERS (ID, N) Values (542, 21);
Insert into NUMBERS (ID, N) Values (543, 977);
Insert into NUMBERS (ID, N) Values (544, 804);
Insert into NUMBERS (ID, N) Values (545, 159);
Insert into NUMBERS (ID, N) Values (546, 329);
Insert into NUMBERS (ID, N) Values (547, 311);
Insert into NUMBERS (ID, N) Values (548, 170);
Insert into NUMBERS (ID, N) Values (549, 965);
Insert into NUMBERS (ID, N) Values (550, 998);
Insert into NUMBERS (ID, N) Values (551, 783);
Insert into NUMBERS (ID, N) Values (552, 664);
Insert into NUMBERS (ID, N) Values (553, 115);
Insert into NUMBERS (ID, N) Values (554, 612);
Insert into NUMBERS (ID, N) Values (555, 103);
Insert into NUMBERS (ID, N) Values (556, 478);
Insert into NUMBERS (ID, N) Values (557, 303);
Insert into NUMBERS (ID, N) Values (558, 800);
Insert into NUMBERS (ID, N) Values (559, 945);
Insert into NUMBERS (ID, N) Values (560, 375);
Insert into NUMBERS (ID, N) Values (561, 701);
Insert into NUMBERS (ID, N) Values (562, 176);
Insert into NUMBERS (ID, N) Values (563, 728);
Insert into NUMBERS (ID, N) Values (564, 13);
Insert into NUMBERS (ID, N) Values (565, 554);
Insert into NUMBERS (ID, N) Values (566, 784);
Insert into NUMBERS (ID, N) Values (567, 256);
Insert into NUMBERS (ID, N) Values (568, 66);
Insert into NUMBERS (ID, N) Values (569, 197);
Insert into NUMBERS (ID, N) Values (570, 973);
Insert into NUMBERS (ID, N) Values (571, 822);
Insert into NUMBERS (ID, N) Values (572, 452);
Insert into NUMBERS (ID, N) Values (573, 982);
Insert into NUMBERS (ID, N) Values (574, 453);
Insert into NUMBERS (ID, N) Values (575, 25);
Insert into NUMBERS (ID, N) Values (576, 612);
Insert into NUMBERS (ID, N) Values (577, 777);
Insert into NUMBERS (ID, N) Values (578, 684);
Insert into NUMBERS (ID, N) Values (579, 398);
Insert into NUMBERS (ID, N) Values (580, 125);
Insert into NUMBERS (ID, N) Values (581, 214);
Insert into NUMBERS (ID, N) Values (582, 502);
Insert into NUMBERS (ID, N) Values (583, 401);
Insert into NUMBERS (ID, N) Values (584, 949);
Insert into NUMBERS (ID, N) Values (585, 347);
Insert into NUMBERS (ID, N) Values (586, 302);
Insert into NUMBERS (ID, N) Values (587, 516);
Insert into NUMBERS (ID, N) Values (588, 494);
Insert into NUMBERS (ID, N) Values (589, 228);
Insert into NUMBERS (ID, N) Values (590, 689);
Insert into NUMBERS (ID, N) Values (591, 651);
Insert into NUMBERS (ID, N) Values (592, 361);
Insert into NUMBERS (ID, N) Values (593, 207);
Insert into NUMBERS (ID, N) Values (594, 828);
Insert into NUMBERS (ID, N) Values (595, 532);
Insert into NUMBERS (ID, N) Values (596, 255);
Insert into NUMBERS (ID, N) Values (597, 804);
Insert into NUMBERS (ID, N) Values (598, 233);
Insert into NUMBERS (ID, N) Values (599, 869);
Insert into NUMBERS (ID, N) Values (600, 355);
Insert into NUMBERS (ID, N) Values (601, 302);
Insert into NUMBERS (ID, N) Values (602, 133);
Insert into NUMBERS (ID, N) Values (603, 621);
Insert into NUMBERS (ID, N) Values (604, 947);
Insert into NUMBERS (ID, N) Values (605, 451);
Insert into NUMBERS (ID, N) Values (606, 807);
Insert into NUMBERS (ID, N) Values (607, 277);
Insert into NUMBERS (ID, N) Values (608, 891);
Insert into NUMBERS (ID, N) Values (609, 296);
Insert into NUMBERS (ID, N) Values (610, 500);
Insert into NUMBERS (ID, N) Values (611, 602);
Insert into NUMBERS (ID, N) Values (612, 515);
Insert into NUMBERS (ID, N) Values (613, 302);
Insert into NUMBERS (ID, N) Values (614, 346);
Insert into NUMBERS (ID, N) Values (615, 325);
Insert into NUMBERS (ID, N) Values (616, 48);
Insert into NUMBERS (ID, N) Values (617, 477);
Insert into NUMBERS (ID, N) Values (618, 244);
Insert into NUMBERS (ID, N) Values (619, 507);
Insert into NUMBERS (ID, N) Values (620, 781);
Insert into NUMBERS (ID, N) Values (621, 473);
Insert into NUMBERS (ID, N) Values (622, 906);
Insert into NUMBERS (ID, N) Values (623, 425);
Insert into NUMBERS (ID, N) Values (624, 403);
Insert into NUMBERS (ID, N) Values (625, 801);
Insert into NUMBERS (ID, N) Values (626, 355);
Insert into NUMBERS (ID, N) Values (627, 706);
Insert into NUMBERS (ID, N) Values (628, 786);
Insert into NUMBERS (ID, N) Values (629, 685);
Insert into NUMBERS (ID, N) Values (630, 892);
Insert into NUMBERS (ID, N) Values (631, 967);
Insert into NUMBERS (ID, N) Values (632, 79);
Insert into NUMBERS (ID, N) Values (633, 816);
Insert into NUMBERS (ID, N) Values (634, 19);
Insert into NUMBERS (ID, N) Values (635, 72);
Insert into NUMBERS (ID, N) Values (636, 663);
Insert into NUMBERS (ID, N) Values (637, 309);
Insert into NUMBERS (ID, N) Values (638, 677);
Insert into NUMBERS (ID, N) Values (639, 840);
Insert into NUMBERS (ID, N) Values (640, 642);
Insert into NUMBERS (ID, N) Values (641, 801);
Insert into NUMBERS (ID, N) Values (642, 119);
Insert into NUMBERS (ID, N) Values (643, 10);
Insert into NUMBERS (ID, N) Values (644, 529);
Insert into NUMBERS (ID, N) Values (645, 36);
Insert into NUMBERS (ID, N) Values (646, 975);
Insert into NUMBERS (ID, N) Values (647, 408);
Insert into NUMBERS (ID, N) Values (648, 683);
Insert into NUMBERS (ID, N) Values (649, 71);
Insert into NUMBERS (ID, N) Values (650, 39);
Insert into NUMBERS (ID, N) Values (651, 36);
Insert into NUMBERS (ID, N) Values (652, 277);
Insert into NUMBERS (ID, N) Values (653, 139);
Insert into NUMBERS (ID, N) Values (654, 294);
Insert into NUMBERS (ID, N) Values (655, 757);
Insert into NUMBERS (ID, N) Values (656, 103);
Insert into NUMBERS (ID, N) Values (657, 487);
Insert into NUMBERS (ID, N) Values (658, 327);
Insert into NUMBERS (ID, N) Values (659, 733);
Insert into NUMBERS (ID, N) Values (660, 136);
Insert into NUMBERS (ID, N) Values (661, 700);
Insert into NUMBERS (ID, N) Values (662, 243);
Insert into NUMBERS (ID, N) Values (663, 969);
Insert into NUMBERS (ID, N) Values (664, 112);
Insert into NUMBERS (ID, N) Values (665, 518);
Insert into NUMBERS (ID, N) Values (666, 674);
Insert into NUMBERS (ID, N) Values (667, 179);
Insert into NUMBERS (ID, N) Values (668, 610);
Insert into NUMBERS (ID, N) Values (669, 23);
Insert into NUMBERS (ID, N) Values (670, 165);
Insert into NUMBERS (ID, N) Values (671, 689);
Insert into NUMBERS (ID, N) Values (672, 279);
Insert into NUMBERS (ID, N) Values (673, 361);
Insert into NUMBERS (ID, N) Values (674, 516);
Insert into NUMBERS (ID, N) Values (675, 311);
Insert into NUMBERS (ID, N) Values (676, 508);
Insert into NUMBERS (ID, N) Values (677, 881);
Insert into NUMBERS (ID, N) Values (678, 832);
Insert into NUMBERS (ID, N) Values (679, 86);
Insert into NUMBERS (ID, N) Values (680, 871);
Insert into NUMBERS (ID, N) Values (681, 393);
Insert into NUMBERS (ID, N) Values (682, 741);
Insert into NUMBERS (ID, N) Values (683, 63);
Insert into NUMBERS (ID, N) Values (684, 823);
Insert into NUMBERS (ID, N) Values (685, 186);
Insert into NUMBERS (ID, N) Values (686, 724);
Insert into NUMBERS (ID, N) Values (687, 182);
Insert into NUMBERS (ID, N) Values (688, 303);
Insert into NUMBERS (ID, N) Values (689, 345);
Insert into NUMBERS (ID, N) Values (690, 805);
Insert into NUMBERS (ID, N) Values (691, 798);
Insert into NUMBERS (ID, N) Values (692, 8);
Insert into NUMBERS (ID, N) Values (693, 919);
Insert into NUMBERS (ID, N) Values (694, 809);
Insert into NUMBERS (ID, N) Values (695, 753);
Insert into NUMBERS (ID, N) Values (696, 319);
Insert into NUMBERS (ID, N) Values (697, 791);
Insert into NUMBERS (ID, N) Values (698, 188);
Insert into NUMBERS (ID, N) Values (699, 139);
Insert into NUMBERS (ID, N) Values (700, 58);
Insert into NUMBERS (ID, N) Values (701, 139);
Insert into NUMBERS (ID, N) Values (702, 97);
Insert into NUMBERS (ID, N) Values (703, 961);
Insert into NUMBERS (ID, N) Values (704, 432);
Insert into NUMBERS (ID, N) Values (705, 554);
Insert into NUMBERS (ID, N) Values (706, 346);
Insert into NUMBERS (ID, N) Values (707, 784);
Insert into NUMBERS (ID, N) Values (708, 19);
Insert into NUMBERS (ID, N) Values (709, 126);
Insert into NUMBERS (ID, N) Values (710, 842);
Insert into NUMBERS (ID, N) Values (711, 973);
Insert into NUMBERS (ID, N) Values (712, 879);
Insert into NUMBERS (ID, N) Values (713, 68);
Insert into NUMBERS (ID, N) Values (714, 796);
Insert into NUMBERS (ID, N) Values (715, 958);
Insert into NUMBERS (ID, N) Values (716, 885);
Insert into NUMBERS (ID, N) Values (717, 966);
Insert into NUMBERS (ID, N) Values (718, 151);
Insert into NUMBERS (ID, N) Values (719, 415);
Insert into NUMBERS (ID, N) Values (720, 862);
Insert into NUMBERS (ID, N) Values (721, 478);
Insert into NUMBERS (ID, N) Values (722, 976);
Insert into NUMBERS (ID, N) Values (723, 617);
Insert into NUMBERS (ID, N) Values (724, 941);
Insert into NUMBERS (ID, N) Values (725, 972);
Insert into NUMBERS (ID, N) Values (726, 443);
Insert into NUMBERS (ID, N) Values (727, 597);
Insert into NUMBERS (ID, N) Values (728, 153);
Insert into NUMBERS (ID, N) Values (729, 702);
Insert into NUMBERS (ID, N) Values (730, 448);
Insert into NUMBERS (ID, N) Values (731, 564);
Insert into NUMBERS (ID, N) Values (732, 20);
Insert into NUMBERS (ID, N) Values (733, 928);
Insert into NUMBERS (ID, N) Values (734, 47);
Insert into NUMBERS (ID, N) Values (735, 303);
Insert into NUMBERS (ID, N) Values (736, 946);
Insert into NUMBERS (ID, N) Values (737, 86);
Insert into NUMBERS (ID, N) Values (738, 846);
Insert into NUMBERS (ID, N) Values (739, 841);
Insert into NUMBERS (ID, N) Values (740, 312);
Insert into NUMBERS (ID, N) Values (741, 566);
Insert into NUMBERS (ID, N) Values (742, 155);
Insert into NUMBERS (ID, N) Values (743, 182);
Insert into NUMBERS (ID, N) Values (744, 412);
Insert into NUMBERS (ID, N) Values (745, 600);
Insert into NUMBERS (ID, N) Values (746, 756);
Insert into NUMBERS (ID, N) Values (747, 892);
Insert into NUMBERS (ID, N) Values (748, 885);
Insert into NUMBERS (ID, N) Values (749, 958);
Insert into NUMBERS (ID, N) Values (750, 168);
Insert into NUMBERS (ID, N) Values (751, 181);
Insert into NUMBERS (ID, N) Values (752, 270);
Insert into NUMBERS (ID, N) Values (753, 164);
Insert into NUMBERS (ID, N) Values (754, 755);
Insert into NUMBERS (ID, N) Values (755, 998);
Insert into NUMBERS (ID, N) Values (756, 112);
Insert into NUMBERS (ID, N) Values (757, 539);
Insert into NUMBERS (ID, N) Values (758, 558);
Insert into NUMBERS (ID, N) Values (759, 583);
Insert into NUMBERS (ID, N) Values (760, 256);
Insert into NUMBERS (ID, N) Values (761, 793);
Insert into NUMBERS (ID, N) Values (762, 348);
Insert into NUMBERS (ID, N) Values (763, 38);
Insert into NUMBERS (ID, N) Values (764, 55);
Insert into NUMBERS (ID, N) Values (765, 888);
Insert into NUMBERS (ID, N) Values (766, 277);
Insert into NUMBERS (ID, N) Values (767, 825);
Insert into NUMBERS (ID, N) Values (768, 153);
Insert into NUMBERS (ID, N) Values (769, 641);
Insert into NUMBERS (ID, N) Values (770, 799);
Insert into NUMBERS (ID, N) Values (771, 197);
Insert into NUMBERS (ID, N) Values (772, 531);
Insert into NUMBERS (ID, N) Values (773, 305);
Insert into NUMBERS (ID, N) Values (774, 596);
Insert into NUMBERS (ID, N) Values (775, 274);
Insert into NUMBERS (ID, N) Values (776, 79);
Insert into NUMBERS (ID, N) Values (777, 732);
Insert into NUMBERS (ID, N) Values (778, 509);
Insert into NUMBERS (ID, N) Values (779, 826);
Insert into NUMBERS (ID, N) Values (780, 931);
Insert into NUMBERS (ID, N) Values (781, 610);
Insert into NUMBERS (ID, N) Values (782, 777);
Insert into NUMBERS (ID, N) Values (783, 421);
Insert into NUMBERS (ID, N) Values (784, 865);
Insert into NUMBERS (ID, N) Values (785, 203);
Insert into NUMBERS (ID, N) Values (786, 562);
Insert into NUMBERS (ID, N) Values (787, 131);
Insert into NUMBERS (ID, N) Values (788, 467);
Insert into NUMBERS (ID, N) Values (789, 604);
Insert into NUMBERS (ID, N) Values (790, 886);
Insert into NUMBERS (ID, N) Values (791, 202);
Insert into NUMBERS (ID, N) Values (792, 878);
Insert into NUMBERS (ID, N) Values (793, 194);
Insert into NUMBERS (ID, N) Values (794, 878);
Insert into NUMBERS (ID, N) Values (795, 365);
Insert into NUMBERS (ID, N) Values (796, 454);
Insert into NUMBERS (ID, N) Values (797, 431);
Insert into NUMBERS (ID, N) Values (798, 7);
Insert into NUMBERS (ID, N) Values (799, 564);
Insert into NUMBERS (ID, N) Values (800, 242);
Insert into NUMBERS (ID, N) Values (801, 555);
Insert into NUMBERS (ID, N) Values (802, 89);
Insert into NUMBERS (ID, N) Values (803, 416);
Insert into NUMBERS (ID, N) Values (804, 263);
Insert into NUMBERS (ID, N) Values (805, 763);
Insert into NUMBERS (ID, N) Values (806, 454);
Insert into NUMBERS (ID, N) Values (807, 348);
Insert into NUMBERS (ID, N) Values (808, 895);
Insert into NUMBERS (ID, N) Values (809, 264);
Insert into NUMBERS (ID, N) Values (810, 823);
Insert into NUMBERS (ID, N) Values (811, 43);
Insert into NUMBERS (ID, N) Values (812, 149);
Insert into NUMBERS (ID, N) Values (813, 335);
Insert into NUMBERS (ID, N) Values (814, 5);
Insert into NUMBERS (ID, N) Values (815, 121);
Insert into NUMBERS (ID, N) Values (816, 995);
Insert into NUMBERS (ID, N) Values (817, 909);
Insert into NUMBERS (ID, N) Values (818, 168);
Insert into NUMBERS (ID, N) Values (819, 521);
Insert into NUMBERS (ID, N) Values (820, 492);
Insert into NUMBERS (ID, N) Values (821, 162);
Insert into NUMBERS (ID, N) Values (822, 27);
Insert into NUMBERS (ID, N) Values (823, 32);
Insert into NUMBERS (ID, N) Values (824, 834);
Insert into NUMBERS (ID, N) Values (825, 677);
Insert into NUMBERS (ID, N) Values (826, 561);
Insert into NUMBERS (ID, N) Values (827, 984);
Insert into NUMBERS (ID, N) Values (828, 734);
Insert into NUMBERS (ID, N) Values (829, 602);
Insert into NUMBERS (ID, N) Values (830, 836);
Insert into NUMBERS (ID, N) Values (831, 319);
Insert into NUMBERS (ID, N) Values (832, 287);
Insert into NUMBERS (ID, N) Values (833, 597);
Insert into NUMBERS (ID, N) Values (834, 241);
Insert into NUMBERS (ID, N) Values (835, 194);
Insert into NUMBERS (ID, N) Values (836, 373);
Insert into NUMBERS (ID, N) Values (837, 231);
Insert into NUMBERS (ID, N) Values (838, 768);
Insert into NUMBERS (ID, N) Values (839, 761);
Insert into NUMBERS (ID, N) Values (840, 467);
Insert into NUMBERS (ID, N) Values (841, 385);
Insert into NUMBERS (ID, N) Values (842, 173);
Insert into NUMBERS (ID, N) Values (843, 615);
Insert into NUMBERS (ID, N) Values (844, 938);
Insert into NUMBERS (ID, N) Values (845, 889);
Insert into NUMBERS (ID, N) Values (846, 322);
Insert into NUMBERS (ID, N) Values (847, 874);
Insert into NUMBERS (ID, N) Values (848, 103);
Insert into NUMBERS (ID, N) Values (849, 46);
Insert into NUMBERS (ID, N) Values (850, 885);
Insert into NUMBERS (ID, N) Values (851, 944);
Insert into NUMBERS (ID, N) Values (852, 592);
Insert into NUMBERS (ID, N) Values (853, 33);
Insert into NUMBERS (ID, N) Values (854, 595);
Insert into NUMBERS (ID, N) Values (855, 76);
Insert into NUMBERS (ID, N) Values (856, 232);
Insert into NUMBERS (ID, N) Values (857, 649);
Insert into NUMBERS (ID, N) Values (858, 400);
Insert into NUMBERS (ID, N) Values (859, 997);
Insert into NUMBERS (ID, N) Values (860, 365);
Insert into NUMBERS (ID, N) Values (861, 290);
Insert into NUMBERS (ID, N) Values (862, 666);
Insert into NUMBERS (ID, N) Values (863, 183);
Insert into NUMBERS (ID, N) Values (864, 861);
Insert into NUMBERS (ID, N) Values (865, 65);
Insert into NUMBERS (ID, N) Values (866, 236);
Insert into NUMBERS (ID, N) Values (867, 521);
Insert into NUMBERS (ID, N) Values (868, 566);
Insert into NUMBERS (ID, N) Values (869, 772);
Insert into NUMBERS (ID, N) Values (870, 881);
Insert into NUMBERS (ID, N) Values (871, 462);
Insert into NUMBERS (ID, N) Values (872, 294);
Insert into NUMBERS (ID, N) Values (873, 339);
Insert into NUMBERS (ID, N) Values (874, 136);
Insert into NUMBERS (ID, N) Values (875, 430);
Insert into NUMBERS (ID, N) Values (876, 52);
Insert into NUMBERS (ID, N) Values (877, 348);
Insert into NUMBERS (ID, N) Values (878, 905);
Insert into NUMBERS (ID, N) Values (879, 936);
Insert into NUMBERS (ID, N) Values (880, 722);
Insert into NUMBERS (ID, N) Values (881, 446);
Insert into NUMBERS (ID, N) Values (882, 928);
Insert into NUMBERS (ID, N) Values (883, 326);
Insert into NUMBERS (ID, N) Values (884, 634);
Insert into NUMBERS (ID, N) Values (885, 431);
Insert into NUMBERS (ID, N) Values (886, 394);
Insert into NUMBERS (ID, N) Values (887, 518);
Insert into NUMBERS (ID, N) Values (888, 246);
Insert into NUMBERS (ID, N) Values (889, 640);
Insert into NUMBERS (ID, N) Values (890, 191);
Insert into NUMBERS (ID, N) Values (891, 738);
Insert into NUMBERS (ID, N) Values (892, 520);
Insert into NUMBERS (ID, N) Values (893, 434);
Insert into NUMBERS (ID, N) Values (894, 942);
Insert into NUMBERS (ID, N) Values (895, 327);
Insert into NUMBERS (ID, N) Values (896, 449);
Insert into NUMBERS (ID, N) Values (897, 408);
Insert into NUMBERS (ID, N) Values (898, 136);
Insert into NUMBERS (ID, N) Values (899, 504);
Insert into NUMBERS (ID, N) Values (900, 661);
Insert into NUMBERS (ID, N) Values (901, 203);
Insert into NUMBERS (ID, N) Values (902, 336);
Insert into NUMBERS (ID, N) Values (903, 396);
Insert into NUMBERS (ID, N) Values (904, 384);
Insert into NUMBERS (ID, N) Values (905, 21);
Insert into NUMBERS (ID, N) Values (906, 375);
Insert into NUMBERS (ID, N) Values (907, 642);
Insert into NUMBERS (ID, N) Values (908, 379);
Insert into NUMBERS (ID, N) Values (909, 499);
Insert into NUMBERS (ID, N) Values (910, 12);
Insert into NUMBERS (ID, N) Values (911, 953);
Insert into NUMBERS (ID, N) Values (912, 94);
Insert into NUMBERS (ID, N) Values (913, 329);
Insert into NUMBERS (ID, N) Values (914, 323);
Insert into NUMBERS (ID, N) Values (915, 998);
Insert into NUMBERS (ID, N) Values (916, 720);
Insert into NUMBERS (ID, N) Values (917, 60);
Insert into NUMBERS (ID, N) Values (918, 700);
Insert into NUMBERS (ID, N) Values (919, 107);
Insert into NUMBERS (ID, N) Values (920, 704);
Insert into NUMBERS (ID, N) Values (921, 426);
Insert into NUMBERS (ID, N) Values (922, 258);
Insert into NUMBERS (ID, N) Values (923, 86);
Insert into NUMBERS (ID, N) Values (924, 206);
Insert into NUMBERS (ID, N) Values (925, 823);
Insert into NUMBERS (ID, N) Values (926, 788);
Insert into NUMBERS (ID, N) Values (927, 742);
Insert into NUMBERS (ID, N) Values (928, 746);
Insert into NUMBERS (ID, N) Values (929, 271);
Insert into NUMBERS (ID, N) Values (930, 933);
Insert into NUMBERS (ID, N) Values (931, 712);
Insert into NUMBERS (ID, N) Values (932, 550);
Insert into NUMBERS (ID, N) Values (933, 240);
Insert into NUMBERS (ID, N) Values (934, 332);
Insert into NUMBERS (ID, N) Values (935, 106);
Insert into NUMBERS (ID, N) Values (936, 465);
Insert into NUMBERS (ID, N) Values (937, 303);
Insert into NUMBERS (ID, N) Values (938, 968);
Insert into NUMBERS (ID, N) Values (939, 13);
Insert into NUMBERS (ID, N) Values (940, 930);
Insert into NUMBERS (ID, N) Values (941, 405);
Insert into NUMBERS (ID, N) Values (942, 471);
Insert into NUMBERS (ID, N) Values (943, 339);
Insert into NUMBERS (ID, N) Values (944, 968);
Insert into NUMBERS (ID, N) Values (945, 513);
Insert into NUMBERS (ID, N) Values (946, 736);
Insert into NUMBERS (ID, N) Values (947, 241);
Insert into NUMBERS (ID, N) Values (948, 493);
Insert into NUMBERS (ID, N) Values (949, 642);
Insert into NUMBERS (ID, N) Values (950, 433);
Insert into NUMBERS (ID, N) Values (951, 153);
Insert into NUMBERS (ID, N) Values (952, 832);
Insert into NUMBERS (ID, N) Values (953, 393);
Insert into NUMBERS (ID, N) Values (954, 589);
Insert into NUMBERS (ID, N) Values (955, 866);
Insert into NUMBERS (ID, N) Values (956, 27);
Insert into NUMBERS (ID, N) Values (957, 124);
Insert into NUMBERS (ID, N) Values (958, 138);
Insert into NUMBERS (ID, N) Values (959, 130);
Insert into NUMBERS (ID, N) Values (960, 290);
Insert into NUMBERS (ID, N) Values (961, 308);
Insert into NUMBERS (ID, N) Values (962, 354);
Insert into NUMBERS (ID, N) Values (963, 928);
Insert into NUMBERS (ID, N) Values (964, 739);
Insert into NUMBERS (ID, N) Values (965, 343);
Insert into NUMBERS (ID, N) Values (966, 59);
Insert into NUMBERS (ID, N) Values (967, 559);
Insert into NUMBERS (ID, N) Values (968, 631);
Insert into NUMBERS (ID, N) Values (969, 290);
Insert into NUMBERS (ID, N) Values (970, 11);
Insert into NUMBERS (ID, N) Values (971, 650);
Insert into NUMBERS (ID, N) Values (972, 465);
Insert into NUMBERS (ID, N) Values (973, 171);
Insert into NUMBERS (ID, N) Values (974, 445);
Insert into NUMBERS (ID, N) Values (975, 672);
Insert into NUMBERS (ID, N) Values (976, 937);
Insert into NUMBERS (ID, N) Values (977, 993);
Insert into NUMBERS (ID, N) Values (978, 326);
Insert into NUMBERS (ID, N) Values (979, 698);
Insert into NUMBERS (ID, N) Values (980, 466);
Insert into NUMBERS (ID, N) Values (981, 221);
Insert into NUMBERS (ID, N) Values (982, 894);
Insert into NUMBERS (ID, N) Values (983, 578);
Insert into NUMBERS (ID, N) Values (984, 663);
Insert into NUMBERS (ID, N) Values (985, 522);
Insert into NUMBERS (ID, N) Values (986, 577);
Insert into NUMBERS (ID, N) Values (987, 576);
Insert into NUMBERS (ID, N) Values (988, 363);
Insert into NUMBERS (ID, N) Values (989, 470);
Insert into NUMBERS (ID, N) Values (990, 235);
Insert into NUMBERS (ID, N) Values (991, 755);
Insert into NUMBERS (ID, N) Values (992, 610);
Insert into NUMBERS (ID, N) Values (993, 321);
Insert into NUMBERS (ID, N) Values (994, 940);
Insert into NUMBERS (ID, N) Values (995, 668);
Insert into NUMBERS (ID, N) Values (996, 748);
Insert into NUMBERS (ID, N) Values (997, 528);
Insert into NUMBERS (ID, N) Values (998, 897);
Insert into NUMBERS (ID, N) Values (999, 599);
Insert into NUMBERS (ID, N) Values (1000, 802);
```

ASKER

Ok, maybe we need to work with an example that is closer to reality. But first, let me just say that I don't need combinations of 10 rows. At most, I will need 5. The best approach would be, I think, a stored procedure that would take a variable that accepts values 1-5 and generate each SQL accordingly. Most of the times, we need the least "expensive" solution.

The rows will usually be something between 150-6000. At most, we don't expect the subset to have more than 10000.

The actual table, with the relevant fields is:

rh_vcrt0 (vcr_emp_empresa PK, vcr_num_empregado PK, vcr_vnc_cod_venc PK, vcr_valor_bruto)

You can treat all values as numbers. vcr_emp_empresa is the company ID. It will be a fixed value for the query, using @Emp variable, that is, we are going to sum rows from the same company all the time.

vcr_num_empregado is the employee number. Each employee will have various codes which will add/subtract to his total salary (like extra hours, night shift, holidays, sick leave, etc). This is vcr_vnc_cod_venc.

So, the fastest query I have for the 5 rows combination is, after applying some of the suggestions:

I have attached some sample data with the largest subset at the moment (about 5.5k rows).

I also attached the execution plan (had to rename to .sql because of EE upload rules. Just rename it back to .sqlplan).

With the smaller subset of data (about 150 rows) and a 5 rows approach, it took about 7 seconds and returned one row.

When I try a 3 rows approach (you can easily infer from the larger one) on the larger subset, it will quickly escalate (I cancelled after 15 minutes, with 4655 rows returned).

Trying a 4 rows approach on the larger subset would be impossible (at the very least, it will take over 30 minutes, though it should actually be a geometrical growth rather than arithmetic).

Also, it should be noted that I'm using a relatively small number for the sum. If I use, for example, 1500, all values would escalate as the consecutive joins would filter less rows.

ScottPletcher, I don't really understand what you mean with selective join. The two value table, if any, would have only the values that do add up to the desired value.

I sort of understand the LEFT JOIN approach, which would give me all the results at once, but I'm not sure how to code it nor if it will be efficient. It seems to me, intuitively, that it will be slower than the INNER JOIN solution.

If I could get a decent solution for the 5 rows sum, I could then devise a stored procedure which will run the 1 rows solution, check for values, if none run the 2 rows solution, check for values, if none run the 3 rows solution, etc...

But this would require that running all 5 queries on the larger subset wouldn't take longer than, at most, 5-10 minutes. Anything less wouldn't be practical. I could make do with a solution that runs for the smaller subsets (most common is something between 300 and 800) in the desired time, but that would just be a smaller consolation.

Sample.txt

query.sql

The rows will usually be something between 150-6000. At most, we don't expect the subset to have more than 10000.

The actual table, with the relevant fields is:

rh_vcrt0 (vcr_emp_empresa PK, vcr_num_empregado PK, vcr_vnc_cod_venc PK, vcr_valor_bruto)

You can treat all values as numbers. vcr_emp_empresa is the company ID. It will be a fixed value for the query, using @Emp variable, that is, we are going to sum rows from the same company all the time.

vcr_num_empregado is the employee number. Each employee will have various codes which will add/subtract to his total salary (like extra hours, night shift, holidays, sick leave, etc). This is vcr_vnc_cod_venc.

So, the fastest query I have for the 5 rows combination is, after applying some of the suggestions:

```
DECLARE @Emp int,@Vl real
SET @Emp=32
SET @Vl=112.41;
WITH v AS (
SELECT vcr_emp_empresa,vcr_num_empregado,vcr_vnc_cod_venc,SUM(vcr_valor_bruto) vcr_valor_bruto
FROM rh_vcrt0
WHERE vcr_emp_empresa=@Emp AND vcr_valor_bruto<@Vl
GROUP BY vcr_emp_empresa,vcr_num_empregado,vcr_vnc_cod_venc)
SELECT v1.vcr_num_empregado 'Nº1',v1.vcr_vnc_cod_venc 'Venc1',v1.vcr_valor_bruto 'Vl1',
v2.vcr_num_empregado 'Nº2',v2.vcr_vnc_cod_venc 'Venc2',v2.vcr_valor_bruto 'Vl2',
v3.vcr_num_empregado 'Nº3',v3.vcr_vnc_cod_venc 'Venc3',v3.vcr_valor_bruto 'Vl3',
v4.vcr_num_empregado 'Nº4',v4.vcr_vnc_cod_venc 'Venc4',v4.vcr_valor_bruto 'Vl4',
v5.vcr_num_empregado 'Nº5',v5.vcr_vnc_cod_venc 'Venc5',v5.vcr_valor_bruto 'Vl5'
FROM v v1
INNER JOIN v v2
ON (v1.vcr_num_empregado>v2.vcr_num_empregado
OR (v1.vcr_num_empregado=v2.vcr_num_empregado AND v1.vcr_vnc_cod_venc>v2.vcr_vnc_cod_venc))
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto<@Vl
INNER JOIN v v3
ON (v2.vcr_num_empregado>v3.vcr_num_empregado
OR (v2.vcr_num_empregado=v3.vcr_num_empregado AND v2.vcr_vnc_cod_venc>v3.vcr_vnc_cod_venc))
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto<@Vl
INNER JOIN v v4
ON (v3.vcr_num_empregado>v4.vcr_num_empregado
OR (v3.vcr_num_empregado=v4.vcr_num_empregado AND v3.vcr_vnc_cod_venc>v4.vcr_vnc_cod_venc))
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto<@Vl
INNER JOIN v v5
ON (v4.vcr_num_empregado>v5.vcr_num_empregado
OR (v4.vcr_num_empregado=v5.vcr_num_empregado AND v4.vcr_vnc_cod_venc>v5.vcr_vnc_cod_venc))
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto+v5.vcr_valor_bruto=@Vl
ORDER BY 1,2,4,5,7,8,10,11,13,14
```

(the WITH part of the table using the SUM is because there can be rows which have the same keys as presented here, but have fields that are also part of the key but aren't used in this case. I take the opportunity to also filter some undesired rows right away)I have attached some sample data with the largest subset at the moment (about 5.5k rows).

I also attached the execution plan (had to rename to .sql because of EE upload rules. Just rename it back to .sqlplan).

With the smaller subset of data (about 150 rows) and a 5 rows approach, it took about 7 seconds and returned one row.

When I try a 3 rows approach (you can easily infer from the larger one) on the larger subset, it will quickly escalate (I cancelled after 15 minutes, with 4655 rows returned).

Trying a 4 rows approach on the larger subset would be impossible (at the very least, it will take over 30 minutes, though it should actually be a geometrical growth rather than arithmetic).

Also, it should be noted that I'm using a relatively small number for the sum. If I use, for example, 1500, all values would escalate as the consecutive joins would filter less rows.

ScottPletcher, I don't really understand what you mean with selective join. The two value table, if any, would have only the values that do add up to the desired value.

I sort of understand the LEFT JOIN approach, which would give me all the results at once, but I'm not sure how to code it nor if it will be efficient. It seems to me, intuitively, that it will be slower than the INNER JOIN solution.

If I could get a decent solution for the 5 rows sum, I could then devise a stored procedure which will run the 1 rows solution, check for values, if none run the 2 rows solution, check for values, if none run the 3 rows solution, etc...

But this would require that running all 5 queries on the larger subset wouldn't take longer than, at most, 5-10 minutes. Anything less wouldn't be practical. I could make do with a solution that runs for the smaller subsets (most common is something between 300 and 800) in the desired time, but that would just be a smaller consolation.

Sample.txt

query.sql

>> The two value table, if any, would have only the values that do add up to the desired value. <<

No. That's what I saying -- keep ALL results that are less than the total desired, because those intermediate results can be used to help generate later results.

No. That's what I saying -- keep ALL results that are less than the total desired, because those intermediate results can be used to help generate later results.

ASKER

But isn't that what I did when I added the condition in the joins for v1.vcr_valor_bruto+v2.vcr_valor_bruto<@Vl (and subsequent)?

Or do you simply want to change the INNER JOINs to LEFT JOINs and build from there?

I don't think I'm understanding your point. Can you please provide some sort of example code or pseudocode so I can grasp it?

Or do you simply want to change the INNER JOINs to LEFT JOINs and build from there?

I don't think I'm understanding your point. Can you please provide some sort of example code or pseudocode so I can grasp it?

the join conditions above do keep the results that are less than the total desired.

that's not a matter of effeciency it's necessary for it to work at all

That same condition not only allows you to find solutions that are possible.

but more importantly, it eliminates joins where the sum is already too big.

That's where the speed comes from, because then you can let the math work for you rather than against you.

Of course, for small values and large sums the possible combinations could still grow unmanageably big but the asker seems to be suggesting that the relative size of value vs sum will make that unlikely.

that's not a matter of effeciency it's necessary for it to work at all

That same condition not only allows you to find solutions that are possible.

but more importantly, it eliminates joins where the sum is already too big.

That's where the speed comes from, because then you can let the math work for you rather than against you.

Of course, for small values and large sums the possible combinations could still grow unmanageably big but the asker seems to be suggesting that the relative size of value vs sum will make that unlikely.

>>> do you simply want to change the INNER JOINs to LEFT JOINs and build from there?

the difference here is INNER JOINS will only find 5 row sums (in your example of 5 joins)

the outer joins I showed above find 1, 2, 3, 4 and 5 rows sums.

if you parameterize your procedure and want to search for exactly one particular level, then INNER is the way to go. Same idea as what I already posted and you've used above.

You'll simply have 5 different queries in your procedure depending on the number of joins/rows you need to examine.

the difference here is INNER JOINS will only find 5 row sums (in your example of 5 joins)

the outer joins I showed above find 1, 2, 3, 4 and 5 rows sums.

if you parameterize your procedure and want to search for exactly one particular level, then INNER is the way to go. Same idea as what I already posted and you've used above.

You'll simply have 5 different queries in your procedure depending on the number of joins/rows you need to examine.

I didn't anticipate NEGATIVE numbers in the input, since there were none in the original q.

That VASTLY complicates things. You won't be able to gain nearly as much efficiency; you'll have to re-scan all negative values.

For only positive numbers, we have a decent chance of using some binary search logic to reduce future search requirements.

As for that specific sample data and search value, for 112.41 in the ~5500 rows, I found 94 separate matches (!) on the second level. That took ~20-24 seconds on my server; naturally, your time may vary.

That VASTLY complicates things. You won't be able to gain nearly as much efficiency; you'll have to re-scan all negative values.

For only positive numbers, we have a decent chance of using some binary search logic to reduce future search requirements.

As for that specific sample data and search value, for 112.41 in the ~5500 rows, I found 94 separate matches (!) on the second level. That took ~20-24 seconds on my server; naturally, your time may vary.

```
USE tempdb
GO
--code assumes that tempdb.dbo.rh_vcrt0 already exists, loaded with the test data
DECLARE @value_to_find decimal(9, 2)
SET @value_to_find = 112.41;
IF OBJECT_ID('tempdb..orig_values') IS NOT NULL
DROP TABLE orig_values
IF OBJECT_ID('tempdb..result2') IS NOT NULL
DROP TABLE result2
-- orig values < @value_to_find
CREATE TABLE orig_values (
orig_ident int IDENTITY(1, 1) NOT NULL,
rh_ident int NOT NULL,
vcr_valor_bruto real NOT NULL
)
CREATE UNIQUE CLUSTERED INDEX orig_values__CL ON orig_values ( orig_ident )
CREATE TABLE result2 (
result2_ident int IDENTITY(1, 1) NOT NULL,
orig_ident1 int NOT NULL,
orig_ident2 int NOT NULL,
CHECK(orig_ident1 < orig_ident2),
result2 real NOT NULL
)
CREATE UNIQUE CLUSTERED INDEX result2__CL ON result2 ( result2_ident )
--CREATE UNIQUE CLUSTERED INDEX result2__CL ON result2 ( result2, result2_ident )
DECLARE @orig_count int
DECLARE @result2_count int
DECLARE @orig_ident int
DECLARE @orig_ident2 int
DECLARE @orig_ident3 int
DECLARE @orig_ident4 int
DECLARE @orig_ident5 int
DECLARE @rh_ident int
DECLARE @rh_ident2 int
DECLARE @rh_ident3 int
DECLARE @rh_ident4 int
DECLARE @rh_ident5 int
DECLARE @final_msg varchar(1000)
/*
IF EXISTS(SELECT 1 FROM rh_vcrt0 WHERE vcr_valor_bruto = @value_to_find)
BEGIN
SET @final_msg = 'Value to search for is in the original input, no need to join rows!'
GOTO ExitCode
END --IF
*/
INSERT INTO orig_values
SELECT
rh_ident, vcr_valor_bruto
FROM dbo.rh_vcrt0
-- can't use this because of NEGATIVE numbers in the input data!!
--WHERE
--vcr_valor_bruto <= @value_to_find
ORDER BY
vcr_valor_bruto
SET @orig_count = @@ROWCOUNT
PRINT '@orig_count = ' + CAST(@orig_count AS varchar(10))
/*
-- can't use this because of NEGATIVE numbers in the input data!!
SELECT @orig_ident = orig_ident
FROM orig_values
WHERE
orig_ident = @orig_count AND
vcr_valor_bruto = @value_to_find
IF @orig_ident IS NOT NULL
GOTO ExitCode
*/
INSERT INTO result2
SELECT ov1.orig_ident, ov2.orig_ident, ov1.vcr_valor_bruto + ov2.vcr_valor_bruto AS result2
FROM orig_values ov1
INNER JOIN orig_values ov2 ON
ov2.orig_ident > ov1.orig_ident AND
ov1.vcr_valor_bruto + ov2.vcr_valor_bruto <= @value_to_find
ORDER BY
ov1.vcr_valor_bruto + ov2.vcr_valor_bruto --result2
SET @result2_count = @@ROWCOUNT
PRINT '@result2_count = ' + CAST(@result2_count AS varchar(10))
SELECT TOP 1
@orig_ident = orig_ident1, @orig_ident2 = orig_ident2
FROM result2
WHERE
result2_ident = @result2_count AND
result2 = @value_to_find
IF @orig_ident IS NOT NULL
GOTO ExitCode
ExitCode:
IF @orig_ident IS NOT NULL
BEGIN
SELECT 'Match(es) found', rh.*
FROM orig_values ov
INNER JOIN rh_vcrt0 rh ON
rh.rh_ident = ov.rh_ident
WHERE ov.orig_ident IN ( @orig_ident, @orig_ident2, @orig_ident3, @orig_ident4, @orig_ident5 )
/*
SELECT COUNT(*)
FROM result2
WHERE
result2 = @value_to_find
*/
END --IF
PRINT @final_msg
```

using the sample data provided I found 111 combinations of 2-row sums (5 seconds)

I get 111 rows using the code in http:#a38513641

or using a small variant on my original query. Compensating for negatives was trivial. Simply generate an id sorted by the values. The negatives will sort to the beginning and allow the same < @vl logic I originally posted.

So, for 2 rows it looks like this... (4 seconds)

I don't see an rh_ident column in the sample data, so the code in http:#a38516277 didn't run for me until I created a new table and generated ids for it. Not a problem, basically did the same thing I used for my own query but maybe my test was skewed.

The code in http:#a38516277 ran in 1:48 (108 seconds) and only printed 1 combination.

I'm willing to believe I messed up ScottPletchers test but I don't see how

This is the table I used for my tests and the asker's test code

CREATE TABLE [dbo].[rh_vcrt0](

[vcr_emp_empresa] [int] ,

[vcr_num_empregado] [int] ,

[vcr_vnc_cod_venc] [int] ,

[vcr_valor_bruto] money

);

I simply ran the inserts and then code above.

To test ScottPletchers code I renamed the table above to rh_vcrt1

then created this table...

CREATE TABLE [dbo].[rh_vcrt0](

rh_ident [int] ,

[vcr_emp_empresa] [int] ,

[vcr_num_empregado] [int] ,

[vcr_vnc_cod_venc] [int] ,

[vcr_valor_bruto] money

);

and then populated it with this query

insert into [rh_vcrt0]

SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto),

vcr_emp_empresa,

vcr_num_empregado,

vcr_vnc_cod_venc, vcr_valor_bruto

FROM [rh_vcrt1];

If there is a flaw in any of the test setup, I apologize and welcome correction.

I get 111 rows using the code in http:#a38513641

or using a small variant on my original query. Compensating for negatives was trivial. Simply generate an id sorted by the values. The negatives will sort to the beginning and allow the same < @vl logic I originally posted.

So, for 2 rows it looks like this... (4 seconds)

```
DECLARE @Emp int,@Vl real
SET @Emp=32
SET @Vl=112.41;
WITH v
AS (SELECT vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc,
SUM(vcr_valor_bruto) vcr_valor_bruto
FROM rh_vcrt0
WHERE vcr_emp_empresa = @emp AND vcr_valor_bruto < @vl
GROUP BY vcr_emp_empresa, vcr_num_empregado, vcr_vnc_cod_venc),
numbers
AS (SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id,
vcr_valor_bruto,
vcr_valor_bruto n,
vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc
FROM v)
SELECT v1.vcr_num_empregado "Nº1",
v1.vcr_vnc_cod_venc "Venc1",
v1.vcr_valor_bruto "Vl1",
v2.vcr_num_empregado "Nº2",
v2.vcr_vnc_cod_venc "Venc2",
v2.vcr_valor_bruto "Vl2"
FROM numbers v1
INNER JOIN numbers v2 ON v1.id > v2.id AND v1.n + v2.n = @vl
```

I don't see an rh_ident column in the sample data, so the code in http:#a38516277 didn't run for me until I created a new table and generated ids for it. Not a problem, basically did the same thing I used for my own query but maybe my test was skewed.

The code in http:#a38516277 ran in 1:48 (108 seconds) and only printed 1 combination.

I'm willing to believe I messed up ScottPletchers test but I don't see how

This is the table I used for my tests and the asker's test code

CREATE TABLE [dbo].[rh_vcrt0](

[vcr_emp_empresa] [int] ,

[vcr_num_empregado] [int] ,

[vcr_vnc_cod_venc] [int] ,

[vcr_valor_bruto] money

);

I simply ran the inserts and then code above.

To test ScottPletchers code I renamed the table above to rh_vcrt1

then created this table...

CREATE TABLE [dbo].[rh_vcrt0](

rh_ident [int] ,

[vcr_emp_empresa] [int] ,

[vcr_num_empregado] [int] ,

[vcr_vnc_cod_venc] [int] ,

[vcr_valor_bruto] money

);

and then populated it with this query

insert into [rh_vcrt0]

SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto),

vcr_emp_empresa,

vcr_num_empregado,

vcr_vnc_cod_venc, vcr_valor_bruto

FROM [rh_vcrt1];

If there is a flaw in any of the test setup, I apologize and welcome correction.

ASKER

I didn't realize negative numbers would interfere with this. I just took them for granted. My apologies for this omission.

However, your comments and the commented part of the code made me realize that I actually can't filter the values beforehand.

Running sdstuber's code with this removed (AND vcr_valor_bruto < @vl) took 17 seconds and returned 132 rows. With it, it was 7 seconds and 116 rows, so it's a relevant issue.

I tried ScottPletcher's code, but couldn't quite get it to work. It seems it ignored the fact that the table has two keys, not just one. It started using rh_ident, then later uses orig_ident. I might be missing something, but when I tried changing this into both keys, I got that "An explicit value for the identity column in table 'orig_values' can only be specified when a column list is used and IDENTITY_INSERT is ON."

Either way, it seems that I can't get a workable solution that will return 5 rows combinations in less than 5 minutes. Adapting sdstuber's code to 3 rows will run for more than 10 minutes. This seems to imply that there isn't any chance that a workable 5 rows solution will be forthcoming. Not even for the smaller subset of data. I tried:

If that is the case, that no feasible solution exists, I will award points to both anyway, because you have both helped me improve my original query, not only making it feasible to actually use a 2 row solution on the larger subset, but also allowing me to get 4 rows solutions on the smaller subset (45 seconds, 15 rows).

However, your comments and the commented part of the code made me realize that I actually can't filter the values beforehand.

Running sdstuber's code with this removed (AND vcr_valor_bruto < @vl) took 17 seconds and returned 132 rows. With it, it was 7 seconds and 116 rows, so it's a relevant issue.

I tried ScottPletcher's code, but couldn't quite get it to work. It seems it ignored the fact that the table has two keys, not just one. It started using rh_ident, then later uses orig_ident. I might be missing something, but when I tried changing this into both keys, I got that "An explicit value for the identity column in table 'orig_values' can only be specified when a column list is used and IDENTITY_INSERT is ON."

Either way, it seems that I can't get a workable solution that will return 5 rows combinations in less than 5 minutes. Adapting sdstuber's code to 3 rows will run for more than 10 minutes. This seems to imply that there isn't any chance that a workable 5 rows solution will be forthcoming. Not even for the smaller subset of data. I tried:

```
DECLARE @Emp int,@Vl real
SET @Emp=101
SET @Vl=112.41;
WITH v
AS (SELECT vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc,
SUM(vcr_valor_bruto) vcr_valor_bruto
FROM rh_vcrt0
WHERE vcr_emp_empresa=@Emp
GROUP BY vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc),
numbers
AS (SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id,
vcr_valor_bruto,
vcr_valor_bruto n,
vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc
FROM v)
SELECT v1.vcr_num_empregado "Nº1",
v1.vcr_vnc_cod_venc "Venc1",
v1.vcr_valor_bruto "Vl1",
v2.vcr_num_empregado "Nº2",
v2.vcr_vnc_cod_venc "Venc2",
v2.vcr_valor_bruto "Vl2",
v3.vcr_num_empregado "Nº3",
v3.vcr_vnc_cod_venc "Venc3",
v3.vcr_valor_bruto "Vl3",
v4.vcr_num_empregado "Nº4",
v4.vcr_vnc_cod_venc "Venc4",
v4.vcr_valor_bruto "Vl4",
v5.vcr_num_empregado "Nº5",
v5.vcr_vnc_cod_venc "Venc5",
v5.vcr_valor_bruto "Vl5"
FROM numbers v1
INNER JOIN numbers v2
ON v1.id>v2.id
INNER JOIN numbers v3
ON v2.id>v3.id
INNER JOIN numbers v4
ON v3.id>v4.id
INNER JOIN numbers v5
ON v4.id>v5.id
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto+v5.vcr_valor_bruto=@Vl
```

I cancelled it after 4 minutes, with 108 rows returned.If that is the case, that no feasible solution exists, I will award points to both anyway, because you have both helped me improve my original query, not only making it feasible to actually use a 2 row solution on the larger subset, but also allowing me to get 4 rows solutions on the smaller subset (45 seconds, 15 rows).

>>> This seems to imply that there isn't any chance that a workable 5 rows solution will be forthcoming.

yes, like I said originally, you're fighting math. The only way any solution (regardless of method) could possibly be fast is with data that implicitly prevents most of the combinatoric explosion.

ScottPletcher's method is essentially (despite protests that it isn't) the same thing presented in SQL. I'm not really sure how any method could be significantly different.

If you need to check all combinations, then you need to check all combinations. There really aren't any shortcuts to "do all"

yes, like I said originally, you're fighting math. The only way any solution (regardless of method) could possibly be fast is with data that implicitly prevents most of the combinatoric explosion.

ScottPletcher's method is essentially (despite protests that it isn't) the same thing presented in SQL. I'm not really sure how any method could be significantly different.

If you need to check all combinations, then you need to check all combinations. There really aren't any shortcuts to "do all"

>> It seems it ignored the fact that the table has two keys, not just one <<

I inserted the values into another table so that (1) the values could be sorted and (2) there would only have to one key. The new ident I use can be used to lookup the original table row and all the columns, as I did upon exit to my code.

My method should of worked exactly as coded for any rows loaded into the rh_vcrt0 table. I didn't go past the second level because, with negative values present, you don't gain anything.

Yes, my method will be slower when 2 values generates a match, but the stored intermediate results will save time when further iterations are needed (vastly more so when the numbers are all POSITIVE).

With only positive values, it's clear you could potentially save time by -- as I stated earlier -- using a binary search to restrict the number of rows you have to iterate over.

>> Compensating for negatives was trivial. <<

Only if you can't see any potential but plow-straight-ahead, no thought about maximizing efficiency. With that limited vision, no improvements would be possible for this (or much of anything else).

I inserted the values into another table so that (1) the values could be sorted and (2) there would only have to one key. The new ident I use can be used to lookup the original table row and all the columns, as I did upon exit to my code.

My method should of worked exactly as coded for any rows loaded into the rh_vcrt0 table. I didn't go past the second level because, with negative values present, you don't gain anything.

Yes, my method will be slower when 2 values generates a match, but the stored intermediate results will save time when further iterations are needed (vastly more so when the numbers are all POSITIVE).

With only positive values, it's clear you could potentially save time by -- as I stated earlier -- using a binary search to restrict the number of rows you have to iterate over.

>> Compensating for negatives was trivial. <<

Only if you can't see any potential but plow-straight-ahead, no thought about maximizing efficiency. With that limited vision, no improvements would be possible for this (or much of anything else).

For the three-level total, we can selectively join the 2-level table to the original input values.

For the four-level total, we can selectively join the 2-level table I created to itself -- I would expect that to be noticeably faster than a four-table join ... which was one of the points of creating an intermediate table.

With all POSITIVE values, we could do binary search of the ALREADY SORTED (in BOTH tables) values first to reduce the values that had to be joined ... but with negative numbers present, this can't really be done.

For the four-level total, we can selectively join the 2-level table I created to itself -- I would expect that to be noticeably faster than a four-table join ... which was one of the points of creating an intermediate table.

With all POSITIVE values, we could do binary search of the ALREADY SORTED (in BOTH tables) values first to reduce the values that had to be joined ... but with negative numbers present, this can't really be done.

>>Only if you can't see any potential but plow-straight-ahead, no thought about maximizing efficiency.

Not sure where you're going with that. The sorting to allow for early filtering (i.e. efficiency) is precisely why I'm doing it. And, based on your code, looks like you're following the same idea.

Given that our approaches are effectively identical, I'm not sure where the "no thought" or "limited vision" comes in unless you are suggesting you have some other method than what you have proposed above.

If so, please illuminate us. The asker in particular appears to be willing to "settle" for what has currently been presented. If you have something better, please post.

Not sure where you're going with that. The sorting to allow for early filtering (i.e. efficiency) is precisely why I'm doing it. And, based on your code, looks like you're following the same idea.

Given that our approaches are effectively identical, I'm not sure where the "no thought" or "limited vision" comes in unless you are suggesting you have some other method than what you have proposed above.

If so, please illuminate us. The asker in particular appears to be willing to "settle" for what has currently been presented. If you have something better, please post.

ASKER

ScottPletcher, when I try your code as is, I get the following error:

Msg 207, Level 16, State 1, Line 54

Invalid column name 'rh_ident'.

Msg 207, Level 16, State 1, Line 105

Invalid column name 'rh_ident'.

Also, if I interpret it correctly, your code is matching all rows regardless of vcr_emp_empresa. Although I only provided data for vcr_emp_empresa=32, which is the largest subset, there are plenty of other values. All combinations have to be restricted to the same vcr_emp_empresa, which I did on my code using @Emp. Combinations with different vcr_emp_empresa are not desired.

Msg 207, Level 16, State 1, Line 54

Invalid column name 'rh_ident'.

Msg 207, Level 16, State 1, Line 105

Invalid column name 'rh_ident'.

Also, if I interpret it correctly, your code is matching all rows regardless of vcr_emp_empresa. Although I only provided data for vcr_emp_empresa=32, which is the largest subset, there are plenty of other values. All combinations have to be restricted to the same vcr_emp_empresa, which I did on my code using @Emp. Combinations with different vcr_emp_empresa are not desired.

>> The sorting to allow for early filtering (i.e. efficiency) is precisely why I'm doing it. <<

Where did you sort the values? And so that SQL knows they are sorted?

Where did you sort the values? And so that SQL knows they are sorted?

>>> Where did you sort the values? And so that SQL knows they are sorted?

when I create the "numbers" cte I create an ID which is sorted by the values

ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id

this ensures my "candidate" combinations look at negatives first then apply positives afterward.

the early filtering will still apply because adding negatives will still produce sums less than the target. It will work even in cases of zero value too.

If, however, the values all happen to be positive, the ordering still works and the early filtering will become more efficient ~for free~ simply because the number of candidate combinations that

The idea being: check for combination sums that "might" work. If they do, keep them, if they don't drop them. In the next level, check if adding another value is still valid, if so, keep that combination, if not, drop it. and so on. Each additional level (join) will either become closer to a solution or eliminate a particular combination from further consideration of additional sums.

when I create the "numbers" cte I create an ID which is sorted by the values

ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id

this ensures my "candidate" combinations look at negatives first then apply positives afterward.

the early filtering will still apply because adding negatives will still produce sums less than the target. It will work even in cases of zero value too.

If, however, the values all happen to be positive, the ordering still works and the early filtering will become more efficient ~for free~ simply because the number of candidate combinations that

*might*work will shrink. Thus making the amount of work in later checks less.The idea being: check for combination sums that "might" work. If they do, keep them, if they don't drop them. In the next level, check if adding another value is still valid, if so, keep that combination, if not, drop it. and so on. Each additional level (join) will either become closer to a solution or eliminate a particular combination from further consideration of additional sums.

>> Also, if I interpret it correctly, your code is matching all rows regardless of vcr_emp_empresa. <<

That would be implemented when doing the initial INSERT into the "orig_values" tables. I left it off just for convenience during testing, as that is trivial.

>> ScottPletcher, when I try your code as is, I get the following error: <<

You must be running a subset of the code, not the entire thing.

Btw, yes, I did actually output only one matching result. I thought you only needed one match and that any match was as good as another.

That would be implemented when doing the initial INSERT into the "orig_values" tables. I left it off just for convenience during testing, as that is trivial.

>> ScottPletcher, when I try your code as is, I get the following error: <<

You must be running a subset of the code, not the entire thing.

Btw, yes, I did actually output only one matching result. I thought you only needed one match and that any match was as good as another.

INSERT INTO orig_values

SELECT

rh_ident, vcr_valor_bruto

FROM dbo.rh_vcrt0

-- can't use this because of NEGATIVE numbers in the input data!!

--WHERE

--vcr_valor_bruto <= @value_to_find

WHERE --<<--

vcr_emp_empresa = 32 --<<--

ORDER BY

vcr_valor_bruto

SELECT

rh_ident, vcr_valor_bruto

FROM dbo.rh_vcrt0

-- can't use this because of NEGATIVE numbers in the input data!!

--WHERE

--vcr_valor_bruto <= @value_to_find

WHERE --<<--

vcr_emp_empresa = 32 --<<--

ORDER BY

vcr_valor_bruto

sorry, just remembered I never addressed this comment...

With negatives in the list, you can't include the initial filter

AND vcr_valor_bruto < @vl (ScottPletcher mentioned this above too)

sorry, I should have taken that out myself.

it works for postives only, but with a mix it doesn't.

simple example

target=6

input 1 = -1

input 2 = 7

if I exclude where "AND vcr_valor_bruto < 6" then it will eliminate a valid combination

Running sdstuber's code with this removed (AND vcr_valor_bruto < @vl) took 17 seconds and returned 132 rows. With it, it was 7 seconds and 116 rows, so it's a relevant issue.

With negatives in the list, you can't include the initial filter

AND vcr_valor_bruto < @vl (ScottPletcher mentioned this above too)

sorry, I should have taken that out myself.

it works for postives only, but with a mix it doesn't.

simple example

target=6

input 1 = -1

input 2 = 7

if I exclude where "AND vcr_valor_bruto < 6" then it will eliminate a valid combination

ASKER

>> You must be running a subset of the code, not the entire thing.

I clicked "Select All" from your code box, pasted on SSMS and only changed the db name from tempdb to rhxxi on the top and on both drop checks. I don't know if I need to create any other table before I run this code, though.

>>WHERE --<<--

vcr_emp_empresa = 32 --<<--

Yeah, I had figured out it would be there. :)

I clicked "Select All" from your code box, pasted on SSMS and only changed the db name from tempdb to rhxxi on the top and on both drop checks. I don't know if I need to create any other table before I run this code, though.

>>WHERE --<<--

vcr_emp_empresa = 32 --<<--

Yeah, I had figured out it would be there. :)

on a related note to the above, I am also assuming your target sum will always be a positive number. ScottPletcher's code above has the same limitation.

ASKER

Actually, though it would be rare, there could be occasions where we would search for negative numbers.

ASKER

And I do mean, very very rare. So rare that, if it doesn't work for negative sums, it won't matter.

I don't understand the current challenge I guess.

I thought if you found a 2nd level-match, you quit searching; from the original q:

"

Usually, when such a need arises, we need to see if any one row has that value. That is basic SQL and I won't even bother with that. Next, we move to two values.

"

The 2-level logic runs so quickly on the original data that you can directly test that.

The example you gave has a lot of 2-level matches.

I thought if you found a 2nd level-match, you quit searching; from the original q:

"

Usually, when such a need arises, we need to see if any one row has that value. That is basic SQL and I won't even bother with that. Next, we move to two values.

"

The 2-level logic runs so quickly on the original data that you can directly test that.

The example you gave has a lot of 2-level matches.

At any rate, I think the negative numbers prevent nearly all optimizations for higher levels. With negatives in there, you almost really do have to go thru every permutation at every level.

[A genuine mathematician might be able to speed things up for you, but I'm not one so I can't say for sure.]

Btw, you

You should use decimal, which provides an exact data value, and thus an exact, and accurate, match.

[A genuine mathematician might be able to speed things up for you, but I'm not one so I can't say for sure.]

Btw, you

**definitely**should**NOT**be using data type "real" for this, since real is an approximate value only.You should use decimal, which provides an exact data value, and thus an exact, and accurate, match.

ASKER

What we do is search for the least "expensive" solution. We search for 1 row results, then move on to 2 row results, etc, until, after 5 row results, we try different approaches (or so we wish). However, two things should be noted:

1- This case has a lot of 2 level matches, but there are many cases where there aren't any.

2- Even when there are 2 level matches, we sometimes need to delve further.

Basically, we are trying to find out why something is amiss, out of some grouped grand totals that are calculated by the application. We sometimes have low level matches but which we find out aren't relevant to the issue, so we need to keep digging.

We had set up a 5 row sum as the limit, since we thought that could be done in a reasonable amount of time. This will force us to re-think our limit, and maybe how we can debug these situations, but the fact stands that, even when there are lower level matches, we sometimes need higher level matches as well.

Which is why I was sort of emphasizing a working 5 rows solution. If we have the final one, we can infer from that and resize to generate lower level ones. However, as it doesn't seem to be possible to create one with 5 rows, I guess we'll have to settle with a working 2nd level solution for larger subsets, and 4 level solution for smaller ones. We'll have to try and think of other solutions for this. Maybe a desktop based app in VB.Net or even plain C would be faster.

1- This case has a lot of 2 level matches, but there are many cases where there aren't any.

2- Even when there are 2 level matches, we sometimes need to delve further.

Basically, we are trying to find out why something is amiss, out of some grouped grand totals that are calculated by the application. We sometimes have low level matches but which we find out aren't relevant to the issue, so we need to keep digging.

We had set up a 5 row sum as the limit, since we thought that could be done in a reasonable amount of time. This will force us to re-think our limit, and maybe how we can debug these situations, but the fact stands that, even when there are lower level matches, we sometimes need higher level matches as well.

Which is why I was sort of emphasizing a working 5 rows solution. If we have the final one, we can infer from that and resize to generate lower level ones. However, as it doesn't seem to be possible to create one with 5 rows, I guess we'll have to settle with a working 2nd level solution for larger subsets, and 4 level solution for smaller ones. We'll have to try and think of other solutions for this. Maybe a desktop based app in VB.Net or even plain C would be faster.

ASKER

>> Btw, you definitely should NOT be using data type "real" for this, since real is an approximate value only.

>> You should use decimal, which provides an exact data value, and thus an exact, and accurate, match.

Actually, that was my bad. The DB was created by another company and I thought it was real. Turns out it was decimal(14,2). That will actually force me to change lots of code unrelated to this question, but which should, nevertheless, be optimized.

>> You should use decimal, which provides an exact data value, and thus an exact, and accurate, match.

Actually, that was my bad. The DB was created by another company and I thought it was real. Turns out it was decimal(14,2). That will actually force me to change lots of code unrelated to this question, but which should, nevertheless, be optimized.

>>> Maybe a desktop based app in VB.Net or even plain C would be faster.

it's not the database that is slow. it's the combinatorics.

I'm not saying you can't possibly write a c program that would be faster, but it will be a marginal difference against the fundamental combinatoric issue..

you can try tackling the problem by running multiple threads, each examining a different set of values. That will consume more total cpu resources but may find solutions in faster "wall clock" time.

it's not the database that is slow. it's the combinatorics.

I'm not saying you can't possibly write a c program that would be faster, but it will be a marginal difference against the fundamental combinatoric issue..

you can try tackling the problem by running multiple threads, each examining a different set of values. That will consume more total cpu resources but may find solutions in faster "wall clock" time.

When I say combinatorics is the issue, maybe showing the math will help explain why.

Assuming 5000 rows you have the following number of combinations to examine at each level.

5000

25000000

125000000000

625000000000000

3125000000000000000

Hopefully with over 3 quintillion possible combinations that makes it clear why the algorithm or language used for analysis isn't particularly important. There are simply too many possible combinations.

The ONLY way this can be solved to 5 levels in any kind of reasonable time is IF AND ONLY IF, the number of valid combinations at each level is small. Due to exponential growth, it must be a very, very small number of possibilities otherwise you'll still get unmanageable growth.

Including negatives means you are unlikely to ever get that level of filtering any level.

Assuming 5000 rows you have the following number of combinations to examine at each level.

5000

25000000

125000000000

625000000000000

3125000000000000000

Hopefully with over 3 quintillion possible combinations that makes it clear why the algorithm or language used for analysis isn't particularly important. There are simply too many possible combinations.

The ONLY way this can be solved to 5 levels in any kind of reasonable time is IF AND ONLY IF, the number of valid combinations at each level is small. Due to exponential growth, it must be a very, very small number of possibilities otherwise you'll still get unmanageable growth.

Including negatives means you are unlikely to ever get that level of filtering any level.

While that is an issue, you're overstating the problem by at least double. He doesn't need duplicate combinations; and you don't join a value to itself.

So the 2nd level is really more like:

12497500.

So the 2nd level is really more like:

12497500.

yes, I didn't specify any of the filters we've already identified.

plus, we already know it's not exactly 5000 rows.

and I don't know how much the filtering will really reduce it.

I wasn't trying to be precise, there are too many unknowns to make that possible.

I'm trying to indicate scale.

But, your comment helps to illustrate the problem.

"at least double" makes it sound like I'm way off so maybe it's not as bad as I'm suggesting.

but I say "double" is completely inconsequential here. The magnitudes are so large as to make a factor of two laughable.

Let's say my simplification is off by a factor of 100000000 (one hundred million) rather than 2.

That means there are many billions of possible combinations, which still makes the problem intractable for most systems and certainly for real-time, user interactive computation.

Again, the math makes the algorithm inconsequential. If the data allows for signficant combinations at each level then you're fighting exponentiation.

plus, we already know it's not exactly 5000 rows.

and I don't know how much the filtering will really reduce it.

I wasn't trying to be precise, there are too many unknowns to make that possible.

I'm trying to indicate scale.

But, your comment helps to illustrate the problem.

"at least double" makes it sound like I'm way off so maybe it's not as bad as I'm suggesting.

but I say "double" is completely inconsequential here. The magnitudes are so large as to make a factor of two laughable.

Let's say my simplification is off by a factor of 100000000 (one hundred million) rather than 2.

That means there are many billions of possible combinations, which still makes the problem intractable for most systems and certainly for real-time, user interactive computation.

Again, the math makes the algorithm inconsequential. If the data allows for signficant combinations at each level then you're fighting exponentiation.

You WERE way off. Off by a MINIMUM of a factor of 2 is not a trivial amount. That is only at the 1st level. That means by the 5th level it is off by at least 32x.

Again, it's the negative values that really prevent any kind of useful optimization for higher levels. Otherwise you'd have a good chance of drastically reducing the number of computations required at upper levels.

Again, it's the negative values that really prevent any kind of useful optimization for higher levels. Otherwise you'd have a good chance of drastically reducing the number of computations required at upper levels.

>>> You WERE way off. Off by a MINIMUM of a factor of 2 is not a trivial amount.

You say 2 is a lot, I say 2 is laughably, trivially inconsequential. Sure, 1.65 quintillion is a lot smaller than 3.1 quintillion. But who cares? Being off by double makes no practical difference at all. Could you actually examine a quintillion different combinations? If not, then it's effectively no difference at all to an implementable solution.

I'm not sure why you highlighted the word minimum. I already addressed the consequences of making the error much, much larger. Since I already said that I was intentionally being imprecise; because it simply doesn't matter except under special circumstances, why bother?

If your goal is simply to get me to concede that there are times when it can be better than double,I already did; but I'll say so again. In fact, I'll concede the factor could be even higher than 100 million. It could be off by quadrillions, in fact, it could be 100% wrong. There might be 0 possible combinations. But who cares? The point was not to say it will always be "X" many combinations. The point was the problem-space is potentially big, very big. So big that even reductions that might "feel" like a lot: half, a tenth, a millionth are still not enough to make a practical dent in it.

As far as I can tell, we're in agreement as to the intractability of this problem. I'll try again to explain in a way that is hopefully more agreeable.

>>> If the data allows for signficant combinations at each level then you're fighting exponentiation.

I've tried saying it in different ways but that line is the key to everything I've said above, both as to how/when it can be solved and when/why it can't be solved.

The reason a positive-only set of data helps is because it can drastically reduce the number of legal combinations at each level. This is still not a guarantee, but certainly better than when negatives are included. Positives only can still produce an unwieldy number of combinations per level.

The reason tiny data sets work is even the worst case scenario doesn't have enough data for exponentiation to explode.

The reason small targets with large values work is, (even with a few negatives or relatively small absolute values) you can still filter a significant number of combinations early enough.

Large totals and small values, or lots of negatives or large negatives don't work because the number of combinations at each level is too many so you'll "effectively" get exponential growth. As ScottPletcher pointed out, it's not "true" exponentiation, but it can easily be large enough to not matter in a practical sense.

My use of the imprecise terms "large", "small" and "tiny" will likely elicit more complaints but we simply don't have enough information to say how many is a "large amount" and how many is a "small amount". It's all relative. You could have a million rows, but if they are all positive values and your target sum is 0.12 then you'll be able to filter out "many" of the candidates at each level, thus exponentiation should not be a problem.

Reduce the number of rows, increase the target sum - some will be solvable, some won't. It's a balance in the data between growth and shrinking.

Add negatives, few/many, small/large absolute values. The combinations will increase, but how much is, again, a relative matter based on the other values and the target sum.

Given a specific set of data constraints it's possible to calculate more precisely how many rows you might expect in the 5-level sums.; but we don't have those so I'm reduced to saying it "could" be large, thus making it a problem with an unreliable solution.

The possible combinations could be anywhere in the quintillions down to zero. So, a factor of 2, a factor of a billion, a factor of a trillion - whatever it is. The algorithm used to find them does not, in any way, change the number of results.

I hope that helps

You say 2 is a lot, I say 2 is laughably, trivially inconsequential. Sure, 1.65 quintillion is a lot smaller than 3.1 quintillion. But who cares? Being off by double makes no practical difference at all. Could you actually examine a quintillion different combinations? If not, then it's effectively no difference at all to an implementable solution.

I'm not sure why you highlighted the word minimum. I already addressed the consequences of making the error much, much larger. Since I already said that I was intentionally being imprecise; because it simply doesn't matter except under special circumstances, why bother?

If your goal is simply to get me to concede that there are times when it can be better than double,I already did; but I'll say so again. In fact, I'll concede the factor could be even higher than 100 million. It could be off by quadrillions, in fact, it could be 100% wrong. There might be 0 possible combinations. But who cares? The point was not to say it will always be "X" many combinations. The point was the problem-space is potentially big, very big. So big that even reductions that might "feel" like a lot: half, a tenth, a millionth are still not enough to make a practical dent in it.

As far as I can tell, we're in agreement as to the intractability of this problem. I'll try again to explain in a way that is hopefully more agreeable.

>>> If the data allows for signficant combinations at each level then you're fighting exponentiation.

I've tried saying it in different ways but that line is the key to everything I've said above, both as to how/when it can be solved and when/why it can't be solved.

The reason a positive-only set of data helps is because it can drastically reduce the number of legal combinations at each level. This is still not a guarantee, but certainly better than when negatives are included. Positives only can still produce an unwieldy number of combinations per level.

The reason tiny data sets work is even the worst case scenario doesn't have enough data for exponentiation to explode.

The reason small targets with large values work is, (even with a few negatives or relatively small absolute values) you can still filter a significant number of combinations early enough.

Large totals and small values, or lots of negatives or large negatives don't work because the number of combinations at each level is too many so you'll "effectively" get exponential growth. As ScottPletcher pointed out, it's not "true" exponentiation, but it can easily be large enough to not matter in a practical sense.

My use of the imprecise terms "large", "small" and "tiny" will likely elicit more complaints but we simply don't have enough information to say how many is a "large amount" and how many is a "small amount". It's all relative. You could have a million rows, but if they are all positive values and your target sum is 0.12 then you'll be able to filter out "many" of the candidates at each level, thus exponentiation should not be a problem.

Reduce the number of rows, increase the target sum - some will be solvable, some won't. It's a balance in the data between growth and shrinking.

Add negatives, few/many, small/large absolute values. The combinations will increase, but how much is, again, a relative matter based on the other values and the target sum.

Given a specific set of data constraints it's possible to calculate more precisely how many rows you might expect in the 5-level sums.; but we don't have those so I'm reduced to saying it "could" be large, thus making it a problem with an unreliable solution.

The possible combinations could be anywhere in the quintillions down to zero. So, a factor of 2, a factor of a billion, a factor of a trillion - whatever it is. The algorithm used to find them does not, in any way, change the number of results.

I hope that helps

Just a recap of what you've said before. Still false, but a nice recap.

You can't seem to grasp the concept of a binary search on values, so we'll never agree on this.

As stated, though, negative values render much of the possible efficiencies moot anyway.

There might be other mathematical tricks that could speed this up (similar to the algorithms used to speed up finding the places for pi, for example), but, even if they exists, I don't have the pure mathematics knowledge to apply those to something like this.

You can't seem to grasp the concept of a binary search on values, so we'll never agree on this.

As stated, though, negative values render much of the possible efficiencies moot anyway.

There might be other mathematical tricks that could speed this up (similar to the algorithms used to speed up finding the places for pi, for example), but, even if they exists, I don't have the pure mathematics knowledge to apply those to something like this.

If you want other options to make better filters to "fight" expansion at each level, try looking for minimum and maximum values and only pursue further levels if the partial sum could reach the target within the known of the data range.

For example: target 100, after 3 levels I have a partial sum of 200

If the minimum value is -25, then I know I don't have to search any further because even adding the minimum twice (whether it exists twice or not isn't important) I can't reach the sum, so I don't need to look any further.

For example: target 100, after 3 levels I have a partial sum of 200

If the minimum value is -25, then I know I don't have to search any further because even adding the minimum twice (whether it exists twice or not isn't important) I can't reach the sum, so I don't need to look any further.

>>> Still false, but a nice recap.

Thanks, but that's not helpful.

Please correct any place I made a mistake. Don't let my error become part of the PAQ.

>>> You can't seem to grasp the concept of a binary search on values, so we'll never agree on this.

I understand a binary search - I do concede that I don't know how you're trying to apply it here.

Please illustrate.

If you have code that is better than what is above, please post it, or at least pseudocode and I'll write it myself. Preferably using the asker's existing table structure so as to reduce ambiguity and errors in translation.

Thanks, but that's not helpful.

Please correct any place I made a mistake. Don't let my error become part of the PAQ.

>>> You can't seem to grasp the concept of a binary search on values, so we'll never agree on this.

I understand a binary search - I do concede that I don't know how you're trying to apply it here.

Please illustrate.

If you have code that is better than what is above, please post it, or at least pseudocode and I'll write it myself. Preferably using the asker's existing table structure so as to reduce ambiguity and errors in translation.

ASKER

@ScottPletcher: I have pinpointed where your code goes wrong. You use this part:

```
INSERT INTO orig_values
SELECT
rh_ident, vcr_valor_bruto
FROM dbo.rh_vcrt0
-- can't use this because of NEGATIVE numbers in the input data!!
--WHERE
--vcr_valor_bruto <= @value_to_find
ORDER BY
vcr_valor_bruto
SET @orig_count = @@ROWCOUNT
```

to insert values in the orig_values table, but rh_vcrt0 doesn't have any rh_ident field. I'm not sure how you're planning to make this work though, so I don't think I can correct it myself.
you can test his code with the changes I made above.

or, if you don't want to change your own objects.

Do a search and replace on his code to replace rh_crt0 with ScottPletcher (or other name)

then create a new ScottPletcher table and populate as shown in http:#a38516676

or, if you don't want to change your own objects.

Do a search and replace on his code to replace rh_crt0 with ScottPletcher (or other name)

then create a new ScottPletcher table and populate as shown in http:#a38516676

ASKER

@sdstuber: I adapted your code to do filtering, which improved the speed quite a lot. Basically, I reversed the sign for the joins and added the check again:

Basically, we can filter it this way because, if the next value is a negative number, then all previous ones were as well. Basically, v3 will be a larger number than v2, so if v1+v2 are larger than our sum, this is already an undesired value. If v3 is a negative number, then v1+v2 is also negative, so this will only decrease. Thus it will always be smaller than the desired target, since we did discard negative sums as not important.

This took 1m30s on the smaller subset, returning 225 rows. So, for smaller subsets, this is feasible, at least for certain rows/sums.

On the larger subset, a 3 row solution was cancelled after 6 minutes, returning 1226 rows so far, so this level remains impractical for larger subsets.

So, you can filter the results, as long as you order the subset. If only the query could be speeded up, or some other filtering could be achieved. At this point, I'd be happy with a 3 row solution for the larger subset.

```
DECLARE @Emp int,@Vl real
SET @Emp=101
SET @Vl=112.41;
WITH v
AS (SELECT vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc,
SUM(vcr_valor_bruto) vcr_valor_bruto
FROM rh_vcrt0
WHERE vcr_emp_empresa=@Emp
GROUP BY vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc),
numbers
AS (SELECT ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto) id,
vcr_valor_bruto,
vcr_valor_bruto n,
vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc
FROM v)
SELECT v1.vcr_num_empregado "Nº1",
v1.vcr_vnc_cod_venc "Venc1",
v1.vcr_valor_bruto "Vl1",
v2.vcr_num_empregado "Nº2",
v2.vcr_vnc_cod_venc "Venc2",
v2.vcr_valor_bruto "Vl2",
v3.vcr_num_empregado "Nº3",
v3.vcr_vnc_cod_venc "Venc3",
v3.vcr_valor_bruto "Vl3",
v4.vcr_num_empregado "Nº4",
v4.vcr_vnc_cod_venc "Venc4",
v4.vcr_valor_bruto "Vl4",
v5.vcr_num_empregado "Nº5",
v5.vcr_vnc_cod_venc "Venc5",
v5.vcr_valor_bruto "Vl5"
FROM numbers v1
INNER JOIN numbers v2
ON v1.id<v2.id
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto<@Vl
INNER JOIN numbers v3
ON v2.id<v3.id
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto<@Vl
INNER JOIN numbers v4
ON v3.id<v4.id
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto<@Vl
INNER JOIN numbers v5
ON v4.id<v5.id
AND v1.vcr_valor_bruto+v2.vcr_valor_bruto+v3.vcr_valor_bruto+v4.vcr_valor_bruto+v5.vcr_valor_bruto=@Vl
```

Basically, we can filter it this way because, if the next value is a negative number, then all previous ones were as well. Basically, v3 will be a larger number than v2, so if v1+v2 are larger than our sum, this is already an undesired value. If v3 is a negative number, then v1+v2 is also negative, so this will only decrease. Thus it will always be smaller than the desired target, since we did discard negative sums as not important.

This took 1m30s on the smaller subset, returning 225 rows. So, for smaller subsets, this is feasible, at least for certain rows/sums.

On the larger subset, a 3 row solution was cancelled after 6 minutes, returning 1226 rows so far, so this level remains impractical for larger subsets.

So, you can filter the results, as long as you order the subset. If only the query could be speeded up, or some other filtering could be achieved. At this point, I'd be happy with a 3 row solution for the larger subset.

>> So, you can filter the results, as long as you order the subset. <<

Which is EXACTLY what I've been saying all along. It's sdstuber that stated differently, that you couldn't "fight math" and the inevitable growth of extrapolations.

I ordered the intial input IMMEDIATELY in a table keyed that way. This would allow max efficiency later.

That's also why I changed the three controlling columns to one ident, for efficiency. You can use the one ident to lookup the original three values when you're done computing.

Which is EXACTLY what I've been saying all along. It's sdstuber that stated differently, that you couldn't "fight math" and the inevitable growth of extrapolations.

I ordered the intial input IMMEDIATELY in a table keyed that way. This would allow max efficiency later.

That's also why I changed the three controlling columns to one ident, for efficiency. You can use the one ident to lookup the original three values when you're done computing.

>>> So, you can filter the results, as long as you order the subset. <<

that's exactly what I did originally (and all along) ! reversing the signs doesn't really do anything

>>> It's sdstuber that stated differently,

nope - If that's how you read what I've posted, then it's no wonder you think you're doing something different than I am. That explains everything

well, not everything, but a lot

that's exactly what I did originally (and all along) ! reversing the signs doesn't really do anything

>>> It's sdstuber that stated differently,

nope - If that's how you read what I've posted, then it's no wonder you think you're doing something different than I am. That explains everything

well, not everything, but a lot

>> Basically, v3 will be a larger number than v2, so if v1+v2 are larger than our sum, this is already an undesired value. <<

Not with negative numbers present. v1+v2+v3 (where v3 is negative) could still yield your desired total.

That is the big problem. Once you introduce negative numbers, you lost almost all efficiency gains in searching.

You do do some bounds checks on the neg values, but they seemed to cover quite a spread, so I don't think even that helped on the extremely large input sample.

Not with negative numbers present. v1+v2+v3 (where v3 is negative) could still yield your desired total.

That is the big problem. Once you introduce negative numbers, you lost almost all efficiency gains in searching.

You do do some bounds checks on the neg values, but they seemed to cover quite a spread, so I don't think even that helped on the extremely large input sample.

The only sort you're doing is part of the query -- you can't possibly exclude large numbers of rows from consideration if they're already part of the query.

You never stored any intermediate results sorted in a table so that value ranges could be checked BEFORE doing the next level of calcs.

This tells you IN ADVANCE whether you have too many possibilities to proceed, BEFORE running the next query and letting it time out.

You never stored any intermediate results sorted in a table so that value ranges could be checked BEFORE doing the next level of calcs.

This tells you IN ADVANCE whether you have too many possibilities to proceed, BEFORE running the next query and letting it time out.

ok, I'll try one more time with a concrete (silly, but mathematically sound) example

my data consists of 100 rows, all of which have vcr_valor_bruto = 1

my target value is 6.

the only filter that will actually do anything to reduce the number of combinations evaluated is the simple v1.id < v2.id (or v2.id < v1.id - there is no difference) until you get the final set which will be checked against 6 and kick them out.

At level 3 that will produce 161,700 permutations to search through. None of which will be returned but every one of them must be evaluated.

That's worst case scenario. I fully expect the real data will be much, much better for us in a practical sense. However, unless there are some special constraints that haven't been mentioned, we have no reason to believe that the combinations won't grow in the direction of worst case and become unsolvable in a practical sense

my data consists of 100 rows, all of which have vcr_valor_bruto = 1

my target value is 6.

the only filter that will actually do anything to reduce the number of combinations evaluated is the simple v1.id < v2.id (or v2.id < v1.id - there is no difference) until you get the final set which will be checked against 6 and kick them out.

At level 3 that will produce 161,700 permutations to search through. None of which will be returned but every one of them must be evaluated.

That's worst case scenario. I fully expect the real data will be much, much better for us in a practical sense. However, unless there are some special constraints that haven't been mentioned, we have no reason to believe that the combinations won't grow in the direction of worst case and become unsolvable in a practical sense

>>> You never stored any intermediate results sorted in a table so that value ranges could be checked BEFORE doing the next level of calcs.

the nature of joins is my "storage" operation.

there is no such thing as a 5-way join. There are only 2 way joins done in sequence.

When I join the table to itself the first time, that produces some set of legal combinations (maybe large, maybe small, we don't know. it's immaterial)

Then I join THOSE results (without needing to store them independently) to the table for the second time (third instance) I don't need to have the results of the first join stored in a table simply to read them back out again. I'm using the results of that first join already.

that's likely why all of the sql approaches has been so much faster than your procedural approach. They are doing the same thing, except without the overhead of extra insert/select operations.

the nature of joins is my "storage" operation.

there is no such thing as a 5-way join. There are only 2 way joins done in sequence.

When I join the table to itself the first time, that produces some set of legal combinations (maybe large, maybe small, we don't know. it's immaterial)

Then I join THOSE results (without needing to store them independently) to the table for the second time (third instance) I don't need to have the results of the first join stored in a table simply to read them back out again. I'm using the results of that first join already.

that's likely why all of the sql approaches has been so much faster than your procedural approach. They are doing the same thing, except without the overhead of extra insert/select operations.

>> None of which will be returned but every one of them must be evaluated. <<

NO, they DON'T.

Add the highest level 2 value (keyed lookup, one read ... if stored) and the highest original value (not in the level 2 combination) -- if that total is less than the total you are seeking, you KNOW that NO level 3 combination will work.

Add the two highest level 2 values -- if that is less than the total you need, then you know that NO level 4 will equal your value.

All of that can be determined with almost no additional I/O.

For positive numbers only, you could do a binary search on the level 2 values to eliminate some combinations -- hopefully a lot, but that depends on the specific values.

NO, they DON'T.

Add the highest level 2 value (keyed lookup, one read ... if stored) and the highest original value (not in the level 2 combination) -- if that total is less than the total you are seeking, you KNOW that NO level 3 combination will work.

Add the two highest level 2 values -- if that is less than the total you need, then you know that NO level 4 will equal your value.

All of that can be determined with almost no additional I/O.

For positive numbers only, you could do a binary search on the level 2 values to eliminate some combinations -- hopefully a lot, but that depends on the specific values.

ASKER

>> Not with negative numbers present. v1+v2+v3 (where v3 is negative) could still yield your desired total.

If you ensure that they're ordered, it couldn't, because, if v3 is negative, then so is v1 and v2. And if v1+v2 is bigger than the sum, then any v3 will be bigger than v2 (negatives would have been behind by this stage). The only way this condition wouldn't work would be with a negative sum. But, not only did we discard this option already, I actually think negative sums are easy. All I need to do is reverse the signs on all values and search for the positive.

>> reversing the signs doesn't really do anything

Actually, it's reversing them that allows you to filter results. If you start from negative to positive, you can know that, once you've passed the desired result, further numbers aren't desirable. If you start from positive to negative, any number (since the largest negative will be the last record) can still yield a desired result.

I've actually thought that you could further filter results by storing the MAX(vcr_valor_bruto) in a variable. Then you could check if v1+v2+@MaxVl>@Vl, then you can proceed (for a 3 row solution. For a 5 row solution it would have to check @MaxVl*3). If v1+v2+@MaxVl<@Vl, then you can't possibly have a solution for that particular join. I believe this was suggested somewhere above.

ScottPletcher, I'm still not completely sure how you're hoping to achieve this. I think you want to create an extended 2 rows solution, that is, a table that has 2 row combinations that either yield the desired result, or that are still valid (not larger than the sum), hoping to later combine it with itself. If that is the case, then I can understand how negatives would hurt your solution. Nevertheless, it was still a valid one, given your knowledge of the OQ, though I'm not sure how you would generate a 3 or 5 row solution from it.

If you ensure that they're ordered, it couldn't, because, if v3 is negative, then so is v1 and v2. And if v1+v2 is bigger than the sum, then any v3 will be bigger than v2 (negatives would have been behind by this stage). The only way this condition wouldn't work would be with a negative sum. But, not only did we discard this option already, I actually think negative sums are easy. All I need to do is reverse the signs on all values and search for the positive.

>> reversing the signs doesn't really do anything

Actually, it's reversing them that allows you to filter results. If you start from negative to positive, you can know that, once you've passed the desired result, further numbers aren't desirable. If you start from positive to negative, any number (since the largest negative will be the last record) can still yield a desired result.

I've actually thought that you could further filter results by storing the MAX(vcr_valor_bruto) in a variable. Then you could check if v1+v2+@MaxVl>@Vl, then you can proceed (for a 3 row solution. For a 5 row solution it would have to check @MaxVl*3). If v1+v2+@MaxVl<@Vl, then you can't possibly have a solution for that particular join. I believe this was suggested somewhere above.

ScottPletcher, I'm still not completely sure how you're hoping to achieve this. I think you want to create an extended 2 rows solution, that is, a table that has 2 row combinations that either yield the desired result, or that are still valid (not larger than the sum), hoping to later combine it with itself. If that is the case, then I can understand how negatives would hurt your solution. Nevertheless, it was still a valid one, given your knowledge of the OQ, though I'm not sure how you would generate a 3 or 5 row solution from it.

As I stated earlier, you can gen the 3-level solution by *selectively* (with only pos values) joining the necessary parts of the 2-level solution to the original values.

The four-level solution can be obtained by selectively joining the 2-level solution to itself.

With neg values, you could just compute the top two values in the 2-level solution, and use those to determine if ANY four-level solution is even possible. If not, you could exit the code if there are any significant number of original values, because as we all know a *full* explosion to five levels would produce an essentially unsolvable number of rows.

The four-level solution can be obtained by selectively joining the 2-level solution to itself.

With neg values, you could just compute the top two values in the 2-level solution, and use those to determine if ANY four-level solution is even possible. If not, you could exit the code if there are any significant number of original values, because as we all know a *full* explosion to five levels would produce an essentially unsolvable number of rows.

>>> you start from negative to positive, you can know that, once you've passed the desired result, further numbers aren't desirable.

yes, that's what I did in my query

ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto)

that puts the values in increasing order negatives first, postives last

>>>Add the highest level 2 value (keyed lookup, one read ... if stored) and the highest original value (not in the level 2 combination) -- if that total is less than the total you are seeking, you KNOW that NO level 3 combination will work.

what you just explained IS evaluating all of them. When you KNOW that NO level 3 combination will work, it's because you looked at each of them and they all dropped out of the join condition or loop condition if done procedurally or where clause on the extra select to previously saved results.

once the level 3 join eliminates rows the level 4 and 5 have nothing to do. That's the whole point of why I sorted and joined the way I did.

I tested your way for level 2, it took 1:48, I tested my way it took 0:04 using the same data on the same server i n the same database. Please, if you have a method that is faster, post it and let us test it.

Clearly our words aren't convincing each other. Whether that's a failing of the writers or the readers or both is losing importance. Let the code do the talking, we can all run it and see for ourselves what works and what doesn't. Again, I apologize if I messed up testing your code; but I've posted exactly what I did to test it. If there is a failure in my test that skewed the results against you, please correct me.

yes, that's what I did in my query

ROW_NUMBER() OVER (ORDER BY vcr_valor_bruto)

that puts the values in increasing order negatives first, postives last

>>>Add the highest level 2 value (keyed lookup, one read ... if stored) and the highest original value (not in the level 2 combination) -- if that total is less than the total you are seeking, you KNOW that NO level 3 combination will work.

what you just explained IS evaluating all of them. When you KNOW that NO level 3 combination will work, it's because you looked at each of them and they all dropped out of the join condition or loop condition if done procedurally or where clause on the extra select to previously saved results.

once the level 3 join eliminates rows the level 4 and 5 have nothing to do. That's the whole point of why I sorted and joined the way I did.

I tested your way for level 2, it took 1:48, I tested my way it took 0:04 using the same data on the same server i n the same database. Please, if you have a method that is faster, post it and let us test it.

Clearly our words aren't convincing each other. Whether that's a failing of the writers or the readers or both is losing importance. Let the code do the talking, we can all run it and see for ourselves what works and what doesn't. Again, I apologize if I messed up testing your code; but I've posted exactly what I did to test it. If there is a failure in my test that skewed the results against you, please correct me.

I already stated, try the 2-level FIRST BEFORE STORING ANY RESULTS.

You only need to store results if you need to go to the 3 level AND the orig values is beyond a certain count.

You only need to store results if you need to go to the 3 level AND the orig values is beyond a certain count.

>> what you just explained IS evaluating all of them. When you KNOW that NO level 3 combination will work, it's because you looked at each of them <<

NO, that's false again.

If I've stored the two level results, I can just read the two highest results and the highest original result (also stored keyed in my example), the 3-level impossibility can be done with NO other 3-level combinations tried.

NO, that's false again.

If I've stored the two level results, I can just read the two highest results and the highest original result (also stored keyed in my example), the 3-level impossibility can be done with NO other 3-level combinations tried.

Interesting discussion. :) I wish that I'd found it sooner....

For my money, I'd skip the SQL solutions and write this in C. It is, after all, a problem with a recursive solution and C (or a derivative) offers some of the best recursive solutions. Certainly the fastest. I would expect a C based solution to solve this with sub-second speed. Some DBMS actually let you write functions and stored procedures in C, but I don't believe that SQL Server is one of them.

One of the problems suggested above was to take the integer values from 1 to 20 and finding all combinations (/permutations) that sum to 37. In C it's trivial if you're comfortable with recursion.

Step 1 is to read the database table into a C array. That's 1 SQL query, no joins or other overhead passed off to the database. Even with a million rows, there's only a few MB of storage required. Far less than the DBMS needs to store and manage the rows.

Step 2 is to create an array to contain the values that make up the solution. In theory, this could be as large as the original array, but there's nothing preventing a rule that limits the number of elements in the solution to a fixed size. It was stated that 5 was the expected upper limit, so let's be generous and use 10. :)

Step 3 is to sort the array ascending. Note that there are no restrictions on signage, nor is there anything to prevent duplication of values. 10+10+10+7=37 and if there are 3 tens and a 7 in the data list, this algorithm will find that solution.

Overkill, I know, but the program below seems to solve this in a couple of milliseconds. Modify this to run the SQL to load the table, and perhaps to store the solutions and it's done.

For my money, I'd skip the SQL solutions and write this in C. It is, after all, a problem with a recursive solution and C (or a derivative) offers some of the best recursive solutions. Certainly the fastest. I would expect a C based solution to solve this with sub-second speed. Some DBMS actually let you write functions and stored procedures in C, but I don't believe that SQL Server is one of them.

One of the problems suggested above was to take the integer values from 1 to 20 and finding all combinations (/permutations) that sum to 37. In C it's trivial if you're comfortable with recursion.

Step 1 is to read the database table into a C array. That's 1 SQL query, no joins or other overhead passed off to the database. Even with a million rows, there's only a few MB of storage required. Far less than the DBMS needs to store and manage the rows.

Step 2 is to create an array to contain the values that make up the solution. In theory, this could be as large as the original array, but there's nothing preventing a rule that limits the number of elements in the solution to a fixed size. It was stated that 5 was the expected upper limit, so let's be generous and use 10. :)

Step 3 is to sort the array ascending. Note that there are no restrictions on signage, nor is there anything to prevent duplication of values. 10+10+10+7=37 and if there are 3 tens and a 7 in the data list, this algorithm will find that solution.

Overkill, I know, but the program below seems to solve this in a couple of milliseconds. Modify this to run the SQL to load the table, and perhaps to store the solutions and it's done.

```
//---------------------------------------------------------------------------
#include <stdio.h>
#pragma hdrstop
//---------------------------------------------------------------------------
#pragma argsused
int TValues[40];
int Limit;
int SValues[10];
int MaxValues;
int Target;
int SCount = 0;
void PrintSolution (int Depth)
{
int idx;
printf (" Found: %4d :", ++SCount);
for (idx = 0; idx <= Depth; ++idx)
printf (" %d", SValues[idx]);
printf ("\n");
}
int FindOperand (int Sum, int Depth, int Limit)
{
int TPosition;
// Find the largest value in the array that when added to the running sum, is GE the target
for (TPosition=0; TPosition < Limit; ++TPosition)
if (Sum + TValues[TPosition] >= Target)
break;
// If a solution is not found and sum(items) is less than the target
// Then we need to keep the largest value and search for another token
if (TPosition >= Limit)
{
if (Limit == 0)
return 0;
SValues[Depth] = TValues[Limit-1]; // Relative 0 indexing, doncha know....
Sum += SValues[Depth]; // Carry the new sum to the next pass
FindOperand (Sum, Depth+1, Limit-1);
return 0;
}
// If a solution is not found and sum (items) is GE than the target
// We need to grab a value less than the current value (previous in the list)
// and look add another item to the solution check
if (Sum + TValues[TPosition] == Target)
{
SValues[Depth] = TValues[TPosition];
PrintSolution (Depth);
}
for (--TPosition; TPosition > 0; --TPosition)
{
SValues[Depth] = TValues[TPosition];
FindOperand (Sum + TValues[TPosition], Depth+1, TPosition-1);
}
return 0;
}
int main(int argc, char* argv[])
{
int TPosition;
int idx;
// Set the value that we want to find.
Target = 37;
// Load 1 through 20 into the TValues array, sorted ascending
for (idx = 0; idx < 20; ++idx)
TValues[idx] = idx+1;
// Set the number of items in the array
Limit = 20;
// Set the maximum number of values in the solution
MaxValues = 10;
FindOperand (0, 0, Limit);
return 0;
}
//---------------------------------------------------------------------------
```

ASKER

At home now, so I can't really test that properly. However, I think you missed the point of the max we're talking about. What we need is ALL 5 value combinations. On a 150 rows sample data, it's pretty standard. However, when you try to find ALL 5 rows combinations, it does get slow. We're not limiting the number of results. We want them all. What we're limiting is the number of values to combine.

That is, you can run the array and combine it with itself to find 2 row solutions (solutions of 2 values that achieve the desired value). That is pretty standard. But what we need is to run the array and combine it with itself, then combine it with itself, then combine it with itself, then combine it with itself, filtering undesired results on the way.

It's likely that a C solution is faster than an SQL one. I just don't know if it's THAT fast. Can a 5 row solution return results out of a 5k rows sample data? If you wish to test it, there is a sample data above that you can use (a bit over 5k). You can use the value 122.41, which is the one we've been using so far. Try to find out all 5 row combinations in about 5 minutes. I don't think you can do it with just pure recursiveness. There are too many combinations to run through them all, even ordering and filtering as you go. However, I'm curious to know if a C solution can be found to achieve a 5 row solution in any decent time (anything less than 5 minutes isn't practical, other than to my curiosity ;)

That is, you can run the array and combine it with itself to find 2 row solutions (solutions of 2 values that achieve the desired value). That is pretty standard. But what we need is to run the array and combine it with itself, then combine it with itself, then combine it with itself, then combine it with itself, filtering undesired results on the way.

It's likely that a C solution is faster than an SQL one. I just don't know if it's THAT fast. Can a 5 row solution return results out of a 5k rows sample data? If you wish to test it, there is a sample data above that you can use (a bit over 5k). You can use the value 122.41, which is the one we've been using so far. Try to find out all 5 row combinations in about 5 minutes. I don't think you can do it with just pure recursiveness. There are too many combinations to run through them all, even ordering and filtering as you go. However, I'm curious to know if a C solution can be found to achieve a 5 row solution in any decent time (anything less than 5 minutes isn't practical, other than to my curiosity ;)

Hi Cluskitt,

Perhaps two key variables in that code need some description.

*Limit* is the number of data points to examine. If there are 20 data points in the table, its value is 20. If there are 50,000 data points, its value is 50,000, etc.

*MaxValues* is the most number of items that will be returned for a solution. If 1, 2, 3, 4, 5, 6 sums to the correct target (21) it will not be shown because *MaxValues* is set to 5 and that sequence contains 6 points. However, 2, 3, 4, 5, 7 will be returned as it contains only 5 points.

I believe that's what you're looking for.

Modern processors are capable of multi-billion calculations per second. At such speeds, every item in a 1,000 item table can be compared to the other items more than 1,000,000 times in well under a second. That kind of performance isn't possible in SQL, where a Cartesian product is first created for each item in the table, then rows are filtered out. Multiple joins results in multiple Cartesian products and a huge number of rows.

It would probably take a whiteboard (or equivalent) to properly show the algorithm that the C code uses, but I'll be glad to give it a stab here.

Kent

Perhaps two key variables in that code need some description.

*Limit* is the number of data points to examine. If there are 20 data points in the table, its value is 20. If there are 50,000 data points, its value is 50,000, etc.

*MaxValues* is the most number of items that will be returned for a solution. If 1, 2, 3, 4, 5, 6 sums to the correct target (21) it will not be shown because *MaxValues* is set to 5 and that sequence contains 6 points. However, 2, 3, 4, 5, 7 will be returned as it contains only 5 points.

I believe that's what you're looking for.

Modern processors are capable of multi-billion calculations per second. At such speeds, every item in a 1,000 item table can be compared to the other items more than 1,000,000 times in well under a second. That kind of performance isn't possible in SQL, where a Cartesian product is first created for each item in the table, then rows are filtered out. Multiple joins results in multiple Cartesian products and a huge number of rows.

It would probably take a whiteboard (or equivalent) to properly show the algorithm that the C code uses, but I'll be glad to give it a stab here.

Kent

I added

if (Depth >= MaxValues) return 0;

to the beginning of FindOperand so the recursion would stop after searching for 5 numbers.

If you set the limits higher, make sure you increase the array size too.

However, even though the original code doesn't appear to observe the MaxValues search stops after a depth of 5 for the 1-20 set and target 37.

The the code runs fast, but only returns a tiny subset of the possible answers. For example, running 1-20 with a target of 37 returns only 18 combinations out of 309.

Found: 1 : 20 17

Found: 2 : 20 16 1

Found: 3 : 20 15 2

Found: 4 : 20 14 3

Found: 5 : 20 13 4

Found: 6 : 20 13 3 1

Found: 7 : 20 12 5

Found: 8 : 20 12 4 1

Found: 9 : 20 11 6

Found: 10 : 20 11 5 1

Found: 11 : 20 11 4 2

Found: 12 : 20 10 7

Found: 13 : 20 10 6 1

Found: 14 : 20 10 5 2

Found: 15 : 20 10 4 2 1

Found: 16 : 20 9 7 1

Found: 17 : 20 8 6 3

Found: 18 : 20 7 5 4 1

It's always starting with the largest value in the set, but even with that limit it's still missing some of the expected results. For example: 20,11,3,2,1 isn't found.

if (Depth >= MaxValues) return 0;

to the beginning of FindOperand so the recursion would stop after searching for 5 numbers.

If you set the limits higher, make sure you increase the array size too.

However, even though the original code doesn't appear to observe the MaxValues search stops after a depth of 5 for the 1-20 set and target 37.

The the code runs fast, but only returns a tiny subset of the possible answers. For example, running 1-20 with a target of 37 returns only 18 combinations out of 309.

Found: 1 : 20 17

Found: 2 : 20 16 1

Found: 3 : 20 15 2

Found: 4 : 20 14 3

Found: 5 : 20 13 4

Found: 6 : 20 13 3 1

Found: 7 : 20 12 5

Found: 8 : 20 12 4 1

Found: 9 : 20 11 6

Found: 10 : 20 11 5 1

Found: 11 : 20 11 4 2

Found: 12 : 20 10 7

Found: 13 : 20 10 6 1

Found: 14 : 20 10 5 2

Found: 15 : 20 10 4 2 1

Found: 16 : 20 9 7 1

Found: 17 : 20 8 6 3

Found: 18 : 20 7 5 4 1

It's always starting with the largest value in the set, but even with that limit it's still missing some of the expected results. For example: 20,11,3,2,1 isn't found.

Hi Cluskitt,

I've done some more work on this (I find it interesting) and several things come to mind.

In your data sample (5,000+) rows, there are duplicate values for CalcValue. If the duplicates are not filtered out, there will be duplicates in the answer.

I'll post the complete program, but it appears that we have a LOT more combinations that generate the target value than we thought.

When we limit the results to only 2-column sums AND we delete the rows that duplicate CalcValue there are 99 results. When we limit the results to 3 column sums there are 65,024 results. The 65,024 rows are written to a file in about a second.

The results are growing exponentially. Perhaps this whole thing needs to be rethough.

Kent

I've done some more work on this (I find it interesting) and several things come to mind.

In your data sample (5,000+) rows, there are duplicate values for CalcValue. If the duplicates are not filtered out, there will be duplicates in the answer.

I'll post the complete program, but it appears that we have a LOT more combinations that generate the target value than we thought.

When we limit the results to only 2-column sums AND we delete the rows that duplicate CalcValue there are 99 results. When we limit the results to 3 column sums there are 65,024 results. The 65,024 rows are written to a file in about a second.

The results are growing exponentially. Perhaps this whole thing needs to be rethough.

Kent

ASKER

Yes, I know there are duplicates. They have to be taken into account. That is, each is to be treated as an individual value. On the whole, there are no two rows that are exactly the same. One of the keys will change.

The fact that a 5 row combinatory calculation explodes into the billions and trillions of possible solutions is the problem we're fighting with here. We tried early filtering, to reduce this number, but it still seems a bit unmanageable, even for C.

The fact that a 5 row combinatory calculation explodes into the billions and trillions of possible solutions is the problem we're fighting with here. We tried early filtering, to reduce this number, but it still seems a bit unmanageable, even for C.

Let's try looking at the problem a different way.

Let's say we give you a magic computer that has infinite speed and for some particular target sum you get a million different possible combinations.

What are you going to do with that information? Is it useful or is the answer only of use when it's something smaller and more manageable?

If a million is considered small (and proportional to what it could be it's miniscule) what if it's a billion? Are the results still useful? What if there are 100 billion? A trillion?

At some point, the growth will be so high that there is no possible way you could ever use the results in a meaningful way, even if you could generate them instantly you couldn't display them or write them to disk fast enough.

or - Can you say with reasonable confidence that the data will not have that kind of growth? Are there other conditions that haven't been specified that will lets us generate many tens of thousands of combinations but it'll never explode in the mega-results?

I can create gigantic input sets (millions of rows) that can still be solved in a few seconds (the 0.12 sum of positives example above). I can create small sets (thousands of rows) that can not be solved within hours (the all 1 example above, if you don't like target of 6, how about a target of 5 - every row combination will be legal.)

So, since the data will either make or break you here. Can you tell us anything more about it?

Let's say we give you a magic computer that has infinite speed and for some particular target sum you get a million different possible combinations.

What are you going to do with that information? Is it useful or is the answer only of use when it's something smaller and more manageable?

If a million is considered small (and proportional to what it could be it's miniscule) what if it's a billion? Are the results still useful? What if there are 100 billion? A trillion?

At some point, the growth will be so high that there is no possible way you could ever use the results in a meaningful way, even if you could generate them instantly you couldn't display them or write them to disk fast enough.

or - Can you say with reasonable confidence that the data will not have that kind of growth? Are there other conditions that haven't been specified that will lets us generate many tens of thousands of combinations but it'll never explode in the mega-results?

I can create gigantic input sets (millions of rows) that can still be solved in a few seconds (the 0.12 sum of positives example above). I can create small sets (thousands of rows) that can not be solved within hours (the all 1 example above, if you don't like target of 6, how about a target of 5 - every row combination will be legal.)

So, since the data will either make or break you here. Can you tell us anything more about it?

ASKER

Well, obviously, if we get millions of records, it won't be useful. We will have to restrict the data further.

The point is to discover codes that aren't correctly parametrized in the application and thus generate a different total. We then have to check which values/codes might cause this. And if it were just this, it would be simpler, because we could check a few, eliminate some possibilities, and generate a smaller result. But in some cases, it's something that affects only a few employees (vcr_num_empregado), in which case codes that are correctly parametrized can still be part of the difference.

So, while some codes can later be eliminated, that will have to be due to human examination and the full results have to be calculated or, I guess, we can establish some ceiling of, for example, 500 results to be evaluated. Also, we first run the 1 row solution. If there are no results, then we move on to the 2 row solution, etc...

From experience, I'd say that most of the time, a 3 row solution would solve the case. However, we sometimes need to perform a 5 row one, for more complicated cases.

The real problem here is that, other than pure math, there really isn't any condition that can be used to exclude data automatically.

Maybe an easier path would be to create 5 row solutions for 1 code at a time. That is, the first table/array would be composed only of one code, which would then be fully combined. I suppose that would cut down quite a few possibilities, especially as some of the codes exist in a smaller number.

The point is to discover codes that aren't correctly parametrized in the application and thus generate a different total. We then have to check which values/codes might cause this. And if it were just this, it would be simpler, because we could check a few, eliminate some possibilities, and generate a smaller result. But in some cases, it's something that affects only a few employees (vcr_num_empregado), in which case codes that are correctly parametrized can still be part of the difference.

So, while some codes can later be eliminated, that will have to be due to human examination and the full results have to be calculated or, I guess, we can establish some ceiling of, for example, 500 results to be evaluated. Also, we first run the 1 row solution. If there are no results, then we move on to the 2 row solution, etc...

From experience, I'd say that most of the time, a 3 row solution would solve the case. However, we sometimes need to perform a 5 row one, for more complicated cases.

The real problem here is that, other than pure math, there really isn't any condition that can be used to exclude data automatically.

Maybe an easier path would be to create 5 row solutions for 1 code at a time. That is, the first table/array would be composed only of one code, which would then be fully combined. I suppose that would cut down quite a few possibilities, especially as some of the codes exist in a smaller number.

>> The real problem here is that, other than pure math, there really isn't any condition that can be used to exclude data automatically.

I think that that's been the point of most everyone that has posted, though the explanations come to that conclusion from different angles.

Recursively spinning over a modest volume of data that (in this case) probably fits into the CPU cache is the fastest solution that you'll get. But the number of possible solutions is huge.

The algorithm in that C program searches for combinations, not permutations. It does it by searching the data in such a way that the value list consists of items in decreasing order. (Technically, non-ascending if duplicates are included.) It may be possible to approximate the number of solutions for a given set of data, but it sure seems impractical (or impossible) to generate all of the solutions in a meaningful form.

Does it do you any good to hunt for solutions where the largest value is X? Or 100 solutions where the largest value is X? Search for "contains x" doesn't get us anywhere as "contains 1" means the entire list still has to search all possible combinations.

Kent

I think that that's been the point of most everyone that has posted, though the explanations come to that conclusion from different angles.

Recursively spinning over a modest volume of data that (in this case) probably fits into the CPU cache is the fastest solution that you'll get. But the number of possible solutions is huge.

The algorithm in that C program searches for combinations, not permutations. It does it by searching the data in such a way that the value list consists of items in decreasing order. (Technically, non-ascending if duplicates are included.) It may be possible to approximate the number of solutions for a given set of data, but it sure seems impractical (or impossible) to generate all of the solutions in a meaningful form.

Does it do you any good to hunt for solutions where the largest value is X? Or 100 solutions where the largest value is X? Search for "contains x" doesn't get us anywhere as "contains 1" means the entire list still has to search all possible combinations.

Kent

Oh. Here's the last C program that I used. I took some liberties with the data, squeezing out the duplicate values and converting the values to non-fractions by multiplying by 100. Those are performance boosters, but it's still a huge number of answers in the result set.

```
//---------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#pragma hdrstop
//---------------------------------------------------------------------------
#pragma argsused
double *TValues;
double SValues[10];
double Target;
int Limit;
int MaxValues;
int SCount = 0;
FILE *R;
void PrintSolution (int Depth)
{
int idx;
fprintf (R, " Found: %d :", ++SCount);
for (idx = 0; idx <= Depth; ++idx)
fprintf (R, " %.0f", SValues[idx]);
fprintf (R, "\n");
}
int FindOperand (double Sum, int Depth, int Limit)
{
int TPosition;
// Scan the list backwards from the starting point (Limit)
for (TPosition=Limit-1; TPosition >= 0; --TPosition)
if (Sum + TValues[TPosition] == Target)
{
SValues[Depth] = TValues[TPosition];
PrintSolution (Depth);
}
else if (Sum + TValues[TPosition] < Target)
{
if (Depth+1 < MaxValues)
{
SValues[Depth] = TValues[TPosition];
FindOperand (Sum + TValues[TPosition], Depth+1, TPosition);
}
else
return 0;
}
return 0;
}
void LoadTable (void)
{
FILE *F;
int A, B, C;
float V;
int Status;
int TableSize = 1000;
F = fopen ("C:\\sqldata.txt", "rt");
Limit = 0;
if (F)
{
TValues = malloc (TableSize * sizeof (double));
while (1)
{
Status = fscanf (F, "(%d,%d,%d,%f)\n", &A, &B, &C, &V);
if (Status <= 0)
break;
if (Limit +1 >= TableSize)
{
TableSize *= 2;
TValues = (double*) realloc (TValues, TableSize * sizeof (double));
}
TValues[Limit++] = (Status = V * 100);
}
fclose (F);
}
}
void SortTable (void)
{
int x, y;
double V;
for (x = 1; x < Limit; ++x)
{
if (TValues[x-1] > TValues[x])
{
for (y = x-2; y>=0; --y)
if (TValues[x] > TValues[y])
break;
++y;
V = TValues[x];
memmove (&TValues[y+1], &TValues[y], (x-y) * sizeof (double));
TValues[y] = V;
}
}
}
void PackTable (void)
{
int Source;
int Destination;
Source = 1;
Destination = 0;
while (Source < Limit)
{
if (TValues[Source] != TValues[Destination])
TValues[++Destination] = TValues[Source];
++Source;
}
Limit = Source;
}
int main(int argc, char* argv[])
{
int TPosition;
int idx;
int i;
FILE *F;
Target = 11241;
MaxValues = 3;
LoadTable ();
SortTable ();
PackTable ();
F = fopen ("C:\\Values.txt", "wb");
for (idx = 0; idx < Limit; ++idx)
fprintf (F, "%.0f\n\r", TValues[idx]);
fclose (F);
R = fopen ("C:\\Results.txt", "wb");
FindOperand (0, 0, Limit);
fclose (R);
return 0;
}
//---------------------------------------------------------------------------
```

>>>I think that that's been the point of most everyone that has posted

That's also why I've been saying the algorithm and language don't matter. Since c can only do a few billions of operations per second (in theory, and "operation" doesn't equate to "evaluate one combination" ) then it's just way too slow for the bad cases. Of course a really bad algorithm can make the problem worse, but nothing "good" will ever be "fast" when the math is against you.

I hate to keep playing the role of doomsayer; but when the combinations are bad there simply is no way to tackle it in a practical sense.

>>> The algorithm in that C program searches for combinations, not permutations.

I'm pretty sure that applies to all of the code above. Other than the query in the original question text, everythign does some sort of sorting to prevent duplicates through permutation and only do combination searches.

>>> That is, the first table/array would be composed only of one code

If it's valid to only evaluate a subset of the rows you've presented then by all means do that. Any rule, no matter how seemingly trivial that can cut the amount of data will help. In fact, the reduction will help exponentially in the same way that more data grows exponentially. Less data is less combinations which inhibits the mathematical expansion.

If however, you still need to check for cross-code sums then it doesn't really help the exhaustive search.

I'm still a little fuzzy about what you want to do with large result sets. If we find 500 or even 50000 combinations quickly, but there are still another 146 million more after that, do the 50000 we found even matter?

That's also why I've been saying the algorithm and language don't matter. Since c can only do a few billions of operations per second (in theory, and "operation" doesn't equate to "evaluate one combination" ) then it's just way too slow for the bad cases. Of course a really bad algorithm can make the problem worse, but nothing "good" will ever be "fast" when the math is against you.

I hate to keep playing the role of doomsayer; but when the combinations are bad there simply is no way to tackle it in a practical sense.

>>> The algorithm in that C program searches for combinations, not permutations.

I'm pretty sure that applies to all of the code above. Other than the query in the original question text, everythign does some sort of sorting to prevent duplicates through permutation and only do combination searches.

>>> That is, the first table/array would be composed only of one code

If it's valid to only evaluate a subset of the rows you've presented then by all means do that. Any rule, no matter how seemingly trivial that can cut the amount of data will help. In fact, the reduction will help exponentially in the same way that more data grows exponentially. Less data is less combinations which inhibits the mathematical expansion.

If however, you still need to check for cross-code sums then it doesn't really help the exhaustive search.

I'm still a little fuzzy about what you want to do with large result sets. If we find 500 or even 50000 combinations quickly, but there are still another 146 million more after that, do the 50000 we found even matter?

ASKER

The only preferred data is one that contains as little different codes as possible. That is, one row that has 3 values with code 1 and 2 with code 6 is more relevant than one that has 5 values from 5 different codes.

From past experience, at least from level 3 solutions, I can say that, for the values used, most samples will return about 100-200 solutions. This is a manageable number which can be analysed by sight, given human intuition and all that.

Usually the differences (desired sum) for the larger subset are larger, which actually generate less possible values. But that only helps with the number of results, not the processing power required, because, while larger numbers do exclude many possibilities, they only exclude them after calculating them all.

For example, if, given the sample provided, you were to search for a sum of 8952.32, there wouldn't be as many possible combinations as for 112.42, because the vast majority of values isn't large enough to produce this sum.

I will try to study your code and adapt it into my VB.Net project which I'm experimenting with, since I'm more at home with VB than C, though I can use both. One thing I did notice is that you use a sorting procedure, which, in the real case, won't be necessary, as it would come from a DataTable (or similar) and I can order SQL to sort it in advance.

From past experience, at least from level 3 solutions, I can say that, for the values used, most samples will return about 100-200 solutions. This is a manageable number which can be analysed by sight, given human intuition and all that.

Usually the differences (desired sum) for the larger subset are larger, which actually generate less possible values. But that only helps with the number of results, not the processing power required, because, while larger numbers do exclude many possibilities, they only exclude them after calculating them all.

For example, if, given the sample provided, you were to search for a sum of 8952.32, there wouldn't be as many possible combinations as for 112.42, because the vast majority of values isn't large enough to produce this sum.

I will try to study your code and adapt it into my VB.Net project which I'm experimenting with, since I'm more at home with VB than C, though I can use both. One thing I did notice is that you use a sorting procedure, which, in the real case, won't be necessary, as it would come from a DataTable (or similar) and I can order SQL to sort it in advance.

C program will definitely be faster than SQL; math calcs are not SQL's strongpoint. Even if staying in SQL, fastest would likely be C in CLR, but even that will be slower than native C.

ASKER

>> If we find 500 or even 50000 combinations quickly, but there are still another 146 million more after that, do the 50000 we found even matter?

If we find 500, but there are still 50000 more after that (our tolerance is lower ;) ), we just study the first 500 to check if there is anything to be made of it, otherwise we just drop it. At this point, we would have exhausted the 1, 2 and 3 row combinations (and possibly 4), so expectations would be low anyway and some other method would have to be found for discovering the problem.

If we find 500, but there are still 50000 more after that (our tolerance is lower ;) ), we just study the first 500 to check if there is anything to be made of it, otherwise we just drop it. At this point, we would have exhausted the 1, 2 and 3 row combinations (and possibly 4), so expectations would be low anyway and some other method would have to be found for discovering the problem.

>>> C program will definitely be faster than SQL; math calcs are not SQL's strongpoint.

sure, but same argument as my approximation errors before. if the c is a million times faster than sql (which it isn't) then it's still to slow to tackle result sets in the quadrillions.

but, based on the asker's last comment that 200 results is all that is expected then the early-filtering that we've all been hoping would work, might actually do it.

if that's the case, then the data is working for you, not against you and the math works in your favor because you'll rarely get the mega-results and, if you start down that path, 50000 is too many to worry about so we don't have to get anywhere near the full combination.

I have some alterations to kdo's code that should make it more efficient but with these new constraints I'm not sure it's even necessary. I expect we can either solve the small sets or give up on the big ones fairly quickly

sure, but same argument as my approximation errors before. if the c is a million times faster than sql (which it isn't) then it's still to slow to tackle result sets in the quadrillions.

but, based on the asker's last comment that 200 results is all that is expected then the early-filtering that we've all been hoping would work, might actually do it.

if that's the case, then the data is working for you, not against you and the math works in your favor because you'll rarely get the mega-results and, if you start down that path, 50000 is too many to worry about so we don't have to get anywhere near the full combination.

I have some alterations to kdo's code that should make it more efficient but with these new constraints I'm not sure it's even necessary. I expect we can either solve the small sets or give up on the big ones fairly quickly

ASKER

I've been working on my own code as well, to see if I can come up with something reasonable (I'm actually also curious if I can get the full results without filtering, but I'll try at home where my computer is better ;)).

It seems to me that there are a few ways to speed up the filtering, especially working with an array:

-Working from an ascending order (we tackle negatives first), we can check for possibilities. For example, let's suppose that the smalles value is -5000. We then add it with the 4 larger values in the array to see if we can possibly achieve our goal. If not, then we just move on to the next position. If so, we go down one level and combine that with the next value. For example, -4500. Then we check to see if -9500 + the 3 largest values permit us to achieve our goal. etc...

-As we've been using, we check if the current sum is already larger than the desired total.

These two conditions alone will filter many combinations, which can serve to end the current depth level prematurely.

One thing that has to be taken into account, though, is that we first generate a 0 depth level (or 1 row solution). Then we generate a 1 depth level, etc. This is a parameter which will be input by the user. In SQL I was thinking of a stored procedure using a variable, pure console C would have to store the variable on input, VB.Net would use a DropDownList, most likely.

I don't know if there is much of a loss in using VB.Net. Other than the initial stage, which is fixed, I just print into a label, instead of console. Processing power should be similar.

It seems to me that there are a few ways to speed up the filtering, especially working with an array:

-Working from an ascending order (we tackle negatives first), we can check for possibilities. For example, let's suppose that the smalles value is -5000. We then add it with the 4 larger values in the array to see if we can possibly achieve our goal. If not, then we just move on to the next position. If so, we go down one level and combine that with the next value. For example, -4500. Then we check to see if -9500 + the 3 largest values permit us to achieve our goal. etc...

-As we've been using, we check if the current sum is already larger than the desired total.

These two conditions alone will filter many combinations, which can serve to end the current depth level prematurely.

One thing that has to be taken into account, though, is that we first generate a 0 depth level (or 1 row solution). Then we generate a 1 depth level, etc. This is a parameter which will be input by the user. In SQL I was thinking of a stored procedure using a variable, pure console C would have to store the variable on input, VB.Net would use a DropDownList, most likely.

I don't know if there is much of a loss in using VB.Net. Other than the initial stage, which is fixed, I just print into a label, instead of console. Processing power should be similar.

LOL. Suddenly now the math is working in our favor, when it was an insurmountable obstacle all along before.

>>> LOL. Suddenly now the math is working in our favor, when it was an insurmountable obstacle all along before.

I've always said that the data determines the success or failure. Not the method.

If the data will produce few combinations then it might be feasible to solve in real time

-i.e. the math works

If the data will produce many combinations then it won't be feasible.

- i.e. the math does not work

Nothing posted until half an hour ago gave any indication that the data would work in our favor. There was only the hope that it might sometimes. The "trick" now is that with a hard cutoff of 50000 we effectively no longer care if it's bad or not.

Up until the last half hour, the request has been to find them all; which meant bad cases were not solvable. Now "I give up" is a legal solution.

So "suddenly" it is in our favor and no longer an obstacle. The rules changed, but not my reasonings.

I've always said that the data determines the success or failure. Not the method.

If the data will produce few combinations then it might be feasible to solve in real time

-i.e. the math works

If the data will produce many combinations then it won't be feasible.

- i.e. the math does not work

Nothing posted until half an hour ago gave any indication that the data would work in our favor. There was only the hope that it might sometimes. The "trick" now is that with a hard cutoff of 50000 we effectively no longer care if it's bad or not.

Up until the last half hour, the request has been to find them all; which meant bad cases were not solvable. Now "I give up" is a legal solution.

So "suddenly" it is in our favor and no longer an obstacle. The rules changed, but not my reasonings.

>>> For example, let's suppose that the smalles value is -5000. We then add it with the 4 larger values in the array to see if we can possibly achieve our goal.

yes, this is where I was going when I suggested keeping track of the minimum and maximum values to allow early detection of impossible conditions without needing to try intermediate values

yes, this is where I was going when I suggested keeping track of the minimum and maximum values to allow early detection of impossible conditions without needing to try intermediate values

Wow, that's exactly the opposite of what you've been saying all along, that the only way to know which combinations would/would not work was to test them ALL.

A genuine mathematician might be able to use logs or some other math "trick" to speed this up considerably. You might want to post this problem on a mathematics site and see if they can give you some algorithm that would help.

A genuine mathematician might be able to use logs or some other math "trick" to speed this up considerably. You might want to post this problem on a mathematics site and see if they can give you some algorithm that would help.

Hi Cluskitt,

>> I've been working on my own code as well, to see if I can come up with something reasonable (I'm actually also curious if I can get the full results without filtering, but I'll try at home where my computer is better ;)).

I've got some killer processors at my disposal. It doesn't seem to help.

>>It seems to me that there are a few ways to speed up the filtering, especially working with an array:

-Working from an ascending order (we tackle negatives first), we can check for possibilities. For example, let's suppose that the smalles value is -5000. We then add it with the 4 larger values in the array to see if we can possibly achieve our goal. If not, then we just move on to the next position. If so, we go down one level and combine that with the next value. For example, -4500. Then we check to see if -9500 + the 3 largest values permit us to achieve our goal. etc...

That's essentially what the C program does, except that it starts at the highest value and works down. This approach has the advantage of pre-filtering all items that are larger than the target value. They're identified in the initial loop and once found are no longer part of the equation.

>> I don't know if there is much of a loss in using VB.Net. Other than the initial stage, which is fixed, I just print into a label, instead of console. Processing power should be similar.

I don't think so. C is a sort or macro-language that compiles almost directly to assembly code. VB.Net is intended to manage visual (gui) objects. I expect quite a large difference in performance.

Kent

>> I've been working on my own code as well, to see if I can come up with something reasonable (I'm actually also curious if I can get the full results without filtering, but I'll try at home where my computer is better ;)).

I've got some killer processors at my disposal. It doesn't seem to help.

>>It seems to me that there are a few ways to speed up the filtering, especially working with an array:

-Working from an ascending order (we tackle negatives first), we can check for possibilities. For example, let's suppose that the smalles value is -5000. We then add it with the 4 larger values in the array to see if we can possibly achieve our goal. If not, then we just move on to the next position. If so, we go down one level and combine that with the next value. For example, -4500. Then we check to see if -9500 + the 3 largest values permit us to achieve our goal. etc...

That's essentially what the C program does, except that it starts at the highest value and works down. This approach has the advantage of pre-filtering all items that are larger than the target value. They're identified in the initial loop and once found are no longer part of the equation.

>> I don't know if there is much of a loss in using VB.Net. Other than the initial stage, which is fixed, I just print into a label, instead of console. Processing power should be similar.

I don't think so. C is a sort or macro-language that compiles almost directly to assembly code. VB.Net is intended to manage visual (gui) objects. I expect quite a large difference in performance.

Kent

>>> Wow, that's exactly the opposite of what you've been saying all along, that the only way to know which combinations would/would not work was to test them ALL.

nope

But I think maybe I understand your confusion. You're confusing potential and actual results.

I'll take the blame for that for failing to differentiate them well enough in my previous attempts.

My concern is when there ARE lots of combinations to be returned.

If there are only 10 legal combinations then I fully expect we'll be able to find those quickly because the joins, loops, filters, sorts, whatever will quickly throw out lots and lots of invalid combinations early and we'll only need to actually check a handful of 5 value sums to find those 10.

If there are 10 trillion legal combinations, then I fully expect we will NOT be able to find those quickly because there is little to be thrown out on each pass.

You can only validate the 5 value sum equals the target if you add all 5 values which, in the bad case means doing that 10 trillion times. Even kdo's billion operation per second system would still take hours to do that.

But you might be able to say a 5 value sum does not equal a target after looking at only a couple values there by skipping a lot of that work.

Does that help?

nope

But I think maybe I understand your confusion. You're confusing potential and actual results.

I'll take the blame for that for failing to differentiate them well enough in my previous attempts.

My concern is when there ARE lots of combinations to be returned.

If there are only 10 legal combinations then I fully expect we'll be able to find those quickly because the joins, loops, filters, sorts, whatever will quickly throw out lots and lots of invalid combinations early and we'll only need to actually check a handful of 5 value sums to find those 10.

If there are 10 trillion legal combinations, then I fully expect we will NOT be able to find those quickly because there is little to be thrown out on each pass.

You can only validate the 5 value sum equals the target if you add all 5 values which, in the bad case means doing that 10 trillion times. Even kdo's billion operation per second system would still take hours to do that.

But you might be able to say a 5 value sum does not equal a target after looking at only a couple values there by skipping a lot of that work.

Does that help?

You don't need to help you, you're the one that keeps flopping all over the place depending on the latest iteration.

I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine:

IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

And that certain math techniques should be able to selectively filter the number of rows required to compute at each level.

Unfortunately, negative values massively limits the reduction.

Will only positive numbers, much more filtering is possible.

I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine:

IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

And that certain math techniques should be able to selectively filter the number of rows required to compute at each level.

Unfortunately, negative values massively limits the reduction.

Will only positive numbers, much more filtering is possible.

ASKER

>> That's essentially what the C program does, except that it starts at the highest value and works down. This approach has the advantage of pre-filtering all items that are larger than the target value. They're identified in the initial loop and once found are no longer part of the equation.

Actually, we've concluded somewhere up there that you can't do that because of negative values. That is, a value can be larger than the desired sum, but negatives can bring that value down to a point where it becomes a valid solution. The only way you can filter them out is using ascending data. That way, you can be sure that, once you're over the sum, no further values are valid.

Actually, we've concluded somewhere up there that you can't do that because of negative values. That is, a value can be larger than the desired sum, but negatives can bring that value down to a point where it becomes a valid solution. The only way you can filter them out is using ascending data. That way, you can be sure that, once you're over the sum, no further values are valid.

>>I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine:

>> IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

sure, we've all said that from the very beginning

go back to the very first post in this thread

t1.ID_Value > t2.ID_Value

what's the point of that condition? To eliminate duplicate evaluation of effectively identical solutions.

go back to the first example of complete code in this thread http:#a38512067

Look at each join I posted, they each have a condition like this...

AND n1.n + ISNULL(n2.n, 0) <= 37

why? because I don't want to continue evaluating further joins if I already know the partial sum has exceeded the target. I'm skipping all the later work for 3, 4 and 5 value sums.

I've been consistent all along. Please reread my posts and feel free to ask for any clarifications. I'm sure there are probably some typos or other mistakes that might make my point confusing; but the intent has been the exact same thing every time.

>>> IF there is NO chance a of 3, 4 or 5 level solution,

go back to my previous post. I'm not really concerned as much about the case when there only a few solutions. I'm concerned when there are many solutions. i.e. when there is not only a chance for 3,4 and 5 level solutions but a certainty of millions, billions or trillions of them.

>> IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

sure, we've all said that from the very beginning

go back to the very first post in this thread

t1.ID_Value > t2.ID_Value

what's the point of that condition? To eliminate duplicate evaluation of effectively identical solutions.

go back to the first example of complete code in this thread http:#a38512067

Look at each join I posted, they each have a condition like this...

AND n1.n + ISNULL(n2.n, 0) <= 37

why? because I don't want to continue evaluating further joins if I already know the partial sum has exceeded the target. I'm skipping all the later work for 3, 4 and 5 value sums.

I've been consistent all along. Please reread my posts and feel free to ask for any clarifications. I'm sure there are probably some typos or other mistakes that might make my point confusing; but the intent has been the exact same thing every time.

>>> IF there is NO chance a of 3, 4 or 5 level solution,

go back to my previous post. I'm not really concerned as much about the case when there only a few solutions. I'm concerned when there are many solutions. i.e. when there is not only a chance for 3,4 and 5 level solutions but a certainty of millions, billions or trillions of them.

ASKER

>> I don't think so. C is a sort or macro-language that compiles almost directly to assembly code. VB.Net is intended to manage visual (gui) objects. I expect quite a large difference in performance.

I would expect C to be faster. Obviously, VB.Net would have to draw the objects, which would make it slower. But the part of pure computation (just calculating valid results and dumping the results into an array) shouldn't be much slower than C. Slower, but not much slower. I could be wrong, though, but I seem to have read somewhere that both C and VB are compiled to binary. VB.Net would be slower due to all the graphics involved. And dumping the array into a label would be slower than just printing to a console. But the math itself, which has no influence from anything other than a memory address communicating with the CPU and storing the value in another (or the same) memory address, would take the same time whether on C, VB, Delphi, or similar.

There would be slower languages, namely all the interpreted ones like flash, java, php, etc. But the only one I could see being faster is pure assembly.

I would expect C to be faster. Obviously, VB.Net would have to draw the objects, which would make it slower. But the part of pure computation (just calculating valid results and dumping the results into an array) shouldn't be much slower than C. Slower, but not much slower. I could be wrong, though, but I seem to have read somewhere that both C and VB are compiled to binary. VB.Net would be slower due to all the graphics involved. And dumping the array into a label would be slower than just printing to a console. But the math itself, which has no influence from anything other than a memory address communicating with the CPU and storing the value in another (or the same) memory address, would take the same time whether on C, VB, Delphi, or similar.

There would be slower languages, namely all the interpreted ones like flash, java, php, etc. But the only one I could see being faster is pure assembly.

>> Actually, we've concluded somewhere up there that you can't do that because of negative values. That is, a value can be larger than the desired sum, but negatives can bring that value down to a point where it becomes a valid solution. The only way you can filter them out is using ascending data. That way, you can be sure that, once you're over the sum, no further values are valid.

You're correct. I missed that. Offhand, it would seem that attempts to simply bias/unbias the data to use only positive values may be more trouble than it's worth. But if all of the values are biased so that the lowest value is 0 (eliminating the need for handling negative numbers) the target value changes with the number of terms in the solution. (It's effectively OriginalTarget+(Bias*Depth)). There's no reason that the target can't increment as the depth increases just as the subtotal (Sum) does. Something else to try! :)

>> I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine: IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

That's an interesting exercise. I think that I'll test the theory with the program (above) by creating a 5,000 item dataset that consists of the even numbers from 2 to 10,000, then searching for a sum that is an odd number. It can't possibly happen. The algorithm does have a short-stop to terminate the current loop when the depth (number of terms) exceeds the maximum, but there is no short-exit that prevents it from skipping the sums that must occur once the loop hits a certain threshhold. I'm wondering if that's more overhead than gain? Hmmm....

You're correct. I missed that. Offhand, it would seem that attempts to simply bias/unbias the data to use only positive values may be more trouble than it's worth. But if all of the values are biased so that the lowest value is 0 (eliminating the need for handling negative numbers) the target value changes with the number of terms in the solution. (It's effectively OriginalTarget+(Bias*Depth

>> I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine: IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

That's an interesting exercise. I think that I'll test the theory with the program (above) by creating a 5,000 item dataset that consists of the even numbers from 2 to 10,000, then searching for a sum that is an odd number. It can't possibly happen. The algorithm does have a short-stop to terminate the current loop when the depth (number of terms) exceeds the maximum, but there is no short-exit that prevents it from skipping the sums that must occur once the loop hits a certain threshhold. I'm wondering if that's more overhead than gain? Hmmm....

>>I'VE SAID ALL ALONG that there ARE mechanisms that can be used to quickly determine:

>> IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

>> sure, we've all said that from the very beginning <<

Geez, baloney. I'm saying NO further joins are required, NONE AT ALL, not filtered ones. You always present some non sequitir and then act as if it's a revelation to me.

>> IF there is NO chance a of 3, 4 or 5 level solution, WITHOUT going thru all the permutations

>> sure, we've all said that from the very beginning <<

Geez, baloney. I'm saying NO further joins are required, NONE AT ALL, not filtered ones. You always present some non sequitir and then act as if it's a revelation to me.

>>> I'm saying NO further joins are required, NONE AT ALL, not filtered ones.

yep - exactly. I agree, I'm sorry we're talking past each other - thanks for participating though.

yep - exactly. I agree, I'm sorry we're talking past each other - thanks for participating though.

I think you could quickly tell whether ANY 3-level, 4-level or 5-level (or n-level) solution can ever reach the desired total as follows:

SELECT the n highest original values (this is quick, even with 5000 rows).

Add them, keeping each total separate.

That alone can tell you if:

1) ANY 3-level total can reach the value needed

2) ANY 4-level total can reach the value needed

3) ANY 5-level total can reach the value needed

From that, you can get some idea of the level of effort to find a match.

If NO 5-level could possibly match, then you know you'd have to go to at least the 6th level, and you could quickly calc the extreme number of permutations you'd have to have to find a solution.

SELECT the n highest original values (this is quick, even with 5000 rows).

Add them, keeping each total separate.

That alone can tell you if:

1) ANY 3-level total can reach the value needed

2) ANY 4-level total can reach the value needed

3) ANY 5-level total can reach the value needed

From that, you can get some idea of the level of effort to find a match.

If NO 5-level could possibly match, then you know you'd have to go to at least the 6th level, and you could quickly calc the extreme number of permutations you'd have to have to find a solution.

Going back to the math problem....

I programmed and ran a test where the Values table contains 5400 values, the even number from 2 to 10,800 and the target value is odd. This is a good process for testing search timings as no solution can be found so the process runs through all possible combinations.

I first tested for 11241, a value slightly higher than the largest value in the table. On a 3+Ghz processor, the program took about 13 seconds to complete. When the target value was reduced to 1121, the search ended almost instantaneously. When value was increased to 112241 the search required about 1 second.

Searching for a small number is lightning fast due to the numbers larger than the target being disqualified (and ignored) in the initial loop. The search actually considered only 556 of the 5,400 values.

Searching for a large number took longer than I had expected, suggesting that there may be some improvement possible in the "we don't need to pursue this particular loop any longer" logic.

I'm going back to testing, but wanted to pass this along.

Kent

I programmed and ran a test where the Values table contains 5400 values, the even number from 2 to 10,800 and the target value is odd. This is a good process for testing search timings as no solution can be found so the process runs through all possible combinations.

I first tested for 11241, a value slightly higher than the largest value in the table. On a 3+Ghz processor, the program took about 13 seconds to complete. When the target value was reduced to 1121, the search ended almost instantaneously. When value was increased to 112241 the search required about 1 second.

Searching for a small number is lightning fast due to the numbers larger than the target being disqualified (and ignored) in the initial loop. The search actually considered only 556 of the 5,400 values.

Searching for a large number took longer than I had expected, suggesting that there may be some improvement possible in the "we don't need to pursue this particular loop any longer" logic.

I'm going back to testing, but wanted to pass this along.

Kent

Don't forget about the negative values issue -- I still think negs are a serious problem to gaining performance, unlike sdstuber. We just have a different opinion on that.

Hi Scott,

I haven't forgotten. Before the night's over I'll test the changes to bias the entire table so that all values are positive. Pretty easy change, actually.

Just an FYI, but the conversation keeps drifting between Combinations and Permutations. Early on the OP indicated that he's looking for combinations.

Kent

I haven't forgotten. Before the night's over I'll test the changes to bias the entire table so that all values are positive. Pretty easy change, actually.

Just an FYI, but the conversation keeps drifting between Combinations and Permutations. Early on the OP indicated that he's looking for combinations.

Kent

>>> unlike sdstuber. We just have a different opinion on that.

nope!

Please email me offline if you want to pursue your line of thought more. I'll be happy to keep trying to explain it to you but I've taken enough space in this thread so far.

nope!

Please email me offline if you want to pursue your line of thought more. I'll be happy to keep trying to explain it to you but I've taken enough space in this thread so far.

Ok. We have a bit more to report.

Setting the maximum number of terms to 4, I find 20,997,754 solutions in the original data. The C program, using biased values to eliminate the issue of negative numbers, took 2:34 to find a write the solutions to disk.

That's also using a packed data table, where the duplicates are collapsed into a single row per unique value. In an offline discussion, sdstuber correctly pointed out that collapsing the duplicates can result in not finding correct solutions. If the target is 100 and the raw data contains 2 or more rows with a value of 50, the correct solution would report 50+50, but that solution is lost when only unique values are considered.

So we have 21M solutions of 1, 2, 3, or 4 columns added together. I propose that finding and counting the solutions for up to 5 columns is beyond this discussion. Actually using the results is waaaaaayyy out of bounds.

Setting the maximum number of terms to 4, I find 20,997,754 solutions in the original data. The C program, using biased values to eliminate the issue of negative numbers, took 2:34 to find a write the solutions to disk.

That's also using a packed data table, where the duplicates are collapsed into a single row per unique value. In an offline discussion, sdstuber correctly pointed out that collapsing the duplicates can result in not finding correct solutions. If the target is 100 and the raw data contains 2 or more rows with a value of 50, the correct solution would report 50+50, but that solution is lost when only unique values are considered.

So we have 21M solutions of 1, 2, 3, or 4 columns added together. I propose that finding and counting the solutions for up to 5 columns is beyond this discussion. Actually using the results is waaaaaayyy out of bounds.

ASKER

First: Yes, what I want is combinations, not permutations. Duplicate values that are essentially the same but in a different order are not desired.

Second: No, you can't pack the data table. Each row has to be taken into account, even if they have the same value (even though, 99.9(9)% of the time, either the number or the code will vary, so the row is still unique, despite the value being the same).

Third: The sum I presented in the above example (112.42) was a real life example of a sum we needed to find... but on the smaller subset. There is a very low chance we ever need to find such a small sum on the larger subset. So, for the data presented above, you can search for values between 5k-10k. Sometimes we may drop to 2k, but that is rarer. As an example for your program, you can use the value 5234.2.

Fourth: Once you find 50k solutions, you can break and exit. No more will be evaluated. If necessary, you can add some line stating the percentage evaluated so far.

Second: No, you can't pack the data table. Each row has to be taken into account, even if they have the same value (even though, 99.9(9)% of the time, either the number or the code will vary, so the row is still unique, despite the value being the same).

Third: The sum I presented in the above example (112.42) was a real life example of a sum we needed to find... but on the smaller subset. There is a very low chance we ever need to find such a small sum on the larger subset. So, for the data presented above, you can search for values between 5k-10k. Sometimes we may drop to 2k, but that is rarer. As an example for your program, you can use the value 5234.2.

Fourth: Once you find 50k solutions, you can break and exit. No more will be evaluated. If necessary, you can add some line stating the percentage evaluated so far.

Hi Cluskitt,

That does leave the question of "how to break" out of the search. The C algorithm finds the solutions "in order". That is, the first solution found will have the largest value to the left and move in that fashion. Targeting 100 could result in:

100

99 1

98 2

98 1 1

97 3

etc.

If you "just break" out of the search, the solutions are all clustered where the leftmost column contains the largest number possible. Most searches would behave similarly as the search is typically an ordered process. Is that adequate or do you need/want a representative sampling across the data?

And I reran the search with a maximum of 4 columns, with the duplicated data. 149,800,308 combinations found in 15 minutes, though some should filter out as the list contains both solutions with duplicated values (98+1+1) as well as duplicate solutions (99+1) when a value (in this case, 1) occurs twice.

Kent

That does leave the question of "how to break" out of the search. The C algorithm finds the solutions "in order". That is, the first solution found will have the largest value to the left and move in that fashion. Targeting 100 could result in:

100

99 1

98 2

98 1 1

97 3

etc.

If you "just break" out of the search, the solutions are all clustered where the leftmost column contains the largest number possible. Most searches would behave similarly as the search is typically an ordered process. Is that adequate or do you need/want a representative sampling across the data?

And I reran the search with a maximum of 4 columns, with the duplicated data. 149,800,308 combinations found in 15 minutes, though some should filter out as the list contains both solutions with duplicated values (98+1+1) as well as duplicate solutions (99+1) when a value (in this case, 1) occurs twice.

Kent

ASKER

You need to run each solution independently, by user input. That is, user inputs level 1, you only run 1 row solutions. User inputs level 3, you only run 3 row solutions (disregarding 1 and 2 row solutions), etc. I said this was necessary somewhere up above. In which case, stopping after a certain number has been achieved is easy.

The way I'm doing the code is basically add a check for Current Level and Max Level (this last is a user input variable).

By the way, I'm curious how long would a non-packed 3 row solution take for you. At this exact moment, I'm running a 3 row solution for the sum 7554.07. I'm waiting to see how long it will take on VB.Net. I'll post my code soon, to see if any change should be made.

Also, I realized that the condition to search for current sum + highest values (to see if sum can be achieved) is pretty much irrelevant. Almost always, there will be a few values that are way higher than the desired sum, as you can see in the sample (there is a value over 20k, one over 10k, etc).

The way I'm doing the code is basically add a check for Current Level and Max Level (this last is a user input variable).

By the way, I'm curious how long would a non-packed 3 row solution take for you. At this exact moment, I'm running a 3 row solution for the sum 7554.07. I'm waiting to see how long it will take on VB.Net. I'll post my code soon, to see if any change should be made.

Also, I realized that the condition to search for current sum + highest values (to see if sum can be achieved) is pretty much irrelevant. Almost always, there will be a few values that are way higher than the desired sum, as you can see in the sample (there is a value over 20k, one over 10k, etc).

4 columns in 15 minutes? wow, that definitely beats anything I've tried (sql or c) and by wide enough margin that I don't think I can chalk it up to hardware.

what's your current code look like?

what's your current code look like?

ASKER

Ok, I don't think my code is running correctly. I must be doing something wrong, though I'm not exactly sure. A 3 column search for 7554.07 was taking over 45 (went to lunch and it was still running when I got back). It shouldn't take so long, even if this code is running slower. This is my code:

```
Imports System.Data.SqlClient
Public Class frmMain
Private iVal(,) As Double
Private iCol(15) As String
Private CurrentLine, TotalLines, CurrentLevel, MaxLevel As Integer
Private MaxVal, DesiredSum, CurrentSum As Double
Protected Sub GetValues()
Dim sConn As New Sqlconnection("server=SERVER\SERVER;uid=rhxxi;pwd=rhxxi;database=rhxxi;")
Dim sqlDA As New SqlDataAdapter("SELECT vcr_num_empregado,vcr_vnc_cod_venc,vcr_valor_bruto " & _
"FROM rh_vcrt0 " & _
"WHERE vcr_emp_empresa=@Emp " & _
"ORDER BY vcr_valor_bruto,vcr_num_empregado*1,vcr_vnc_cod_venc", _
sConn)
sqlDA.SelectCommand.Parameters.Add("@Emp", SqlDbType.Int).Value = txt_Emp.Text
Dim Table1 As New DataTable
sqlDA.Fill(Table1)
TotalLines = Table1.Rows.Count - 1
CurrentLine = 0
ReDim iVal(3, TotalLines + 1)
For a = 0 To TotalLines
For b = 0 To 2
iVal(b, a) = Table1.Rows(a).Item(b)
Next
Next
End Sub
Private Sub btnCalc_Click(sender As System.Object, e As System.EventArgs) Handles btnCalc.Click
If Not Integer.TryParse(txt_Emp.Text, New Integer) Or Not Double.TryParse(txt_Sum.Text, New Double) Then
lbl_Msg.Text = "Insira valores válidos para empresa e soma"
Else
lbl_Msg.Text = ""
Dim timNow As DateTime = Now
DesiredSum = txt_Sum.Text
MaxLevel = cbo_Nivel.SelectedIndex + 1
GetValues()
lbl_Msg.Text = "SQL: Demorou " & DateDiff(DateInterval.Second, timNow, Now) & " segundos (" & TotalLines & " resultados)"
timNow = Now
SumArr(0, 1, 0)
lbl_Msg.Text &= vbNewLine & "Cálculo+Print: Demorou " & DateDiff(DateInterval.Second, timNow, Now) & " segundos (" & CurrentLine & " resultados)"
End If
End Sub
Protected Sub SumArr(CurSum As Double, CurLevel As Integer, CurIndex As Integer)
Dim CurSum2, CurSumMax As Double
For a = CurIndex To TotalLines
CurSum2 = CurSum + iVal(2, a)
If CurSum2 > DesiredSum Then
Exit Sub
End If
If CurLevel < MaxLevel Then
CurSumMax = CurSum2
For r = TotalLines To TotalLines - (MaxLevel - CurLevel) Step -1
CurSumMax += iVal(2, r)
Next
If CurSumMax < DesiredSum Then
Exit Sub
End If
For x = 0 To 2
iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
Next
SumArr(CurSum2, CurLevel + 1, a + 1)
Else
If CurSum2 = DesiredSum Then
For x = 0 To 2
iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
Next
GenRow(iCol)
If CurrentLine >= 500 Then
'TODO: %
Exit Sub
End If
End If
End If
Next
End Sub
Protected Sub GenRow(iCol() As String)
For Each ctr As Label In pnlMain.Controls.OfType(Of Label)()
If Microsoft.VisualBasic.Left(ctr.Name, 4) = "lbl_" Then
ctr.Text &= iCol(Microsoft.VisualBasic.Right(ctr.Name, Len(ctr.Name) - 4)) & vbNewLine
End If
Next
CurrentLine += 1
End Sub
Private Sub frmMain_Load(sender As Object, e As System.EventArgs) Handles Me.Load
cbo_Nivel.Items.Clear()
cbo_Nivel.Items.Add("1")
cbo_Nivel.Items.Add("2")
cbo_Nivel.Items.Add("3")
cbo_Nivel.Items.Add("4")
cbo_Nivel.Items.Add("5")
cbo_Nivel.SelectedIndex = 0
End Sub
End Class
```

I don't know if I have a flaw in my logic, or if I can do things more efficiently. I don't have much experience with recursive combinations, but it seems to me, logically, that this should be a correct way of doing so. But maybe there's a simpler way which I'm not considering.
I'll rerun the test to look for 5235.2. My guess is that the time jumps way up. WAY up.

Grouping the items by the number of rows is trivial. 1 or 2 columns returns the results instantaneously, and 3 columns is still sub-second. Running the search once for 1, 2, 3, and 4 column solutions takes about 1 second longer than running it once for all solutions. Far superior to sorting the results later.

It's trivial to make the number of columns a command line argument. If you use this program we can make that happen.

findstuff -1

Currently, the program reads the input from your SQL source, though I have removed the text so that just the data points remain. The data file looks like this:

(32,1,1,3641.00)

(32,1,6,133.44)

(32,10014,1,585.46)

Apples to Apples. Looking for all 1, 2, and, 3 column solutions where the target is 7554.07 takes about a second. 33 results are found.

Looking at your code, I believe that the SumArr function is doing entirely too much work, resulting in the high search time. All it needs to do is scan the relevant portion of the data list from the position of the previous element +1 to the end (line 49 does this) and take 1 of three actions depending on the sum of the previous columns and the current value. If the sum is greater than the target, ignore it. (Since you're examining the values in increasing magnitude you can exit the function.) If the sum is equal to the target print it (and exit the function since the rest of the list can't possibly produce a unique result). If the sum is less than the target, add another column or continue the loop depending on the current column number.

Grouping the items by the number of rows is trivial. 1 or 2 columns returns the results instantaneously, and 3 columns is still sub-second. Running the search once for 1, 2, 3, and 4 column solutions takes about 1 second longer than running it once for all solutions. Far superior to sorting the results later.

It's trivial to make the number of columns a command line argument. If you use this program we can make that happen.

findstuff -1

Currently, the program reads the input from your SQL source, though I have removed the text so that just the data points remain. The data file looks like this:

(32,1,1,3641.00)

(32,1,6,133.44)

(32,10014,1,585.46)

Apples to Apples. Looking for all 1, 2, and, 3 column solutions where the target is 7554.07 takes about a second. 33 results are found.

Looking at your code, I believe that the SumArr function is doing entirely too much work, resulting in the high search time. All it needs to do is scan the relevant portion of the data list from the position of the previous element +1 to the end (line 49 does this) and take 1 of three actions depending on the sum of the previous columns and the current value. If the sum is greater than the target, ignore it. (Since you're examining the values in increasing magnitude you can exit the function.) If the sum is equal to the target print it (and exit the function since the rest of the list can't possibly produce a unique result). If the sum is less than the target, add another column or continue the loop depending on the current column number.

ASKER

I already do this. If the sum is greater, exit sub, if the sum is the same, print it (I don't exit because I want other similar values), if it's smaller, keep going.

The point of limiting the number of columns to user input isn't to save time. It's to keep results separate. We always search 1 row solutions first. When we examine them, if we still need to find results, then we move on to 2 row solutions. At this point, seeing as we've already examined 1 row solutions, all we want is 2 row solutions. No need to repeat data that was already examined and that will only add to the total number (which will cut off at X lines, 500 in my code).

The point of limiting the number of columns to user input isn't to save time. It's to keep results separate. We always search 1 row solutions first. When we examine them, if we still need to find results, then we move on to 2 row solutions. At this point, seeing as we've already examined 1 row solutions, all we want is 2 row solutions. No need to repeat data that was already examined and that will only add to the total number (which will cut off at X lines, 500 in my code).

ASKER

Without printing anything anywhere, just running the calculations (commented the iCol and GenRow lines), a 3 row solution for 7445.2 calculated 0 results after 1145 seconds, or just a bit over 19 minutes.

This is the only important part of the code:

I don't know if your code is taking that into account.

Could you try to translate the above to C and see if there's any significant difference in performance to your own? If there is, I imagine I implemented this the wrong way. If not, then VB.Net is WAY slower. Possibly due to iterating a multi-dimensional array.

This is the only important part of the code:

```
Protected Sub SumArr(CurSum As Double, CurLevel As Integer, CurIndex As Integer)
Dim CurSum2 As Double
For a = CurIndex To TotalLines
CurSum2 = CurSum + iVal(2, a)
If CurSum2 > DesiredSum Then
Exit Sub
End If
If CurLevel < MaxLevel Then
'For x = 0 To 2
' iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
'Next
SumArr(CurSum2, CurLevel + 1, a + 1)
Else
If CurSum2 = DesiredSum Then
'For x = 0 To 2
' iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
'Next
'GenRow(iCol)
CurrentLevel += 1
If CurrentLine >= 500 Then
'TODO: %
Exit Sub
End If
End If
End If
Next
End Sub
```

One thing that could be causing it is that maybe an array of doubles isn't as effective, especially because it has to be a two dimensional one (to store number, code and value).I don't know if your code is taking that into account.

Could you try to translate the above to C and see if there's any significant difference in performance to your own? If there is, I imagine I implemented this the wrong way. If not, then VB.Net is WAY slower. Possibly due to iterating a multi-dimensional array.

Hi Cluskitt,

Lines 55 - 64 have a couple of inefficiencies.

It looks like they pre-determine if calling the function to process the next column will be successful. If the answer is "no", the call isn't made, but if the answer is "yes" the effort is redundant since the function will need to recalculate things to process the next column. I'd just call the function and let it make it's own determination.

Lines 62 - 64 need to be moved outside of the function, probably to GenRow(). It looks like you're doing a "binary to display" conversion (float to string) every time the function advances to the next column. Depending on the number of rows in the search, the function could be spending more time in needless data conversion than in searching.

These are the critical lines:

Move the float to string conversion outside of the loop and those lines can be shortened to:

Kent

Lines 55 - 64 have a couple of inefficiencies.

It looks like they pre-determine if calling the function to process the next column will be successful. If the answer is "no", the call isn't made, but if the answer is "yes" the effort is redundant since the function will need to recalculate things to process the next column. I'd just call the function and let it make it's own determination.

Lines 62 - 64 need to be moved outside of the function, probably to GenRow(). It looks like you're doing a "binary to display" conversion (float to string) every time the function advances to the next column. Depending on the number of rows in the search, the function could be spending more time in needless data conversion than in searching.

These are the critical lines:

```
If CurLevel < MaxLevel Then
CurSumMax = CurSum2
For r = TotalLines To TotalLines - (MaxLevel - CurLevel) Step -1
CurSumMax += iVal(2, r)
Next
If CurSumMax < DesiredSum Then
Exit Sub
End If
For x = 0 To 2
iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
Next
SumArr(CurSum2, CurLevel + 1, a + 1)
Else
```

Move the float to string conversion outside of the loop and those lines can be shortened to:

```
If CurLevel < MaxLevel Then
SumArr(CurSum2, CurLevel + 1, a + 1)
Else
```

Kent

GenRow increments the number of solutions found so commenting out the call to the function means that the program never acknowledges that a solution was found.

And taking 20 minutes is a good indication that VB is doing a lot object management (probably memory) that's taking processor time away from the task at hand.

Kent

And taking 20 minutes is a good indication that VB is doing a lot object management (probably memory) that's taking processor time away from the task at hand.

Kent

I thought of another option to take advantage of the 50000 cutoff.

Ignore the negatives.

kdo's original c code worked from largest to smallest so he'd get that "for free"

but if you're working from smallest to largest, start with whatever element is 0.

If, by chance, the data and target are such that you can't find 50000 results using only positive numbers then you can go ahead and use negatives to finish the set.

However, due to that complexity, it might be easier to just use kdo's top-down search.

The fact that it might drop out some results isn't particularly important if we know we can stop early anyway.

Of course, I'm still looking at the "bad cases" where there are many, many combinations - basically because that's the more interesting part of the problem. We can brute-force 1 and 2 value searches and a couple of trivial filters will let 3-value searches finish quickly too.

Ignore the negatives.

kdo's original c code worked from largest to smallest so he'd get that "for free"

but if you're working from smallest to largest, start with whatever element is 0.

If, by chance, the data and target are such that you can't find 50000 results using only positive numbers then you can go ahead and use negatives to finish the set.

However, due to that complexity, it might be easier to just use kdo's top-down search.

The fact that it might drop out some results isn't particularly important if we know we can stop early anyway.

Of course, I'm still looking at the "bad cases" where there are many, many combinations - basically because that's the more interesting part of the problem. We can brute-force 1 and 2 value searches and a couple of trivial filters will let 3-value searches finish quickly too.

ASKER

When I commented GenRow, I moved the CurrentLine incrementation to just above it.

I'm converting floats to string to print them to a label. But even commenting those doesn't help either. Running just this code:

It should be noted that:

1- This laptop isn't very good. In fact, I have a hard time with VS2010, because it constantly hogs most of the memory (1GB).

2- I'm running the exe that is created when I debug (i.e., not debugging, but using the exe directly).

However, while it might be a bit slower, these are the conditions I have to deal with, so a solution will have to work reasonably well in this computer.

Can you, using the VB.Net code, adapt your C to retrieve the data into a multi-dimensional array (always writing something like: Num1;Code1;Value1;Num2;Code2;Value2;etc...)?

I ask this because I have a feeling a multi-dimensional array will slow things down (even though we're only actively using one of the dimensions) and I would like to compare and see how it runs in this computer. If I can achieve a 3 row solution in a few seconds, then that is a LOT better than it was before. At the very least, a 5 row solution to smaller subsets would be possible (about 200-400 rows is the most common value. 5k+ is only for one company, though it's also important).

I'm converting floats to string to print them to a label. But even commenting those doesn't help either. Running just this code:

```
Protected Sub SumArr(CurSum As Double, CurLevel As Integer, CurIndex As Integer)
Dim CurSum2 As Double
For a = CurIndex To TotalLines
CurSum2 = CurSum + iVal(2, a)
If CurSum2 > DesiredSum Then
Exit Sub
End If
If CurLevel < MaxLevel Then
'For x = 0 To 2
' iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
'Next
SumArr(CurSum2, CurLevel + 1, a + 1)
Else
If CurSum2 = DesiredSum Then
'For x = 0 To 2
' iCol((CurLevel - 1) * 3 + x) = iVal(x, a)
'Next
'GenRow(iCol)
CurrentLevel += 1
If CurrentLine >= 500 Then
'TODO: %
Exit Sub
End If
End If
End If
Next
End Sub
```

(basically, just counting results), still takes a long time. I tried removing the panel and running just the count. It took 811 seconds, about 13.5 minutes.It should be noted that:

1- This laptop isn't very good. In fact, I have a hard time with VS2010, because it constantly hogs most of the memory (1GB).

2- I'm running the exe that is created when I debug (i.e., not debugging, but using the exe directly).

However, while it might be a bit slower, these are the conditions I have to deal with, so a solution will have to work reasonably well in this computer.

Can you, using the VB.Net code, adapt your C to retrieve the data into a multi-dimensional array (always writing something like: Num1;Code1;Value1;Num2;Cod

I ask this because I have a feeling a multi-dimensional array will slow things down (even though we're only actively using one of the dimensions) and I would like to compare and see how it runs in this computer. If I can achieve a 3 row solution in a few seconds, then that is a LOT better than it was before. At the very least, a 5 row solution to smaller subsets would be possible (about 200-400 rows is the most common value. 5k+ is only for one company, though it's also important).

ASKER

Ignoring the negatives, while it would sound good for programming, wouldn't be advisable for finding a real solution. There are many codes that only have negative values, and those would be ignored. The only gain would be on gigantic results, but on manageable ones it would actually slow things down, I think. After all, the biggest problem is running the combinations. The number of results is secondary to that.

>>> There are many codes that only have negative values, and those would be ignored.

ok, I don't know your real distributions so thought I'd throw it out there.

data trumps algorithm :)

ok, I don't know your real distributions so thought I'd throw it out there.

data trumps algorithm :)

ASKER

The sample I provided is a real life sample. You can check this situation (the negatives) on codes like 160, 161, 182, etc.

Go to lunch, miss a lot. :)

VB does array bounds checking while C does not. Every single array reference requires additional computation to ensure that the reference is in bounds. So while C merely dereferences a pointer (Array + offset) which can be done in a single instruction (depending on the processor) the bounds check at least doubles the necessary CPU instructions, perhaps even more.

Something to note is that I'm mostly running this on my desktop computer. It's an AMD Phenom processor with 64KB of L1 cache. The entire program should fit into cache.

I suspect that the use of a binary search to determine the starting point of the loop within the recursive function may help. If only a few items are skipped, the trade-off is next to zero, but if the number of skipped values is more than a handful, there's a boost to be had. It's microseconds per pass, but we're performing a LOT of passes.

An earlier test described looking for 7554.07 in the original data. The 33 solutions that I find (with MaxColumns == 3) are returned in about a second. The recursive function is called 847,873,440 times to examine the data. So saving a few instructions per pass could be significant when adding that 4th column.

I resolved the issue with negative number by biasing all of the data (adding a value to all of the items so that there are no negative numbers). The search takes the bias into account when searching through the data.

Here's the code as it exists now. Be glad to explain anything.

VB does array bounds checking while C does not. Every single array reference requires additional computation to ensure that the reference is in bounds. So while C merely dereferences a pointer (Array + offset) which can be done in a single instruction (depending on the processor) the bounds check at least doubles the necessary CPU instructions, perhaps even more.

Something to note is that I'm mostly running this on my desktop computer. It's an AMD Phenom processor with 64KB of L1 cache. The entire program should fit into cache.

I suspect that the use of a binary search to determine the starting point of the loop within the recursive function may help. If only a few items are skipped, the trade-off is next to zero, but if the number of skipped values is more than a handful, there's a boost to be had. It's microseconds per pass, but we're performing a LOT of passes.

An earlier test described looking for 7554.07 in the original data. The 33 solutions that I find (with MaxColumns == 3) are returned in about a second. The recursive function is called 847,873,440 times to examine the data. So saving a few instructions per pass could be significant when adding that 4th column.

I resolved the issue with negative number by biasing all of the data (adding a value to all of the items so that there are no negative numbers). The search takes the bias into account when searching through the data.

Here's the code as it exists now. Be glad to explain anything.

```
//---------------------------------------------------------------------------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#pragma hdrstop
//---------------------------------------------------------------------------
#pragma argsused
int TableType = 0;
double *TValues;
double SValues[10];
double Target;
double Bias = 0;
int Limit;
int MaxValues;
int SCount = 0;
long long Passes = 0;
FILE *R; // Results Stream
//---------------------------------------------------------------------------
void PrintSolution (int Depth)
{
int idx;
fprintf (R, " Found: %d :", ++SCount);
for (idx = 0; idx <= Depth; ++idx)
fprintf (R, " %.2f", (SValues[idx] - Bias) / 100.0);
fprintf (R, "\n");
}
//---------------------------------------------------------------------------
// FindOperand -- find the next suitable column from the list
//---------------------------------------------------------------------------
int FindOperand (double Sum, int Depth, int Limit, int Target)
{
int TPosition;
++Passes;
// Scan the list backwards from the starting point (Limit)
for (TPosition=Limit-1; TPosition >= 0; --TPosition)
if (Sum + TValues[TPosition] == Target)
{
SValues[Depth] = TValues[TPosition];
PrintSolution (Depth);
}
else if (Sum + TValues[TPosition] < Target)
{
if (Depth+1 < MaxValues)
{
SValues[Depth] = TValues[TPosition];
FindOperand (Sum + TValues[TPosition], Depth+1, TPosition, Target + Bias);
}
else
return 0;
}
return 0;
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
void LoadTable (void)
{
FILE *F;
int A, B, C;
float V;
int Status;
int TableSize;
int i;
switch (TableType)
{
case 1:
TableSize = 6000;
TValues = malloc (TableSize * sizeof (double));
Limit = 5400;
for (i = 0; i < Limit; i++)
TValues[i] = i * 2 + 2;
break;
default:
F = fopen ("E:\\Documents and Settings\\Kent.YOUR-0B3AB6E8FA\\My Documents\\RAD Studio\\sqldata.txt", "rt");
Limit = 0;
if (F)
{
TableSize = 1000;
TValues = malloc (TableSize * sizeof (double));
while (1)
{
Status = fscanf (F, "(%d,%d,%d,%f)\n", &A, &B, &C, &V);
if (Status <= 0)
break;
if (Limit +1 >= TableSize)
{
TableSize *= 2;
TValues = (double*) realloc (TValues, TableSize * sizeof (double));
}
TValues[Limit++] = (Status = V * 100);
}
fclose (F);
}
}
}
//---------------------------------------------------------------------------
//---------------------------------------------------------------------------
void SortTable (void)
{
int x, y;
double V;
for (x = 1; x < Limit; ++x)
{
if (TValues[x-1] > TValues[x])
{
for (y = x-2; y>=0; --y)
if (TValues[x] > TValues[y])
break;
++y;
V = TValues[x];
memmove (&TValues[y+1], &TValues[y], (x-y) * sizeof (double));
TValues[y] = V;
}
}
}
//---------------------------------------------------------------------------
// PackTable -- Squeeze out duplicated values
//---------------------------------------------------------------------------
void PackTable (void)
{
int Source;
int Destination;
Source = 1;
Destination = 0;
while (Source < Limit)
{
if (TValues[Source] != TValues[Destination])
TValues[++Destination] = TValues[Source];
++Source;
}
Limit = Source;
}
//---------------------------------------------------------------------------
// BiasTable -- Add ABS(lowestvalue) to every item in the table
// when it is negative so that all values are positive
// and increasing.
//---------------------------------------------------------------------------
void BiasTable (void)
{
int idx;
if (Limit)
if (TValues[0] < 0)
{
Bias = -TValues[0];
for (idx = 0; idx < Limit; ++idx)
TValues[idx] += Bias;
}
}
//---------------------------------------------------------------------------
int main(int argc, char* argv[])
{
int TPosition;
int idx;
int i;
FILE *F;
time_t rawtime;
struct tm * timeinfo;
time ( &rawtime );
timeinfo = localtime ( &rawtime );
printf ( "Start: %s\n", asctime (timeinfo) );
Target = 11241;
Target = 755407;
MaxValues = 3;
LoadTable ();
SortTable ();
// PackTable ();
F = fopen ("C:\\Values.txt", "wb");
for (idx = 0; idx < Limit; ++idx)
fprintf (F, "%.2f\n\r", TValues[idx]/100.0);
fclose (F);
BiasTable ();
R = fopen ("C:\\Results.txt", "wb");
FindOperand (0, 0, Limit, Target+Bias);
fclose (R);
time ( &rawtime );
timeinfo = localtime ( &rawtime );
printf ("End : %s\n", asctime (timeinfo) );
printf (" %d Solutions found.\n", SCount);
printf (" %ld passes.\n");
return 0;
}
//---------------------------------------------------------------------------
```

Below is the binary search logic.

First it loads the data from the rh_vcrt0 table into a temp table in order to add an identity and sort by vcr_valor_bruto. That only needs done ONCE per set of data to be analyzed; once you've loaded it, comment out the load section to run further binary searches.

For the initial run, I chose:

@value_to_find = 7554.07

@count_of_values_to_join = 3 --3-level join

The code returns rh_ident 5416 (of a total of 5470 values).

[I have a fairly fast server, so this runs in 00:00:00 on my machine, but it should be fast enough on virtually any hardware.]

That means that when doing the 3-level join, one level can be restricted to ONLY values from #5416 on. Other values can be IGNORED, because it's known that it's impossible for those values to reach the desired total.

Something like:

FROM dbo.search_values sv1

INNER JOIN dbo.search_values sv2 ON

sv2.rh_ident > sv1.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find

INNER JOIN dbo.search_values sv3 ON

sv3.rh_ident >= 5416 AND

sv3.rh_ident > sv2.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

That will eliminate a lot of unnecessary permutations.

Just by changing the two initial params, you can pre-test for any total and level.

First it loads the data from the rh_vcrt0 table into a temp table in order to add an identity and sort by vcr_valor_bruto. That only needs done ONCE per set of data to be analyzed; once you've loaded it, comment out the load section to run further binary searches.

For the initial run, I chose:

@value_to_find = 7554.07

@count_of_values_to_join = 3 --3-level join

The code returns rh_ident 5416 (of a total of 5470 values).

[I have a fairly fast server, so this runs in 00:00:00 on my machine, but it should be fast enough on virtually any hardware.]

That means that when doing the 3-level join, one level can be restricted to ONLY values from #5416 on. Other values can be IGNORED, because it's known that it's impossible for those values to reach the desired total.

Something like:

FROM dbo.search_values sv1

INNER JOIN dbo.search_values sv2 ON

sv2.rh_ident > sv1.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find

INNER JOIN dbo.search_values sv3 ON

sv3.rh_ident >= 5416 AND

sv3.rh_ident > sv2.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

That will eliminate a lot of unnecessary permutations.

Just by changing the two initial params, you can pre-test for any total and level.

```
--USE <_db_containing__rh_vcrt0__table>
DECLARE @value_to_find decimal(14, 2)
DECLARE @count_of_values_to_join int
SET @value_to_find = 7554.07
SET @count_of_values_to_join = 3
DECLARE @search_count int
DECLARE @search_low_entry int
DECLARE @search_high_entry int
DECLARE @search_midpoint int
DECLARE @sum decimal(14, 2)
/* run this code ONCE for each new set of search values, then comment it out for further analysis
IF OBJECT_ID('tempdb.dbo.rh_vcrt0_sorted') IS NOT NULL
DROP TABLE tempdb.dbo.rh_vcrt0_sorted
IF OBJECT_ID('tempdb.dbo.rh_search_values') IS NOT NULL
DROP TABLE tempdb.dbo.rh_search_values
CREATE TABLE tempdb.dbo.rh_search_values (
rh_ident smallint NOT NULL,
vcr_valor_bruto decimal(14, 2) NOT NULL
)
CREATE UNIQUE CLUSTERED INDEX rh_search_values__CL ON tempdb.dbo.rh_search_values ( rh_ident )
-- insert the original values into a new table to: add an identity and sort by vcr_valor_bruto.
SELECT
IDENTITY(smallint, 1, 1) AS rh_ident, *
INTO tempdb.dbo.rh_vcrt0_sorted
FROM dbo.rh_vcrt0
-- can't use this because of NEGATIVE numbers in the input data!!
--WHERE
--vcr_valor_bruto <= @value_to_find
WHERE
vcr_emp_empresa = 32
ORDER BY
vcr_valor_bruto
-- insert the sorted values into a search table to: make search table as small as possible.
INSERT INTO tempdb.dbo.rh_search_values
SELECT
rh_ident, vcr_valor_bruto
FROM tempdb.dbo.rh_vcrt0_sorted
ORDER BY
rh_ident
*/
----------------------------------------------------------------------------------------------------
-- actual binary search logic
SET @search_count = (SELECT COUNT(*) FROM tempdb.dbo.rh_search_values)
SET @search_low_entry = 0
SET @search_high_entry = @search_count
IF @search_high_entry = 0
GOTO ExitCode
WHILE 1 = 1
BEGIN
SET @search_midpoint = (@search_low_entry + @search_high_entry) / 2
SELECT @sum = SUM(vcr_valor_bruto)
FROM tempdb.dbo.rh_search_values
WHERE rh_ident BETWEEN @search_midpoint AND @search_midpoint + (@count_of_values_to_join - 1)
IF @sum = @value_to_find
BREAK
IF @sum < @value_to_find
SET @search_low_entry = @search_midpoint
ELSE
SET @search_high_entry = @search_midpoint
IF @search_high_entry - @search_low_entry <= 2 --@count_of_values_to_join
BREAK
END
ExitCode:
SELECT 'End', @search_count AS Total_#Values, @search_low_entry AS starting_rh_ident, @value_to_find AS value_to_find,
sum_of_bruto, row_count
FROM (
SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count
FROM tempdb.dbo.rh_search_values
WHERE rh_ident BETWEEN @search_low_entry AND @search_low_entry + (@count_of_values_to_join - 1)
) AS rh_values
/*
-- show next values, to find very last starting_rh_ident that is valid
SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 1 AS starting_rh_ident, @value_to_find AS value_to_find,
sum_of_bruto, row_count
FROM (
SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count
FROM dbo.rh_search_values
WHERE rh_ident BETWEEN @search_low_entry + 1 AND @search_low_entry + 1 + (@count_of_values_to_join - 1)
) AS rh_values
SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 2 AS starting_rh_ident, @value_to_find AS value_to_find,
sum_of_bruto, row_count
FROM (
SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count
FROM dbo.rh_search_values
WHERE rh_ident BETWEEN @search_low_entry + 2 AND @search_low_entry + 2 + (@count_of_values_to_join - 1)
) AS rh_values
SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 3 AS starting_rh_ident, @value_to_find AS value_to_find,
sum_of_bruto, row_count
FROM (
SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count
FROM dbo.rh_search_values
WHERE rh_ident BETWEEN @search_low_entry + 3 AND @search_low_entry + 3 + (@count_of_values_to_join - 1)
) AS rh_values
SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 4 AS starting_rh_ident, @value_to_find AS value_to_find,
sum_of_bruto, row_count
FROM (
SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count
FROM dbo.rh_search_values
WHERE rh_ident BETWEEN @search_low_entry + 4 AND @search_low_entry + 4 + (@count_of_values_to_join - 1)
) AS rh_values
SELECT 'End', @search_count AS Total_#Values, @search_low_entry + 5 AS starting_rh_ident, @value_to_find AS value_to_find,
sum_of_bruto, row_count
FROM (
SELECT SUM(vcr_valor_bruto) AS sum_of_bruto, COUNT(*) AS row_count
FROM dbo.rh_search_values
WHERE rh_ident BETWEEN @search_low_entry + 5 AND @search_low_entry + 5 + (@count_of_values_to_join - 1)
) AS rh_values
*/
```

Hi Cluskitt,

I just saw something in your code that might help.

https://www.experts-exchange.com/questions/27905416/Adding-different-values-from-a-table-to-find-a-specific-sum-value.html?anchorAnswerId=38534888#a38534888

At line 19 I think that you want to increment CurrentLine.

Line 22 is an "Exit Sub". That only ends the iteration on the current column. It doesn't unwind the recursive calls already on the stack so the calling iteration will just advance to the next value for the previous column and recall the function. Perhaps the function should normally return 0 and return 1 when it's found the 500 solutions. Of course, the function would have to test the return value from the recursive call. :)

Kent

I just saw something in your code that might help.

https://www.experts-exchange.com/questions/27905416/Adding-different-values-from-a-table-to-find-a-specific-sum-value.html?anchorAnswerId=38534888#a38534888

At line 19 I think that you want to increment CurrentLine.

Line 22 is an "Exit Sub". That only ends the iteration on the current column. It doesn't unwind the recursive calls already on the stack so the calling iteration will just advance to the next value for the previous column and recall the function. Perhaps the function should normally return 0 and return 1 when it's found the 500 solutions. Of course, the function would have to test the return value from the recursive call. :)

Kent

Hi Scott,

I'm not following your description very well. :( Can you help me out? And are you using Cluskitt's sample data?

>> That means that when doing the 3-level join, one level can be restricted to ONLY values from #5416 on. Other values can be IGNORED, because it's known that it's impossible for those values to reach the desired total.

I think that's data dependent. If the maximum value in the data is 2,600, three rows with values between 2,500 and 2,600 could exist that sum to 7554.07. In the actual data, it looks like only the 5 highest values cannot possibly be part of a 3-column solution.

A couple of observations about the SQL:

-- In the first join, the first term should probably be 'sv2.rh_ident >= sv1.rh_ident'. If we're searching for 100, then 20+20+60 is a valid solution. That will also require another term to filter rows by PK where the row would be joined to itself.

-- But I think that the SQL will generate entirely too many intermediate rows to be effective. The nature of the problem requires that the join between sv1 and sv2 occurs before the join of sv3. That should result in about 12,000,000 intermediate rows that must all be joined to sv3 and filtered. That's gonna take some time.

Kent

I'm not following your description very well. :( Can you help me out? And are you using Cluskitt's sample data?

>> That means that when doing the 3-level join, one level can be restricted to ONLY values from #5416 on. Other values can be IGNORED, because it's known that it's impossible for those values to reach the desired total.

I think that's data dependent. If the maximum value in the data is 2,600, three rows with values between 2,500 and 2,600 could exist that sum to 7554.07. In the actual data, it looks like only the 5 highest values cannot possibly be part of a 3-column solution.

A couple of observations about the SQL:

-- In the first join, the first term should probably be 'sv2.rh_ident >= sv1.rh_ident'. If we're searching for 100, then 20+20+60 is a valid solution. That will also require another term to filter rows by PK where the row would be joined to itself.

-- But I think that the SQL will generate entirely too many intermediate rows to be effective. The nature of the problem requires that the join between sv1 and sv2 occurs before the join of sv3. That should result in about 12,000,000 intermediate rows that must all be joined to sv3 and filtered. That's gonna take some time.

Kent

Ident is just an identity value to uniquely identify each row, which is assigned AFTER the input values are sorted, viz:

-- insert the original values into a new table to: add an identity and sort by vcr_valor_bruto.

SELECT

IDENTITY(smallint, 1, 1) AS rh_ident, *

INTO tempdb.dbo.rh_vcrt0_sorted

FROM dbo.rh_vcrt0 --<<-- assumed to contain sample data posted earlier by Author

WHERE vcr_emp_empresa = 32

ORDER BY vcr_valor_bruto

>> In the first join, the first term should probably be 'sv2.rh_ident >= sv1.rh_ident' <<

No, as that would join a row to itself. Duplicate values may appear, of course, by any single row cannot be joined to itself in the solution.

Ident is NOT the amount being added, it's JUST an identity value.

>> I think that's data dependent. <<

Yes, of course. That's the point of the binary search -- to determine,

What the binary search tells you is that AT LEAST ONE of the values in the total must be this large. So, what 5416 means is that I add 5415+5414+5413 the value is LESS than I need; therefore, no combination of values where ALL rows have idents less than 5416 needs to be tested, because it's IMPOSSIBLE for those rows to ever reach the desired total.

>> In the actual data, it looks like only the 5 highest values cannot possibly be part of a 3-column solution. <<

Note that I'm applying the restriction to only ONE level. The other two levels can be ANY ident.

-- insert the original values into a new table to: add an identity and sort by vcr_valor_bruto.

SELECT

IDENTITY(smallint, 1, 1) AS rh_ident, *

INTO tempdb.dbo.rh_vcrt0_sorted

FROM dbo.rh_vcrt0 --<<-- assumed to contain sample data posted earlier by Author

WHERE vcr_emp_empresa = 32

ORDER BY vcr_valor_bruto

>> In the first join, the first term should probably be 'sv2.rh_ident >= sv1.rh_ident' <<

No, as that would join a row to itself. Duplicate values may appear, of course, by any single row cannot be joined to itself in the solution.

Ident is NOT the amount being added, it's JUST an identity value.

>> I think that's data dependent. <<

Yes, of course. That's the point of the binary search -- to determine,

*from this specific set of data*, what is the minimum value that needs included to reach the desired total.

What the binary search tells you is that AT LEAST ONE of the values in the total must be this large. So, what 5416 means is that I add 5415+5414+5413 the value is LESS than I need; therefore, no combination of values where ALL rows have idents less than 5416 needs to be tested, because it's IMPOSSIBLE for those rows to ever reach the desired total.

>> In the actual data, it looks like only the 5 highest values cannot possibly be part of a 3-column solution. <<

Note that I'm applying the restriction to only ONE level. The other two levels can be ANY ident.

IOW, say I have the values:

1; 7; 11; 21; 45; 49.

And I say give me 3 rows that total 57.

When I add 7+11+21 I get only 39, less than the total sought, therefore value#5 or larger MUST be included to produce the desired result. That is, I DO NOT EVEN NEED TO CONSIDER:

1+2+3

1+2+4

2+3+4

That's what the binary search code above determines for the specified input.

If THE 3/N

Likewise, if I said give me three rows that total 81, since 11+21+45=77, then I know that value#6 (or later) MUST be included, and I can ignore all 3-values combinations/permutations of only idents 1-5.

The KEY is that you have SORTED the input values BEFORE making the determination.

Note that negative numbers do NOT affect this computation.

Does this make sense now?

1; 7; 11; 21; 45; 49.

And I say give me 3 rows that total 57.

When I add 7+11+21 I get only 39, less than the total sought, therefore value#5 or larger MUST be included to produce the desired result. That is, I DO NOT EVEN NEED TO CONSIDER:

1+2+3

1+2+4

2+3+4

That's what the binary search code above determines for the specified input.

If THE 3/N

*HIGHEST*values in a subset cannot reach the total, then any values guaranteed to be lower could not possibly reach it either.Likewise, if I said give me three rows that total 81, since 11+21+45=77, then I know that value#6 (or later) MUST be included, and I can ignore all 3-values combinations/permutations of only idents 1-5.

The KEY is that you have SORTED the input values BEFORE making the determination.

Note that negative numbers do NOT affect this computation.

Does this make sense now?

Hi Scott.

Thanks. The SQL makes more sense now. I didn't realize that you were generating an identity column for the table as the original data already had a composite PK.

That looks like it could be an effective way to establish the search's starting point.

Now the question is the performance.

If the search target is sufficiently large that sum of the values in the first join are less than the target, all possible combinations will be generated. The OP's sample data contains 5,470 lines. This could explode into 14,954,980 rows that must then be joined with the third column data.

It takes a little while to generate 15 million intermediate rows from the cross product. And considerably more time to join them into another cross product and filter the results. That's potentially 81 billion rows.

I'm not aware of any DBMS capable of generating and filtering that many rows in anywhere close to a reasonable amount of time. Or is there another filter/short-exit that I'm not seeing?

Kent

Thanks. The SQL makes more sense now. I didn't realize that you were generating an identity column for the table as the original data already had a composite PK.

That looks like it could be an effective way to establish the search's starting point.

Now the question is the performance.

If the search target is sufficiently large that sum of the values in the first join are less than the target, all possible combinations will be generated. The OP's sample data contains 5,470 lines. This could explode into 14,954,980 rows that must then be joined with the third column data.

It takes a little while to generate 15 million intermediate rows from the cross product. And considerably more time to join them into another cross product and filter the results. That's potentially 81 billion rows.

I'm not aware of any DBMS capable of generating and filtering that many rows in anywhere close to a reasonable amount of time. Or is there another filter/short-exit that I'm not seeing?

Kent

>> It takes a little while to generate 15 million intermediate rows from the cross product <<

That's actually proven to be remarkly quick in SQL, since some results are instantly ignored as they exceed the max value being sought.

>> That's potentially 81 billion rows. <<

Ah, but that's the importance of the binary search result. With it, you only have to join the 2-level results to the rows from 5416 on -- so the possible results are 2nd-level * 55 instead of 2nd-level * 5470 -- BIG difference.

Again, because of the binary search result, we KNOW that combinations where all idents are lower than 5416 CANNOT yield a valid result. So we can safely FORCE AT LEAST ONE of the values to be #5416 or higher.

(Again, remember that 5416 is the relative SORTED value #, not the value itself, and NOT an unsorted id.)

That's actually proven to be remarkly quick in SQL, since some results are instantly ignored as they exceed the max value being sought.

>> That's potentially 81 billion rows. <<

Ah, but that's the importance of the binary search result. With it, you only have to join the 2-level results to the rows from 5416 on -- so the possible results are 2nd-level * 55 instead of 2nd-level * 5470 -- BIG difference.

Again, because of the binary search result, we KNOW that combinations where all idents are lower than 5416 CANNOT yield a valid result. So we can safely FORCE AT LEAST ONE of the values to be #5416 or higher.

(Again, remember that 5416 is the relative SORTED value #, not the value itself, and NOT an unsorted id.)

>> I didn't realize that you were generating an identity column for the table as the original data already had a composite PK. <<

Yes, but being a composite key, it is more overhead to work with. Moreover, the original values are unsorted, and hence offer no chance to streamline the computations.

Therefore, I insert all the original data into a SORTED table with a single smallint identity that is a pointer back to all of the original data. [smallint handles up to 32767; requested has specified max number of values is 10000; realistically, permutations on anything above that would be too prohibitive anyway]

Finally, for maximum search/computational efficiency, I take just the identity pointer and the actual value to be summed and store them, SORTED, into a new table (what I call the "search" table). That makes each row only 11 data bytes, with no variable length columns.

Yes, but being a composite key, it is more overhead to work with. Moreover, the original values are unsorted, and hence offer no chance to streamline the computations.

Therefore, I insert all the original data into a SORTED table with a single smallint identity that is a pointer back to all of the original data. [smallint handles up to 32767; requested has specified max number of values is 10000; realistically, permutations on anything above that would be too prohibitive anyway]

Finally, for maximum search/computational efficiency, I take just the identity pointer and the actual value to be summed and store them, SORTED, into a new table (what I call the "search" table). That makes each row only 11 data bytes, with no variable length columns.

ASKER

@ScottPletcher: I really don't follow your code at all. Nevertheless, I tried running it as is, but all I succeeded in doing is getting this line:

(No column name) Total_#Values starting_rh_ident value_to_find sum_of_bruto row_count

End 5897 5842 7554.07 7446.00 3

If I uncomment the last ones, what I get is further lines all similar, but with the sum_of_bruto column incremented. It takes virtually no time to run, but how can I get my results?

@Kdo: I'll try to adapt your code to get the data from the DB, though I'm a bit rusty on C. I'll have to install GCC as well. I won't probably do that today, so most likely, I'll try it on Monday, when I have a bit more free time.

At this point, though I'd love to have a working solution, I'm actually more interested in the theoretical part of this, and getting something to work as fast as possible. Right now, I'd just love to see how much faster I can make this. Even if a 5 row solution can only be achieved in 2 hours, or even 30 minutes, it won't be pratical for use, but I would feel very gratified :)

(No column name) Total_#Values starting_rh_ident value_to_find sum_of_bruto row_count

End 5897 5842 7554.07 7446.00 3

If I uncomment the last ones, what I get is further lines all similar, but with the sum_of_bruto column incremented. It takes virtually no time to run, but how can I get my results?

@Kdo: I'll try to adapt your code to get the data from the DB, though I'm a bit rusty on C. I'll have to install GCC as well. I won't probably do that today, so most likely, I'll try it on Monday, when I have a bit more free time.

At this point, though I'd love to have a working solution, I'm actually more interested in the theoretical part of this, and getting something to work as fast as possible. Right now, I'd just love to see how much faster I can make this. Even if a 5 row solution can only be achieved in 2 hours, or even 30 minutes, it won't be pratical for use, but I would feel very gratified :)

ASKER

One idea I did have, which I probably won't be able to work out on my own:

Isn't it possible to join both solutions? I mean, an SQL intermediate solution that is then further treated in C? I don't necessarily need a "pure" solution.

Isn't it possible to join both solutions? I mean, an SQL intermediate solution that is then further treated in C? I don't necessarily need a "pure" solution.

Hi Cluskitt,

Actually, since SQL Server doesn't really support Stored Procedures in C (at least I don't think that it does) any solution that uses a C program is a hybrid solution.

The performance of these algorithms/implementations is hugely dependent on the data. Scott's technique starts by selecting the minimum value (row) that must be part of any solution, mine works in reverse and eliminates those rows that can't possibly be part of any solution. Test both of them side-by-side (in the same language) and one will perform better as the target value increases, the other will perform better as the target decreases.

For a 5-value solution, I'm concerned over disk space. The 4-value solution required 147MB to store the solutions in ASCII. I can envision the 5-value solution not fitting onto a hard drive.

I'm going to modify the C program to count the 5-values solutions and see where it is after about an hour. :)

Kent

Actually, since SQL Server doesn't really support Stored Procedures in C (at least I don't think that it does) any solution that uses a C program is a hybrid solution.

The performance of these algorithms/implementations

For a 5-value solution, I'm concerned over disk space. The 4-value solution required 147MB to store the solutions in ASCII. I can envision the 5-value solution not fitting onto a hard drive.

I'm going to modify the C program to count the 5-values solutions and see where it is after about an hour. :)

Kent

ASKER

Disk space isn't an issue. We did specify that we want a max of 50k rows. So, once that number has been reached, the program will exit and no further space is wasted. This would be true whether in C or SQL.

I was thinking more about the magnitude of the problem, not the first 50,000 results. :)

The theory goes that with N data points, there could be up to N^y solutions. 5,000 data points means that there could be up to 5,000 1-column solutions. 25,000,000 2-column solutions. 125,000,000,000 3-column solutions, 625,000,000,000,000 4-column solutions. 3,125,000,000,000,000,000 5-column solutions.

That's why the 4 and 5 column solutions are so daunting. The algorithms need to identify things to not bother checking without investing many resources in the filter.

Kent

The theory goes that with N data points, there could be up to N^y solutions. 5,000 data points means that there could be up to 5,000 1-column solutions. 25,000,000 2-column solutions. 125,000,000,000 3-column solutions, 625,000,000,000,000 4-column solutions. 3,125,000,000,000,000,000 5-column solutions.

That's why the 4 and 5 column solutions are so daunting. The algorithms need to identify things to not bother checking without investing many resources in the filter.

Kent

ASKER

I know the processing part of it is the hardest part of this. But you were concerned over disk space and that isn't an issue (although memory usage might be).

>> End 5897 5842 7554.07 7446.00 3 <<

Yep, that IS the result. What that tells you is that the last-level join can START at 5842 (out of 5897 values).

So instead of having to code:

FROM dbo.search_values sv1

INNER JOIN dbo.search_values sv2 ON

sv2.rh_ident > sv1.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find

INNER JOIN dbo.search_values sv3 ON

sv3.rh_ident > sv2.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

You can change the third join to this:

INNER JOIN dbo.search_values sv3 ON

sv3.rh_ident >= 5842 AND

sv3.rh_ident > sv2.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

Eliminating a huge number of permutations from consideration.

>> For a 5-value solution, I'm concerned over disk space. <<

Are you using my stripped-down "search" table, or the original table with all columns?

Again, that's why I created the

Yep, that IS the result. What that tells you is that the last-level join can START at 5842 (out of 5897 values).

So instead of having to code:

FROM dbo.search_values sv1

INNER JOIN dbo.search_values sv2 ON

sv2.rh_ident > sv1.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find

INNER JOIN dbo.search_values sv3 ON

sv3.rh_ident > sv2.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

You can change the third join to this:

INNER JOIN dbo.search_values sv3 ON

sv3.rh_ident >= 5842 AND

sv3.rh_ident > sv2.rh_ident AND

sv1.vcr_valor_bruto + sv2.vcr_valor_bruto = @value_to_find

Eliminating a huge number of permutations from consideration.

>> For a 5-value solution, I'm concerned over disk space. <<

Are you using my stripped-down "search" table, or the original table with all columns?

Again, that's why I created the

**absolutely-minimum-columns**-necessary "search" table, to reduce the search overhead as much as possible.
Additional short-cuts could be obtained by using the SORTED input values, again with a binary search. I don't have time to layout the full code, but it's the same general search.

For example, if the first 2 levels I add value #1 and value #2 (both likely neg), only an extremely small subset of the original rows could every possibly yield a result.

The SORTED original values only help with one level of the search, though. For higher levels, you need the 2-level values stored (SORTED, of course), so you can use a similar technique/logic to reduce more than one level of searches.

For example, if the first 2 levels I add value #1 and value #2 (both likely neg), only an extremely small subset of the original rows could every possibly yield a result.

The SORTED original values only help with one level of the search, though. For higher levels, you need the 2-level values stored (SORTED, of course), so you can use a similar technique/logic to reduce more than one level of searches.

ASKER

I liked Kdo's solution of biasing the results so they're all positive. I imagine that would be very helpful for your code, as that would eliminate the negatives problem. As long as you account for the bias when searching for the target, I can't see any inconvenient in this.

>> I resolved the issue with negative number by biasing all of the data (adding a value to all of the items so that there are no negative numbers). The search takes the bias into account when searching through the data. <<

That's tricky.

What

You have to do bias_amount * #_of_values_combined at that level, right?

That's tricky.

What

*specifically*are you doing to "take the bias into account when searching"?You have to do bias_amount * #_of_values_combined at that level, right?

>>> Eliminating a huge number of permutations

are you really trying to eliminate permutations? I don't see anywhere in your code or anyone else's for that matter where permutations will be a problem except in the code in the question itself. That's, in part, why we sort.

Permutations aren't what we want anyway. We want combinations, which are fewer in number.

You've used the word "permutation" previously - is it a mistake or are you using that word intentionally? If it is intentional, why are you trying to generate/filter permutations and how is that supposed to be happening?

are you really trying to eliminate permutations? I don't see anywhere in your code or anyone else's for that matter where permutations will be a problem except in the code in the question itself. That's, in part, why we sort.

Permutations aren't what we want anyway. We want combinations, which are fewer in number.

You've used the word "permutation" previously - is it a mistake or are you using that word intentionally? If it is intentional, why are you trying to generate/filter permutations and how is that supposed to be happening?

>>> What specifically are you doing to "take the bias into account when searching"?

yes, you're right, the bias has to be included once for each level.

You can see where that happens in the recursion.

Each call deeper increments the target by the bias amount.

FindOperand (Sum + TValues[TPosition], Depth+1, TPosition, Target + Bias); (line 59)

he also does it at line 199 to get it started.

yes, you're right, the bias has to be included once for each level.

You can see where that happens in the recursion.

Each call deeper increments the target by the bias amount.

FindOperand (Sum + TValues[TPosition], Depth+1, TPosition, Target + Bias); (line 59)

he also does it at line 199 to get it started.

I think we need to index the data by value, rather an ident (or by value also, if the ident is still useful for other purposes). Then for the

CREATE TABLE tempdb.dbo.rh_search_values (

rh_ident smallint NOT NULL,

vcr_valor_bruto decimal(14, 2) NOT NULL

)

CREATE UNIQUE CLUSTERED INDEX rh_search_values__CL ON tempdb.dbo.rh_search_values ( vcr_valor_bruto, rh_ident )

3-level search:

FROM dbo.search_values sv1

INNER JOIN dbo.search_values sv2 ON

sv2.rh_ident > sv1.rh_ident --AND

--sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find --not valid w/neg #s

INNER JOIN dbo.search_values sv3 ON

sv3.rh_ident > sv2.rh_ident AND

sv3.vcr_valor_bruto = @value_to_find - sv2.vcr_valor_bruto - sv1.vcr_valor_bruto

Then the last level does NO explosion at all, it just does a keyed lookup for the needed value.

That should make the 3-level lookup almost as fast as the 2-level lookup.

It also makes the 4-level lookup only require gen'ing the 3-level number of combinations.

*final*level we just select a matching value(s), with NO further combinations required.CREATE TABLE tempdb.dbo.rh_search_value

rh_ident smallint NOT NULL,

vcr_valor_bruto decimal(14, 2) NOT NULL

)

CREATE UNIQUE CLUSTERED INDEX rh_search_values__CL ON tempdb.dbo.rh_search_value

3-level search:

FROM dbo.search_values sv1

INNER JOIN dbo.search_values sv2 ON

sv2.rh_ident > sv1.rh_ident --AND

--sv1.vcr_valor_bruto + sv2.vcr_valor_bruto < @value_to_find --not valid w/neg #s

INNER JOIN dbo.search_values sv3 ON

sv3.rh_ident > sv2.rh_ident AND

sv3.vcr_valor_bruto = @value_to_find - sv2.vcr_valor_bruto - sv1.vcr_valor_bruto

Then the last level does NO explosion at all, it just does a keyed lookup for the needed value.

That should make the 3-level lookup almost as fast as the 2-level lookup.

It also makes the 4-level lookup only require gen'ing the 3-level number of combinations.

Hi Scott,

That's closer to what I had thought your previous solution was trying to do.

But the index is doomed to failure. The data has duplicates that can't be ignored.

Kent

That's closer to what I had thought your previous solution was trying to do.

But the index is doomed to failure. The data has duplicates that can't be ignored.

Kent

Ok. One more round to report.

I modified the C code and came up with some more results to report.

-- All of the data items are stored as integers, at 100 times the original value. Integer operations (math and comparison) are generally faster than FP data types.

-- When iterating through the data, ignore duplicate values for the current column. (If 121 occurs 4 times, the 2nd - 4th checks of the value are redundant. If value was rejected the first pass it will be the next 3, too. If the value was a solution the first pass, it will be the next 3, too. This eliminates duplicate results but retains rows with duplicate values. It also significantly reduces the number of data items that are tested. It has a high cost in that the first test of each data item is to see if it is a repeat, but the overall result is a faster program.

-- Duplicate results are ignored.

-- Rows with duplicate values are found.

-- Solutions are found and group by number of columns.

Oddly, the program runs about 1% slower when it's compiled for speed than for debug. 82 seconds vs 81 seconds searching for all 1, 2, 3, and 4 column solutions. And these timings are repeatable.

I modified the C code and came up with some more results to report.

-- All of the data items are stored as integers, at 100 times the original value. Integer operations (math and comparison) are generally faster than FP data types.

-- When iterating through the data, ignore duplicate values for the current column. (If 121 occurs 4 times, the 2nd - 4th checks of the value are redundant. If value was rejected the first pass it will be the next 3, too. If the value was a solution the first pass, it will be the next 3, too. This eliminates duplicate results but retains rows with duplicate values. It also significantly reduces the number of data items that are tested. It has a high cost in that the first test of each data item is to see if it is a repeat, but the overall result is a faster program.

-- Duplicate results are ignored.

-- Rows with duplicate values are found.

-- Solutions are found and group by number of columns.

Oddly, the program runs about 1% slower when it's compiled for speed than for debug. 82 seconds vs 81 seconds searching for all 1, 2, 3, and 4 column solutions. And these timings are repeatable.

>>> Duplicate results are ignored.

is that appropriate?

The combinations are based on numbers, but the numbers are tied to other data.

I can buy ice cream $1

I can buy candy for $1

I can buy a drink for $2

how many ways can I spend $3?

If I'm reading your explanation correctly I'll get one solution of $1 + $2

but the asker needs both solutions.

ice cream + drink

and

candy + drink

is that appropriate?

The combinations are based on numbers, but the numbers are tied to other data.

I can buy ice cream $1

I can buy candy for $1

I can buy a drink for $2

how many ways can I spend $3?

If I'm reading your explanation correctly I'll get one solution of $1 + $2

but the asker needs both solutions.

ice cream + drink

and

candy + drink

I suppose you could keep track of all the duplicates and then apply another round of combination searching on the other fields to expand a collapsed duplicate into all variations of the same amounts.

But that seems a like more complexity when you could just include them in the original solution.

Maybe it'd be worth it though.

But that seems a like more complexity when you could just include them in the original solution.

Maybe it'd be worth it though.

>> But the index is doomed to failure. The data has duplicates that can't be ignored. <<

Huh? The index won't ignore duplicates. On the contrary, it will join every value that matches, regardless of how many times it appears in the input.

Both indexes are needed on the search table, as follows:

CREATE TABLE tempdb.dbo.rh_search_values (

rh_ident smallint NOT NULL,

vcr_valor_bruto decimal(14, 2) NOT NULL

)

CREATE UNIQUE CLUSTERED INDEX rh_search_values__CL ON tempdb.dbo.rh_search_values ( rh_ident )

CREATE UNIQUE NONCLUSTERED INDEX rh_search_values__IX_vcr_valor_bruto ON tempdb.dbo.rh_search_values ( vcr_valor_bruto, rh_ident )

Huh? The index won't ignore duplicates. On the contrary, it will join every value that matches, regardless of how many times it appears in the input.

Both indexes are needed on the search table, as follows:

CREATE TABLE tempdb.dbo.rh_search_value

rh_ident smallint NOT NULL,

vcr_valor_bruto decimal(14, 2) NOT NULL

)

CREATE UNIQUE CLUSTERED INDEX rh_search_values__CL ON tempdb.dbo.rh_search_value

CREATE UNIQUE NONCLUSTERED INDEX rh_search_values__IX_vcr_v

Yeah, we could DISTINCT out the duplicates in the search values, then re-expand them at the end to include all the original rows with that value.

Hi stuber,

I'm trying to squeeze as much performance out as I can so that a 5-column solution has a chance to actually complete. If that kind of process results in a significant enough improvement, it's worthwhile to try. (I know that I'm losing focus on the 50,000 limitation, but I'm still curious about the general solution.

By filtering the search to avoid finding duplicate rows, the search time drops from about 350 seconds to 81. If that extrapolates to the 5-column solution, it's the difference between a 10 hour run (run overnight) and 2+ days. I'm giving up after the second night. :)

Hi Scott,

Rats. Don't know why I have such trouble reading the fine print in your SQL. Apologies.

Try running that and generating an explain plan. With only 5,400 numeric rows, the data will fit into a handful of pages and the index may actually be larger than the data. The optimizer may decide that the two-tier access to each item is more expensive than scanning the data blocks.

It'd be interesting to know.

Kent

I'm trying to squeeze as much performance out as I can so that a 5-column solution has a chance to actually complete. If that kind of process results in a significant enough improvement, it's worthwhile to try. (I know that I'm losing focus on the 50,000 limitation, but I'm still curious about the general solution.

By filtering the search to avoid finding duplicate rows, the search time drops from about 350 seconds to 81. If that extrapolates to the 5-column solution, it's the difference between a 10 hour run (run overnight) and 2+ days. I'm giving up after the second night. :)

Hi Scott,

Rats. Don't know why I have such trouble reading the fine print in your SQL. Apologies.

Try running that and generating an explain plan. With only 5,400 numeric rows, the data will fit into a handful of pages and the index may actually be larger than the data. The optimizer may decide that the two-tier access to each item is more expensive than scanning the data blocks.

It'd be interesting to know.

Kent

>> Don't know why I have such trouble reading the fine print in your SQL. Apologies. <<

LOL. No problem, you followed it well really. Based on his comments, I don't think sdstuber

>> The optimizer may decide that the two-tier access to each item is more expensive than scanning the data blocks. <<

There's no 2-tier access in this case, because I made the index a covering index. That is, the index has all data needed to satisfy the query, so SQL can use the index w/o linking back to the main table. SQL actual favors covering indexes when available.

>> All of the data items are stored as integers <<

Technically they need to be big integers (8 bytes), as 14 digits -- decimal(14, 2) -- obviously won't fit into a standard (4 byte) int.

LOL. No problem, you followed it well really. Based on his comments, I don't think sdstuber

*even now*understands anything I wrote.>> The optimizer may decide that the two-tier access to each item is more expensive than scanning the data blocks. <<

There's no 2-tier access in this case, because I made the index a covering index. That is, the index has all data needed to satisfy the query, so SQL can use the index w/o linking back to the main table. SQL actual favors covering indexes when available.

>> All of the data items are stored as integers <<

Technically they need to be big integers (8 bytes), as 14 digits -- decimal(14, 2) -- obviously won't fit into a standard (4 byte) int.

Other improvements are possible, with similar binary searches / pre-stored values at key points. I think the best improvement for higher levels -- 4 and 5 level + -- will come from having the 2-level results physically stored (fyi, sdstuber disputes this too).

The overhead of storing the sorted level 2 results is not trivial, but on a reasonably fast machine -- or in a C++ array -- it shouldn't be at all prohibitive.

That allows an exact valuing match search to replace the last

The overhead of storing the sorted level 2 results is not trivial, but on a reasonably fast machine -- or in a C++ array -- it shouldn't be at all prohibitive.

That allows an exact valuing match search to replace the last

*two*levels of the join, meaning that a 4-way join is not much overhead than a 2-way join, and a 5-way join not much more overhead than a full 3-way join.
>>> (fyi, sdstuber disputes this too).

as suggested before, please contact me offline if you want to discuss what I do or do not believe

as for this particular "dispute" - I have no idea why you'd suggest such a thing since I've never commented on your storage - reread my comments.

to this point you've posted 2 sets of code - the first was 30 times slower than anything posted to that point and didn't provide the requested results so I didn't really bother investigating it, let alone commenting on it (other than to mention the relative performance).

I have yet to test what you posted yesterday so I have no comments on that either.

All of my participation thus far has been trying to explain and reexplain my methods and a little commentary on some of kdo's first work that I did get a chance to test. And of course my "doomsayer" stuff about big sets.

Again, feel free to contact me offline if you'd like to pursue anything further, I'm happy to try to explain what I've done above, and anything I've actually tested.

You may have noted I haven't posted much code in the last couple days, partially because I've busy, but mostly because Kdo has set the bar. If I can't beat his times then I won't post. Why bother?

as suggested before, please contact me offline if you want to discuss what I do or do not believe

as for this particular "dispute" - I have no idea why you'd suggest such a thing since I've never commented on your storage - reread my comments.

to this point you've posted 2 sets of code - the first was 30 times slower than anything posted to that point and didn't provide the requested results so I didn't really bother investigating it, let alone commenting on it (other than to mention the relative performance).

I have yet to test what you posted yesterday so I have no comments on that either.

All of my participation thus far has been trying to explain and reexplain my methods and a little commentary on some of kdo's first work that I did get a chance to test. And of course my "doomsayer" stuff about big sets.

Again, feel free to contact me offline if you'd like to pursue anything further, I'm happy to try to explain what I've done above, and anything I've actually tested.

You may have noted I haven't posted much code in the last couple days, partially because I've busy, but mostly because Kdo has set the bar. If I can't beat his times then I won't post. Why bother?

ScottPletcher - thank you for finally posting the code example. Apparently I read your code better than I read your words.

I never disputed a binary search was fast for finding a single number from a set, I simply had no idea what you wanted to do with it. Now I do. Thanks!

I'm still not sure why you've picked other random statements as either being flip flops or ascribing views I've never stated but I'll, again, take that as a failing of mine that I'm not explaining myself well enough for others to understand what I'm talking about.

Now onto Kdo's stuff, because like him, I'm a math guy and solving for the whole set (even when it's awful) is still the more interesting part of this question for me.

I never disputed a binary search was fast for finding a single number from a set, I simply had no idea what you wanted to do with it. Now I do. Thanks!

I'm still not sure why you've picked other random statements as either being flip flops or ascribing views I've never stated but I'll, again, take that as a failing of mine that I'm not explaining myself well enough for others to understand what I'm talking about.

Now onto Kdo's stuff, because like him, I'm a math guy and solving for the whole set (even when it's awful) is still the more interesting part of this question for me.

I was curious just how much the bias approach helped in kdo's code.

I changed the timing to use struct timeval in order to record sub-second measures

Here are my first results for 3-value on target 11241 (112.41)

Elapsed time: 664569 microseconds

177527 Solutions found.

Bias: 0.000000

Elapsed time: 665221 microseconds

177527 Solutions found.

Bias: 124602.000000

bias was slower but that was only one sample.

after repeated tests the biased version did end up slightly faster but only by 2660.6 microseconds.

I did one run each for 4-value sums

Elapsed time: 459879342 microseconds

151737464 Solutions found.

139931314708480 passes.

Bias: 0.000000

Elapsed time: 458895217 microseconds

151737464 Solutions found.

139971610013696 passes.

Bias: 124602.000000

Bias does seem to help; but only marginally and the benefit diminishes as the results grow. I haven't waited for 5-values to finish but the benefit on 2-value was greater, not that it needed it. 26052.4 vs 26798.2 microseconds.

of course the exact values will vary for others but I imagine the pattern should remain.

I changed the timing to use struct timeval in order to record sub-second measures

Here are my first results for 3-value on target 11241 (112.41)

Elapsed time: 664569 microseconds

177527 Solutions found.

Bias: 0.000000

Elapsed time: 665221 microseconds

177527 Solutions found.

Bias: 124602.000000

bias was slower but that was only one sample.

after repeated tests the biased version did end up slightly faster but only by 2660.6 microseconds.

I did one run each for 4-value sums

Elapsed time: 459879342 microseconds

151737464 Solutions found.

139931314708480 passes.

Bias: 0.000000

Elapsed time: 458895217 microseconds

151737464 Solutions found.

139971610013696 passes.

Bias: 124602.000000

Bias does seem to help; but only marginally and the benefit diminishes as the results grow. I haven't waited for 5-values to finish but the benefit on 2-value was greater, not that it needed it. 26052.4 vs 26798.2 microseconds.

of course the exact values will vary for others but I imagine the pattern should remain.

The c code above doesn't find all solutions.

I changed these lines (and the file paths, but nothing else)

Target = 11241;

MaxValues = 2;

these results were NOT found but should have been.

1: -89.00 201.41

2: -40.59 153.00

3: -35.87 148.28

4: -26.31 138.72

5: -17.79 130.20

6: -17.79 130.20

7: -15.75 128.16

8: -15.75 128.16

9: -12.22 124.63

10: -12.16 124.57

11: -12.16 124.57

12: -12.16 124.57

It's not a problem with the bias, from the performance tests above I already knew it didn't change the results found but I did check again explicitly for these combinations and confirmed that with our without bias they are still skipped.

I changed these lines (and the file paths, but nothing else)

Target = 11241;

MaxValues = 2;

these results were NOT found but should have been.

1: -89.00 201.41

2: -40.59 153.00

3: -35.87 148.28

4: -26.31 138.72

5: -17.79 130.20

6: -17.79 130.20

7: -15.75 128.16

8: -15.75 128.16

9: -12.22 124.63

10: -12.16 124.57

11: -12.16 124.57

12: -12.16 124.57

It's not a problem with the bias, from the performance tests above I already knew it didn't change the results found but I did check again explicitly for these combinations and confirmed that with our without bias they are still skipped.

Wow. Great catch!

That's because the biasing, as implemented, doesn't work.

Searching for : 201.41 -89.00 (Larger number to the left....)

Target is : 112.41

Recapping, the original problem is that when adding terms A and B, the result is less than A when B is negative. The original algorithm was fooled by the negative values and didn't bother moving right a column when the sum of the columns already checked (in this case, 1 column with a value of 201.41) is larger than target.

So Biasing was invented. The actual data has a smallest value of -1246.02, so 1246.02 is added to all of the data values, making 0 the smallest value.

After biasing everything, testing the first column for (201.41, -89.00) uses these terms:

Searching for : 1447.43 1157.02 (Larger number still to the left....)

Target is : 1358.43

The first term is still larger than the target because both the term and the target were incremented by the same value.

Gotta rethink this one, now.....

That's because the biasing, as implemented, doesn't work.

Searching for : 201.41 -89.00 (Larger number to the left....)

Target is : 112.41

Recapping, the original problem is that when adding terms A and B, the result is less than A when B is negative. The original algorithm was fooled by the negative values and didn't bother moving right a column when the sum of the columns already checked (in this case, 1 column with a value of 201.41) is larger than target.

So Biasing was invented. The actual data has a smallest value of -1246.02, so 1246.02 is added to all of the data values, making 0 the smallest value.

After biasing everything, testing the first column for (201.41, -89.00) uses these terms:

Searching for : 1447.43 1157.02 (Larger number still to the left....)

Target is : 1358.43

The first term is still larger than the target because both the term and the target were incremented by the same value.

Gotta rethink this one, now.....

I'm not sure the biasing is the cause of skipped combinations. If I comment out the bias call (so bias=0) I get the same results as with the bias.

I wasn't going to bother posting my code because kdo's code was so much faster than mine but since the results are different, it might still be of value.

I'm sure I'm overlooking something; but assuming bias= 0, I don't see a significant difference between the code. I sum bottom-up, kdo sums top-down. I see no reason why one should be faster than the other or why one should find more results than the other. Our loops, our if/else are basically the same.

Of course, since kdo's code isn't finding all of the results so it doesn't have to do as much work. So it "should" be faster. But it's 5-6 times the speed as mine while doing far more than 1/5 of the work.

I wasn't going to bother posting my code because kdo's code was so much faster than mine but since the results are different, it might still be of value.

I'm sure I'm overlooking something; but assuming bias= 0, I don't see a significant difference between the code. I sum bottom-up, kdo sums top-down. I see no reason why one should be faster than the other or why one should find more results than the other. Our loops, our if/else are basically the same.

Of course, since kdo's code isn't finding all of the results so it doesn't have to do as much work. So it "should" be faster. But it's 5-6 times the speed as mine while doing far more than 1/5 of the work.

```
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define SOURCE_ROWS 5470
#define MAX_DEPTH 2
#define BASE_TARGET 112.41
#define TARGET (int) (BASE_TARGET * 100.00)
#define MAX_SOLUTIONS 50000 //not used yet
int rows = 0;
int *numbers;
int solution_count = 0;
int solution_value[MAX_DEPTH];
FILE *outfile;
long long function_call_count = 0;
int sum;
int intcomp(const int *a, const int *b) {
return *a - *b;
}
void read_input_file (void) {
FILE *infile;
int dummy1, dummy2, dummy3, ret, i;
float temp_value;
infile = fopen ("sqldata.txt", "rt");
numbers = malloc (SOURCE_ROWS * sizeof (int));
while (rows < SOURCE_ROWS) {
ret = fscanf (infile, "(%d,%d,%d,%f)\n", &dummy1, &dummy2, &dummy3, &temp_value);
if (ret <= 0) break;
// For effenciency adjust and cast all values as integers
numbers[rows] = (int) (100 * temp_value);
rows++;
}
if (rows < SOURCE_ROWS) numbers = realloc (numbers, rows * sizeof (int));
fclose(infile);
qsort(numbers, rows, sizeof(int), (int (*)(const void *,const void *))intcomp);
// printf("Read %ld rows from input file\n", rows);
// for (i=0;i< rows;i++) printf("%4i: %i %i\n",i,numbers[i],TARGET);
}
void print_solution(int depth, int number) {
int i;
if (outfile) {
fprintf(outfile,"%10i:",solution_count);
for (i=1;i<depth;i++) {
fprintf(outfile," %.2f",solution_value[i-1],solution_value[i-1]/100.0);
}
fprintf(outfile," %.2f\n",number/100.0);
}
else {
printf("%10i:",solution_count);
for (i=1;i<depth;i++) {
printf(" %.2f",solution_value[i-1],solution_value[i-1]/100.0);
}
printf(" %.2f\n",number/100.0);
}
}
void find_sums(int depth, int partialsum, int index) {
int i;
function_call_count++;
for (i=index;i< rows;i++) {
sum = partialsum + numbers[i];
if (sum > TARGET) return;
if (sum == TARGET) {
++solution_count;
//print_solution(depth,numbers[i]);
}
else if (MAX_DEPTH > depth) {
solution_value[depth-1] = numbers[i];
find_sums(depth+1,sum,i+1);
}
}
}
int main(int argc, char* argv[])
{
struct timeval start_time, end_time;
long int elapsed;
read_input_file();
// outfile = fopen ("solutions.txt", "w");
gettimeofday(&start_time,NULL);
find_sums(1,0,0);
gettimeofday(&end_time,NULL);
elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);
printf("Elapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
elapsed,solution_count,function_call_count);
if (outfile) {
fprintf(outfile,"Elapsed time: %ld microseconds\nSolutions found: %ld\n",elapsed,solution_count);
fclose(outfile);
}
return 0;
}
```

ASKER

Wow, lots of posts this weekend. I'll have to try and find some time to check them all. At first glance, though, I can see Kdo's problem right away, not from his code, but from his example:

>> After biasing everything, testing the first column for (201.41, -89.00) uses these terms:

>> Searching for : 1447.43 1157.02 (Larger number still to the left....)

>> Target is : 1358.43

You're using a bias of 1246.02. In which case, the target isn't correct. It should be 1246.02 * 2 + 122.41 (because we're on a 2 row solution). So, checking for 2614.45 wouldn't skip these values. If you don't account for double the bias (one for each column), then you'll be skipping not only the negatives, but also some positives.

I still feel that some sort of mixed solution would be the fastest. Maybe adapting Scott's solution of storing the two column results and using those for expanding. Implementing that solution on C may prove promising.

>> After biasing everything, testing the first column for (201.41, -89.00) uses these terms:

>> Searching for : 1447.43 1157.02 (Larger number still to the left....)

>> Target is : 1358.43

You're using a bias of 1246.02. In which case, the target isn't correct. It should be 1246.02 * 2 + 122.41 (because we're on a 2 row solution). So, checking for 2614.45 wouldn't skip these values. If you don't account for double the bias (one for each column), then you'll be skipping not only the negatives, but also some positives.

I still feel that some sort of mixed solution would be the fastest. Maybe adapting Scott's solution of storing the two column results and using those for expanding. Implementing that solution on C may prove promising.

The target does increment by the bias every time that you move right a column. The problem is that if the target is less than current value, biasing both items by the same value doesn't change that.

I've got a solution in mind. Just need a couple of minutes to implement. :)

Kent

I've got a solution in mind. Just need a couple of minutes to implement. :)

Kent

SOLUTION

THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.

View this solution by signing up for a free trial.

Members can start a 7-Day free trial and enjoy unlimited access to the platform.

ASKER

That query took 20 seconds to complete. From what I understand, though, I have to use your other query first to populate and to define the binary search value indicated on the comments. I suppose it would be easier to run both codes at once and store the value in a variable.

Is there any way to return the total number of combinations (to compare with other solutions and check for accuracy)?

Is there any way to return the total number of combinations (to compare with other solutions and check for accuracy)?

I was going to get a count of the total solutions (see the "--COUNT(*)" in my query above :-) ), but once it ran over 2.5 mins I gave up and cancelled it.

Of course if you let it run it would eventually give you its count.

>> From what I understand, though, I have to use your other query first to populate and to define the binary search value indicated on the comments. <<

Yes, that adds a lot of efficiency (esp. on your box) with the 4_5 table, by eliminating most of the rows from one level of the joins.

>> I suppose it would be easier to run both codes at once and store the value in a variable. <<

Yep. Sorry, just running this catch-as-catch-can right now, trying to piece together a working, viable solution. I was careful, however, to use variables for the value to find and for the level, to make it much easier to put it into combined/straight-line code later.

If you want a different subset, you could limit the rh_ident on one or more of the other levels -- another reason I like the idea of an identity for these rows instead of the multiple original rows.

For example, if you specified:

WHERE s1.rh_ident > 100

Then none of the first 100 rows would be included in s1, s2 or s3.

Of course if you let it run it would eventually give you its count.

>> From what I understand, though, I have to use your other query first to populate and to define the binary search value indicated on the comments. <<

Yes, that adds a lot of efficiency (esp. on your box) with the 4_5 table, by eliminating most of the rows from one level of the joins.

>> I suppose it would be easier to run both codes at once and store the value in a variable. <<

Yep. Sorry, just running this catch-as-catch-can right now, trying to piece together a working, viable solution. I was careful, however, to use variables for the value to find and for the level, to make it much easier to put it into combined/straight-line code later.

If you want a different subset, you could limit the rh_ident on one or more of the other levels -- another reason I like the idea of an identity for these rows instead of the multiple original rows.

For example, if you specified:

WHERE s1.rh_ident > 100

Then none of the first 100 rows would be included in s1, s2 or s3.

ASKER

I tried commenting the TOP 50000 line and letting it run. After 7 minutes I cancelled, having near 7M records.

Analysing different subsets would be useful, because I think the data presented will be quite biased, starting from negative numbers as it does. A random sample would be more representative, but not only would that be harder and slower, having this many results would decrease the likelihood of finding a solution using the combinatorial method anyway.

Once I assemble your code together, I'll try to study it to understand it. If I have questions I'll ask them here.

I'm curious to see if Kdo can achieve either a faster solution (not really necessary. Anything under 5 minutes is good enough) or one that can return all possible solutions in a decent time. I love problem solving, and this is one that's piqued my curiosity.

Analysing different subsets would be useful, because I think the data presented will be quite biased, starting from negative numbers as it does. A random sample would be more representative, but not only would that be harder and slower, having this many results would decrease the likelihood of finding a solution using the combinatorial method anyway.

Once I assemble your code together, I'll try to study it to understand it. If I have questions I'll ask them here.

I'm curious to see if Kdo can achieve either a faster solution (not really necessary. Anything under 5 minutes is good enough) or one that can return all possible solutions in a decent time. I love problem solving, and this is one that's piqued my curiosity.

As to "check for accuracy":

I show you the original relative value# (ident).

I re-add the numbers in the display and the total matches.

I also verified the first 50,000 results I got, going back to make sure the listed values -- s1, s2, etc. -- really were in the original input.

Not sure what else I can do as far as verifying "accuracy".

C'mon, 5 levels in 20 secs, you're not happy?? I thought you were talking about *hoping* to be able to get 5-level results in HOURS just to see them.

I show you the original relative value# (ident).

I re-add the numbers in the display and the total matches.

I also verified the first 50,000 results I got, going back to make sure the listed values -- s1, s2, etc. -- really were in the original input.

Not sure what else I can do as far as verifying "accuracy".

C'mon, 5 levels in 20 secs, you're not happy?? I thought you were talking about *hoping* to be able to get 5-level results in HOURS just to see them.

ASKER

I am extremely happy with this solution. I just wanted to compare with Kdo's as well. Also, I wanted to ascertain whether both solutions return the same number of results (which would be a good indication that none are missing any solutions, like was happening with Kdo's).

One thing I'm not sure of, though, is how to adapt your code to use 2, 3 or 4 level solutions, which would have to be run before going for the 5 second solution (most likely using a stored procedure to populate/drop the table and select which level to run).

One thing I'm not sure of, though, is how to adapt your code to use 2, 3 or 4 level solutions, which would have to be run before going for the 5 second solution (most likely using a stored procedure to populate/drop the table and select which level to run).

We could adapt it easily enough for the different levels -- again, that's why I used variables for the value to find and the levels to use.

I would include the binary search even for smaller values and levels. Even for the small example value you used earlier of 112.41, the 2-level total can start at #2952 (out of 5470), knocking out over half the rows from one level almost instantly.

I would include the binary search even for smaller values and levels. Even for the small example value you used earlier of 112.41, the 2-level total can start at #2952 (out of 5470), knocking out over half the rows from one level almost instantly.

You can vary the s1.rh_ident starting point to get different result sets.

For example:

WHERE s1.rh_ident > 100

WHERE s1.rh_ident > 500

etc..

They still aren't random, of course. Don't know any way to easily do that. You could randomly delete many values from the table, which would reduce the possible result sets randomly -- maybe.

For example:

WHERE s1.rh_ident > 100

WHERE s1.rh_ident > 500

etc..

They still aren't random, of course. Don't know any way to easily do that. You could randomly delete many values from the table, which would reduce the possible result sets randomly -- maybe.

ASKER

Don't worry too much about the random values. They aren't too important. The chances of hitting a relevant value from that many results doesn't improve by much, if at all, by randomizing them. It's just the programmer in me that likes these types of solutions ;)

LOL, I got ya.

However, it occurs to me now, the only values that genuinely must be sorted are for doing the binary search.

The values in the final search for s1, s2 and s3 do NOT have to be sorted, since the selection for them is done by ident and not by value. [The s4_5 is doing an indexed lookup, so the physical order is irrelevant, as long as the bruto total has an index on it.]

So you could randomly insert the values into the search_values table to produce different results. And that IS easy AND FAST:

SELECT ... INTO dbo.rh_search_values ORDER BY NEWID()

That could be done even, say, 10 times, and you could produce 10 reasonably quick random top 50,000 (or whatever) results ... even, if you wanted, store those results and randomly pick some number from the 500,000 random results.

However, it occurs to me now, the only values that genuinely must be sorted are for doing the binary search.

The values in the final search for s1, s2 and s3 do NOT have to be sorted, since the selection for them is done by ident and not by value. [The s4_5 is doing an indexed lookup, so the physical order is irrelevant, as long as the bruto total has an index on it.]

So you could randomly insert the values into the search_values table to produce different results. And that IS easy AND FAST:

SELECT ... INTO dbo.rh_search_values ORDER BY NEWID()

That could be done even, say, 10 times, and you could produce 10 reasonably quick random top 50,000 (or whatever) results ... even, if you wanted, store those results and randomly pick some number from the 500,000 random results.

SOLUTION

THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.

View this solution by signing up for a free trial.

Members can start a 7-Day free trial and enjoy unlimited access to the platform.

In C(++), I suspect you're right, or at least very close ... my only potential issue with that would be the indexed lookup to match the value ... I'm not sure how well C deals with doing that vs SQL (that is THE strength of SQL, after all).

I've confirmed for myself that 50,000 random values could be obtained in nearly the same amount of time as the last run (maybe a second or two on the slower boxes).

The original ~5,500 values must be sorted and stored, but that is a relatively trivial amount of time for SQL. Then the binary sort finds the "cut off" point, which is very fast in SQL.

Finally, the qualifying two-level combinations must be stored and sorted -- much less trivial, but SQL is still fast enough to do this in less than 30 seconds for almost all reasonable input lists and search values.

None of the advanced techniques are necessary for the 2-level (although the binary search could still yield savings, sometimes a lot, and is so fast it probably should be done anyway).

I would use them for the 3-level checks -- sometimes it won't have been needed, but other times it could help a lot.

I would definitely use some type of "advanced" techniques for 4- and 5-level searches, because I think it will yield noticeable savings there.

I've confirmed for myself that 50,000 random values could be obtained in nearly the same amount of time as the last run (maybe a second or two on the slower boxes).

The original ~5,500 values must be sorted and stored, but that is a relatively trivial amount of time for SQL. Then the binary sort finds the "cut off" point, which is very fast in SQL.

Finally, the qualifying two-level combinations must be stored and sorted -- much less trivial, but SQL is still fast enough to do this in less than 30 seconds for almost all reasonable input lists and search values.

None of the advanced techniques are necessary for the 2-level (although the binary search could still yield savings, sometimes a lot, and is so fast it probably should be done anyway).

I would use them for the 3-level checks -- sometimes it won't have been needed, but other times it could help a lot.

I would definitely use some type of "advanced" techniques for 4- and 5-level searches, because I think it will yield noticeable savings there.

More to report. :)

No 5-column solution uses any of the top 68 values in the list. That's certainly not intuitive, and an algorithm to detect the starting point without missing any values is quite a task. (Note that different data might use the top value and none of the next 50, or none of the top 200, etc.)

Still, starting the search just a modest way into the list makes a huge difference in run times. The first 5-column solution is found using the 69th largest value in the list. So starting the search after just ignoring the first 68 items yields these times (for 50,000 results) for the 2, 3, 4, and 5 column searches.

0, 5, 15, 46. Seconds.

That's an impressive reduction. Skipping the first 100 values yields:

0, 5, 11, 7. Seven seconds to find 50,000 5-columns solutions!

Skipping the first 200 values yields:

0, 5, 7, 9.

Skipping the first 400 values yields:

0, 5, 10, 12.

There's a strong indication that performance will vary as the data changes, but simply ignoring the top 3 or 4% of the values gets all of the generated answers in about 10 seconds.

If the TOP 50K results are needed, it's not possible to pick an arbitrary number of values to ignore. But if 50K consecutive (or related) solutions are sufficient, starting just a short way into the list is a very workable technique.

Kent

No 5-column solution uses any of the top 68 values in the list. That's certainly not intuitive, and an algorithm to detect the starting point without missing any values is quite a task. (Note that different data might use the top value and none of the next 50, or none of the top 200, etc.)

Still, starting the search just a modest way into the list makes a huge difference in run times. The first 5-column solution is found using the 69th largest value in the list. So starting the search after just ignoring the first 68 items yields these times (for 50,000 results) for the 2, 3, 4, and 5 column searches.

0, 5, 15, 46. Seconds.

That's an impressive reduction. Skipping the first 100 values yields:

0, 5, 11, 7. Seven seconds to find 50,000 5-columns solutions!

Skipping the first 200 values yields:

0, 5, 7, 9.

Skipping the first 400 values yields:

0, 5, 10, 12.

There's a strong indication that performance will vary as the data changes, but simply ignoring the top 3 or 4% of the values gets all of the generated answers in about 10 seconds.

If the TOP 50K results are needed, it's not possible to pick an arbitrary number of values to ignore. But if 50K consecutive (or related) solutions are sufficient, starting just a short way into the list is a very workable technique.

Kent

>> No 5-column solution uses any of the top 68 values in the list. That's certainly not intuitive, and an algorithm to detect the starting point without missing any values is quite a task. <<

Indeed. I suspect the algorithm for that might take longer than generating the first solutions :-).

That's why I went at it from a different angle: for an n-level solution, what is the SMALLEST value that MUST be included to reach the desired total? That's what my binary search determines (within a few entries), and a binary search will ALWAYS BE VERY FAST.

Now, this only helps for one level, but it often helps BIG for that level, allowing even SQL to gen a solution in a very reasonable time.

Indeed. I suspect the algorithm for that might take longer than generating the first solutions :-).

That's why I went at it from a different angle: for an n-level solution, what is the SMALLEST value that MUST be included to reach the desired total? That's what my binary search determines (within a few entries), and a binary search will ALWAYS BE VERY FAST.

Now, this only helps for one level, but it often helps BIG for that level, allowing even SQL to gen a solution in a very reasonable time.

This reminds me of the old "super queen" problem. The lessons are relevant here.

Back in the 70s, Scientific American posed the question of placing N super queens (can move like a queen and a knight) on an NxN chess board. At the time of publishing, they had solutions up to N=16 (or there abouts) and within the next couple of months readers submitted solutions for perhaps 3 more values of N.

At the time, I worked for Control Data and had access to some of the most powerful computers in the world. I wrote a small (recursive) program in assembly to solve as N increased. Within a short period of time, I had solutions for all boards up to N=22. N=23 was daunting. Days later the program was still running, without finding a solution. I didn't have a solution for N=23, but I was the only mathematician in the world with answers for all N < 23. And they were, of course, useless. :)

That program was looking for the "least solution". Put a super-queen in the lower left corner and advance to the next column. Put a super-queen legally in the next column and move on. When no legal move is found retreat a column and move the super-queen to the next legal column. The base-N numeric value of the solution (with each column being a digit in the solution) was the "least solution". I was hoping to establish a pattern by which solutions could be derived for larger values of N.

But that didn't work.

What did work, at least as far as finding SOME solution, was to start in the middle and work out. Start by placing a queen in the center square, then move to the column to the left of center and search up for a legal move. Then move to the column to the right of center and search down for a legal move. Move away from center one more column and repeat. (Searches are circular so that all positions are tested, and the left/down, right/up technique lent symmetry to the solution. There is a slight variation on the start depending on whether N is odd or even.)

That technique found solutions very, very fast. And remembering how I've paid those dues made me think that if we start toward the middle of the list and work out, we can find 50,000 solutions lightning fast. I'm going to try it.

Back in the 70s, Scientific American posed the question of placing N super queens (can move like a queen and a knight) on an NxN chess board. At the time of publishing, they had solutions up to N=16 (or there abouts) and within the next couple of months readers submitted solutions for perhaps 3 more values of N.

At the time, I worked for Control Data and had access to some of the most powerful computers in the world. I wrote a small (recursive) program in assembly to solve as N increased. Within a short period of time, I had solutions for all boards up to N=22. N=23 was daunting. Days later the program was still running, without finding a solution. I didn't have a solution for N=23, but I was the only mathematician in the world with answers for all N < 23. And they were, of course, useless. :)

That program was looking for the "least solution". Put a super-queen in the lower left corner and advance to the next column. Put a super-queen legally in the next column and move on. When no legal move is found retreat a column and move the super-queen to the next legal column. The base-N numeric value of the solution (with each column being a digit in the solution) was the "least solution". I was hoping to establish a pattern by which solutions could be derived for larger values of N.

But that didn't work.

What did work, at least as far as finding SOME solution, was to start in the middle and work out. Start by placing a queen in the center square, then move to the column to the left of center and search up for a legal move. Then move to the column to the right of center and search down for a legal move. Move away from center one more column and repeat. (Searches are circular so that all positions are tested, and the left/down, right/up technique lent symmetry to the solution. There is a slight variation on the start depending on whether N is odd or even.)

That technique found solutions very, very fast. And remembering how I've paid those dues made me think that if we start toward the middle of the list and work out, we can find 50,000 solutions lightning fast. I'm going to try it.

ASKER

I'm actually very fascinated by all this and I'm loving the progress being made in the C code (and also the new approach, though I'm not sure if it will be fruitful. I'll wait and see).

However, it seems that the SQL solution is pretty much stable and finalized. I would ask if ScottPletcher can assemble the solution into some sort of stored procedure which I can then adapt. Is there any way the procedure can detect if you need to refill the table? Also, is it possible to use the number of results returned as a variable? I seem to remember that TOP @NumberOfRows wouldn't work on SQL and I had to use dynamic SQL before, but I can't be sure now, and, since you're using an incrementing index, you might not need it.

My biggest difficulty, so far, is determining how you get the results back from the ident. For each column, I need vcr_num_empregado, vcr_vnc_cod_venc and vcr_valor_bruto, but I can't seem to understand how you built the ident so I can reverse it.

Kdo, when you do have a more final solution, please post it here, so I can compare both. Please take into account that the program should retrieve the values from a DB. It should take only a few microseconds for this, but it should be taken into account.

When this is over, I'll be opening a new question to award 500 points to each working solution. I feel like you guys deserve it.

However, it seems that the SQL solution is pretty much stable and finalized. I would ask if ScottPletcher can assemble the solution into some sort of stored procedure which I can then adapt. Is there any way the procedure can detect if you need to refill the table? Also, is it possible to use the number of results returned as a variable? I seem to remember that TOP @NumberOfRows wouldn't work on SQL and I had to use dynamic SQL before, but I can't be sure now, and, since you're using an incrementing index, you might not need it.

My biggest difficulty, so far, is determining how you get the results back from the ident. For each column, I need vcr_num_empregado, vcr_vnc_cod_venc and vcr_valor_bruto, but I can't seem to understand how you built the ident so I can reverse it.

Kdo, when you do have a more final solution, please post it here, so I can compare both. Please take into account that the program should retrieve the values from a DB. It should take only a few microseconds for this, but it should be taken into account.

When this is over, I'll be opening a new question to award 500 points to each working solution. I feel like you guys deserve it.

I keep trying to make the code simpler

I'm have more testing to do on the larger sets to confirm they are correct.

I did try adding the binary search at the last level an, as expected it is faster

but I'm less confident in the results. Not because of a flaw in the idea but a flaw in my code.

My latest timings - without binary search.

I'll try to get the kinks worked out of both versions and post final results and code this weekend.

I'm have more testing to do on the larger sets to confirm they are correct.

I did try adding the binary search at the last level an, as expected it is faster

but I'm less confident in the results. Not because of a flaw in the idea but a flaw in my code.

My latest timings - without binary search.

```
--------------------------------
Target sum: 112.410000
--------------------------------
Depth: 1
Elapsed time: 7 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 12585 microseconds
Solutions found: 122
Call count: 3332
--------------------------------
Depth: 3
Elapsed time: 13513323 microseconds
Solutions found: 256735
Call count: 5500396
--------------------------------
Depth: 4
Elapsed time: 11481604585 microseconds
Solutions found: 277343509
Call count: 6094087776
--------------------------------
```

I'll try to get the kinks worked out of both versions and post final results and code this weekend.

I made a few more improvements and fixed the binary search bug, which does add some complexity because if a value exists multiple times in the set, the binary search will only find one of them so I have to search the neighbors exhaustively. That is still significantly faster than iterating through all remaining values but does take a little more code.

I also returned a previous tuning effort that will exit early if I can determine the target is unreachable. Adding those checks slows things down for the two value solutions but those only take a few hundred microseconds anyway so the loss isn't appreciable. However on solutions with 4 or more values the early exit can reduce the total execution time by quite a bit. The relative amount will vary by data and target though.

But, despite all the successes I'm having with this code, I'd say the times still prove my original point: The data beats the algorithm. I can find simple solutions with convenient data in almost no time at all; but big solutions with lots of combinations take a long time, and "lots of combinations" with the existing data is still relatively small compared to what they could be. For example: the 4-value solutions below, are only 277 million but, with different data and a different target the solutions could be in the billions or trillions and that would take a lot longer to sift through, let alone trying to tackle the 5 value solutions.

However, the 50000 limit, provides a convenient "out" that lets you dodge some of the math and that, as already shown above by others, can be incredibly powerful.

Like Kdo though, finding all solutions is more interesting and putting the 50000 cap in the middle is a trivial addition so I'm still pursuing the exhaustive solution.

I'm running for 5-value solutions now, these are timings I have up through 4 while I wait

--------------------------------

target sum: 112.410000

--------------------------------

Depth: 1

Elapsed time: 0 microseconds

Solutions found: 0

Call count: 1

--------------------------------

Depth: 2

Elapsed time: 713 microseconds

Solutions found: 122

Call count: 2953

--------------------------------

Depth: 3

Elapsed time: 450369 microseconds (half a second)

Solutions found: 256735

Call count: 4390127

--------------------------------

Depth: 4

Elapsed time: 448023735 microseconds (a little under 8 minutes)

Solutions found: 277343509

Call count: 4446882814

--------------------------------

I also returned a previous tuning effort that will exit early if I can determine the target is unreachable. Adding those checks slows things down for the two value solutions but those only take a few hundred microseconds anyway so the loss isn't appreciable. However on solutions with 4 or more values the early exit can reduce the total execution time by quite a bit. The relative amount will vary by data and target though.

But, despite all the successes I'm having with this code, I'd say the times still prove my original point: The data beats the algorithm. I can find simple solutions with convenient data in almost no time at all; but big solutions with lots of combinations take a long time, and "lots of combinations" with the existing data is still relatively small compared to what they could be. For example: the 4-value solutions below, are only 277 million but, with different data and a different target the solutions could be in the billions or trillions and that would take a lot longer to sift through, let alone trying to tackle the 5 value solutions.

However, the 50000 limit, provides a convenient "out" that lets you dodge some of the math and that, as already shown above by others, can be incredibly powerful.

Like Kdo though, finding all solutions is more interesting and putting the 50000 cap in the middle is a trivial addition so I'm still pursuing the exhaustive solution.

I'm running for 5-value solutions now, these are timings I have up through 4 while I wait

--------------------------

target sum: 112.410000

--------------------------

Depth: 1

Elapsed time: 0 microseconds

Solutions found: 0

Call count: 1

--------------------------

Depth: 2

Elapsed time: 713 microseconds

Solutions found: 122

Call count: 2953

--------------------------

Depth: 3

Elapsed time: 450369 microseconds (half a second)

Solutions found: 256735

Call count: 4390127

--------------------------

Depth: 4

Elapsed time: 448023735 microseconds (a little under 8 minutes)

Solutions found: 277343509

Call count: 4446882814

--------------------------

I lost my connection to the server so my 5-value search from above didn't finish.

But I made a couple of small tweaks and added the early exit.

It can do 50000 5-value sums in about 8 seconds.

Below are timings for 5,000,000 at each level (if there are that many)

I only check at the top of the function so duplicates found within one call will be counted, hence the 5000001 in the last set.

--------------------------------

target sum: 112.410000

--------------------------------

Depth: 1

Elapsed time: 0 microseconds

Solutions found: 0

Call count: 1

--------------------------------

Depth: 2

Elapsed time: 277 microseconds

Solutions found: 122

Call count: 2953

--------------------------------

Depth: 3

Elapsed time: 411936 microseconds

Solutions found: 256735

Call count: 4390127

--------------------------------

Depth: 4

Elapsed time: 25090999 microseconds

Solutions found: 5000000

Call count: 326857516

--------------------------------

Depth: 5

Elapsed time: 183320310 microseconds

Solutions found: 5000001

Call count: 2720381708

--------------------------------

--------------------------------

This is the code used to generate the above.

You might note I did not use the bsearch function. I tried it but what I have below actually ended up slightly faster. I think it's because I don't force searching down to a single value; which saves a tiny bit on extra division and offsets. Instead, if I get down to 2 values I simply evaluate both.

But I made a couple of small tweaks and added the early exit.

It can do 50000 5-value sums in about 8 seconds.

Below are timings for 5,000,000 at each level (if there are that many)

I only check at the top of the function so duplicates found within one call will be counted, hence the 5000001 in the last set.

--------------------------

target sum: 112.410000

--------------------------

Depth: 1

Elapsed time: 0 microseconds

Solutions found: 0

Call count: 1

--------------------------

Depth: 2

Elapsed time: 277 microseconds

Solutions found: 122

Call count: 2953

--------------------------

Depth: 3

Elapsed time: 411936 microseconds

Solutions found: 256735

Call count: 4390127

--------------------------

Depth: 4

Elapsed time: 25090999 microseconds

Solutions found: 5000000

Call count: 326857516

--------------------------

Depth: 5

Elapsed time: 183320310 microseconds

Solutions found: 5000001

Call count: 2720381708

--------------------------

--------------------------

This is the code used to generate the above.

You might note I did not use the bsearch function. I tried it but what I have below actually ended up slightly faster. I think it's because I don't force searching down to a single value; which saves a tiny bit on extra division and offsets. Instead, if I get down to 2 values I simply evaluate both.

```
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <getopt.h>
#define DEPTH_LIMIT 5
#define SOURCE_ROWS 5470
#define BASE_TARGET 112.41
#define MAX_SOLUTIONS 5000000
#define separator printf("--------------------------------\n")
int target = (BASE_TARGET * 100.00);
char *in_filename;
unsigned rows = 0;
int *numbers;
int solution_value[DEPTH_LIMIT];
long max_solution[DEPTH_LIMIT];
FILE *outfile;
unsigned long solution_count = 0;
unsigned long function_call_count = 0;
long sum;
unsigned max_depth;
int intcomp(const int *a, const int *b) {
return *a - *b;
}
void read_input_file (void) {
FILE *infile;
int dummy2, dummy3, ret, i,j;
float dummy1, temp_value;
infile = fopen ("sqldata.txt", "rt");
numbers = malloc (SOURCE_ROWS * sizeof (int));
while (rows < SOURCE_ROWS) {
ret = fscanf (infile, "(%d,%d,%d,%f)\n", &dummy1, &dummy2, &dummy3, &temp_value);
if (ret <= 0) break;
// For effenciency adjust and cast all values as integers,
// adjust for proper truncation
numbers[rows] = (int) (100.0 * temp_value + (temp_value >0?0.1:-0.1) ) ;
rows++;
}
if (rows < SOURCE_ROWS) numbers = realloc (numbers, rows * sizeof (int));
fclose(infile);
qsort(numbers, rows, sizeof(int), (int (*)(const void *,const void *))intcomp);
for (i=0;i<DEPTH_LIMIT;i++) {
for (j=rows-1;j>=rows-1-i;j--)
max_solution[i] += numbers[j];
}
//printf("Read %ld rows from input file\n", rows);
//for (i=0;i< rows;i++) printf("%4i: %i %i\n",i,numbers[i],target);
}
void print_solution(unsigned depth, int number) {
int i;
for (i=1;i<depth;i++) {
fprintf(outfile,"%.2f ",solution_value[i-1],solution_value[i-1]/100.0);
}
fprintf(outfile,"%.2f\n",number/100.0);
}
// Given a depth and current position in the list
// what is the smallest sum that could be generated?
// Since the list is sorted, the smallest sum is
// found by adding the next elements in sequence
long min_solution(unsigned depth, unsigned index) {
int j;
long temp=0;
for (j=index+1;j<=index+max_depth-depth;j++)
temp += numbers[j];
return temp;
}
void find_sums(unsigned depth, long partialsum, unsigned index) {
unsigned i, j, lo, hi;
function_call_count++;
if (solution_count >= MAX_SOLUTIONS) return;
if (depth < max_depth) {
for (i=index;i< rows;i++) {
sum = partialsum + numbers[i];
if (sum > target) return;
// Are we so far from the target that no values could
// possibly get us there? If so, give up.
if (sum + max_solution[depth-1] < target) break;
if (sum + min_solution(depth,i) > target) break;
// We're close enough that it's worth trying to continue
solution_value[depth-1] = numbers[i];
find_sums(depth+1,sum,i+1);
}
}
else {
lo = index;
hi = rows-1;
while (hi-lo > 1) {
i = (hi+lo)>>1;
if (partialsum + numbers[i] < target) lo = i;
else if (partialsum + numbers[i] > target) hi = i;
else break;
}
if (partialsum + numbers[lo] == target) i = lo;
else if (partialsum + numbers[hi] == target) i = hi;
if (partialsum + numbers[i] == target) {
++solution_count;
if (outfile) print_solution(depth,numbers[i]);
// We found a hit but there may be more than one element in the set with the same value
// so count the neighbors until we exhaust them all
for (j=i-1;(j > index) && (numbers[j] == numbers[i]);j--) {
++solution_count;
if (outfile) print_solution(depth,numbers[j]);
}
for (j=i+1;(j < rows) && (numbers[j] == numbers[i]);j++) {
++solution_count;
if (outfile) print_solution(depth,numbers[j]);
}
}
}
}
int main(int argc, char* argv[])
{
struct timeval start_time, end_time;
unsigned long elapsed;
int c;
while ((c = getopt(argc, argv, ":t:i:o:?")) != -1) {
switch (c) {
case 't':
target = (int) (atof(optarg) * 100.00);
break;
case 'i':
in_filename = optarg;
break;
case 'o':
outfile = fopen (optarg, "w");
break;
case '?':
printf("\n%s [-t target] [-s ] [-i infile] [-o outfile]\n",argv[0]);
printf("-t target is a floating point value, default=%.2f\n",BASE_TARGET);
printf("-i input file name, default sqldata.txt\n");
printf("-o output file name\n");
return 0;
default:
if (isprint (optopt))
fprintf(stderr, "Unknown option `-%c'.\n", optopt);
else
fprintf(stderr,"Unknown option character `\\x%x'.\n",optopt);
return 0;
}
}
read_input_file();
separator;
printf("target sum: %f\n",target/100.0);
separator;
for (max_depth=1;max_depth<=DEPTH_LIMIT;max_depth++) {
memset(&start_time,0,sizeof(start_time));
memset(&end_time,0,sizeof(end_time));
function_call_count = 0;
solution_count = 0;
gettimeofday(&start_time,NULL);
find_sums(1,0,0);
gettimeofday(&end_time,NULL);
elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);
printf("Depth: %i\nElapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
max_depth,elapsed,solution_count,function_call_count);
separator;
}
separator;
return 0;
}
```

This is essentially the same code as above except my array of "numbers" isn't just the numbers, I maintain the entire record which doesn't really impact the performance much because the positional references is simply math. There is a slight overhead in all of the element dereferencing though. I'm not sure how important this is but since other code and discussion above has mentioned special complexity for negative values I'll make a note of how they are handled here. The initial sort is all that is required. Negative values don't add any additional complexity or require any special consideration with the methods used in this code. They do, of course, add more possible options to explore but you don't have to code for them. They just fall out naturally. That may be a feature or a failing depending on your point of view I guess.

As for implementing this with your database: you could either write the data from sql to a file and then use this pretty much as-is, or replace the read file function with a read database routine to build the array.

Here are timings for 50000 limit

As for implementing this with your database: you could either write the data from sql to a file and then use this pretty much as-is, or replace the read file function with a read database routine to build the array.

```
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <getopt.h>
#define DEPTH_LIMIT 5
#define SOURCE_ROWS 5470
#define BASE_TARGET 112.41
#define MAX_SOLUTIONS 50000
#define separator printf("--------------------------------\n")
struct rh_vcrt0 {
int vcr_emp_empresa,
vcr_num_empregado,
vcr_vnc_cod_venc,
vcr_valor_bruto;
};
int target = (BASE_TARGET * 100.00);
char *in_filename;
unsigned rows = 0;
struct rh_vcrt0 *numbers;
struct rh_vcrt0 solution_value[DEPTH_LIMIT];
long max_solution[DEPTH_LIMIT];
FILE *outfile;
unsigned long solution_count = 0;
unsigned long function_call_count = 0;
long sum;
unsigned max_depth;
int intcomp(const struct rh_vcrt0 *a, const struct rh_vcrt0 *b) {
return a->vcr_valor_bruto - b->vcr_valor_bruto;
}
void read_input_file (void) {
FILE *infile;
int dummy2, dummy3, ret, i,j;
float dummy1, temp_value;
struct rh_vcrt0 temp;
infile = fopen ("sqldata.txt", "rt");
numbers = malloc (SOURCE_ROWS * sizeof (struct rh_vcrt0));
while (rows < SOURCE_ROWS) {
ret = fscanf (infile, "(%d,%d,%d,%f)\n",
&temp.vcr_emp_empresa,
&temp.vcr_num_empregado,
&temp.vcr_vnc_cod_venc,
&temp_value);
if (ret <= 0) break;
// For effenciency adjust and cast all values as integers,
// adjust for proper truncation
temp.vcr_valor_bruto = (int) (100.0 * temp_value + (temp_value >0?0.1:-0.1) );
memcpy(&numbers[rows],&temp,sizeof(struct rh_vcrt0));
rows++;
}
if (rows < SOURCE_ROWS) numbers = realloc (numbers, rows * sizeof (int));
fclose(infile);
qsort(numbers, rows, sizeof(struct rh_vcrt0), (int (*)(const void *,const void *))intcomp);
for (i=0;i<DEPTH_LIMIT;i++) {
for (j=rows-1;j>=rows-1-i;j--)
max_solution[i] += numbers[j].vcr_valor_bruto;
}
//printf("Read %ld rows from input file\n", rows);
//for (i=0;i< rows;i++) printf("%4i: %i %i\n",i,target,numbers[i].vcr_valor_bruto);
}
void print_solution(unsigned depth, struct rh_vcrt0 *last) {
int i;
for (i=1;i<depth;i++) {
fprintf(outfile,"(%d,%d,%d,%.2f) ",
solution_value[i-1].vcr_emp_empresa,
solution_value[i-1].vcr_num_empregado,
solution_value[i-1].vcr_vnc_cod_venc,
solution_value[i-1].vcr_valor_bruto/100.0);
}
fprintf(outfile,"(%d,%d,%d,%.2f)\n",
last->vcr_emp_empresa,
last->vcr_num_empregado,
last->vcr_vnc_cod_venc,
last->vcr_valor_bruto/100.0);
}
// Given a depth and current position in the list
// what is the smallest sum that could be generated?
// Since the list is sorted, the smallest sum is
// found by adding the next elements in sequence
long min_solution(unsigned depth, unsigned index) {
int j;
long temp=0;
for (j=index+1;j<=index+max_depth-depth;j++)
temp += numbers[j].vcr_valor_bruto;
return temp;
}
void find_sums(unsigned depth, long partialsum, unsigned index) {
unsigned i, j, lo, hi;
function_call_count++;
if (solution_count >= MAX_SOLUTIONS) return;
if (depth < max_depth) {
for (i=index;i< rows;i++) {
sum = partialsum + numbers[i].vcr_valor_bruto;
if (sum > target) return;
// Are we so far from the target that no values could
// possibly get us there? If so, give up.
if (sum + max_solution[depth-1] < target) break;
if (sum + min_solution(depth,i) > target) break;
// We're close enough that it's worth trying to continue
memcpy(&solution_value[depth-1], &numbers[i], sizeof(struct rh_vcrt0));
find_sums(depth+1,sum,i+1);
}
}
else {
lo = index;
hi = rows-1;
while (hi-lo > 1) {
i = (hi+lo)>>1;
if (partialsum + numbers[i].vcr_valor_bruto < target) lo = i;
else if (partialsum + numbers[i].vcr_valor_bruto > target) hi = i;
else break;
}
if (partialsum + numbers[lo].vcr_valor_bruto == target) i = lo;
else if (partialsum + numbers[hi].vcr_valor_bruto == target) i = hi;
if (partialsum + numbers[i].vcr_valor_bruto == target) {
++solution_count;
if (outfile) print_solution(depth,&numbers[i]);
// We found a hit but there may be more than one element in the set with the same value
// so count the neighbors until we exhaust them all
for (j=i-1;(j > index) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j--) {
++solution_count;
if (outfile) print_solution(depth,&numbers[j]);
}
for (j=i+1;(j < rows) && (numbers[j].vcr_valor_bruto == numbers[i].vcr_valor_bruto);j++) {
++solution_count;
if (outfile) print_solution(depth,&numbers[j]);
}
}
}
}
int main(int argc, char* argv[])
{
struct timeval start_time, end_time;
unsigned long elapsed;
int c;
while ((c = getopt(argc, argv, ":t:i:o:?")) != -1) {
switch (c) {
case 't':
target = (int) (atof(optarg) * 100.00);
break;
case 'i':
in_filename = optarg;
break;
case 'o':
outfile = fopen (optarg, "w");
break;
case '?':
printf("\n%s [-t target] [-s ] [-i infile] [-o outfile]\n",argv[0]);
printf("-t target is a floating point value, default=%.2f\n",BASE_TARGET);
printf("-i input file name, default sqldata.txt\n");
printf("-o output file name\n");
return 0;
default:
if (isprint (optopt))
fprintf(stderr, "Unknown option `-%c'.\n", optopt);
else
fprintf(stderr,"Unknown option character `\\x%x'.\n",optopt);
return 0;
}
}
read_input_file();
separator;
printf("target sum: %f\n",target/100.0);
separator;
for (max_depth=1;max_depth<=DEPTH_LIMIT;max_depth++) {
memset(&start_time,0,sizeof(start_time));
memset(&end_time,0,sizeof(end_time));
function_call_count = 0;
solution_count = 0;
gettimeofday(&start_time,NULL);
find_sums(1,0,0);
gettimeofday(&end_time,NULL);
elapsed = (end_time.tv_sec * 1000000 + end_time.tv_usec) - (start_time.tv_sec * 1000000 + start_time.tv_usec);
printf("Depth: %i\nElapsed time: %ld microseconds\nSolutions found: %ld\nCall count: %ld\n",
max_depth,elapsed,solution_count,function_call_count);
separator;
}
separator;
return 0;
}
```

Here are timings for 50000 limit

```
--------------------------------
target sum: 112.410000
--------------------------------
Depth: 1
Elapsed time: 1 microseconds
Solutions found: 0
Call count: 1
--------------------------------
Depth: 2
Elapsed time: 667 microseconds
Solutions found: 122
Call count: 2953
--------------------------------
Depth: 3
Elapsed time: 83836 microseconds
Solutions found: 50001
Call count: 889437
--------------------------------
Depth: 4
Elapsed time: 1366452 microseconds
Solutions found: 50000
Call count: 18983831
--------------------------------
Depth: 5
Elapsed time: 8349257 microseconds
Solutions found: 50000
Call count: 126549329
--------------------------------
--------------------------------
```

What about a random 50000 vs (nearly) always finding the lowest matches?

AS to duplicates on the binary search, the code normally exits just ABOVE the last value, which should cover duplicates. Or you can simply search up until you don't get a dup -- in SQL, with the table indexed, that will be extremely fast.

AS to duplicates on the binary search, the code normally exits just ABOVE the last value, which should cover duplicates. Or you can simply search up until you don't get a dup -- in SQL, with the table indexed, that will be extremely fast.

Complexity/speed of random selection would depend on degree of randomness required.

It wouldn't be hard to randomize selection of the first value, but you'll still get clusters if there are multiple combinations that use that value.

If eliminating those clusters would also be required then I suppose that could be ameliorated by exiting as soon as the a solution was found.

Since finding one solution doesn't seem to take very long, with a limit of 50000, the random hopping around, while slower, might still be acceptable. My concern there would be if a particular target and data set would not have 50000 solutions, then forcing early exit might cause a loss of records.

I guess in that case there could always be a second pass to do an exhaustive search if the randomized early exits couldn't fill the 50000 buckets.

However, it might simply be easier to over-generate and then randomly select from what was found. I can do half a million in 3 minutes. Randomly select 10% of those. You wouldn't even need to store the values as you go, the random selection could be done on the fly. Simply apply a 10% randomizer to each solution found, if you're within 10%, then that solution counts, if not, skip it and move on. Doing that would also easily handle the duplicates without requiring early exit. The downside is you'll still get randomized results that are skewed toward the beginning of your set.

If the randomization must be evenly distributed across the entire field of all combinations then it'll be a more complex and slower. Still possible but I'm skeptical of the practical value of that.

Especially for the "bad cases" which haven't been mentioned much recently. So let me put my doomsayer cap back on. All of the "good and fast" stuff above is for data and a target that plays nicely.

We only have 122 two-value solutions but could have had millions. We only have 256735 three-value solutions but could have had billions; and so on.

Anything that forces exploration of more combinations will always be susceptible to intractability when the combinations explode. If I have to select 50000 combinations randomly distributed out of quadrillions of legal solutions then it's simply not going to happen in any kind of practical time.

Skewed results? maybe

Skewed results with convenient data? more likely

It wouldn't be hard to randomize selection of the first value, but you'll still get clusters if there are multiple combinations that use that value.

If eliminating those clusters would also be required then I suppose that could be ameliorated by exiting as soon as the a solution was found.

Since finding one solution doesn't seem to take very long, with a limit of 50000, the random hopping around, while slower, might still be acceptable. My concern there would be if a particular target and data set would not have 50000 solutions, then forcing early exit might cause a loss of records.

I guess in that case there could always be a second pass to do an exhaustive search if the randomized early exits couldn't fill the 50000 buckets.

However, it might simply be easier to over-generate and then randomly select from what was found. I can do half a million in 3 minutes. Randomly select 10% of those. You wouldn't even need to store the values as you go, the random selection could be done on the fly. Simply apply a 10% randomizer to each solution found, if you're within 10%, then that solution counts, if not, skip it and move on. Doing that would also easily handle the duplicates without requiring early exit. The downside is you'll still get randomized results that are skewed toward the beginning of your set.

If the randomization must be evenly distributed across the entire field of all combinations then it'll be a more complex and slower. Still possible but I'm skeptical of the practical value of that.

Especially for the "bad cases" which haven't been mentioned much recently. So let me put my doomsayer cap back on. All of the "good and fast" stuff above is for data and a target that plays nicely.

We only have 122 two-value solutions but could have had millions. We only have 256735 three-value solutions but could have had billions; and so on.

Anything that forces exploration of more combinations will always be susceptible to intractability when the combinations explode. If I have to select 50000 combinations randomly distributed out of quadrillions of legal solutions then it's simply not going to happen in any kind of practical time.

Skewed results? maybe

Skewed results with convenient data? more likely

I think the original list of values could be stored "sorted" randomly (in SQL, by NEWID(), for example).

The only list that needs sorted by the values is the 2-level mini-join'ed lookup table. The other levels could all be random, thus inherently providing a random result with only a single process of 50,000 (or however many) rows.

Purely in SQL, that would appear to take, based on these tests, typically 30 secs or less.

It can be done faster in C, no q, but only the OP can decide if the speed diff is worth the switch to C, and the resulting loss of control/understanding.

The only list that needs sorted by the values is the 2-level mini-join'ed lookup table. The other levels could all be random, thus inherently providing a random result with only a single process of 50,000 (or however many) rows.

Purely in SQL, that would appear to take, based on these tests, typically 30 secs or less.

It can be done faster in C, no q, but only the OP can decide if the speed diff is worth the switch to C, and the resulting loss of control/understanding.

ASKER

Randomization isn't important. It's just a little something my programmer mindset would prefer, but it's not essential.

As for the large datasets: Most of the time, hopefully, there won't be that many results returned. I said 50k records as a threshold, but it might as well be 5k. I'd say that about 85% of the time, there will be about under 50-100 valid solutions. 10% will be 100-1k. 5% will be 1k+ and usually we search for a different solution in these cases.

So, if it helps, you can set the threshold to 5k, or even to 1k. It's not likely we'll be searching further, for various reasons, not the least of which is the time required to effectively search all those solutions.

I'd guess that this would speed ScottPletcher's code as well.

To effectively use this, you need to drop the 122.41 value. That was from a real life example of a value we needed to find, but on the smallest set. You can use a much higher value for the larger set. Something around 7k-25k will be good enough, even though we sometimes need values as high as 100k.

As for the large datasets: Most of the time, hopefully, there won't be that many results returned. I said 50k records as a threshold, but it might as well be 5k. I'd say that about 85% of the time, there will be about under 50-100 valid solutions. 10% will be 100-1k. 5% will be 1k+ and usually we search for a different solution in these cases.

So, if it helps, you can set the threshold to 5k, or even to 1k. It's not likely we'll be searching further, for various reasons, not the least of which is the time required to effectively search all those solutions.

I'd guess that this would speed ScottPletcher's code as well.

To effectively use this, you need to drop the 122.41 value. That was from a real life example of a value we needed to find, but on the smallest set. You can use a much higher value for the larger set. Something around 7k-25k will be good enough, even though we sometimes need values as high as 100k.

using a fixed target lets us compare results to be sure we're not only making progress with the speed but also check for correctness.

However, once the code is verified it's easy to switch.

If you compile the c code above (simply gcc the file with defaults) you can run the resulting output with the -t flag to specify whatever target you want.

for example: save the code above to somefile.c

gcc somefile.c -o testit

testit -t 12345.67

However, once the code is verified it's easy to switch.

If you compile the c code above (simply gcc the file with defaults) you can run the resulting output with the -t flag to specify whatever target you want.

for example: save the code above to somefile.c

gcc somefile.c -o testit

testit -t 12345.67

Basically I'm saying that randomization is free, in that it does NOT cost ANY noticeable time diff. You could even give them the option of ordered or random results.

@ScottPletcher

>>Purely in SQL, that would appear to take, based on these tests, typically 30 secs or less.

can you post a complete example of what you suggest?

I agree it's doable but I don't share the same degree of optimism on the results; I'd love to test something and be proven wrong though.

@Cluskitt

>> I'd say that about 85% of the time, there will be about under 50-100 valid solutions.

Is the example data above abnormal? I've tried lots of random target values and unless I pick a gargantuan number there are almost always many combinations that add up to the target sum.

If you're right and your normal real data and real targets produce have so few results then randomization isn't needed at all because you can exhaustively search.

>>Purely in SQL, that would appear to take, based on these tests, typically 30 secs or less.

can you post a complete example of what you suggest?

I agree it's doable but I don't share the same degree of optimism on the results; I'd love to test something and be proven wrong though.

@Cluskitt

>> I'd say that about 85% of the time, there will be about under 50-100 valid solutions.

Is the example data above abnormal? I've tried lots of random target values and unless I pick a gargantuan number there are almost always many combinations that add up to the target sum.

If you're right and your normal real data and real targets produce have so few results then randomization isn't needed at all because you can exhaustively search.

>>> randomization is free

I don't get how you made that leap. Please post some code that shows that.

Either you're not correct, or I'm misunderstanding what you mean - I'm very willing to believe the fault is mine.

Possibly the confusion is just imprecision in the use of the word "randomization".

Exactly what are you randomizing and to what degree? Are you saying you can pick 50000 (or 1000 or whatever) results with an even distribution from all possible valid solutions?

And, you can do that even for inconvenient data? (e.g. select N results out of 154 quadrillion?)

if so, I think we'd all love to see that

Or, are you saying you can pick a skewed randomized selection of a subset of results for convenient data? (pick N results out of 32 million) an example here would be nice, the asker even asked for this as a nice-to-have. I'll post skewed randomization for my code later.

Or are you saying something something else entirely?

again, testable code is always more helpful than words and leaves zero ambiguity

I don't get how you made that leap. Please post some code that shows that.

Either you're not correct, or I'm misunderstanding what you mean - I'm very willing to believe the fault is mine.

Possibly the confusion is just imprecision in the use of the word "randomization".

Exactly what are you randomizing and to what degree? Are you saying you can pick 50000 (or 1000 or whatever) results with an even distribution from all possible valid solutions?

And, you can do that even for inconvenient data? (e.g. select N results out of 154 quadrillion?)

if so, I think we'd all love to see that

Or, are you saying you can pick a skewed randomized selection of a subset of results for convenient data? (pick N results out of 32 million) an example here would be nice, the asker even asked for this as a nice-to-have. I'll post skewed randomization for my code later.

Or are you saying something something else entirely?

again, testable code is always more helpful than words and leaves zero ambiguity

ASKER

Well, I must admit that we cheat a bit. We sometimes clear a few codes in advance on the larger sets to make them more manageable. For example, if you pick the given sample of data and remove all vcr_vnc_cod_venc IN (1,4,5), you'll be effectively removing LOTS of possible combinations. Not only that, we usually remove a few of the most common negative codes. We do this gradually, though. We first remove codes 1 and 5. Then, when we still don't have a reasonable result we remove a few more, etc. It isn't always clear cut, so it can't be hard-coded, which is why I haven't mentioned it before, other than in passing.

So, normally, we'll work with smaller sets (about 2/3) of the larger set. Come to think of it, I actually think it would be simpler to remove the codes first and then start adding them back if we don't get proper results. This would have the advantage of eliminating a lot of similar values, I suppose (one of the codes we also eliminate fairly often is 6, which almost always has the same value).

If you want to test a reasonably common filter, remove these codes and test with the remaining ones: 1,4,5,6,12 and all between 139-169 (which will remove most of the negatives)

So, normally, we'll work with smaller sets (about 2/3) of the larger set. Come to think of it, I actually think it would be simpler to remove the codes first and then start adding them back if we don't get proper results. This would have the advantage of eliminating a lot of similar values, I suppose (one of the codes we also eliminate fairly often is 6, which almost always has the same value).

If you want to test a reasonably common filter, remove these codes and test with the remaining ones: 1,4,5,6,12 and all between 139-169 (which will remove most of the negatives)

As I stated for "randomization", I'm saying that the original rows can be inserted in a random order prior to doing the searches. SQL seems to strongly bias its results toward rows that appear earlier in the result set. Thus, by storing the initial values randomly, we're likely to get a random sample of results. Certainly much more so than storing the values sorted.

As to 30 secs, the 5-level results took ~20 secs in SQL, as stated by OP. Based on the q's comments, I took that to be a typical case. The extra 10 secs is a typical IT "cushion" :-) .

As to 30 secs, the 5-level results took ~20 secs in SQL, as stated by OP. Based on the q's comments, I took that to be a typical case. The extra 10 secs is a typical IT "cushion" :-) .

Probably not a bad idea to eliminate dups from the input values, since we're only generating a subset of matches anyway, dropping dups gives us more unique input values that match to reach the search limit.

>>> I'm saying that the original rows can be inserted in a random order prior to doing the searches.

That's effectively picking the first values randomly, which, as noted above, should work just fine but the results are still skewed with clusters.

But because of the initial randomization the skewing will hopefully not be a recognizable pattern.

but, if there are a lot of valid combinations that happen to use value X, then no matter how randomized your method is of picking X, once you have it, unless you also include early exit you'll still get a big cluster.

Easy example: X=0.

Similarly, every 3-value solution clusters around 0 in 4-value solutions.

Other values can also produce clusters but 0 is an obvious candidate for maximal clustering.

That's effectively picking the first values randomly, which, as noted above, should work just fine but the results are still skewed with clusters.

But because of the initial randomization the skewing will hopefully not be a recognizable pattern.

but, if there are a lot of valid combinations that happen to use value X, then no matter how randomized your method is of picking X, once you have it, unless you also include early exit you'll still get a big cluster.

Easy example: X=0.

Similarly, every 3-value solution clusters around 0 in 4-value solutions.

Other values can also produce clusters but 0 is an obvious candidate for maximal clustering.

Perhaps, but given that the requestor has repeatedly stated that he doesn't need ultra-scientific randomization, just a better sampling of results, then I don't see why an initial random insert isn't sufficient.

I give up on your constant assaults on my perception, implying I have no clue what I'm talking about.

I believe my perception here is more accurate than yours: the requestor doesn't need

I give up on your constant assaults on my perception, implying I have no clue what I'm talking about.

I believe my perception here is more accurate than yours: the requestor doesn't need

*absolute*randomization. In fact, he explicitly stated that: it's just a "nicety". So the almost-no-cost randomization of the initial values seems a practical solution to me, w or w/o clustering.
>>> requestor has repeatedly stated that he doesn't need ultra-scientific randomization, just a better sampling of results, then I don't see why an initial random insert isn't sufficient.

I didn't say or imply it wasn't sufficient. I was pointing out that your suggested sql method was functionally the same as my suggested c method.

I didn't say skewed clusters were wrong, but I did think it was appropriate to note the distinction between those results and random selection from a uniform distribution.

>>> I give up on your constant assaults on my perception, implying I have no clue what I'm talking

about.

I'm truly sorry you have that impression but I've neither said nor implied any such thing. In fact, I've gone out of my way to assume the fault is mine for misunderstanding. And have pointed out on multiple times when we were following the same approach, as in this case. So, if did somehow slight you, I was poking at myself as well; but again, it was never my intent.

Please, please reread my comments and contact me offline I'll be happy to discuss whatever you want and then come back here to publicly apologize and correct/remove anything that is inappropriate.

>> I believe my perception here is more accurate than yours

Yes, I know - you've been very clear about that

But more or less accurate is irrelevant since we agree. I also recognized that randomization was asked for it as a "nice-to-have" and not a strict requirement.

I hadn't been pursing randomization at all because it was only a "nicety".

After I posted my last example - YOU asked about how I would produce randomized results. So I was answering you. Then when your response wasn't clear to me (again I was willing to take the fault there) I asked for more clarification. I would have taken the entire sidebar offline but since Cluskitt "did" ask about it (even as a "nicety") it seemed appropriate to keep it in thread.

Then when you suggested you already had a solution to do it, I asked you to post it. Even if it is just a "nicety" - if you really have a viable solution, please share it. Give us something we can test.

I'm not attacking you - again I apologize that you've somehow come to that conclusion.

I'm simply asking you to post what you suggest. What you say sounds great, so now that you've described it - please show it.

No hostility is intended, implied nor should be construed in anyway from anything in this post. My requests are not challenges to your skill, your knowledge, your perception or any other slight you may have felt to his point.

My requests are for my own curiosity (I don't know TSQL nearly as well as you do, I'm hoping to learn something) and for the asker who is looking for a complete solution and I don't think there is one available in the posts above.

I didn't say or imply it wasn't sufficient. I was pointing out that your suggested sql method was functionally the same as my suggested c method.

I didn't say skewed clusters were wrong, but I did think it was appropriate to note the distinction between those results and random selection from a uniform distribution.

>>> I give up on your constant assaults on my perception, implying I have no clue what I'm talking

about.

I'm truly sorry you have that impression but I've neither said nor implied any such thing. In fact, I've gone out of my way to assume the fault is mine for misunderstanding. And have pointed out on multiple times when we were following the same approach, as in this case. So, if did somehow slight you, I was poking at myself as well; but again, it was never my intent.

Please, please reread my comments and contact me offline I'll be happy to discuss whatever you want and then come back here to publicly apologize and correct/remove anything that is inappropriate.

>> I believe my perception here is more accurate than yours

Yes, I know - you've been very clear about that

But more or less accurate is irrelevant since we agree. I also recognized that randomization was asked for it as a "nice-to-have" and not a strict requirement.

I hadn't been pursing randomization at all because it was only a "nicety".

After I posted my last example - YOU asked about how I would produce randomized results. So I was answering you. Then when your response wasn't clear to me (again I was willing to take the fault there) I asked for more clarification. I would have taken the entire sidebar offline but since Cluskitt "did" ask about it (even as a "nicety") it seemed appropriate to keep it in thread.

Then when you suggested you already had a solution to do it, I asked you to post it. Even if it is just a "nicety" - if you really have a viable solution, please share it. Give us something we can test.

I'm not attacking you - again I apologize that you've somehow come to that conclusion.

I'm simply asking you to post what you suggest. What you say sounds great, so now that you've described it - please show it.

No hostility is intended, implied nor should be construed in anyway from anything in this post. My requests are not challenges to your skill, your knowledge, your perception or any other slight you may have felt to his point.

My requests are for my own curiosity (I don't know TSQL nearly as well as you do, I'm hoping to learn something) and for the asker who is looking for a complete solution and I don't think there is one available in the posts above.

Hi Scott,

I must admit to my own confusion here. How does randomization tie in with your initial step(s) of omitting the impossible low-values sums?

Kent

I must admit to my own confusion here. How does randomization tie in with your initial step(s) of omitting the impossible low-values sums?

Kent

Hi Kent,

can you compile and run my code above on your hardware?

I'm curious to get a benchmark when the hardware differences are removed.

Since I have multiple versions above , let me know which one you tried (although they should be roughly equal).

If you post your latest, I'll reciprocate

can you compile and run my code above on your hardware?

I'm curious to get a benchmark when the hardware differences are removed.

Since I have multiple versions above , let me know which one you tried (although they should be roughly equal).

If you post your latest, I'll reciprocate

>> How does randomization tie in with your initial step(s) of omitting the impossible low-values sums? <<

In order to do the binary search and

For the sake of convenience, I used that same sorted-in-value-order table as in

In order to do the binary search and

__quickly__find the limiting value, the values__must__be sorted.For the sake of convenience, I used that same sorted-in-value-order table as in

for example..

instead of

t1.ID_Value<>t2.ID_Value

use

t1.ID_Value > t2.ID_Value

you'll get the same effect of not joining a row to itself, but you'll also prevent joining a to b and b to a