Link to home
Start Free TrialLog in
Avatar of SybaseUk
SybaseUk

asked on

Sybase IQ generating distinct rows for dummy table

Hi,

I am trying to generate a table containing 25 columns and populate with some sample data. My aim is to get around 300k rows. I have generated the table with 300 distinct rows of sample data using powerdesigner. Does anybody know a nice way of blowing up this table to 300k rows without duplicating any of the data?? I tried using the STUFF function to take the last 2 digits of the ID and append to each column as follows: -

insert into test1 (C1)
select  stuff(C1,5,2,substring(convert(char, ID), length(convert(varchar, ID))-1,2))
from test1

But this method gives me duplicate values when I execute distinct count. Can some1 please help?

But this


create table TEST1 (
   ID                   int                            null default autoincrement,
   C1                   char(6)                        null,
   C2                   char(6)                        null,
   C3                   char(6)                        null,
   C4                   char(6)                        null,
   C5                   char(6)                        null,
   C6                   char(6)                        null,
   C7                   char(6)                        null,
   C8                   char(6)                        null,
   C9                   char(6)                        null,
   C10                  char(6)                        null,
   C11                  char(6)                        null,
   C12                  char(6)                        null,
   C13                  char(6)                        null,
   C14                  char(6)                        null,
   C15                  char(6)                        null,
   C16                  char(6)                        null,
   C17                  char(6)                        null,
   C18                  char(6)                        null,
   C19                  char(6)                        null,
   C20                  char(6)                        null,
   C21                  char(6)                        null,
   C22                  char(6)                        null,
   C23                  char(6)                        null,
   C24                  char(6)                        null,
   C25                  char(6)                        null
);
 
insert into TEST1 (C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19, C20, C21, C22, C23, C24, C25) values ('H0A9HT', 'SGYO53', '27V8L2', '9EM508', ' MX12I', '5KP34L', '2S40AO', '2XIYW6', '0G5COU', 'T2QLOI', '4AQXA8', 'YIA80X', ' 6L8S2', 'IFR7NN', '8223LK', 'XHEKTW', '304GVR', 'W EMVT', '08G1JX', 'LT3CVB', 'GT38TX', '21DHHX', '5MKAWW', 'BNCISN', '1XDDJ2');
 
insert into TEST1 (C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19, C20, C21, C22, C23, C24, C25) values ('RU812M', 'M6 7WG', '3HUU4A', 'QU55G9', 'RJMNOY', 'WRDJMH', 'V8KGCT', 'UWONM1', '6C3O Q', '9Y  6G', '2H927L', 'TVUT U', 'R5RG0N', '9SUJ37', 'VTOO2Y', '4XPLMN', 'OLMPU8', 'WU6VLD', 'P70A3X', ' 12VPO', '0MI3TX', 'G2E1W2', 'NNBMI5', 'NOH5C9', '73MKPD');

Open in new window

Avatar of alpmoon
alpmoon
Flag of Australia image

You can use rand() function for that purpose. There is a short article about the usage:

http://www.sybase.com/detail?id=648

You can develop a similar loop to create rows.
Avatar of SybaseUk
SybaseUk

ASKER

Hi,

Thanks for that info. How would I create the loop to insert new rows (i.e. double the rows with distinct values) instead of updating the existing column values?

Thanks!
Just as an example:

declare @max int
declare @cur int

select @cur = 1, @max = 300000 -- test with a smaller number first as I haven't tested it

while @cur < @max
begin

insert into TEST1 values (str(rand()), str(rand(@cur)), str(rand()), ....)

select @cur = @cur + 1

end

You can use different seed values to manipulate random generator as well, I just put @cur as seed as an example.
You have another problem as well.  Doing 300K individual inserts in Sybase IQ will take a very long time.  IQ simply is not made for fast individual record manipulation.

If you have to do all this inside IQ, create the table without any indexes (except the default FPs of course) and populate it.  Start with a few thousand rows of random data.  For now, leave the IDs blank.  Then, use INSERT/SELECT to double the number of rows each time.  If you start with 9375 seed rows and double it 5 times, you will wind up with 300K.

After you have all the rows you need, you can go back and assign the IDs in bulk.    You can either use something like UPDATE table SET IDcol = NUMBER(*).  Alternatively, you can use the ROWID with something like... UPDATE table SET IDcol = ROWID(table).  Of course, if you need a character prefix or something else, you can add that into the expression you use in the UPDATE

Of course, you can always use scripts or another database (ASA or ASE) to generate the data, BCP it out and do a LOAD TABLE to get it into IQ.

Regards,
Bill
Hi Bill,

Many thanks for your reply. As always, your help is very much valued & appreciated.

Its a little more tricky, because my table has 25 columns which are char(6). I designed it with powerdesigner and also used PD to generate 300 rows of sample data. I have been using INSERT/SELECT to double the rows within IQ but the problem is that I end up with duplicate rows. I then tried to use RAND() within my INSERT/SELECT statement but I am still ending up with duplicate rows.

insert into test1 (C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19, C20, C21,
C22, C23, C24, C25)
select  stuff(C1,5,2,substring(convert(char, rand()), 2,2))
select  stuff(C2,5,2,substring(convert(char, rand()), 2,2))
select  stuff(C3,5,2,substring(convert(char, rand()), 2,2))
........................
etc

Running this statement twice now gives me 1200 rows with only 900 distinct values in each column.

select top 5 c1 from test1

c1
'H0A9HT'
'RU812M'
'FACK8O'
'UBC86S'
'3REOAK'

select top 5 c1 from test1 where ROWID("Test1") > 300

c1
'H0A9.5'
'RU81.5'
'FACK.5'
'UBC8.5'
'3REO.5'

select top 5 c1 from test1 where ROWID("Test1") > 600

c1
'H0A9.9'
'RU81.9'
'FACK.9'
'UBC8.9'
'3REO.9'

select top 5 c1 from test1 where ROWID("Test1") > 900

c1
'H0A9.9'
'RU81.9'
'FACK.9'
'UBC8.9'
'3REO.9'

Any ideas why my data is being duplicated and not giving me different values using the rand() function???

Thanks
applmoon,

I need to use stuff to append the first 2 s.f's of the generated random value to the existing column C1, C2, C3 respectively.....

So I can't use literal values, I must use INSERT/SELECT. However, I am still getting duplicates........
ASKER CERTIFIED SOLUTION
Avatar of grant300
grant300

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I agree that "Using a RAND function will not guarantee the uniqueness". But possibility of uniqueness depends on how that function is used. A dice has 6 sides, a two digit number has only 99 possibilities. If you use rand function 6 times to generate 6 different characters, you would have 308,915,776 possibilities and the chance of uniqueness would be high enough even though it is not guaranteed.

I think using some seed rows is not a good idea.
The problem is that the procedure SybaseUK is using does not generate six individual random characters for each of the 25 fields for each row.  It looks as if he is simply stuffing a couple of characters in.  Besides, if you did try calling RAND 150 times for each row, I suspect there would be a performance issue.  I am pretty sure that RAND is an ASA function, not an IQ function, so all the data would have to pass back and forth between the two engines.  This kind of thing should be better documented in IQ but as of now it is more experiencial than anything else.

Bottom line, if you need guaranteed uniqueness, you cannot use random functions.

Regards,
Bill
I might be wrong, but my understanding is this is a one off process to run and prepare a sample data for a test. So performance shouldn't be an issue. Also data can be prepared on another server and then loaded into IQ. Also if it is the case and there are a few duplicate rows, they can be deleted.

You are right saying that instead of using seed rows generating all rows at once would be a better approach. I am not really sure whether using rand function to create six character data is easier or your method is easier. But, I think they are both valid methods.

There might be around 300 duplicate rows out of 300K rows when rand() is used, but they can be deleted easily. Also it depends on how many charaters are used. If small caps and numbers are used, the possibility of uniqueness would be much higher.
Hi alpmoon,

I have tried using the loop method you have suggested but I am still getting the same results as I did without the loop. Can you please look at my code and tell me whats wrong?

declare @max int
declare @cur int
declare @char char(2)

select @cur = 1, @max = 3

while @cur < @max
begin
select @char = substring(convert(char, rand(@cur)),2,2)

insert into Test1 (C1)
select
stuff(C1,5,2,@char) from Test1

select @cur = @cur + 1

end

As you can see I am using stuff to append the last 2 random numbers to replace the last 2 characters of C1. I cannot use literal values therefore I am using INSERT/SELECT...

Any help appreciated please....
Hi Bill,

alpmoon is right in that the sample data does not need to 100% unique and I agree with him. However, the loop method isn't working presently and I am stuck. I desperately need to get around 300k rows and as alpmoon suggested, I can easily remove any duplicates.

The problem I am facing is that the rand() function is not producing random values which I can then stuff into C1, C2, C3 etc....

SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Bill you are correct that RAND() will not guarantee uniqueness but a combination of 25 columns with random values appended should be unique (across a single row)....

INSERT/SELECT is doing this process in batches of 300 and thats the reason I think I am not getting a true random value. I need a way of stuffing a new random value (even though there are only 99 possibilities) into the respective columns for every row.

Can you suggest any solution please?

P.S. I'm just trying to generate a modest size table for testing purposes...
Now, I understand what you want better. If you only want to generate only two digit, you should multiply the result of rand() with at least 100:

insert into test1 (C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19, C20, C21,
C22, C23, C24, C25)
select  stuff(C1,5,2,substring(convert(char, rand()*100), 2,2))
select  stuff(C2,5,2,substring(convert(char, rand()*100), 2,2))
select  stuff(C3,5,2,substring(convert(char, rand()*100), 2,2))
....

or use module additionally:

insert into test1 (C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19, C20, C21,
C22, C23, C24, C25)
select  stuff(C1,5,2,substring(convert(char, rand()*100000 % 100), 2,2))
select  stuff(C2,5,2,substring(convert(char, rand()*100000 % 100), 2,2))
select  stuff(C3,5,2,substring(convert(char, rand()*100000 % 100), 2,2))
......
alpmoon,

Yeah, this gets a log clearer now that we know sybaseUK does not each column value to be unique; just the entire row.

That code will work 90 percent of the time but you still have a problem with single digit numbers.  You need to ensure there is a leading zero or you will wind up with about 10 percent of the values being too short.

Try...
select stuff(C1,5,2,right('0'+convert(varchar,rand()*100000 % 100),2))
Come to think of it, STUFF is slower than substring and concatenation.  We could use...
select substring(C1,1,4)+right('0'+convert(varchar,rand()*100000 % 100),2)

Finally, instead of 100000, I would choose a prime number in that vicinity.  Something tells me you'll get better (more random) results than using a multiplier that is an even multiple of the modulo.

Regards,
Bill
hi guys,

i was trying to convert the rand() output to char and then take a substring of that i.e. 2 digits and stuff it into C1.

so for example:

select stuff('H0A9HT', 5,2,substring(convert(char, rand()), 2,2))

would produce the result:

H0A935

the last 2 digits being random values
Hi,

I get a syntax error from executing the following statement: -

select stuff(C1,5,2,right('0'+convert(varchar,rand()*100000 % 100),2))

If I remove % 100 then the statement executes but I dont think its doing what it should be????

select stuff(C1,5,2,right('0'+convert(varchar,rand()*100000),2))
The text of the syntax error might be useful but I think I know what the issue is.

The modulo operator only works on integers.  You will have to add a CONVERT(INT,..... to the expression to get it to function properly.

You know, since you are going to grab two digits by position and then convert them into a string anyway, why bother with the modulo.  Try something like....

select stuff(C1,5,2,substring(convert(varchar,rand(),3,2)))

Regards,
Bill
Hi Bill,

This is what I began with!

I appreciate that I will get some column values which will be duplicates but the chances of the entire row being a duplicate (or identical) should be should be very less likely. But I am getting many rows which are completely identical so I know there is a bug in my code or the loop.

I was thinking the purpose of the loop is to do single row by row inserts and thus giving me a true random value each time. But I am not doing single row by row inserts, rather I am doubling the amount of rows.

How do I do single row by row inserts but only change/replace the last 2 characters of each value with a random value????

This is getting very frustrating now.

Thanks for all your help.  
Hi Guys,

Please disregard all the previous messages. I have decided to scrap my 300 rows of sample data and start from scratch.

My table has an ID (auto-identity) column and 25 columns of char(6). I would like to populate this table with random data and have atleast 300k rows that are unique....please advise the best way to go about

many thanks!
I am correcting the syntax there:

declare @max int
declare @cur int
declare @char6 char(6)

select @cur = 1, @max = 300

while @cur < @max
begin
select @char6 = char((rand()*100000 % 26) + 65)+char((rand()*100000 % 26) + 65) +
  char((rand()*100000 % 26) + 65)+char((rand()*100000 % 26) + 65) +
  char((rand()*100000 % 26) + 65)+char((rand()*100000 % 26) + 65)

insert into Test1 (C1)
select @char6

select @cur = @cur + 1

end

I have tested it for one column and it works fine(covers A to Z). You can repeat @char6 variable as many as you can.
alpmoon,

I have tried the above and I get a syntax error. If I execute this on its own: -

select char((rand()*100000 % 26) + 65)

I get syntax error near '(end of line)' message. Please help me resolve this.
Hi guys,

I now have the following and it works fine but it takes a very long time and also eventually runs out of space. Is there a quicker way of doing this?

declare @max int
declare @cur int
//declare @char6 char(6)

select @cur = 1, @max = 100000

while @cur < @max
begin


insert into Test1 (C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12, C13, C14, C15, C16, C17, C18, C19, C20, C21,
C22, C23, C24, C25)
select stuff(inttohex(convert(int,rand()*100000)),1,3,'A'),
       stuff(inttohex(convert(int,rand()*100000)),1,3,'B'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'C'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'D'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'E'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'F'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'G'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'H'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'I'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'J'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'K'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'L'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'M'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'N'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'O'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'P'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'Q'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'R'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'S'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'T'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'U'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'V'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'W'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'X'),
 stuff(inttohex(convert(int,rand()*100000)),1,3,'Y')

select @cur = @cur + 1

end
Do you think it would be easier to write a shell script to create a data file containing delimited rows and then load these into IQ???

Can anyone please help me write a C shell script to create a test data file??

Thanks,
P.S. I need your help to write a shell script that will create the data (300K rows) each containing a unique id and random data made up of 25 columns of char(6).....

Please let me know whether I shud ask a new question and I would be happy to do that....

Many Thanks,
Actually the code I have tested was for ASE. If you have an ASE server. You can extend it for 25 columns and run for 300K rows. Then bcp out the data and insert into IQ. Inserting data into IQ one row at a time is not a good idea at all as Bill said before.

Or if you have ASA, you can do it in ASA as well. If you don't both, directing output to a text file and then eliminating unnecessary lines and tabs by sed might be a better option without writing a shell script.
All I can do is repeat my suggestion from early in the thread....

So what I would do is create the list of numbers from 1 to 300,000, then generate each column value from that by adding the column prefix (A-Z) and calculating a hex number from the ID (1-300000) and the offset for that column.You can do this by creating a single column table with the IDs and a helper table with 26 columns of INT and the offset for each one.  You can then generate the entire target table in an INSERT/SELECT by joining your ID table with the Column Offset table and using an expression to generate each column value, e.g.INSERT INTO TargetTableSELECT ID,                'A' + RIGHT(INTTOHEX(ID+CS.COL_A),5),                'B'.........FROM ID_Table,            Column_Offset CS
This will do all the inserts in one pass and be at least 100 and probably closer to 1000 times faster than the individual inserts you were doing.

How you create the ID_Table rows is irrelevant.  You could create a list of 300000 numbers in a shell script, stuff it in a flat file, and then do a LOAD TABLE to get it in.

Regards,
Bill
Hi,

I have now inserted the data into ASE and I have unique rows of data...

I want to bcp the data out and have it pipe delimited and row delimited by new line. Can you please show me how I can format the data so it is pipe delimited and row delimited by new line? I am using windows.....

I will then load this flat file into IQ

Thanks,
Hi,

Can anybody explain why a simple query like the following takes around 6 seconds in ASE but is taking around 4 mins in IQ????????????

select C1, count(*)
from test1
group by C1
having count(*) > 1

I have loaded the IQ table by generating the random data in ASE and BCP'ing the data out. I am very confused why it is taking so long for this query to run.....

Can some1 please have a look at my query plan also and help me?
queryplan.doc
That one sounds like a whole new topic and question.  Why don't you close off this one an open a new one.  Give the points to Alpmoon.

Regards,
Bill
I am splitting the points 50/50 as both of you participated equally in the final solution..