[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 950
  • Last Modified:

NewID() not being random for each record.

I am trying to create a random address generator in TSQL.  I thought I had done it, but I get non random results.  I am on Ms SQL Server 2005.

I know you cannot use RAND() or NEWID() in a user defined function , but the work around is to put them in a view and reference the view from the user defined function.  You can see this in the code section.

I use both of those in a function called UDF_RandomAddress and it seems to work except I sometimes get repeated addresses.  Below is an example of the result I get

DebtorID      Street1      Street2      City      State      Zipcode
13268      62051 Whitehorse Bay      Apt. 271      Toledo      OH      43614
19278      27680 Revere Lane            Hazel Park      MI      48030
396498      P.O. Box 173            Pensacola      FL      32534
425924      14167 Snug Harbor Drive      Ste. 45065      Tuscarawas      OH      44682
459657      14167 Snug Harbor Drive      Ste. 45065      Tuscarawas      OH      44682

Notice how the bottom two records have the same address.  Anyone have any idea why? or how to fix it?
CREATE VIEW [dbo].[vNewID]
AS
SELECT     NEWID() AS NID
 
CREATE VIEW [dbo].[vRandNumber]
AS
SELECT RAND() as RandNumber
 
CREATE FUNCTION [dbo].[UDF_RandomNumber] 
(
	@Low INT,
	@High INT
)
RETURNS INT
AS
BEGIN
	-- Declare the return variable here
	DECLARE @Result INT
 
	-- Add the T-SQL statements to compute the return value here
	SELECT @Result = CONVERT(INT,@Low + (@High-@Low)*(SELECT RandNumber FROM vRandNumber))
 
	-- Return the result of the function
	RETURN @Result
END
 
CREATE FUNCTION [dbo].[UDF_NewID] ()
RETURNS uniqueidentifier
AS
BEGIN
	-- Declare the return variable here
	DECLARE @Result uniqueidentifier
 
	-- Add the T-SQL statements to compute the return value here
	SELECT @Result = (SELECT NID FROM vNewID)
 
	-- Return the result of the function
	RETURN @Result
END
 
CREATE FUNCTION [dbo].[UDF_RandomAddress] 
(	
	-- Add the parameters for the function here
	@AddressType VARCHAR(25) = null,
	@State VARCHAR(2) = null 
)
RETURNS @AddressTable table (
			[Street1] VARCHAR(30),
			[Street2] VARCHAR(30),
			[City] VARCHAR(30),
			[State] VARCHAR(2),
			[Zipcode] VARCHAR(10)
)
AS
BEGIN
	DECLARE @Street1 VARCHAR(30)
	DECLARE @Street2 VARCHAR(30)
	DECLARE @City VARCHAR(30)
	DECLARE @Zipcode VARCHAR(10)
 
	-- Capitalize to simlify the code.
	SET @State = UPPER(@State)
 
	-- Select a state in case none were selected.
	IF (@State IS NULL)
		SET @State = UPPER((SELECT TOP 1 Abbreviation FROM Scrambler.dbo.Common_States ORDER BY Scrambler.dbo.UDF_NewID()))
 
	-- If @State is listed as Unknown, then we will randomly select one.
	IF (@State = '--')
		SET @State = UPPER((SELECT TOP 1 Abbreviation FROM Scrambler.dbo.Common_States ORDER BY Scrambler.dbo.UDF_NewID()))
 
	-- Make sure the @State is in the list of states, if not 
	IF ((SELECT Abbreviation FROM Scrambler.dbo.Common_States WHERE UPPER(Abbreviation) = @State) IS NULL)
		SET @State = ''
 
	-- Capitalize @AddressType to simplify the code
	SET @AddressType = UPPER(@AddressType)
 
	-- Select an Address Type in case none were selected.
	IF (@AddressType IS NULL)
		SET @AddressType = (SELECT TOP 1 AddressType FROM Scrambler.dbo.Enum_AddressTypes WHERE AddressType <> 'Random' ORDER BY Scrambler.dbo.UDF_NewID())
 
	-- Make sure the @AddressType is in the list of states, if not 
	IF ((SELECT AddressType FROM Scrambler.dbo.Enum_AddressTypes WHERE UPPER(AddressType) = UPPER(@AddressType)) IS NULL)
		SET @AddressType = ''
 
	IF ((@State = '') OR (@AddressType = ''))
	BEGIN
		SET @Street1 = ''
		SET @Street2 = ''
		SET @City = ''
		SET @State = ''
		SET @Zipcode = ''
	END
	ELSE
	BEGIN
		-- If we want a random address type, then select it.
		IF (@AddressType = 'RANDOM')
			SET @AddressType = (SELECT TOP 1 AddressType FROM Scrambler.dbo.Enum_AddressTypes WHERE AddressType <> 'Random' ORDER BY Scrambler.dbo.UDF_NewID())
 
		-- Create Single Line Address
		IF (@AddressType = 'ONELINE')
		BEGIN
			SET @Street1 = CONVERT(VARCHAR,(select dbo.UDF_RandomNumber(1000,99990))) + ' ' + 
				(SELECT TOP 1 StreetName FROM Scrambler.dbo.Address_StreetNames ORDER BY Scrambler.dbo.UDF_NewID()) + ' ' +
				(SELECT TOP 1 StreetSuffix FROM Scrambler.dbo.Address_StreetSuffixes ORDER BY Scrambler.dbo.UDF_NewID())
			SET @Street2 = ''
		END
		ELSE IF ((@AddressType = 'TWOLINERESIDENTIAL') OR (@AddressType = 'TWOLINECOMMERCIAL'))
		-- Create two line addresses
		BEGIN
			SET @Street1 = CONVERT(VARCHAR,(select dbo.UDF_RandomNumber(1000,99990))) + ' ' + 
				(SELECT TOP 1 StreetName FROM Scrambler.dbo.Address_StreetNames ORDER BY Scrambler.dbo.UDF_NewID()) + ' ' +
				(SELECT TOP 1 StreetSuffix FROM Scrambler.dbo.Address_StreetSuffixes ORDER BY Scrambler.dbo.UDF_NewID())
 
			IF (@AddressType = 'TWOLINERESIDENTIAL')
				SET @Street2 = 'Apt. ' + CONVERT(VARCHAR,(select dbo.UDF_RandomNumber(100,399)))
			ELSE
				SET @Street2 = 'Ste. ' + CONVERT(VARCHAR,(select dbo.UDF_RandomNumber(1000,99100)))
		END
		ELSE
		-- Create P.O. Box Addresses
		BEGIN
			SET @Street1 = 'P.O. Box ' + CONVERT(VARCHAR,(select dbo.UDF_RandomNumber(5,200)))
			SET @Street2 = ''
		END
 
		-- City and State are selected regardless of address type.
		SELECT Top 1 @City = City, @Zipcode = Zipcode FROM Scrambler.dbo.Address_Zipcodes WHERE State = @State ORDER BY Scrambler.dbo.UDF_NewID()
	END
 
	-- Add the SELECT statement with parameter references here
	INSERT INTO @AddressTable
	SELECT @Street1, @Street2, @City, @State, @Zipcode
 
	RETURN
END

Open in new window

0
DeHaynes
Asked:
DeHaynes
  • 6
  • 5
  • 2
1 Solution
 
8080_DiverCommented:
For openers, no matter how you define your Random Number it is actually what is known in Mathematics as a Psuedo Random Number, so there is a finite chance that you will hit a repeating sequence of "random numbers".
So, as you'[re going along using your Random Number Generator, you are selecting a sequence of components for your address.  If you should happen to get the same basic sequence of random numbers and that sequence is of just the right length (i.e. the same number as the number of components you are selecting), you will select the same options for each of the components.
Also, if you have a finite number of options for any one of those selections, much less for each one, then you increase your odds of repeating the selection.
Picture the flipping of a coin 100 times.  If you look at the sequence of Heads and Tails, you will see what appear to be patterns of repeating sequences.
Now, picture rolling a 6 sided die 100 times.  There may be patterns of repeating sequences of numbers but, because there are 6 options each time rather than 2, the patterns seem shorter and farther apart.
Now, picture rolling a 20 sided die 100 times.  There may be patterns of repeating sequences of numbers but, because there are 20 options each time rather than 2, the patterns seem even shorter and even farther apart.
If you want to generate more random addresses, include more options wherever possible and, perhaps, generate a random number that you use for the house/building number in the street address.  You might also adjust so that you have a street number component, a street name component, and a street type (e.g. Cove, Avenue, Place, Blvd, Terrace, Street, etc.).  In other words, increase the granularity of your selection of the component pieces and increase the number of options.
If you have 4 components, your options are M x N x O x P where M, N, O, and P represent the number of optons for each component.  If there are 5 of each, then you only have 625 possible "addresses".  If, on the tother hand, you add the steet type with the 6 specific ones I indicated, you now have 3750 possible addresses.  If you add a street number of from 1 to 4 digits, you have just increased your number of addresses to almost 37,500,000.
0
 
Anthony PerkinsCommented:
That is because your code is using RAND().  Use NEWID() instead.
0
 
Anthony PerkinsCommented:
On second thoughts, since you have basically limited the results of using something like NEWID() from a virtually unlimited arbitrary result to a paltry range, you are still going to get duplicates that way.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Anthony PerkinsCommented:
And finally, I suspect you may find this more reliable:
SQL Data Generator
http://www.red-gate.com/products/SQL_Data_Generator/index.htm
0
 
DeHaynesAuthor Commented:
Thanks, I think.

8080_Diver

"If you want to generate more random addresses, include more options wherever possible and, perhaps, generate a random number that you use for the house/building number in the street address.  You might also adjust so that you have a street number component, a street name component, and a street type (e.g. Cove, Avenue, Place, Blvd, Terrace, Street, etc.).  In other words, increase the granularity of your selection of the component pieces and increase the number of options."

Actually if you look at the code, that is how I generate the streets values, using a table of the top 1000 most common street names and another table with the 17 most common street prefixes.  The City, State and Zip are pulled from an 42,192 record table of zipcodes.  That way I can create a fictional address but keep the demographics the same.

One Line Address - 98990 * 1000 * 17 * 42192 = 71,001,963,360,000 different addresses.
Two Line Residential Address - 98990 * 1000 * 17 * 299 * 42192 = 21,229,587,044,640,000 different addresses.
Two Line Commericial Address - 98990 * 1000 * 17 * 921000 * 42192 = 69,652,926,056,160,000,000 different addresses
PO Box Address - 192 * 42192 = 8,227,440 different addresses.

If you add those together you will get 69,674,226,645,176,227,440 different addresses

I don't think that lack of divirsity is the cause of the problem, but thanks.

0
 
DeHaynesAuthor Commented:
ACperkins

"That is because your code is using RAND(). Use NEWID() instead."

My code uses both RAND() and NEWID(). It uses RAND() to generate a number and uses NEWID() to select a random record.
"On second thoughts, since you have basically limited the results of using something like NEWID() from a virtually unlimited arbitrary result to a paltry range, you are still going to get duplicates that way."
 Like I answered above, I don't think 69,674,226,645,176,227,440 different addresses is a paltry range.
"And finally, I suspect you may find this more reliable:
SQL Data Generator
http://www.red-gate.com/products/SQL_Data_Generator/index.htm"

1.  My company don't want to spend the money.
2.  I am a programmer, so I should be able to do it.
0
 
DeHaynesAuthor Commented:
Guys what I am looking for is if anyone knows why the NewID() funciton wouldn't work on a per record basis, like it is supposed to. If I cannot solve this issue using TSQL.  I will have to drop down to using CLR User Defined Functions.  So I have a plan B, just trying to find why plan A didn't work.

Thanks in advance.
0
 
Anthony PerkinsCommented:
>>Like I answered above, I don't think 69,674,226,645,176,227,440 different addresses is a paltry range.<<
That would be true if you were just using NEWID(), unfortunately as I stated you have limited the results to a range, hence the result.
>>1. My company don't want to spend the money.<<
Consider how much time they have wasted on this already and you will realize they obviously do not value your time very much.  By the time you are through add up all the time you have spent on this and then factor in the cost of the license.  The result will be how much they value your time.

>> I am a programmer, so I should be able to do it<<
You would think so.  Unfortunately and again you cannot implement it that way.  You have basically crippled NEWID() beyond recognition.
0
 
DeHaynesAuthor Commented:
I ended up using a cursor.  When I used a cursor, this fixed the issue with RAND() and GETID().
0
 
Anthony PerkinsCommented:
>>Nobody gave me viable advice.  In fact, they gave me advice to do what I had already done.  <<
Unfortunately that is not true.  You were not prepared to listen.

>>I ended up using a cursor.<<
My condolencies and you should be glad they allow you to use CURSORs in your shop.  They are only allowed here as a last resort where performance is not an issue and no other viable solution can be found.

P.S. Don't forget to compute the hours you spent on this boondogle.  This should give you ammunition to convince your company to spring the "big bucks" next time for a better solution.
0
 
DeHaynesAuthor Commented:
Ok, I am fully prepared to listen. You said,

"That is because your code is using RAND(). Use NEWID() instead."

My code used NEWID() anytime it was required to get a random file from a recordset.  So I could understand your point if I was getting duplicate numbers for something like the numberic part of the address.  But the street name of the address is pulled from a 1000 record list of street names and the street suffixes is 100 or so records and both of those use NEWID().  So yes, my code uses RAND() but but is also uses NEWID() where appropriate and that is what this post is about.  Why isn't NEWID() generating a new street name between records?
You also said,
"On second thoughts, since you have basically limited the results of using something like NEWID() from a virtually unlimited arbitrary result to a paltry range, you are still going to get duplicates that way."

Your statement is pretty vague. I posted the code I am using and it doesn't reference it anywhere. So please show me an example of what your talking about in my code.

You also said,
">>Like I answered above, I don't think 69,674,226,645,176,227,440 different addresses is a paltry range.<<
That would be true if you were just using NEWID(), unfortunately as I stated you have limited the results to a range, hence the result.
"
To answer this, I refer you to the above paragraph that addressed when you said my code used RAND() was the reason it wasn't working.
You also said,
">>1. My company don't want to spend the money.<<
Consider how much time they have wasted on this already and you will realize they obviously do not value your time very much.  By the time you are through add up all the time you have spent on this and then factor in the cost of the license.  The result will be how much they value your time.
"
Although I do appreciate the initial reference, once I point out that is not an option then this is not your concern and not what this post is about.  It is pointless.
You also said,
">> I am a programmer, so I should be able to do it<<
You would think so.  Unfortunately and again you cannot implement it that way.  You have basically crippled NEWID() beyond recognition.
"
Again I see no refence to any code I posted pointing out what I did wrong.  Seeing has you have said nothing tangible except to use NEWID() which I already did, I don't see how you can say "again you cannot implement it that way."  Please show where I have crippled NEWID() beyond recognition.
You also said,
">>I ended up using a cursor.<<
My condolencies and you should be glad they allow you to use CURSORs in your shop.  They are only allowed here as a last resort where performance is not an issue and no other viable solution can be found.
"
Bingo.  You hit the nail on the head.  This is for a test environment that depends on an third-party, canned application to create the data.  The application doesn't follow database normalization rules or other standard practices.  So there is a ton of data duplication in it.  The unfortunate situation is that my company uses this system as their core database.  So when we write other applications to interact with it, we have to copy the production database into a test environment.  In order to minimize the exposure that might happen by having real data in a test environment, I am writing an application that goes through the database and scrambles the data.  Hence the need to randomly generate names, addresses, phone numbers, SSNs etc.  So I have basically accomplished it.  I wanted to speed it up as it runs for 7 hours.  Hence my post here.  
I was really hoping to use this to build up some more of my SQL Server knowledge but I found your posts to be derogatory and unfounded.  I tried to point that out with the hope that you would back up your statements and then we could move forward, but I found the next responses more of the same.  Seeing as I have not gotten any knowledge that helped me solve this problem, I don't see any reason to award the points to anyone.
0
 
8080_DiverCommented:
As I initially stated:
For openers, no matter how you define your Random Number it is actually what is known in Mathematics as a Psuedo Random Number, so there is a finite chance that you will hit a repeating sequence of "random numbers".
Another characteristic of the kinds of random numbers you are dealing with is that, since they are between 0 and 1, it is entirely possible to get 2 "different" random numbers that are just not that different.  You calculation of the value for your street selection is such that it is possible to generate 2 relatively close "random numbers" that resultin the same outcome.  This, combined with the simple fact that any random number generator will tend to have good seeds (those that produce "long sequences of numbers") and bad seeds (those that produce "short sequence of numbers"), means that you can expect to have repeating instances of your "random" choices . . . no matter how large the theoretical population from which you are sampling.
If you absolutely have to have X unique addresses, then I would crank up your address generator, putting the addresses in a table, and generate about twice as many addresses as you need.  Then simply take the top X from a DISTINCT selection of them.  That should preclude your having any duplicates in the X addresses you use.
You might also want to play with providing different seed values on the initial call of the Rand() function.  Now, as to what seed might be really good . . . well, that is where the "art" of this activity comes into play.  Having done studies of random number generators, all I can tell you is that a) what is a good or bad seed for one RNG is not neccessarily a good or bad seed for the next RNG and b) unless you play with it a lot and have a pretty strong background in the mathematics involved, it is far easier to treat the symptoms (e.g try some different seeds or over generate and cull, as I suggested above) than it is to try to actually figure out what would be a Good Seed.
0
 
Anthony PerkinsCommented:
>>I don't see any reason to award the points to anyone.<<
Fair enough.

Good luck.
0

Featured Post

Nothing ever in the clear!

This technical paper will help you implement VMware’s VM encryption as well as implement Veeam encryption which together will achieve the nothing ever in the clear goal. If a bad guy steals VMs, backups or traffic they get nothing.

  • 6
  • 5
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now