i have an array with randomly placed numbers from 1 to 1000000 ,there is an integer duplicated in the array what is the easiest , perf. efficient way to find that value

Hi,

I have been recently asked a question by someone and it drove me crazy as i could not give an efficient answer to a question.

so the trick is; there is an array with numbers  placed from 1 to 1 million randomly but one of the integers inside the array has been duplicated and i need to find this value in the most efficient way. what could this be?

Obvious answer is to get the first value of the array and compare this with the rest and then move to the next one etc.. but this is not what is wanted, there needs to be a way to very efficiently retrieve this but how?

could you help me on this puzzle like algorithm question? thanks
nicedoneAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ozoCommented:
If the array contains 1000001 entries, with one each of all integer values from 1 to 1000000, except for one value that is duplicated, then you can take the sum of all elements and subtract 500000500000
0
Dave BaldwinFixer of ProblemsCommented:
In databases, the 'standard' method is to sort the rows on the field you want to find the duplicates in.  Then duplicates will be in consecutive rows.  If you can sort the array, you can look for the occurrence of the same value in two consecutive positions.  I don't know C# but PHP has an array_sort function that I can use.  Sort functions are usually much more efficient than going thru the array and making comparisons to each and every element.  Then I can use 'foreach' to go thru the sorted array to find items that are duplicated.
0
ozoCommented:
If the array can contain some missing values as well as a duplicate value, then you might use a hash table or a bit vector to record numbers that you have seen and detect a duplicate when you find a number you have seen before
0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

phoffricCommented:
You could have a bit array of 1 million bits. As you go through each integer value in the array, you see if the corresponding bit in the array has been set. If it has, then you now know the index in the array where the duplicate value has occurred and you know the value. If the bit is not set, then set it.

For a faster routine, and for only 8 times the cost in memory resources, you could change the bit array into a 1 million byte array and 0 means not yet found, and 1 means, the number is already found.
0
Michael FowlerSolutions ConsultantCommented:
You could move the array to a dictionary and either stop when you get an error or even get a count of the number of duplicates eg

Dictionary dict = new Dictionary<int, int>();    
foreach (int i in array)
{
   if (dict.ContainsKey(i))
      dict[i]++;
    else
        dict.Add (i, 1)
 }

Open in new window

Dictionary dict = new Dictionary<int, int>(); 
int dup = 0   
foreach (int i in array)
{
     if (dict.ContainsKey(i))
         dup = i;
         break;
}

Open in new window

0
ozoCommented:
If there are no missing values, and exactly one duplicate value, and no values outside of the range 1 to 1000000,
and you don't want to overflow a 32 bit int while adding up all numbers from 1 to 1000000,
then you could use xor instead of add, and xor the result with 1000000.
0
Michael FowlerSolutions ConsultantCommented:
Second code block from my post is wrong. 2nd time lucky

Dictionary dict = new Dictionary<int, int>();
int dup = 0 
foreach (int i in array)
{
   if (dict.ContainsKey(i))
      dup = i;
      break;
    else
       dict.Add (i, 1)
 }

Open in new window

0
Garnik GiloyanCommented:
Hello

You can find the duplicate values using the following algorithm.

int my_array[] = { 50, 51, 32, 50, 7, 3, 7 };
	int size = sizeof(my_array) / sizeof(my_array[0]);

	for (int i = 0; i < size; i++)
	{
		for (int j = i + 1; j < size; j++)
		{
			if (my_array[i] == my_array[j])
				printf(" %d ", my_array[i]);
		}
	}

Open in new window


But also you can use 'Set' Associative containers which will ignore the duplicate values
0
AndyAinscowFreelance programmer / ConsultantCommented:
I'd have suggested a dictionary - lot of the work already done for you - and check as you add to the dictionary if that key is in use.  BUT, Michael74 has already suggested that.
I note you request efficient.
You also talk of 1 to 1 million as the range of numbers being stored in the array but it is no clear how many slots there are in your array.  10, 100, 1000, 100,000,000 (extreme but not excluded from your wording as I understand it)
If the array has relatively few values then the 'double' looping is easy to code and understand and probably pretty much as efficient as the dictionary.  However with 100 million entries in the array (that still leaves a reasonable possibility of some numbers only appearing once) then that is very poor on performance.
0
Richard LeeSoftware EnthusiastCommented:
Hi,

Here is an approach in C# that is more commonly used in an RDBMS. Its effective and quite fast. Simply using Grouping.

var Duplicates = Numbers
	.GroupBy(x => x)
	.Where(x => x.Count() > 1)
	.Select(x => x.Key)
	.ToArray();

Open in new window


Any numbers found in the resulting array are duplicates. If you want to know how often each number of duplication then you can try the following query.

var Duplicates = Numbers
	.GroupBy(x => x)
	.Where(x => x.Count() > 1)
	.Select(x => new { Number = x.Key, Count = x.Count() })
	.ToArray();

Open in new window


A different approach from the previous answers but very fast.

Hope this helps.
0
käµfm³d 👽Commented:
I'd have to agree with ozo's and phoffric's suggestion if performance is truly critical. A bit array would have you only looping the array once to locate the duplicate. So in terms of runtime cost, you should be looking at O(n). Memory cost is, of course, greater because you'd be introducing a second array/vector to track duplicate values.

Again, if performance is paramount, then the LINQ and sorting suggestions will not be as performant. LINQ has a bit of overhead associated with it--that's on top of the overhead you introduce by creating the new anonymous objects in the GroupBy's and Select's. Sorting is a bit better on performance than LINQ would be, but you've still got the overhead of executing the sort. You can mitigate some of this by choosing the most-performant sort, but you'll still have some degree of overhead.

The dictionary approach is the bit array solution reimagined. It's easier to read, but it may not be as performant. You'd have to time it.

If performance is not critical, then the LINQ or dictionary solutions are good ways to go in terms of readability. These days, readability trumps performance save for a select few areas.

As a C# example of what ozo and phoffric suggest, you could use the BitArray class:

e.g.

using System;
using System.Collections;
using System.Diagnostics;

namespace ConsoleApplication34
{
    class Program
    {
        static void Main(string[] args)
        {
            const int MAX_VALUE = 1000000;
            BitArray vector = new BitArray(MAX_VALUE + 1, false);
            int[] array = SetupArray(MAX_VALUE);
            Stopwatch clock = Stopwatch.StartNew();

            for (int i = 0; i < array.Length; i++)
            {
                int value = array[i];

                if (vector[value])
                {
                    Console.WriteLine("Duplicate value: {0}", value);
                }
                else
                {
                    vector[value] = true;
                }
            }

            clock.Stop();

            Console.WriteLine("Elapsed MS: {0}", clock.ElapsedMilliseconds);

            System.Diagnostics.Debugger.Break();
        }

        static int[] SetupArray(int maxItems)
        {
            Random rand = new Random();
            int[] array = new int[maxItems];

            for (int i = 1; i < maxItems; i++)
            {
                rand.Next(rand.Next());
                array[i] = rand.Next(1, maxItems + 1);
            }

            return array;
        }
    }
}

Open in new window


There's probably a way you can pull of the same with an array of integers and bit masking, but I don't know that it would be any more performant. That would be something you'd have to time.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Richard LeeSoftware EnthusiastCommented:
Bringing the figures to the table might clarify matters a bit. Using an example of 1 million randomly generated numbers such as the following yielded an elapsed time of 531.0978ms.

class Program
{
	static void Main(string[] args)
	{
		var rand = new Random((int) DateTime.Now.Ticks);
		var sw = new Stopwatch();

		var array = new int[1000000];

		for (int i = 0; i < array.Length; i++)
			array[i] = rand.Next(1000000);

		sw.Start();
			
		var duplicates = array
			.GroupBy(x => x)
			.Where(x => x.Count() > 1)
			.Select(x => x.Key)
			.ToArray();

		sw.Stop();

		Console.WriteLine("Elapsed Time: {0}", sw.Elapsed.TotalMilliseconds);
		Console.ReadKey();
	}
}

Open in new window


Hope this helps clarify matters.
0
käµfm³d 👽Commented:
@DaTribe

Oh, it's on like Donkey Kong now...    ; )

My code is slower because of the printing to the console. Take that bit out, and you're looking at around 15 ms (on my i7-3770).
0
phoffricCommented:
1/2 second is a very long time, IMO.
0
Richard LeeSoftware EnthusiastCommented:
Okay I concede. :)
0
ozoCommented:
I'm still wondering if there is a reason the question author said "randomly placed numbers from 1 to 1000000" instead of "random numbers from 1 to 1000000"
0
nicedoneAuthor Commented:
@kaufman ,thank you for your great help and code of course !! it was really helpful
@ozo ,thanks for your help
@phoffric, thank you for your support
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Algorithms

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.