Solved

i have an array with randomly placed numbers from 1 to 1000000 ,there is an integer duplicated in the array what is the easiest , perf. efficient way to find that value

Posted on 2014-11-19
17
200 Views
Last Modified: 2016-02-18
Hi,

I have been recently asked a question by someone and it drove me crazy as i could not give an efficient answer to a question.

so the trick is; there is an array with numbers  placed from 1 to 1 million randomly but one of the integers inside the array has been duplicated and i need to find this value in the most efficient way. what could this be?

Obvious answer is to get the first value of the array and compare this with the rest and then move to the next one etc.. but this is not what is wanted, there needs to be a way to very efficiently retrieve this but how?

could you help me on this puzzle like algorithm question? thanks
0
Comment
Question by:nicedone
  • 4
  • 3
  • 2
  • +6
17 Comments
 
LVL 84

Expert Comment

by:ozo
ID: 40453837
If the array contains 1000001 entries, with one each of all integer values from 1 to 1000000, except for one value that is duplicated, then you can take the sum of all elements and subtract 500000500000
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 40453866
In databases, the 'standard' method is to sort the rows on the field you want to find the duplicates in.  Then duplicates will be in consecutive rows.  If you can sort the array, you can look for the occurrence of the same value in two consecutive positions.  I don't know C# but PHP has an array_sort function that I can use.  Sort functions are usually much more efficient than going thru the array and making comparisons to each and every element.  Then I can use 'foreach' to go thru the sorted array to find items that are duplicated.
0
 
LVL 84

Expert Comment

by:ozo
ID: 40453892
If the array can contain some missing values as well as a duplicate value, then you might use a hash table or a bit vector to record numbers that you have seen and detect a duplicate when you find a number you have seen before
0
 
LVL 32

Assisted Solution

by:phoffric
phoffric earned 100 total points
ID: 40453899
You could have a bit array of 1 million bits. As you go through each integer value in the array, you see if the corresponding bit in the array has been set. If it has, then you now know the index in the array where the duplicate value has occurred and you know the value. If the bit is not set, then set it.

For a faster routine, and for only 8 times the cost in memory resources, you could change the bit array into a 1 million byte array and 0 means not yet found, and 1 means, the number is already found.
0
 
LVL 23

Expert Comment

by:Michael74
ID: 40454166
You could move the array to a dictionary and either stop when you get an error or even get a count of the number of duplicates eg

Dictionary dict = new Dictionary<int, int>();    
foreach (int i in array)
{
   if (dict.ContainsKey(i))
      dict[i]++;
    else
        dict.Add (i, 1)
 }

Open in new window

Dictionary dict = new Dictionary<int, int>(); 
int dup = 0   
foreach (int i in array)
{
     if (dict.ContainsKey(i))
         dup = i;
         break;
}

Open in new window

0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 100 total points
ID: 40454197
If there are no missing values, and exactly one duplicate value, and no values outside of the range 1 to 1000000,
and you don't want to overflow a 32 bit int while adding up all numbers from 1 to 1000000,
then you could use xor instead of add, and xor the result with 1000000.
0
 
LVL 23

Expert Comment

by:Michael74
ID: 40454225
Second code block from my post is wrong. 2nd time lucky

Dictionary dict = new Dictionary<int, int>();
int dup = 0 
foreach (int i in array)
{
   if (dict.ContainsKey(i))
      dup = i;
      break;
    else
       dict.Add (i, 1)
 }

Open in new window

0
 

Expert Comment

by:Garnik Giloyan
ID: 40454282
Hello

You can find the duplicate values using the following algorithm.

int my_array[] = { 50, 51, 32, 50, 7, 3, 7 };
	int size = sizeof(my_array) / sizeof(my_array[0]);

	for (int i = 0; i < size; i++)
	{
		for (int j = i + 1; j < size; j++)
		{
			if (my_array[i] == my_array[j])
				printf(" %d ", my_array[i]);
		}
	}

Open in new window


But also you can use 'Set' Associative containers which will ignore the duplicate values
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 44

Expert Comment

by:AndyAinscow
ID: 40454374
I'd have suggested a dictionary - lot of the work already done for you - and check as you add to the dictionary if that key is in use.  BUT, Michael74 has already suggested that.
I note you request efficient.
You also talk of 1 to 1 million as the range of numbers being stored in the array but it is no clear how many slots there are in your array.  10, 100, 1000, 100,000,000 (extreme but not excluded from your wording as I understand it)
If the array has relatively few values then the 'double' looping is easy to code and understand and probably pretty much as efficient as the dictionary.  However with 100 million entries in the array (that still leaves a reasonable possibility of some numbers only appearing once) then that is very poor on performance.
0
 
LVL 18

Expert Comment

by:Richard Lee
ID: 40455212
Hi,

Here is an approach in C# that is more commonly used in an RDBMS. Its effective and quite fast. Simply using Grouping.

var Duplicates = Numbers
	.GroupBy(x => x)
	.Where(x => x.Count() > 1)
	.Select(x => x.Key)
	.ToArray();

Open in new window


Any numbers found in the resulting array are duplicates. If you want to know how often each number of duplication then you can try the following query.

var Duplicates = Numbers
	.GroupBy(x => x)
	.Where(x => x.Count() > 1)
	.Select(x => new { Number = x.Key, Count = x.Count() })
	.ToArray();

Open in new window


A different approach from the previous answers but very fast.

Hope this helps.
0
 
LVL 74

Accepted Solution

by:
käµfm³d   👽 earned 300 total points
ID: 40455318
I'd have to agree with ozo's and phoffric's suggestion if performance is truly critical. A bit array would have you only looping the array once to locate the duplicate. So in terms of runtime cost, you should be looking at O(n). Memory cost is, of course, greater because you'd be introducing a second array/vector to track duplicate values.

Again, if performance is paramount, then the LINQ and sorting suggestions will not be as performant. LINQ has a bit of overhead associated with it--that's on top of the overhead you introduce by creating the new anonymous objects in the GroupBy's and Select's. Sorting is a bit better on performance than LINQ would be, but you've still got the overhead of executing the sort. You can mitigate some of this by choosing the most-performant sort, but you'll still have some degree of overhead.

The dictionary approach is the bit array solution reimagined. It's easier to read, but it may not be as performant. You'd have to time it.

If performance is not critical, then the LINQ or dictionary solutions are good ways to go in terms of readability. These days, readability trumps performance save for a select few areas.

As a C# example of what ozo and phoffric suggest, you could use the BitArray class:

e.g.

using System;
using System.Collections;
using System.Diagnostics;

namespace ConsoleApplication34
{
    class Program
    {
        static void Main(string[] args)
        {
            const int MAX_VALUE = 1000000;
            BitArray vector = new BitArray(MAX_VALUE + 1, false);
            int[] array = SetupArray(MAX_VALUE);
            Stopwatch clock = Stopwatch.StartNew();

            for (int i = 0; i < array.Length; i++)
            {
                int value = array[i];

                if (vector[value])
                {
                    Console.WriteLine("Duplicate value: {0}", value);
                }
                else
                {
                    vector[value] = true;
                }
            }

            clock.Stop();

            Console.WriteLine("Elapsed MS: {0}", clock.ElapsedMilliseconds);

            System.Diagnostics.Debugger.Break();
        }

        static int[] SetupArray(int maxItems)
        {
            Random rand = new Random();
            int[] array = new int[maxItems];

            for (int i = 1; i < maxItems; i++)
            {
                rand.Next(rand.Next());
                array[i] = rand.Next(1, maxItems + 1);
            }

            return array;
        }
    }
}

Open in new window


There's probably a way you can pull of the same with an array of integers and bit masking, but I don't know that it would be any more performant. That would be something you'd have to time.
0
 
LVL 18

Expert Comment

by:Richard Lee
ID: 40455913
Bringing the figures to the table might clarify matters a bit. Using an example of 1 million randomly generated numbers such as the following yielded an elapsed time of 531.0978ms.

class Program
{
	static void Main(string[] args)
	{
		var rand = new Random((int) DateTime.Now.Ticks);
		var sw = new Stopwatch();

		var array = new int[1000000];

		for (int i = 0; i < array.Length; i++)
			array[i] = rand.Next(1000000);

		sw.Start();
			
		var duplicates = array
			.GroupBy(x => x)
			.Where(x => x.Count() > 1)
			.Select(x => x.Key)
			.ToArray();

		sw.Stop();

		Console.WriteLine("Elapsed Time: {0}", sw.Elapsed.TotalMilliseconds);
		Console.ReadKey();
	}
}

Open in new window


Hope this helps clarify matters.
0
 
LVL 74

Expert Comment

by:käµfm³d 👽
ID: 40455946
@DaTribe

Oh, it's on like Donkey Kong now...    ; )

My code is slower because of the printing to the console. Take that bit out, and you're looking at around 15 ms (on my i7-3770).
0
 
LVL 32

Expert Comment

by:phoffric
ID: 40456294
1/2 second is a very long time, IMO.
0
 
LVL 18

Expert Comment

by:Richard Lee
ID: 40456400
Okay I concede. :)
0
 
LVL 84

Expert Comment

by:ozo
ID: 40456422
I'm still wondering if there is a reason the question author said "randomly placed numbers from 1 to 1000000" instead of "random numbers from 1 to 1000000"
0
 

Author Closing Comment

by:nicedone
ID: 40463427
@kaufman ,thank you for your great help and code of course !! it was really helpful
@ozo ,thanks for your help
@phoffric, thank you for your support
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

For those of you who don't follow the news, or just happen to live under rocks, Microsoft Research released a beta SDK (http://www.microsoft.com/en-us/download/details.aspx?id=27876) for the Xbox 360 Kinect. If you don't know what a Kinect is (http:…
The greatest common divisor (gcd) of two positive integers is their largest common divisor. Let's consider two numbers 12 and 20. The divisors of 12 are 1, 2, 3, 4, 6, 12 The divisors of 20 are 1, 2, 4, 5, 10 20 The highest number among the c…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now