Link to home
Start Free TrialLog in
Avatar of sudhakar_koundinya
sudhakar_koundinya

asked on

Help needed to make this collections method much efficient

 private java.util.List[] getURLList(java.util.List list) {
    Hashtable ht = new Hashtable();
    for (int i = 0; i < list.size(); i++) {
      String url = list.get(i).toString();
      String array[] = url.split("/");

      String domain = array[0];
      if (array[0].toLowerCase().startsWith("http")) {
        domain = array[2];
        domain = domain + array[4];
      }
      else {
        domain = domain + array[2];
      }

      ArrayList sublist = (ArrayList) ht.get(domain);
      if (sublist == null) {
        sublist = new ArrayList();
      }
      sublist.add(url);
      ht.put(domain, sublist);
    }
    return (List[]) Collections.list(ht.elements()).toArray(new ArrayList[0]);
  }


Thanks
Koundinya
SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I used Lists of type URL as they're more versatile, but you may want to use String
Avatar of sudhakar_koundinya
sudhakar_koundinya

ASKER

suppose if the list contains both http and https based urls then how can we handle them??
Not quite sure what you mean. They'd appear together under the same host list
list will be some thing like this


            al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0101.EML");
          al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0102.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0103.EML");
      al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0104.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Deleted Items/TEST Message 0056.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Deleted Items/TEST Message 0055.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0105.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0106.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0107.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0108.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0109.EML");
        al.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0110.EML");
        al.add("http://192.12.0.26/exchange/t1/Inbox/testmailfromAD.EML");
        al.add("http://192.12.0.3/exchange/vijaya/Inbox/testattachment.EML");
        al.add("https://192.12.0.38:403/exchange/Administrator/Inbox/TEST Message 01.EML");
        al.add("https://192.12.0.3/exchange/Administrator/Inbox/TEST Message 011.EML");
        al.add("https://192.12.0.3/exchange/Administrator/Inbox/TEST Message 012.EML");
        al.add("https://192.12.0.38:403/exchange/Administrator/Inbox/TEST Message 013.EML");
(slightly unprettily printed, but you'll get the idea ;-))
following is the TestCase created with my method and your method by adding url.getProtocol() as extra

Now your  results are much faster

My Case : above 7 Seconds
Your Case : above 5 Seconds

Is it possible to reduce still??

Thanks
Sudha



import junit.framework.*;
import java.util.*;
import java.net.*;
import junit.extensions.*;

/**
 * A sample test case, testing <code>java.util.ArrayList</code>.
 *
 */
public class TestURLList extends TestCase {
      
      ArrayList vect=new ArrayList();
      public TestURLList(String name) {
            super(name);
      }

      public static void main (String[] args) {
            junit.textui.TestRunner.run (suite());
      }
      protected void setUp() {
      }

      public static Test suite() {
            return new TestSuite(TestURLList.class);
      }
      public void testURLS()
      {
            for(int i=0;i<100;i++)
            {
                  for(int j=0;j<1000;j++)
                  {
                        vect.add("http://Test"+i+"/"+j+"/"+(i+j));
                  }
            }
            //List[] array=getUrlList(vect); //7501 seconds
            List[] array=getURLList(vect); // 5258 seconds
      }

//7501
        private java.util.List[] getUrlList(java.util.List list) {
    Hashtable ht = new Hashtable();
    for (int i = 0; i < list.size(); i++) {
      String url = list.get(i).toString();
      String array[] = url.split("/");

      String domain = array[0];
      if (array[0].toLowerCase().startsWith("http")) {
        domain = array[2];
        domain = domain + array[4];
      }
      else {
        domain = domain + array[2];
      }

      ArrayList sublist = (ArrayList) ht.get(domain);
      if (sublist == null) {
        sublist = new ArrayList();
      }
      sublist.add(url);
      ht.put(domain, sublist);
    }
    return (List[]) Collections.list(ht.elements()).toArray(new ArrayList[0]);
  }

// 5258 milli seconds
   private static java.util.List[] getURLList(java.util.List list) {
          Hashtable hostToUrls = new Hashtable();
          try {
               for (int i = 0; i < list.size(); i++) {
                    URL url = new URL((String) list.get(i));
                    String hostKey = url.getHost();
                              String protocol=url.getProtocol();
                        //      System.err.println(url.getProtocol() );
                    // Could be a Set if no duplicates required
                    ArrayList urlsForHost = (ArrayList)hostToUrls.get(protocol+hostKey);
                    if (urlsForHost == null) {
                         urlsForHost = new ArrayList();
                         urlsForHost.add(url);
                         hostToUrls.put(protocol+hostKey, urlsForHost);
                    }
                    else {
                         urlsForHost.add(url);
                    }
               }
          }
          catch (MalformedURLException e) {
               return null;
          }
          int numberOfLists = hostToUrls.size();
          return (java.util.List[]) Collections.list(hostToUrls.elements()).toArray(new ArrayList[numberOfLists]);
     }
}
I'll think about it. So you want to keep http and https separate?

>>domain = domain + array[4];

What makes you think the above will always be in range?
there is a possibility of giving urls some thing like this

192.12.0.39/exchange/MailBox/Mail.eml or
http://192.12.0.39/exchange/MailBox/Mail.eml
>> So you want to keep http and https separate?

Yes including domain
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>The function is called every iteration in your loop

That's not the case in fact

>> Change the return type from List[] to Collection.

Friend,
If that is the case why I should write this method?? I could somply use either ArrayList or Vector. Thanks for helping me

How Efficiently I could change this line??
 return (java.util.List[]) Collections.list(hostToUrls.elements()).toArray(new ArrayList[numberOfLists]);

Thanks and Best Regards
Sudhakar

>>results would be a Collection of Array lists.

My apologies, I didn't read entire comment
Let me try your Idea
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I compqred your idea with my logic (modified version from CEHJ)

And here are the sample results

My code : 1484 milli seconds
Your  code : 1485 milli seconds

And I tested the codes  on  Pentium 4/256 MB Ram. running On multiple tests what I have identified is some times ur idea is fast and some times my idea is fast and that too a difference of few milli seconds. Maximum difference I have identified is +/- 25 milli seconds. This may be because of other processes that are running.  So I don't think returning as collection is right solution.

More or less we are doing the same in both codes. My idea converts the list into list array by calee  whereas ur idea converts the list into list array by caller

Here is the test case

import java.util.*;
import java.net.*;
public class ArrayTest
{
      


   private static Collection getURLList(java.util.List list) {
          Hashtable hostToUrls = new Hashtable();
          try {
               for (int i = 0; i < list.size(); i++) {
                    URL url = new URL((String) list.get(i));
                    String hostKey = url.getHost();
                         String protocol=url.getProtocol();
                    //     System.err.println(url.getProtocol() );
                    // Could be a Set if no duplicates required
                    ArrayList urlsForHost = (ArrayList)hostToUrls.get(protocol+hostKey);
                    if (urlsForHost == null) {
                         urlsForHost = new ArrayList();
                         urlsForHost.add(url);
                         hostToUrls.put(protocol+hostKey, urlsForHost);
                    }
                    else {
                         urlsForHost.add(url);
                    }
               }
          }
          catch (MalformedURLException e) {
               return null;
          }
          int numberOfLists = hostToUrls.size();
       
      return hostToUrls.values();

     }


  private static java.util.List[] getUrlList(java.util.List list) {
          Hashtable hostToUrls = new Hashtable();
          try {
               for (int i = 0; i < list.size(); i++) {
                    URL url = new URL((String) list.get(i));
                    String hostKey = url.getHost();
                         String protocol=url.getProtocol();
                    //     System.err.println(url.getProtocol() );
                    // Could be a Set if no duplicates required
                    ArrayList urlsForHost = (ArrayList)hostToUrls.get(protocol+hostKey);
                    if (urlsForHost == null) {
                         urlsForHost = new ArrayList();
                         urlsForHost.add(url);
                         hostToUrls.put(protocol+hostKey, urlsForHost);
                    }
                    else {
                         urlsForHost.add(url);
                    }
               }
          }
          catch (MalformedURLException e) {
               return null;
          }
          int numberOfLists = hostToUrls.size();
          return (java.util.List[]) Collections.list(hostToUrls.elements()).toArray(new ArrayList[numberOfLists]);
     }


public static void main(String s[])
{
   
      java.util.Date start=new java.util.Date();
ArrayList al=new ArrayList();

  for(int i=0;i<100;i++)
          {
               for(int j=0;j<1000;j++)
               {
                    al.add("http://Test"+i+"/"+j+"/"+(i+j));
               }
          }

/**
   Following is your Idea
*/
Collection c=getURLList(al);
Object[] array=c.toArray();
for(int i=0;i<array.length;i++)
{

      List array1=(List)array[i];
//      System.err.println("Array "+i+"\n");
      for(int j=0;j<array1.size();j++)
      {
            //System.err.println(array1.get(j));
      }
      
} // This had taken 1485 milli seconds


/*
    Following is My Idea
*/

/*List array[]=getUrlList(al);
for(int i=0;i<array.length;i++)
{

      List array1=(List)array[i];
//      System.err.println("Array "+i+"\n");
      for(int j=0;j<array1.size();j++)
      {
            //System.err.println(array1.get(j));
      }
      
}*/ //this had taken 1484 milli seconds

java.util.Date end=new java.util.Date();
System.err.println(end.getTime()-start.getTime());

}
}
i.e. return

List<List<URL>>

(per my example)

or

List<List<String>>
CEHJ,

I think even ur second idea also gives the same results more or less.
Obviously try running these things repeatedly and take averages
How Efficiently I could change this line??
 return (java.util.List[]) Collections.list(hostToUrls.elements()).toArray(new ArrayList[numberOfLists]);
>>List<List<URL>>

The current JDK Version is 1.4.2, So that Idea may not be helpful for me
>>The current JDK Version is 1.4.2, So that Idea may not be helpful for me

No - i mean conceptually - not necessarily literally in code ;-)

What are your ideal return types?
>>What are your ideal return types?

I did not get you
What types do you want returned from the method, i.e. what would you like to work with ideally, or does it not matter?
Basically I need to return List[]

If this is not possible, I am open to know other solutions. Main thing is I need to have optimized solution

At runtime, method  need to work with 10000 - 1000000 mail objects(URLs) depending on conditions.

It's time to sleep now (1.55 AM). I will discuss with you tomorrow from my office PC.

Good Night

BTW,

What is your full name and what is your Native Place?? Just Curious

I am from INDIA
Hi,

Back again
Were you asking me that q?
nope just intimating
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
:-)
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
url.getFile().split("/")[2]; do the job

But I believe it also reduces the performance as discussed  already here
https://www.experts-exchange.com/questions/21109363/StringTokenizer-Vs-String-split.html

Thanks
Sudhakar
As mentioned in another q., could you post the link to a sizeable chunk of test data?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks CEHJ,

didn't see the comment.

I will test that and let you know
I am seeing the changes you have done, trying to understand the code
Sorry - it's almost completely uncommented. More or less the same as before apart from counting slashes by String index
Although the change  improved the performance

The code is failed for following test case  :(

array.add("http://192.12.0.38/exchange/Administrator/Inbox/TEST Message 0237.EML");
array.add("http://192.12.0.39/exchange/Administrator/Inbox/TEST Message 0237.EML");

LOL sorry - i've forgotten in the meantime that the host is part of the key too!
Just wrote this code

I beg your help to check if this is giving fast results than previous methods at your end

public static final List getUrlList7(List l)
    {
          //http://192.12.0.62/exchange/Administrator
          Hashtable ht=new Hashtable();
          for(int i=0;i<l.size();i++)
          {
                String url=(String)l.get(i);
                int hashCount=0;
                StringBuffer key=new StringBuffer();
                for(int j=0;j<url.length();j++)
                {
                      char c=url.charAt(j);
                      key.append(c);
                      if(c=='/')
                            hashCount++;
                      if(hashCount==5)
                            break;
                      
                }
                List array=(List)ht.get(key.toString());
                if(array==null)
                {
                      array=new ArrayList();
                      ht.put(key.toString(),array);
                }
                array.add(url);
                
          }
          return Collections.list(ht.elements());
   
          
    }

thanks,
sudhakar
nope that is  bad code i think :( giving poor results
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>nope that is  bad code i think :( giving poor results

Well it's (getUrlListSud2) about 3 times faster than your original one (getURLList):
Following results in milliseconds for 1000 invocations against the test data. If you're wondering what the '3' is it's just me checking that my method is returning the right number of lists ;-)


C:\java\tiger\lang>jr TestUrlLists 1000
Executing getURLList...
40208.0
Executing getUrlListSud2...
12739.0
Executing getUrlList8...
8481.0
3

C:\java\tiger\lang>jr TestUrlLists 1000
Executing getURLList...
39727.0
Executing getUrlListSud2...
13149.0
Executing getUrlList8...
8671.0
3

C:\java\tiger\lang>jr TestUrlLists 1000
Executing getUrlList8...
8572.0
Executing getUrlListSud2...
12437.0
Executing getURLList...
40459.0
3
Ok I willtry that

And I  have done a small change in above code

If my assumtion is correct, this should work fast. I need to check with Exchange administrator for this.

I am assuming exchange is common.

public static final List getUrlList7(List l)
    {
          //http://192.12.0.62/exchange/Administrator
          Hashtable ht=new Hashtable();
          for(int i=0;i<l.size();i++)
          {
                String url=(String)l.get(i);
                int hashCount=0;
                StringBuffer key=new StringBuffer();
                for(int j=4;j<url.length();j++)
                {
                      //char c=url.charAt(j);
                      key.append(url.charAt(j));
                      if(key.charAt(key.length()-1)=='/')
                            hashCount++;
                      if(hashCount==3)
                            j=j+8;
                      if(hashCount==5)
                            break;
                      
                }
                List array=(List)ht.get(key.toString());
                if(array==null)
                {
                      array=new ArrayList();
                      ht.put(key.toString(),array);
                }
                array.add(url);
                
          }
            Enumeration e= ht.keys();
         while(e.hasMoreElements())
          System.err.println(e.nextElement());
          return Collections.list(ht.elements());
   
          
    }
it  improved performance by 300 milli seconds at my side

for folllowing test case

 public Vector getURLS() {
        Vector vect=new Vector();
        for(int i=0;i<100;i++) {
            for(int j=0;j<1000;j++) {
                vect.add("http://Test"+i+"/"+j+"/"+(i+j));
            }
        }
        //List[] array=getUrlList(vect); //7501 seconds
       return vect;
}
i mean this

public Vector getURLS() {
        Vector vect=new Vector();
        for(int i=0;i<100;i++) {
            for(int j=0;j<1000;j++) {
                vect.add("http://Test"+i+"/exchange/"+(i+j));
            }
        }
looking at ur code

if (forwardSlashCount == 2) {
                       // Compute hashcode of host
                       hostHashCode = 31 * hostHashCode + url.charAt(ixChar);
                  }
                  else if (forwardSlashCount == 4) {
                       accountBuffer.append(url.charAt(ixChar));
                  }


what you are trying to do on above snippet??

>>hostHashCode = 31 * hostHashCode + url.charAt(ixChar);

will this be unique for both http and https based ??
Your last is actually slower than your previous:

C:\java\tiger\lang>jr TestUrlLists 10000
Executing getUrlListSud3...
130687.0
Executing getUrlList8...
84761.0
Executing getUrlListSud2...
126582.0
Executing getURLList...
394578.0

>>what you are trying to do on above snippet??

First block takes the hashcode of the host name
Second block distinguishes between http & https

The variable named 'accountBuffer' should better be called 'keyBuffer'
could you please post your whole test code??

I wanted to test the same at my end also

thanks
Sudhakar
>>hostHashCode = 31 * hostHashCode + url.charAt(ixChar);

Why specifically this one?? OR can we use any other logic also?? in identifying the hashcode
>>Why specifically this one??

That's actually the central code of String's hashCode()

>>could you please post your whole test code??

Well there's quite a lot of it, but here are the essentials:

http://www.cehjohnson.uklinux.net/aire/MethodTimer.java

, which is 'work in progress' the obvious thing that's missing (great if you can put it in) is that it should be able to return the function's return value to the caller, as well as the time (simple custom class required) and this is being called from main thus:

            Class[] methodTypes = new Class[]{java.util.List.class};
            Object[] methodArgs = new Object[]{al};
            // Put test method names here
            final String[] METHOD_NAMES = {"getUrlList8", "getUrlListSud9"};
            //final String[] METHOD_NAMES = {"testStringTokenizer", "testSplit"};
            Stack methodStack = new Stack();
            methodStack.addAll(Arrays.asList(METHOD_NAMES));
            Collections.shuffle((java.util.List) methodStack);
            double aggregatedTime = 0.0;
            // Number of invocations passed in on the command line
            final int NUMBER_OF_INVOCATIONS = Integer.parseInt(args[0]);
            while (!methodStack.isEmpty()) {
                  String currentMethodName = (String) methodStack.pop();
                  System.out.println("Executing " + currentMethodName + "...");
                  for (int i = 0; i < NUMBER_OF_INVOCATIONS; i++) {
                        aggregatedTime += MethodTimer.timeStaticMethod("TestUrlLists", currentMethodName, methodTypes, methodArgs);
                  }
                  System.out.println(aggregatedTime);
                  aggregatedTime = 0.0;
            }



it seems ur last code is only the final way. Anyhow I will  wait for some other experts inputs for some time. If nobody  participate here then the points will be ur's ;-)

thanks
sudhakar

 
I'll have a look when I get in the office tomorrow, whats the current state?
CEHJ gave me this code


     public final static List getUrlList8(java.util.List urls) {
          Hashtable hostToUrls = new Hashtable(urls.size());
          StringBuffer accountBuffer = new StringBuffer(32);
          for (int i = 0; i < urls.size(); i++) {
               accountBuffer.setLength(0);
               String url = (String) urls.get(i);
               int forwardSlashCount = 0;
               int hostHashCode = 0;
               for (int ixChar = 4; forwardSlashCount < 5 && ixChar < url.length(); ixChar++) {
                    if (url.charAt(ixChar) == '/') {
                         forwardSlashCount++;
                    }
                    if (forwardSlashCount == 2) {
                         // Compute hashcode of host
                         hostHashCode = 31 * hostHashCode + url.charAt(ixChar);
                    }
                    else if (forwardSlashCount == 4) {
                         accountBuffer.append(url.charAt(ixChar));
                    }
               }
               // end for
               accountBuffer.append(url.charAt(4)).append(hostHashCode);
               final String key = accountBuffer.toString();
               ArrayList urlsForHost = (ArrayList) hostToUrls.get(key);
               if (urlsForHost == null) {
                    urlsForHost = new ArrayList();
                    urlsForHost.add(url);
                    hostToUrls.put(key, urlsForHost);
               }
               else {
                    urlsForHost.add(url);
               }
          }
         

After discussing the different test cases. At this moment it seems to better to me
> At this moment it seems to better to me

how much improvement is that giving you?
Get rid of that Hashtable for a start and replace it with a HashMap.
And give your ArrayList an initial size (you could make it as big as the number of elements remaining in urls).
My first posting took  7  seconds

where as CEHJ's took 1.2 seconds currently for 20 Milliion URLs  grouping

Thanks
Sudhakar
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
On higher end, no doubt that will help me.

curently I am working on exchange server project in which the current task is to download the 25 mails per minute using slide

Now currently I am able to reach the target of 15/min.

So I am currently searching for the code optimizations

thanks
sudhakar

 
>> Get rid of that Hashtable for a start and replace it with a HashMap.

Just discussing the same here
https://www.experts-exchange.com/questions/21111028/Hashtable-Vs-Hashmap.html
>>where as CEHJ's took 1.2 seconds currently for 20 Milliion URLs  grouping

Can you try the last one on the 20,000,000 and see if the result is accurate?
>> Get rid of that Hashtable for a start and replace it with a HashMap.
> Just discussing the same here

It will be faster.

yeah that's true took 985 milli secs
specifying an initial capacity for the list will give you further improvements.
Give it as much as you can afford (<= number of elements left in list)
wheres the list coming from anyway?
and do you have any analysis on the typical breakdown of the data?

CEHJ on several test runs

What I have identified is ur last  code is almost double the spped.

>>wheres the list coming from anyway?

Acutally we are developing the project along with the client. client's programmers are developing their own API and we need to integrate the API with our's. And we did not get their API also yet for testing. So it is difficult to get the actual data.

So what i did is, I have created test environment on 10 machines and    testing my API

>> specifying an initial capacity for the list will give you further improvements.
 I will do that.  assuming 10,000 to 50,1000 urls for single list. How much initial capacity I can give to that??

Suppose if the list contains only 10 urls and if set the size to 100, will it occupy extra  memory??



Thanks
sudhakar
How and where are you planning to set the capacity of the List?
Your last code is really a good Idea :-)
> Suppose if the list contains only 10 urls and if set the size to 100, will it occupy extra  memory??

Sure but your goal is to make it go as fast as you can isn't it, not to use less memory :)
that code does not seem to behave exactly the same as the requirements were specified earlier.
Exactly what is the possible formats that the code would need to handle, and how should they be broken down.

>  client's programmers are developing their own API and we need to integrate the API with
> our's. And we did not get their API also yet for testing. So it is difficult to get the actual data.

If you really need to squeeze your msec out of this, then you should be discussing with them to ensure they pass it to you in a form that is as ideal as possible for processing.

Earlier you said you could also recieve the format:
192.12.0.39/exchange/MailBox/Mail.eml

Is this still the case?

What is the definition of possible formats?
>>192.12.0.39/exchange/MailBox/Mail.eml

Sorry objects,  I forgot to mention that. This case was removed
>>Your last code is really a good Idea :-)

Was that directed at me sudhakar? (sometimes i don't know which of us you're addressing ;-))
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>Was that directed at me sudhakar?

Yes :-)
Actually, the last code i posted should not be used. It's late here - more later ;-)
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>> Actually, the last code i posted should not be used. It's late here - more later ;-)


why??
>> load factor used by your Map.

Sorry, I didn't get you
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Objects,
As earlier said, at following link
https://www.experts-exchange.com/questions/21111028/Hashtable-Vs-Hashmap.html

other experts are getting negative results only regarding Hashmap. So I am not yet/able decide which to use
Both Maps use a load factor so you'd get an improvement whichever you use.
Can you email me a large set of test data and I'll do some benchmarking.
>> Can you email me a large set of test data and I'll do some benchmarking

I will try to do that tomorrow. As today is Festival (Rakhee) Holiday in India

thanks
Sudhakar
>>why??

Let's carry on as normal and i'll explain this later:

>>Can you email me a large set of test data

Please don't email anything, but post the link instead.
>>other experts are getting negative results only regarding Hashmap

HashMap will give you (in theory) only slightly better results, as the cost of synchronization is not as high as is commonly imagined
> Let's carry on as normal and i'll explain this later

????

> Please don't email anything

email is fine for me
Moderators generally don't like seeing e-mail IDs or questions being solved through e-mail.
not solving it thru email, just need the test data.
And its too much to post here so email is the easiest, and its not directly relevant to the solution anyway.
Yeah, I agree to that :)
There's no problem with posting a link to test data, and absolutely no need for emails, quite apart from it being discouraged by the moderators.
ok guys,

I have uploaded the testurls zip file at http://www.prithvisolutions.com/urllist.zip  I may remove this with in one or two days as I have uploaded at my comany's official site. I just get 2 days permission to place that zip file there.

OK coming to the scenario,

With the mail client that sends the multiple emails at a time I was able to send the mails to different exchange servers. All these are just test servers. So I am sorry because I am not able to give realtime Email URLS for this.

And using Slide API, I got the urllist that I have placed in a text file. It contains arround 16 lacs of urls list

Sridhar,
Welcome to this thread participation :-). And I hope u understand what exactly I am looking for by seeing the previous discussions.


Thanks,
Sudhakar
I am increasing the points to 500 as you guys are too patience in helping me :-)
> There's no problem with posting a link to test data

And no difference :-D Except some inconvenience on suds.
No more of your petty complaints please, noone else had a problem so why should you.
It not like you even asked for any test data.
>> And I hope u understand what exactly I am looking for by seeing the previous discussions

I've been listening to this but haven't been able to take out time for bench-marking something. BTW, I tried clicking on the link you posted (the ZIP file). It said Page cannot be displayed.
sudhakar - it would be useful if you could post how many lists are meant to be produced by the test data
>> It said Page cannot be displayed.
???. It is working fine at my side. I just tested on both IE and NS

>>Let's carry on as normal and i'll explain this later:

CEHJ, will that won't work for some conditions??
>>CEHJ, will that won't work for some conditions??

I'm not going to prejudice you against it at the moment - keep testing it ;-)

a. does it produce the right number of lists for the test data?
b. what *is* the right number of lists?
Please let me know if you have the copy of that zip file. I am revoing that by this evening

thanks
Sudhakar
I've got it
objects shall I remove that file??
Ok CEHJ,

for the contents of that zip file

here is the key information

http://www.geocities.com/sudhakar_koundinya/urls.html
Did my code produce the right number of lists? (i haven't tested)
CEHJ,

for 1823794 URLS your method get 1860 lists
where as actuals are 1984

Still trying to identify the problem
CEHJ,

here are the results http://www.geocities.com/sudhakar_koundinya/results.htm

for the following test case

import java.io.*;
import java.util.*;
class ReadURLS
{
static long urls=0;
      public static void main(String[] args)
      {
      try {
        BufferedReader in = new BufferedReader(new FileReader("c:\\urllist.txt"));
        String str;
        while ((str = in.readLine()) != null) {
                    getUrlList10(str);    //cehj          
                             process(str); //sudhakar
        }
                        System.err.println(ht.size());
                        Enumeration e=ht.keys();
                        while(e.hasMoreElements())
            {
                              String str1=(String)e.nextElement();
                              System.out.println("Key  : "+str1+"   Value : "+ht.get(str1));
            }
            
                        System.err.println("Current Size :"+ht.size()+" -> No of URLS "+urls);
        in.close();
            List l=Collections.list(hostToUrls.keys());
            long value=0;
            for(int i=0;i<l.size();i++)
            {
                  Long l1=(Long)l.get(i);
                              System.out.println("Key  : "+l1+"   Value : "+hostToUrls.get(l1));
            }
            System.out.println("No of Lists :"+l.size());


    } catch (IOException e) {
    }
      }
      public static void process(String str)
      {
            if(str==null) return;
            if(str.trim().length()<=0) return;
            StringBuffer sb=new StringBuffer();
            int counter=0;
            for(int i=0;i<str.length();i+=1)
            {
                  sb.append(str.charAt(i));
                  if(sb.charAt(sb.length()-1)=='/')
                  {
                        counter++;
                        if(counter==5)
                        {
                              Long i1=(Long)ht.get(sb.toString());
                              if(i1==null)
                              {
                                    i1=new Long(1);
                              }
                              else
                              {
                                    i1=new Long(i1.longValue()+1);
                              }
                              ht.put(sb.toString(),i1);
                              break;
                        }
                  }
            }
            urls++;
            if(urls%50000==0)
            System.err.println("Current Size :"+ht.size()+" -> No of URLS "+urls);
      }
      public final static void getUrlList10(String str) {

          //for (int i = 0; i < urls.size(); i++)
                    {
               String url = (String) str;
                                 if(str==null) return;
            if(str.trim().length()<=0) return;
               int forwardSlashCount = 0;
               int hostPlusAccountPlusProtocolHashcode = 0;
               for (int ixChar = 4; forwardSlashCount < 5 && ixChar < url.length(); ixChar++) {
                    if (url.charAt(ixChar) == '/') {
                         forwardSlashCount++;
                    }
                    if (forwardSlashCount == 2 || forwardSlashCount == 4) {
                         // Compute hashcode of host
                         hostPlusAccountPlusProtocolHashcode = 3 * hostPlusAccountPlusProtocolHashcode + url.charAt(ixChar);
                    }
               }
               // end for
               if (url.charAt(4) == 's') {
                    hostPlusAccountPlusProtocolHashcode = ~hostPlusAccountPlusProtocolHashcode;
               }
               Long key = new Long(hostPlusAccountPlusProtocolHashcode);
               Long urlsForHost = (Long) hostToUrls.get(key);
               if (urlsForHost == null) {
                    urlsForHost = new Long(1);
                    //urlsForHost.add(url);
                    hostToUrls.put(key, urlsForHost);
               }
               else {
                    urlsForHost=new Long(urlsForHost.longValue()+1);
                                                  hostToUrls.put(key, urlsForHost);
               }
                     _urls++;
                     if(_urls%100000==0)
                     System.err.println("Key :"+key+" Value :"+urlsForHost+"Urls Count :"+_urls);
          }


     }

    static      Hashtable hostToUrls = new Hashtable();
          static      Hashtable ht= new Hashtable();
            static long _urls=0;

}
CEHJ,

any updates??
Hello objects,

Are you able to find some benchmark solution??
>>
for 1823794 URLS your method get 1860 lists
where as actuals are 1984
>>

Is that the one using String as the hash key or Integer?
that is Integer
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
But the later one is faster know??. Whay it is failing to get the right hashcode??
I know it's faster, but there's an  problem inherent to it: Integer's hashCode is essentially 'synonymous' with its equals method. That means that if a hashcode collision *does* occur, and this is more likely with such large amounts of data as you have [too large for me to handle here unfortunately], then the equals method will not 'save' it from replacing the previous entry in the table. This is not the case with String, as the String needs to be equal as well as returning the same hashcode before the previous entry will get kicked out of the table.

Of course, i'm not certain that this is why the anomaly occurs - it could be an error in my code ;-)
OK CEHJ,

I end up with ur StringBuffer related code

thanks
sudhakar
:-)