?
Solved

Sparse Matrices in Java

Posted on 2005-04-03
15
Medium Priority
?
349 Views
Last Modified: 2008-02-01
Hi

I am struck with a problem.

What I have to do is....

I have a set  of 10000 URLs(webpages). I have to build a DIRECTED GRAPH using the 10000 webpages. If there is a link in webpage A pointing to webpage B, then we include a edge in the graph from node A to node B and vice versa.

This is all like a sparse adjacency matrix. After I build the matrix, I have to do INTENSE calculations on the matrix.  


One obvious solution to do this is by taking a node class which represents a webpage, and include two member vectors. One vector contains the urls(OUTLINKS) which this node point to and the other vector contains urls(INLINKS) which point to this node. After this I have to do INTENSE calculations, but in this method, it gets too slow and performance might be too bad.

What is the best way to do this?? Are there any tools which can make my work easy???

Hope I explained enough. I have to do all this is JAVA.

Thank You.
0
Comment
Question by:sumantedla
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 5
  • 3
15 Comments
 
LVL 15

Expert Comment

by:aozarov
ID: 13693722
The data structure you suggested (node with list for adjacent pages and list for parent pages) sounds good.
You might want to consider switching from Vector to List (probably LinkedList in your case as long as you don't need a direct access by index),
to remove the sychronized overhead of Vector. What is the INTENSE calculation you are doing?
I think it is probably slow because of the calculation algorithm and not because of the picked data structure.
There are many sparse matrix implementation in java on the web: (e.g. http://www.math.uib.no/~bjornoh/mtj/smt/) but again I am not
sure this is your problem.
You can also run the java program with -verbosegc just to make sure you are not spending your time on G.C because of a small heap size (can be changed -Xmx)
0
 
LVL 6

Expert Comment

by:guitaristx
ID: 13697582
If you're doing a directed graph, you're double-entering data by having nodes with lists of in-links and out-links.  Pick one, and it will suffice.  If I were doing it, I'd probably choose in-links.
0
 
LVL 15

Expert Comment

by:aozarov
ID: 13697846
guitaristx, that is true though sometimes knowing the parent can help several algorithms (even though the network flow is outbound only) but you don't suggest
that this is the cause of its slowness, right?
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 6

Expert Comment

by:guitaristx
ID: 13701295
I suppose it's a matter of what's more important - speed vs. data size.  You've got twice as much data if you store in-links and out-links, but, as you stated, it speeds up some algorithms.  It's just that when I saw that, I had a mental "red flag" go up.  Just make sure that the data in both places (in vs. out) stays in synch.
0
 

Author Comment

by:sumantedla
ID: 13701373
Hi ,

Storing both the outlinks and inlinks is important for the algorithm.

In my last post,I missed an important point.

Getting the 5000 webpages thro java and then analyzing the webpages to extract the possible outlinks which point to webpages within the 5000 is much more time consuming than the actual algorithm calculation.

What can I do for this. Will Multithreading alone improves the performance??

0
 
LVL 6

Expert Comment

by:guitaristx
ID: 13701559
Multithreading might help, depending on how you implement everything.  Can you give a brief synopsis of how it works now?
0
 
LVL 15

Expert Comment

by:aozarov
ID: 13701584
guitaristx :
>> You've got twice as much data if you store in-links and out-links
It is not twice the data but twice the link to the data (so it isn't that bad)

>> Will Multithreading alone improves the performance??
Yes definetly!!  Also try using asynchrous IO (if you have 1.4 and abovev)
see: http://grexengine.com/sections/externalgames/articles/Adam%20Martin-Java%20NIO%20Networking%20for%20Games-1.html
or http://www.cse.wustl.edu/~cdgill/courses/cs6785/Java_NIO.ppt
Using java.util.concurrent or Doug Lea library util.concurrent (if you don't have 1.5) can help you as well.
0
 

Author Comment

by:sumantedla
ID: 13701688
An initial pool of 5000 urls will be collected.

Now on this I have to build the adjacency matrix or graph.

All the 5000 urls will be stored in a HashMap(or anyother datastructure) with the value as a class containing inlinks and outlinks to the particular url(webpage)

foreach url
(1) get the webpage
(2) extract the links
(3) Match the extracted links with the 5000 links
(4) If there are any matches update the HaspMap Value objects with the found inlinks and outlinks.

Thats it.This completes the first phase.Second phase is applying the algorithm.
0
 
LVL 15

Expert Comment

by:aozarov
ID: 13702399
you can have a thread pool (with some throtteling) to dispatch collecting tasks (again, see the concurrrent library PooledExecutor or 1.5 ThreadPoolExecutor).
The collecting tasks can then send the requests / receive response and populate the common Map. (you can use asynchronous io here to make the collection more efficient)
It seems that you will be using a lot of memory. You can use -Xmx to change the max memory settings. (e.g -Xmx512m)
0
 

Author Comment

by:sumantedla
ID: 13702447
How to use the asynchronous io??

Can you give me one example.I didnt use it till now.

-Xmx ??? what is this??How to set the heap size before execution??

0
 
LVL 15

Expert Comment

by:aozarov
ID: 13702533
Did you read the links I gave you above (regarding asynchronous io [packaed as nio])
This one is actually better: http://java.sun.com/j2se/1.4.2/docs/guide/nio/example/

>> -Xmx ??? what is this??How to set the heap size before execution??
java -Xmx512m <your_java_program>
will run your java program with maximum of 512 MB.
0
 

Author Comment

by:sumantedla
ID: 13702647
I read the given links.

Are you sure that asynchronous io improves the performance of my program.

The first phase is all about connecting to 5000 urls and extracting the links and matching with urls in haspmap.

In what way i can apply nio to my program.

Thank You.
0
 
LVL 15

Expert Comment

by:aozarov
ID: 13703523
instead of blocking on 5000 io reads you can send at once all (or throttled) the requests and then use Selector to read the response.
Once a response comes back you use the thread pool to handle it. This way you can have less threads to handle more sockets.
0
 

Author Comment

by:sumantedla
ID: 13703560
Probably I should increse the points to 500 as this is getting complex.

>> instead of blocking on 5000 io reads you can send at once all (or throttled) the requests and then use Selector to read the response.

How to do this?? I read the sample code from http://java.sun.com/j2se/1.4.2/docs/guide/nio/example/

But understood very little. My experience in java is limited.

Can u write some sample code. By the way you used the term throttled. What that mean??
0
 
LVL 15

Accepted Solution

by:
aozarov earned 2000 total points
ID: 13703798
>> Can u write some sample code.
http://javaalmanac.com/egs/java.nio/NbClient.html

>> By the way you used the term throttled. What that mean??
means that you can have a property (which you can configure) to control how many requests you are seding at once
and also a property to configure how many threads can handle response from the selector. you do that to prevent overloading
your machine TCP stack which can handle so many requests at once as well as the usage of your memeory /cpu.
Later on you can fine-tune by changing those properties to match your system capabilities.
0

Featured Post

[Webinar] How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

By the end of 1980s, object oriented programming using languages like C++, Simula69 and ObjectPascal gained momentum. It looked like programmers finally found the perfect language. C++ successfully combined the object oriented principles of Simula w…
In this post we will learn different types of Android Layout and some basics of an Android App.
Viewers will learn about the different types of variables in Java and how to declare them. Decide the type of variable desired: Put the keyword corresponding to the type of variable in front of the variable name: Use the equal sign to assign a v…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
Suggested Courses
Course of the Month8 days, 2 hours left to enroll

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question