Link to home
Start Free TrialLog in
Avatar of Jack McKenzie
Jack McKenzie

asked on

Distributed computing in Java: how to design a distributed application for stream processing?

How do you build a distributed system in Java? I want to distribute a stream on multiple nodes.

I've started with Java RMI. It works, but rmiregistry is an unwanted complication. Is there something better in Java, without rmiregistry?

I figured the system needs a dynamic leader election, so now I have the bully algorithm implemented with RMI. When multiple instances are executed to represent nodes, they correctly vote and elects the leader, including handling of reelection when some of the nodes fails or disconnects.

Now, I'm not sure how to distribute stream processing on multiple nodes.
I can have one or more streams, and one or more operations to be performed on the streams.

I want he leader to be the central task allocator. The leader node should distribute tasks on nodes and monitor the progress.
Distributed systems typically use some partitioning of data.

I'd expect distributed stream processing to have some well-known architecture, but somehow I can't find it. It's probably going to use the same components as every other distributed system. (Well, a scheduler - that's the central task allocator,  a partitioner - to partitoin the data perhaps using some hashing function, leader election - done, ...)

With your answer, it should be clearer how to distribute an event stream on multiple nodes.
Avatar of krakatoa
krakatoa
Flag of United Kingdom of Great Britain and Northern Ireland image

My comment is just a comment, but given that concurrency ultimately requires interprocess state conformity, streams per se don't appear to hold many favours in store for you -perhaps one reason why you've had trouble tracking down examples of this part of the art.

On a yet more general level, if one thread is in charge of the duty roster, aren't you looking at a client-server model anyway? I'm not sure whether your system aims to feed different parameters but identical data to the nodes to arrive at some (optimum?) result, or whether the nodes are working on different parts of the same task / data? Maybe I shouldn't be saying any of this, as I haven't, and probably won't, understand your question. If so, feel free to ignore it entirely. But maybe others would ask you : what's the end aim and purpose in mind for the distributed system?
If RMI is too complicated I'd suggest you move to micro services, where you can add and delete nodes at will. You can probably replace your RMI solution with Apache's Zookeeper. And here is a small example on how you can implement a dynamic leader application.
Avatar of Jack McKenzie
Jack McKenzie

ASKER

The distributed system is for pattern matching in streams. Every stream needs matching against all other streams. (scalability must be horizontal by adding new nodes, for example to match 10,000 streams in real-time, and more as the amount of streams increases over the years)

For example:
stream1, stream2, stream3                  stream4, stream5, stream6
node1                                                   node2

The algorithm needs to match stream1 with stream2, stream3, stream4, stream5, stream6
and also match stream2 with stream1, stream3, stream4, stream5, stream6, etc.
The matching needs to be done as quickly as possible and new data can arrive quickly (backpressure issue / load shedding as the solution).

How to match everything with everything at the lowest cost? Do I need some middle node, or stages (they are costly), or is there some other option?
Undoubtedly I am being a bit slow here, but otoh it's not yet apparent to me what the currency or ultimate objective is that you are dealing with in this.

So you envisage x nodes . . . what are they all supposed to be doing . . . matching locally, then comparing their matches universally - i.e. all-to-all? What's the point in that . . . ? . . . either there's already a match or there isn't, so why the need to have that checked by, or announced to, the peergroup ?  Or it is that a local match is fine,  and when it's found, a node reports in, and gets on with the next task in the queue? I don't get why all nodes need to know the results of all others ?
It is hard to tell if we do not know the requirements. Can you not match the streams on each node and then match the "winning" stream on node1 with the "winning" stream on node 2?
No such concept as "winning stream" exists. You've invented a concept that isn't there.

Please be careful while I explain the concept one more time.

The question asks how to partition data in a distributed stream processing system.

Data = multiple event streams.
Queries = multiple user-defined sql-like queries to match something from the data.

There is one scenario to match every data stream against all other data streams (it's one query) and for this query it's required to work similarly to brute force, i.e. test all combinations of data streams. It will test them always in couples. For example stream1 tested against stream2, stream2 against stream1, stream1 against stream3, stream3 against stream1, etc.

There are more streams than one computer can handle. Therefore, streams must be distributed on multiple nodes. In that case it's a distributed system, and it requires some scheduler, partitioner, remote method invocation, heartbeat, replication, and similar concepts.

The question is about the partitioning of data streams (or distribution of data streams?) to be able to match these streams against each other with minimal communication complexity. (we want to minimize data shuffling, or eliminate it fully if possible)

What would be the way to partition data and queries, so that it can be matched as above?
Data = multiple event streams.

How do you come to that view? An event is something after the fact - its only relationship with data is that the event itself is a datum.
First, why do you need to test data streams against each other a second time? If you test stream1 against stream2 why do you also need to test stream2 against stream1?

Second, if you need to test all data streams against all other data streams, then there is no easy solution; you will have to test all of the possible combinations. Without knowing more about what "matching" you're doing I cannot think of anything else.

What I meant by saying "winning node" is that if, for example, you're looking to find the biggest number and you have the following node with a set of numbers each:

node1: 2,5,9
node2: 3,4,0

then you can find the "winning" number in node 1 (9) and the "winning" number in node2 (4) and then compare with each other. In this example you can fully utilise the power of distribution, since you break down the problem into smaller subsets and you work on each subset independently.
This question needs an answer!
Become an EE member today
7 DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform.
View membership options
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.