Link to home
Start Free TrialLog in
Avatar of infodigger
infodigger

asked on

Present data from different sources in realtime

Hello,

I have several data sources which let's say (for simplification reasons) that send the data like the following way:

source, id, key, value
.
.
.

I have about 200 such data sources which all send these kind of data when I request them. When the data arrive I need to do some calculations. For example I need to map each of their IDs with my IDs in the database.

What would be the best practice for that? Shall I keep all data in memory and do the calculations there? Currently I use curl to get data with php and then perform all the calculations once all the sources have completed sending data, which is too slow.
Avatar of smeghammer
smeghammer
Flag of United Kingdom of Great Britain and Northern Ireland image

Hi infodigger,

Several questions:

 - How are your data sources accessed (ODBC, HTTP, text files etc.)? - I assume HTTP
 - What format are your data sources sending the data as (DB records, JSON, ASCII text, SOAP etc.)? - I assume SOAP/JSON/ASCII
 - What is the frequency that the data is updated?
 - Are the different data sources related (i.e. are there dependencies where data from A is needed before data from B etc.)?
 - Do you need to keep a history of old data (i.e. calculations on CURRENT data depend on OLD data)?
 - How complex are the calculations? (it is likely that the time consuming bit will be the retrieval of data)
- what do you need to do with the results after you have calculated them (present dynamic web page, write into your own DB etc.)?

It is inclear from your description whether the data sources are PUSHING the data and you have listeners in your PHP code, or whether you are PULLING the data at specified time intervals (regardless of whether the data source has updated); or whether the data is retrieved for every web page request.

In general:
CURL is getting data from a HTTP location, which potentially will be slow. If there is no other way of accessing the data, then you are stuck with this method.

If the webpage is currently retrieving the source data and doing all the necessary calculations each time it is called in a browser, then of course a remote HTTP call is made by CURL, and possibly complex calculations performed, for every web page request. This will potentially slow down the page load quite considerably. A much better solution is to retrieve the data periodically on your server, process it and cache the result somewhere (in RAM if there is sufficient, as local files, in your local DB). The webpage should then retrieve this LOCAL copy of the processed data - this will be percieved as much faster page loads etc.

The format and how that format is processed to retrieve the data can potentially affect performance as well - make sure any string/XML manipulation is as efficient as it can be.

The calculations cannot be done anywhere except in memory, so I am not sure what you mean there - if you mean should you cache the data from the data source, then that will depend on expected frequency of update of the source.

You also suggest that you DO need all the data before you can calculate the result - "Currently I use curl to get data with php and then perform all the calculations once all the sources have completed sending data, which is too slow" - in which case, optimisation of the data processing, calculation and output code should be your first area for investigation.

I have made a lot of assumptions here, so apologies if some of the above is obvious or wrong...

Given what I understand, I would initially look into some kind of asyncronous process to retrieve, process and store LOCALLY the results. I would then code the web pages to retrieve the local result.

Cheers
Avatar of infodigger
infodigger

ASKER

Hi smeghammer,

Thank you very much for your extensive answer.

The sources I am getting data from, send XMLs when I send a request. The data change all the time like flight tickets for example. So every time a user makes a request, I have to check with every sources and get their xml file and it's no use to cache the results on my server.

I can have asynchronous loading of these sources but somehow they should be matched with each other, so I guess it will need to be a combination of javascript and server-side processing. If you have any example/case/scenario of such thing plese let me know. It is the same way that all the meta search engines work like flights/hotels/insurance/price comparison/etc.

Thank a lot!
Ah - it's a portal?

You say you already have async caching, but the cached data need to be matched? What is it that matches them together? Is it just the act of searching for something? Logged-in user ID? If it is possible to 'pre-link' some of this cached data that might help.

Sorry, can't really suggest more without knowing a bit more about the logic of how your process works.

I suspect the deciding factor will be how often the source data is updated. If these are updated very frequently, or at random, then the only choice you have is to make requests in real time on each page load. You can probably do some optimisation for the data sources that are unlikely to change very often (addresses, maps or whatever) and cache these results locally.

Cheers
smeghammer,

Let's it's a hotel comparison site (it's a little more complicated but that would help me explain the process).

You have your own hotels in database with their ids, and each one has another id for each of the data sources you are receiving data from. For example you can have:

hotelid = 1, expedia_id =203, booking_id =394, etc.

When the user requests the data, you hit each source and the pricing for their id, comes as a result. For example:

expedia_id = 203 | price = $45
booking_id = 394 | price = $59

you need to match those ids with your id so that you present:

hotelid = 1| price_expedia= $45, price_booking = $59

but the price from expedia and booking comes in different time (for example booking might send the request faster).

As you wait for all the sources to complete, you need to calculate in background the data that you have already got and present it.

Here is a good example:
http://www.kayak.co.uk/hotels/Crowne-Plaza-London,Kensington,London,England,United-Kingdom-c28501-h168980-details/2013-12-03/2013-12-06/2guests/expanded/#overview

You will see that the page loads and as it loads it does this job connecting the data it receives with their database and presenting them in realtime.
ASKER CERTIFIED SOLUTION
Avatar of smeghammer
smeghammer
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thank you very much for your time to answer this question in so much detail.
You are welcome.

Cheers