Present data from different sources in realtime


I have several data sources which let's say (for simplification reasons) that send the data like the following way:

source, id, key, value

I have about 200 such data sources which all send these kind of data when I request them. When the data arrive I need to do some calculations. For example I need to map each of their IDs with my IDs in the database.

What would be the best practice for that? Shall I keep all data in memory and do the calculations there? Currently I use curl to get data with php and then perform all the calculations once all the sources have completed sending data, which is too slow.
Who is Participating?
smeghammerConnect With a Mentor Commented:
OK... I see the issue clearly now :-)

Other than code optimisation, the obvious approach - unless you do this already, in which case I apologise - is to use AJAX for each of the positions where the comparison price is to be rendered. Your main page will be rendered, and you will get each comparison price displaying as and when it is delivered - this is exactly what the example URL you sent appears to be doing.

The big issue of course with this is cross domain access. You would need to create a bunch of server-side proxy scripts that called each remote service and simply returned the XML. Your AJAX code would call these proxy PHP (I guess..) files, rather than trying to call the remote URLs directly.

Using the above methodology, the ACTUAL HTTP load time would not be any different, but the PERCIEVED page load time would be considerably better as the main page would return quickly, and each data point would fill in over the next few seconds as each AJAX call completed. This way, you don't have to wait for all data to return before performing the calculations and sending the page to the browser.

Hi infodigger,

Several questions:

 - How are your data sources accessed (ODBC, HTTP, text files etc.)? - I assume HTTP
 - What format are your data sources sending the data as (DB records, JSON, ASCII text, SOAP etc.)? - I assume SOAP/JSON/ASCII
 - What is the frequency that the data is updated?
 - Are the different data sources related (i.e. are there dependencies where data from A is needed before data from B etc.)?
 - Do you need to keep a history of old data (i.e. calculations on CURRENT data depend on OLD data)?
 - How complex are the calculations? (it is likely that the time consuming bit will be the retrieval of data)
- what do you need to do with the results after you have calculated them (present dynamic web page, write into your own DB etc.)?

It is inclear from your description whether the data sources are PUSHING the data and you have listeners in your PHP code, or whether you are PULLING the data at specified time intervals (regardless of whether the data source has updated); or whether the data is retrieved for every web page request.

In general:
CURL is getting data from a HTTP location, which potentially will be slow. If there is no other way of accessing the data, then you are stuck with this method.

If the webpage is currently retrieving the source data and doing all the necessary calculations each time it is called in a browser, then of course a remote HTTP call is made by CURL, and possibly complex calculations performed, for every web page request. This will potentially slow down the page load quite considerably. A much better solution is to retrieve the data periodically on your server, process it and cache the result somewhere (in RAM if there is sufficient, as local files, in your local DB). The webpage should then retrieve this LOCAL copy of the processed data - this will be percieved as much faster page loads etc.

The format and how that format is processed to retrieve the data can potentially affect performance as well - make sure any string/XML manipulation is as efficient as it can be.

The calculations cannot be done anywhere except in memory, so I am not sure what you mean there - if you mean should you cache the data from the data source, then that will depend on expected frequency of update of the source.

You also suggest that you DO need all the data before you can calculate the result - "Currently I use curl to get data with php and then perform all the calculations once all the sources have completed sending data, which is too slow" - in which case, optimisation of the data processing, calculation and output code should be your first area for investigation.

I have made a lot of assumptions here, so apologies if some of the above is obvious or wrong...

Given what I understand, I would initially look into some kind of asyncronous process to retrieve, process and store LOCALLY the results. I would then code the web pages to retrieve the local result.

infodiggerAuthor Commented:
Hi smeghammer,

Thank you very much for your extensive answer.

The sources I am getting data from, send XMLs when I send a request. The data change all the time like flight tickets for example. So every time a user makes a request, I have to check with every sources and get their xml file and it's no use to cache the results on my server.

I can have asynchronous loading of these sources but somehow they should be matched with each other, so I guess it will need to be a combination of javascript and server-side processing. If you have any example/case/scenario of such thing plese let me know. It is the same way that all the meta search engines work like flights/hotels/insurance/price comparison/etc.

Thank a lot!
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Ah - it's a portal?

You say you already have async caching, but the cached data need to be matched? What is it that matches them together? Is it just the act of searching for something? Logged-in user ID? If it is possible to 'pre-link' some of this cached data that might help.

Sorry, can't really suggest more without knowing a bit more about the logic of how your process works.

I suspect the deciding factor will be how often the source data is updated. If these are updated very frequently, or at random, then the only choice you have is to make requests in real time on each page load. You can probably do some optimisation for the data sources that are unlikely to change very often (addresses, maps or whatever) and cache these results locally.

infodiggerAuthor Commented:

Let's it's a hotel comparison site (it's a little more complicated but that would help me explain the process).

You have your own hotels in database with their ids, and each one has another id for each of the data sources you are receiving data from. For example you can have:

hotelid = 1, expedia_id =203, booking_id =394, etc.

When the user requests the data, you hit each source and the pricing for their id, comes as a result. For example:

expedia_id = 203 | price = $45
booking_id = 394 | price = $59

you need to match those ids with your id so that you present:

hotelid = 1| price_expedia= $45, price_booking = $59

but the price from expedia and booking comes in different time (for example booking might send the request faster).

As you wait for all the sources to complete, you need to calculate in background the data that you have already got and present it.

Here is a good example:,Kensington,London,England,United-Kingdom-c28501-h168980-details/2013-12-03/2013-12-06/2guests/expanded/#overview

You will see that the page loads and as it loads it does this job connecting the data it receives with their database and presenting them in realtime.
infodiggerAuthor Commented:
Thank you very much for your time to answer this question in so much detail.
You are welcome.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.