Link to home
Start Free TrialLog in
Avatar of Matt_T_hat
Matt_T_hat

asked on

SimpleXML, Date Data Order, Stream, Efficiency and all that

I wish to create a class that will act as an API to a stream of data. Such as might be a twitter clone or a facebook wall clone or whatever. The store of preference is XML and so I will be using SimpleXML as I have a sound grasp of how to use it. The store needs to be of any size and might become very large over time.

These items are likely to be of the form <something date="" id=""></something> but inside that the items might get quite complex. The most important point is that they are part of the bigger system time sensitive with the newest items being the ones we are largely going to be interested in.

That said, while I can probably come up with some methods of getting a unique ID system in place I am fairly sure it would end up as a hack of some kind. So my thinking is towards using time/date sensitive to the millisecond as ID. That's not my question but...

Now the output process needs to return a SimpleXML object with the newest x items.

The easiest way would be to scoop the first 20 entries from the XML store ut this would only work if the newest records were at the top. SimpleXML appends new entries at the end of the space and so this does not work.

So the next option would be to scoop the last 20 although I can not currently conceive of anything that might do this efficiently.

The ideal solution would not care if the data was unsorted but this is not going to happen as some maintenance or index is needed for simplicity of data fetching (the same reason SQL databases have indexes).

At all costs I want to avoid doing a foreach through the entire collection as this will eventually kill any server as the file gets bigger and the output more popular.

As a secondary factor I am sure that I will want to "page" through the "archives".

So I need a logically sound pattern (and some pointers as to how I can implement the fellow) for storing and retrieving from such a store.

To use MySQL would solve these problems as they are already solved in MySQL but this makes demands of the final implementation I am not happy with. The ideal of this being that we have a specification for the process and display but it could be created in any language and in the future replaced with ease if the next maintainer feels Ruby on Rails or whatever is the way to go forward.

So it has to be XML. It has to be efficient even with big files and it has to be simple enough for me to understand.

I plan to create a Factory singleton that will maintain a cache of the each stream specific API class. However, a web site is multi threaded (each child process could be running the same code) so this is only so much use. Therefore I need to work out exact rules for the API class for read and write that would allow multiple threads to not bum out the system with identical dates or dates out of order...

I considered a once a minute cron to take from a buffer file and as a virtual single threaded singleton write to the store but this is overly complex and opens up a whole new can of possible problems. I need to create a plan for a data format and read-write method that plays well with other instances.

I considered file locking but don't know much about it enough to use for sanity and to be honest aside from identical date-time ID there should be no issue that requires this. However...

So the read write process also needs sanity checks.

I'm not looking for complete solutions but I do need to form a sound plan to work from with the potential problems mitigated or avoided in theory at least.

I'm shooting for stable and reliable even if lots of random developers of different skills start making calls to the API class

That's why this question is getting top points possible - I need a solid working plan.
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Matt_T_hat
Matt_T_hat

ASKER

On a larger application MySQL and a public API would probably be the perfect set up but we are talking about something that the average guy might run on a shared hosting the API being programmatical for the benefit of other PHP developers.

I can see how everything will work but not the low footprint "most recent x from y" when using XML. I'm convinced that there should be a way to do this with a reasonably well structured XML file (or collection thereof).

One idea that occours is to store the data as "data objects" - one file per object and use the timestamp archiving older data to a single file as a structured opperation later. I'm not sure how efficient or elegant this would be though. This matters especially as I'm going to have to write everything myself and from scratch.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I'm not so sure that I want to use MySQL just because it can do this one thing well.

Actually that last idea of mine combined with yours could work quite well. In many respects the data being processed fits the email service design pattern which could be mimicked by an internal data handler and connected to via the presentation layer.

Then if some bright sparks want to write a highly efficient PERL, Apache, C or C++ service then this would just speed up the "back office" work flow.

I've knocked up a quick diagram to try and show what I'm getting at.

Do I want to use a big file or lots of little ones? Same question but new thoughts about ways it can be processed.

What's the best and most efficient process for getting to "most recent X from Y" in an XML environment.

MyWall.png
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
You put forth some very compelling arguments in favour of MySQL. I can be quite the pro MySL geek sometimes (like when designing a simple search engine) but other times it seems like a lot to ask and the very tree like nature of the XML might leave me with a table:

table data(
xml bigtext;
date timestamp;
);

Which just feels silly. That leads me to a similar place where we end up with a BigText MetaData section... As a data modelling geek that feels not so much silly as outright wrong.

The fact is that I am looking at data that aside from date ordering need never have any other MySQL function used with it. So I feel loathed to fire up a heavy surver to do one thing (albiet very well).

The project inquestion is going to process social media output. For example it might take your facebook, twitter, flickr and blog-rss and create a personal public timeline or lifestream that is platform seperate.

http://mywall.lordmatt.co.uk/ is where I try to explain what I'm trying to get to.

I want to be very flexible as to what meta-data I store along with the basic title, date, source, text block. As long as the items can end up in time order and/or be fetched out as such without hurting even a weak hosting solution (or my very busy server) then I'm happy.

Furthermore, in my mind, the data is best appriciated as an XML tree - it's just this blasted date order business that I can't get a clear idea of.

Have a look at the website and see if that makes it much clearer. I figure I'm just telling the story badly or I'm missing something and just figuring that out might be worth this question.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Now having said that, it is also possible to "sort" XML.  It is ugly and I do not recommend it, but I have forced it to work.  This is not likely to be a good choice for rapidly changing data, since you would need to force some kind of transaction-lock around the sort and retrieve process.

I guess my overall sense is that data storage in MySQL is not really that burdensome, and I've not read of anyone converting a MySQL data base into XML, except for the presentation layer.  I have, OTOH, heard of several installations that tried to do data storage in XML and have given up, going instead to MySQL or equivalent data base technology.

Best of luck with it, ~Ray
<?php // RAY_sort_XML.php
 
// TEST DATA
$response = '<all>
<CUST>
<NAME>cheese co</NAME>
<LVL>E1</LVL>
<STATE>FL</STATE>
</CUST>
<CUST>
<NAME>ABC Co.</NAME>
<LVL>A1</LVL>
<STATE>CA</STATE>
</CUST>
<CUST>
<NAME>ACME</NAME>
<LVL>A2</LVL>
<STATE>CA</STATE>
</CUST>
</all>';
 
// CONVERT TO OBJECT
$xmlobj = SimpleXML_Load_String($response);
 
// ITERATE OVER THE OBJECT
$point = 0;
foreach ($xmlobj->CUST as $thing)
{
// EXTRACT THE NAME FOR SORTING
	$order["$point"] = "$thing->NAME";
// INSERT A POINTER
	$thing->ORDER = "$point";
	$point++;
}
 
// SORT THE NAMES IN SENSIBLE ORDER
natcasesort ($order);
 
// ITERATE BY ORDER
foreach ($order as $key => $value)
{
	$my_obj = $xmlobj->CUST[$key];
	$my_nom = "$my_obj->NAME";
	echo "<br/>$my_nom";
}

Open in new window

Hmmm... lots for me to think about over the weekend.

Thank you.
I've been doing this my whole life. I wanted Relational Data (not that I had any idea that was what I was looking for) when flat file was the only option. I wanted multi-sheet or 3D spreadsheet systems when 2D single sheet was the only option. I wanted a visual OO language when basic was all that was being offered to me.

Now I want ordered XML when it's not yet available.

Why me?

But seriously aside from being a thinking on the edge of what is there is some work being done towards this end.

Dynamic labeling schemes for ordered XML based on type information, 2006
http://portal.acm.org/citation.cfm?id=1151736.1151743

Sketch-Based Summarization of Ordered XML Streams
http://whitepapers.techrepublic.com.com/abstract.aspx?docid=889013

Which has left me feeling, if not enlightened then at least no longer utterly insane.

There is also work to suggest that you idea might be the way to go too

Storing and Querying Ordered XML Using a Relational Database System, by Kevin Beyer, Igor Tatarinov, Jayavel Shanmugasundaram
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.5927

I don't think the right answer is to go with an XML to array, array sort and then array to XML device unless we were dealing with fairly small bites of XML.

http://scripts.ringsworld.com/development-tools/openbiz-2.0/openbiz/bin/xmltoarray.php.html

I'm still pondering as it seems such a waste to use an entire service of the complexity of MySQL just for the fast sorting. Anyway, I thought I'd put my ideas down for you to read - I'm going to poke about the net some more and have a good think.
Are we basically saying that XPath and DomDocument would _NOT_ be up to the task?

http://uk2.php.net/manual/en/function.simplexml-element-xpath.php
http://us2.php.net/domdocument

If so could you cover a little of the why?

Because the only thing that worries me is date order (okay so file size is an issue too but...)

Could xpath get me all the timestamped today, then yesterday etc until the count is reached? Further would not treating XML objects as files also enable the generation of object files with a created timestamp of x?

After all the XML will be a 1:1 representation of the object logic and so storing it as XML means that the logic is indicated.

Actually I don't like the using the file system it feels like an error waiting to happen and breaks the multiply entities beyond the minimum rule.

All the same would DOM or XPath not help here?

You could be right about SQL - I just want to make the best choice for the project without assuming anything.
Having thought about things for a while I have questions about your answer. Should I go with MySQL some issues are raised that I need to get clear in my mind.

For example if I maintain an archive collection and a current items collection then I am only ever dealing with a small collection of XML nodes. Thus ugly sorts might work.

However I'm left wondering if the issue of concurrent writes to the XML might create data corruption? With locking only one thread can write at a time and this is not exactly efficient.

It is starting to look like your MySQL suggestion is the only way forward as a way to manage XML.

It's ironic that after creating the "perfect specification" whereby the data maps so directly to the output in such an elegant fashion that the data must be maintained (while live) in another storage service.

That said by wrapping that in a manager class I can always allow for a time when PHP and XML can work together enough to not need MySQL. With little or no lead time.
Thanks for the feedback and time. I've spent a day diagramming everything (available on mywall.lordmatt.co.uk) and I think I'm going to have to live with MySQL being used however some of the things you have suggested have provided me with some great caching a query reduction ideas.

XML manipulation has a way to go yet but in the mean time I can get on with doing what I do best. Object manipulation.
Please take just a moment to read the EE grading guidelines and explain why, after all this work, you marked the answer grade down to a "B" -- I would love to know what I did wrong that disappointed you.  It would help me to know whether to spend time on your questions in the future.

https://www.experts-exchange.com/help.jsp#hi403

Thanks for your input on this matter,
~Ray
While you answer was perfectly correct on a technical level I was looking for a solution to an issue with XML. You answers guided me to a solution with lots of wrapping and a whole other server engine not to mention a whole other set of cache issues. The server in question is going to already be running near the limit of LAMP on a single server and having the memory hungry MySQL out of the stack for this solution would have very good on lots of levels.

You helped, don't get me wrong. However the answer was "it can't be done (yet)" which is not the same as "this is how you might do it". You have clearly put some quality time into helping me and I don't want to not recognise that because I appriciate it. A grade A in this case would have taken me to a place where I could have solved the ordered XML issue albeit with dire warnings that I might not want to that in you opinion.

Does that make sense?

It's not my intention to be rude or insult as I appriciate the effort you have made for me. On the other hand the primary problem was not solved but offloaded to another server application to take care of.

The projects XML objects are made up of two segments head and body and body can contain any number of additional XML objects that can contain...

Also the head section can contain any number of segments relating to the data and what process is ment to deal with it. The result is that a "stream" is a tree of unlimited depth with the newest items at the top of the list at any given level. So I will be mostly just storing XML in a text field and tracking links with autonumber keys on a self referential table. That's nowhere near the optimal I was shooting for.

however if that's the best I can currently hope for from XML and PHP then I have to go with that.
MO: Thanks.  

Matt_T_hat, going forward, please remember that it costs you nothing more to give a grade of "A" unless the answer is really incomplete - in which case a "B" makes sense.

best to all, ~Ray