Solved

SimpleXML, Date Data Order, Stream, Efficiency and all that

Posted on 2009-03-29
17
900 Views
Last Modified: 2012-06-21
I wish to create a class that will act as an API to a stream of data. Such as might be a twitter clone or a facebook wall clone or whatever. The store of preference is XML and so I will be using SimpleXML as I have a sound grasp of how to use it. The store needs to be of any size and might become very large over time.

These items are likely to be of the form <something date="" id=""></something> but inside that the items might get quite complex. The most important point is that they are part of the bigger system time sensitive with the newest items being the ones we are largely going to be interested in.

That said, while I can probably come up with some methods of getting a unique ID system in place I am fairly sure it would end up as a hack of some kind. So my thinking is towards using time/date sensitive to the millisecond as ID. That's not my question but...

Now the output process needs to return a SimpleXML object with the newest x items.

The easiest way would be to scoop the first 20 entries from the XML store ut this would only work if the newest records were at the top. SimpleXML appends new entries at the end of the space and so this does not work.

So the next option would be to scoop the last 20 although I can not currently conceive of anything that might do this efficiently.

The ideal solution would not care if the data was unsorted but this is not going to happen as some maintenance or index is needed for simplicity of data fetching (the same reason SQL databases have indexes).

At all costs I want to avoid doing a foreach through the entire collection as this will eventually kill any server as the file gets bigger and the output more popular.

As a secondary factor I am sure that I will want to "page" through the "archives".

So I need a logically sound pattern (and some pointers as to how I can implement the fellow) for storing and retrieving from such a store.

To use MySQL would solve these problems as they are already solved in MySQL but this makes demands of the final implementation I am not happy with. The ideal of this being that we have a specification for the process and display but it could be created in any language and in the future replaced with ease if the next maintainer feels Ruby on Rails or whatever is the way to go forward.

So it has to be XML. It has to be efficient even with big files and it has to be simple enough for me to understand.

I plan to create a Factory singleton that will maintain a cache of the each stream specific API class. However, a web site is multi threaded (each child process could be running the same code) so this is only so much use. Therefore I need to work out exact rules for the API class for read and write that would allow multiple threads to not bum out the system with identical dates or dates out of order...

I considered a once a minute cron to take from a buffer file and as a virtual single threaded singleton write to the store but this is overly complex and opens up a whole new can of possible problems. I need to create a plan for a data format and read-write method that plays well with other instances.

I considered file locking but don't know much about it enough to use for sanity and to be honest aside from identical date-time ID there should be no issue that requires this. However...

So the read write process also needs sanity checks.

I'm not looking for complete solutions but I do need to form a sound plan to work from with the potential problems mitigated or avoided in theory at least.

I'm shooting for stable and reliable even if lots of random developers of different skills start making calls to the API class

That's why this question is getting top points possible - I need a solid working plan.
0
Comment
Question by:Matt_T_hat
  • 9
  • 7
17 Comments
 
LVL 108

Accepted Solution

by:
Ray Paseur earned 500 total points
ID: 24013640
From what you're describing it sounds like the underlying data needs to be kept in a MySQL data base, and the results need to be returned in XML.  If you want a good overview of RESTful interfaces, which are the best for this sort of thing, have a look at the Yahoo man page on the topic:

http://developer.yahoo.com/maps/rest/V1/geocode.html

In the REST model, the URL "get" string contains all the arguments and the "browser output" contains all the responses (note how easy that is for testing, especially when compared to SOAP!).

Please post back here with specific questions, and I'll try to help you piece it together.

best regards, ~Ray
0
 
LVL 1

Author Comment

by:Matt_T_hat
ID: 24071835
On a larger application MySQL and a public API would probably be the perfect set up but we are talking about something that the average guy might run on a shared hosting the API being programmatical for the benefit of other PHP developers.

I can see how everything will work but not the low footprint "most recent x from y" when using XML. I'm convinced that there should be a way to do this with a reasonably well structured XML file (or collection thereof).

One idea that occours is to store the data as "data objects" - one file per object and use the timestamp archiving older data to a single file as a structured opperation later. I'm not sure how efficient or elegant this would be though. This matters especially as I'm going to have to write everything myself and from scratch.
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 500 total points
ID: 24071898
When you get into data retrieval with query logic like "most recent x from y" you are not really in the realm of what XML is built for, so it is no surprise that it would be difficult to visualize an XML solution to that sort of question.  Data bases handle that sort of thing well.  XML just handles the presentation layer - identifying the different data fields in a way that makes them easy to find.

The essential "sanity checks" are all part of the transaction feature of MySQL, and I urge you to use that tool rather than try to program around it - everybody has MySQL and everybody understands it.  If you think of this application in the model-view-controller paradigm, you may find yourself drawn to a solution that puts the right technologies under the right parts of the problem.

APIs are neither hard to write nor hard to use if you go with REST.  If you choose SOAP, well, consider yourself forewarned ;-)

HTH, ~Ray
0
 
LVL 1

Author Comment

by:Matt_T_hat
ID: 24071942
I'm not so sure that I want to use MySQL just because it can do this one thing well.

Actually that last idea of mine combined with yours could work quite well. In many respects the data being processed fits the email service design pattern which could be mimicked by an internal data handler and connected to via the presentation layer.

Then if some bright sparks want to write a highly efficient PERL, Apache, C or C++ service then this would just speed up the "back office" work flow.

I've knocked up a quick diagram to try and show what I'm getting at.

Do I want to use a big file or lots of little ones? Same question but new thoughts about ways it can be processed.

What's the best and most efficient process for getting to "most recent X from Y" in an XML environment.

MyWall.png
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 500 total points
ID: 24071981
@Matt_T_hat: Trust me, I am not trying to make your life more complicated by suggesting the use of a PHP+MySQL solution.  It works for Yahoo, Digg, Facebook and lots of others.  XML is not an "environment" any more than HTML is an environment - it is just a markup language, so it does not have the ability to offer multivariate controls of the data selection process.  As to efficiency, you cannot do better than MySQL, especially not by writing your own code - as a bright fellow once observed, "The hundred smartest people in the world don't work for you!"  In fact, a lot of them work on optimizing MySQL for high performance. (Sidebar prediction: IBM will buy MySQL or maybe all of Sun).

If I understood why you want to avoid a data base, maybe I could help.  But in any case, consider a data abstraction layer that isolates the data base.  You need to support the four CRUD functions (Create Read Update Delete) for your data elements.  You can do this with simple function calls that isolate the data base from the in-line code.  Something conceptually like "get_data($user)" or "put_data($user)".  If you isolate your data base this way, you have more flexibility to tune for performance using triggers, indexes, memcacheD and similar tools.
0
 
LVL 1

Author Comment

by:Matt_T_hat
ID: 24073159
You put forth some very compelling arguments in favour of MySQL. I can be quite the pro MySL geek sometimes (like when designing a simple search engine) but other times it seems like a lot to ask and the very tree like nature of the XML might leave me with a table:

table data(
xml bigtext;
date timestamp;
);

Which just feels silly. That leads me to a similar place where we end up with a BigText MetaData section... As a data modelling geek that feels not so much silly as outright wrong.

The fact is that I am looking at data that aside from date ordering need never have any other MySQL function used with it. So I feel loathed to fire up a heavy surver to do one thing (albiet very well).

The project inquestion is going to process social media output. For example it might take your facebook, twitter, flickr and blog-rss and create a personal public timeline or lifestream that is platform seperate.

http://mywall.lordmatt.co.uk/ is where I try to explain what I'm trying to get to.

I want to be very flexible as to what meta-data I store along with the basic title, date, source, text block. As long as the items can end up in time order and/or be fetched out as such without hurting even a weak hosting solution (or my very busy server) then I'm happy.

Furthermore, in my mind, the data is best appriciated as an XML tree - it's just this blasted date order business that I can't get a clear idea of.

Have a look at the website and see if that makes it much clearer. I figure I'm just telling the story badly or I'm missing something and just figuring that out might be worth this question.
0
 
LVL 108

Assisted Solution

by:Ray Paseur
Ray Paseur earned 500 total points
ID: 24073225
I would expect that there would be better functionality in the application if there were greater detail than this in the data model...

table data(
xml bigtext;
date timestamp;
);

FWIW, a data base that uses column names for XML tags is highly feasible, and the code to convert a row to the xml is drop dead simple.
// WRITE XML HEADERS

/* XML THING */
 

// ITERATE OVER THE DATA TO BUILD XML

while ($row = mysql_fetch_assoc($res))

{

   echo "<item>\n";

   foreach ($row as $k => $v)

   {

      echo "<$k>$v</$k>\n";

   }

   echo "</item>\n";

}
 

// WRITE XML TRAILERS

/* XML THING */

Open in new window

0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 24073247
Now having said that, it is also possible to "sort" XML.  It is ugly and I do not recommend it, but I have forced it to work.  This is not likely to be a good choice for rapidly changing data, since you would need to force some kind of transaction-lock around the sort and retrieve process.

I guess my overall sense is that data storage in MySQL is not really that burdensome, and I've not read of anyone converting a MySQL data base into XML, except for the presentation layer.  I have, OTOH, heard of several installations that tried to do data storage in XML and have given up, going instead to MySQL or equivalent data base technology.

Best of luck with it, ~Ray
<?php // RAY_sort_XML.php
 

// TEST DATA

$response = '<all>

<CUST>

<NAME>cheese co</NAME>

<LVL>E1</LVL>

<STATE>FL</STATE>

</CUST>

<CUST>

<NAME>ABC Co.</NAME>

<LVL>A1</LVL>

<STATE>CA</STATE>

</CUST>

<CUST>

<NAME>ACME</NAME>

<LVL>A2</LVL>

<STATE>CA</STATE>

</CUST>

</all>';
 

// CONVERT TO OBJECT

$xmlobj = SimpleXML_Load_String($response);
 

// ITERATE OVER THE OBJECT

$point = 0;

foreach ($xmlobj->CUST as $thing)

{

// EXTRACT THE NAME FOR SORTING

	$order["$point"] = "$thing->NAME";

// INSERT A POINTER

	$thing->ORDER = "$point";

	$point++;

}
 

// SORT THE NAMES IN SENSIBLE ORDER

natcasesort ($order);
 

// ITERATE BY ORDER

foreach ($order as $key => $value)

{

	$my_obj = $xmlobj->CUST[$key];

	$my_nom = "$my_obj->NAME";

	echo "<br/>$my_nom";

}

Open in new window

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 1

Author Comment

by:Matt_T_hat
ID: 24106575
Hmmm... lots for me to think about over the weekend.

Thank you.
0
 
LVL 1

Author Comment

by:Matt_T_hat
ID: 24107777
I've been doing this my whole life. I wanted Relational Data (not that I had any idea that was what I was looking for) when flat file was the only option. I wanted multi-sheet or 3D spreadsheet systems when 2D single sheet was the only option. I wanted a visual OO language when basic was all that was being offered to me.

Now I want ordered XML when it's not yet available.

Why me?

But seriously aside from being a thinking on the edge of what is there is some work being done towards this end.

Dynamic labeling schemes for ordered XML based on type information, 2006
http://portal.acm.org/citation.cfm?id=1151736.1151743

Sketch-Based Summarization of Ordered XML Streams
http://whitepapers.techrepublic.com.com/abstract.aspx?docid=889013

Which has left me feeling, if not enlightened then at least no longer utterly insane.

There is also work to suggest that you idea might be the way to go too

Storing and Querying Ordered XML Using a Relational Database System, by Kevin Beyer, Igor Tatarinov, Jayavel Shanmugasundaram
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.2.5927

I don't think the right answer is to go with an XML to array, array sort and then array to XML device unless we were dealing with fairly small bites of XML.

http://scripts.ringsworld.com/development-tools/openbiz-2.0/openbiz/bin/xmltoarray.php.html

I'm still pondering as it seems such a waste to use an entire service of the complexity of MySQL just for the fast sorting. Anyway, I thought I'd put my ideas down for you to read - I'm going to poke about the net some more and have a good think.
0
 
LVL 1

Author Comment

by:Matt_T_hat
ID: 24123719
Are we basically saying that XPath and DomDocument would _NOT_ be up to the task?

http://uk2.php.net/manual/en/function.simplexml-element-xpath.php
http://us2.php.net/domdocument

If so could you cover a little of the why?

Because the only thing that worries me is date order (okay so file size is an issue too but...)

Could xpath get me all the timestamped today, then yesterday etc until the count is reached? Further would not treating XML objects as files also enable the generation of object files with a created timestamp of x?

After all the XML will be a 1:1 representation of the object logic and so storing it as XML means that the logic is indicated.

Actually I don't like the using the file system it feels like an error waiting to happen and breaks the multiply entities beyond the minimum rule.

All the same would DOM or XPath not help here?

You could be right about SQL - I just want to make the best choice for the project without assuming anything.
0
 
LVL 1

Author Comment

by:Matt_T_hat
ID: 24139208
Having thought about things for a while I have questions about your answer. Should I go with MySQL some issues are raised that I need to get clear in my mind.

For example if I maintain an archive collection and a current items collection then I am only ever dealing with a small collection of XML nodes. Thus ugly sorts might work.

However I'm left wondering if the issue of concurrent writes to the XML might create data corruption? With locking only one thread can write at a time and this is not exactly efficient.

It is starting to look like your MySQL suggestion is the only way forward as a way to manage XML.

It's ironic that after creating the "perfect specification" whereby the data maps so directly to the output in such an elegant fashion that the data must be maintained (while live) in another storage service.

That said by wrapping that in a manager class I can always allow for a time when PHP and XML can work together enough to not need MySQL. With little or no lead time.
0
 
LVL 1

Author Closing Comment

by:Matt_T_hat
ID: 31564078
Thanks for the feedback and time. I've spent a day diagramming everything (available on mywall.lordmatt.co.uk) and I think I'm going to have to live with MySQL being used however some of the things you have suggested have provided me with some great caching a query reduction ideas.

XML manipulation has a way to go yet but in the mean time I can get on with doing what I do best. Object manipulation.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 24140583
Please take just a moment to read the EE grading guidelines and explain why, after all this work, you marked the answer grade down to a "B" -- I would love to know what I did wrong that disappointed you.  It would help me to know whether to spend time on your questions in the future.

http://www.experts-exchange.com/help.jsp#hi403

Thanks for your input on this matter,
~Ray
0
 
LVL 1

Author Comment

by:Matt_T_hat
ID: 24147621
While you answer was perfectly correct on a technical level I was looking for a solution to an issue with XML. You answers guided me to a solution with lots of wrapping and a whole other server engine not to mention a whole other set of cache issues. The server in question is going to already be running near the limit of LAMP on a single server and having the memory hungry MySQL out of the stack for this solution would have very good on lots of levels.

You helped, don't get me wrong. However the answer was "it can't be done (yet)" which is not the same as "this is how you might do it". You have clearly put some quality time into helping me and I don't want to not recognise that because I appriciate it. A grade A in this case would have taken me to a place where I could have solved the ordered XML issue albeit with dire warnings that I might not want to that in you opinion.

Does that make sense?

It's not my intention to be rude or insult as I appriciate the effort you have made for me. On the other hand the primary problem was not solved but offloaded to another server application to take care of.

The projects XML objects are made up of two segments head and body and body can contain any number of additional XML objects that can contain...

Also the head section can contain any number of segments relating to the data and what process is ment to deal with it. The result is that a "stream" is a tree of unlimited depth with the newest items at the top of the list at any given level. So I will be mostly just storing XML in a text field and tracking links with autonumber keys on a self referential table. That's nowhere near the optimal I was shooting for.

however if that's the best I can currently hope for from XML and PHP then I have to go with that.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 24149130
MO: Thanks.  

Matt_T_hat, going forward, please remember that it costs you nothing more to give a grade of "A" unless the answer is really incomplete - in which case a "B" makes sense.

best to all, ~Ray
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

Article by: Nadia
Linear search (searching each index in an array one by one) works almost everywhere but it is not optimal in many cases. Let's assume, we have a book which has 42949672960 pages. We also have a table of contents. Now we want to read the content on p…
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
The viewer will learn how to dynamically set the form action using jQuery.
The viewer will learn how to count occurrences of each item in an array.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now