SimpleXML, Date Data Order, Stream, Efficiency and all that
Posted on 2009-03-29
I wish to create a class that will act as an API to a stream of data. Such as might be a twitter clone or a facebook wall clone or whatever. The store of preference is XML and so I will be using SimpleXML as I have a sound grasp of how to use it. The store needs to be of any size and might become very large over time.
These items are likely to be of the form <something date="" id=""></something> but inside that the items might get quite complex. The most important point is that they are part of the bigger system time sensitive with the newest items being the ones we are largely going to be interested in.
That said, while I can probably come up with some methods of getting a unique ID system in place I am fairly sure it would end up as a hack of some kind. So my thinking is towards using time/date sensitive to the millisecond as ID. That's not my question but...
Now the output process needs to return a SimpleXML object with the newest x items.
The easiest way would be to scoop the first 20 entries from the XML store ut this would only work if the newest records were at the top. SimpleXML appends new entries at the end of the space and so this does not work.
So the next option would be to scoop the last 20 although I can not currently conceive of anything that might do this efficiently.
The ideal solution would not care if the data was unsorted but this is not going to happen as some maintenance or index is needed for simplicity of data fetching (the same reason SQL databases have indexes).
At all costs I want to avoid doing a foreach through the entire collection as this will eventually kill any server as the file gets bigger and the output more popular.
As a secondary factor I am sure that I will want to "page" through the "archives".
So I need a logically sound pattern (and some pointers as to how I can implement the fellow) for storing and retrieving from such a store.
To use MySQL would solve these problems as they are already solved in MySQL but this makes demands of the final implementation I am not happy with. The ideal of this being that we have a specification for the process and display but it could be created in any language and in the future replaced with ease if the next maintainer feels Ruby on Rails or whatever is the way to go forward.
So it has to be XML. It has to be efficient even with big files and it has to be simple enough for me to understand.
I plan to create a Factory singleton that will maintain a cache of the each stream specific API class. However, a web site is multi threaded (each child process could be running the same code) so this is only so much use. Therefore I need to work out exact rules for the API class for read and write that would allow multiple threads to not bum out the system with identical dates or dates out of order...
I considered a once a minute cron to take from a buffer file and as a virtual single threaded singleton write to the store but this is overly complex and opens up a whole new can of possible problems. I need to create a plan for a data format and read-write method that plays well with other instances.
I considered file locking but don't know much about it enough to use for sanity and to be honest aside from identical date-time ID there should be no issue that requires this. However...
So the read write process also needs sanity checks.
I'm not looking for complete solutions but I do need to form a sound plan to work from with the potential problems mitigated or avoided in theory at least.
I'm shooting for stable and reliable even if lots of random developers of different skills start making calls to the API class
That's why this question is getting top points possible - I need a solid working plan.