Calculating probable occurence of events

Dear Experts,

I have created a database which logs issues as they are submitted by our users. These issues are related to specific job items which they have been assigned to complete, and n this sense each issue has an identifier and each job item has an identifier.

As far as the job items are concerned, the time user is alerted to the item assignment is logged.

Likewise the time of the reported issue is logged, and also the corresponding job item identifier is logged along with this record.

What I am aiming to create is an equation which will calculate the chance of issues occuring considering the current frequency and count of assigned items.

For example (made up example):

12 items assigned on Tuesday, and for this day 2 issues were reported
72 items were assigned on Thursday and there were 32 errors

In this case I can see that the relationship between the number of assigned items and issues reported is not linear, and I am finding it difficult to understand how I can predict the number of  issues expected for a certain frequency if items assigned.

Considering I have easy access to historical information linking the number of assigned items with the number issues reported over time is there some method I can use as a foundation for doing my calculation?

As a basis, I will have a table (View/Stored Procedure generated from SQL Server 2005) as described in the code snipped area Table A. There will be multiple records in this.

Could anyone recommend how I could harvest this information to retrieve variables to progress towards what I need, which is essentially the answer to the question:

"If X items are being processed today then how many issues can I expect?"

I'm not sure any answers need to be in programming language immediately as I have not finalised what my development language would be, but if there are any SQL Server 2005 queries and generic formulae offered I would be greatly pleased to see them!

Many thanks.

Table A:

Open in new window

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

With only 2 sample data points it's hard to see what might be happening.

Since you have access to historical data, try querying the min,max and average number of errors by number of items processed daily.

Possibly including day of week, if you have processing spikes and/or dips by schedule as well.

butterhookAuthor Commented:
Do you mean for example splitting the data into day (and potentially AM/PM), and when aware of the current number of items being processed to then look for a day with a similar amount of items for the estimate?

At this stage I would like to predict the chance of error, no more drill-down is necessary.

What other data points could be valid to use, for example?
What I meant by 2 data points was only 2 example days were shown.

So, with the limited data you have right now.....

I'd say on Tuesdays you have a 1/6 chance of error
On Thursdays you have a 4/9 chance of error

On other days I guess I'd extrapolate from those two points and say you have 17/42 chance of error.

With more points you can extrapolate whether you have trends by day or trends by quantity or both.
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

butterhookAuthor Commented:
Aha, my data about days was just an example. In reality I have data about every day since monitoring began. How can I use these statistics to predict the frequncy of issues relating to the frequency of items assigned?
what does your database table look like?  I can help you write queries to extrapolate reasonable predictions.  Can you attach a csv file with more sample data?
butterhookAuthor Commented:
I will have to create the tables and remove any business sensitive information, but I will endeavour to sort this ASAP. Thank you very much.
Plot issues (y axis) vs items (x axis).
Do least squares fit to a straight line, parabola, and any other curve which looks like it might fit. Pick the best one. There exist goodness of fit parameters. Pick the one you think best.
You can also use various interpolation algorithms ( linear, quadratic. moving average (not good here), bicubic, etc). Extrapolation is risky. If data points show lots of scatter, you can try the above procedures on sub sets of data (by day, week, month, etc). There is no best way. Judgment is required.
sort by user might be useful too.
If you have a log that states, for each item, whether an issue was raised, then you can perform a logistic regression on the data ( This will work best if this log contains a number of variables that may influence the probability of an issue arising, and will give you a model such that, if you feed in the values of these variables for a given item, will tell you a predicted probability of an issue arising from it, and from that you can calculate an expected number of issues each day.

I don't think SQL can do logistic regression (if it can do Maximum Likelihood Estimation then I will probably be proven wrong), but there are procedures and functions in most statistical packages like SAS and R that will do it, and a little googling suggests that you can even pull it off in spreadsheet programs like Excel and Lotus 1-2-3.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
butterhookAuthor Commented:
Thanks Confusing, this sounds like a good idea. I don't have too many variables to play with at the moment, just number of items, whether there was an issue and what kind of issue.

In terms of getting historical data out which could be relevant I have been thinking of doing something like this:

1. Figuring out how many items are expected in tomorrow

2. Running a query to retrieve the average frequency of issues for days when a similar number of items was processed, in pseudo code/incorrect SQL:


If I can get either a table or the average issue frequency as a stat. I can do coding in whatever language, probably PHP (in order to get the info into a web format for a 'dashboard'). I can then fine tune the tolerance according to the accuracy of the predictions.

I can also do other calculations within PHP as appropriate.

Does this sound like the right idea?
are the items pre-populated in your table and you want to estimate what the issues will be updated to be or is it a parameter to the query?

similarly tolerance is a parameter? or a value in a table.  If the latter, is it stored by day?  That is, does the tolerance change from one day to the next?
Assuming they are both parameters then your pseudo-query above is almost correct

select avg(issues) from a
where items >= p_tomorrows_items - p_tolerance
and items <= p_tomorrows_items + p_tolerance

If you need to look up tomorrow's items then something like this...

select avg(issues) from a
where items >= (select items from a where day = dateadd(dd,1, datediff(dd,0,getdate())) - p_tolerance
and items <= (select items from a where day = dateadd(dd,1, datediff(dd,0,getdate()))  + p_tolerance

if the tolerance is also from a table and not a parameter, please provide the table structure and join/lookup conditions and I can write that for you
butterhookAuthor Commented:
I think the tolerance may eventually become a parameter and could eventually be tuned by judging retrospectively the accuracy of past predictions, but for now will be an arbitrary figure which will get fine tuned until the stats look right.

Since I started having passion for this analysis we have solved the problem which was causing the issues initially! However it's good to be thinking about predictive analytics from our production database because it's easier get involved with a large, real dataset than some theoretical maths. For me anyway.

I will shortly be assigning shared points between those of you who have helped me with this, thank you very much.
butterhookAuthor Commented:
Excellent, thank you everyone. The process of discussing this helped me get my head round the question and possible solutions.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Development

From novice to tech pro — start learning today.