Best Fit Equation for a Line

I am working on a module in our application that will calculate the volume of a tank given a height. Let us say X = height in inches of Tank 001, the user will specify a formula to calculate volume Y (note that computed volume would vary depending on the density of what is being stored in the tank, but kind of irrelevant to this question). As an example:

Y = C * Power(X,3) + B * Power(X,2) + A * X + k

A, B, C are coefficients in this 3rd order polynomial equation, X is the height in inches, and k is a constant.

The users used Excel to come up with this equation by having Excel plot a line for a series of X, Y coordinates and then it gave them a best fit equation to interpolate the line.

I already have an algorithm (in SQL) that will calculate Y based on the above user supplied formula (for given A, B, C, and k).

Now what they are asking for is the ability to enter in a series of sample points, let us say between 15 and 40 samples of X and Y, and have our routine spit back out a formula. In other words, it would determine what the A, B, C, and k are for the polynomial equation.

Note that not all of their example formulas are third order polynomials, some are just linear equations, Y = [A] * X + [k],   and possibly other stuff is theoretically possible.

I did find a routine that can calculate a very rough (VERY rough) 3rd order polynomial for a sample set but it does it by splitting the data set (ordered) into 4 regions and solves a system of 4 equations using the averages of X for each region.  This actually didn't turn out to be that bad, but on the lower and upper ends of the data set it's just too much error.

Has anyone already solved this before and if so, any advice or an algorithm?

This would not have to be an algoritm in TSQL, I can convert it from whatever. Or suppose I could make this a CLR if it was .NET.
Who is Participating?
TommySzalapskiConnect With a Mentor Commented:
Which method is best depends on the data. If the "strapping table" is very accurate, then some kind of interpolating is by far the best method. There are interpolation methods that produce nice curves between the anchor points.

If the strapping table is mostly accurate but has some errors that are way off (like typos), then you certainly want some kind of outlier detection. Your mention of "obvious weeds" makes me think this is likely.

If the strapping table is rough estimates and they just want a simple formula that gets kinda close, then a best fit curve would probably be good enough especially since that's what they have been using.

Excel uses a least squares linear regression approach. The LINEST function in Excel is what the chart tool uses in the background. Here is more information on that (from MS) than you probably want

There are lots of ways to do linear regression to get a least squares best fit polynomial. Here's the mathematical definitions
Here's an applet that shows what it can look like
And here's some sample code (pick your favorite language)
>>  volume would vary depending on the density of what is being stored in the tank

That seems strange.  Are you talking about volume (gallons, cubit feet, cubic meters) or mass/weight.

The characteristics of a tank should be determined by it geometry/measurements.
I don't see that any sort of fitting would help.
What shape is the tank?
My guess is they came up with that formula because that's the best fit polynomial line Excel could find given the parameters. If the tank is a cylinder lying on it's side, then there's a "right answer" that doesn't involve best fit guessing but is mildly complicated.

If the tank is an upright cylinder, then the formula is really easy.

If the tanks are random shapes, then I can see how you might be forced to guess. In that case though, you can trust your data points that you use to find the equation, so you won't want the smoothing that you would get with a best fit polynomial. You'd be better off storing the known values and picking some weighted average of the two closest points.
virtuadeptAuthor Commented:
I already have plenty of equations on how to calculate the volume of different types (shapes) of tanks. That is not the question I have. Many of our tanks have data sufficient enough to use a volume formula to compute volume, such as the volume of a partial sphere for a spherical tank to compute unadjusted volume and then adjust for pressure, temperature, specific gravity, etc.

That's not how the petroleum industry always does it, however. A lot of times what you have to work with is a "strapping table", this is where you have a chart that was probably generated in the 1970's or some time long ago that says for every inch or partial inch X, the volume for product P at a given pressure/temperature will be Y.  In other words, it plots out like a line graph where X is the horizontal axis and Y is the vertical axis.

My question is, given a good sampling list of 15 to 40 or so X values (height, but that is irrelevant to the question) and associated Y values (volume), is there an algorithm that can generate the best fit equation for that set of points? Obviously I know there are algorithms that exist for this, I've seen some libraries on the web, and Excel has this feature. But if you had to code this for a in-house application, what is the approach you would take?

This may not be optimal. However, this is what they are ALREADY DOING using Excel. My mission, should I choose to accept it (have no choice ;P) is to give the the functionality they desire in our own custom in-house application. Excel lets them type in a series of X and Y values and plot a curve and then help them come up with a formula for that curve. That's fine, they can keep doing that, if it comes to it. But if it is not terribly difficult to implement, and by that, I mean not more than a weekend or so, I would like for our app to be able to do that functionality.

Also, thank you TommySzalapski for the advice on interpolating based on closest points, that is an option I was going to add in (I have about 6 or so calculation methods so far, this would be another one). This would probably require a lot more data to get right, much more than 40 points, but we may indeed have that data already in a digital form based on the history of the tank(s) so that could be a method they decided to do (if I can figure out how to clean out the "obvious weeds in the grass" data points).
virtuadeptAuthor Commented:
Thanks very much, those links were extremely helpful.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.