# Data smoothing in N dimensions¶

Published on 2010-09-20 00:00:00

A question popped up in thread on the Scipy mailing list: how to smooth scattered data? I got slightly interested and quickly wrote a nonparametric smoother.

It seems to work: For those interested:

It will probably land in Scipy at some point. Some code examples are listed below.

The basic idea is the usual one: given a scattered set of coordinates in N-dimensions and scalar data at them, we want to find smoothed data values at the coordinates by minimizing with some sort of weighing balancing the two. It’s easy to see what “deviation from data” means, with some weights , but now that the coordinates are scattered, it is not so straightforward to see what “smoothness” is.

One solution is to construct a minimum norm network [Nielson83]: one imagines that the data set is interpolated by a smooth spline interpolant. A suitable smoothness penalty then would be However, this is difficult to calculate, and so one tries to mimic its qualitative features by (i) constructing a triangulation of the coordinate set (e.g. via Delaunay triangulation), and (ii) writing an approximative penalty term in the form where are edges (between vertices) in the triangulation, and is the interpolant restricted to the edges. Now, suppose that in addition to the smoothed values , also the gradient is known at the coordinates. is by definition an 1-d polynomial, and if we assume it is a 3rd order one, it is uniquely determined by and . In fact, where and are the values at vertices between which edge lies. So by including the gradients in the fitting, the curvature term can be defined in a rigorous fashion. As a side effect, the gradient of the data is estimated.

Since the problem is quadratic in the unknowns, the total smoothing operation can be done by solving a linear equation: where the matrix is the curvature penalty, and contains the weights. It is interesting to note how all geometrical information of the coordinates , the triangulation, etc., has been swallowed by the matrix . This smells heavily of (Tikhonov) regularization — the minimum norm network gives a certain way to construct a regularization matrix natural to the problem.

The matrix equation can be solved in the standard way. It’s not too large: the problem size is (ndim+1)*npoints × (ndim+1)*npoints which stays manageable (on a puny Eee laptop) at least to 10000 data points in 2-D. In principle, iterative methods could be used to solve it, but in practice since I didn’t want to spend time looking for a good preconditioner, direct sparse solvers will have to do.

All in all, this is probably a known smoothing method, but since I did this for “fun”, I didn’t bother trying to find it in the literature. Nevertheless, the minimum norm network idea is beautiful, and evidently indicates Correct Thought.

References

 [Nielson83] G. Nielson, “A method for interpolating scattered data based upon a minimum norm network”. Math. Comp., 40, 253 (1983).

## … in practice?¶

Ok, so far so good. But does it actually work? A simple test problem should be enough to check that:

>>> import numpy as np
>>> points = np.random.rand(1500, 2)
>>> signal = np.cos(5*points[:,0])*np.sin(5*points[:,1])
>>> noise = np.random.randn(points.shape)
>>> data = signal + noise

>>> from scipy.interpolate import NDSmoother, griddata
>>> smoothed = NDSmoother(points, data, scale=1e-5)


The scale parameter controls the curvature term — it essentially determines the length scale of smoothing. If scale is large, the curvature term dominates. If it is small, the curvature term does not do much. Indeed, if we make scale small, there is essentially no smoothing:

>>> print abs(smoothed.values - data).max()
2.23206311856e-06


On the other hand, if we set scale to a big value, we get a least square fit of a plane to the data:

>>> smoothed = NDSmoother(points, data, scale=10.0)
warning: (almost) singular matrix! (estimated cond. number: 1.12e+13)
>>> coef = np.c_[points, np.ones_like(points[:,0])]
>>> sol, res, rank, s = np.linalg.lstsq(coef, data)    # hyperplane fit
>>> print abs(smoothed.values - np.dot(coef, sol)).max()
6.86110242736e-05


Note how the large-scale limit was a bit problematic for the linear solver: the curvature term is huge, but it only forces the smoothed points to lie on a plane. Instead, everything is determined by the small deviation-from-data term, which makes things nasty from the numerical point of view.

So the basic limits seem to be in order. We can now expect that sensible length scales would also work:

>>> smoothed = NDSmoother(points, data, scale=0.1)
>>> smoothed_2 = NDSmoother(points, data, scale=0.5)


Which is probably best illustrated with pictures:

>>> import matplotlib.pyplot as plt
>>> plt.subplot(141)
>>> xx, yy = np.mgrid[0:1:80j,0:1:80j]
>>> zz = griddata(points, signal, (xx, yy))
>>> plt.imshow(zz, extent=(0,1,0,1))
>>> plt.xticks([0, 0.5, 1.0])
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Signal")

>>> plt.subplot(142)
>>> xx, yy = np.mgrid[0:1:80j,0:1:80j]
>>> zz = griddata(points, data, (xx, yy))
>>> plt.imshow(zz, extent=(0,1,0,1))
>>> plt.xticks([0, 0.5, 1.0])
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Signal+Noise")

>>> plt.subplot(143)
>>> plt.imshow(smoothed((xx, yy)), extent=(0,1,0,1))
>>> plt.xticks([0, 0.5, 1.0])
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Scale=0.1")

>>> plt.subplot(144)
>>> plt.imshow(smoothed_2((xx, yy)), extent=(0,1,0,1))
>>> plt.xticks([0, 0.5, 1.0])
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Scale=0.5")

>>> plt.show() Weighing

Suppose now some of the data points contain more or less bogus values.

>>> data = 200
>>> smoothed = NDSmoother(points, data, scale=0.1)


Luckily, adding data point specific weights is also quite easy:

>>> weights = np.ones_like(data)
>>> weights = 1e-4
>>> smoothed_2 = NDSmoother(points, data, scale=0.1, weights=weights)


And so, uncertainties can be controlled to some degree:

>>> xx, yy = np.mgrid[0:1:80j,0:1:80j]
>>> plt.subplot(131)
>>> plt.imshow(smoothed((xx, yy)), extent=(0,1,0,1))
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Unweighted")
>>> plt.subplot(132)
>>> plt.imshow(smoothed_2((xx, yy)), extent=(0,1,0,1))
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Weighted")
>>> plt.show() In principle, this feature should work, but I didn’t really test it yet.

## Higher dimensions?¶

So, 2-d seemed to be OK. How about 3-d?

Suppose we have 500 samples in 3-d in a box [0,1]x[0,1]x[0,1]:

>>> np.random.seed(1234)
>>> points = np.random.rand(500, 3)


The samples contain some signal plus gaussian noise:

>>> def func(x, y, z):
...     return (np.cos(3*x) + 2*y**2)*np.sin(5*x + z)

>>> signal = func(points[:,0], points[:,1], points[:,2])
>>> noise = 0.3*np.random.randn(500)
>>> data = signal + noise


We would like to smooth the noise away to recover the signal. As above:

>>> from scipy.interpolate import NDSmoother
>>> smoothed = NDSmoother(points, data, scale=0.1)
>>> smoothed_2 = NDSmoother(points, data, scale=0.5)


We can plot the slice at z=0.5 to check how smoothing went:

>>> from scipy.interpolate import griddata
>>> x, y = np.mgrid[0:1:30j,0:1:30j]
>>> data_grid = griddata(points, data, (x, y, 0.5))

>>> import matplotlib.pyplot as plt
>>> plt.subplot(141)
>>> plt.imshow(func(x, y, 0.5).T, extent=(0,1,0,1), origin='lower')
>>> plt.title('signal')
>>> plt.subplot(142)
>>> plt.imshow(data_grid.T, extent=(0,1,0,1), origin='lower')
>>> plt.title('data')
>>> plt.subplot(143)
>>> plt.imshow(smoothed((x, y, 0.5)).T,
...            extent=(0,1,0,1), origin='lower')
>>> plt.title('scale=0.1')
>>> plt.subplot(144)
>>> plt.imshow(smoothed_2((x, y, 0.5)).T, extent=(0,1,0,1),
...            origin='lower')
>>> plt.title('scale=0.5')
>>> plt.show() So it works also in 3-D. Thanks to Qhull providing triangulations in arbitrary dimension, it will also work in 4-D, 5-D, 6-D, 7-D, etc., but don’t expect to be able to handle large data sets.

## End of story¶

The bottom line: Functional minimization is fun, and minimum norm networks are cool.