Data smoothing in N dimensions

Published on 2010-09-20 00:00:00

A question popped up in thread on the Scipy mailing list: how to smooth scattered data? I got slightly interested and quickly wrote a nonparametric smoother.

It seems to work:


For those interested:

It will probably land in Scipy at some point. Some code examples are listed below.

The basic idea is the usual one: given a scattered set of coordinates \{\mathbf{x}_i\}_i in N-dimensions and scalar data f_i at them, we want to find smoothed data values g_i at the coordinates by minimizing

||\text{deviation from data}||^2 + ||\text{deviation from smoothness}||^2

with some sort of weighing balancing the two. It’s easy to see what “deviation from data” means,

\sum_i w_i |f_i - g_i|^2

with some weights w_i, but now that the coordinates \mathbf{x}_i are scattered, it is not so straightforward to see what “smoothness” is.

One solution is to construct a minimum norm network [Nielson83]: one imagines that the data set is interpolated by a smooth spline interpolant. A suitable smoothness penalty then would be

\int d\mathbf{x} ||\text{curvature of the interpolant}||^2

However, this is difficult to calculate, and so one tries to mimic its qualitative features by (i) constructing a triangulation of the coordinate set (e.g. via Delaunay triangulation), and (ii) writing an approximative penalty term in the form

\sum_E \int_0^1 dx |W''(x)|^2

where E are edges (between vertices) in the triangulation, and W is the interpolant restricted to the edges. Now, suppose that in addition to the smoothed values g_i, also the gradient \nabla g_i is known at the coordinates. W is by definition an 1-d polynomial, and if we assume it is a 3rd order one, it is uniquely determined by g_i and \nabla g_i. In fact,

\int_0^1 dx ||W''(x)||^2 = L^{-3}
\begin{pmatrix}g_1\\g_2\\E\cdot\nabla g_1\\-E\cdot\nabla g_2\end{pmatrix}^T
12 & -12 & 6 & -6 \\
-12 & 12 & -6 & 6 \\
6 & -6 & 4 & -2 \\
-6 & 6 & -2 & 4
\begin{pmatrix}g_1\\g_2\\E\cdot\nabla g_1\\-E\cdot\nabla g_2\end{pmatrix}

where g_1 and g_2 are the values at vertices between which edge E lies. So by including the gradients in the fitting, the curvature term can be defined in a rigorous fashion. As a side effect, the gradient of the data is estimated.

Since the problem is quadratic in the unknowns, the total smoothing operation can be done by solving a linear equation:

(A + B) \begin{pmatrix}g\\\nabla g\end{pmatrix} = B \begin{pmatrix}f\\0\end{pmatrix}

where the matrix A is the curvature penalty, and B_{ij}
= w_i\delta_{ij} contains the weights. It is interesting to note how all geometrical information of the coordinates \mathbf{x}_i, the triangulation, etc., has been swallowed by the matrix A. This smells heavily of (Tikhonov) regularization — the minimum norm network gives a certain way to construct a regularization matrix natural to the problem.

The matrix equation can be solved in the standard way. It’s not too large: the problem size is (ndim+1)*npoints × (ndim+1)*npoints which stays manageable (on a puny Eee laptop) at least to 10000 data points in 2-D. In principle, iterative methods could be used to solve it, but in practice since I didn’t want to spend time looking for a good preconditioner, direct sparse solvers will have to do.

All in all, this is probably a known smoothing method, but since I did this for “fun”, I didn’t bother trying to find it in the literature. Nevertheless, the minimum norm network idea is beautiful, and evidently indicates Correct Thought.


[Nielson83]G. Nielson, “A method for interpolating scattered data based upon a minimum norm network”. Math. Comp., 40, 253 (1983).

... in practice?

Ok, so far so good. But does it actually work? A simple test problem should be enough to check that:

>>> import numpy as np
>>> points = np.random.rand(1500, 2)
>>> signal = np.cos(5*points[:,0])*np.sin(5*points[:,1])
>>> noise = np.random.randn(points.shape[0])
>>> data = signal + noise
>>> from scipy.interpolate import NDSmoother, griddata
>>> smoothed = NDSmoother(points, data, scale=1e-5)

The scale parameter controls the curvature term — it essentially determines the length scale of smoothing. If scale is large, the curvature term dominates. If it is small, the curvature term does not do much. Indeed, if we make scale small, there is essentially no smoothing:

>>> print abs(smoothed.values - data).max()

On the other hand, if we set scale to a big value, we get a least square fit of a plane to the data:

>>> smoothed = NDSmoother(points, data, scale=10.0)
warning: (almost) singular matrix! (estimated cond. number: 1.12e+13)
>>> coef = np.c_[points, np.ones_like(points[:,0])]
>>> sol, res, rank, s = np.linalg.lstsq(coef, data)    # hyperplane fit
>>> print abs(smoothed.values -, sol)).max()

Note how the large-scale limit was a bit problematic for the linear solver: the curvature term is huge, but it only forces the smoothed points to lie on a plane. Instead, everything is determined by the small deviation-from-data term, which makes things nasty from the numerical point of view.

So the basic limits seem to be in order. We can now expect that sensible length scales would also work:

>>> smoothed = NDSmoother(points, data, scale=0.1)
>>> smoothed_2 = NDSmoother(points, data, scale=0.5)

Which is probably best illustrated with pictures:

>>> import matplotlib.pyplot as plt
>>> plt.subplot(141)
>>> xx, yy = np.mgrid[0:1:80j,0:1:80j]
>>> zz = griddata(points, signal, (xx, yy))
>>> plt.imshow(zz, extent=(0,1,0,1))
>>> plt.xticks([0, 0.5, 1.0])
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Signal")
>>> plt.subplot(142)
>>> xx, yy = np.mgrid[0:1:80j,0:1:80j]
>>> zz = griddata(points, data, (xx, yy))
>>> plt.imshow(zz, extent=(0,1,0,1))
>>> plt.xticks([0, 0.5, 1.0])
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Signal+Noise")
>>> plt.subplot(143)
>>> plt.imshow(smoothed((xx, yy)), extent=(0,1,0,1))
>>> plt.xticks([0, 0.5, 1.0])
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Scale=0.1")
>>> plt.subplot(144)
>>> plt.imshow(smoothed_2((xx, yy)), extent=(0,1,0,1))
>>> plt.xticks([0, 0.5, 1.0])
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Scale=0.5")


Suppose now some of the data points contain more or less bogus values.

>>> data[30] = 200
>>> smoothed = NDSmoother(points, data, scale=0.1)

Luckily, adding data point specific weights is also quite easy:

>>> weights = np.ones_like(data)
>>> weights[30] = 1e-4
>>> smoothed_2 = NDSmoother(points, data, scale=0.1, weights=weights)

And so, uncertainties can be controlled to some degree:

>>> xx, yy = np.mgrid[0:1:80j,0:1:80j]
>>> plt.subplot(131)
>>> plt.imshow(smoothed((xx, yy)), extent=(0,1,0,1))
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Unweighted")
>>> plt.subplot(132)
>>> plt.imshow(smoothed_2((xx, yy)), extent=(0,1,0,1))
>>> plt.clim(signal.min(), signal.max())
>>> plt.title("Weighted")

In principle, this feature should work, but I didn’t really test it yet.

Higher dimensions?

So, 2-d seemed to be OK. How about 3-d?

Suppose we have 500 samples in 3-d in a box [0,1]x[0,1]x[0,1]:

>>> np.random.seed(1234)
>>> points = np.random.rand(500, 3)

The samples contain some signal plus gaussian noise:

>>> def func(x, y, z):
...     return (np.cos(3*x) + 2*y**2)*np.sin(5*x + z)
>>> signal = func(points[:,0], points[:,1], points[:,2])
>>> noise = 0.3*np.random.randn(500)
>>> data = signal + noise

We would like to smooth the noise away to recover the signal. As above:

>>> from scipy.interpolate import NDSmoother
>>> smoothed = NDSmoother(points, data, scale=0.1)
>>> smoothed_2 = NDSmoother(points, data, scale=0.5)

We can plot the slice at z=0.5 to check how smoothing went:

>>> from scipy.interpolate import griddata
>>> x, y = np.mgrid[0:1:30j,0:1:30j]
>>> data_grid = griddata(points, data, (x, y, 0.5))
>>> import matplotlib.pyplot as plt
>>> plt.subplot(141)
>>> plt.imshow(func(x, y, 0.5).T, extent=(0,1,0,1), origin='lower')
>>> plt.title('signal')
>>> plt.subplot(142)
>>> plt.imshow(data_grid.T, extent=(0,1,0,1), origin='lower')
>>> plt.title('data')
>>> plt.subplot(143)
>>> plt.imshow(smoothed((x, y, 0.5)).T,
...            extent=(0,1,0,1), origin='lower')
>>> plt.title('scale=0.1')
>>> plt.subplot(144)
>>> plt.imshow(smoothed_2((x, y, 0.5)).T, extent=(0,1,0,1),
...            origin='lower')
>>> plt.title('scale=0.5')

So it works also in 3-D. Thanks to Qhull providing triangulations in arbitrary dimension, it will also work in 4-D, 5-D, 6-D, 7-D, etc., but don’t expect to be able to handle large data sets.

End of story

The bottom line: Functional minimization is fun, and minimum norm networks are cool.


blog comments powered by Disqus