Data models in which the data points must satisfy multiple linear relations (i.e. they lie on a lower dimensional subspace) can often be handled by treating each dependent coordinate separately, but this is only appropriate for noise models in which the coordinates are independent (e.g. Gaussian noise).
So if we have a collection of points (xi,yi) defining a regression problem, we can think of them instead as each defining a different line b=(-xi)a+yi in the a-b plane. What we want to find is a point in the a-b plane that is "near" all of these lines; that is, we expect the lines to roughly look like they all go through a common point, and we want to find that common point.
Similar methods transform any higher dimensional regression problem of fitting a single hyperplane to a set of data points into an equivalent problem of fitting a single point to an arrangement of hyperplanes.
The big advantage of least squares is that it's easy to solve: E is just a positive definite quadratic form, so it has a unique minimum. Since E is a smooth function of m and b, its minimum occurs at the point where its partial derivatives in each coordinate are all zero. This gives a system of d equations in d unknowns which one can solve by Gaussian elimination, or one can instead use local improvement methods to approximate the optimal fit. Thus the time is linear in the number of unknowns, and at worst cubic in the dimension.
More generally the Lp error is formed by using pth powers in place of squares in the definition of E. The limit of this, the Linfinity norm, is just the maximum error of any individual data point. The most commonly used of these are the L1 norm (also known as "least absolute deviation" or LAD regression) which is less sensitive to outliers than least squares, and the Linfinity norm, which is more sensitive to outliers and therefore useful in contexts where one wants to detect them e.g. measuring flatness of a surface.
The Linfinity estimator can be expressed as a low dimensional linear program: find (m,b,e), minimizing e, and satisfying the linear constraints -e < m.xi+b-yi < e. The dimension (number of variables to be found) is just d+1. Therefore, it can be solved in time linear in the number of data points (but exponential in the dimension) using low-dimensional linear programming techniques. [citations to be filled in later]. There also exists a similar linear time bound for L1 estimation [Z84]. All Lp estimators are invariant to coordinate changes and affine transformations among only dependent variables.
One advantage of assuming dependent noise is that it is preserved by the duality transformation described above: the vertical distance of a point (xi,yi) from the line y=ax+b is the same as the vertical distance of the point (a,b) from the line b=(-xi)a+yi. (Just work out the equations, they're the same!)
In the discussion, one of the students suggested using a convex combination of different Lp metrics, which might e.g. come from a truncated power series for some harder-to-estimate metric. Alternatively, the coefficients in the convex combination might themselves come from some sort of statistical estimation, to fit the noise actually apparent in the data. However, I don't know of any references related to these ideas.
There's a nice connection between L1 estimation in the independent noise model and the famous "k-sets" or "k-level" problem from computational geometry: if you fix a choice of slope, the problem is just a one-dimensional L1 problem which can be solved by the median. If we use projective duality, what we want to know is the median among the points where the dual hyperplanes cross each possible vertical line; this sequence of medians forms a surface in the arrangement and is known as the "k-level" (here k=n/2). In two dimensions, this surface is just a polygonal path, formed by walking left-to-right through the arrangement and turning up or down at every crossing point you pass through. A famous problem posed by Erdös and Lovász is on the number of edges this path can have; known bounds are Omega(n log k) and O(nk1/3).
Linfinity estimation is equivalent to computing the width of a set (minimum distance between parallel supporting hyperplanes); it can be computed in O(n log n) time in the plane, but the best time known in three dimensions is O(n3/2) [citations to be filled in later].
These independent noise estimators are generally invariant under arbitrary translations or rotations of the coordinate system, however they are not invariant to general affine transformations.
With that definition, one can make any sort of Lp regression etc. The L1, L2, and Linfinity regressions can all be solved easily using simple modifications of the algorithms for dependent noise models. (This is one big advantage of dual independence over primal independence.)
I don't know of any references, or work by actual statisticians, on this sort of model.
It is commonly claimed that this estimator provides the maximum possible robustness to outliers -- up to half the data may be arbitrarily corrupted. This claim seems to rely on an assumption that the data is well spread out along the independent coordinates, rather than being tightly clustered along lower dimensional subspaces in a way that would allow an adversary to set up false estimates using large numbers of valid data points. But the same assumption must be made for any other robust regression method.
Unfortunately algorithms for computing the LMS estimator are relatively slow: O(n2) even in two dimensions [SS87, ES90]. There are theoretical reasons for believing that no exact algorithm can be faster than this. However Mount et al [MNRSW97] provide an O(n log n) time approximation as well as a randomized algorithm that they believe should work well in practice (although it lacks theoretical performance guarantees).
In the dual framework, we are given an arrangement of lines, and we wish to find the shortest possible vertical line segment that crosses at least half of the lines.
This estimator assumes a dependent noise model; one can define LMS-like estimator in similar ways for independent noise (in which case one is trying to find a line minimizing the median distance to the points) or for dual-independent noise (in which case one is trying to find a minimum-radius circle that crosses at least half of the lines).
Here's my guess, based on the name -- I haven't looked the paper itself up yet. LMS estimation can be interpreted as finding a set of outliers to throw away and ignore (namely the points further away than the median distance it finds). But once it throws away those points, it then does something fairly stupid with the rest of the points: just the Linfinity estimator, which is very sensitive to outliers. So if the data that's left after throwing away half the points is still suspect, it seems more reasonable to instead apply the same outlier detection strategy recursively.
There's not much to say about algorithms for this, at least not without more progress on algorithms for the LMS estimator. The reason is that if you have a quadratic algorithm for LMS, and apply it recursively as above, you can describe the recursion's time by the recurrence T(n)=T(n/2)+O(n2) which is again quadratic. So asymptotically, the time for this recursive approach would be the same as the time for simple LMS estimation.
In the dual framework, we are given an arrangement of lines in the a-b plane. Each pair of lines determines a point where the two lines cross; we wish to find the median a-coordinate among these (n choose 2) crossings points.
This method is only robust to n(1-1/sqrt(2)) ~ 0.29n outliers rather than n/2 for the LMS estimator (proof: if there are n/sqrt(2) valid data points, they define roughly n^2/4 slopes, too many for the outliers to perturb the median). However, it can be computed more quickly: O(n log n) time by any of several algorithms [BC98, CSSS89, DMN92, KS93, M91b].
I guess in higher dimensions one could apply any robust single point estimation technique (e.g. centerpoint or least median of squares) to the normal vectors to hyperplanes defined by d-tuples of points. But I don't know of any work on such generalizations.
The dual version of this is much easier to understand. The depth of a point in an arrangement is just the minimum number of lines (or hyperplanes) that you would need to cross to get from the point out to infinity. A good fit according to this measure is then just a point with high depth; that is, one that's enclosed by a large number of hyperplanes in every possible direction.
In one dimension, a hyperplane is just a point, a vertical hyperplane is just a point at infinity, and so the median of a point set gives optimal depth n/2. In two dimensions, a very simple construction, the "catline" [HR98] has depth n/3. This is optimal, since one can find sets of points for which no line is better than n/3. It turns out that any point set in any dimension has a hyperplane of depth at least n/(d+1), exactly matching the quality bounds known for centerpoints [ABET98]. However the proof is nonconstructive.
Just as with the centerpoint, you probably want the deepest possible plane not just any deep plane. This can be found efficiently in the plane [KMRSSS99] and less efficiently in higher dimensions. One can also efficiently approximate the deepest plane in linear time in any dimension using geometric sampling techniques [SW98].
Because (the primal formulation of) this estimator depends on a definition of a vertical direction, it makes sense for the dependent or dual-independent noise models but not for independent noise. It is invariant under affine transformations among only independent variables, arbitrary affine transformations of the dual space, or more generally any continuous deformation of space that doesn't ever flatten a tetrahedron (produce d+1 coplanar points) or make a vertical triangle (d points forming a vertical plane).
David Eppstein,
Theory Group,
Dept. Information & Computer Science,
UC Irvine.
Last update: