Thursday, March 29, 2007

Data Fitting and Polynomial Interpolation

Today Professor Overton was out of town, so the acting chair of the CS Department Professor Marsha Berger handled the lecture. It was a good experience to have a completely different professor teach a session of a class this far into the semester; It helps me disconnect the information from the presentation a bit.

We covered issues surrounding trying to find a function, typically a polynomial, which interpolates some set of data points. This is a very basic problem, and there are lots of effective strategies in various disciplines for approaching it. Statistics, for example, uses linear regression through R2 data to find a line which minimizes the squares of the errors for all points. We have also covered several ways of doing this sort of work in Machine Learning this semester.

We took a very general approach to the problem in class today. The core idea is that, for n data points, you can find a n-1th degree polynomial which interpolates those points with n+1 coefficients. As an aside, this strategy can only go so far. Very high-order polynomials (starting around degree 10) behave a bit counter-intuitively, and end up not being very useful. In any case, it's more often than not possible to find a low-order interpolation for enormous datasets. Linear regression is an excellent example of this.

Our basic example was a set of two data points in R2: (1,1) and (2,3). In order to do interpolation, you need to find a polynomial for each data point. Explicitly, you need to find {cj} such that p(Xi) = Yi for i=0..n. So for our example we need to fill in the polynomial p1(X) = C0 + C1X with our two data points:

C0 + C1(1) = 1
C0 + C1(2) = 3

This obviously a linear system, and as such can be represented by a matrix:

|1 1||C0|=|1|
|1 2||C1| |3|

This is (trivially) solved to give us the solution -1 + 2X, so C0 = -1 and C1 = 2. The equation 0 = 2x -1 is our interpolation of the test data. We really have no way of knowing, though, whether a line is the proper way to represent the underlying distribution of the data. It might be that a quadratic or a cubic or any other higher-order polynomial is a better fit. We can't know until we have more data points to evaluate. It turns out that there's no good algorithmic way to determine what is proper when that data arrives. It's basically a qualitative judgment.

The general case of this technique generates a square matrix of size n+1, with 1 in the first column (what I wouldn't give for a fixed-width font here):

|1 X0^1 X0^2 ... X0^n||C0| |Y0|
|1 X1^1 X1^2 ... X1^n||C1|=|Y1|
|... ... ... ... ... ||..| |..|
|1 Xn^1 Xn^2 ... Xn^n||Cn| |Yn|

This is called a Vandermonde Matrix. It's important to note that it would be singular if any two data points were identical, so all data must be distinct. If it turns out that multiple identical data points matter, one way to keep the matrix from becoming singular is to add information on the derivative at each point into the matrix. This would make something like the two points at the intersection of a loop distinct as they'd have different slopes. A Vandermonde matrix augmented with this derivative information is called a Hermite matrix.

I mentioned earlier that there's no way to know which polynomial best interpolates a given set of data. It does turn out that it's easy to show that *any* polynomial of a given degree which interpolates the data is as good as any other, and all such are effectively just restatements of each other.

Suppose there exist 2 interpolating polynomials P(x) and Q(x) for some data. If you state that their difference must be another polynomial of the same degree D(x), you get the following form:

D(X) = P(X) - Q(X)

If we state that P and Q are equal, then D(Xi) must equal 0 for i=0..n. At this point my notes break down a bit, because the next statement I see is that since D(Xi) = 0 we can say that P(X) must equal Q(X), but this is a tautology. I'll need to go back and find out which step I missed.

In any case, this fact that any two polynomials of an equal degree are as good as any other (keep in mind that the coefficients associated with each will be different) has uses primarily when you want to scale data away from very large or very small magnitudes to reduce computation error.

The last thing I wanted to note is that these polynomials end up looking a lot like Taylor series. In fact, Taylor series are just an extreme degeneracy of interpolation which uses all derivative (the 1st, 2nd, ... nth derivatives of P) information.

Sometimes you just have to step up.

It's been almost two months since I came back to training at the temple, and I feel as good about my physical condition as I did after six months of training the first time. Part of it is certainly that I retained some flexibility over my hiatus, but I don't think that can be all of it. I'm doing a few things differently this time around:

  1. I always take a *cold* shower as soon as I get home from practice. This was a suggestion that came from Runner's World, and the idea is that the cold water causes the muscles to contract and squeeze out some of the lactic acid so you're less sore the next day. This obviously feels better, but it also helps you recover faster. I've cheated a couple of times and skipped the cold shower and definitely felt a difference, so I think there's real benefit here.
  2. I quit drinking on evenings after training. I know that alcohol has an effect on muscle recovery, but I've been surprised at how much better I feel in the morning after just coming home and eating a little something, drinking a bunch of water, and going to bed.
  3. I stopped messing around doing little attempts at techniques during the day. I've gotten pretty serious about stretching out and not doing anything if my muscles are cold. I've only had one minor pull this time around, (nothing like the major pop I had in my hamstring the first time, but there was a little bruising), and I think that's partially because I'm not straining cold muscles in-between practices.
  4. I'm paying a *lot* more attention to proper technique. If something hurts, I assume I'm doing it wrong and try to watch people to see what detail I might be missing. The best example here is the pain I used to get in my right groin and lower back from the downward kick in caijiow. I used to get horrible pain, and could feel my lower back grinding every time I brought my leg down; After one particularly intense practice where Sifu made me understand what it means to "pop" these pains have simply not come back. I've gotten benefits along these lines in several other basic techniques, which is awesome.

All of this work came together tonight with me feeling strong, flexible, and confident enough to just step up and do Chuji Quantao. I had completed the form when I trained before, but that was several years ago, and this time I've only been taught the first three moves. I've been watching closely during practice, though, and really felt like I could get through it.

Sifu was briefly at the Temple tonight, but Josh ran practice. I've been to one of Josh's practices before, and I really like the way he focuses on teaching details like foot placement and wrist position. After stretching, the group of us working on basic techniques went up through Erqjiau, learning a lot in the process, at which point Josh asked who knew Chuji Quantao. I said that I knew the first three movements, and a couple of other people said the same thing. Two people had been taught the whole thing and will be testing, so when he divided off and asked if there was anyone else who had been taught the form I decided to just go for it and walked over.

Josh had me do the form first, which was unsettling, but I took a deep breath and made a strong start with what I knew. I made it through the entire form, even making some of the subtle balance points correctly, though overall it was horribly sloppy. After the other two students had gone through it as well, he said that we all basically knew the form but needed to work on technique. He went over some transitions and pointed out places where we need to be able to pause and keep our balance, overall spending about 10 minutes working with us.

I though I might explode with excitement when practice was over. We did pushups, situps, and back-extensions which Sifu doesn't seem to do anymore, which was a nice blast from the past. At the end of practice, having just completed the first form for the first time, I was honestly on fire inside; Really just a fantastic feeling of accomplishment and pride.

Wednesday, March 28, 2007

Principal Component Analysis

Monday's lecture on K-Means Clustering and today's on Principal Component Analysis cover algorithms which are used to achieve dimensionality reduction in a data set. This sort of reduction is mainly useful for compression, the applications for which are easy to see, and for feature extraction. Professor LeCun talked about a project in his lab which uses dimensionality expansion in order to try and identify multiple voices in a piece of orchestral or jazz music, but we didn't talk about any actual mechanics.

Principal Component Analysis attempts to reduce the number of dimensions needed to represent the data by finding positional correlation between two or more of its dimensions. Think about it like this: With two-dimensional data, if all of the points lie on a straight line then their position in one direction is completely correlated with their position in the other, and you only have to store one of those positions in order to record each point with no error. In reality, even with very strong correlation each point will lie to one or another side of the line, which is what we call the principal component, creating loss. This is much easier to visualize in two dimensions than it is in ten, but the same basic idea applies as long as you keep in mind that there might be more than one principal component.

In order to store the principal components we need to compute a covariance matrix for the data. Remember that stronger correlation means better compression, since it means less variance in the dimensions orthogonal to the principal component. With this in mind, the covariance matrix A for data in R2 is generated with the following sum:

1/p({sum from k=1 to n} (x1k)^2), 1/p({sum from i=1 to p} xi*xi')
1/p({sum from i=1 to p} xi*xi'), 1/p({sum from k=1 to n} (x2k)^2)

The terms in a12 and a21 indicate correlation. For no correlation at all, as with random data, these terms will equal 0.

The question remains: What use is this covariance matrix? The answer is that it allows us to compute the projection of each vector onto the primary component U, which allows us to discard the vectors' dimensions in these orthogonal dimensions. We do this by computing two new matrices, one containing the increasing eigenvalues of A on the i=j diagonal (typically called Lambda, though here I'll call it L) and the matrix Q whose lines are the normalized and mutually orthogonal eigenvectors of A. Q is a rotation matrix, and it can readily be seen that A = QLQ'.

Now that we have these matrices we can construct a rectangular matrix Qk made up of the k most important eigenvectors in Q, which are the principal components of A. The definition of "most important" in this case is situational, but often there are a few eigenvalues which are of a much larger magnitude than the rest, and these values' eigenvectors make up Qk.

Multiplying a vector by Qk gives the projections of the vector onto the principal eigenvectors of A. It's now easy and inexpensive to compute the k PCA features of any vector in the dimensionality of the original space by simply performing this multiplication. This efficiency has a price, however, in the computation of Q and L. Computing the eigenvectors and eigenvalues of a matrix is roughly n-cubed, so there is an upper-bound to the complexity of the data sets on which PCA is useful.

Debugging Memory Problems

Working in a language like C or C++ exposes you to all the glory and misery of memory management. The introductory programming classes at NYU are now taught in Java, and from talking to people I get the impression that most high school programming classes are as well. It's interesting to me that people can be in their third or fourth year of a computer science degree without ever having seen what happens when you double-free a pointer, but clearly this is now the majority case.

We talked about the fundamental problem of memory management, and the advantages is has over using a garbage-collected language and VM environment like Java (or Perl, or Python, or Ruby, or...) which with the power available on modern machines are honestly fairly thin. In most cases the only reasons to give up the benefits of automatic memory management involve performance or resource constraints related to embedded or similarly constrictive environments.

That said, we went over a few historical ways of finding memory leaks, and the conceptual underpinnings of automatic code annotation. In particular we talked about strategies for recording and monitoring the state of various regions of memory, and implementation details in C++ for doing this sort of thing yourself. When it comes down to it, though, the right thing to do is use valgrind. I'd have more to say, but the lecture notes are returning 404s so the details aren't available to me to go over.

Tuesday, March 27, 2007

The first three movements of Chuji Quantao

Today was a good practice overall. I'm finally (after almost two months) past what I hope are the majority of my not-yet-flexible-enough injuries, and generally feel strong. I had a little bit of a tweak in my right quad left over from the excellent instruction I had last week in pubu chuanzhang lunbi zaquan, but it didn't really hinder me.

I've started doing wall stretch with people, and the results after just a couple of attempts are noticeable. I really like the extra work on the muscles along my ribs, and I feel like I'm getting extra range in my groin without hurting myself like I was when I was doing it on my own.

After the break I made it through all of the basic forms under Sifu's watchful eye, and was pulled aside to learn chuji quantao. Unfortunately, we only had time to go over the first three movements, which has so far happened four times. The benefit is that my form on these opening three movements is very good, but I admit to hoping that the next time around I can make it into the rest of the form.

Eigenvalues, Matrix Properties, and the Singular Value Decomposition

It's interesting to me that my Numerical Computing class is in some ways an advanced Linear Algebra class. It's appropriate, though, since so many continuous problems are really only solvable on computers if represented as linear systems. Today was an extended discussion of the properties of various types of matrices, and an introduction to the Singular Value Decomposition of a matrix.

Some properties of various types of matrices:
  • Any random matrix will have some complex eigenvalues. Furthermore, these eigenvalues tend to occur in conjugate pairs.

  • A symmetric matrix, on the other hand, will always have all real eigenvalues. In this context a symmetric matrix is called Positive Definite. If all of this matrix's eigenvalues, which will be real, are also positive it's called a Symmetric Positive Definite matrix.

  • Symmetric matrices also have the interesting property that their eigenvectors are orthogonal.

  • A stochastic matrix is one in which each column consists of positive real values which sum to 1. Note that because of this property each column vector can also be interpreted as a probability vector. The largest eigenvalue of a stochastic matrix is always exactly 1.


The Singular Value Decomposition

Any matrix A can be represented in the following way:

A = USVT

U and V both have the property that when multiplied by their transpose they yield the Identity, while S (which is actually represented with a capital Sigma) is an all-zero matrix except for the diagonal, which is made up of strictly decreasing real values, with sn 0 if A is non-singular.

This is a little difficult to represent with the text tools available here, but this means that A can be represented in the following way:

A = (Sum from i=1 to n) { siuiviT }

This essentially says that A can be constructed by multiplying each of the vectors of U and V by their cooresponding sigma and adding the resulting matrices together. As you add each term in the summation, the value of this matrix, call it B, approaches the actual value of A. B has the interesting property that, for the current value of i, it will have rank=i. It can also be proven that B, is the closest possible matrix to A with rank=1, meaning that it minimizes ||A-B|| over all B with rank r.

Monday, March 26, 2007

K-Means Clustering

We talked today about unsupervised learning, with some time spent on dimensionality reduction and principal component analysis, but the most interesting subtopic was k-means clustering.

Suppose you have some data set organized into clusters. The first and most apparent problem is that you can't know ahead of time how many clusters you have; This means you can't know how many categories into which new data will fall. The basic idea of k-means clustering is that you can algorithmically derive a location (called a
prototype) for each cluster which minimizes the sum of the square distances to the points in that cluster.

The procedure is fairly straightforward:

1. You need to initialize some number of prototypes (M
k). In general this is done by randomly choosing some number of points in the training data.
2. For each M
k compute the set of all points in the data for which the square of the norm of the distance is lowest versus all other prototypes. This produces a number of sets equal to the number of initial Mk each of which is the 'cluster' associated with that prototype.
3. Re-compute the location of each M
k to be the position which minimizes the square of the norm of the distances to all of the points in its matching set Sk.
4. Iterate 2 and 3 until convergence.

Once this algorithm has converged, classification of new data is extremely cheap - Any new data point is of the category of the closest prototype, with error equal to the square of the distance between these two points.

There are some great advantages to this scheme, first and foremost that it will work with any distance measure, and makes easy sense in data of any dimension. Also an advantage, and certainly an important one, is the fact that this algorithm is extremely straightforward to efficiently implement.

One major problem, though, needs to be considered. With poor initial prototype selection it's very possible to get stuck in extremely bad solutions. Consider data with some high number of clusters where initial prototype selection places several prototypes
in one or more clusters and leaves several other clusters empty. Some clusters will go unrepresented in the final list of prototypes (matching instead some other nearby cluster), while some other clusters will be incorrectly split between two or more prototypes.

This can be partially compensated for by performing what's called hierarchical k-means; Select exactly two prototypes, and run normal k-means. Then run k-means again with k=2 on each of the resulting subclusters. This procedure will tend to avoid the problem of missed clusters while multiple prototypes converge to local minima, but can be significantly more expensive.

GDB for the win!

I was pleasantly surprised when it turned out that my Software Engineering professor was a fan of linux, gcc, and the rest of the GNU development toolchain, but I was a little worried about gdb. There was a time when I thought it was absolutely the king of the debugging heap, but then I watched excellent developers try to use it to debug multi-threaded C++ code; It was not a pleasant sight. I had gone largely sour on gdb in general until today's lecture, which presented a few macro techniques which combine with gdb features to produce fairly advanced behavior (particularly for a command-line debugger).

First, there are a few gdb features that I never really knew about:

fin - Finishes the current function call.

cond n [expr] - Makes breakpoint n condition on expr.

jump line - go to line and continue execution from there.

call function - Call function function.

Big headaches like getting out of a function where you only cared about the behavior at the beginning, being able to step across functions without having to evaluate them, and so on go away with these features. Handy.

We also went over a few interesting gdb tricks. The best one:

You can combine ignore and run to step a program to a breakpoint immediately before it crashes. You set a breakpoint n at the place you want to stop, then ignore n . You run, and after the crash you can use info b to see how many times breakpoint n was hit. You can then ignore n this number of times, run again, and you're at the breakpoint right before your program crashes.

Introduction

I've got a whole lot going on. I'm a full-time sysadmin at a certain well-known internet company, I attend NYU full-time working on my bachelor's in Computer Science, and in an attempt to keep myself fit and sane I train at USA Shaolin Temple. I've found, particularly with my schoolwork, that I understand new material better if I talk through the day's information at least once.
I'll be summarizing the information presented in my classes, and keeping a training journal of my progress in shaolin. I will certainly find other things to drop in here and there as well, but my main hope is that this will help me keep my thoughts straight as I grind through my current (relatively insane) workload.