# Continuing Adventures in Machine Learning

In the last post, I wrote about calculating the cost of linear regression learning models combined with using gradient descent to find the minimized cost.

Quick review of the key equations.

• Hypothesis: $h_\theta(x) = \theta_0 + \theta_{1}x$
• Parameters: $\theta_0, \theta_1$
• Cost Function: $J(\theta_0,\theta_1) = \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$
• Goal: $\underset{\rm \theta_0,\theta_1}{\rm minimize}$ $J(\theta_0, \theta_1)$

With these tools, we can perform a gradient descent, an optimization algorithm designed to find $\underset{\rm \theta_0,\theta_1}{\rm minimize}$ $J(\theta_0, \theta_1)$.

The algorithm is described as follows:

where $\alpha$ is the “learning rate” and determines the magnitude of gradient steps per iteration. If $\alpha$ is too small, gradient descent can be slow, while if $\alpha$ is too large, gradient descent can overshoot the minimum and fail to converge or diverge.

One of the most powerful mathematical tools to apply against this problem are matrix and vector (a 1-dimensional matrix) operations. For example, we had the generic form of a hypothesis:

For illustrative purposes, we’ll make that a concrete hypothesis as follows:

Recall the sample data: We can represent the variable (house size) as a 7x2 matrix and the hypothesis as a 2x1 vector:

We’ve added the first column of 1 values to the matrix because we are going to do some matrix multiplication and we want to ensure that we’re able to perform the calculation (the inner dimensions must match) and to ensure that we don’t alter the hypothesis parameter. Let’s do the multiplication now:

The result is a 7x1 matrix. So we’ve just calculated the entire set of input values against the hypothesis in a single matrix operation. In a rudimentary programming route, this would’ve been implemented as an iterative loop. Using well-designed linear algebra libraries in various languages for matrix operations will be far more efficient than a straight loop. This is all well and good, but having a single feature in a real-world hypothesis is unrealistic. So, let’s move on to linear regression with multiple features, otherwise known as multivariate linear regression.

Let’s take our previous housing data and add a few additional features: From left to right, the corresponding variables would be $x_1$, $x_2$, $x_3$, $x_4$ and $y$. We use $m$ to denote the number of training sets. Other definitions:

• $n$ = number of features (e.g. $x_1$ - $x_4$)
• $x^{(i)}$ = input (features) of $i^{th}$ training example
• $x^{(i)}_j$ = value of features $j$ in $i^{th}$ training example

Though our previous hypothesis equation was defined as $h_\theta(x) = \theta_0 + \theta_{1}x$, with more than one feature, we update the equation as follows:

$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n$, and for convenience of notation, define $x_0 = 1$

Which means that we can define two matrices as follows:

In this current form of nx1 matrices, we can’t calculate the $h_\theta(x)$ (remember, the inner dimensions, in this case 1, must match). The solution is as simple as doing a matrix transpose and then multiplying.

So, this:

can be rewritten as:

and we would end up with an nx1 result vector. I thought this was one of the coolest things—being able to perform what would be an iterative calculation and do it in one go with a single matrix operation. This was where school-taught theory (which seemed to have no basis in practical work) actually made perfect sense for the problem at hand.

The equation for gradient descent for multiple features is:

As you can imagine, however, given the varying scales in the universe of possible features for a given hypothesis (e.g. sqft, # of bedrooms, floors, etc.) plotting out the results will result in skewed shapes which will result in inefficient walking of gradients to local/global minima. So there are a couple techniques to make these more efficient:

• feature scaling: as the name implies, we perform some mean normalization on the inputs to make sure they are all on the same scale where $% $, etc. For example, for each given input, e.g. $x_i=\frac{x_i-\mu_i}{max(x)}$
• adjustment of learning rate: this is the $\alpha$ multiplicand in the gradient descent equation above. This will increase/decrease the rate at which we reach convergence. For sufficiently small $\alpha$, $J(\theta)$ should decrease on every iteration. If $\alpha$ is too small, gradient descent can be slow to converge, requiring a very large number of iterations to obtain the solution, or if $\alpha$ is too large, $J(\theta)$ may not decrease on every iteration and, furthermore, may never converge. Rule of thumb for selecting $\alpha$ is to start from 0.0001 and increase by factors of 3: 0.0001, 0.0003, 0.01, 0.03, 01., 0.3, 1,…

That’s enough for today. Next time, we’ll go over polynomial regression for more sophisticated, non-linear hypotheses.