{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recitation notes for November 4th, 2016 by Eric Wong for 15-388/688 Practical Data Science"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import matplotlib\n",
"matplotlib.use('svg')\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('ggplot')\n",
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Hypothesis Testing\n",
"\n",
"Not a lot of new libraries here, in fact everything you need to know is already in the last recitation, and also already in the lecture notes.\n",
"\n",
"In additional to the Gaussian you'll need to work with a T distribution as well. These are all included under `scipy.stats`. Note that the T distribution requires a degree of freedom parameter. \n",
"\n",
"A reminder that you'll need to use the inverse CDF function; in `scipy.stats` this goes under the name `ppf`. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# From the last recitation\n",
"import scipy.stats as st\n",
"st.norm.ppf(0.5); # the value x such that P(X <= x) = 0.5 for X a standard normal random variable\n",
"st.t(4).ppf(0.5); # the value t such that P(X <= t) = 0.5 for t a T continuous random variable with 4 degrees of freedom"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pandas again\n",
"You'll need to do various SQL-like queries on your Pandas dataframes. If you are not familiar with it, you can do the following: "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])\n",
"df[df[\"a\"] > 0] # gets the subset of the dataframe where the column a is greater than 0\n",
"df[(df[\"a\"] >= 1) & (df[\"b\"] < -2)] # gets subset of the dataframe where column a >= 1 and b < -2\n",
"df[\"a\"].unique(); # gets all unique values in column \"a\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pandas Dataframes are array-likes\n",
"So you can pass them to `numpy` like any other array. Pandas Series are just 1D numpy arrays. But you can always convert to a numpy array wrapper as soon as you don't need to the dataframe structure anymore. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"np.sum(df, axis=0) # sum along the 0th axis\n",
"np.multiply(df,df) # elementwise multiplication\n",
"df[[\"a\", \"c\"]].as_matrix(); # converts the dataframe to a numpy array"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Numpy Tricks & Performance\n",
"Again, it's mostly the same numpy (with an initial Pandas to load from the CSV but after processing we immediately just work with numpy arrays). We're letting you use the greatly reduced dataset, so everything works on dense numpy operations. Of course, it is good practice to just write general code that doesn't rely on it. The solution doesn't.\n",
"\n",
"The more advanced algorithms take much longer to run, so its important to write them efficiently. Things like SVM and KMeans are quick and easy even if written inefficiently, but EM algorithms and matrix factorization can take significantly longer even if written well. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Brief aside from last recitation: `np.newaxis` is actually just a variable with value `None`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None\n"
]
}
],
"source": [
"print np.newaxis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Iterating over matrices\n",
"\n",
"Numpy has a list-like indexing method. You don't need to always provide all indices. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"X = np.random.randn(6,10)\n",
"X[0] # 0th row\n",
"X.T[0] # 0th column\n",
"for Xi in X:\n",
" Xi # iterates over each row of X\n",
"for Xj in X.T:\n",
" Xj # iterates over each column of X\n",
"for i,Xi in enumerate(X):\n",
" i,Xi # iterates and enumerates each row of X, nicer than doing X[i] within the loop"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preallocation\n",
"Numpy arrays are not like lists. While lists are implemented as unbounded arrays and can be resized somewhat efficiently, numpy arrays are optimized for fixed shape operations. \n",
"\n",
"Pre-allocate any matrices that you'll use over and over again that don't change (like identity matrices), instead of recreating them every iteration. "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 loops, best of 3: 46.4 ms per loop\n"
]
}
],
"source": [
"%%timeit\n",
"# don't iteratively append rows to a matrix\n",
"n = 1000\n",
"X = np.ones(n)\n",
"for _ in range(200):\n",
" X = np.vstack([X, np.ones(n)])"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000 loops, best of 3: 1.3 ms per loop\n"
]
}
],
"source": [
"%%timeit\n",
"# Instead, preallocate and assign values\n",
"n = 1000\n",
"X = np.zeros((200,n))\n",
"for i in range(200):\n",
" X[i] = np.ones(n)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10 loops, best of 3: 40.6 ms per loop\n",
"10 loops, best of 3: 96.4 ms per loop\n"
]
}
],
"source": [
"# preallocate matrices that don't change\n",
"n = 1000\n",
"X = np.random.randn(n,n)\n",
"D = np.eye(n)*1e-4\n",
"%timeit [X+D for _ in range(10)]\n",
"%timeit [X+np.eye(n)*1e-4 for _ in range(10)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Computing pairwise distances\n",
"Recall from Zico's slides that for KMeans, you need to compute all distances from points to cluster centers. We can compute them naively:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"n,m,k = 4,6,2\n",
"X = np.random.randn(n,k)\n",
"Y = np.random.randn(m,k)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1000 loops, best of 3: 165 µs per loop\n"
]
}
],
"source": [
"%%timeit\n",
"D0 = np.zeros((n,m))\n",
"for i, Xi in enumerate(X):\n",
" for j, Yj in enumerate(Y):\n",
" D0[i,j] = ((Xi-Yj)**2).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As is the theme with all matrix algebra, it is significantly more efficient to do it in matrix form. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The slowest run took 3513.63 times longer than the fastest. This could mean that an intermediate result is being cached \n",
"10000 loops, best of 3: 25 µs per loop\n"
]
}
],
"source": [
"%%timeit\n",
"D = (-2*X.dot(Y.T) + np.sum(X**2,axis=1)[:,None] + np.sum(Y**2, axis=1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Min/argmin\n",
"Use these functions judiciously. Never iterate through your matrix to find the smallest entry one. "
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"n,m = 10,6\n",
"X = np.random.randn(n,m)\n",
"X.argmin(axis=1)\n",
"X.min(axis=0);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The log trick\n",
"You will need this to get well-conditioned numbers when computing probabilities. Suppose we have a vector of values $a$ and we need to compute $\\exp(a_i)/\\sum_{j}\\exp(a_j)$. Computing this directly when $a$ is a very large negative number can pose numerical problems, since everything is 0. "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ nan, nan, nan, nan, nan])"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n=5\n",
"a = -np.ones(5)*(1e10)\n",
"np.exp(a)/np.exp(a).sum()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"-0.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"-np.exp(-1e10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, note that\n",
"$$\\frac{\\exp(a_i)}{\\sum_{j}\\exp(a_j)} = \\frac{\\exp(a_i)}{\\sum_{j}\\exp(a_j)}\\cdot \\frac{\\exp(-b)}{\\exp(-b)} = \\frac{\\exp(a_i-b)}{\\sum_{j}\\exp(a_j-b)}$$\n",
"holds for any number $b$. It is common to pick b = \\text{max}(a)$, so that at least one term in the resulting probability vector is 1. "
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0.2, 0.2, 0.2, 0.2, 0.2])"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n=5\n",
"a = -np.ones(5)*(1e10)\n",
"b = a.max()\n",
"np.exp(a-b)/np.exp(a-b).sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extra Stuff\n",
"Here's a few extra things that may help you with your contests. \n",
"\n",
"### Decaying learning rate in gradient descent. \n",
"Recall that for gradient descent, we just apply a constant learning rate from a number of iterations. Consider the simple example $y = x^2$, and experiment with the learning rate: "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.09968510225e+139\n"
]
}
],
"source": [
"def y(x): \n",
" return x*x\n",
"def g(x): \n",
" return 0.5*x\n",
"lr = 50\n",
"niters= 100\n",
"x = 20\n",
"\n",
"for i in range(niters):\n",
" x -= g(x)*lr\n",
"print x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Depending on the size of the learning rate, this gradient descent may blow up. Furthermore, in some problems, if you run for too many iterations, a constant sized learning rate that is normally OK becomes too large. Thus, we introduce the idea of a decaying learning rate: every time our objective doesn't decrease, we reduce the learning rate by a factor of $\\beta$ until it does. In practice, $\\beta = 0.9$ works quite well. "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.72928978033\n"
]
}
],
"source": [
"beta = 0.9\n",
"lr = 100\n",
"niters= 100\n",
"x = 20\n",
"for i in range(niters):\n",
" x0 = x\n",
" y0 = y(x)\n",
" x = x0 - g(x)*lr\n",
" while (y0 < y(x)):\n",
" lr *= beta\n",
" x = x0 - g(x)*lr\n",
"print x"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Earliy stopping conditions for gradient descent\n",
"By doing so, you can stop adjusting the number of iterations for each particular problem and just pick one that is large enough. If you want to remove even that condition, you can include an early stopping condition. For gradient descent, a simple condition is to just stop when the gradient is too small (since no progress will be made anyways): "
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2.64884105353e-07 488\n"
]
}
],
"source": [
"beta = 0.9\n",
"lr = 50\n",
"niters= 1000\n",
"x = 20\n",
"for i in range(niters):\n",
" if g(x)*lr < 1e-8:\n",
" break\n",
" x0 = x\n",
" y0 = y(x)\n",
" x = x0 - g(x)*lr\n",
" while (y0 < y(x)):\n",
" lr *= beta\n",
" x = x0 - g(x)*lr\n",
"print x,i"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, you select `niters` to be as long as you are willing to wait, and if it terminates early, you save some time and don't need to tune as many parameters. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting combinations for median trick\n",
"If you decide to use RBF features you'll find that in order to use the median trick, you need to iterate over all pairs of centers. You can do this by using the `itertools.combinations` function:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(0, 1),\n",
" (0, 2),\n",
" (0, 3),\n",
" (0, 4),\n",
" (0, 5),\n",
" (1, 2),\n",
" (1, 3),\n",
" (1, 4),\n",
" (1, 5),\n",
" (2, 3),\n",
" (2, 4),\n",
" (2, 5),\n",
" (3, 4),\n",
" (3, 5),\n",
" (4, 5)]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import itertools\n",
"[c for c in itertools.combinations(range(6), 2)]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [Root]",
"language": "python",
"name": "Python [Root]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}