python - What's wrong with my PCA? -

- February 15, 2011

my code:

from numpy import *  def pca(orig_data):     data = array(orig_data)     data = (data - data.mean(axis=0)) / data.std(axis=0)     u, s, v = linalg.svd(data)     print s #should s**2 instead!     print v  def load_iris(path):     lines = []     open(path) input_file:         lines = input_file.readlines()     data = []     line in lines:         cur_line = line.rstrip().split(',')         cur_line = cur_line[:-1]         cur_line = [float(elem) elem in cur_line]         data.append(array(cur_line))     return array(data)  if __name__ == '__main__':     data = load_iris('iris.data')     pca(data)

the iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

output:

[ 20.89551896  11.75513248   4.7013819    1.75816839] [[ 0.52237162 -0.26335492  0.58125401  0.56561105]  [-0.37231836 -0.92555649 -0.02109478 -0.06541577]  [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ]  [ 0.26199559 -0.12413481 -0.80115427  0.52354627]]

desired output:
eigenvalues - [2.9108 0.9212 0.1474 0.0206]
principal components - same got transposed okay guess

also, what's output of linalg.eig function? according pca description on wikipedia, i'm supposed this:

cov_mat = cov(orig_data) val, vec = linalg.eig(cov_mat) print val

but doesn't match output in tutorials found online. plus, if have 4 dimensions, thought should have 4 eigenvalues , not 150 eig gives me. doing wrong?

edit: i've noticed values differ 150, number of elements in dataset. also, eigenvalues supposed add equal number of dimensions, in case, 4. don't understand why difference happening. if divided eigenvalues len(data) result want, don't understand why. either way proportion of eigenvalues isn't altered, important me i'd understand what's going on.

you decomposed wrong matrix.

principal component analysis requires manipulating eigenvectors/eigenvalues of covariance matrix, not data itself. covariance matrix, created m x n data matrix, m x m matrix ones along main diagonal.

you can indeed use cov function, need further manipulation of data. it's little easier use similar function, corrcoef:

import numpy np import numpy.linalg la  # simulated data set 8 data points, each point having 5 features data = np.random.randint(0, 10, 40).reshape(8, 5)  # idea mean center data first: data -= np.mean(data, axis=0)  # calculate covariance matrix  c = np.corrcoef(data, rowvar=0) # returns m x m matrix, or here 5 x 5 matrix)  # eigenvalues/eigenvectors of c: eval, evec = la.eig(c)

to eigenvectors/eigenvalues, did not decompose covariance matrix using svd, though, can. preference calculate them using eig in numpy's (or scipy's) la module--it little easier work svd, return values eigenvectors , eigenvalues themselves, , nothing else. contrast, know, svd doesn't return these these directly.

granted svd function decompose matrix, not square ones (to eig function limited); when doing pca, you'll have square matrix decompose, regardless of form data in. obvious because matrix decomposing in pca covariance matrix, definition square (i.e., columns individual data points of original matrix, likewise rows, , each cell covariance of 2 points, evidenced ones down main diagonal--a given data point has perfect covariance itself).

Search This Blog

Sohocode

python - What's wrong with my PCA? -

Comments

Post a Comment

Popular posts from this blog

java - SNMP4J General Variable Binding Error -

sql server - python to mssql encoding problem -

windows - Python Service Installation - "Could not find PythonClass entry" -