python - What's wrong with my PCA? -
my code:
from numpy import * def pca(orig_data): data = array(orig_data) data = (data - data.mean(axis=0)) / data.std(axis=0) u, s, v = linalg.svd(data) print s #should s**2 instead! print v def load_iris(path): lines = [] open(path) input_file: lines = input_file.readlines() data = [] line in lines: cur_line = line.rstrip().split(',') cur_line = cur_line[:-1] cur_line = [float(elem) elem in cur_line] data.append(array(cur_line)) return array(data) if __name__ == '__main__': data = load_iris('iris.data') pca(data)
the iris dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
output:
[ 20.89551896 11.75513248 4.7013819 1.75816839] [[ 0.52237162 -0.26335492 0.58125401 0.56561105] [-0.37231836 -0.92555649 -0.02109478 -0.06541577] [ 0.72101681 -0.24203288 -0.14089226 -0.6338014 ] [ 0.26199559 -0.12413481 -0.80115427 0.52354627]]
desired output:
eigenvalues - [2.9108 0.9212 0.1474 0.0206]
principal components - same got transposed
okay guess
also, what's output of linalg.eig function? according pca description on wikipedia, i'm supposed this:
cov_mat = cov(orig_data) val, vec = linalg.eig(cov_mat) print val
but doesn't match output in tutorials found online. plus, if have 4 dimensions, thought should have 4 eigenvalues , not 150 eig gives me. doing wrong?
edit: i've noticed values differ 150, number of elements in dataset. also, eigenvalues supposed add equal number of dimensions, in case, 4. don't understand why difference happening. if divided eigenvalues len(data)
result want, don't understand why. either way proportion of eigenvalues isn't altered, important me i'd understand what's going on.
you decomposed wrong matrix.
principal component analysis requires manipulating eigenvectors/eigenvalues of covariance matrix, not data itself. covariance matrix, created m x n data matrix, m x m matrix ones along main diagonal.
you can indeed use cov function, need further manipulation of data. it's little easier use similar function, corrcoef:
import numpy np import numpy.linalg la # simulated data set 8 data points, each point having 5 features data = np.random.randint(0, 10, 40).reshape(8, 5) # idea mean center data first: data -= np.mean(data, axis=0) # calculate covariance matrix c = np.corrcoef(data, rowvar=0) # returns m x m matrix, or here 5 x 5 matrix) # eigenvalues/eigenvectors of c: eval, evec = la.eig(c)
to eigenvectors/eigenvalues, did not decompose covariance matrix using svd, though, can. preference calculate them using eig in numpy's (or scipy's) la module--it little easier work svd, return values eigenvectors , eigenvalues themselves, , nothing else. contrast, know, svd doesn't return these these directly.
granted svd function decompose matrix, not square ones (to eig function limited); when doing pca, you'll have square matrix decompose, regardless of form data in. obvious because matrix decomposing in pca covariance matrix, definition square (i.e., columns individual data points of original matrix, likewise rows, , each cell covariance of 2 points, evidenced ones down main diagonal--a given data point has perfect covariance itself).
Comments
Post a Comment