About the Dataset
This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica. The data were collected by Anderson, Edgar (1935, The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5).
Exploratory Data Analysis
Before we begin clustering let us visualize the dataset. Following are few ways to visualize data.
From this graph below we can see that setosa species can be linearly separated but versicolor and virginica have overlap.
Density Plot shows the distribution of observations for sepal lengths.
Clustering Using K-means
From the confusion matrix we see that 6 observations are incorrectly classified.
Decision Tree gives us 85% accuracy on unknown data.
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. (has