Objective: Comparing and analyzing various machine learning algorithms across

various data sets using Weka open source. Both supervised and unsupervised

approaches will be looked at in this assignment.

NOTE: Programming is not required for this work if using WEKA, hence put your

effort in the analysis and report. Be sure to include any references you use for the

report. Setting up WEKA should take less than an hour. The run times for these

algorithms should be very short, all take seconds for a single run, except for the

neural network that takes a few minutes. You need to use multiple runs for your

report— (Use the experimenter feature described at the tutorial how to set

multiple runs in WEKA).

Part A: Supervised Learning

You are to compare three popular supervised learning algorithms across four

classification data sets.

Decision tree: Algorithm 1 will be a C4.5 decision tree. This algorithm can be

found in Weka under the classify tab using the label trees/J48.

Neural network: Algorithm 2 will be a standard neural network trained using back

propagation. This algorithm can be found in Weka under the classify tab using the

label functions/MultilayerPerceptron.

K nearest neighbours: Algorithm 3 will be using the K nearest neighbours

classification algorithm (will be reviewed on Tuesday tutorial, however,you c).

This algorithm can be found in Weka under the classify tab using the label

lazy/IBk.

Data sets:

Iris classification data set

Contains features regarding iris plants, with the goal of determining which

class of iris the plant is.

There are 150 input vectors. Each input vector contains 4 attributes, and 3

possible classifications.

https://archive.ics.uci.edu/ml/datasets/Iris

Wisconsin breast cancer data set

Contains medical features of a tumour, with the goal of determining if the

tumour is malignant or benign.

There are 699 input vectors. Each input vector contains 9 attributes, and 2

possible classifications.

https://archive.ics.uci.edu/ml/datasets/Breast+Can…

inal%29

Car evaluation data set

Contains features regarding different vehicles, with the goal of determining

the safety level of the car.

There are 1728 input vectors. Each input vector contains 6 attributes, and 4

possible classifications.

https://archive.ics.uci.edu/ml/datasets/Car+Evalua…

Diabetic retinopathy data set

Contains features of medical images, with the goal of determining whether

the image shows signs of diabetic retinopathy or not.

There are 1151 input vectors. Each input vector contains 20 attributes, and

2 possible classifications.

https://archive.ics.uci.edu/ml/datasets/Diabetic+R…

ata+Set

For part A, analyze the performance of each required algorithm for each data set.

What observations can you make regarding the data set used and the models

trained? Does one approach beat all others for every data set, or do different

approaches work better on the different problems? Using your understanding of

the algorithms, try and explain the observations you make. Try modifying the

parameters for the different algorithms. Does changing the parameters from their

default values significantly impact the performance of the algorithm?

Part B:

Implement and analyze the performance of clustering on unsupervised data sets

using various clustering algorithms. For this part you will use the K-means

clustering algorithm. The data sets that you will use are available at the following

link: http://cs.joensuu.fi/sipu/datasets/

Note: Use the S1, S2, S3 and S4 data sets. Feel free to use any additional

data sets from the above link. The data sets will need to be converted into an arff

file as explained in tutorial for use with Weka.

You are to make observations regarding how the k-means clustering works on the

different data sets. How does modifying the number of clusters impact the within

cluster sum of squared error? What happens if you use too many or too few

clusters? What sort of impact would you expect from modifying the way clusters

are initialized? What observations can you make comparing the clustering on an

easily separable data set (s1) to one where the optimal clusters are a lot less clear

(s4)?

For bonus marks, extend your analysis by including a self-organizing map

approach (self-organizing maps is not a topic typically covered in 4P76 but it is a

seminar topic that will be presented and is worth knowing). To install the self

organizing map package, on the Weka home page select the tools tab and click on

package manager. In the package manager, select the SelfOrganizingMap package

and click install. It will now be available under the cluster tab. Compare the self

organizing map clustering approach to the k-means clustering approach. How

does its clustering procedure differ from that of k-means? What sort of impact

does modifying the lattice width and height have on the algorithm? What

observations can you make when the values are the same, or when one value is

larger then the other?

Assignment Requirements and Grading: The results are to be handed in via a

technical paper written in the IEEE format shown to you in tutorial. Your report

should contain the following headers and sections:

Abstract, Introduction & problem definition

supervised and unsupervised learning, applicability of the two approaches.

Background

learning and one on unsupervised learning.

and equations.

they are not the focus of this assignment you do not need as much detail.

Results and Discussion

learning and one for unsupervised learning.

experimental evidence to support your claims.

provide further supporting evidence to your claims.

Conclusions

in your results section.

0 comments