Weka is a standard Java tool for performing both machine learning experiments and for embedding trained models in Java applications. It can be used for supervised and unsupervised learning. There are three ways to use Weka first using command line, second using Weka GUI, and third through its API with Java. Weka's library provides a large collection of machine learning algorithms, implemented in Java. The Objective of this post is to explain how to generate a model from ARFF data file and how to classify a new instance with this model using Weka API.
Requirements:<dependency>
<groupId>nz.ac.waikato.cms.weka</groupId>
<artifactId>weka-stable</artifactId>
<version>3.8.0</version>
</dependency>
Weka uses a data file format called ARFF (Attribute-Relation File Format). It is a file consists of a list of all the instances, with the attribute values for each instance being separated by commas. I will use Iris 2D dataset in this example. It has three attributes petallength, petalwidth, and class (Iris-setosa, Iris-versicolor, and Iris-virginica). Our objective is to generate a model to correctly classify any new instance with petallength and petalwidth attributes to Iris-setosa class, Iris-versicolor class, or Iris-virginica class.
@relation iris-weka.filters.unsupervised.attribute.Remove-R1-2 @attribute petallength numeric @attribute petalwidth numeric @attribute class {Iris-setosa,Iris-versicolor,Iris-virginica} @data 1.4,0.2,Iris-setosa 4.7,1.4,Iris-versicolor 5,1.5,Iris-virginica
In this stage we will generate a model using MultilayerPerceptron (Neural network) to classify iris 2D dataset. I used the default values for neural network learning process, of course, you can change them manually through setter methods.
This is done by ModelGenerator class which has four methods as described in the next table.
Name | Function |
loadDataset | Loading dataset from ARFF file and save it to Instances object |
buildClassifier | Building classifier for training set using MultilayerPerceptron (Neural network) |
evaluateModel | Evaluating the accuracy for the generated model with test set |
saveModel | Saving the generated model to a path to use it for future prediction |
package com.emaraic.ml; import java.util.logging.Level; import java.util.logging.Logger; import weka.classifiers.Classifier; import weka.classifiers.evaluation.Evaluation; import weka.classifiers.functions.MultilayerPerceptron; import weka.core.Instances; import weka.core.SerializationHelper; import weka.core.converters.ConverterUtils.DataSource; /** * * @author Taha Emara * Website: http://www.emaraic.com * Email : taha@emaraic.com * Created on: Jun 28, 2017 * Github link: https://github.com/emara-geek/weka-example */ public class ModelGenerator { public Instances loadDataset(String path) { Instances dataset = null; try { dataset = DataSource.read(path); if (dataset.classIndex() == -1) { dataset.setClassIndex(dataset.numAttributes() - 1); } } catch (Exception ex) { Logger.getLogger(ModelGenerator.class.getName()).log(Level.SEVERE, null, ex); } return dataset; } public Classifier buildClassifier(Instances traindataset) { MultilayerPerceptron m = new MultilayerPerceptron(); //m.setGUI(true); //m.setValidationSetSize(0); //m.setBatchSize("100"); //m.setLearningRate(0.3); //m.setSeed(0); //m.setMomentum(0.2); //m.setTrainingTime(500);//epochs //m.setNormalizeAttributes(true); /*Multipreceptron parameters and its default values *Learning Rate for the backpropagation algorithm (Value should be between 0 - 1, Default = 0.3). *m.setLearningRate(0); *Momentum Rate for the backpropagation algorithm (Value should be between 0 - 1, Default = 0.2). *m.setMomentum(0); *Number of epochs to train through (Default = 500). *m.setTrainingTime(0) *Percentage size of validation set to use to terminate training (if this is non zero it can pre-empt num of epochs. (Value should be between 0 - 100, Default = 0). *m.setValidationSetSize(0); *The value used to seed the random number generator (Value should be >= 0 and and a long, Default = 0). *m.setSeed(0); *The hidden layers to be created for the network(Value should be a list of comma separated Natural numbers or the letters 'a' = (attribs + classes) / 2, 'i' = attribs, 'o' = classes, 't' = attribs .+ classes) for wildcard values, Default = a). *m.setHiddenLayers("2,3,3"); three hidden layer with 2 nodes in first layer and 3 nodends in second and 3 nodes in the third. *The desired batch size for batch prediction (default 100). *m.setBatchSize("1"); */ try { m.buildClassifier(traindataset); } catch (Exception ex) { Logger.getLogger(ModelGenerator.class.getName()).log(Level.SEVERE, null, ex); } return m; } public String evaluateModel(Classifier model, Instances traindataset, Instances testdataset) { Evaluation eval = null; try { // Evaluate classifier with test dataset eval = new Evaluation(traindataset); eval.evaluateModel(model, testdataset); } catch (Exception ex) { Logger.getLogger(ModelGenerator.class.getName()).log(Level.SEVERE, null, ex); } return eval.toSummaryString("", true); } public void saveModel(Classifier model, String modelpath) { try { SerializationHelper.write(modelpath, model); } catch (Exception ex) { Logger.getLogger(ModelGenerator.class.getName()).log(Level.SEVERE, null, ex); } } }
In this stage we will use the generated model with ModelGenerator class to classify a new instance. This is done by ModelClassifier class.
package com.emaraic.ml; import java.util.ArrayList; import java.util.logging.Level; import java.util.logging.Logger; import weka.classifiers.Classifier; import weka.classifiers.functions.MultilayerPerceptron; import weka.core.Attribute; import weka.core.DenseInstance; import weka.core.Instances; import weka.core.SerializationHelper; /** * This is a classifier for iris.2D.arff dataset * @author Taha Emara * Website: http://www.emaraic.com * Email : taha@emaraic.com * Created on: Jul 1, 2017 * Github link: https://github.com/emara-geek/weka-example */ public class ModelClassifier { private Attribute petallength; private Attribute petalwidth; private ArrayListattributes; private ArrayList classVal; private Instances dataRaw; public ModelClassifier() { petallength = new Attribute("petallength"); petalwidth = new Attribute("petalwidth"); attributes = new ArrayList (); classVal = new ArrayList (); classVal.add("Iris-setosa"); classVal.add("Iris-versicolor"); classVal.add("Iris-virginica"); attributes.add(petallength); attributes.add(petalwidth); attributes.add(new Attribute("class", classVal)); dataRaw = new Instances("TestInstances", attributes, 0); dataRaw.setClassIndex(dataRaw.numAttributes() - 1); } public Instances createInstance(double petallength, double petalwidth, double result) { dataRaw.clear(); double[] instanceValue1 = new double[]{petallength, petalwidth, 0}; dataRaw.add(new DenseInstance(1.0, instanceValue1)); return dataRaw; } public String classifiy(Instances insts, String path) { String result = "Not classified!!"; Classifier cls = null; try { cls = (MultilayerPerceptron) SerializationHelper.read(path); result = classVal.get((int) cls.classifyInstance(insts.firstInstance())); } catch (Exception ex) { Logger.getLogger(ModelClassifier.class.getName()).log(Level.SEVERE, null, ex); } return result; } public Instances getInstance() { return dataRaw; } }
In Test class, I provide a complete example for using ModelGenerator and ModelClassifier classes to generate a model and use it for future prediction.
import com.emaraic.ml.ModelClassifier; import com.emaraic.ml.ModelGenerator; import weka.classifiers.functions.MultilayerPerceptron; import weka.core.Debug; import weka.core.Instances; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Normalize; /** * * @author Taha Emara * Website: http://www.emaraic.com * Email : taha@emaraic.com * Created on: Jul 1, 2017 * Github link: https://github.com/emara-geek/weka-example */ public class Test { public static final String DATASETPATH = "/Users/Emaraic/Temp/ml/iris.2D.arff"; public static final String MODElPATH = "/Users/Emaraic/Temp/ml/model.bin"; public static void main(String[] args) throws Exception { ModelGenerator mg = new ModelGenerator(); Instances dataset = mg.loadDataset(DATASETPATH); Filter filter = new Normalize(); // divide dataset to train dataset 80% and test dataset 20% int trainSize = (int) Math.round(dataset.numInstances() * 0.8); int testSize = dataset.numInstances() - trainSize; dataset.randomize(new Debug.Random(1));// if you comment this line the accuracy of the model will be droped from 96.6% to 80% //Normalize dataset filter.setInputFormat(dataset); Instances datasetnor = Filter.useFilter(dataset, filter); Instances traindataset = new Instances(datasetnor, 0, trainSize); Instances testdataset = new Instances(datasetnor, trainSize, testSize); // build classifier with train dataset MultilayerPerceptron ann = (MultilayerPerceptron) mg.buildClassifier(traindataset); // Evaluate classifier with test dataset String evalsummary = mg.evaluateModel(ann, traindataset, testdataset); System.out.println("Evaluation: " + evalsummary); //Save model mg.saveModel(ann, MODElPATH); //classifiy a single instance ModelClassifier cls = new ModelClassifier(); String classname =cls.classifiy(Filter.useFilter(cls.createInstance(1.6, 0.2, 0), filter), MODElPATH); System.out.println("\n The class name for the instance with petallength = 1.6 and petalwidth =0.2 is " +classname); } }
Evaluation: Correctly Classified Instances 29 96.6667 % Incorrectly Classified Instances 1 3.3333 % Kappa statistic 0.9497 K&B Relative Info Score 2783.763 % K&B Information Score 44.1136 bits 1.4705 bits/instance Class complexity | order 0 47.6278 bits 1.5876 bits/instance Class complexity | scheme 3.5142 bits 0.1171 bits/instance Complexity improvement (Sf) 44.1136 bits 1.4705 bits/instance Mean absolute error 0.046 Root mean squared error 0.1051 Relative absolute error 10.3365 % Root relative squared error 22.2694 % Total Number of Instances 30 The class name for the instance with petallength = 1.6 and petalwidth =0.2 is Iris-setosa