How to use Weka in your Java code

2017-07-02
Taha Emara

Machine Learning Java

Indroduction

Weka is a standard Java tool for performing both machine learning experiments and for embedding trained models in Java applications. It can be used for supervised and unsupervised learning. There are three ways to use Weka first using command line, second using Weka GUI, and third through its API with Java. Weka's library provides a large collection of machine learning algorithms, implemented in Java.
The Objective of this post is to explain how to generate a model from ARFF data file and how to classify a new instance with this model using Weka API.

Requirements:

Java Development Kit (JDK), you can download from http://www.oracle.com/technetwork/java/javase/downloads/index.html.
Netbeans, you can download from https://netbeans.org/downloads.
Source Code, you can download from https://github.com/emara-geek/weka-example.
Weka library. You can embed it to your java code via maven. Just insert these line in your pom file.

<dependency>

          <groupId>nz.ac.waikato.cms.weka</groupId>

          <artifactId>weka-stable</artifactId>

          <version>3.8.0</version>

</dependency>

Dataset

Weka uses a data file format called ARFF (Attribute-Relation File Format). It is a file consists of a list of all the instances, with the attribute values for each instance being separated by commas.
I will use Iris 2D dataset in this example. It has three attributes petallength, petalwidth, and class (Iris-setosa, Iris-versicolor, and Iris-virginica). Our objective is to generate a model to correctly classify any new instance with petallength and petalwidth attributes to Iris-setosa class, Iris-versicolor class, or Iris-virginica class.

@relation iris-weka.filters.unsupervised.attribute.Remove-R1-2

@attribute petallength numeric
@attribute petalwidth numeric
@attribute class {Iris-setosa,Iris-versicolor,Iris-virginica}

@data
1.4,0.2,Iris-setosa
4.7,1.4,Iris-versicolor
5,1.5,Iris-virginica

Generating a model

In this stage we will generate a model using MultilayerPerceptron (Neural network) to classify iris 2D dataset. I used the default values for neural network learning process, of course, you can change them manually through setter methods.

This is done by ModelGenerator class which has four methods as described in the next table.

Name	Function
loadDataset	Loading dataset from ARFF file and save it to Instances object
buildClassifier	Building classifier for training set using MultilayerPerceptron (Neural network)
evaluateModel	Evaluating the accuracy for the generated model with test set
saveModel	Saving the generated model to a path to use it for future prediction

package com.emaraic.ml;

import java.util.logging.Level;
import java.util.logging.Logger;
import weka.classifiers.Classifier;
import weka.classifiers.evaluation.Evaluation;
import weka.classifiers.functions.MultilayerPerceptron;
import weka.core.Instances;
import weka.core.SerializationHelper;
import weka.core.converters.ConverterUtils.DataSource;

/**
 *
 * @author Taha Emara 
 * Website: http://www.emaraic.com 
 * Email : taha@emaraic.com
 * Created on: Jun 28, 2017
 * Github link: https://github.com/emara-geek/weka-example
 */
public class ModelGenerator {

    public Instances loadDataset(String path) {
        Instances dataset = null;
        try {
            dataset = DataSource.read(path);
            if (dataset.classIndex() == -1) {
                dataset.setClassIndex(dataset.numAttributes() - 1);
            }
        } catch (Exception ex) {
            Logger.getLogger(ModelGenerator.class.getName()).log(Level.SEVERE, null, ex);
        }

        return dataset;
    }

    public Classifier buildClassifier(Instances traindataset) {
        MultilayerPerceptron m = new MultilayerPerceptron();
        
        //m.setGUI(true);
        //m.setValidationSetSize(0);
        //m.setBatchSize("100");
        //m.setLearningRate(0.3);
        //m.setSeed(0);
        //m.setMomentum(0.2);
        //m.setTrainingTime(500);//epochs
        //m.setNormalizeAttributes(true);
        
        /*Multipreceptron parameters and its default values 
        *Learning Rate for the backpropagation algorithm (Value should be between 0 - 1, Default = 0.3).
        *m.setLearningRate(0);
        
	*Momentum Rate for the backpropagation algorithm (Value should be between 0 - 1, Default = 0.2).
	*m.setMomentum(0);
        
        *Number of epochs to train through (Default = 500).
        *m.setTrainingTime(0)
        
	*Percentage size of validation set to use to terminate training (if this is non zero it can pre-empt num of epochs.
	 (Value should be between 0 - 100, Default = 0).
        *m.setValidationSetSize(0);
        
	*The value used to seed the random number generator (Value should be >= 0 and and a long, Default = 0).
        *m.setSeed(0);
        
        *The hidden layers to be created for the network(Value should be a list of comma separated Natural 
	numbers or the letters 'a' = (attribs + classes) / 2, 
	'i' = attribs, 'o' = classes, 't' = attribs .+ classes) for wildcard values, Default = a).
         *m.setHiddenLayers("2,3,3"); three hidden layer with 2 nodes in first layer and 3 nodends in second and 3 nodes in the third.
        
        *The desired batch size for batch prediction  (default 100).
        *m.setBatchSize("1");
         */
        try {
            m.buildClassifier(traindataset);

        } catch (Exception ex) {
            Logger.getLogger(ModelGenerator.class.getName()).log(Level.SEVERE, null, ex);
        }
        return m;
    }

    public String evaluateModel(Classifier model, Instances traindataset, Instances testdataset) {
        Evaluation eval = null;
        try {
            // Evaluate classifier with test dataset
            eval = new Evaluation(traindataset);
            eval.evaluateModel(model, testdataset);
        } catch (Exception ex) {
            Logger.getLogger(ModelGenerator.class.getName()).log(Level.SEVERE, null, ex);
        }
        return eval.toSummaryString("", true);
    }

    public void saveModel(Classifier model, String modelpath) {

        try {
            SerializationHelper.write(modelpath, model);
        } catch (Exception ex) {
            Logger.getLogger(ModelGenerator.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

}

Classification using the generated model

In this stage we will use the generated model with ModelGenerator class to classify a new instance. This is done by ModelClassifier class.

package com.emaraic.ml;

import java.util.ArrayList;
import java.util.logging.Level;
import java.util.logging.Logger;
import weka.classifiers.Classifier;
import weka.classifiers.functions.MultilayerPerceptron;
import weka.core.Attribute;
import weka.core.DenseInstance;
import weka.core.Instances;
import weka.core.SerializationHelper;

/**
 * This is a classifier for iris.2D.arff dataset  
 * @author Taha Emara 
 * Website: http://www.emaraic.com 
 * Email  : taha@emaraic.com
 * Created on: Jul 1, 2017
 * Github link: https://github.com/emara-geek/weka-example
 */
public class ModelClassifier {

    private Attribute petallength;
    private Attribute petalwidth;

    private ArrayList attributes;
    private ArrayList classVal;
    private Instances dataRaw;


    public ModelClassifier() {
        petallength = new Attribute("petallength");
        petalwidth = new Attribute("petalwidth");
        attributes = new ArrayList();
        classVal = new ArrayList();
        classVal.add("Iris-setosa");
        classVal.add("Iris-versicolor");
        classVal.add("Iris-virginica");

        attributes.add(petallength);
        attributes.add(petalwidth);

        attributes.add(new Attribute("class", classVal));
        dataRaw = new Instances("TestInstances", attributes, 0);
        dataRaw.setClassIndex(dataRaw.numAttributes() - 1);
    }

    
    public Instances createInstance(double petallength, double petalwidth, double result) {
        dataRaw.clear();
        double[] instanceValue1 = new double[]{petallength, petalwidth, 0};
        dataRaw.add(new DenseInstance(1.0, instanceValue1));
        return dataRaw;
    }


    public String classifiy(Instances insts, String path) {
        String result = "Not classified!!";
        Classifier cls = null;
        try {
            cls = (MultilayerPerceptron) SerializationHelper.read(path);
            result = classVal.get((int) cls.classifyInstance(insts.firstInstance()));
        } catch (Exception ex) {
            Logger.getLogger(ModelClassifier.class.getName()).log(Level.SEVERE, null, ex);
        }
        return result;
    }


    public Instances getInstance() {
        return dataRaw;
    }
    

}

Test

In Test class, I provide a complete example for using ModelGenerator and ModelClassifier classes to generate a model and use it for future prediction.


import com.emaraic.ml.ModelClassifier;
import com.emaraic.ml.ModelGenerator;
import weka.classifiers.functions.MultilayerPerceptron;
import weka.core.Debug;
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.Normalize;

/**
 *
 * @author Taha Emara 
 * Website: http://www.emaraic.com 
 * Email : taha@emaraic.com
 * Created on: Jul 1, 2017
 * Github link: https://github.com/emara-geek/weka-example
 */
public class Test {

    public static final String DATASETPATH = "/Users/Emaraic/Temp/ml/iris.2D.arff";
    public static final String MODElPATH = "/Users/Emaraic/Temp/ml/model.bin";

    public static void main(String[] args) throws Exception {
        
        ModelGenerator mg = new ModelGenerator();

        Instances dataset = mg.loadDataset(DATASETPATH);

        Filter filter = new Normalize();

        // divide dataset to train dataset 80% and test dataset 20%
        int trainSize = (int) Math.round(dataset.numInstances() * 0.8);
        int testSize = dataset.numInstances() - trainSize;

        dataset.randomize(new Debug.Random(1));// if you comment this line the accuracy of the model will be droped from 96.6% to 80%
        
        //Normalize dataset
        filter.setInputFormat(dataset);
        Instances datasetnor = Filter.useFilter(dataset, filter);

        Instances traindataset = new Instances(datasetnor, 0, trainSize);
        Instances testdataset = new Instances(datasetnor, trainSize, testSize);

        // build classifier with train dataset             
        MultilayerPerceptron ann = (MultilayerPerceptron) mg.buildClassifier(traindataset);

        // Evaluate classifier with test dataset
        String evalsummary = mg.evaluateModel(ann, traindataset, testdataset);
        System.out.println("Evaluation: " + evalsummary);

        //Save model 
        mg.saveModel(ann, MODElPATH);

        //classifiy a single instance 
        ModelClassifier cls = new ModelClassifier();
        String classname =cls.classifiy(Filter.useFilter(cls.createInstance(1.6, 0.2, 0), filter), MODElPATH);
        System.out.println("\n The class name for the instance with petallength = 1.6 and petalwidth =0.2 is  " +classname);

    }

}

Output

Evaluation: 
Correctly Classified Instances          29               96.6667 %
Incorrectly Classified Instances         1                3.3333 %
Kappa statistic                          0.9497
K&B Relative Info Score               2783.763  %
K&B Information Score                   44.1136 bits      1.4705 bits/instance
Class complexity | order 0              47.6278 bits      1.5876 bits/instance
Class complexity | scheme                3.5142 bits      0.1171 bits/instance
Complexity improvement     (Sf)         44.1136 bits      1.4705 bits/instance
Mean absolute error                      0.046 
Root mean squared error                  0.1051
Relative absolute error                 10.3365 %
Root relative squared error             22.2694 %
Total Number of Instances               30     


The class name for the instance with petallength = 1.6 and petalwidth =0.2 is  Iris-setosa

Emaraic