Training a model from a CSV dataset

This tutorial walks you through the training and using of a machine learning neural network model to estimate the tree cover type based on tree data. This makes use of the well-known ‘Cover Type’ dataset, as presented in the Kaggle competition https://www.kaggle.com/c/forest-cover-type-prediction.

In summary, a CSV file contains numerical data about patches of forest land, and we will build a model that estimate the cover type of the patch, from 7 categories (e.g. spruce/fir, aspen, …). See https://www.kaggle.com/c/forest-cover-type-prediction/data for an explanation of the data themselves.

Getting the dataset

Let us create a dedicated repository


mkdir models
mkdir models/covert

The data can be obtained either from Kaggle or from http://www.deepdetect.com/dd/examples/all/forest_type/train.csv.tar.bz2


cd models/covert
wget http://www.deepdetect.com/dd/examples/all/forest_type/train.csv.tar.bz2
tar xvjf train.csv.tar.bz2

You can take a look at the raw data:


head -n 5 train.csv

The field ‘Id’ contains every training example id, and ‘Cover_Type’ holds the reference label, i.e. the forest cover type between 1 and 7. There are a total of 15120 training examples.

Creating the machine learning service

The first step with DeepDetect is to start the server, via Docker:


docker run -d -p 8080:8080 -v /path/to/models/:/opt/models/ jolibrain/deepdetect_cpu

Now create a machine learning service that uses a multilayered perceptron with 150 hidden neurons in 3 layers, and using prelu activations:


curl -X PUT "http://localhost:8080/services/covert" -d '{
       "mllib":"caffe",
       "description":"forest classification service",
       "type":"supervised",
       "parameters":{
         "input":{
           "connector":"csv"
         },
         "mllib":{
           "template":"mlp",
           "nclasses":7,
           "layers":[150,150,150],
           "activation":"prelu"
         }
       },
       "model":{
         "templates":"../templates/caffe/",
         "repository":"/opt/models/covert"
       }
     }'

yields:


{
  "status":{
    "code":201,
    "msg":"Created"
  }
}

Training and testing the service

Let us now train a statistical model in the form of the neural network defined above. Below is a full API call for launching an asynchronous training call on the GPU. Take a look at it, and before proceeding with the call, let us review the call in details below.


curl -X POST "http://localhost:8080/train" -d '{
       "service":"covert",
       "async":true,
       "parameters":{
         "mllib":{
           "gpu":true,
           "solver":{
             "iterations":10000,
             "test_interval":100,
             "base_lr":0.05
           },
           "net":{
             "batch_size":500
           }
         },
         "input":{
         "label_offset":-1,
         "label":"Cover_Type",
         "id":"Id",
         "separator":",",
         "shuffle":true,
         "test_split":0.1,
         "scale":true
         },
         "output":{
           "measure":["acc","mcll","f1"]
         }
       },
       "data":["/opt/models/covert/train.csv"]
     }'

First and foremost, we are using our newly created service to train a model. This means that our service will be busy for some time, and we cannot use it for anything else but reviewing the training call status and progress. Other services, if any, would remain available of course. In more details here:

async allows to start a non-blocking (i.e. asynchronous call)
gpu allows to tell the server we would like to use the GPU. Importantly, note that in the absence of GPU, the server will automatically fall back on the CPU, without warning
iterations is the number of training iterations after which the training will terminate automatically. Until termination it is possible to get the status and progress of the call, as we will demonstrate below
label_offset tells the CSV input connectors that the label identifiers run from 1 to 7 instead of 0 to 6. This is required here in order to not miss a class
label identifies the reference label column from the CSV dataset
id is the column identifier of the samples
test_split tells the input connector to keep 90% of the training set of training and 10% for assessing the quality of the model being built
shuffle tells the input connector to shuffle both the training and testing sets, this is especially useful for cross validation
scale tells the input connector to scale all data within [0,1] in order to get similar sensitivity across all dimensions. This usually helps the optimization procedure that underlies fitting a neural net.
measures lists the assessment metrics of the model being built, acc is for accuracy, mcll for multi-class log loss and f1 for F1-score
data holds the dataset file

For more details on the training phase options and parameters, see the API.

Let us now run the call above, the immediate answer is:


{
  "status":{
    "code":201,
    "msg":"Created"
  },
  "head":{
    "method":"/train",
    "job":1,
    "status":"running"
  }
}

indicating that the call was successful and the training is now running.

You can get the status of the call anytime with another call:


curl -X GET "http://localhost:8080/train?service=covert&job=1"


{
  "status":{
    "code":200,
    "msg":"OK"
  },
  "head":{
    "method":"/train",
    "job":1,
    "status":"finished",
    "time":61.0
  },
  "body":{
    "measure":{
      "train_loss":0.6788941025733948,
      "mcll":0.6393973624892094,
      "recall":0.7269925057270527,
      "iteration":999.0,
      "precision":0.7266408723876882,
      "f1":0.7268166465273875,
      "accp":0.7275132275132276,
      "acc":0.7275132275132276
    }
  }
}

Scaling, activated via scale:true lets the connector determines linear factors to scale all data, column per column, into [0,1]. The scaling parameters are returned when training has completed. They can then be used in subsequent /predict calls in order to scale the new data with the same factors as used on the training set. Below is what the GET /train call after the training has finished should yield:


curl -X GET "http://localhost:8080/train?service=covert&job=1"


{
  "status":{
    "code":200,
    "msg":"OK"
  },
  "head":{
    "method":"/train",
    "job":1,
    "status":"finished",
    "time":430.0
  },
  "body":{
    "parameters":{
      "mllib":{
        "batch_size":504
      },
      "input":{
        "max_vals":[15120.0,3849.0,360.0,52.0,1343.0,554.0,6890.0,254.0,254.0,248.0,6993.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0],
        "min_vals":[1.0,1863.0,0.0,0.0,0.0,-146.0,0.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0],
        "connector":"csv"
      }
    },
    "measure":{
      "train_loss":0.45148444175720217,
      "mcll":0.4988359195665834,
      "recall":0.788701702801853,
      "iteration":9999.0,
      "precision":0.7950319088203942,
      "f1":0.7918541548488839,
      "accp":0.7896825396825397,
      "acc":0.7896825396825397
    }
  }
}

The final quality of the model can be read as 79.5% accuracy on the testing portion of the dataset. In order to train a much better model, you can increase the number of iterations and the batch_size, as well as play with the number of layers and their size. Typically, training the perceptron above for 10000 iterations with batch_size: 5210 would yield an accuracy between 82% and 84%.

The status call can be repeated as needed until the status indicates that the training is finished, after which the job is deleted.

The trained model is now available on disk in the /path/to/models/covert repository. If you turn the server off or delete the service without wiping out the files, you will be able to use the trained model from another service.

However for now we show below how to use the current service and model to make prediction from new data.

Prediction for new data

The service is ready for the predict resource of the API to be used.

Prediction from file

The test data file can be obtained either from Kaggle or from http://www.deepdetect.com/dd/examples/all/forest_type/test.csv.tar.bz2


cd models/covert
wget http://www.deepdetect.com/dd/examples/all/forest_type/test.csv.tar.bz2
tar xvjf test.csv.tar.bz2

The full test set has 565892 samples, so let us lower this to a 10 samples (plus the header line) so we can inspect the results more easily:


head -n 11 test.csv > test10.csv

and make a predict call, passing the scaling parameters:


curl -X POST "http://localhost:8080/predict" -d '{
       "service":"covert",
       "parameters":{
         "input":{
           "id":"Id",
           "separator":",",
           "scale":true,
           "max_vals":[15120.0,3849.0,360.0,52.0,1343.0,554.0,6890.0,254.0,254.0,248.0,6993.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0],
           "min_vals":[1.0,1863.0,0.0,0.0,0.0,-146.0,0.0,0.0,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0]
         }
       },
       "data":["/opt/models/covert/test10.csv"]
     }'


{
  "status":{
    "code":200,
    "msg":"OK"
  },
  "head":{
    "method":"/predict",
    "time":3.0,
    "service":"covert"
  },
  "body":{
    "predictions":[
      {
        "uri":"15121",
        "classes":{"prob":0.5287069082260132,"cat":"0"}
      },
      {
        "uri":"15122",
        "classes":{"prob":0.4979664981365204,"cat":"1"}
      },
      {
        "uri":"15130",
        "classes":{"prob":0.5201037526130676,"cat":"0"}
      },
      {
        "uri":"15123",
        "classes":{"prob":0.5286441445350647,"cat":"0"}
      },
      {
        "uri":"15124",
        "classes":{"prob":0.5498184561729431,"cat":"0"}
      },
      {
        "uri":"15128",
        "classes":{"prob":0.525192379951477,"cat":"0"}
      },
      {
        "uri":"15125",
        "classes":{"prob":0.5601629018783569,"cat":"0"}
      },
      {
        "uri":"15126",
        "classes":{"prob":0.543620765209198,"cat":"0"}
      },
      {
        "uri":"15129",
        "classes":{"prob":0.5230810642242432,"cat":"0"}
      },
      {
        "uri":"15127",
        "classes":{"prob":0.5464935898780823,"cat":"0"}
      }
    ]
  }
}

In the results above:

uri is the Id of the sample in the test set
prob is the probability associated to the predicted class with highest probability

So for instance, sample 15121 was predicted as being of forest cover type 0 with probability 0.528. Do not forget that we did use a label_offset when training the service. So 0 here corresponds to class 1 on page https://www.kaggle.com/c/forest-cover-type-prediction/data, which is a Spruce/Fir cover type.

Prediction from in-memory data


curl -X POST "http://localhost:8080/predict" -d '{
       "service":"covert",
       "parameters":{
         "input":{
           "connector":"csv",
           "scale":true,
           "min_vals":[1863,0,0,0,-146,0,0,99,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1],
           "max_vals":[3849,360,52,1343,554,6890,254,254,248,6993,1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
         }
       },
       "data":["2499,0,9,150,55,1206,207,223,154,859,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0"]
     }'


{
  "status":{
    "code":200,
    "msg":"OK"
  },
  "head":{
    "method":"/predict",
    "time":1.0,
    "service":"covert"
  },
  "body":{
    "predictions":{
      "uri":"1",
      "classes":{
        "prob":0.5324947834014893,"cat":"2"
      }
    }
  }
}

Importantly, note that in the call above there’s no mention of the Id field, and that the Id value has been stripped from each array of scaling parameters. This is because while in training mode, the datasets often hold an id per training sample, but when predicting, it is less common.