Machine learning using Clojure, libpython-clj2, and Pytorch
Motivation
The Clojure programming language has desirable properties for software engineering such as immutabilitty and strong support for implementing parallel algorithms. Currently Python is popular for machine learning due to Pytorch and other machine learning libraries targeting the Python programming language. However using libpython-clj2 one can invoke Pytorch for machine learning from within Clojure.
This article demonstrates a few basics of machine learning using the \(y=x^2\) parabola function as an example.
Import PyTorch
The PyTorch library is quite comprehensive, is free software, and you can find a lot of documentation on how to use it. The default version of PyTorch on pypi.org comes with CUDA (Nvidia) GPU support. There are also PyTorch wheels provided by AMD which come with ROCm support. Here we are going to use a CPU version of PyTorch which is a much smaller install.
You need to install Python 3.10 or later. For package management we are going to use the uv package manager. The following pyproject.toml file is used to install PyTorch and NumPy.
[project]
name = "ppo"
version = "0.1.0"
description = "Proximal Policy Optimization"
authors = [{ name="Jan Wedekind", email="jan@wedesoft.de" }]
requires-python = ">=3.10.0"
dependencies = [
"numpy",
"torch",
]
[tool.uv]
python-preference = "only-system"
[tool.uv.sources]
torch = { index = "pytorch" }
numpy = { index = "pytorch" }
[[tool.uv.index]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cpu"
[build-system]
requires = ["setuptools", "wheel"]
build-backend = "setuptools.build_meta"Note that we are specifying a custom repository index to get the CPU-only version of PyTorch. Also we are using the system version of Python to prevent uv from trying to install its own version which lacks the _cython module. To freeze the dependencies and create a uv.lock file, you need to run
uv lockYou can install the dependencies using
uv syncIn order to access PyTorch from Clojure you need to run the clj command via uv:
uv run cljNow you should be able to import the Python modules using require-python.
(require-python '[torch :as torch]
'[torch.nn :as nn]
'[torch.utils.data :as data]
'[torch.nn.functional :as F]
'[torch.optim :as optim]):okThe training data
First we are going to set the random seed for reproducibility of this article.
(torch/manual_seed 1)<torch._C.Generator object at 0x7f28bf9a0e10>We are going to sample a small set of points from the interval \([-6, 6]\) and add Gaussian noise to the labels.
(def extent 6.0)(def n 64)(def noise 1.0)(def features (torch/sub (torch/mul (torch/rand [n 1]) (* 2 extent)) extent))(def labels (torch/add (torch/mul features features) (torch/mul noise (torch/randn [n 1]))))featurestensor([[ 3.0916],
[-2.6483],
[-1.1632],
[ 2.8162],
[-5.6486],
[ 3.5983],
[-1.2344],
[ 3.0525],
[ 0.8341],
[-0.7347],
[ 1.6642],
[ 0.2960],
[ 2.1914],
[-2.3382],
[-0.4375],
[-0.5402],
[ 0.8697],
[-0.0240],
[ 5.2450],
[ 1.8671],
[-2.2344],
[-3.6238],
[-1.0057],
[-2.5880],
[-1.9227],
[ 0.2873],
[ 3.5768],
[ 3.2612],
[-5.8653],
[ 3.7195],
[ 1.6762],
[ 5.6913],
[ 3.9604],
[-5.4668],
[-5.7049],
[-2.8940],
[ 5.2687],
[-0.9994],
[ 2.5678],
[-2.7883],
[ 5.8873],
[-2.5386],
[ 4.4995],
[ 0.0710],
[-3.1609],
[ 3.0841],
[-3.1849],
[ 1.7646],
[-1.7325],
[-0.6578],
[-5.7683],
[-2.8607],
[ 3.2558],
[-1.4585],
[ 5.9763],
[ 4.8095],
[-0.2809],
[-4.0049],
[ 3.6538],
[ 1.8621],
[-3.8785],
[ 3.8973],
[ 3.6426],
[ 5.3214]])labelstensor([[ 9.5324e+00],
[ 5.9900e+00],
[ 7.5678e-01],
[ 6.9255e+00],
[ 3.1696e+01],
[ 1.2940e+01],
[ 3.1971e+00],
[ 9.3279e+00],
[-8.2310e-03],
[ 3.5447e-01],
[ 1.7732e+00],
[-7.4364e-01],
[ 4.3411e+00],
[ 4.9071e+00],
[ 5.8694e-01],
[-6.9050e-01],
[ 2.4983e-01],
[ 1.0035e-01],
[ 2.6856e+01],
[ 4.2179e+00],
[ 3.5583e+00],
[ 1.2631e+01],
[ 1.1831e+00],
[ 6.5380e+00],
[ 3.9514e+00],
[-4.1942e-01],
[ 1.1752e+01],
[ 1.1368e+01],
[ 3.3353e+01],
[ 1.3364e+01],
[ 3.1007e+00],
[ 3.4382e+01],
[ 1.6346e+01],
[ 3.1076e+01],
[ 3.3362e+01],
[ 7.4617e+00],
[ 2.9144e+01],
[ 1.8498e-01],
[ 5.6658e+00],
[ 8.8864e+00],
[ 3.5996e+01],
[ 7.0487e+00],
[ 2.0142e+01],
[-1.4617e-01],
[ 7.8892e+00],
[ 8.8916e+00],
[ 8.6655e+00],
[ 1.9805e+00],
[ 3.8755e+00],
[-1.2755e-01],
[ 3.4559e+01],
[ 9.0004e+00],
[ 1.0806e+01],
[ 2.4322e+00],
[ 3.6252e+01],
[ 2.2700e+01],
[ 2.6371e+00],
[ 1.5806e+01],
[ 1.3337e+01],
[ 5.3282e+00],
[ 1.3062e+01],
[ 1.6987e+01],
[ 1.3370e+01],
[ 2.8657e+01]])Next we are going to combine features and labels into a PyTorch dataset.
(def dataset (data/TensorDataset features labels))In order to check that the model generalises well, one usually splits up the available data into a training, development and test set.
(def train-size (int (* 0.6 n)))(def dev-size (int (* 0.2 n)))(def test-size (- n train-size dev-size))The random_split function is used to shuffle and split the dataset. Shuffling can be quite important for stability of the training process.
(def splits (data/random_split dataset [train-size dev-size test-size]))(def train-ds (nth splits 0))(def dev-ds (nth splits 1))(def test-ds (nth splits 2))The datasets are used in the following way:
- training data is used to train the model
- development data is used to check that the model is not overfitting or underfitting
- test data is used to report the performance of the model
Next we are going to use PyTorch DataLoaders to split training and development data into mini-batches.
(def train-data-loader (data/DataLoader train-ds :batch_size 4 :shuffle true))(def dev-data-loader (data/DataLoader dev-ds :batch_size 4 :shuffle true))The Model
In Python you can implement a small neural network with an input layer, two hidden layers, and an output layer as follows:
class ParabolaNet(nn.Module):
def __init__(self, n_hidden):
super().__init__()
self.fc1 = nn.Linear(1, n_hidden)
self.fc2 = nn.Linear(n_hidden, n_hidden)
self.fc3 = nn.Linear(n_hidden, 1)
def forward(self, x):
x = self.fc1(x)
x = F.sigmoid(x)
x = self.fc2(x)
x = F.sigmoid(x)
x = self.fc3(x)
return xUsing libpython-clj2 one can create the Python class from within Clojure.
(def ParabolaNet
(py/create-class
"ParabolaNet" [nn/Module]
{"__init__"
(py/make-instance-fn
(fn [self n-hidden]
(py. nn/Module __init__ self)
(py/set-attrs!
self
{"fc1" (nn/Linear 1 n-hidden)
"fc2" (nn/Linear n-hidden n-hidden)
"fc3" (nn/Linear n-hidden 1)})
nil))
"forward"
(py/make-instance-fn
(fn [self x]
(let [x (py. self fc1 x)
x (F/sigmoid x)
x (py. self fc2 x)
x (F/sigmoid x)
x (py. self fc3 x)]
x)))}))Each arrow represents a weight stored in the fully connected layers fc1, fc2, and fc3. There are different layer types in PyTorch. A fully connected layer is the most basic one. To be able to model non-linear functions, activation functions are used (here: sigmoid). The model looks like this when using 8 units in each hidden layer.
When evaluating a model, you should disable gradient accumulation. Otherwise gradients will leak into subsequent training steps. In Python this looks like this:
with torch.no_grad():
...In Clojure we can define a macro to disable gradient accumulation:
(defmacro without-gradient
[& body]
`(let [no-grad# (torch/no_grad)]
(try
(py. no-grad# ~'__enter__)
~@body
(finally
(py. no-grad# ~'__exit__ nil nil nil)))))Now one can perform a model prediction as follows:
(def model (ParabolaNet 10))(without-gradient
(model (torch/tensor [0.0])))tensor([-0.0908])Note that the output is not zero. PyTorch performs random initialisation of the weights for us. Breaking the symmetry like this is important, otherwise all activations and gradients will be the same and the model will not be able to learn.
Training
A training epoch performs the following step for each mini-batch in the training set:
- reset the gradient of the optimizer
- perform model predictions for the input features
- compute the loss which compares the predictions with the labels
- perform backpropagation to get gradients for each model parameter
- perform a gradient descent step using the optimizer
- return the loss value
(defn train-epoch
[train-data-loader criterion model optimizer]
(py. model train)
(for [[features labels] train-data-loader]
(do
(py. optimizer zero_grad)
(let [prediction (model features)
loss (criterion prediction labels)]
(py. loss backward)
(py. optimizer step)
(py. loss item)))))In order to check whether the model is overfitting or underfitting, we need to evaluate the model on the development set. The following method computes the loss values for each mini-batch in the development set.
(defn dev-epoch
[dev-data-loader criterion model]
(py. model eval)
(without-gradient
(for [[features labels] dev-data-loader]
(let [prediction (model features)
loss (criterion prediction labels)]
(py. loss item)))))We define a function to compute the average of a list of numbers.
(defn average [numbers]
(/ (reduce + numbers) (count numbers)))Now we can implement a training run. A training run basically consists of many training epochs. Here we are using the stochastic gradient descent method. Note that usually the Adam optimizer is used, because it is more efficient. As a loss function we simply use the mean squared error.
(defn training-run
[train-data-loader dev-data-loader epochs n-hidden lr weight-decay]
(let [model (ParabolaNet n-hidden)
optimizer (optim/SGD (py. model "parameters") :lr lr :weight_decay weight-decay)
criterion (nn/MSELoss)]
(loop [epoch 1 train-losses [] dev-losses []]
(let [train-loss (average (train-epoch train-data-loader criterion model optimizer))
dev-loss (average (dev-epoch dev-data-loader criterion model))]
(if (< epoch epochs)
(recur (inc epoch) (conj train-losses train-loss) (conj dev-losses dev-loss))
{:model model :train-losses (conj train-losses train-loss) :dev-losses (conj dev-losses dev-loss)})))))Let’s train a model.
(def result (training-run train-data-loader dev-data-loader 5000 100 0.01 0.0))Hyperparameter tuning
The following function plots all data points (training, development, and test set) and the model predictions for different inputs.
(defn plot-model
[features labels {:keys [model]}]
(without-gradient
(let [x (range (- extent) (+ extent 0.01) 0.01)
y (map (fn [x] (py. (first (model (torch/tensor [x]))) item)) x)
ds (tc/dataset {:x x :y y})
pts (tc/dataset {:x (map first (py/->jvm (py. features tolist)))
:y (map first (py/->jvm (py. labels tolist)))})]
(-> ds
(plotly/base {:=title "Model"})
(plotly/layer-point {:=dataset pts :=x :x :=y :y :=name "data"})
(plotly/layer-line {:=x :x :=y :y :=name "prediction"})))))Let’s plot our model.
(plot-model features labels result)As one can see, the model is overfitting the training set. I.e. the model fits the training data closely but does not generalize well. This makes the model sensitive to noise.
Here one can observe overfitting directly. In general however one uses the dev set to detect if the model is overfitting.
First we define a smoothing function which smoothes a sequence of loss values.
(defn smoothing
[alpha]
(fn [coll]
(reductions (fn [prev-avg current] (+ (* alpha prev-avg) (* (- 1 alpha) current))) 0.0 coll)))Next we define a function to plot the training loss (learning curve) and validation loss over time.
(defn plot-losses
[{:keys [train-losses dev-losses]} smoothing-fn]
(-> (tc/dataset {:x (range 1 (count train-losses)) :y (smoothing-fn train-losses)})
(plotly/base {:=title "Losses"})
(plotly/layer-line {:=x :x :=y :y :=name "training loss"})
(plotly/layer-line {:=dataset (tc/dataset {:x (range 1 (count dev-losses)) :y (smoothing-fn dev-losses)})
:=x :x :=y :y :=name "dev loss"})))Let’s plot the losses without smoothing.
(plot-losses result (smoothing 0.0))In practice one uses smoothing to make the trend of the curves more visible.
(plot-losses result (smoothing 0.99))As one can see, the dev set loss is much higher than the training loss. This is a sign that the model is overfitting (high variance).
High variance can be resolved using the following techniques:
- use more data
- regularization
- neural network architecture search
Let’s see what underfitting looks like.
(def result2 (training-run train-data-loader dev-data-loader 5000 1 0.01 0.0))(plot-model features labels result2)Let’s look at the losses.
(plot-losses result2 (smoothing 0.99))As one can see, both the training and dev set loss are high. This is a sign that the model is underfitting (high bias).
High bias can be resolved using the following techniques:
- bigger network
- train longer
- neural network architecture search
Regularization
Instead of tuning the number of hidden units and layers (which can only be done in discrete steps), one can use regularization to tune the network.
Here we are using weight decay (which is equivalent to L2 regularization). For example when using a weight decay of 0.001, we get the following model.
(def result3 (training-run train-data-loader dev-data-loader 5000 100 0.01 0.001))Here is the model output.
(plot-model features labels result3)And the losses are as follows.
(plot-losses result3 (smoothing 0.99))We did not explore different learning rates, but this hyperparameter is usually straightforward to tune:
- The learning rate is too low if the model converges very slowly.
- The learning rate is too high if the loss becomes unstable or the parameters diverge.
Enjoy!
source: src/mlp/main.clj