NNabla Python API Demonstration Tutorial¶
Let us import nnabla first, and some additional useful tools.
# python2/3 compatibility
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
import nnabla as nn # Abbreviate as nn for convenience.
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
2017-09-27 14:00:30,785 [nnabla][INFO]: Initializing CPU extension...
NdArray¶
NdArray is a data container of a multi-dimensional array. NdArray is
device (e.g. CPU, CUDA) and type (e.g. uint8, float32) agnostic, in
which both type and device are implicitly casted or transferred when it
is used. Below, you create a NdArray with a shape of (2, 3, 4)
.
a = nn.NdArray((2, 3, 4))
You can see the values held inside a
by the following. The values
are not initialized, and are created as float32 by default.
print(a.data)
[[[ 9.42546995e+24 4.56809286e-41 8.47690058e-38 0.00000000e+00]
[ 7.38056336e+34 7.50334969e+28 1.17078231e-32 7.58387310e+31]
[ 7.87001454e-12 9.84394250e-12 6.85712044e+22 1.81785692e+31]]
[[ 1.84681296e+25 1.84933247e+20 4.85656319e+33 2.06176836e-19]
[ 6.80020530e+22 1.69307638e+22 2.11235872e-19 1.94316151e-19]
[ 1.81805047e+31 3.01289097e+29 2.07004908e-19 1.84648795e+25]]]
The accessor .data
returns a reference to the values of NdArray as
numpy.ndarray
. You can modify these by using the NumPy API as
follows.
print('[Substituting random values]')
a.data = np.random.randn(*a.shape)
print(a.data)
print('[Slicing]')
a.data[0, :, ::2] = 0
print(a.data)
[Substituting random values]
[[[ 0.36133638 0.22121875 -1.5912329 -0.33490974]
[ 1.35962474 0.2165522 0.54483992 -0.61813235]
[-0.13718799 -0.44104072 -0.51307833 0.73900551]]
[[-0.59464753 -2.17738533 -0.28626776 -0.45654735]
[ 0.73566747 0.87292582 -0.41605178 0.04792296]
[-0.63856047 0.31966645 -0.63974309 -0.61385244]]]
[Slicing]
[[[ 0. 0.22121875 0. -0.33490974]
[ 0. 0.2165522 0. -0.61813235]
[ 0. -0.44104072 0. 0.73900551]]
[[-0.59464753 -2.17738533 -0.28626776 -0.45654735]
[ 0.73566747 0.87292582 -0.41605178 0.04792296]
[-0.63856047 0.31966645 -0.63974309 -0.61385244]]]
Note that the above operation is all done in the host device (CPU).
NdArray provides more efficient functions in case you want to fill all
values with a constant, .zero
and .fill
. They are lazily
evaluated when the data is requested (when neural network computation
requests the data, or when NumPy array is requested by Python) The
filling operation is executed within a specific device (e.g. CUDA GPU),
and more efficient if you specify the device setting, which we explain
later.
a.fill(1) # Filling all values with one.
print(a.data)
[[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]]
You can create an NdArray instance directly from a NumPy array object.
b = nn.NdArray.from_numpy_array(np.ones(a.shape))
print(b.data)
[[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]]
NdArray is used in Variable class, as well as NNabla’s imperative computation of neural networks. We describe them in the later sections.
Variable¶
Variable class is used when you construct a neural network. The neural network can be described as a graph in which an edge represents a function (a.k.a operator and layer) which defines operation of a minimum unit of computation, and a node represents a variable which holds input/output values of a function (Function class is explained later). The graph is called “Computation Graph”.
In NNabla, a Variable, a node of a computation graph, holds two
NdArray
s, one for storing the input or output values of a function
during forward propagation (executing computation graph in the forward
order), while another for storing the backward error signal (gradient)
during backward propagation (executing computation graph in backward
order to propagate error signals down to parameters (weights) of neural
networks). The first one is called data
, the second is grad
in
NNabla.
The following line creates a Variable instance with a shape of (2, 3,
4). It has data
and grad
as NdArray
. The flag need_grad
is used to omit unnecessary gradient computation during backprop if set
to False.
x = nn.Variable([2, 3, 4], need_grad=True)
print('x.data:', x.data)
print('x.grad:', x.grad)
x.data: <NdArray((2, 3, 4)) at 0x7f575caf4ea0>
x.grad: <NdArray((2, 3, 4)) at 0x7f575caf4ea0>
You can get the shape by:
x.shape
(2, 3, 4)
Since both data
and grad
are NdArray
, you can get a
reference to its values as NdArray with the .data
accessor, but also
it can be referred by .d
or .g
property for data
and grad
respectively.
print('x.data')
print(x.d)
x.d = 1.2345 # To avoid NaN
assert np.all(x.d == x.data.data), 'd: {} != {}'.format(x.d, x.data.data)
print('x.grad')
print(x.g)
x.g = 1.2345 # To avoid NaN
assert np.all(x.g == x.grad.data), 'g: {} != {}'.format(x.g, x.grad.data)
# Zeroing grad values
x.grad.zero()
print('x.grad (after `.zero()`)')
print(x.g)
x.data
[[[ 9.42553452e+24 4.56809286e-41 8.32543479e-38 0.00000000e+00]
[ nan nan 0.00000000e+00 0.00000000e+00]
[ 3.70977305e+25 4.56809286e-41 3.78350585e-44 0.00000000e+00]]
[[ 5.68736600e-38 0.00000000e+00 1.86176378e-13 4.56809286e-41]
[ 4.74367616e+25 4.56809286e-41 5.43829710e+19 4.56809286e-41]
[ 0.00000000e+00 0.00000000e+00 2.93623372e-38 0.00000000e+00]]]
x.grad
[[[ 9.42576510e+24 4.56809286e-41 9.42576510e+24 4.56809286e-41]
[ 9.27127763e-38 0.00000000e+00 9.27127763e-38 0.00000000e+00]
[ 1.69275966e+22 4.80112800e+30 1.21230330e+25 7.22962302e+31]]
[[ 1.10471027e-32 4.63080422e+27 2.44632805e+20 2.87606258e+20]
[ 4.46263300e+30 4.62311881e+30 7.65000750e+28 3.01339003e+29]
[ 2.08627352e-10 1.03961868e+21 7.99576678e+20 1.74441223e+22]]]
x.grad (after .zero()
)
[[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]]
Like NdArray
, a Variable
can also be created from NumPy
array(s).
x2 = nn.Variable.from_numpy_array(np.ones((3,)), need_grad=True)
print(x2)
print(x2.d)
x3 = nn.Variable.from_numpy_array(np.ones((3,)), np.zeros((3,)), need_grad=True)
print(x3)
print(x3.d)
print(x3.g)
<Variable((3,), need_grad=True) at 0x7f572a5242c8>
[ 1. 1. 1.]
<Variable((3,), need_grad=True) at 0x7f572a5244a8>
[ 1. 1. 1.]
[ 0. 0. 0.]
Besides storing values of a computation graph, pointing a parent edge
(function) to trace the computation graph is an important role. Here
x
doesn’t have any connection. Therefore, the .parent
property
returns None.
print(x.parent)
None
Function¶
A function defines an operation block of a computation graph as we
described above. The module nnabla.functions
offers various
functions (e.g. Convolution, Affine and ReLU). You can see the list of
functions available in the API reference
guide.
import nnabla.functions as F
As an example, here you will defines a computation graph that computes the element-wise Sigmoid function outputs for the input variable and sums up all values into a scalar. (This is simple enough to explain how it behaves but a meaningless example in the context of neural network training. We will show you a neural network example later.)
sigmoid_output = F.sigmoid(x)
sum_output = F.reduce_sum(sigmoid_output)
The function API in nnabla.functions
takes one (or several)
Variable(s) and arguments (if any), and returns one (or several) output
Variable(s). The .parent
points to the function instance which
created it. Note that no computation occurs at this time since we just
define the graph. (This is the default behavior of NNabla computation
graph API. You can also fire actual computation during graph definition
which we call “Dynamic mode” (explained later)).
print("sigmoid_output.parent.name:", sigmoid_output.parent.name)
print("x:", x)
print("sigmoid_output.parent.inputs refers to x:", sigmoid_output.parent.inputs)
sigmoid_output.parent.name: Sigmoid
x: <Variable((2, 3, 4), need_grad=True) at 0x7f572a51a778>
sigmoid_output.parent.inputs refers to x: [<Variable((2, 3, 4), need_grad=True) at 0x7f572a273a48>]
print("sum_output.parent.name:", sum_output.parent.name)
print("sigmoid_output:", sigmoid_output)
print("sum_output.parent.inputs refers to sigmoid_output:", sum_output.parent.inputs)
sum_output.parent.name: ReduceSum
sigmoid_output: <Variable((2, 3, 4), need_grad=True) at 0x7f572a524638>
sum_output.parent.inputs refers to sigmoid_output: [<Variable((2, 3, 4), need_grad=True) at 0x7f572a273a48>]
The .forward()
at a leaf Variable executes the forward pass
computation in the computation graph.
sum_output.forward()
print("CG output:", sum_output.d)
print("Reference:", np.sum(1.0 / (1.0 + np.exp(-x.d))))
CG output: 18.59052085876465
Reference: 18.5905
The .backward()
does the backward propagation through the graph.
Here we initialize the grad
values as zero before backprop since the
NNabla backprop algorithm always accumulates the gradient in the root
variables.
x.grad.zero()
sum_output.backward()
print("d sum_o / d sigmoid_o:")
print(sigmoid_output.g)
print("d sum_o / d x:")
print(x.g)
d sum_o / d sigmoid_o:
[[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]]
d sum_o / d x:
[[[ 0.17459197 0.17459197 0.17459197 0.17459197]
[ 0.17459197 0.17459197 0.17459197 0.17459197]
[ 0.17459197 0.17459197 0.17459197 0.17459197]]
[[ 0.17459197 0.17459197 0.17459197 0.17459197]
[ 0.17459197 0.17459197 0.17459197 0.17459197]
[ 0.17459197 0.17459197 0.17459197 0.17459197]]]
NNabla is developed by mainly focused on neural network training and
inference. Neural networks have parameters to be learned associated with
computation blocks such as Convolution, Affine (a.k.a. fully connected,
dense etc.). In NNabla, the learnable parameters are also represented as
Variable
objects. Just like input variables, those parameter
variables are also used by passing into Function
s. For example,
Affine function takes input, weights and biases as inputs.
x = nn.Variable([5, 2]) # Input
w = nn.Variable([2, 3], need_grad=True) # Weights
b = nn.Variable([3], need_grad=True) # Biases
affine_out = F.affine(x, w, b) # Create a graph including only affine
The above example takes an input with B=5 (batchsize) and D=2 (dimensions) and maps it to D’=3 outputs, i.e. (B, D’) output.
You may also notice that here you set need_grad=True
only for
parameter variables (w and b). The x is a non-parameter variable and the
root of computation graph. Therefore, it doesn’t require gradient
computation. In this configuration, the gradient computation for x is
not executed in the first affine, which will omit the computation of
unnecessary backpropagation.
The next block sets data and initializes grad, then applies forward and backward computation.
# Set random input and parameters
x.d = np.random.randn(*x.shape)
w.d = np.random.randn(*w.shape)
b.d = np.random.randn(*b.shape)
# Initialize grad
x.grad.zero() # Just for showing gradients are not computed when need_grad=False (default).
w.grad.zero()
b.grad.zero()
# Forward and backward
affine_out.forward()
affine_out.backward()
# Note: Calling backward at non-scalar Variable propagates 1 as error message from all element of outputs. .
You can see that affine_out holds an output of Affine.
print('F.affine')
print(affine_out.d)
print('Reference')
print(np.dot(x.d, w.d) + b.d)
F.affine
[[-0.17701732 2.86095762 -0.82298267]
[-0.75544345 -1.16702223 -2.44841242]
[-0.36278027 -3.4771595 -0.75681627]
[ 0.32743117 0.24258983 1.30944324]
[-0.87201929 1.94556415 -3.23357344]]
Reference
[[-0.1770173 2.86095762 -0.82298267]
[-0.75544345 -1.16702223 -2.44841242]
[-0.3627803 -3.4771595 -0.75681627]
[ 0.32743117 0.24258983 1.309443 ]
[-0.87201929 1.94556415 -3.23357344]]
The resulting gradients of weights and biases are as follows.
print("dw")
print(w.g)
print("db")
print(b.g)
dw
[[ 3.10820675 3.10820675 3.10820675]
[ 0.37446201 0.37446201 0.37446201]]
db
[ 5. 5. 5.]
The gradient of x
is not changed because need_grad
is set as
False.
print(x.g)
[[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]
[ 0. 0.]]
Parametric Function¶
Considering parameters as inputs of Function
enhances expressiveness
and flexibility of computation graphs. However, to define all parameters
for each learnable function is annoying for users to define a neural
network. In NNabla, trainable models are usually created by composing
functions that have optimizable parameters. These functions are called
“Parametric Functions”. The Parametric Function API provides various
parametric functions and an interface for composing trainable models.
To use parametric functions, import:
import nnabla.parametric_functions as PF
The function with optimizable parameter can be created as below.
with nn.parameter_scope("affine1"):
c1 = PF.affine(x, 3)
The first line creates a parameter scope. The second line then
applies PF.affine
- an affine transform - to x
, and creates a
variable c1
holding that result. The parameters are created and
initialized randomly at function call, and registered by a name
“affine1” using parameter_scope
context. The function
nnabla.get_parameters()
allows to get the registered parameters.
nn.get_parameters()
OrderedDict([('affine1/affine/W',
<Variable((2, 3), need_grad=True) at 0x7f572822f0e8>),
('affine1/affine/b',
<Variable((3,), need_grad=True) at 0x7f572822f138>)])
The name=
argument of any PF function creates the equivalent
parameter space to the above definition of PF.affine
transformation
as below. It could save the space of your Python code. The
nnabla.parametric_scope
is more useful when you group multiple
parametric functions such as Convolution-BatchNormalization found in a
typical unit of CNNs.
c1 = PF.affine(x, 3, name='affine1')
nn.get_parameters()
OrderedDict([('affine1/affine/W',
<Variable((2, 3), need_grad=True) at 0x7f572822f0e8>),
('affine1/affine/b',
<Variable((3,), need_grad=True) at 0x7f572822f138>)])
It is worth noting that the shapes of both outputs and parameter variables (as you can see above) are automatically determined by only providing the output size of affine transformation(in the example above the output size is 3). This helps to create a graph in an easy way.
c1.shape
(5, 3)
Parameter scope can be nested as follows (although a meaningless example).
with nn.parameter_scope('foo'):
h = PF.affine(x, 3)
with nn.parameter_scope('bar'):
h = PF.affine(h, 4)
This creates the following.
nn.get_parameters()
OrderedDict([('affine1/affine/W',
<Variable((2, 3), need_grad=True) at 0x7f572822f0e8>),
('affine1/affine/b',
<Variable((3,), need_grad=True) at 0x7f572822f138>),
('foo/affine/W',
<Variable((2, 3), need_grad=True) at 0x7f572822fa98>),
('foo/affine/b',
<Variable((3,), need_grad=True) at 0x7f572822fae8>),
('foo/bar/affine/W',
<Variable((3, 4), need_grad=True) at 0x7f572822f728>),
('foo/bar/affine/b',
<Variable((4,), need_grad=True) at 0x7f572822fdb8>)])
Also, get_parameters()
can be used in parameter_scope
. For
example:
with nn.parameter_scope("foo"):
print(nn.get_parameters())
OrderedDict([('affine/W', <Variable((2, 3), need_grad=True) at 0x7f572822fa98>), ('affine/b', <Variable((3,), need_grad=True) at 0x7f572822fae8>), ('bar/affine/W', <Variable((3, 4), need_grad=True) at 0x7f572822f728>), ('bar/affine/b', <Variable((4,), need_grad=True) at 0x7f572822fdb8>)])
nnabla.clear_parameters()
can be used to delete registered
parameters under the scope.
with nn.parameter_scope("foo"):
nn.clear_parameters()
print(nn.get_parameters())
OrderedDict([('affine1/affine/W', <Variable((2, 3), need_grad=True) at 0x7f572822f0e8>), ('affine1/affine/b', <Variable((3,), need_grad=True) at 0x7f572822f138>)])
MLP Example For Explanation¶
The following block creates a computation graph to predict one dimensional output from two dimensional inputs by a 2 layer fully connected neural network (multi-layer perceptron).
nn.clear_parameters()
batchsize = 16
x = nn.Variable([batchsize, 2])
with nn.parameter_scope("fc1"):
h = F.tanh(PF.affine(x, 512))
with nn.parameter_scope("fc2"):
y = PF.affine(h, 1)
print("Shapes:", h.shape, y.shape)
Shapes: (16, 512) (16, 1)
This will create the following parameter variables.
nn.get_parameters()
OrderedDict([('fc1/affine/W',
<Variable((2, 512), need_grad=True) at 0x7f572822fef8>),
('fc1/affine/b',
<Variable((512,), need_grad=True) at 0x7f572822f9a8>),
('fc2/affine/W',
<Variable((512, 1), need_grad=True) at 0x7f572822f778>),
('fc2/affine/b',
<Variable((1,), need_grad=True) at 0x7f572822ff98>)])
As described above, you can execute the forward pass by calling forward method at the terminal variable.
x.d = np.random.randn(*x.shape) # Set random input
y.forward()
print(y.d)
[[-0.05708594]
[ 0.01661986]
[-0.34168088]
[ 0.05822293]
[-0.16566885]
[-0.04867431]
[ 0.2633169 ]
[ 0.10496549]
[-0.01291842]
[-0.09726256]
[-0.05720493]
[-0.09691752]
[-0.07822668]
[-0.17180404]
[ 0.11970415]
[-0.08222144]]
Training a neural networks needs a loss value to be minimized by gradient descent with backprop. In NNabla, loss function is also a just function, and packaged in the functions module.
# Variable for label
label = nn.Variable([batchsize, 1])
# Set loss
loss = F.reduce_mean(F.squared_error(y, label))
# Execute forward pass.
label.d = np.random.randn(*label.shape) # Randomly generate labels
loss.forward()
print(loss.d)
1.9382084608078003
As you’ve seen above, NNabla backward
accumulates the gradients at
the root variables. You have to initialize the grad of the parameter
variables before backprop (We will show you the easiest way with
Solver
API).
# Collect all parameter variables and init grad.
for name, param in nn.get_parameters().items():
param.grad.zero()
# Gradients are accumulated to grad of params.
loss.backward()
Imperative Mode¶
After performing backprop, gradients are held in parameter variable grads. The next block will update the parameters with vanilla gradient descent.
for name, param in nn.get_parameters().items():
param.data -= param.grad * 0.001 # 0.001 as learning rate
The above computation is an example of NNabla’s “Imperative Mode” for
executing neural networks. Normally, NNabla functions (instances of
nnabla.functions)
take Variable
s as their input. When at least one NdArray
is
provided as an input for NNabla functions (instead of Variable
s),
the function computation will be fired immediately, and returns an
NdArray
as the output, instead of returning a Variable
. In the
above example, the NNabla functions F.mul_scalar
and F.sub2
are
called by the overridden operators *
and -=
, respectively.
In other words, NNabla’s “Imperative mode” doesn’t create a computation graph, and can be used like NumPy. If device acceleration such as CUDA is enabled, it can be used like NumPy empowered with device acceleration. Parametric functions can also be used with NdArray input(s). The following block demonstrates a simple imperative execution example.
# A simple example of imperative mode.
xi = nn.NdArray.from_numpy_array(np.arange(4).reshape(2, 2))
yi = F.relu(xi - 1)
print(xi.data)
print(yi.data)
[[0 1]
[2 3]]
[[ 0. 0.]
[ 1. 2.]]
Note that in-place substitution from the rhs to the lhs cannot be done
by the =
operator. For example, when x
is an NdArray
,
writing x = x + 1
will not increment all values of x
-
instead, the expression on the rhs will create a new NdArray
object that is different from the one originally bound by x
, and binds
the new NdArray
object to the Python variable x
on the lhs.
For in-place editing of NdArrays
, the in-place assignment operators
+=
, -=
, *=
, and /=
can be used. The copy_from
method
can also be used to copy values of an existing NdArray
to another.
For example, incrementing 1 to x
, an NdArray
, can be done by
x.copy_from(x+1)
. The copy is performed with device acceleration if
a device context is specified by using nnabla.set_default_context
or
nnabla.context_scope
.
# The following doesn't perform substitution but assigns a new NdArray object to `xi`.
# xi = xi + 1
# The following copies the result of `xi + 1` to `xi`.
xi.copy_from(xi + 1)
assert np.all(xi.data == (np.arange(4).reshape(2, 2) + 1))
# Inplace operations like `+=`, `*=` can also be used (more efficient).
xi += 1
assert np.all(xi.data == (np.arange(4).reshape(2, 2) + 2))
Solver¶
NNabla provides stochastic gradient descent algorithms to optimize
parameters listed in the nnabla.solvers
module. The parameter
updates demonstrated above can be replaced with this Solver API, which
is easier and usually faster.
from nnabla import solvers as S
solver = S.Sgd(lr=0.00001)
solver.set_parameters(nn.get_parameters())
# Set random data
x.d = np.random.randn(*x.shape)
label.d = np.random.randn(*label.shape)
# Forward
loss.forward()
Just call the the following solver method to fill zero grad region, then backprop
solver.zero_grad()
loss.backward()
The following block updates parameters with the Vanilla Sgd rule (equivalent to the imperative example above).
solver.update()
Toy Problem To Demonstrate Training¶
The following function defines a regression problem which computes the norm of a vector.
def vector2length(x):
# x : [B, 2] where B is number of samples.
return np.sqrt(np.sum(x ** 2, axis=1, keepdims=True))
We visualize this mapping with the contour plot by matplotlib as follows.
# Data for plotting contour on a grid data.
xs = np.linspace(-1, 1, 100)
ys = np.linspace(-1, 1, 100)
grid = np.meshgrid(xs, ys)
X = grid[0].flatten()
Y = grid[1].flatten()
def plot_true():
"""Plotting contour of true mapping from a grid data created above."""
plt.contourf(xs, ys, vector2length(np.hstack([X[:, None], Y[:, None]])).reshape(100, 100))
plt.axis('equal')
plt.colorbar()
plot_true()

We define a deep prediction neural network.
def length_mlp(x):
h = x
for i, hnum in enumerate([4, 8, 4, 2]):
h = F.tanh(PF.affine(h, hnum, name="fc{}".format(i)))
y = PF.affine(h, 1, name='fc')
return y
nn.clear_parameters()
batchsize = 100
x = nn.Variable([batchsize, 2])
y = length_mlp(x)
label = nn.Variable([batchsize, 1])
loss = F.reduce_mean(F.squared_error(y, label))
We created a 5 layers deep MLP using for-loop. Note that only 3 lines of the code potentially create infinitely deep neural networks. The next block adds helper functions to visualize the learned function.
def predict(inp):
ret = []
for i in range(0, inp.shape[0], x.shape[0]):
xx = inp[i:i + x.shape[0]]
# Imperative execution
xi = nn.NdArray.from_numpy_array(xx)
yi = length_mlp(xi)
ret.append(yi.data.copy())
return np.vstack(ret)
def plot_prediction():
plt.contourf(xs, ys, predict(np.hstack([X[:, None], Y[:, None]])).reshape(100, 100))
plt.colorbar()
plt.axis('equal')
Next we instantiate a solver object as follows. We use Adam optimizer which is one of the most popular SGD algorithm used in the literature.
from nnabla import solvers as S
solver = S.Adam(alpha=0.01)
solver.set_parameters(nn.get_parameters())
The following function generates data from the true system infinitely.
def random_data_provider(n):
x = np.random.uniform(-1, 1, size=(n, 2))
y = vector2length(x)
return x, y
In the next block, we run 2000 training steps (SGD updates).
num_iter = 2000
for i in range(num_iter):
# Sample data and set them to input variables of training.
xx, ll = random_data_provider(batchsize)
x.d = xx
label.d = ll
# Forward propagation given inputs.
loss.forward(clear_no_need_grad=True)
# Parameter gradients initialization and gradients computation by backprop.
solver.zero_grad()
loss.backward(clear_buffer=True)
# Apply weight decay and update by Adam rule.
solver.weight_decay(1e-6)
solver.update()
# Just print progress.
if i % 100 == 0 or i == num_iter - 1:
print("Loss@{:4d}: {}".format(i, loss.d))
Loss@ 0: 0.6976373195648193
Loss@ 100: 0.08075223118066788
Loss@ 200: 0.005213144235312939
Loss@ 300: 0.001955194864422083
Loss@ 400: 0.0011660841992124915
Loss@ 500: 0.0006421314901672304
Loss@ 600: 0.0009330055327154696
Loss@ 700: 0.0008817618945613503
Loss@ 800: 0.0006205961108207703
Loss@ 900: 0.0009072928223758936
Loss@1000: 0.0008160348515957594
Loss@1100: 0.0011569359339773655
Loss@1200: 0.000837412488181144
Loss@1300: 0.0011542742140591145
Loss@1400: 0.0005833200993947685
Loss@1500: 0.0009848927147686481
Loss@1600: 0.0005141657311469316
Loss@1700: 0.0009339841199107468
Loss@1800: 0.000950580753851682
Loss@1900: 0.0005430278833955526
Loss@1999: 0.0007046313839964569
Memory usage optimization: You may notice that, in the above
updates, .forward()
is called with the clear_no_need_grad=
option, and .backward()
is called with the clear_buffer=
option.
Training of neural network in more realistic scenarios usually consumes
huge memory due to the nature of backpropagation algorithm, in which all
of the forward variable buffer data
should be kept in order to
compute the gradient of a function. In a naive implementation, we keep
all the variable data
and grad
living until the NdArray
objects are not referenced (i.e. the graph is deleted). The clear_*
options in .forward()
and .backward()
enables to save memory
consumption due to that by clearing (erasing) memory of data
and
grad
when it is not referenced by any subsequent computation. (More
precisely speaking, it doesn’t free memory actually. We use our memory
pool engine by default to avoid memory alloc/free overhead). The
unreferenced buffers can be re-used in subsequent computation. See the
document of Variable
for more details. Note that the following
loss.forward(clear_buffer=True)
clears data
of any intermediate
variables. If you are interested in intermediate variables for some
purposes (e.g. debug, log), you can use the .persistent
flag to
prevent clearing buffer of a specific Variable
like below.
loss.forward(clear_buffer=True)
print("The prediction `y` is cleared because it's an intermediate variable.")
print(y.d.flatten()[:4]) # to save space show only 4 values
y.persistent = True
loss.forward(clear_buffer=True)
print("The prediction `y` is kept by the persistent flag.")
print(y.d.flatten()[:4]) # to save space show only 4 value
The predictiony
is cleared because it's an intermediate variable. [ 2.27279830e-04 6.02164946e-05 5.33679675e-04 2.35557582e-05] The predictiony
is kept by the persistent flag. [ 1.0851264 0.87657517 0.79603785 0.40098712]
We can confirm the prediction performs fairly well by looking at the following visualization of the ground truth and prediction function.
plt.subplot(121)
plt.title("Ground truth")
plot_true()
plt.subplot(122)
plt.title("Prediction")
plot_prediction()

You can save learned parameters by nnabla.save_parameters
and load
by nnabla.load_parameters
.
path_param = "param-vector2length.h5"
nn.save_parameters(path_param)
# Remove all once
nn.clear_parameters()
nn.get_parameters()
2017-09-27 14:00:40,544 [nnabla][INFO]: Parameter save (.h5): param-vector2length.h5
OrderedDict()
# Load again
nn.load_parameters(path_param)
print('\n'.join(map(str, nn.get_parameters().items())))
2017-09-27 14:00:40,564 [nnabla][INFO]: Parameter load (<built-in function format>): param-vector2length.h5
('fc0/affine/W', <Variable((2, 4), need_grad=True) at 0x7f576328df48>)
('fc0/affine/b', <Variable((4,), need_grad=True) at 0x7f57245f2868>)
('fc1/affine/W', <Variable((4, 8), need_grad=True) at 0x7f576328def8>)
('fc1/affine/b', <Variable((8,), need_grad=True) at 0x7f5727ee5c78>)
('fc2/affine/W', <Variable((8, 4), need_grad=True) at 0x7f5763297318>)
('fc2/affine/b', <Variable((4,), need_grad=True) at 0x7f5727d29908>)
('fc3/affine/W', <Variable((4, 2), need_grad=True) at 0x7f57632973b8>)
('fc3/affine/b', <Variable((2,), need_grad=True) at 0x7f57632974a8>)
('fc/affine/W', <Variable((2, 1), need_grad=True) at 0x7f57632974f8>)
('fc/affine/b', <Variable((1,), need_grad=True) at 0x7f5763297598>)
Both save and load functions can also be used in a parameter scope.
with nn.parameter_scope('foo'):
nn.load_parameters(path_param)
print('\n'.join(map(str, nn.get_parameters().items())))
2017-09-27 14:00:40,714 [nnabla][INFO]: Parameter load (<built-in function format>): param-vector2length.h5
('fc0/affine/W', <Variable((2, 4), need_grad=True) at 0x7f576328df48>)
('fc0/affine/b', <Variable((4,), need_grad=True) at 0x7f57245f2868>)
('fc1/affine/W', <Variable((4, 8), need_grad=True) at 0x7f576328def8>)
('fc1/affine/b', <Variable((8,), need_grad=True) at 0x7f5727ee5c78>)
('fc2/affine/W', <Variable((8, 4), need_grad=True) at 0x7f5763297318>)
('fc2/affine/b', <Variable((4,), need_grad=True) at 0x7f5727d29908>)
('fc3/affine/W', <Variable((4, 2), need_grad=True) at 0x7f57632973b8>)
('fc3/affine/b', <Variable((2,), need_grad=True) at 0x7f57632974a8>)
('fc/affine/W', <Variable((2, 1), need_grad=True) at 0x7f57632974f8>)
('fc/affine/b', <Variable((1,), need_grad=True) at 0x7f5763297598>)
('foo/fc0/affine/W', <Variable((2, 4), need_grad=True) at 0x7f5763297958>)
('foo/fc0/affine/b', <Variable((4,), need_grad=True) at 0x7f57632978b8>)
('foo/fc1/affine/W', <Variable((4, 8), need_grad=True) at 0x7f572a51ac78>)
('foo/fc1/affine/b', <Variable((8,), need_grad=True) at 0x7f5763297c78>)
('foo/fc2/affine/W', <Variable((8, 4), need_grad=True) at 0x7f5763297a98>)
('foo/fc2/affine/b', <Variable((4,), need_grad=True) at 0x7f5763297d68>)
('foo/fc3/affine/W', <Variable((4, 2), need_grad=True) at 0x7f5763297e08>)
('foo/fc3/affine/b', <Variable((2,), need_grad=True) at 0x7f5763297ea8>)
('foo/fc/affine/W', <Variable((2, 1), need_grad=True) at 0x7f5763297f48>)
('foo/fc/affine/b', <Variable((1,), need_grad=True) at 0x7f5763297cc8>)
!rm {path_param} # Clean ups