This article gives a sense of how matrix operations such as matrix multiplication arise in machine learning.
Dense layers
Before we get started, there is a small bit of prerequisite knowledge about “dense layers” of neural networks.
 What is a “dense layer”? Don’t worry about how “dense layers” fit into the big picture of a neural network. All we need to know is that a dense layer can be thought of as a collection of “nodes”.

Well now we’ve begged the question: what is a “node”? A node can be thought of as a machine, associated with a set of numbers called weights, that is responsible for producing a numeric output in response to receiving some number of binary (i.e. 0 or 1) inputs. Specifically, if a node N is associated with n weights w_{1}, …, w_{n}, then, when n binary numbers x_{1}, …, x_{n }are given to N as input, N will compute the following “weighted sum” as output:
For example, if a node has weights 3, 4, 5 and is passed the binary inputs 1, 0, 1, then that node will return the weighted sum 3 * 1 + 4 * 0 + 5 * 1 = 8.
That’s it! We are done with covering prerequisites. We now know what a dense layer of a neural network is responsible for accomplishing: a dense layer contains nodes, which are associated with weights; the nodes use the weights to compute weighted sums when they receive binary inputs.
Matrices
Now, we see what happens when we consider all of the nodes in a dense layer producing their outputs at once.
Suppose L is a dense layer. Then L consists of m nodes, N_{1}, …, N_{m} for some whole number m, where the ith node N_{i} has n weights* associated with it, w_{i1}, …, w_{in}. When N_{i }receives n binary (0 or 1) inputs x_{1}, …, x_{n}, it computes
Since the above weighted sum is computed by the ith node, let’s refer to it as f(N_{i}):
The nodes N_{1}, …, N_{m} will need to compute the weighted sums f(N_{1}), …, f(N_{m}):
The above can be rewritten with use of socalled column vectors:
A column vector is simply a list of numbers written out in a column. Note that in the above we have made use of the notions of “column vector addition” and “scaling of a column vector by a number”.
The above expression can be expressed as a matrixvector product:
(The matrixvector product Wx of the above is literally defined to be the expression on the right side of the previous equation).
This matrixvector product notation gives us a succinct way to determine what each node in a layer does to inputs x_{1}, …, x_{n}.
Generalization
The above approach generalizes in the following way. Assume that instead of n inputs x_{1}, …, x_{n}, we have p groups of inputs, where the ith group has inputs x_{1i}, …, x_{ni}. Additionally, define x_{i} := (x_{1i}, …, x_{ni}) and let f(N_{j}, x_{i}) denote the result of node N_{j} acting on x_{1i}, …, x_{ni}. Then we have
This is equivalent to
Further generalization
Now, assume that in addition to having multiple groups of inputs x_{1}, …, x_{p}, we want to pass these groups of inputs through multiple layers multiple layers L_{1}, …, L_{l}. If layer L_{i} has weight matrix w_{i}, then the result of passing x_{1}, …, x_{p} through L_{1}, …, L_{l} is the following product of matrices:
* You may object and remark that the number of weights associated with a given node depends on the node; in other words, that node each node N_{i} has n_{i} weights associated with it, not n weights. We are justified in assuming that n_{i} = n for all i because we can always associate extra weights of 0 to nodes that don’t already have the requisite n weights associated with them.