Recurrent neural networks for personalized customer journeys

Know thy customer, flog thy product.

^{15 June 2022}

Machine-learning in marketing and services
How RNNs function
RNNs in retail

Machine learning for data-assisted customer relationship management is becoming an increasingly important topic in marketing and quality of service. For large companies with extensive product ranges, and millions of customers, a clear challenge is the delivery of personalized service. In response to this challenge, retailers are turning to machine-learning assisted tools to sift through transactions, customers histories, and other data in order to make better predictions about future consumer behavior from prior trends. For retailers that are able to effectively leverage the vast amounts of data at their fingertips, the potential upside is enormous. Applications range from improved customer recommendations and targeted advertising, to fraud detection, real-time bidding for online ad-inventory, optimized newsletter send-out timing, and more accurate product demand predictions. On the other hand, without targeted assistance customers may miss products of interest, be turned off by frequent irrelevant advertising, or find it easier to seek products elsewhere. The aim of this article is to introduce you to a special kind of neural network known as a recurrent neural network, which may be of particular interest for your business. We can implement similar tools for you. Feel free to reach out.

Machine-learning in marketing and services

The question many Retailers face is how to best make use of the data they have at their fingertips. Traditional approaches to machine learning rely on input that is of a fixed format, which allows fixed neural architecture to take advantage of the known, and constant size of the input data. The classic example is that of image recognition, in which input is always represented as an nxn grid of colored pixels, while the output corresponds to a finite list of target labels (i.e. numbers, or letters, or the names of animals). In contrast, commercial data is usually structured as a time series, with different customers having vastly different purchasing histories. The traditional approach to dealing with diverse consumer behaviors and purchasing channels, has been to rely on feature engineering that attempts to re-express large and amorphous input datasets in terms of a small number key parameters, which can then be fed into some fixed neural architecture. For instance, the complicated purchasing history of any customer can be re-parameterized in terms of so called `RFM’ data detailing the `recency’, `frequency’, and `monetary’ values of a full purchasing history. The benefit of feature engineering is its simplicity, and that it evolves naturally from the classical analysis of large commercial data-sets (RFM data, for instance, has been used for many years to calculate the CLV or `customer lifetime value’ of customers). The downside, however, is that determining which finite set of features best captures the essence of consumer histories is a time-consuming and fiddly art which at its core involves throwing away input data. If not done prudently, this reduction risks incorrect or incomplete conclusions being drawn. It would be better to have complete access to the full richness of the data set. After all, isn’t that the promised benefit of machine learning over traditional approaches in the first place?

Because consumer histories are inherently sequential and of varying length, a natural approach to consider in place of feature engineering is that of recurrent neural networks (RNNs). RNNs are artificial neural networks that are specifically designed to deal with sequential data of varying lengths. RNNs use a fixed neural architecture, however this architecture is employed to analyze each element of a sequence of data in turn. At each step, output from the proceeding step is included along with the input, providing `contextual’ information for the calculation. For the engineers reading, the difference between RNNs and more traditional feed forward neural networks is analogous to the difference between open loop and closed loop control theory. The following schematic is helpful for understanding the idea:

The kinds of input sequences that you should be thinking of are things like English sentences or songs, where each element of the sequences is given by a word, or a note. A useful way of understanding a RNN is to think of it `unrolled’ and acting on each element of a sequence of data one step at a time. At each step in an input sequence, the network takes in two pieces of input: (i) External input (i.e. a word, a photo, a piece of music), and (ii) output `memory’ from the proceeding step. In turn, the network produces two pieces of output at each moment in time: (i) some usable output (perhaps an image, or some word), and (ii) an output ‘memory’ to feed into the next iteration of the network. The specifics of the input and output, as well as the internal topology of the network will depend on the problem at hand. We will say a bit more about these shortly. For now, the key point is that the same network is employed at each step in the sequence, and sequences of arbitrary length are able to be considered, which allows RNNs to process arbitrarily long inputs, and also to produce arbitrarily long outputs.

While the key idea behind RNNs of feeding output back into the input may sound simple, in practice it is extremely powerful, and has achieved remarkable results in natural language processing, speech translation, and a range of other domains. A classic example is that of a time series, in which temporal context should be taken into consideration at each moment in time. Take for instance the problem of determining the future or past trajectory of a thrown ball from a sequence of photos. From a single photo it would be very difficult (and perhaps even impossible) to determine in which direction the ball had initially been thrown or where it is headed. On the other hand, with two or more photos, and knowledge about which order they occurred in, it should be possible to piece together information about the trajectory of the ball over time. Or said another way, context allows one not only to know the position of the ball at a given point in time, but also its derivatives.

How RNNs function

Before discussing potential business applications, let's pull back the curtain just a little, to see how RNNs work. In order to properly motivate the neural architecture behind RNNs, it is helpful to first make mention of how RNNs are trained. As with any neural network, training is achieved by updating the internal weights of the network in order to minimize the error of a predicted output in comparison to that expected for a given input. For those completely unfamiliar with neural networks, you can think of one as a `black box’ that takes some input, and uses it to produce an output. The `weights’ determine exactly what function is performed by the network, and so `training’ a network adjusts how the black box performs.

For a standard feed-forward neural network, this training process follows three key steps:

Feed some input into the network to calculate a predicted output.
Calculate the `error’ or difference from the expected output, as well as the derivatives of the error with respect to the network weights.
Use the derivatives of the error to adjust the network's internal weights in order to minimize the error.

This training process, known as back propagation, is repeated many times and preferably with a large amount of training data, with the ultimate goal of developing a set of weights capable of producing highly accurate output predictions for any given set of input data. For RNNs, the analogous process is known as back propagation through time (BPTT), but conceptually it is very similar to regular back propagation, and can be understood by analyzing the `unraveled’ neural network applied to a sequence of input data (as in the very first figure above). Each step of the unrolled network can be viewed as a `hidden layer’ of a regular feed forward deep neural network, which consists of an identical copy of the network. In this way, the exact same training techniques (which at their core reduce to the repeated application of the chain rule for derivatives) used in feed forward networks can also be applied to RNNs. In short: errors are calculated at each step, and the weights are correspondingly updated.

From the brief description above, readers familiar with training deep neural networks may be anticipating some of the problems that arise when training RNNs. Deep neural networks often suffer from `gradient’ problems, in which the training algorithm is more efficient at training the last layers of a deep Neural network, and becomes increasingly less effective the deeper the network is. Intuitively, this effect has to do with the dependence that each layer of a network has on input from the proceeding layers and so one sees compounding effects as one proceeds through the layers. This can lead to derivatives in the back propagation calculation that either vanish or explode, resulting in slow and noisy learning. Because RNNs are designed to work on sequences of arbitrary length, when unraveled (as in the first figure above) they may be viewed as deep neural networks of arbitrary depth. The usual result is `memory loss’ in training. That is, as the training algorithm adjusts the weights of a network while working over a sequence of training data, it does a much better job towards the end of a sequence. RNNs therefore tend to naturally assign far more importance to the end of an input data sequence.

At its heart, the reason that vanilla RNNs are subject to gradient problems is that the output from each step in the unrolled network is only shared at the next step in the sequence, and so there is no direct connection between the start and the end of the sequence. An obvious solution would be to directly feed the input from each step in the calculation into every subsequent step, however this would drastically increase the number of connections and the complexity of the algorithm. A less invasive approach is to explicitly create a single long term memory state that is propagated through the network, and shared at each step in the sequence along with the usual short-term memory. This is the approach of ‘long short-term memory’ networks or LSTMs, which have been employed to great effect in circumventing the gradient problems of RNNs. At each step when analyzing a sequence of data, an LSTM receives the current long-term memory of the network, known as the ‘cell state’, in addition to the usual output from the previous point in the sequence, and the external input data for the current point in the sequence.

The key new idea in LSTMs is the `cell’ state, which provides contextual information about the data that has already been processed in a sequence. The beauty of LSTMs is that the network itself is trained to regulate the flow of information into and out of the cell. LSTMs are composed of three smaller neural networks known as the `forget gate’, the `input gate’, and the output gate. These three gates can be understood schematically by the following diagram:

The first gate, known as the forget gate, decides which information in the incoming long-term memory `cell state’ vector is important, and which should be dropped. To do so it takes the output from the previous step in the sequence, together with external input, and uses it to generate a `forget’ vector, with elements between 0 and 1. This vector is the same dimension as the `cell state’ vector. The forget vector and cell state are multiplied together point-wise, which has the effect of removing information from the cell state, or long-term memory. In other words, the forget gate decides which pieces of the long-term memory should be forgotten, given the output of the proceeding hidden state and the current data in the input sequence.

The next gate, known as the `memory network’ gate (`New Memory Update Gate' in picture), determines what new information should be added to the long term memory cell state. To do so it uses the output from the previous hidden state together with the new input data to generate a `new memory update’ vector with both positive and negative elements. The new memory-update vector is added to the cell state, resulting in the long-term memory of the network being updated with information that depends on the new input data and the context given by the previous hidden state. In combination the `Forget’ and `new memory’ gates update the long term memory or `cell state’ of the network. The final `output’ gate then combines all of the data from the cell state, the input, and the output of the proceeding step to produce the output, which is passed along to the next step in the sequence together with the cell state.

RNNs in retail

We secretly experience the power of RNNs every day when we use auto-complete on our mobile phones, or when Google suggests search queries. The same properties that make RNNs ideal for language processing, however, also make them perfectly suited for tasks in retail and commerce such as customer sentiment analysis, calculating customer lifetime value, attribution analysis, and next purchase prediction, as well as in analyzing other consumer behaviors and interests. From the perspective of data processing, customer transaction histories, just like sentences, are sequences of `words’ with arbitrary length, and in which the ordering of words holds information. RNNs reduce the effort for feature analysis: they allow to feed in sequential data where otherwise onlz aggregated data would be usable - thereby also increasing the quality of the analysis as data aggregation always translates to a loss of information.

Take the example of a return forecast model, where you want to predict the likeliness of an ordered item to be returned. You can immediately write down a long list of different input information that should be relevant for this deduction. However, they come in two different types: On the one hand, you have a set of sequential data where you collect i.a. the return rate per past order of the customer. On top of that, you also have non-sequential, single information about the customer and the current order – like the time of the order or whether the customer subscribed to your newsletter. If you want to get the most out of your model, it should be able to process both types of data and it is the sequential data where RNNs come into play.

By now, you should have a big picture view of what RNNs are and roughly how they function. You may already be imagining intriguing new use cases for your own business. The applications in retail are really endless and down to your own imagination, with huge potential for creating a data driven, highly efficient business. If you have an application you would like to discuss, we will be happy to help bring your ideas to life. From analyzing transaction tables to customer classification and segmentation to making predictions based on customer online activity, RNNs provide a host of possibilities.

Recurrent neural networks for personalized customer journeys

Know thy customer, flog thy product.

Machine-learning in marketing and services

How RNNs function

RNNs in retail

Stay up to date with our free newsletter

0 comments

Leave a comment