Understanding LSTM

7 minute read

Introduction

When I walked through Deep Learning Specialisation in Coursera, everything seemed to be easy until I hit the LSTM part of Sequence Models course.

LSTM-Cell (Source: Coursera)

Out of nowhere, all these equations and new concepts (i.e. gates) just appeared and scared me. LSTM was tough, at first. But then I found it really undestandable by breaking the big abstract concept into smaller pieces and trying to conquer one piece at a time.

After going through this post, you may find the maths behind LSTM still unreachable but you should have a good understanding of how LSTM works. The content of this post is drawn from the first coding assignment of the course.

Overview of gates and states

Forget gate $\mathbf{\Gamma}_{f}$

  • Let’s assume we are reading words in a piece of text, and plan to use an LSTM to keep track of grammatical structures, such as whether the subject is singular (“puppy”) or plural (“puppies”).
  • If the subject changes its state (from a singular word to a plural word), the memory of the previous state becomes outdated, so we “forget” that outdated state.
  • The “forget gate” is a tensor containing values that are between 0 and 1.
    • If a unit in the forget gate has a value close to 0, the LSTM will “forget” the stored state in the corresponding unit of the previous cell state.
    • If a unit in the forget gate has a value close to 1, the LSTM will mostly remember the corresponding value in the stored state.

Equation

\[\mathbf{\Gamma}_f^{\langle t \rangle} = \sigma(\mathbf{W}_f[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_f)\tag{1}\]

Explanation of the equation:

  • $\mathbf{W_{f}}$ contains weights that govern the forget gate’s behavior.
  • The previous time step’s hidden state $[a^{\langle t-1 \rangle}$ and current time step’s input $x^{\langle t \rangle}]$ are concatenated together and multiplied by $\mathbf{W_{f}}$.
  • A sigmoid function is used to make each of the gate tensor’s values $\mathbf{\Gamma}_f^{\langle t \rangle}$ range from 0 to 1.
  • The forget gate $\mathbf{\Gamma}_f^{\langle t \rangle}$ has the same dimensions as the previous cell state $c^{\langle t-1 \rangle}$.
  • This means that the two can be multiplied together, element-wise.
  • Multiplying the tensors $\mathbf{\Gamma}_f^{\langle t \rangle} * \mathbf{c}^{\langle t-1 \rangle}$ is like applying a mask over the previous cell state.
  • If a single value in $\mathbf{\Gamma}_f^{\langle t \rangle}$ is 0 or close to 0, then the product is close to 0.
    • This keeps the information stored in the corresponding unit in $\mathbf{c}^{\langle t-1 \rangle}$ from being remembered for the next time step.
  • Similarly, if one value is close to 1, the product is close to the original value in the previous cell state.
    • The LSTM will keep the information from the corresponding unit of $\mathbf{c}^{\langle t-1 \rangle}$, to be used in the next time step.

Candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$

  • The candidate value is a tensor containing information from the current time step that may be stored in the current cell state $\mathbf{c}^{\langle t \rangle}$.
  • Which parts of the candidate value get passed on depends on the update gate.
  • The candidate value is a tensor containing values that range from -1 to 1.
  • The tilde “~” is used to differentiate the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ from the cell state $\mathbf{c}^{\langle t \rangle}$.

Equation

\(\mathbf{\tilde{c}}^{\langle t \rangle} = \tanh\left( \mathbf{W}_{c} [\mathbf{a}^{\langle t - 1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{c} \right) \tag{3}\)

Explanation of the equation

  • The ‘tanh’ function produces values between -1 and +1.

Update gate $\mathbf{\Gamma}_{u}$

  • We use the update gate to decide what aspects of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ to add to the cell state $c^{\langle t \rangle}$.
  • The update gate decides what parts of a “candidate” tensor $\tilde{\mathbf{c}}^{\langle t \rangle}$ are passed onto the cell state $\mathbf{c}^{\langle t \rangle}$.
  • The update gate is a tensor containing values between 0 and 1.
    • When a unit in the update gate is close to 1, it allows the value of the candidate $\tilde{\mathbf{c}}^{\langle t \rangle}$ to be passed onto the hidden state $\mathbf{c}^{\langle t \rangle}$
    • When a unit in the update gate is close to 0, it prevents the corresponding value in the candidate from being passed onto the hidden state.

Equation

\[\mathbf{\Gamma}_u^{\langle t \rangle} = \sigma(\mathbf{W}_u[a^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_u)\tag{2}\]

Explanation of the equation

  • Similar to the forget gate, here $\mathbf{\Gamma}_u^{\langle t \rangle}$, the sigmoid produces values between 0 and 1.
  • The update gate is multiplied element-wise with the candidate, and this product ($\mathbf{\Gamma}_{u}^{\langle t \rangle} * \tilde{c}^{\langle t \rangle}$) is used in determining the cell state $\mathbf{c}^{\langle t \rangle}$.

Cell state $\mathbf{c}^{\langle t \rangle}$

  • The cell state is the “memory” that gets passed onto future time steps.
  • The new cell state $\mathbf{c}^{\langle t \rangle}$ is a combination of the previous cell state and the candidate value.

Equation

\[\mathbf{c}^{\langle t \rangle} = \mathbf{\Gamma}_f^{\langle t \rangle}* \mathbf{c}^{\langle t-1 \rangle} + \mathbf{\Gamma}_{u}^{\langle t \rangle} *\mathbf{\tilde{c}}^{\langle t \rangle} \tag{4}\]

Explanation of equation

  • The previous cell state $\mathbf{c}^{\langle t-1 \rangle}$ is adjusted (weighted) by the forget gate $\mathbf{\Gamma}_{f}^{\langle t \rangle}$
  • and the candidate value $\tilde{\mathbf{c}}^{\langle t \rangle}$, adjusted (weighted) by the update gate $\mathbf{\Gamma}_{u}^{\langle t \rangle}$

Output gate $\mathbf{\Gamma}_{o}$

  • The output gate decides what gets sent as the prediction (output) of the time step.
  • The output gate is like the other gates. It contains values that range from 0 to 1.

Equation

\[\mathbf{\Gamma}_o^{\langle t \rangle}= \sigma(\mathbf{W}_o[\mathbf{a}^{\langle t-1 \rangle}, \mathbf{x}^{\langle t \rangle}] + \mathbf{b}_{o})\tag{5}\]

Explanation of the equation

  • The output gate is determined by the previous hidden state $\mathbf{a}^{\langle t-1 \rangle}$ and the current input $\mathbf{x}^{\langle t \rangle}$
  • The sigmoid makes the gate range from 0 to 1.

Hidden state $\mathbf{a}^{\langle t \rangle}$

  • The hidden state gets passed to the LSTM cell’s next time step.
  • It is used to determine the three gates $\mathbf{\Gamma}_f$, $\mathbf{\Gamma}_u$, $\mathbf{\Gamma}_o$ of the next time step.
  • The hidden state is also used for the prediction $y^{\langle t \rangle}$.

Equation

\[\mathbf{a}^{\langle t \rangle} = \mathbf{\Gamma}_o^{\langle t \rangle} * \tanh(\mathbf{c}^{\langle t \rangle})\tag{6}\]

Explanation of equation

  • The hidden state $\mathbf{a}^{\langle t \rangle}$ is determined by the cell state $\mathbf{c}^{\langle t \rangle}$ in combination with the output gate $\mathbf{\Gamma}_{o}$.
  • The cell state is passed through the “tanh” function to rescale values between -1 and +1.
  • The output gate acts like a “mask” that either preserves the values of $\tanh(\mathbf{c}^{\langle t \rangle})$ or keeps those values from being included in the hidden state $\mathbf{a}^{\langle t \rangle}$

Prediction $\mathbf{y}^{\langle t \rangle}_{pred}$

  • The prediction in this use case is a classification, so we’ll use a softmax.

Equation

\[\mathbf{y}^{\langle t \rangle}_{pred} = \textrm{softmax}(\mathbf{W}_{y} \mathbf{a}^{\langle t \rangle} + \mathbf{b}_{y})\]