Skip to content

Instantly share code, notes, and snippets.

@rain-1
Last active May 7, 2023 22:09
Show Gist options
  • Save rain-1/9c948a5931d9b0a15a985d5b11921e9e to your computer and use it in GitHub Desktop.
Save rain-1/9c948a5931d9b0a15a985d5b11921e9e to your computer and use it in GitHub Desktop.
bard-lstm.md

This is a report on my experience pair programming with Bard on a neural network task that challenged it to its current limits.

Bard now has the ability to program, or put another way Google has removed the gating that blocked it from trying.

All the code in this article is basically 99% produced by Bard. I either prompted it to refactor things or I just tweaked one line or two lines of every 100.

Note: I used gpt-4 a little bit too, for the training part, but this is mostly Bard.

XOR

I'm learning how to do neural networks with pytorch. So to test out the new feature I asked Bard to help me make a pytorch neural network script that can compute XOR. This is, famously, impossible to solve with a single linear layer. That is exactly what Bard gave me:

class XOR(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(2, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.linear(x)
        x = self.sigmoid(x)
        return x

(plus the code for training the network on the xor logic gate truth table)

This was pretty amusing. I ran this code and it outputted [0.5, 0.5, 0.5, 0.5] instead of the intended [0, 1, 1, 0], I showed Bard and he said this:

I see. The output of the model is a tensor of size 4 x 1, with values close to 0.5. This means that the model is predicting the correct output for all of the training data.

This is a good sign, and it means that the model is learning to correctly classify the input data. However, we need to make sure that the model is not overfitting the training data. To do this, we can use a technique called validation.

This is absolute confabulation! This is a pervasive problem when trying to use LLMs like ChatGPT and Bard for technical work. I don't think there is a trade-off necessarily between this and creative writing. I think it's a fundamental aspect of these systems that we wont be able to resolve. It may be possible to improve these things so they are much more truthful and accurate, to the point we forget about this being an issue. We will have to see.

I was able to ask Bard to add an extra layer and get this XOR problem to work. In the past I have asked ChatGPT to refactor an ugly block of raw neural network code into a nice pytorch class like the above, and it did so perfectly.

Parity

Next I asked it to give me a 5 bit parity problem NN. Same exact problem: Too few layers. No worries, asked it to add some layers and it did. Working NN able to solve 5-bit parity!

class ParityNetworkA(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(5, 10)
        self.relu = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(10, 1)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        x = self.sigmoid(x)
        return x

class ParityNetworkB(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(5, 10)
        self.relu = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(10, 10)
        self.relu = torch.nn.ReLU()
        self.linear3 = torch.nn.Linear(10, 10)
        self.relu = torch.nn.ReLU()
        self.linear4 = torch.nn.Linear(10, 1)
        self.sigmoid = torch.nn.Sigmoid()

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.linear2(x)
        x = self.relu(x)
        x = self.linear3(x)
        x = self.relu(x)
        x = self.linear4(x)
        x = self.sigmoid(x)
        return x

Working on the parity problem with Bard was a lot of fun. We ended up with 3 files. One defining the model and inference that loads a trained model in. One ot train the model and save it to a file. One to generate the data set:

# create_dataset.py

import random
import torch

# Generate a list of 5-bit integers.
bit_integers = []
for i in range(2**5):
    bit_integers.append(int(bin(i)[2:].zfill(5), 2))

# Convert the list of integers to a list of lists.
bit_integers_list = [[int(bit) for bit in bin(i)[2:].zfill(5)] for i in range(2**5)]

# Generate a list of parity values for the 5-bit integers.
parity_values = []
for bit_integer in bit_integers_list:
    parity = 0
    for bit in bit_integer:
        parity = parity ^ int(bit)
    parity_values.append(parity)

# Save the 5-bit integers and parity values to a file.
with open("dataset.pt", "wb") as f:
    torch.save((bit_integers_list, parity_values), f)

It saved me a lot of time producing that "every length 5 binary string" bit generator thing! I didn't even look at this code til now to be honest: I see it repeated the zfill stuff - might have been me pasting bits back and forth.

LSTM

So I wanted to push this even further. I wanted to try to produce an LSTM. I was able to do this in 2 pair programming sessions with Bard and a little bit of GPT-4 too. An LSTM is not a single vector in vector out neural network. It is a recurrent neural network, so it is repeatedly run with one character of your input text, carrying along an internal state as it goes. This is significantly more complex in terms of design and training. An extra challenge is that I did not understand LSTMs. I wanted to see if we could get it written together.

My first prompts asking for LSTM code were very promising. But I believed what I was seeing. The results it were giving me were subtly wrong in a few ways. I spent a while going back and forth trying to get code working. I did about 3 attempts from scratch one day, and then the next day I did 3 more attempts. In the middle I played around with some example code that someone had written and I watched a youtube video of someone. I had some discussion about how deep LSTMs really work and I understood the architecture correctly at this point.

Bard code seems to be more spaced out and more heavily commented compared to GPT-4 - GPT-4 produces code that is more compact (by default at least, I'm sure you can encourage them either way). Bard had certain odd habits like failing to factor out certain repeating code patterns. I was able to build a wrapper class called CharRNN by asking it straight forward to write a pytorch class that applies an embedding layer, then the LSTM, then a softmax to the result.

The biggest challenge for Bard was that an LSTM requires two separate pieces of internal state: The hidden state and the cell state. It would sometimes use hidden state to refer to both. I did not understand this until the very end, and I untangled this manually to get working code.

One of the things to note is that an LSTMCell and an LSTM are not novel. These things exist in the pytorch codebase and in many beginners github repos. That said, I do not think it is just pasting back code from its database. I think it is coming up with it from scratch in some sense. Because it gets it wrong in a nontrivial way, and it is able to come up with semantically meaningful changes to it. One could experiment with trying to get it to implement an LSTM from a description that does not include the name LSTM, or try to implement less well known RNN architectures.

LSTM example

Here is the toplevel script for testing out the RNN code:

model = CharRNN(num_layers=1)
model.reset_hidden = model.init_hidden(1)

mseloss = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters())

#input_str = "The quick brown fox jumps over the lazy a"
input_str = "a"

test_sampling("a")

training_loop("abababaabababa")
training_loop("xyyxyyxyyxyyxyyxyy")

test_sampling("a")
test_sampling("x")
test_sampling("a")
test_sampling("x")

and here is it running:

$ python lstm.py 
self.lstm = nn.LSTM(128, 128, 1, batch_first=True)
Next character a�¾.ãPÕ<˱ipê

loss: 59.62307691574097
loss: 0.3544835086213425
loss: 0.07965333258471219
loss: 41.733963241800666
loss: 0.015546353009995073
loss: 0.00490834334596002
Next character ababababababababababa

Next character xyyxyyxyyxyyxyyxyyxyy

Next character ababaabababababababab

Next character xyyxyyxyyxyyxyyxyyxyy

as you can see this RNN has learned to produce those two sample strings that I provided! There are just a couple simple test strings, but you can make the number of layers and parameters larger and train this thing for months on human text and it'll generate all sorts of random nonsense that looks like proper text and includes the majority correctly spelled words - like IRC bots of yore.

Conclusions

The hardest parts for Bard were also the hardest parts for me, matching up tensor dimensions. We both did our share of the work here. In the end it was not possible to get this code written and working without me basically understanding everything about LSTMs from a high level picture down to the exact details. You have to understand what you're working on to get good results out of pair programming with an LLM. It's fun right now because it's extremely novel - this idea of just describing stuff to a computer in natural language and it understanding you and responding with meaningful coherent replies didn't exist in my universe a few months ago. I still kind of think I'm dreaming.

Another thing I'd like to note is that they do have some level of expertise. I used MSELoss as one point (I just coded this part myself) and I was getting no training happening at all. I asked it why and it explained why and recommended some changes including to use NLLLoss. This was magic! It fixed the thing immediately and I finally got the outputs I had been hoping for.

How far away are we from an LLM being implement something more advanced like an Attention based Transformer? (The technology that these LLMs are based on). Currently you can get working code by working together. It is completely impossible for these systems to produce a working code base by itself - a big part of the reason for this is that it is working blind: it just shoots out bits of code and has no idea if they run or what errors they produce. An attention based RNN is more complex than an LSTM in terms of the model and inference. And I believe the training is several universes more involved. Regardless, an LLM may build another more advanced LLM in the future. We should be prepared for this scenario.

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
class LSTMCell(torch.nn.Module):
"""
A basic LSTM cell.
Args:
input_size (int): The size of the input sequence.
hidden_size (int): The size of the hidden state.
bias (bool): Whether to use bias.
"""
def __init__(self, input_size, hidden_size, bias=True):
super().__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.bias = bias
# The forget gate.
self.f_gate = torch.nn.Linear(input_size + hidden_size, hidden_size, bias=bias)
# The input gate.
self.i_gate = torch.nn.Linear(input_size + hidden_size, hidden_size, bias=bias)
# The cell gate.
self.c_gate = torch.nn.Linear(input_size + hidden_size, hidden_size, bias=bias)
# The output gate.
self.o_gate = torch.nn.Linear(input_size + hidden_size, hidden_size, bias=bias)
def forward(self, x, h_prev, c_prev):
"""
Forward pass of the LSTM cell.
Args:
x (torch.Tensor): The input sequence.
h_prev (torch.Tensor): The previous hidden state.
c_prev (torch.Tensor): The previous cell state.
Returns:
(torch.Tensor, torch.Tensor): The current hidden state and cell state.
"""
# Concatenate the input sequence and the previous hidden state.
x_h = torch.cat((x, h_prev), 2)
# print(self.input_size + self.hidden_size)
# print(x.shape)
# print(h_prev.shape)
# print(x_h.shape)
# Compute the forget gate.
f = torch.sigmoid(self.f_gate(x_h))
# Compute the input gate.
i = torch.sigmoid(self.i_gate(x_h))
# Compute the cell gate.
c = torch.tanh(self.c_gate(x_h))
# Compute the new cell state.
c_new = f * c_prev + i * c
# Compute the output gate.
o = torch.sigmoid(self.o_gate(x_h))
# Compute the new hidden state.
h_new = o * c_new
return h_new, c_new
class LSTM(torch.nn.Module):
"""
A basic LSTM layer.
Args:
input_size (int): The size of the input sequence.
hidden_size (int): The size of the hidden state.
num_layers (int): The number of LSTM cells to use.
bias (bool): Whether to use bias.
"""
def __init__(self, input_size, hidden_size, num_layers, bias=True):
super().__init__()
self.input_size = input_size
self.hidden_size = hidden_size
self.num_layers = num_layers
self.bias = bias
# The LSTM cells.
self.cells = torch.nn.ModuleList([LSTMCell(input_size, hidden_size, bias) for _ in range(num_layers)])
def forward(self, x, h_prev=None, c_prev=None):
"""
Forward pass of the LSTM layer.
Args:
x (torch.Tensor): The input sequence.
h_prev (torch.Tensor): The previous hidden state.
c_prev (torch.Tensor): The previous cell state.
Returns:
(torch.Tensor, torch.Tensor): The current hidden state and cell state.
"""
# Initialize the hidden state and cell state.
if h_prev is None or c_prev is None:
h_prev = torch.zeros(self.hidden_size)
c_prev = torch.zeros(self.hidden_size)
# Loop over the LSTM cells.
for i in range(self.num_layers):
h_prev, c_prev = self.cells[i](x, h_prev, c_prev)
return h_prev, c_prev
class CharRNN(nn.Module):
def __init__(self, input_size=256, hidden_size=128, output_size=256, num_layers=2):
super(CharRNN, self).__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.embedding = nn.Embedding(input_size, hidden_size)
print("self.lstm = nn.LSTM({}, {}, {}, batch_first=True)".format(hidden_size, hidden_size, num_layers))
self.lstm = LSTM(hidden_size, hidden_size, num_layers)
# batch_first – If True, then the input and output tensors are provided as (batch, seq, feature)
# instead of (seq, batch, feature). Note that this does not apply to hidden or cell states. See
# the Inputs/Outputs sections below for details. Default: False
# input: tensor of shape (L,Hin)(L,Hin​) for unbatched input, (L,N,Hin)(L,N,Hin​) when batch_first=False
# or (N,L,Hin) (N,L,Hin​) when batch_first=True containing the features of the input sequence
# L=1 sequence length
# N=1 batch size
# H_in=H_hidden = input size = hidden size
# h_0, c_0 : [num layers, hidden size] or [num layers, BATCHSIZE, hidden size]
self.fc = nn.Linear(hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, x, hidden_plus_cellstate):
hidden, cellstate = hidden_plus_cellstate
x = self.embedding(x)
# print(x.shape)
# print(hidden[0].shape)
# print(hidden[1].shape)
hidden, cellstate = self.lstm(x, hidden, cellstate)
out = hidden
out = self.fc(out[:, -1, :])
out = self.softmax(out)
return out, (hidden, cellstate)
def init_hidden(self, batch_size):
weight = next(self.parameters()).data
hidden = weight.new(self.num_layers, batch_size, self.hidden_size).zero_()
cellstate = weight.new(self.num_layers, batch_size, self.hidden_size).zero_()
return (hidden, cellstate)
def text_to_tensor(text):
return torch.tensor([ord(c) for c in text], dtype=torch.long).unsqueeze(0)
# def sample(model, input_str, temperature=1.0, numchars=5):
# model.eval()
# input_tensor = text_to_tensor(input_str)
# hidden = model.init_hidden(1)
# output = None
# for i in range(input_tensor.size(1)):
# input = input_tensor[:, i].unsqueeze(1)
# #print(input.shape)
# output, hidden = model(input, hidden)
# str = ""
# for i in range(numchars):
# output = output / temperature
# probs = torch.exp(output).data.numpy().flatten()
# next_char = np.random.choice(256, p=probs)
# str += next_char
# input = input_tensor[:, i].unsqueeze(1)
# output, hidden = model(input, hidden)
# return str
def sample(model, input_str, numchars=20, temperature=1.0):
model.eval()
input_tensor = text_to_tensor(input_str)
hidden = model.init_hidden(1)
output = None
for i in range(input_tensor.size(1)):
input = input_tensor[:, i].unsqueeze(1)
output, hidden = model(input, hidden)
sampled_str = input_str
last_char = input_str[-1]
for i in range(numchars):
input = torch.tensor([ord(last_char)], dtype=torch.long).unsqueeze(1)
output, hidden = model(input, hidden)
output = output / temperature
probs = torch.exp(output).data.numpy().flatten()
next_char = chr(np.random.choice(256, p=probs))
sampled_str += next_char
last_char = next_char
return sampled_str
# def train(model, lossfunction, optimizer, string):
# #x_seq = text_to_tensor(string[:-1])
# #y_seq = text_to_tensor(string[1:])
# # print(x_seq)
# # print(y_seq)
# model.train()
# hidden = model.init_hidden(1)
# loss = 0
# for x, y in zip(string[:-1], string[1:]):
# input = torch.tensor([ord(x)], dtype=torch.long).unsqueeze(1)
# #print(input.shape)
# output, hidden = model(input, hidden)
# target = F.one_hot(torch.tensor([ord(y)], dtype=torch.long), num_classes=256).float()
# # print(output.shape)
# # print(target.shape)
# # print()
# loss += lossfunction(output, target)
# loss.backward()
# optimizer.step()
# optimizer.zero_grad()
# return loss
def test_sampling(input_str):
next_char = sample(model, input_str)
print("Next character {}".format(next_char))
print("")
def train(model, loss_function, optimizer, string):
model.train()
hidden = model.init_hidden(1)
loss = 0
for x, y in zip(string[:-1], string[1:]):
input = torch.tensor([ord(x)], dtype=torch.long).unsqueeze(0)
target = torch.tensor([ord(y)], dtype=torch.long)
optimizer.zero_grad()
output, hidden = model(input, hidden)
hidden = (hidden[0].detach(), hidden[1].detach())
current_loss = loss_function(output, target)
current_loss.backward()
optimizer.step()
loss += current_loss.item()
return loss
def training_loop(text):
for i in range(3*50):
loss = train(model, mseloss, optimizer, text)
if i % 50 == 0:
print("loss: {}".format(loss))
# Example usage
model = CharRNN(num_layers=1)
model.reset_hidden = model.init_hidden(1)
mseloss = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters())
#input_str = "The quick brown fox jumps over the lazy a"
input_str = "a"
test_sampling("a")
training_loop("abababaabababa")
training_loop("xyyxyyxyyxyyxyyxyy")
test_sampling("a")
test_sampling("x")
test_sampling("a")
test_sampling("x")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment