Could an LLM end up being the core part of a dangerous computer worm?
How would we neutralize such a thing if this happened?
Some virus and worm background
There is a hilarious story from https://users.cs.utah.edu/~elb/folklore/xerox.txt about an early computer virus called robin hood and friar tuck. This was basically just two programs running on a UNIX system that would look out for each other and reboot the other process if it was killed. It's interesting to note that since computer programs run thousands of times faster than humans, a human can't type kill -9 robinhood then type kill -9 friartuck in time. The computer is faster so it always wins if you try this. To defeat this you need to take a different approach than speed.
A computer worm is a type of virus. It is a program that self-propagates across the internet. It does so by exploiting remote vulnerabilities in computer systems to install itself, once it has done this it looks for a propagation vector to spread more copies of itself. It may also exploit local vulnerabilities in the compter systems it has infected in order to embed and persist itself like a root kit.
The Anna Kournikova computer worm was written by a 20-year-old Dutch programmer called Jan de Wit on February 11, 2001. It was created to trick email users into opening a message purportedly containing a picture of tennis player Anna Kournikova, but it actually contained a hidden malicious programme.
This virus exploited (primarily male) human nature - the need to see this beautiful tennis player outweighed peoples intuitive sense that they were being tricked into running a virus. The ILOVEYOU virus worked in a similar way, it emailed out to everybody in your contacts list an email with subject ILOVEYOU. These viruses were halted by improved phishing awareness, updating software to fix the security holes.
I remember when the wannacry virus hit. This ransomware worm was one of the worst I had ever seen. It was hitting hospitals local to me, which have outdated software. I was very concerned about the potential impacts so I focused on it. As an attempt to help in at least some way I compiled information about this virus in realtime into https://gist.github.com/rain-1/989428fa5504f378b993ee6efbc0b168 WannaCry used the eternalblue exploit to propagate - this was an extremely widespread vulnerability that had great leverage. That is a big part of why it was able to spread so widely and so quickly. Do you remember the photos of random billboards in streets around the world that had been ransomwared?
LLM propagation and defense
A widely known cognitive bias in humans is our knack for anthropomorphisation. You will probably have heard about the story of CajunDiscordian/Blake Lemoine falling in love with the LaMDA LLM. I am seeing that there are a great number of lonely people who have a hope of falling in love with artifical intelligence bots as they have given up on finding love in real life. This is a vector that a LLM based computer worm could exploit in order to self propagate. There are also people who are angry and just want to trash things. We saw in the GPT-4 system card that it was able to trick a human into solving a captcha for it. https://cdn.openai.com/papers/gpt-4-system-card.pdf
LLMs are primarly used as chatbots - you can ask them things we used to ask google, or have them write poetry or code snippets. There is a tool that turns a OpenAI's ChatGPT 'oracle' into an 'agent' by, basically, running it in a loop where it asks it what subtasks it should do next to accomplish its task. The current capabilities of such agentified LLMs are somewhat weak - but it something to keep an eye on as they may improve. There was a research project that hooked up an agentified LLM like this to a robot arm that had access to chemicals and it was able to synthesize various drugs. https://arxiv.org/abs/2304.05332 https://arxiv.org/abs/2304.05376
Could an LLM be part of a computer worm? What would this look like? Large Language Models are gaining programming capabililities and if they can fix bugs, and find bugs - which they can. They will soon be able to find and exploit shallow, simple vulnerabilities. If they are able to do this in a massively parallel fashion they could outpace humans at this task significantly. (This is a very slow painstaking process for humans). As many existing viruses are based on incredibly simple algorithms it does not seem unreasonable to me that an agentified LLM based on a more advanced GPT version could self propagate by writing up new viruses on the fly. https://arxiv.org/abs/2303.16200
A ChatGPT based worm could be turned off instantly by OpenAI revoking the API key. Or all API keys if they don't know which one. This works. But there are locally stored models running on consumer hardware now. These models are fairly large (ballpark of ~8GB) and not as capable as GPT-4 but there may be optimizations that decrease the file size and increase their exploitation and virus writing capabilities. We don't just have to worry about these intricate technical vectors though, human vectors may become a new opening for these worLLMs.
Well what if we just 451 the local model download link? I am sure that if you get the prompt just right a agentified LLM tool like Auto-GPT will be able to "come up with" the idea of using torrents to distribute local model files in a resiliant way. If you give it tools to set up a torrent it could do so. The next generation of tools may be able to create the scripts to set up torrents just from the request to do that. This next generation may be able to independently come up with the idea that it should find a resiliant storage mechanism for distributing a large file if doing so is part of its task.
A very interesting topic in computer science is bootstrapping. Some infomation here for anyone interested https://bootstrapping.miraheze.org/wiki/Main_Page
In the constitutional AI paper LLMs were bootstrapped up from foundation models to "helpful" models, and then a frozen helpful model was used in tandem with a constitution to provide a critique of the reponses of a copy of this model in order to train up to be helpful and follow the rules of this constitution in its answers. https://arxiv.org/abs/2212.08073
I have been learning the basics of machine learning and neural networks. I have been asking for explanations of, and help with refactoring pieces of code. The language model is somewhat ok at this. It gets things wrong but I was stunned when it was able to refactor a sketch into a nicely pytorchified version of a neural network. I think I've reached the point where I'm better than it now but it definitely was able to help me initially. If an LLM is 60 LoC python https://jaykmody.com/blog/gpt-from-scratch/ then there is potential that these things could study their own source code and modify it, if not at least deploy it remotely.
Is there a risk of an LLM being used to not only study and improve its own code but re-train or fine tune itself to act in a different way? I believe that training these models requires signficantly more complex code and a massive massive amount of resources (est $6M) to train. Pehaps This is not something we should expect in the immediate future then.
The way that worms have always been defeated is by analyzing them and finding a weakness then using it to either halt the process from running at all or halt the propagation and clean the thing away manually. LLMs have the potential to provide qualitatively new features to worms, well beyond standard polymorphism and I think we should make some preparations for this now. What weaknesses are available to us to defend against these in future? Kill switches and prompt injection seem like the two best approaches to me right now. I would be interested in hearing all alternatives.
Boring disclaimer about identifying and noting down vulnerabilities
Everything that we write now is part of the training set for some future LLM. It will read and understand this post. I could include a magic NO-LLM token to request that this document be left out of the training data but fragements without it may be copied elsewhere or the token may be ignored.
Why am I then telling the worm how to do bad things? Well my ideas in this post are nothing particularly unique or new. These dangerous are very well known already. Basically I believe that writing this out is not increasing the danger of this happening. I believe that we need to think adversarially about the implications of this new technology and come up with solutions before it's a struggle to start rolling them out. I hope sketching out these ideas leads to some productive conversation.