Skip to content

Instantly share code, notes, and snippets.

@koitsu
Last active November 9, 2019 07:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save koitsu/834374da9105b53a6a7db820de832ade to your computer and use it in GitHub Desktop.
Save koitsu/834374da9105b53a6a7db820de832ade to your computer and use it in GitHub Desktop.
The Quest for a Replacement MegaHAL

The Quest for a Replacement MegaHAL

Why

Blah blah IRC UTF8 blah blah jävla helvete write this up when I fucking feel like it

MegaHAL brain/corpus details

  • Brain age: 7.5 years (October 2012 to May 2019)
  • Brain size: 8.1MBytes on disk
  • Training file: 16,085 lines; mostly ASCII (some UTF8)

Trial By Fire

markovify

https://github.com/jsvine/markovify

Written in Python, supporting versions 2.7 through 3.6. At first glance it looked like a perfect replacement: "oh wow, it can generate a random sentence based on its corpus (fed data)!"

I fed it about 20,000 lines of corpus data and it seemed to work: I could generate a random sentence from past lines and the output seemed humourous in the same way MegaHAL was. There was just one problem...

MegaHAL (generally; there are functions to bypass this behaviour) operates as follows: if you give it a word/sentence, it will not only learn from what you tell it, but it will give you a reply that usually relates to what you just told it. The keywords are "relates to". markovify does not natively support this.

When I had just about given up, I noticed make_sentence_with_start() and its strict=False capability:

Tries making a sentence that begins with beginning string, which should be a string of one to self.state words known to exist in the corpus.

If strict == False, then markovify will draw its initial inspiration from any sentence containing the specified word/phrase.

This sounded like what I wanted... except it behaves nothing like what is described. make_sentence_with_start("prostate", strict=False) resulted in responses that for the most part had nothing to do with "prostate"... and all the responses, of course, started with the string prostate . The more I thought about this function, the more it felt almost pointless.

In contrast: MegaHAL's learn() function accepts a string -- a sentence a human would write, including single words like "prostate" -- splits it up into words, then proceeds to learn from what it was fed, then returns a capitalised reply that has keyed off the aforementioned words.

I found no way to simulate/replacate this behaviour in markovify. Bummer, because of all the rest of the below trash (and they are trash), this proved to be the best of the bunch; I really wanted markovify to be the solution.

megahal (Python + C version)

https://github.com/4ZM/megahal

Written in Python, unknown supported versions. Used Extensions (e.g. Python/C bindings, i.e.compiles the actual C-based MegaHAL source then connects the two PLs together using Extensions).

At first glance, this seemed awesome. Why? Because this is exactly AI::MegaHAL does, where it works extremely well: fast, non-bloated, and the underlying perlxs code is mostly legible (enough for me to add a C function and the related perlxs tie-in without any trouble).

The first thing I learned is that this version is intended for Python 2.x; the Python interface/extension is not 3.x-compatible. I took a stab at converting it but this lead me down a rabbit hole of terrible documentation that cannot be easily summarised other than "terrible". A colleague of mine did a better job explaining it than I did:

they seem to love writing tons of useless prose that's neither good as a reference, nor good as a specification: more like obese books that tell you nothing useful, but presented as references. and of course, since it's another turd dynlang, any sort of generated documentation output usually doesn't tell you the most basic shit: what arguments are valid input to a given goddamn function (yes, there are type annotations for python now; 30 years too late)

I then switched back to 2.6.17. I was at least able to get something functional... sort of.

Sadly, the Python+C version is quite terrible in comparison. None of the Python extension/interface functions behave identically to the C version, instead opting for basically "hard-coded everything". The interface lacks many functions and overall functionality, which is very disappointing. It seems the individual who wrote this interface chose to divert from many of the expected norms of MegaHAL and try to implement something bizarre.

However, I do expect that on some level if someone was to step up to the plate and actually write a decent interface to the C version, it would be usable. But in its current (IMO, half-ass) state, it cannot be used decently.

cobe

https://github.com/pteichman/cobe

Written in Python, unknown supported versions. I tried 2.7.16 and 3.6.8.

This software was a saga in itself, and tells the tale of what I consider a huge problem in the open-source world: maintainer neglect. Read on for tales of woe.

With 2.7.16: I was unable to get this software to run/function due to a deeper dependency problem not even directly related to cobe itself. It indirectly depends, through a chain of dependencies, on an egg called more_itertools version 7.0.0. Apparently version 5.0.0 is the last known version to support Python 2.x; version 6.0.0 and newer intentionally do not work (tests and runtime fail).

With 3.6.8: the more_itertools problem was no more, but now we had an actual problem with cobe itself: it was using 2.x syntaxes for some things such as print filename (note missing parens).

So it seems cobe is written in Python 2.x, but cannot actually be used with Python 2.x because of the aforementioned more_itertools problem. (Remember: it does not depend directly on more_itertools, but rather some other package that depends on something that then depends on more_itertools.)

(I should add that as of this writing, the same problem happens via pip install cobe when using Python 3.x, so this is universally broken.)

It was around this time that I noticed there were 47 forks of this software on GitHub. This followed by discovering many Issues and a few PRs submit for all sorts of things, such as:

I had looked at trying to update cobe's code myself to Python 3.x, starting with outdated print statements, but apparently there were a lot more.

Upon reviewing both of the PRs, I noticed differences. PR #26 used 2to3 which had tons of fixes (and broke TravisCI, but that's because the version of Python used is 2.7.6 (!)), but the older PR #24 had different fixes, particularly UTF-8-related fixes when training from a file.

This is a great example of why maintainers need to be responsible and not neglectful of their projects. Who's fork to use?!?

I opted to try GitHub user gnowxilef's fork master branch, since their PRs had been merged in the past (i.e. the author presumably trusted them). Nope, failure:

Traceback (most recent call last):
  File "bin/cobe", line 11, in <module>
    load_entry_point('cobe==2.1.2', 'console_scripts', 'cobe')()
  File ".../.local/lib/python3.6/site-packages/cobe-2.1.2-py3.6.egg/cobe/control.py", line 42, in main
AttributeError: 'Namespace' object has no attribute 'run'

I then tried CrazyPython's fork: same failure. This is the exact type of shit that drove me away from Linux during the 1.3.x days and over to FreeBSD.

I then began to sift through all 47 forks, one at a time, to see who had commits more recent than master. This lead me to the following discoveries:

  • chauffer/cobe -- Python 3 support was a commit, but the repo had been archived
  • mumtoofs/cobe -- a fork of chauffer/cobe but repo not archived
  • Circlepuller/cobe -- Python 3 support was a commit
  • garym/cobe -- Python 3 support was a commit, but was 10 commits behind master
  • appscluster/cobe -- a fork of gnowxilef/cobe with additional fixes
  • slickriptide/cobe -- various commits, including Python 3 fixes (and some wonky commits too)

I should note all of these individuals' Python 3 commits were different in various ways from one another.

I eventually stopped when I noticed these two things:

  1. LeMagnesium/cobe -- commit 5d2d4b69
  2. tvishwanadha/cobe -- commits
  3. Wintervenom/cobe -- commits

The first implied the IRC client that came with it didn't support being in multiple channels at once. Hrm... okay, so I'd have to do my own IRC client interface -- that's fine, but shows neglect.

The second indicated the IRC client egg/package used by this software was extremely old (their commits were from 2013, so 3 years before the last commit from the actual cobe author!). Wow, what the fuck?

The third implied things like cobe dump didn't work, amongst other issues.

I paused and decided to go back to the official repository; fuck these forks. I couldn't believe all the chaos. I did my own Python 3 fixes (haphazardly, i.e. just print statements). This got me a bit further, only to encounter a whole new problem:

Traceback (most recent call last):
  File "./cobe", line 11, in <module>
    load_entry_point('cobe==2.1.2', 'console_scripts', 'cobe')()
  ...
  File ".../.local/lib/python3.6/site-packages/cobe-2.1.2-py3.6.egg/cobe/control.py", line 6, in <module>
  File ".../.local/lib/python3.6/site-packages/cobe-2.1.2-py3.6.egg/cobe/commands.py", line 13, in <module>
  File ".../.local/lib/python3.6/site-packages/cobe-2.1.2-py3.6.egg/cobe/brain.py", line 11, in <module>
  File "/usr/local/lib/python3.6/sqlite3/__init__.py", line 23, in <module>
    from sqlite3.dbapi2 import *
  File "/usr/local/lib/python3.6/sqlite3/dbapi2.py", line 27, in <module>
    from _sqlite3 import *
ModuleNotFoundError: No module named '_sqlite3'

Review of cobe/brain.py sure enough showed import sqlite3, however no where in setup.py was this dependency declared. It turns out Python and PyPi both are somehow retarded and nobody offers SQLite3 packages. So, to get this to work, I had to install py36-sqlite3 system-wide via pkg on FreeBSD.

At this point, I stopped dead in my tracks and thought about how I was spending my time. This was an endless rabbit hole. All this time spent just for one program that was known to be totally neglected? No. My time is worth more than this.

I rm -fr'd everything pertaining to this software and removed the system packages I had installed, since odds were -- given the Issues and PRs -- that this stuff would just break further at run-time anyway.

jsmegahal

https://github.com/seiyria/jsmegahal

Written in Node.js and/or CoffeeScript. Was able to get it built/installed using Node v10.x on FreeBSD via npm.

However, Node is not a language I am familiar with. I found things like reading a file (to train it) tedious as hell, and the same went for doing something like executing a simple shell pipeline of commands (to find out memory usage before/after training). Not my cup of tea.

Thus, I can't really comment on whether or not this is a worthy contender because I gave up struggling with the PL in general.

megahal (native Python)

https://github.com/krmaxwell/megahal

Written in Python, unknown supported versions. I tried 2.7.16 and 3.6.8.

I first tried 3.6.8. This failed due to use of functions like xrange, but I was able to convert those and other bits with ease. However, something curious that should have acted as premonition to the horrible quest that I was about to embark upon:

    for i in xrange(self.order + 1, 0, -1):
NameError: name 'xrange' is not defined
Segmentation fault (core dumped)

The actual Python crash was what shocked me. This problem was reproducible. I took a look at the stack trace, hoping to gleam some small tidbit of an idea as to why the PL would be crashing on such a simple error:

#0  0x0000000800cbb9c9 in PyModule_GetState () from /usr/local/lib/libpython3.6m.so.1.0
#1  0x0000000802a1f67b in ?? () from /usr/local/lib/python3.6/lib-dynload/_pickle.so
#2  0x0000000802a1cbbe in ?? () from /usr/local/lib/python3.6/lib-dynload/_pickle.so
#3  0x0000000802a2a86b in ?? () from /usr/local/lib/python3.6/lib-dynload/_pickle.so
#4  0x0000000800cba60f in _PyCFunction_FastCallDict () from /usr/local/lib/libpython3.6m.so.1.0
#5  0x0000000800d3988d in ?? () from /usr/local/lib/libpython3.6m.so.1.0
#6  0x0000000800d36ee4 in _PyEval_EvalFrameDefault () from /usr/local/lib/libpython3.6m.so.1.0
#7  0x0000000800d3ae59 in _PyFunction_FastCallDict () from /usr/local/lib/libpython3.6m.so.1.0
#8  0x0000000800c71aaf in _PyObject_FastCallDict () from /usr/local/lib/libpython3.6m.so.1.0
#9  0x0000000800c71c48 in _PyObject_Call_Prepend () from /usr/local/lib/libpython3.6m.so.1.0
#10 0x0000000800c718e6 in PyObject_Call () from /usr/local/lib/libpython3.6m.so.1.0
#11 0x0000000800cd5fb0 in ?? () from /usr/local/lib/libpython3.6m.so.1.0
#12 0x0000000800cd5666 in ?? () from /usr/local/lib/libpython3.6m.so.1.0
#13 0x0000000800d33585 in _PyEval_EvalFrameDefault () from /usr/local/lib/libpython3.6m.so.1.0

Valid and botched at the same time? What in god's name was going on here? No, on second thought, I don't even want to know.

I switched to 2.7.16 where I was able to get things at least functional. But I will warn readers here and now: it only gets worse.

The issues were hideous in several ways:

  1. Dogshit slow. It took a very, VERY long time to train.

  2. Lurking bugs as a result of a haphazardly-done porting job. Upon feeding it a corpus, it generated an exception error at runtime (but no segfault) like so:

  File ".../.local/lib/python2.7/site-packages/megahal.py", line 172, in boundary
     string[position + 1].isalpha()):
IndexError: string index out of range

I fired up python2 -m pdb test.py to see what was going on: it was crashing when encountering a learn/training line ending with an apostrophe. In this case, the variables and their contents for the problematic line:

string   = "THIS THING KEEPS ON JAMMIN'"
phrase   = "JAMMIN'"
position = 7

Yeah, 7+1 certainly exceeds the length of the string, so the exception was legitimate. But wait... why didn't the C version blow up or misbehave with such a string?

Once I began began to look at the actual code (both Python and the deeper innards of MegaHAL), it quickly dawned on me: this was quite literally ported from C: as in the author sat with two windows and wrote the Python code to be as identical as possible to C, nearly all the way down to how strings were parsed/managed.

Apostrophe in MegaHAL is treated specially for some unmentioned reason. See for yourself:

I suspect it gets special treatment to try and handle conjugated words like they're or there's. I really don't know; the comments suck.

If you look closely at the C code, you'll see that the malloc() calls for a word add 1 to the size, needed for the trailing NULL. isalpha(string[position+1]) would therefore comparing a byte of 0x00. And while this is certainly not in the range of [A-Za-z] and thus the function returns 0, more importantly it isn't looking at the wrong piece/byte of memory (or even crashing, if the byte past the boundary of a malloc'd page not owned by the process).

I suspect Jason Hutchens (author of MegaHAL) "got lucky" with this piece of code. The whole blind position+1 thing done in this manner I firmly believe was done under bad assumptions. I readily admit string parsing/splitting is not particularly "fun" in C, but why were functions like strtok(), strsep(), and/or strpbrk() not used instead? THose functions are all part of ISO C90, available since roughly 1990; MegaHAL was written in 1997.

Did Hutchens not think about the edge cases? I can't speak for other programmers, but when I deal with string input -- even in Perl -- I am meticulous as hell.

Anyway, the silly workaround I came up with was to simply change the nested conditionals to the following:

                elif (string[position] == "'" and
                    string[position - 1].isalpha() and
                    position+1 < strlen(string) and
                    string[position + 1].isalpha()):
                    boundary = False
  1. A gargantuan memory hog. Once trained, the underlying Python process took up around 450MBytes of RAM (RSS/RES) (the native C version is around 30MBytes). But it's worse than that...

Any time the brain was written to disk, the size of the process would double (not to mention take a good 30+ seconds). Given that the system this software runs on contained only 1GB of RAM and 1GB of swap... well, let's just say things got pretty hairy.

And in one case, Python itself emit a message I'd never seen before in my life:

HASH: Out of overflow pages.  Increase page size

Remember what I said earlier? "On second thought, I don't even want to know."

  1. Disk space: the brain itself was around 60MBytes. While this version used Python's native shelve for brain storage (convenient since it comes with Python), I suspect the overall "brain implementation" was the source of the wasted space (and likely above issues).

So who did this terrible port? The GitHub repository is owned by a Kyle Maxwell, but I suspect he isn't the original person who created this travesty, given that the GitHub repository description says:

Copied from https://code.google.com/p/halpy/

So I took a gander at the commit history for that Google Code repo. All the commits being from someone identified as cjones, going all the way back to 2008. I think the quality of the commit messages speak for themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment