koitsu/megahal-replacement.md

## megahal-replacement.md

      
    Raw
  

              megahal-replacement.md
            
          
    The Quest for a Replacement MegaHAL

Why

Blah blah IRC UTF8 blah blah jävla helvete write this up when I fucking feel like it
MegaHAL brain/corpus details


Brain age: 7.5 years (October 2012 to May 2019)
Brain size: 8.1MBytes on disk
Training file: 16,085 lines; mostly ASCII (some UTF8)

Trial By Fire

markovify

https://github.com/jsvine/markovify
Written in Python, supporting versions 2.7 through 3.6.  At first glance it
looked like a perfect replacement: "oh wow, it can generate a random sentence
based on its corpus (fed data)!"
I fed it about 20,000 lines of corpus data and it seemed to work: I could
generate a random sentence from past lines and the output seemed humourous in
the same way MegaHAL was.  There was just one problem...
MegaHAL (generally; there are functions to bypass this behaviour) operates as
follows: if you give it a word/sentence, it will not only learn from what you
tell it, but it will give you a reply that usually relates to what you just
told it.  The keywords are "relates to".  markovify does not natively support
this.
When I had just about given up, I noticed make_sentence_with_start() and its
strict=False capability:

Tries making a sentence that begins with beginning string,
which should be a string of one to self.state words known
to exist in the corpus.
If strict == False, then markovify will draw its initial inspiration
from any sentence containing the specified word/phrase.

This sounded like what I wanted... except it behaves nothing like what is
described.  make_sentence_with_start("prostate", strict=False) resulted in
responses that for the most part had nothing to do with "prostate"...  and all
the responses, of course, started with the string prostate .  The more I
thought about this function, the more it felt almost pointless.
In contrast: MegaHAL's learn() function accepts a string -- a sentence a
human would write, including single words like "prostate" -- splits it up into
words, then proceeds to learn from what it was fed, then returns a capitalised
reply that has keyed off the aforementioned words.
I found no way to simulate/replacate this behaviour in markovify.  Bummer,
because of all the rest of the below trash (and they are trash), this proved to
be the best of the bunch; I really wanted markovify to be the solution.
megahal (Python + C version)

https://github.com/4ZM/megahal
Written in Python, unknown supported versions.  Used Extensions (e.g.  Python/C
bindings, i.e.compiles the actual C-based MegaHAL source then connects the two
PLs together using Extensions).
At first glance, this seemed awesome.  Why?  Because this is exactly
AI::MegaHAL does, where it works
extremely well: fast, non-bloated, and the underlying perlxs code is mostly
legible (enough for me to add a C function and the related perlxs tie-in
without any trouble).
The first thing I learned is that this version is intended for Python 2.x;
the Python interface/extension is not 3.x-compatible.  I took a stab at
converting it but this lead me
down a rabbit hole of terrible documentation
that cannot be easily summarised other than "terrible".  A colleague of mine
did a better job explaining it than I did:

they seem to love writing tons of useless prose that's neither good as a
reference, nor good as a specification: more like obese books that tell you
nothing useful, but presented as references.
and of course, since it's another turd dynlang, any sort of generated
documentation output usually doesn't tell you the most basic shit: what
arguments are valid input to a given goddamn function (yes, there are
type annotations for python now; 30 years too late)

I then switched back to 2.6.17.  I was at least able to get something
functional... sort of.
Sadly, the Python+C version is quite terrible in comparison.  None of
the Python extension/interface functions behave identically to the C version,
instead opting for basically "hard-coded everything".  The interface lacks many
functions and overall functionality, which is very disappointing.  It seems
the individual who wrote this interface chose to divert from many of the
expected norms of MegaHAL and try to implement something bizarre.
However, I do expect that on some level if someone was to step up to the plate
and actually write a decent interface to the C version, it would be usable.
But in its current (IMO, half-ass) state, it cannot be used decently.
cobe

https://github.com/pteichman/cobe
Written in Python, unknown supported versions.  I tried 2.7.16 and 3.6.8.
This software was a saga in itself, and tells the tale of what I consider a
huge problem in the open-source world: maintainer neglect.  Read on for tales
of woe.
With 2.7.16: I was unable to get this software to run/function due to a deeper
dependency problem not even directly related to cobe itself.  It indirectly
depends, through a chain of dependencies, on an egg called more_itertools
version 7.0.0.  Apparently version 5.0.0 is the last known version to support
Python 2.x; version 6.0.0 and newer
intentionally do not work (tests and runtime fail).
With 3.6.8: the more_itertools problem was no more, but now we had an actual
problem with cobe itself: it was using 2.x syntaxes for some things such as
print filename (note missing parens).
So it seems cobe is written in Python 2.x, but cannot actually be used with
Python 2.x because of the aforementioned more_itertools problem.  (Remember:
it does not depend directly on more_itertools, but rather some other package
that depends on something that then depends on more_itertools.)
(I should add that as of this writing, the same problem happens via pip install cobe when using Python 3.x, so this is universally broken.)
It was around this time that I noticed there were 47 forks of this software on
GitHub.  This followed by discovering many Issues and a few PRs submit for all
sorts of things, such as:

PR #24: more fixed for python3
PR #26: Fixes for Python 3
Issue #29: Python 3 Support

I had looked at trying to update cobe's code myself to Python 3.x, starting
with outdated print statements, but apparently there were a lot more.
Upon reviewing both of the PRs, I noticed differences.  PR #26 used 2to3
which had tons of fixes (and broke TravisCI, but that's because the version of
Python used is 2.7.6 (!)), but the older PR #24 had different fixes,
particularly UTF-8-related fixes when training from a file.
This is a great example of why maintainers need to be responsible and not
neglectful of their projects.  Who's fork to use?!?
I opted to try GitHub user gnowxilef's fork master branch, since their PRs
had been merged in the past (i.e. the author presumably trusted them).  Nope,
failure:
Traceback (most recent call last):
  File "bin/cobe", line 11, in <module>
    load_entry_point('cobe==2.1.2', 'console_scripts', 'cobe')()
  File ".../.local/lib/python3.6/site-packages/cobe-2.1.2-py3.6.egg/cobe/control.py", line 42, in main
AttributeError: 'Namespace' object has no attribute 'run'

I then tried CrazyPython's fork: same failure.  This is the exact type of shit
that drove me away from Linux during the 1.3.x days and over to FreeBSD.
I then began to sift through all 47 forks, one at a time, to see who had
commits more recent than master.  This lead me to the following discoveries:

chauffer/cobe -- Python 3 support was a commit, but the repo had been archived
mumtoofs/cobe -- a fork of chauffer/cobe but repo not archived
Circlepuller/cobe -- Python 3 support was a commit
garym/cobe -- Python 3 support was a commit, but was 10 commits behind master
appscluster/cobe -- a fork of gnowxilef/cobe with additional fixes
slickriptide/cobe -- various commits, including Python 3 fixes (and some wonky commits too)

I should note all of these individuals' Python 3 commits were different in
various ways from one another.
I eventually stopped when I noticed these two things:

LeMagnesium/cobe -- commit 5d2d4b69
tvishwanadha/cobe -- commits
Wintervenom/cobe -- commits

The first implied the IRC client that came with it didn't support being in
multiple channels at once.  Hrm... okay, so I'd have to do my own IRC client
interface -- that's fine, but shows neglect.
The second indicated the IRC client egg/package used by this software was
extremely old (their commits were from 2013, so 3 years before the last commit
from the actual cobe author!).  Wow, what the fuck?
The third implied things like cobe dump didn't work, amongst other issues.
I paused and decided to go back to the official repository; fuck these forks.
I couldn't believe all the chaos.  I did my own Python 3 fixes (haphazardly,
i.e. just print statements).  This got me a bit further, only to encounter a
whole new problem:
Traceback (most recent call last):
  File "./cobe", line 11, in <module>
    load_entry_point('cobe==2.1.2', 'console_scripts', 'cobe')()
  ...
  File ".../.local/lib/python3.6/site-packages/cobe-2.1.2-py3.6.egg/cobe/control.py", line 6, in <module>
  File ".../.local/lib/python3.6/site-packages/cobe-2.1.2-py3.6.egg/cobe/commands.py", line 13, in <module>
  File ".../.local/lib/python3.6/site-packages/cobe-2.1.2-py3.6.egg/cobe/brain.py", line 11, in <module>
  File "/usr/local/lib/python3.6/sqlite3/__init__.py", line 23, in <module>
    from sqlite3.dbapi2 import *
  File "/usr/local/lib/python3.6/sqlite3/dbapi2.py", line 27, in <module>
    from _sqlite3 import *
ModuleNotFoundError: No module named '_sqlite3'

Review of cobe/brain.py sure enough showed import sqlite3, however no where
in setup.py was this dependency declared.  It turns out Python and PyPi both
are somehow retarded and nobody offers SQLite3 packages.  So, to get this to
work, I had to install py36-sqlite3 system-wide via pkg on FreeBSD.
At this point, I stopped dead in my tracks and thought about how I was spending
my time.  This was an endless rabbit hole.  All this time spent just for one
program that was known to be totally neglected?  No.  My time is worth more
than this.
I rm -fr'd everything pertaining to this software and removed the system
packages I had installed, since odds were -- given the Issues and PRs -- that
this stuff would just break further at run-time anyway.
jsmegahal

https://github.com/seiyria/jsmegahal
Written in Node.js and/or CoffeeScript.  Was able to get it built/installed
using Node v10.x on FreeBSD via npm.
However, Node is not a language I am familiar with.  I found things like
reading a file (to train it) tedious as hell, and the same went for doing
something like executing a simple shell pipeline of commands (to find out
memory usage before/after training).  Not my cup of tea.
Thus, I can't really comment on whether or not this is a worthy contender
because I gave up struggling with the PL in general.
megahal (native Python)

https://github.com/krmaxwell/megahal
Written in Python, unknown supported versions.  I tried 2.7.16 and 3.6.8.
I first tried 3.6.8.  This failed due to use of functions like xrange, but I
was able to convert those and other bits with ease.  However, something curious
that should have acted as premonition to the horrible quest that I was about to
embark upon:
    for i in xrange(self.order + 1, 0, -1):
NameError: name 'xrange' is not defined
Segmentation fault (core dumped)

The actual Python crash was what shocked me.  This problem was reproducible.  I
took a look at the stack trace, hoping to gleam some small tidbit of an idea as
to why the PL would be crashing on such a simple error:
#0  0x0000000800cbb9c9 in PyModule_GetState () from /usr/local/lib/libpython3.6m.so.1.0
#1  0x0000000802a1f67b in ?? () from /usr/local/lib/python3.6/lib-dynload/_pickle.so
#2  0x0000000802a1cbbe in ?? () from /usr/local/lib/python3.6/lib-dynload/_pickle.so
#3  0x0000000802a2a86b in ?? () from /usr/local/lib/python3.6/lib-dynload/_pickle.so
#4  0x0000000800cba60f in _PyCFunction_FastCallDict () from /usr/local/lib/libpython3.6m.so.1.0
#5  0x0000000800d3988d in ?? () from /usr/local/lib/libpython3.6m.so.1.0
#6  0x0000000800d36ee4 in _PyEval_EvalFrameDefault () from /usr/local/lib/libpython3.6m.so.1.0
#7  0x0000000800d3ae59 in _PyFunction_FastCallDict () from /usr/local/lib/libpython3.6m.so.1.0
#8  0x0000000800c71aaf in _PyObject_FastCallDict () from /usr/local/lib/libpython3.6m.so.1.0
#9  0x0000000800c71c48 in _PyObject_Call_Prepend () from /usr/local/lib/libpython3.6m.so.1.0
#10 0x0000000800c718e6 in PyObject_Call () from /usr/local/lib/libpython3.6m.so.1.0
#11 0x0000000800cd5fb0 in ?? () from /usr/local/lib/libpython3.6m.so.1.0
#12 0x0000000800cd5666 in ?? () from /usr/local/lib/libpython3.6m.so.1.0
#13 0x0000000800d33585 in _PyEval_EvalFrameDefault () from /usr/local/lib/libpython3.6m.so.1.0

Valid and botched at the same time?  What in god's name was going on here?  No,
on second thought, I don't even want to know.
I switched to 2.7.16 where I was able to get things at least functional.  But I
will warn readers here and now: it only gets worse.
The issues were hideous in several ways:


Dogshit slow.  It took a very, VERY long time to train.


Lurking bugs as a result of a haphazardly-done porting job.  Upon feeding it
a corpus, it generated an exception error at runtime (but no segfault) like so:


  File ".../.local/lib/python2.7/site-packages/megahal.py", line 172, in boundary
     string[position + 1].isalpha()):
IndexError: string index out of range

I fired up python2 -m pdb test.py to see what was going on: it was crashing
when encountering a learn/training line ending with an apostrophe.  In this
case, the variables and their contents for the problematic line:
string   = "THIS THING KEEPS ON JAMMIN'"
phrase   = "JAMMIN'"
position = 7

Yeah, 7+1 certainly exceeds the length of the string, so the exception was
legitimate.  But wait... why didn't the C version blow up or misbehave with such
a string?
Once I began began to look at the actual code (both Python and the deeper
innards of MegaHAL), it quickly dawned on me: this was quite literally ported
from C: as in the author sat with two windows and wrote the Python code to be
as identical as possible to C, nearly all the way down to how strings were
parsed/managed.
Apostrophe in MegaHAL is treated specially for some unmentioned reason.  See for yourself:

megahal.py boundary()
megahal.c boundary()

I suspect it gets special treatment to try and handle conjugated words like they're
or there's.  I really don't know; the comments suck.
If you look closely at the C code, you'll see that
the malloc() calls for a word
add 1 to the size, needed for the trailing NULL.  isalpha(string[position+1]) would
therefore comparing a byte of 0x00.  And while this is certainly not in the range of
[A-Za-z] and thus the function returns 0, more importantly it isn't looking at the
wrong piece/byte of memory (or even crashing, if the byte past the boundary of a
malloc'd page not owned by the process).
I suspect Jason Hutchens (author of MegaHAL) "got lucky" with this piece of
code.  The whole blind position+1 thing done in this manner I firmly believe
was done under bad assumptions.  I readily admit string parsing/splitting is not
particularly "fun" in C, but why were functions like strtok(), strsep(), and/or
strpbrk() not used instead?  THose functions are all part of ISO C90, available
since roughly 1990; MegaHAL was written in 1997.
Did Hutchens not think about the edge cases?  I can't speak for other programmers,
but when I deal with string input -- even in Perl -- I am meticulous as hell.
Anyway, the silly workaround I came up with was to simply change the nested
conditionals to the following:
                elif (string[position] == "'" and
                    string[position - 1].isalpha() and
                    position+1 < strlen(string) and
                    string[position + 1].isalpha()):
                    boundary = False


A gargantuan memory hog.  Once trained, the underlying Python process
took up around 450MBytes of RAM (RSS/RES) (the native C version is around
30MBytes).  But it's worse than that...

Any time the brain was written to disk, the size of the process would double
(not to mention take a good 30+ seconds).  Given that the system this software
runs on contained only 1GB of RAM and 1GB of swap... well, let's just say
things got pretty hairy.
And in one case, Python itself emit a message I'd never seen before in my life:
HASH: Out of overflow pages.  Increase page size

Remember what I said earlier?  "On second thought, I don't even want to
know."

Disk space: the brain itself was around 60MBytes.  While this version used
Python's native shelve for brain storage (convenient since it comes with
Python), I suspect the overall "brain implementation" was the source of the
wasted space (and likely above issues).

So who did this terrible port?  The GitHub repository is owned by a Kyle
Maxwell, but I suspect he isn't the original person who created this travesty,
given that the GitHub repository description says:

Copied from https://code.google.com/p/halpy/

So I took a gander at
the commit history
for that Google Code repo.  All the commits being from someone identified as
cjones, going all the way back to 2008.  I think the quality of the commit
messages speak for themselves.