tra38/notes.md

## notes.md

      
    Raw
  

              notes.md
            
          
    Note:
The following paragraphs came from a discussion I had with @danbernier in the generative-art Slack chat, in September 2016.
I refer to "Latent Semantic Indexer" and "Dumb Clustering" in this text. Latent Semantic Indexer is another name for Latent Semantic Analysis. I also mentioned searching for keywords; LSA is really good at information retreival, so I can easily search for paragraphs that are related to a user-specified keyword.
Dumb Clustering was some clustering method that I invented because I really wanted to do some clustering but didn't really know of any clustering algorithm. The clustering algorithm I invented worked as followed:

The algorithm goes through a list of paragraphs..
For every paragraph in the list, it finds three similar paragraphs using Latent Semantic Analysis...and then removes all of them from the list in question and places them into its own unique "cluster".
Keep going until all articles in the list has been assigned to its own cluster.
Print out the clusters.

When I wrote about this Dumb Clustering algorithm on September 15th 2016, I wrote:

I don't know whether this algorithm is good enough for the purpose of sorting arbitrary text into random clusters. Due to its lack of intelligence, the clustering may be of lower quality. It might also be slower than other forms of clustering. However, it seems much easier to implement and since clustering is used for "exploratory analysis" as opposed to making predictions and business decisions, the damage that the Dumb Clustering Approach may do to the text analysis may be limited. In other words, the Dumb Clustering Approach may not be "good", but it's "good enough", and "good enough" is much cheaper than "good".

I thought about using Dumb Clustering during NaNoGenMo, but I did not get the chance to do so. In any event, I never revisited "Dumb Clustering" once I got a better idea of how to use "k-means clustering".
I also referred to a Markov Clustering Algorithm. Sadly, I still do not know how to use that algorithm, though it was Markov clustering that actually got me on the path to studying clustering in general.

September 15th 2016, Part 1

@danbernier: You mentioned how you were using Markov Clustering for some project or another during our discussion about "AI credit". To wit:

"related: i've been reading about/playing with the Markov Cluster Algorithm. you feed it a sparse matrix of items, and their similarity to each other. it interprets this as a proximity graph, and then does random walks. it infers clusters from graph-reachability. it seems to work pretty well. a lot depends on how you calculate those similarity scores - get that wrong, and you're lost. but if those are good, it seems to work pretty well."

I just had the idea in my mind to find text paragraph similarities using the Latent Semantic Indexer (LSI) and then feed that into the Markov Clustering algorithm you mentioned. Just find the paragraph similarity and that'll be done with it.

As for "wandering far afield of generative-art", I'll need to copy all this convo and turn it into a corpus that a Markov Cluster Algorithm can traverse. Just to see whether it could be possible to generate literature using it... (edited)

The best way to describe my "text generation technique" is "glorified cut-ups". You grab a bunch of random paragraphs, and place them right next to each other, and BAM, you now have a longform opinion piece! People assume that paragraphs that are right next to each other are related to each other, and so long as the paragraphs appear relatively coherent, they are willing to consider it as part of a greater whole. The "ELIZA Effect" prevails.
Previously, I had to search for keywords within the corpus to find related paragraphs, and then simply print out those paragraphs. You can see examples of these "generated articles" in #bring-n-brag, with me searching for terms about "suffering". Now with the "dumb" clustering algorithm, I don't even need to type in a keyword...the machine will break down the corpus into clusters (and I can then simply present each cluster as its own standalone opinion piece).

September 15th 2016, Part 2

I was previously using "glorified cut-ups" before with Prolefeed and random paragraphs shuffling (see this Prolefeed README: https://github.com/tra38/Prolefeed for more information), but this technique couldn't scale (you would have to handwrite all the paragraphs before feeding it to Prolefeed). But LSI/Dumb Clustering can scale up, since you can input in arbitrary paragraphs and the algorithm could try to cluster them semi-effectively. The resulting generated work may still need some human editing (and you may want to edit the input corpus too), but even a bad output is already much more readable than the best Markov chain.  Since LSI is a form of machine learning, you also get to enjoy the joy of partaking in the great "Machine Learning Hype Cycle". And, of course, now that your algorithm have generated a "rough draft", you could try programming some algorithmic editing as well to improve it further...
The thing is, I think this "cut-ups" technique is...not common, or at least not common enough for people to instantly recognize it and give it an academic name. It's probably been reinvented independently multiple times. The earliest I've seen it being used was the early 1980s (as an experimental work by a comp. sci professor about a woman who attended a dance party during the semiconductor boom...she would later enter a revised version of that story into the Dartmouth Turing Test last year), and I know that spammers use a similar type of approach to disguise their plagiarism of preexisting content (example: rearranging and rewriting paragraphs).  I even found a blog post about someone as late as 2013 attempting to revamp a program that keep track of notes on different subjects, and even wanted to use LSI to group together similar paragraphs so that users can be exposed to different opinions about the same topic, but it seems work on that program may have been cancelled (and the blog post didn't really appear to like the output of the LSI either).
The clustering algorithm sorts all the paragraphs in the corpus into different clusters, which I'll refer to as "articles". Each paragraph in the "article" is related to each other, so the "article" as a whole gets some coherence. Since each "article" is both coherent and readable (since it is essentially made up of other people's writings), each "article" has a chance of being accepted by a reader.
Now, the problem is that the generated articles are not super-coherent, and they probably require some human editing (though I might consider experimenting with algorithmic editing to reduce the human manual labor). The corpus may also need to be sanitized beforehand, in the sense that each paragraph needs to be its own standalone piece, that is not dependent on what paragraphs comes before or after (The "cut-up" approach I'm using is unable to handle continuity).
I also wouldn't be surprised if I never actually end up using clustering in a "real-world scenario". It's possible that a user might prefer an alternate approach: search the corpus for paragraphs mentioning key words, and then combine those paragraphs to generate an "article". This approach may be a bit more manual than the clustering (have to think of an idea first)...but allows you to generate articles based on your own personal criteria (either trying to improve your SEO or respond to current events better). But it may still be useful to know about clustering, in case a user wants a more automated approach. (edited)
Here's that 2013 blog post by the way: http://bodong.ch/blog/2013/03/11/analyze-text-similarity-in-r-latent-semantic-analysis-and-multidimentional-scaling.html

September 16th 2016

The clustering algorithm sorts all the paragraphs in the corpus into different clusters, which I'll refer to as "articles". Each paragraph in the "article" is related to each other, so the "article" as a whole gets some coherence. Since each "article" is both coherent and readable (since it is essentially made up of other people's writings), each "article" has a chance of being accepted by a reader.
Now, the problem is that the generated articles are not super-coherent, and they probably require some human editing (though I might consider experimenting with algorithmic editing to reduce the human manual labor). The corpus may also need to be sanitized beforehand, in the sense that each paragraph needs to be its own standalone piece, that is not dependent on what paragraphs comes before or after (The "cut-up" approach I'm using is unable to handle continuity).
I also wouldn't be surprised if I never actually end up using clustering in a "real-world scenario". It's possible that a user might prefer an alternate approach: search the corpus for paragraphs mentioning key words, and then combine those paragraphs to generate an "article". This approach may be a bit more manual than the clustering (have to think of an idea first)...but allows you to generate articles based on your own personal criteria (either trying to improve your SEO or respond to current events better). But it may still be useful to know about clustering, in case a user wants a more automated approach. (edited)