Instantly share code, notes, and snippets.
Created
May 28, 2014 20:25
-
Star
(0)
0
You must be signed in to star a gist -
Fork
(0)
0
You must be signed in to fork a gist
-
Save RMDK/7c54600b9c2af68914b3 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Clustering Related Posts", | |
"level": 1 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "This is a series of notebooks (in progress) to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.\n\n<a target=\"_parent\" href=\"http://rmdk.ca\">Visit my webpage for more</a>. \n\nEmail me: <a href=\"mailto:email.ryan.kelly@gmail.com?Subject=Hey\" target=\"_top\">email.ryan.kelly@gmail.com</a>\n\nI'd love for you to share if you liked this post." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "social()", | |
"prompt_number": 123, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"html": "\n <a style='float:left; margin-right:5px;' href=\"https://twitter.com/share\" class=\"twitter-share-button\" data-text=\"Check this out\" data-via=\"Ryanmdk\">Tweet</a>\n<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>\n <a style='float:left; margin-right:5px;' href=\"https://twitter.com/Ryanmdk\" class=\"twitter-follow-button\" data-show-count=\"false\">Follow @Ryanmdk</a>\n<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>\n <a style='float:left; margin-right:5px;'target='_parent' href=\"http://www.reddit.com/submit\" onclick=\"window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false\"> <img src=\"http://www.reddit.com/static/spreddit7.gif\" alt=\"submit to reddit\" border=\"0\" /> </a>\n<script src=\"//platform.linkedin.com/in.js\" type=\"text/javascript\">\n lang: en_US\n</script>\n<script type=\"IN/Share\"></script>\n", | |
"metadata": {}, | |
"prompt_number": 123, | |
"text": "<IPython.core.display.HTML at 0x11f0b52d0>" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "This notebook covers or includes: ", | |
"level": 2 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "* Introduction to word processing\n* Natural Language Learning Toolkit \n* KMeans Clustering text data" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "TO DO:", | |
"level": 6 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "" | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Measuring Similarity Between Text Messages:", | |
"level": 2 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "\nThis notebook will explore the idea of recommending news posts to a reader based their search query. To do this, we also have to introduce basic text processing. Clustering can be defined as classifying unlabelled data by a measurement of similarity.\n\nOne of the most robust methods to quantify meaning in textual data is using the **bag-of-word** approach. For each word in the post, we count track the number of occurances in a vector (vectorization). In this way the data can be stored in an efficient matrix structure." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Preprocessing:", | |
"level": 3 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "First we have to convert the text into a `bag-of-words`. We can do this using scikit's builtin `CountVectorizer`. The input `min_df` determines how the function will treat words that are used infrequently. If set to an interger, all words occuring less than that amount will be dropped. If set to a fraction, all words that occur less than the fraction of the overall dataset will be dropped. There are also a lot of other options which will we get into later." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "from sklearn.feature_extraction.text import CountVectorizer\n\nvect = CountVectorizer(min_df=1)\nprint vect", | |
"prompt_number": 2, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "CountVectorizer(analyzer=word, binary=False, charset=None, charset_error=None,\n decode_error=strict, dtype=<type 'numpy.int64'>, encoding=utf-8,\n input=content, lowercase=True, max_df=1.0, max_features=None,\n min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None,\n strip_accents=None, token_pattern=(?u)\\b\\w\\w+\\b, tokenizer=None,\n vocabulary=None)\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "We see that for now the counting is done at the word level (`analyzer = word`)." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "content = ['how to open a beer without a bottle opener', \n 'Beer bottles or beer cans',]\nX = vect.fit_transform(content)\n\nvect.get_feature_names()", | |
"prompt_number": 3, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 3, | |
"metadata": {}, | |
"text": "[u'beer',\n u'bottle',\n u'bottles',\n u'cans',\n u'how',\n u'open',\n u'opener',\n u'or',\n u'to',\n u'without']" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "#Print the vectorized word occurances\nprint X\nprint X.toarray()", | |
"prompt_number": 4, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": " (0, 0)\t1\n (1, 0)\t2\n (0, 1)\t1\n (1, 2)\t1\n (1, 3)\t1\n (0, 4)\t1\n (0, 5)\t1\n (0, 6)\t1\n (1, 7)\t1\n (0, 8)\t1\n (0, 9)\t1\n[[1 1 0 0 1 1 1 0 1 1]\n [2 0 1 1 0 0 0 1 0 0]]\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- Count vectors returned by `transform` are stored in the more memory efficient coordinate matrix format, we have to access the full standard vector for analysis though. \n\nLet's add some more data." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "posts = ['how to open a beer without a bottle opener', \n 'Do girls like beer bottles or beer cans?',\n 'where did all my beer go?',\n 'where did all my beer go? where did all my beer go?',\n 'recycling beer bottles and cans',\n 'Is it worth recycling?',\n 'do not bring bottles to my backyard party, only cans please.', \n 'This is useless']", | |
"prompt_number": 5, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "X_train = vect.fit_transform(posts)\n\nnum_samples, num_features = X_train.shape\n\nprint '#samples: {}, #features: {}'.format(num_samples, num_features)", | |
"prompt_number": 6, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "#samples: 8, #features: 31\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "- Unsurprisingly, we have 8 posts with a total of 31 different words. Now we can vectorize our data.\n\nLet's vectorize a new post, then see how similar it is to our existing corpus." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "new_post = 'Opening beer bottles and cans 101'\nnew_post_vect = vect.transform([new_post])\n\nprint(new_post_vect).toarray()", | |
"prompt_number": 7, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "[[0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "import scipy as sp\n\ndef dists(v1, v2):\n delta = v1-v2\n # Calculate Euclidean \"norm\" distance\n return sp.linalg.norm(delta.toarray())\n\nimport sys\n\ndef similarity(new_post_vector, corpus):\n best_dist = 999\n best_i = None\n \n for i in xrange(len(corpus.toarray())):\n post = posts[i]\n \n if post == new_post:\n continue\n post_vec = corpus.getrow(i)\n d = dists(post_vec, new_post_vector)\n print 'Post %i with dist = %.2f: %s'%(i, d, post)\n \n if d < best_dist:\n best_dist = d\n best_i = i\n print 'Best post is {} with dist = {}'.format(best_i, best_dist)", | |
"prompt_number": 8, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "similarity(new_post_vect, X_train)", | |
"prompt_number": 9, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Post 0 with dist = 3.00: how to open a beer without a bottle opener\nPost 1 with dist = 2.45: Do girls like beer bottles or beer cans?\nPost 2 with dist = 2.83: where did all my beer go?\nPost 3 with dist = 4.90: where did all my beer go? where did all my beer go?\nPost 4 with dist = 1.00: recycling beer bottles and cans\nPost 5 with dist = 2.83: Is it worth recycling?\nPost 6 with dist = 3.32: do not bring bottles to my backyard party, only cans please.\nPost 7 with dist = 2.65: This is useless\nBest post is 4 with dist = 1.0\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Great, our first text similarity measurement! We can see here that post 3 is most similar to our new post. However, we can see that `post 2` is \"closer\" to `post 3`, even though `post 3` is simply `post 2` doubled. It is clear the simple counts of words is too simple. The next step is to normalize the word counts to get vectors of unitless lengths to avoid this problem." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# Update our dists function\ndef dists(v1, v2):\n v1_norm = v1/sp.linalg.norm(v1.toarray())\n v2_norm = v2/sp.linalg.norm(v2.toarray())\n delta = v1_norm-v2_norm\n # Calculate Euclidean \"norm\" distance\n return sp.linalg.norm(delta.toarray())", | |
"prompt_number": 10, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "similarity(new_post_vect, X_train)", | |
"prompt_number": 11, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Post 0 with dist = 1.27: how to open a beer without a bottle opener\nPost 1 with dist = 0.86: Do girls like beer bottles or beer cans?\nPost 2 with dist = 1.26: where did all my beer go?\nPost 3 with dist = 1.26: where did all my beer go? where did all my beer go?\nPost 4 with dist = 0.46: recycling beer bottles and cans\nPost 5 with dist = 1.41: Is it worth recycling?\nPost 6 with dist = 1.18: do not bring bottles to my backyard party, only cans please.\nPost 7 with dist = 1.41: This is useless\nBest post is 4 with dist = 0.459505841095\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Great, posts 2 & 3 are now equally similar to our new post." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Removing Less Important Words:", | |
"level": 3 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "There are many words in language that do not carry much meaning in terms of the overall interpretation of the message. Words like \"it\" should be much less meaningful than \"beer\" in our current context. These less important words are called `stop words`, and can be removed from the posts since they do not help us distiguish between different posts." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "#Add english stop words to our vectorizer object.\nvect = CountVectorizer(min_df=1, stop_words='english')\n#Display a sample\nprint sorted(vect.get_stop_words())[80:-150]", | |
"prompt_number": 12, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "['empty', 'enough', 'etc', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'found', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'hasnt', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'ie', 'if', 'in', 'inc', 'indeed', 'interest', 'into', 'is', 'it', 'its', 'itself', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'ltd', 'made', 'many', 'may', 'me', 'meanwhile', 'might', 'mill', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', 'name']\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "If you already have a list of words in mind you with to `stop`, you can simply pass them as a list to the `stop_words` argument." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Stemming", | |
"level": 2 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "We also need to consider that similar words, such as \"girl\" and \"girls\" should probably be considered as the same word. Thus we need a function that reduces words to a finite 'word stem'. We can do thsi with the **Natural Language Toolkit (NLTK)**. After installing NLTK, import the library and try out the stemmer for english." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "import nltk.stem\n\ns = nltk.stem.SnowballStemmer('english')\n\nprint s.stem('bottles')\nprint s.stem('bottle')\n\nprint s.stem('perception')\nprint s.stem('perceptive')\n\nprint s.stem('crashing')\nprint s.stem('crashed')", | |
"prompt_number": 13, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "bottl\nbottl\npercept\npercept\ncrash\ncrash\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Extending the vectorizer with NLTK stemming", | |
"level": 3 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "We need to step the posts before we feed then into the `CountVectorizer`. The best way to do this is overwrite the method `build_analyzer`. \n\nBy doing this we utilize the preprocessing functions in the parent class that converts the raw posts into lower case. We tokenize all the words, and then convert each word into the stemmed version." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "import nltk.stem\n\nenglish_stemmer = nltk.stem.SnowballStemmer('english')\n\nclass StemmedCountVectorizer(CountVectorizer):\n def build_analyzer(self):\n analyzer = super(StemmedCountVectorizer, self).build_analyzer()\n return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))\n \nvectorizer = StemmedCountVectorizer(min_df=1, stop_words='english')", | |
"prompt_number": 14, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "X = vectorizer.fit_transform(posts)\nvectorizer.get_feature_names()", | |
"prompt_number": 15, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 15, | |
"metadata": {}, | |
"text": "[u'backyard',\n u'beer',\n u'bottl',\n u'bring',\n u'can',\n u'did',\n u'girl',\n u'like',\n u'open',\n u'parti',\n u'recycl',\n u'useless',\n u'worth']" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# Restate the new vectorizer on the data\nX_train = vectorizer.fit_transform(posts)\nnew_post_vect = vectorizer.transform([new_post])\n\nsimilarity(new_post_vect, X_train)", | |
"prompt_number": 16, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Post 0 with dist = 0.61: how to open a beer without a bottle opener\nPost 1 with dist = 0.77: Do girls like beer bottles or beer cans?\nPost 2 with dist = 1.14: where did all my beer go?\nPost 3 with dist = 1.14: where did all my beer go? where did all my beer go?\nPost 4 with dist = 0.71: recycling beer bottles and cans\nPost 5 with dist = 1.41: Is it worth recycling?\nPost 6 with dist = 1.05: do not bring bottles to my backyard party, only cans please.\nPost 7 with dist = 1.41: This is useless\nBest post is 0 with dist = 0.605810893055\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "We see now that post 0 is most similar to our new post, because bottles and bottle are now treated as the same word." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "print new_post\nprint posts[0]", | |
"prompt_number": 17, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Opening beer bottles and cans 101\nhow to open a beer without a bottle opener\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Thinking a bit deeper about relevant post features", | |
"level": 3 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "So far we have considered that higher occurrence of certains words in post equates to a greater importance of that word in the post. While this is true to some extent, there is the case where very frequent words really don't carry any meaning to posts. For example, the word \"Subject\" appears in every blog post, thus it is not really communicating anything important, and does not help us distinguish between posts.\n\nWe could perhaps set a 90% occurrence cutoff in our tokenizer, such that words that occur in >90% of the posts are excluded, however, we still run into the problem of border cases, say where the word occurs in only 89% of the posts.\n\nTo solve these problems we count the term frequencies for every post **while** discounting those words that appear in many posts. This is the concept of **term frequency - inverse document frequency (TF-IDF)**. We can implement this using scikit learn's `TfidfVectorizer`." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "from sklearn.feature_extraction.text import TfidfVectorizer\n# Rebuild the function to include our stemmer\n\nclass StemmedTfidfVectorizer(TfidfVectorizer):\n def build_analyzer(self):\n analyzer = super(TfidfVectorizer, self).build_analyzer()\n \n return lambda doc: (english_stemmer.stem(w) for w in analyzer(doc))\n\nvectorizer = StemmedTfidfVectorizer(min_df=1, stop_words='english', decode_error='ignore')\n", | |
"prompt_number": 18, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Now instead of counts, our document vectors will contain individual TF-IDF values per term (token)." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# Restate new vectorizer\nX_train = vectorizer.fit_transform(posts)\nnew_post_vect = vectorizer.transform([new_post])\n\nsimilarity(new_post_vect, X_train)", | |
"prompt_number": 19, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Post 0 with dist = 0.57: how to open a beer without a bottle opener\nPost 1 with dist = 0.99: Do girls like beer bottles or beer cans?\nPost 2 with dist = 1.26: where did all my beer go?\nPost 3 with dist = 1.26: where did all my beer go? where did all my beer go?\nPost 4 with dist = 0.90: recycling beer bottles and cans\nPost 5 with dist = 1.41: Is it worth recycling?\nPost 6 with dist = 1.17: do not bring bottles to my backyard party, only cans please.\nPost 7 with dist = 1.41: This is useless\nBest post is 0 with dist = 0.572957858071\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Recap", | |
"level": 2 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "So far we have:\n\n1. Tokenized text\n2. Discard words that occur too often and don't help us detect relevant posts\n3. Throw away very uncommon words\n4. Count the remaining words\n5. Calculated TF-IDF values from the counts, considering the whole text corpus.\n\n**Limitations of the bag-of-words approach**\n\n* It does not cover word relations: \"Car hits wall\" and \"Wall hits car\" will both have the same feature vector.\n* It does not count negations well: \"I will eat soup\" and \"I will *not* eat soup\" will have very similar feature vectors. Though this can be remedied by also counting bigrams and trigrams (two or three words in a row together).\n* Totally fails with misspelled words." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Clustering", | |
"level": 1 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Now that we can represent our blog posts quantitatively, to some degree. Now our goal is to cluster similar posts. There are two main times of clustering algorithms: **flat** and **hierarchical**. \n\n**Flat clustering** divides the posts into sets of clusters that minimizes the difference _within_ clusters and maximized the difference _between_ clusters. Generally we have to specify the number of clusters upfront.\n\n**Hierarchical clustering** does not require the number of clusters as an input. It creates a hierarchy of clusters where very similar posts are grouped together, then similar clusters are then further grouped recursively until one cluster is left that contains all the data. Once completed, the user can discern the optimal number of clusters." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "KMeans", | |
"level": 2 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "KMeans is probably the most common **flat** clustering algorithm. First you must specify the number of desired clusters (k). From there, the algorithm first specifies k random _seeds_ within the data. Then it assigns each post to the closest seed centroid. Next, the seeds are relocated to the mean center of the points initially assigned to it. Then the process is repeat, whereby the posts are then reassigned based on the new closest seed point. This continues as long as the seed centroids move a considerable amount, after some _n_ iterations, the movements will fall below a threshold. The algorithm is then considered converged." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Get some test data ", | |
"level": 4 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "We will utilize a machine learning dataset that contains 18 826 posts from 20 different newsgroups. There are many topics including technology, politics, and religion. However, for now we will only use the technical groups.\n\nOne question we could ask is, for a certain topic, can we effectivly cluster the newgroups who published that topic into distinct categories?\n\n\nThis data is already split into testing and training data, we can download the data using sklearn." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "import sklearn.datasets\n\nsave_dir = '/users/ryankelly/downloads/' # Your save file path\n\n# Download data using sklearn\ndf = sklearn.datasets.load_mlcomp(\"20news-18828\", mlcomp_root=save_dir)", | |
"prompt_number": 20, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# Data files\nprint df.filenames\nprint len(df.filenames)", | |
"prompt_number": 22, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "['/users/ryankelly/downloads/379/raw/comp.graphics/1190-38614'\n '/users/ryankelly/downloads/379/raw/comp.graphics/1383-38616'\n '/users/ryankelly/downloads/379/raw/alt.atheism/487-53344' ...,\n '/users/ryankelly/downloads/379/raw/rec.sport.hockey/10215-54303'\n '/users/ryankelly/downloads/379/raw/sci.crypt/10799-15660'\n '/users/ryankelly/downloads/379/raw/comp.os.ms-windows.misc/2732-10871']\n18828\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# Data Topics\ndf.target_names", | |
"prompt_number": 23, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 23, | |
"metadata": {}, | |
"text": "['alt.atheism',\n 'comp.graphics',\n 'comp.os.ms-windows.misc',\n 'comp.sys.ibm.pc.hardware',\n 'comp.sys.mac.hardware',\n 'comp.windows.x',\n 'misc.forsale',\n 'rec.autos',\n 'rec.motorcycles',\n 'rec.sport.baseball',\n 'rec.sport.hockey',\n 'sci.crypt',\n 'sci.electronics',\n 'sci.med',\n 'sci.space',\n 'soc.religion.christian',\n 'talk.politics.guns',\n 'talk.politics.mideast',\n 'talk.politics.misc',\n 'talk.religion.misc']" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# Restrict data to only 'tech' categories\ngroup = ['comp.graphics', 'comp.os.ms-windows.misc', \n 'comp.sys.ibm.pc.hardware', 'comp.sys.ma c.hardware', \n 'comp.windows.x', 'sci.space']\n# Reload in only training data with the desired categories\ntrain_data = sklearn.datasets.load_mlcomp('20news-18828', 'train', \n mlcomp_root=save_dir, \n categories=group)", | |
"prompt_number": 24, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "print(len(train_data.filenames))", | |
"prompt_number": 25, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "3414\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Clustering posts", | |
"level": 2 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "While initializing our `vectorizer` we have to remember that we are working with real data, which has many errors, which in this case invalid characers that cannot be encoded." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "vec = StemmedTfidfVectorizer(min_df=10, max_df=0.5,\n stop_words='english', decode_error='ignore')\n\nvecData = vec.fit_transform(train_data.data)", | |
"prompt_number": 26, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "num_samples, num_features = vecData.shape\nprint('#samples: {}, #features: {}').format(num_samples, num_features)", | |
"prompt_number": 27, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "#samples: 3414, #features: 4331\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "This is the information we will use as input for KMeans clustering. Since we know there are 5 topic groups in these data, it makes sense that there could be 5 clusters in the data, so we will try this first." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "num_clusters = 5\nfrom sklearn.cluster import KMeans\n\nkm = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1)\nkm.fit(vecData)", | |
"prompt_number": 110, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Initialization complete\nIteration 0, inertia 6434.212\nIteration 1, inertia 3302.138\nIteration 2, inertia 3286.234\nIteration 3, inertia 3278.006\nIteration 4, inertia 3274.039" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3271.234\nIteration 6, inertia 3268.856\nIteration 7, inertia 3267.609\nIteration 8, inertia 3266.964\nIteration 9, inertia 3266.352" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3265.901\nIteration 11, inertia 3265.509\nIteration 12, inertia 3264.970\nIteration 13, inertia 3263.969\nIteration 14, inertia 3261.887" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3259.657\nIteration 16, inertia 3258.196\nIteration 17, inertia 3257.560\nIteration 18, inertia 3256.997\nIteration 19, inertia 3256.714" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3256.482\nIteration 21, inertia 3256.326\nIteration 22, inertia 3256.126\nIteration 23, inertia 3255.998\nIteration 24, inertia 3255.918" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3255.870\nIteration 26, inertia 3255.826\nIteration 27, inertia 3255.768\nIteration 28, inertia 3255.658\nIteration 29, inertia 3255.574" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3255.550\nIteration 31, inertia 3255.533\nIteration 32, inertia 3255.527\nIteration 33, inertia 3255.522\nIteration 34, inertia 3255.513" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 35, inertia 3255.508\nIteration 36, inertia 3255.503\nConverged at iteration 36\n" | |
}, | |
{ | |
"output_type": "pyout", | |
"prompt_number": 110, | |
"metadata": {}, | |
"text": "KMeans(copy_x=True, init='random', max_iter=300, n_clusters=5, n_init=1,\n n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,\n verbose=1)" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "After fitting, we can get the clustering information out of the `labels_` property, and cluster centers from `cluster_centers_`. We then measure the completeness score to see the percentage of correct predictions." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "from sklearn import metrics\n\nmetrics.completeness_score(train_data.target, km.labels_)", | |
"prompt_number": 111, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 111, | |
"metadata": {}, | |
"text": "0.40904043798434664" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "39% accuracy isn't the best, but this could be because although there are five different topics, the contents are related between them, why dont we test several `k` values and see the prediction scores. " | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "from sklearn.cluster import KMeans\n\ndef best_k():\n for i in range(2,40):\n best_k = 0\n best_score = 0\n km = KMeans(n_clusters=num_clusters, init='random', n_init=1, verbose=1)\n km.fit(vecData)\n score = metrics.completeness_score(train_data.target, km.labels_)\n if score > best_score:\n best_k = i\n best_score = score\n out = [best_k, best_score]\n return out\n \nbest_k()", | |
"prompt_number": 109, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Initialization complete\nIteration 0, inertia 6445.479\nIteration 1, inertia 3292.339\nIteration 2, inertia 3275.461\nIteration 3, inertia 3270.621\nIteration 4, inertia 3268.049" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3266.777\nIteration 6, inertia 3266.141\nIteration 7, inertia 3265.889\nIteration 8, inertia 3265.754\nIteration 9, inertia 3265.668" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3265.602\nIteration 11, inertia 3265.509\nIteration 12, inertia 3265.367\nIteration 13, inertia 3265.151\nIteration 14, inertia 3264.775" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3264.314\nIteration 16, inertia 3263.827\nIteration 17, inertia 3263.243\nIteration 18, inertia 3262.592\nIteration 19, inertia 3262.179" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3261.991\nIteration 21, inertia 3261.915\nIteration 22, inertia 3261.842\nIteration 23, inertia 3261.741\nIteration 24, inertia 3261.661" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3261.614\nIteration 26, inertia 3261.582\nIteration 27, inertia 3261.569\nIteration 28, inertia 3261.557\nIteration 29, inertia 3261.539" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3261.525\nIteration 31, inertia 3261.499\nConverged at iteration 31\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6524.930\nIteration 1, inertia 3308.247\nIteration 2, inertia 3292.389\nIteration 3, inertia 3283.365\nIteration 4, inertia 3278.358" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3276.421\nIteration 6, inertia 3275.128\nIteration 7, inertia 3273.981\nIteration 8, inertia 3272.630\nIteration 9, inertia 3270.863" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3268.894\nIteration 11, inertia 3267.018\nIteration 12, inertia 3265.305\nIteration 13, inertia 3263.985\nIteration 14, inertia 3263.395" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3262.957\nIteration 16, inertia 3262.720\nIteration 17, inertia 3262.581\nIteration 18, inertia 3262.501\nIteration 19, inertia 3262.414" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3262.318\nIteration 21, inertia 3262.253\nIteration 22, inertia 3262.192\nIteration 23, inertia 3262.085\nIteration 24, inertia 3261.962" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3261.815\nIteration 26, inertia 3261.625\nIteration 27, inertia 3261.492\nIteration 28, inertia 3261.394\nIteration 29, inertia 3261.278" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3261.206\nIteration 31, inertia 3261.134\nIteration 32, inertia 3261.077\nIteration 33, inertia 3261.018\nIteration 34, inertia 3260.997" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 35, inertia 3260.975\nIteration 36, inertia 3260.958\nIteration 37, inertia 3260.949\nConverged at iteration 37\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6392.513\nIteration 1, inertia 3298.129\nIteration 2, inertia 3286.500\nIteration 3, inertia 3280.842\nIteration 4, inertia 3277.803" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3276.304\nIteration 6, inertia 3274.915\nIteration 7, inertia 3273.931\nIteration 8, inertia 3273.201\nIteration 9, inertia 3272.640" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3272.355\nIteration 11, inertia 3272.069\nIteration 12, inertia 3271.870\nIteration 13, inertia 3271.619\nIteration 14, inertia 3271.328" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3271.052\nIteration 16, inertia 3270.824\nIteration 17, inertia 3270.511\nIteration 18, inertia 3270.053\nIteration 19, inertia 3269.612" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3269.327\nIteration 21, inertia 3269.190\nIteration 22, inertia 3269.089\nIteration 23, inertia 3269.024\nIteration 24, inertia 3268.943" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3268.846\nIteration 26, inertia 3268.764\nIteration 27, inertia 3268.697\nIteration 28, inertia 3268.597\nIteration 29, inertia 3268.465" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3268.295\nIteration 31, inertia 3268.120\nIteration 32, inertia 3267.779\nIteration 33, inertia 3267.203\nIteration 34, inertia 3266.515\nIteration 35, inertia 3265.992" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 36, inertia 3265.674\nIteration 37, inertia 3265.235\nIteration 38, inertia 3264.315\nIteration 39, inertia 3263.987\nIteration 40, inertia 3263.929" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 41, inertia 3263.905\nIteration 42, inertia 3263.885\nIteration 43, inertia 3263.866\nIteration 44, inertia 3263.859\nIteration 45, inertia 3263.852\nConverged at iteration 45\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6326.529\nIteration 1, inertia 3294.746\nIteration 2, inertia 3282.371\nIteration 3, inertia 3276.461\nIteration 4, inertia 3273.181" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3271.013\nIteration 6, inertia 3268.783\nIteration 7, inertia 3266.648\nIteration 8, inertia 3265.133\nIteration 9, inertia 3264.077" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3263.566\nIteration 11, inertia 3263.328\nIteration 12, inertia 3263.232\nIteration 13, inertia 3263.172\nIteration 14, inertia 3263.125" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3263.087\nIteration 16, inertia 3263.064\nIteration 17, inertia 3263.053\nIteration 18, inertia 3263.047\nIteration 19, inertia 3263.044" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 19\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6396.511\nIteration 1, inertia 3292.367\nIteration 2, inertia 3280.269\nIteration 3, inertia 3275.911\nIteration 4, inertia 3272.600" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3270.273\nIteration 6, inertia 3269.109\nIteration 7, inertia 3268.377\nIteration 8, inertia 3267.638\nIteration 9, inertia 3266.541" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3265.821\nIteration 11, inertia 3265.175\nIteration 12, inertia 3264.720\nIteration 13, inertia 3264.471\nIteration 14, inertia 3264.307" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3264.199\nIteration 16, inertia 3264.110\nIteration 17, inertia 3264.035\nIteration 18, inertia 3263.980\nIteration 19, inertia 3263.934" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3263.922\nIteration 21, inertia 3263.906\nIteration 22, inertia 3263.890\nIteration 23, inertia 3263.867\nIteration 24, inertia 3263.857" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3263.845\nIteration 26, inertia 3263.827\nIteration 27, inertia 3263.818\nIteration 28, inertia 3263.816\nConverged at iteration 28\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6431.988\nIteration 1, inertia 3293.092\nIteration 2, inertia 3278.216\nIteration 3, inertia 3269.663\nIteration 4, inertia 3265.719" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3263.092\nIteration 6, inertia 3261.218\nIteration 7, inertia 3260.260\nIteration 8, inertia 3259.782\nIteration 9, inertia 3259.574" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3259.506\nIteration 11, inertia 3259.466\nIteration 12, inertia 3259.449\nIteration 13, inertia 3259.435\nIteration 14, inertia 3259.422" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 14\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6434.113\nIteration 1, inertia 3296.655\nIteration 2, inertia 3278.784\nIteration 3, inertia 3272.196\nIteration 4, inertia 3270.036" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3268.580\nIteration 6, inertia 3266.836\nIteration 7, inertia 3265.345\nIteration 8, inertia 3264.172\nIteration 9, inertia 3263.147" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3262.455\nIteration 11, inertia 3261.793\nIteration 12, inertia 3261.236\nIteration 13, inertia 3260.754\nIteration 14, inertia 3260.035\nIteration 15, inertia 3259.548" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 16, inertia 3259.407\nIteration 17, inertia 3259.335\nIteration 18, inertia 3259.323\nIteration 19, inertia 3259.319\nIteration 20, inertia 3259.313\nIteration 21, inertia 3259.307" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 22, inertia 3259.302\nIteration 23, inertia 3259.298\nIteration 24, inertia 3259.296\nConverged at iteration 24\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6421.814\nIteration 1, inertia 3300.660\nIteration 2, inertia 3287.858\nIteration 3, inertia 3281.381\nIteration 4, inertia 3276.546" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3271.531\nIteration 6, inertia 3267.330\nIteration 7, inertia 3264.234\nIteration 8, inertia 3263.418\nIteration 9, inertia 3262.728" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3262.077\nIteration 11, inertia 3261.563\nIteration 12, inertia 3261.202\nIteration 13, inertia 3260.836\nIteration 14, inertia 3260.469" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3260.095\nIteration 16, inertia 3259.766\nIteration 17, inertia 3259.590\nIteration 18, inertia 3259.492\nIteration 19, inertia 3259.396" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3259.263\nIteration 21, inertia 3259.172\nIteration 22, inertia 3259.122\nIteration 23, inertia 3259.087\nIteration 24, inertia 3259.059" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3259.021\nIteration 26, inertia 3258.983\nIteration 27, inertia 3258.919\nIteration 28, inertia 3258.870\nIteration 29, inertia 3258.826" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3258.756\nIteration 31, inertia 3258.694\nIteration 32, inertia 3258.621\nIteration 33, inertia 3258.534\nIteration 34, inertia 3258.440" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 35, inertia 3258.277\nIteration 36, inertia 3258.160\nIteration 37, inertia 3258.098\nIteration 38, inertia 3258.041\nIteration 39, inertia 3257.966" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 40, inertia 3257.909\nIteration 41, inertia 3257.860\nIteration 42, inertia 3257.774\nIteration 43, inertia 3257.727\nIteration 44, inertia 3257.694" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 45, inertia 3257.666\nIteration 46, inertia 3257.593\nIteration 47, inertia 3257.551\nIteration 48, inertia 3257.537\nConverged at iteration 48\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6373.464\nIteration 1, inertia 3297.963\nIteration 2, inertia 3287.660\nIteration 3, inertia 3282.323\nIteration 4, inertia 3279.099" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3277.759\nIteration 6, inertia 3277.064\nIteration 7, inertia 3276.650\nIteration 8, inertia 3276.232\nIteration 9, inertia 3275.737\nIteration 10, inertia 3275.473" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3275.339\nIteration 12, inertia 3275.253\nIteration 13, inertia 3275.199\nIteration 14, inertia 3275.158\nIteration 15, inertia 3275.128\nIteration 16, inertia 3275.107" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 17, inertia 3275.076\nIteration 18, inertia 3275.055\nIteration 19, inertia 3275.041\nIteration 20, inertia 3275.022\nIteration 21, inertia 3274.999\nIteration 22, inertia 3274.979" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 23, inertia 3274.960\nIteration 24, inertia 3274.942\nIteration 25, inertia 3274.931\nIteration 26, inertia 3274.926\nIteration 27, inertia 3274.922\nIteration 28, inertia 3274.920" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 28\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6289.281\nIteration 1, inertia 3304.450\nIteration 2, inertia 3288.473\nIteration 3, inertia 3282.639\nIteration 4, inertia 3280.544\nIteration 5, inertia 3279.671" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 6, inertia 3279.139\nIteration 7, inertia 3278.606\nIteration 8, inertia 3278.196\nIteration 9, inertia 3277.723\nIteration 10, inertia 3277.261" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3276.925\nIteration 12, inertia 3276.570\nIteration 13, inertia 3276.117\nIteration 14, inertia 3275.711\nIteration 15, inertia 3275.582" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 16, inertia 3275.538\nIteration 17, inertia 3275.526\nIteration 18, inertia 3275.517\nIteration 19, inertia 3275.509\nConverged at iteration 19\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6390.941\nIteration 1, inertia 3290.356\nIteration 2, inertia 3274.869\nIteration 3, inertia 3268.843\nIteration 4, inertia 3265.737" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3264.341\nIteration 6, inertia 3263.580\nIteration 7, inertia 3262.989\nIteration 8, inertia 3262.543\nIteration 9, inertia 3262.156" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3261.898\nIteration 11, inertia 3261.653\nIteration 12, inertia 3261.429\nIteration 13, inertia 3261.209\nIteration 14, inertia 3260.992" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3260.760\nIteration 16, inertia 3260.407\nIteration 17, inertia 3259.996\nIteration 18, inertia 3259.382\nIteration 19, inertia 3258.432" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3257.154\nIteration 21, inertia 3256.723\nIteration 22, inertia 3256.546\nIteration 23, inertia 3256.446\nIteration 24, inertia 3256.391" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3256.362\nIteration 26, inertia 3256.344\nIteration 27, inertia 3256.339\nConverged at iteration 27\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6432.035\nIteration 1, inertia 3304.822\nIteration 2, inertia 3291.831\nIteration 3, inertia 3281.480\nIteration 4, inertia 3275.025" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3270.730\nIteration 6, inertia 3266.021\nIteration 7, inertia 3261.621\nIteration 8, inertia 3259.239\nIteration 9, inertia 3258.382" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3257.763\nIteration 11, inertia 3257.227\nIteration 12, inertia 3256.768\nIteration 13, inertia 3256.410\nIteration 14, inertia 3256.245" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3256.139\nIteration 16, inertia 3256.045\nIteration 17, inertia 3256.003\nIteration 18, inertia 3255.975\nIteration 19, inertia 3255.955" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3255.938\nIteration 21, inertia 3255.926\nIteration 22, inertia 3255.919\nIteration 23, inertia 3255.906\nIteration 24, inertia 3255.901" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3255.899\nIteration 26, inertia 3255.897\nConverged at iteration 26\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6439.297\nIteration 1, inertia 3292.780\nIteration 2, inertia 3279.272\nIteration 3, inertia 3275.342\nIteration 4, inertia 3271.297" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3266.407\nIteration 6, inertia 3264.193\nIteration 7, inertia 3262.548\nIteration 8, inertia 3261.671\nIteration 9, inertia 3260.768" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3259.996\nIteration 11, inertia 3259.212\nIteration 12, inertia 3258.566\nIteration 13, inertia 3258.245\nIteration 14, inertia 3258.081" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3257.916\nIteration 16, inertia 3257.788\nIteration 17, inertia 3257.724\nIteration 18, inertia 3257.663\nIteration 19, inertia 3257.642" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3257.620\nIteration 21, inertia 3257.606\nIteration 22, inertia 3257.599\nIteration 23, inertia 3257.597\nIteration 24, inertia 3257.592" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 24\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6437.135\nIteration 1, inertia 3308.062\nIteration 2, inertia 3296.359\nIteration 3, inertia 3288.127\nIteration 4, inertia 3284.844" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3282.816\nIteration 6, inertia 3280.496\nIteration 7, inertia 3277.755\nIteration 8, inertia 3274.709\nIteration 9, inertia 3271.397" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3269.900\nIteration 11, inertia 3269.041\nIteration 12, inertia 3268.558\nIteration 13, inertia 3268.149\nIteration 14, inertia 3267.920" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3267.757\nIteration 16, inertia 3267.569\nIteration 17, inertia 3267.379\nIteration 18, inertia 3267.232\nIteration 19, inertia 3267.083" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3266.887\nIteration 21, inertia 3266.684\nIteration 22, inertia 3266.575\nIteration 23, inertia 3266.486\nIteration 24, inertia 3266.413" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3266.331\nIteration 26, inertia 3266.293\nIteration 27, inertia 3266.268\nIteration 28, inertia 3266.235\nIteration 29, inertia 3266.214" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3266.203\nIteration 31, inertia 3266.192\nIteration 32, inertia 3266.186\nIteration 33, inertia 3266.183\nConverged at iteration 33\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6493.097\nIteration 1, inertia 3302.676\nIteration 2, inertia 3285.066\nIteration 3, inertia 3278.241\nIteration 4, inertia 3274.562" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3270.829\nIteration 6, inertia 3265.238\nIteration 7, inertia 3261.167\nIteration 8, inertia 3259.118\nIteration 9, inertia 3258.502" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3258.201\nIteration 11, inertia 3257.948\nIteration 12, inertia 3257.797\nIteration 13, inertia 3257.716\nIteration 14, inertia 3257.673\nIteration 15, inertia 3257.666" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 15\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6359.146\nIteration 1, inertia 3291.541\nIteration 2, inertia 3279.445\nIteration 3, inertia 3275.558\nIteration 4, inertia 3273.488" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3272.191\nIteration 6, inertia 3271.287\nIteration 7, inertia 3270.702\nIteration 8, inertia 3270.374\nIteration 9, inertia 3270.197" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3269.949\nIteration 11, inertia 3269.697\nIteration 12, inertia 3269.348\nIteration 13, inertia 3268.820\nIteration 14, inertia 3267.955" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3266.767\nIteration 16, inertia 3265.877\nIteration 17, inertia 3265.359\nIteration 18, inertia 3264.872\nIteration 19, inertia 3264.386" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3263.777\nIteration 21, inertia 3263.350\nIteration 22, inertia 3262.954\nIteration 23, inertia 3262.645\nIteration 24, inertia 3262.343" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3262.119\nIteration 26, inertia 3262.012\nIteration 27, inertia 3261.943\nIteration 28, inertia 3261.875\nIteration 29, inertia 3261.808" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3261.770\nIteration 31, inertia 3261.744\nIteration 32, inertia 3261.707\nIteration 33, inertia 3261.679\nIteration 34, inertia 3261.674" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 35, inertia 3261.669\nIteration 36, inertia 3261.667\nConverged at iteration 36\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6373.946\nIteration 1, inertia 3294.749\nIteration 2, inertia 3278.626\nIteration 3, inertia 3273.958\nIteration 4, inertia 3271.969" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3270.800\nIteration 6, inertia 3269.873\nIteration 7, inertia 3269.060\nIteration 8, inertia 3268.193\nIteration 9, inertia 3267.473" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3266.822\nIteration 11, inertia 3266.335\nIteration 12, inertia 3266.065\nIteration 13, inertia 3265.876\nIteration 14, inertia 3265.720\nIteration 15, inertia 3265.663" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 16, inertia 3265.627\nIteration 17, inertia 3265.610\nIteration 18, inertia 3265.577\nIteration 19, inertia 3265.549\nIteration 20, inertia 3265.523" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 21, inertia 3265.513\nIteration 22, inertia 3265.503\nIteration 23, inertia 3265.497\nConverged at iteration 23\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6454.118\nIteration 1, inertia 3303.824\nIteration 2, inertia 3288.688\nIteration 3, inertia 3282.998\nIteration 4, inertia 3279.922" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3278.183\nIteration 6, inertia 3276.889\nIteration 7, inertia 3275.991\nIteration 8, inertia 3275.039\nIteration 9, inertia 3273.694" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3272.089\nIteration 11, inertia 3270.481\nIteration 12, inertia 3269.142\nIteration 13, inertia 3267.853\nIteration 14, inertia 3266.220" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3264.370\nIteration 16, inertia 3262.774\nIteration 17, inertia 3261.495\nIteration 18, inertia 3260.136\nIteration 19, inertia 3258.555" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3256.940\nIteration 21, inertia 3256.170\nIteration 22, inertia 3255.746\nIteration 23, inertia 3255.497\nIteration 24, inertia 3255.385" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3255.340\nIteration 26, inertia 3255.299\nIteration 27, inertia 3255.283\nConverged at iteration 27\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6362.969\nIteration 1, inertia 3296.454\nIteration 2, inertia 3282.673\nIteration 3, inertia 3275.059\nIteration 4, inertia 3269.156" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3264.227\nIteration 6, inertia 3259.917\nIteration 7, inertia 3257.108\nIteration 8, inertia 3256.442\nIteration 9, inertia 3256.069" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3255.857\nIteration 11, inertia 3255.774\nIteration 12, inertia 3255.708\nIteration 13, inertia 3255.674\nIteration 14, inertia 3255.650" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3255.635\nIteration 16, inertia 3255.631\nIteration 17, inertia 3255.629\nConverged at iteration 17\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6476.107\nIteration 1, inertia 3296.095\nIteration 2, inertia 3282.579\nIteration 3, inertia 3276.890\nIteration 4, inertia 3272.801\nIteration 5, inertia 3268.908" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 6, inertia 3266.789\nIteration 7, inertia 3265.977\nIteration 8, inertia 3265.409\nIteration 9, inertia 3264.982\nIteration 10, inertia 3264.650\nIteration 11, inertia 3264.401" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 12, inertia 3264.138\nIteration 13, inertia 3263.900\nIteration 14, inertia 3263.748\nIteration 15, inertia 3263.628\nIteration 16, inertia 3263.528" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 17, inertia 3263.422\nIteration 18, inertia 3263.345\nIteration 19, inertia 3263.335\nIteration 20, inertia 3263.326\nIteration 21, inertia 3263.324" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 21\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6467.892\nIteration 1, inertia 3299.482\nIteration 2, inertia 3284.474\nIteration 3, inertia 3276.773\nIteration 4, inertia 3273.421" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3271.134\nIteration 6, inertia 3269.243\nIteration 7, inertia 3268.631\nIteration 8, inertia 3268.409\nIteration 9, inertia 3268.296" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3268.184\nIteration 11, inertia 3268.000\nIteration 12, inertia 3267.834\nIteration 13, inertia 3267.674\nIteration 14, inertia 3267.473" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3267.362\nIteration 16, inertia 3267.273\nIteration 17, inertia 3267.147\nIteration 18, inertia 3267.035\nIteration 19, inertia 3266.914" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3266.829\nIteration 21, inertia 3266.699\nIteration 22, inertia 3266.545\nIteration 23, inertia 3266.270\nIteration 24, inertia 3265.958" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3265.560\nIteration 26, inertia 3265.069\nIteration 27, inertia 3264.684\nIteration 28, inertia 3264.510\nIteration 29, inertia 3264.421" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3264.306\nIteration 31, inertia 3264.165\nIteration 32, inertia 3264.036\nIteration 33, inertia 3263.952\nIteration 34, inertia 3263.910" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 35, inertia 3263.856\nIteration 36, inertia 3263.814\nIteration 37, inertia 3263.778\nIteration 38, inertia 3263.729\nIteration 39, inertia 3263.623" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 40, inertia 3263.525\nIteration 41, inertia 3263.408\nIteration 42, inertia 3263.292\nIteration 43, inertia 3263.134\nIteration 44, inertia 3262.944" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 45, inertia 3262.742\nIteration 46, inertia 3262.450\nIteration 47, inertia 3261.958\nIteration 48, inertia 3260.961\nIteration 49, inertia 3259.360" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 50, inertia 3258.312\nIteration 51, inertia 3257.919\nIteration 52, inertia 3257.750\nIteration 53, inertia 3257.643\nIteration 54, inertia 3257.588\nIteration 55, inertia 3257.580" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 55\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6420.248\nIteration 1, inertia 3304.572\nIteration 2, inertia 3289.501\nIteration 3, inertia 3282.402\nIteration 4, inertia 3278.539" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3276.338\nIteration 6, inertia 3274.250\nIteration 7, inertia 3272.702\nIteration 8, inertia 3270.959\nIteration 9, inertia 3269.232\nIteration 10, inertia 3267.949" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3266.887\nIteration 12, inertia 3265.973\nIteration 13, inertia 3265.242\nIteration 14, inertia 3264.568\nIteration 15, inertia 3264.087\nIteration 16, inertia 3263.834" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 17, inertia 3263.631\nIteration 18, inertia 3263.505\nIteration 19, inertia 3263.451\nIteration 20, inertia 3263.379\nIteration 21, inertia 3263.328\nIteration 22, inertia 3263.294" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 23, inertia 3263.249\nIteration 24, inertia 3263.226\nIteration 25, inertia 3263.212\nIteration 26, inertia 3263.198\nIteration 27, inertia 3263.185\nIteration 28, inertia 3263.176" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 29, inertia 3263.173\nIteration 30, inertia 3263.171\nConverged at iteration 30\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6400.961\nIteration 1, inertia 3298.251\nIteration 2, inertia 3280.432\nIteration 3, inertia 3275.345\nIteration 4, inertia 3273.142\nIteration 5, inertia 3271.588" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 6, inertia 3269.971\nIteration 7, inertia 3268.344\nIteration 8, inertia 3267.296\nIteration 9, inertia 3266.664\nIteration 10, inertia 3265.748" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3264.808\nIteration 12, inertia 3263.649\nIteration 13, inertia 3262.882\nIteration 14, inertia 3262.461\nIteration 15, inertia 3262.228" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 16, inertia 3262.058\nIteration 17, inertia 3261.915\nIteration 18, inertia 3261.792\nIteration 19, inertia 3261.680\nIteration 20, inertia 3261.592" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 21, inertia 3261.520\nIteration 22, inertia 3261.401\nIteration 23, inertia 3261.279\nIteration 24, inertia 3261.215\nIteration 25, inertia 3261.126" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 26, inertia 3261.046\nIteration 27, inertia 3260.992\nIteration 28, inertia 3260.953\nIteration 29, inertia 3260.912\nIteration 30, inertia 3260.862" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 31, inertia 3260.815\nIteration 32, inertia 3260.791\nIteration 33, inertia 3260.779\nIteration 34, inertia 3260.773\nIteration 35, inertia 3260.761" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 36, inertia 3260.741\nIteration 37, inertia 3260.718\nIteration 38, inertia 3260.701\nIteration 39, inertia 3260.698\nIteration 40, inertia 3260.688\nIteration 41, inertia 3260.677" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 42, inertia 3260.672\nConverged at iteration 42\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6472.336\nIteration 1, inertia 3303.275\nIteration 2, inertia 3284.138\nIteration 3, inertia 3274.154\nIteration 4, inertia 3268.411" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3265.076\nIteration 6, inertia 3262.077\nIteration 7, inertia 3261.408\nIteration 8, inertia 3260.914\nIteration 9, inertia 3260.573" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3260.182\nIteration 11, inertia 3259.746\nIteration 12, inertia 3259.141\nIteration 13, inertia 3258.615\nIteration 14, inertia 3258.188" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3257.699\nIteration 16, inertia 3257.071\nIteration 17, inertia 3256.768\nIteration 18, inertia 3256.620\nIteration 19, inertia 3256.475" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3256.358\nIteration 21, inertia 3256.205\nIteration 22, inertia 3256.133\nIteration 23, inertia 3256.099\nIteration 24, inertia 3256.074" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3256.063\nIteration 26, inertia 3256.057\nIteration 27, inertia 3256.055\nIteration 28, inertia 3256.053\nConverged at iteration 28\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6409.635\nIteration 1, inertia 3306.802\nIteration 2, inertia 3292.770\nIteration 3, inertia 3282.228\nIteration 4, inertia 3274.919" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3269.284\nIteration 6, inertia 3265.479\nIteration 7, inertia 3262.476\nIteration 8, inertia 3260.595\nIteration 9, inertia 3259.696" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3259.101\nIteration 11, inertia 3258.481\nIteration 12, inertia 3258.167\nIteration 13, inertia 3257.964\nIteration 14, inertia 3257.725" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3257.538\nIteration 16, inertia 3257.429\nIteration 17, inertia 3257.344\nIteration 18, inertia 3257.202\nIteration 19, inertia 3257.062" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3256.865\nIteration 21, inertia 3256.692\nIteration 22, inertia 3256.549\nIteration 23, inertia 3256.403\nIteration 24, inertia 3256.245" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 25, inertia 3256.127\nIteration 26, inertia 3256.025\nIteration 27, inertia 3255.952\nIteration 28, inertia 3255.853\nIteration 29, inertia 3255.769" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 30, inertia 3255.630\nIteration 31, inertia 3255.571\nIteration 32, inertia 3255.543\nIteration 33, inertia 3255.516\nIteration 34, inertia 3255.496" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 35, inertia 3255.489\nConverged at iteration 35\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6414.364\nIteration 1, inertia 3292.636\nIteration 2, inertia 3274.091\nIteration 3, inertia 3266.486\nIteration 4, inertia 3263.416" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3261.789\nIteration 6, inertia 3260.794\nIteration 7, inertia 3260.258\nIteration 8, inertia 3259.941\nIteration 9, inertia 3259.658\nIteration 10, inertia 3259.351" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3258.914\nIteration 12, inertia 3258.190\nIteration 13, inertia 3257.195\nIteration 14, inertia 3256.270\nIteration 15, inertia 3255.707" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 16, inertia 3255.556\nIteration 17, inertia 3255.521\nIteration 18, inertia 3255.483\nIteration 19, inertia 3255.469\nIteration 20, inertia 3255.460\nIteration 21, inertia 3255.456" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 21\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6324.895\nIteration 1, inertia 3293.965\nIteration 2, inertia 3275.830\nIteration 3, inertia 3267.741\nIteration 4, inertia 3263.209" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3261.451\nIteration 6, inertia 3260.725\nIteration 7, inertia 3260.367\nIteration 8, inertia 3260.137\nIteration 9, inertia 3259.991" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3259.924\nIteration 11, inertia 3259.890\nIteration 12, inertia 3259.877\nIteration 13, inertia 3259.861\nConverged at iteration 13\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6439.038\nIteration 1, inertia 3291.442\nIteration 2, inertia 3276.028\nIteration 3, inertia 3271.637\nIteration 4, inertia 3269.695" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3268.859\nIteration 6, inertia 3268.340\nIteration 7, inertia 3267.780\nIteration 8, inertia 3267.261\nIteration 9, inertia 3266.530" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3265.668\nIteration 11, inertia 3264.816\nIteration 12, inertia 3263.986\nIteration 13, inertia 3263.582\nIteration 14, inertia 3263.172" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3262.976\nIteration 16, inertia 3262.861\nIteration 17, inertia 3262.783\nIteration 18, inertia 3262.751\nIteration 19, inertia 3262.726" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3262.708\nIteration 21, inertia 3262.699\nConverged at iteration 21\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6458.746\nIteration 1, inertia 3309.368\nIteration 2, inertia 3296.435\nIteration 3, inertia 3288.927\nIteration 4, inertia 3282.518" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3275.289\nIteration 6, inertia 3267.311\nIteration 7, inertia 3264.367\nIteration 8, inertia 3263.004\nIteration 9, inertia 3262.378\nIteration 10, inertia 3261.967" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3261.658\nIteration 12, inertia 3261.507\nIteration 13, inertia 3261.294\nIteration 14, inertia 3261.093\nIteration 15, inertia 3260.902\nIteration 16, inertia 3260.740" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 17, inertia 3260.652\nIteration 18, inertia 3260.585\nIteration 19, inertia 3260.539\nIteration 20, inertia 3260.491\nIteration 21, inertia 3260.454\nIteration 22, inertia 3260.426" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 23, inertia 3260.412\nIteration 24, inertia 3260.405\nIteration 25, inertia 3260.402\nIteration 26, inertia 3260.398\nIteration 27, inertia 3260.390\nIteration 28, inertia 3260.382" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 29, inertia 3260.380\nIteration 30, inertia 3260.376\nConverged at iteration 30\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6350.535\nIteration 1, inertia 3291.919\nIteration 2, inertia 3279.374\nIteration 3, inertia 3273.346\nIteration 4, inertia 3269.117\nIteration 5, inertia 3266.915" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 6, inertia 3265.431\nIteration 7, inertia 3264.712\nIteration 8, inertia 3264.349\nIteration 9, inertia 3264.067\nIteration 10, inertia 3263.850\nIteration 11, inertia 3263.726" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 12, inertia 3263.650\nIteration 13, inertia 3263.619\nIteration 14, inertia 3263.607\nIteration 15, inertia 3263.597\nConverged at iteration 15\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6456.248\nIteration 1, inertia 3300.444\nIteration 2, inertia 3283.503\nIteration 3, inertia 3276.788\nIteration 4, inertia 3274.204" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3272.677\nIteration 6, inertia 3271.439\nIteration 7, inertia 3270.415\nIteration 8, inertia 3269.341\nIteration 9, inertia 3268.165\nIteration 10, inertia 3267.504" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3267.135\nIteration 12, inertia 3266.829\nIteration 13, inertia 3266.572\nIteration 14, inertia 3266.337\nIteration 15, inertia 3266.077" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 16, inertia 3265.841\nIteration 17, inertia 3265.544\nIteration 18, inertia 3265.359\nIteration 19, inertia 3265.181\nIteration 20, inertia 3265.045" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 21, inertia 3264.936\nIteration 22, inertia 3264.811\nIteration 23, inertia 3264.654\nIteration 24, inertia 3264.496\nIteration 25, inertia 3264.081" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 26, inertia 3263.339\nIteration 27, inertia 3261.533\nIteration 28, inertia 3258.654\nIteration 29, inertia 3256.621\nIteration 30, inertia 3255.979" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 31, inertia 3255.643\nIteration 32, inertia 3255.477\nIteration 33, inertia 3255.403\nIteration 34, inertia 3255.360\nIteration 35, inertia 3255.335" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nConverged at iteration 35\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6451.563\nIteration 1, inertia 3304.684\nIteration 2, inertia 3285.713\nIteration 3, inertia 3279.365\nIteration 4, inertia 3277.067" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3275.508\nIteration 6, inertia 3274.519\nIteration 7, inertia 3273.507\nIteration 8, inertia 3272.746\nIteration 9, inertia 3272.162\nIteration 10, inertia 3271.657" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3271.264\nIteration 12, inertia 3270.956\nIteration 13, inertia 3270.540\nIteration 14, inertia 3270.082\nIteration 15, inertia 3269.869" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 16, inertia 3269.726\nIteration 17, inertia 3269.584\nIteration 18, inertia 3269.468\nIteration 19, inertia 3269.352\nIteration 20, inertia 3269.178" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 21, inertia 3269.011\nIteration 22, inertia 3268.723\nIteration 23, inertia 3268.353\nIteration 24, inertia 3267.843\nIteration 25, inertia 3267.215\nIteration 26, inertia 3266.362" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 27, inertia 3265.584\nIteration 28, inertia 3265.157\nIteration 29, inertia 3264.786\nIteration 30, inertia 3264.364\nIteration 31, inertia 3263.901\nIteration 32, inertia 3263.552" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 33, inertia 3263.260\nIteration 34, inertia 3262.937\nIteration 35, inertia 3262.485\nIteration 36, inertia 3261.695\nIteration 37, inertia 3261.107\nIteration 38, inertia 3260.828" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 39, inertia 3260.594\nIteration 40, inertia 3260.428\nIteration 41, inertia 3260.389\nIteration 42, inertia 3260.367\nIteration 43, inertia 3260.365" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 44, inertia 3260.359\nConverged at iteration 44\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6405.600\nIteration 1, inertia 3302.004\nIteration 2, inertia 3283.203\nIteration 3, inertia 3276.145\nIteration 4, inertia 3273.083\nIteration 5, inertia 3271.498" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 6, inertia 3270.418\nIteration 7, inertia 3269.699\nIteration 8, inertia 3268.915\nIteration 9, inertia 3267.884\nIteration 10, inertia 3266.646" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 11, inertia 3265.083\nIteration 12, inertia 3263.472\nIteration 13, inertia 3262.431\nIteration 14, inertia 3261.918\nIteration 15, inertia 3261.636" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 16, inertia 3261.445\nIteration 17, inertia 3261.310\nIteration 18, inertia 3261.224\nIteration 19, inertia 3261.135\nIteration 20, inertia 3261.059\nIteration 21, inertia 3261.018" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 22, inertia 3260.983\nIteration 23, inertia 3260.947\nIteration 24, inertia 3260.900\nIteration 25, inertia 3260.840\nIteration 26, inertia 3260.790" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 27, inertia 3260.764\nIteration 28, inertia 3260.743\nIteration 29, inertia 3260.738\nConverged at iteration 29\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6448.216\nIteration 1, inertia 3298.831\nIteration 2, inertia 3279.635\nIteration 3, inertia 3269.284\nIteration 4, inertia 3263.260" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3259.594\nIteration 6, inertia 3257.439\nIteration 7, inertia 3256.139\nIteration 8, inertia 3255.675\nIteration 9, inertia 3255.538" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3255.445\nIteration 11, inertia 3255.393\nIteration 12, inertia 3255.364\nIteration 13, inertia 3255.356\nIteration 14, inertia 3255.344" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3255.334\nConverged at iteration 15\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6455.246\nIteration 1, inertia 3306.953\nIteration 2, inertia 3294.150\nIteration 3, inertia 3287.016\nIteration 4, inertia 3283.105" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3280.206\nIteration 6, inertia 3277.649\nIteration 7, inertia 3275.314\nIteration 8, inertia 3273.816\nIteration 9, inertia 3272.719" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3271.792\nIteration 11, inertia 3270.814\nIteration 12, inertia 3270.039\nIteration 13, inertia 3269.696\nIteration 14, inertia 3269.384" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3269.025\nIteration 16, inertia 3268.540\nIteration 17, inertia 3268.051\nIteration 18, inertia 3267.514\nIteration 19, inertia 3267.302\nIteration 20, inertia 3267.222" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 21, inertia 3267.177\nIteration 22, inertia 3267.135\nIteration 23, inertia 3267.080\nIteration 24, inertia 3266.960\nIteration 25, inertia 3266.678\nIteration 26, inertia 3265.716" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 27, inertia 3262.812\nIteration 28, inertia 3257.784\nIteration 29, inertia 3256.421\nIteration 30, inertia 3255.818\nIteration 31, inertia 3255.614\nIteration 32, inertia 3255.518" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 33, inertia 3255.469\nIteration 34, inertia 3255.443\nIteration 35, inertia 3255.435\nIteration 36, inertia 3255.429\nIteration 37, inertia 3255.420\nIteration 38, inertia 3255.416" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 39, inertia 3255.409\nIteration 40, inertia 3255.399\nIteration 41, inertia 3255.378\nIteration 42, inertia 3255.365\nIteration 43, inertia 3255.355" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 44, inertia 3255.347\nIteration 45, inertia 3255.345\nIteration 46, inertia 3255.342\nIteration 47, inertia 3255.340\nConverged at iteration 47\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6373.585\nIteration 1, inertia 3295.265\nIteration 2, inertia 3276.429\nIteration 3, inertia 3270.790\nIteration 4, inertia 3269.210" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3268.392\nIteration 6, inertia 3267.849\nIteration 7, inertia 3267.406\nIteration 8, inertia 3267.006\nIteration 9, inertia 3266.540" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3266.094\nIteration 11, inertia 3265.727\nIteration 12, inertia 3265.176\nIteration 13, inertia 3264.168\nIteration 14, inertia 3262.569" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3261.010\nIteration 16, inertia 3260.253\nIteration 17, inertia 3260.028\nIteration 18, inertia 3259.907\nIteration 19, inertia 3259.861" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3259.830\nIteration 21, inertia 3259.785\nIteration 22, inertia 3259.758\nIteration 23, inertia 3259.755\nConverged at iteration 23\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6354.581\nIteration 1, inertia 3307.480\nIteration 2, inertia 3294.591\nIteration 3, inertia 3286.870\nIteration 4, inertia 3283.171" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3280.286\nIteration 6, inertia 3277.624\nIteration 7, inertia 3275.121\nIteration 8, inertia 3272.140\nIteration 9, inertia 3269.519" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3267.144\nIteration 11, inertia 3264.701\nIteration 12, inertia 3262.442\nIteration 13, inertia 3260.466\nIteration 14, inertia 3258.164" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 15, inertia 3257.111\nIteration 16, inertia 3256.494\nIteration 17, inertia 3255.938\nIteration 18, inertia 3255.690\nIteration 19, inertia 3255.623" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 20, inertia 3255.598\nIteration 21, inertia 3255.591\nIteration 22, inertia 3255.587\nIteration 23, inertia 3255.583\nConverged at iteration 23\nInitialization complete" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 0, inertia 6456.341\nIteration 1, inertia 3299.840\nIteration 2, inertia 3286.698\nIteration 3, inertia 3281.930\nIteration 4, inertia 3279.365" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 5, inertia 3275.912\nIteration 6, inertia 3271.700\nIteration 7, inertia 3268.976\nIteration 8, inertia 3267.243\nIteration 9, inertia 3266.373" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 10, inertia 3265.959\nIteration 11, inertia 3265.614\nIteration 12, inertia 3265.320\nIteration 13, inertia 3265.040\nIteration 14, inertia 3264.620\n" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "Iteration 15, inertia 3264.257\nIteration 16, inertia 3264.017\nIteration 17, inertia 3263.875\nIteration 18, inertia 3263.794\nIteration 19, inertia 3263.725\nIteration 20, inertia 3263.691" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 21, inertia 3263.666\nIteration 22, inertia 3263.640\nIteration 23, inertia 3263.625\nIteration 24, inertia 3263.621\nIteration 25, inertia 3263.610\nIteration 26, inertia 3263.607" | |
}, | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "\nIteration 27, inertia 3263.604\nConverged at iteration 27\n" | |
}, | |
{ | |
"output_type": "pyout", | |
"prompt_number": 109, | |
"metadata": {}, | |
"text": "[39, 0.40027932557045898]" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "40% accuracy using 39 clusters is only marginally better than our model with 5 clusters, we will definately choose the simpler model moving forward. Remember though that these results are still `in sample` error, and are probably better than we can expect on real data. " | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "heading", | |
"source": "Solve a real problem", | |
"level": 2 | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Now we are at the stage where we can recommend similar articles to the user. This could be implemented as part of the serach algorithm, or simply recommended posts to read after the current page.\n\nWe first need to vectorize the new post before we predict it's label." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "new_post = '''hard drives can fail at any time,\n it is important to always backup your data.'''\n \nnew_post_vec = vec.transform([new_post])\nnew_post_label = km.predict(new_post_vec)[0] # predict the class it belongs to", | |
"prompt_number": 114, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# Select all posts with the same cluster label as the new post vector\nsimilar_label = (km.labels_ == new_post_label).nonzero()[0]", | |
"prompt_number": 115, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "markdown", | |
"source": "Now, between the records we know are similar, we build a new list of similarity scores, similar to what we did above in earlier examples." | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "similar = []\nfor i in similar_label:\n dist = sp.linalg.norm((new_post_vec - vecData[i].toarray()))\n similar.append((dist, train_data.target[i], train_data.data[i]))\nsimilar = sorted(similar)\nprint(len(similar))", | |
"prompt_number": 116, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "175\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "# Present the most similar posts\nprint similar[0]", | |
"prompt_number": 117, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": "(1.1757159813728066, 2, 'From: gjp@sei.cmu.edu (George Pandelios)\\nSubject: Help me select a Backup Solution\\n\\n\\nHi Netters!\\n\\nI\\'m looking at purchasing some sort of backup solution. After you read about\\nmy situation, I\\'d like your opinion. Here\\'s the scenario:\\n\\n1. There are two computers in the house. One is a small 286 (40MB IDE drive).\\n The other is a 386DX (213 SCSI drive w/ Adaptec 1522 controller). Both \\n systems have PC TOOLS and will use Central Point Backup as the backup / \\n restore program. Both systems have 3.5\" and 5.25\" floppies.\\n\\n2. The computers are not networked (nor will they be anytime soon).\\n\\nFrom what I have seen so far, there appear to be at least 4 possible\\nsolutions (I\\'m sure there are others I haven\\'t thought about). For these \\noptions, I would appreciate hearing from anyone who has tried them or sees \\nany flaws (drive type X won\\'t coexist with device Y, etc.) in my thinking \\n(I don\\'t know very much about these beasts):\\n\\n1. Put 2.88MB floppy drives (or a combination drive) on each system.\\n Can someone supply cost and brand information? What\\'s a good brand?\\n What do the floppies themselves cost?\\n\\n\\n2. Put an internal tape backup unit on the 386 using my SCSI adapter, and\\n continue to back up the 286 with floppies. Again, can someone recommend a\\n few manufacturers? The only brand I remember is Colorado Memories. Any\\n happy or unhappy users (I know about the compression controversy)?\\n \\n\\n3. Connect an external tape backup unit on the 386 using my SCSI adapter, and\\n (maybe?) connect it to the 286 somehow (any suggestions?)\\n\\n\\n4. Install a Floptical drive in each machine. Again, any gotcha\\'s or \\n recommendations for manufacturers? \\n\\nI appreciate your help. You may either post or send me e-mail. I will\\nsummarize all responses for the net.\\n\\nThanks,\\n\\nGeorge\\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\\n George J. Pandelios\\t\\t\\t\\tInternet: gjp@sei.cmu.edu\\n Software Engineering Institute\\t\\tusenet:\\t sei!gjp\\n 4500 Fifth Avenue\\t\\t\\t\\tVoice:\\t (412) 268-7186\\n Pittsburgh, PA 15213\\t\\t\\t\\tFAX:\\t (412) 268-5758\\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\\nDisclaimer: These opinions are my own and do not reflect those of the\\n\\t Software Engineering Institute, its sponsors, customers, \\n\\t clients, affiliates, or Carnegie Mellon University. In fact,\\n\\t any resemblence of these opinions to any individual, living\\n\\t or dead, fictional or real, is purely coincidental. So there.\\n=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\\n')\n" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "from IPython.core.display import HTML\n\n\ndef css_styling():\n styles = open(\"/users/ryankelly/desktop/custom_notebook.css\", \"r\").read()\n\n return HTML(styles)\ncss_styling()\n", | |
"prompt_number": 122, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"html": "\n<style>\nbody {\n font-family: Century Gothic, sans;\n\n}\n\n\ndiv.text_cell_render h1 { /* Main titles bigger, centered */\nfont-size: 2.2em;\nline-height:1.4em;\ntext-align:center;\n}\n\n/*Input and output cells formatting*/\ndiv.prompt.input_prompt, div.prompt.output_prompt {\n visibility: hidden;\n /*font-family: Consolas;*/\n color: #575748;\n /*background-color: #CCCCCC;*/\n border: 0px;\n width: 6.5em;\n float:left;\n}\n\n\ndiv.output_subarea.output_text.output_stream.output_stdout,div.output_subarea.output_text {\n margin-left: 1.5em;\n padding-top: 1em;\n padding-bottom: 0.5em;\n margin-top: 8px; /*This is for getting the box-shadow property of the parent to display properly;*/\n}\n\ndiv.cell { /* Tunes the space between cells */\nmargin-top:1em;\nmargin-bottom:1em;\nwidth:100%;\nmargin-right:auto;\noverflow-x:hidden;\n}\n\ndiv.text_cell_render{\n overflow-x:hidden;\n \n}\n\n\ndiv.input{\nmargin-right:1%;\n}\n\n</style>\n \n\n\n\n", | |
"metadata": {}, | |
"prompt_number": 122, | |
"text": "<IPython.core.display.HTML at 0x11f0b5a10>" | |
} | |
], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
}, | |
{ | |
"metadata": {}, | |
"cell_type": "code", | |
"input": "def social():\n code = \"\"\"\n <a style='float:left; margin-right:5px;' href=\"https://twitter.com/share\" class=\"twitter-share-button\" data-text=\"Check this out\" data-via=\"Ryanmdk\">Tweet</a>\n<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>\n <a style='float:left; margin-right:5px;' href=\"https://twitter.com/Ryanmdk\" class=\"twitter-follow-button\" data-show-count=\"false\">Follow @Ryanmdk</a>\n<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>\n <a style='float:left; margin-right:5px;'target='_parent' href=\"http://www.reddit.com/submit\" onclick=\"window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false\"> <img src=\"http://www.reddit.com/static/spreddit7.gif\" alt=\"submit to reddit\" border=\"0\" /> </a>\n<script src=\"//platform.linkedin.com/in.js\" type=\"text/javascript\">\n lang: en_US\n</script>\n<script type=\"IN/Share\"></script>\n\"\"\"\n return HTML(code)", | |
"prompt_number": 121, | |
"outputs": [], | |
"language": "python", | |
"trusted": true, | |
"collapsed": false | |
} | |
], | |
"metadata": {} | |
} | |
], | |
"metadata": { | |
"gist_id": "7c54600b9c2af68914b3", | |
"name": "", | |
"signature": "sha256:f93da21c0dee31a8cb4b36b501cb0663551fcfdc019d52aa013a8864cdc3a526" | |
}, | |
"nbformat": 3 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment