Zsolt Pocsaji tinychaos42

## OptimizelyCLA
OPTIMIZELY CONTRIBUTION LICENSE AGREEMENT

This contribution license agreement (“Agreement”) is an agreement between you
and Optimizely North America Inc., and grants certain rights to Optimizely
North America Inc. and its affiliates (collectively, “Optimizely”) with respect
to your open-source contributions to Optimizely’s Repositories.  This Agreement
is effective on the date of your acceptance and is confirmed by you Submitting
Contributions.

1.  Definitions – (and words denoting the singular includes the plural and vice

## gist:864ee951b224cd8fbc125aba95e816f2
13  2016-04-18 17:16:55
222 2016-04-19 09:51:55
287 2016-04-19 09:54:30
801 2016-04-21 16:52:54
803 2016-04-21 16:52:54
812 2016-04-21 16:53:02
816 2016-04-21 16:54:08
1114    2016-04-21 17:39:34
1415    2016-04-21 19:49:07
1471    2016-04-24 19:01:09

## gist:dd3d4187ebd3080035ad3d464739e5bc
test

## gist:6038877
NOI ?
<nonident/>
Web site does not collect identified data.

ADM not needed
<admin/>
Web Site and System Administration: Information may be used for the technical support of the Web site and its computer system. This would include processing computer account information, information used in the course of securing and maintaining the site, and verification of Web site activity by the site or its agents.

DEV ?
<develop/>

## clusters
Creating word bags...
Calculating index numbers...
Checking top keywords in each document...
Checking correlations...
Creating clusters based on the correlations...
Swapping document id-s with titles for readability...

Cluster 1 contents:
	The Queen toasts Barack Obama and special relationship with the US
	Obama gives message of support to the Queen at lavish state banquet

## algorithm.textile

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                tinychaos42
                / algorithm.textile
            
            
              Created
              January 17, 2012 21:58
            
              
                Explanation of the clustering algorithm
              
          
    Task 3: Article Clustering

The algorithm I used is basically the if-idf algorithm, which can be found here . The idea behind the algorithm is that for each term in each document, it calculates two frequencies. One is the term frequency, which is just literally the number of occurrences of the term in that specific document, ‘normalized’ by the length of the document. The second is the inverse document frequency, which is the relative frequency of the term in the whole document store, namely the logarithm of the size of the whole document store divided by the number of occurrences. After a certain amount of research I concluded that this algorithm is fairly ideal for the task’s purposes, can be programmed in a nice and readable way and not overly complex.
During the research I found two other options which I concluded either slightly irrelevant or too complex for the task. The first one was a Bag-of-words solu
  

## cluster.php
<?php
// no argument, process demo json
if(!isset($argv[1]))
{
	$file = file_get_contents('data.json');
}
else
{
    $file = file_get_contents($argv[1]);
}
	OPTIMIZELY CONTRIBUTION LICENSE AGREEMENT

	This contribution license agreement (“Agreement”) is an agreement between you
	and Optimizely North America Inc., and grants certain rights to Optimizely
	North America Inc. and its affiliates (collectively, “Optimizely”) with respect
	to your open-source contributions to Optimizely’s Repositories. This Agreement
	is effective on the date of your acceptance and is confirmed by you Submitting
	Contributions.

	1. Definitions – (and words denoting the singular includes the plural and vice
	13 2016-04-18 17:16:55
	222 2016-04-19 09:51:55
	287 2016-04-19 09:54:30
	801 2016-04-21 16:52:54
	803 2016-04-21 16:52:54
	812 2016-04-21 16:53:02
	816 2016-04-21 16:54:08
	1114 2016-04-21 17:39:34
	1415 2016-04-21 19:49:07
	1471 2016-04-24 19:01:09
	NOI ?
	<nonident/>
	Web site does not collect identified data.

	ADM not needed
	<admin/>
	Web Site and System Administration: Information may be used for the technical support of the Web site and its computer system. This would include processing computer account information, information used in the course of securing and maintaining the site, and verification of Web site activity by the site or its agents.

	DEV ?
	<develop/>
	Creating word bags...
	Calculating index numbers...
	Checking top keywords in each document...
	Checking correlations...
	Creating clusters based on the correlations...
	Swapping document id-s with titles for readability...

	Cluster 1 contents:
	The Queen toasts Barack Obama and special relationship with the US
	Obama gives message of support to the Queen at lavish state banquet
	<?php
	// no argument, process demo json
	if(!isset($argv[1]))
	{
	$file = file_get_contents('data.json');
	}
	else
	{
	$file = file_get_contents($argv[1]);
	}