Skip to content

Instantly share code, notes, and snippets.

@Qyoom
Last active August 29, 2015 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Qyoom/dc92137f659a57a59306 to your computer and use it in GitHub Desktop.
Save Qyoom/dc92137f659a57a59306 to your computer and use it in GitHub Desktop.
Scala worksheet for experimenting with Spark Vector class
import org.apache.spark.util.Vector
import org.apache.spark._
object Vector_lab_1 {
println("Vector_lab_1")
val vec1 = Vector(Array(1.5, 2.5, 3.5))
val vec2 = Vector(Array(1.0, 2.0, 3.0))
val vec3 = Vector(Array(5.1, 6.1, 7.1, 8.1))
val vec4 = Vector(Array(1.0, 1.0, 1.0))
val vec5 = Vector(6, 3.2, 9.8)
val vec6 = Vector(3, (x:Int)=>1*2.6)
vec1(0)
vec1(0) + vec2(1)
vec1 + vec2
vec1 add vec2
// vec1 + vec3 // illegal arg ex: not same length
// (vec1 + vec2).reduce(_ + _) // reduce not a member of Vector
vec1 - vec2
vec1 dot vec2
1.5 + 5 + 10.5
// vec1 * vec2 // Error
// vec1 dot vec3 // IllegalArgumentException: Vectors of different length
vec1 plusDot(vec2, vec4)
vec1 plusDot(vec4, vec2)
vec1
vec1 += vec2
vec1
vec1 addInPlace(vec2)
3 * vec2
vec2 * 3
vec3.sum
vec3 / 3.9
vec2.unary_-
vec1 squaredDist vec2
vec1 dist vec2
}
@Qyoom
Copy link
Author

Qyoom commented May 21, 2014

References:
http://en.wikipedia.org/wiki/Dot_product
"[Dot product] is the sum of the products of the corresponding entries of the two sequences of numbers. Geometrically, it is the product of the magnitudes of the two vectors and the cosine of the angle between them."

Manning, Christopher D, Foundations of Statistical Natural Language Processing, 1999, p.539
"The vector space model is one of the most widely used models for ad-hoc retrieval, mainly because of its conceptual simplicity and the appeal of the underlying metaphor of using spatial proximity for semantic proximity. Documents and queries are represented in a high-dimensional space, in which each dimension of the space corresponds to a word in the document collection. The most relevant documents for a query are expected to be those represented by the vectors closest to the query, that is, documents that use similar words to the query. Rather than considering the magnitude of the vectors, closeness is often calculated by just looking at angles and choosing documents that enclose the smallest angle with the query vector."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment