I am writing this with a hope that you would get a wide angle perception on the term "Symbol" in programming. I am deliberately mixing multiple programming languages to make our understanding better, and at-least one of them would make sense to you - if not entirely - intuitively! As you read through, you would discover better reasons on why we use multiple languages to explain this concept.
Before we explain Symbols, let's get familiarised with the behaviour of strings
(instantiation, storage, etc.) in all these languages.
let x be "afsal"
and it's a string type (not language specific). x
will be allocated a memory space M
and is identified by an id
(sometimes we call object_id
). Invoking x
is getting the contents of x
identified by id
.
Note: Most of you would guess this to be an intro to reference identity, reference equality and related topic. But, for the time being, let us restrict ourselves on to the terminologies mentioned above.
Let us use Python here as an example of getting the id
of a variable. You may not see this feature in all languages.
But the concept remains the same.
>>> x = "afsal"
>>> id (x)
4438888640
>>> y = "afsal"
>>> id (y)
4438888640
>>>
As seen above, the strings with the same content share the same id.
- It means, when we tried to define y with the same content as that of x, the application looked up the heap/memory and identified that
y
could reuse the contents ofx
. - If
y
needs to re-usex
, obvious that they should have the same object_id.id (x) == id (y)
- No extra copy of
y
is created, and we saved some space.
- In computer science, the behaviour is termed as string interning. That is, internalising the strings will ensure that all strings with same contents share the same memory.
- One source of drawbacks is that string interning may be problematic when mixed with multithreading, but this discussion is out of scope.
- In the above examples with python/ruby, the string x is internalised automatically, so is for Java/Scala. (However, ou will find differences in behaviour across these languages soon)
- In java, String.intern() internalise the strings forcefully, but you don't do this generally.
- The intern() method returns a canonical representation of the string object.
Again, for demonstration I am using Scala console. I strongly recommend to view this example as a conceptual explanation of String interning, and it doesn't intend to explain the various functions in Scala/Java and its differences. For the time being, all that you would need to know
is eq
method in Scala verifies if two strings are pointing to the same memory location.
scala> val x = "afsal"
x: String = afsal
scala> val y = "afsal"
y: String = afsal
scala> x == y
res0: Boolean = true
// `eq` verifies if they point to the same memory.
scala> x eq y
res1: Boolean = true
// In the above example, strings are automatically interned,
// Now, please have a look at the below examples. You can see strings are not automatically interned here:
scala> val x = new String("afsal")
x: String = afsal
scala> val y = new String("afsal")
y: String = afsal
scala> x == y
res2: Boolean = true
scala> x eq y
res3: Boolean = false
// Let's look at the below example where we intern strings forcefully, resulting in `eq` giving a result `true`
scala> val x = new String("afsal").intern
x: String = afsal
scala> val y = new String("afsal").intern
y: String = afsal
scala> x == y
res4: Boolean = true
scala> x eq y
res5: Boolean = true
Let's do a simple test on the automatic interning of Strings. For simplicity, let's use Scala console. Let's create ten strings, each with 10000000 characters of 'a'. All ten long strings have same contents, and ideally, they should share the same id.
scala> val longString = List.fill(10000000)("a").mkString
longString: String = aaaa..............
scala> val strings = List.fill(10)(longString)
strings: List[String] = List(aaaaaaaaaaaaaa.............
scala> strings.forall(t => t eq longString)
res0: Boolean = true
Yes, it is a property of strings. You won't find an intern
method for an Integer variable.
However, you may note that ids are constant for certain primitives. Ex: id
of integer 1 is always the same.
Example: In python, it would look like this. You can verify the same in Ruby, and it behaves in the same manner.
>>> x = 12121212121321232432453234123
>>> y = 12121212121321232432453234123
>>> id (x) == id (y)
False
>>> x = 1
>>> y = 1
>>> id (x) == id (y)
True
Surprisingly, the answer is NO. In java world, we say there is no guarantee that strings would be internalised and it depends on JVM's whim, and probably the content itself. Python doesn't intern strings with special characters for example.
Let's analyse the consistency of this string interning in back to python/ruby (programming language does matter here). We will come to know String interning, being such a generalised term, behave in different ways in different languages.
In python:
>>> x = "afsal11111111111111111111116789i|dsad!@##"
>>> y = "afsal11111111111111111111116789i|dsad!@##"
>>> id(x) == id(y)
False
>>> x == y
True
>>> x = "afsal"
>>> y = "afsal"
>>> id(x) == id (y)
True
In Ruby, you would see something known as object_id. It is the equivalent of id
in python. But as per documentation, the object_ids always differ for 2 active objects.
Hence, the following code gives different object_id
(or id
) for two variables with the same content afsal
. There is no string interning happening here!
afsalthaj@Afsals-MacBook-Pro ~> irb
irb(main)> x = "afsal"
=> "afsal"
irb(main)> y = "afsal"
=> "afsal"
irb(main)> x.object_id == y.object_id
=> false
The answer is "Symbol".
To make it further simpler, call the method intern
for a string in Ruby, and you get a "Symbol" in return.
irb(main)> x = "afsal".intern
=> :afsal
irb(main)> x.is_a?(Symbol)
=> true
irb(main)> y = "afsal".intern
=> :afsal
irb(main)> x.object_id == y.object_id
=> true
irb(main)> x = :afsal
=> :afsal
irb(main)> y = :afsal
=> :afsal
irb(main)> x.object_id == y.object_id
=> true
Now you must be wondering, can we solve the intern inconsistencies in any language using Symbols? Yes and No. Yes for languages who has specific Symbol type but we may not always use this in practice, and there are alternative methods in languages that don't have symbols inbuilt. In other words, even if some languages don't have the concept of Symbol, they have string internalising, and there is one or other way of doing it. In Ruby/Java/Scala it is done by making use of symbols. In Python, use the method intern, and in Clojure, we could use keywords or symbols. Let us try to understand it better.
The concept of symbol is not language specific, but they differ in some or other ways. Let us explore.
As mentioned in above examples, the way Ruby handles intern is by converting it into symbols. i.e, Symbol representation of String "afsal"
is :afsal
.
If you are not using symbols, every time you define "afsal", Ruby instantiates a new String object, interpreter looks at the memory (heap) and allocate a new memory, assign a new object_id to keep track of the object, and interpreter marks it for destruction if not used often, resulting in the repetition of the whole steps next time the String "afsal" is defined. It can affect the performance - especially when it comes to massive datasets.
Let us define a string 100000000 times and similarly a symbol 100000000 times, and see which one performs better:
def test_strings_performance
x = Time.now
100000000.times do
"afsal"
end
y = Time.now
y - x
end
irb(main):020:0> test_strings_performance
=> 6.360865
irb(main):021:0> test_strings_performance
=> 6.381256
def test_symbol_performance
x = Time.now
100000000.times do
:afsal
end
y = Time.now
y - x
end
irb(main):018:0> test_symbol_performance
=> 3.76367
irb(main):019:0> test_symbol_performance
=> 3.773923
Hurray! symbols are performing almost ~2 times faster than string.
Many times symbols are used as identifiers.Example: Every method name in Ruby is saved as a symbol under the hood. Symbols are widely used as keys in your hash. By using symbols as keys, Ruby need to compare only the object_ids of the already stored key with the new ones, and not its contents/compute-hash-of-each-value. It could be used anywhere with-in your application.
Be aware using excessive use of Symbols results in lots of memory usage. The frequent casting of Symbols to Strings can also slow down your application. Said that memory leakage due to the usage of Symbols is not a concern in Ruby anymore (for version > 2.2) as there is symbol garbage collector now.
There is no python equivalent for Ruby's symbols. However, as you have seen from the above examples, they are interned by default. We have also seen a few examples of strings that were not interned by default in Python. But we can force the intern of those strings using the intern function.
>>> x = "sfbakjdbakjdbajhdbjadbjasbdjad&&&%^&'"
>>> y = "sfbakjdbakjdbajhdbjadbjasbdjad&&&%^&'"
>>> id(x) == id(y)
False
>>> x = intern(x)
>>> y = intern(y)
>>> id(x) == id(y)
True
>>> type(x)
<type 'str'>
>>> type(y)
<type 'str'>
Clojurists have Strings, Symbols and keywords. This trichotomy may confuse many. It may partly make sense to you as we need String and Symbol.
user=> :afsal
:afsal
user=> 'afsal
afsal
user=> (symbol? :afsal)
false
user=> (symbol? 'afsal)
true
user=>
According to documentation, the one which provides faster equality tests is given by keywords (i.e.,afsal). Also, you may observe that these keywords are not Symbols. As far as I can understand, a Symbol in Clojure/Lisp is mainly used to manipulate the function names, variables and also program forms (closely related to macros). We could use Symbols ('afsal) to manipulate objects, but that is done less common in practice. Another difference is while Symbols are namespace qualified, keywords are not. However, we could explicitly qualify a keyword in Clojure to a particular namespace by giving an extra column
user=> :afsal
:afsal
user=> ::afsal
:user/afsal
user=>
Although the difference is not explained in detail here, it is important to know at-least this much if you are into Clojure.
In dynamic languages, symbols are often used to identify things that have a stronger meaning than a string content, identifiers that are often used more than once. Moreover, in homoiconic languages like Clojure, where code can act as data, the programmer has control over manipulating functions and variables using Symbols to produce various custom behaviours. However, in statically typed languages we could argue that your comparison space is already restricted by types, and most of the times the homoiconic nature doesn't exist. Hence, although symbols have the same meaning in the context of Java/Scala (i.e., guaranteed interning and faster equality operations), they are probably less used in practice when we compare with Ruby or Clojure.
scala> 'afsal
res0: Symbol = 'afsal
scala> Symbol("afsal")
res1: Symbol = 'afsal