Skip to content

Instantly share code, notes, and snippets.

@brijesh-deb
Last active June 1, 2018 05:38
Show Gist options
  • Save brijesh-deb/f93407bcd71c6c47f7db0f22f90a7ce9 to your computer and use it in GitHub Desktop.
Save brijesh-deb/f93407bcd71c6c47f7db0f22f90a7ce9 to your computer and use it in GitHub Desktop.
#Database #Index #Notes

Why use index ?

  • The whole point of having an index is to speed up search queries by essentially cutting down the number of records/rows in a table that need to be examined.
    • SELECT * FROM Employee WHERE Employee_Name = 'Jesus'
  • In absence of index, for searching column value, all rows in the database table has to be scanned. This is called full table scan.

What is an index?

  • An index is a data structure (most commonly a B- tree) that stores the values for a specific column in a table. An index is created on a column of a table. So, the key points to remember are that an index consists of column values from one table, and that those values are stored in a data structure. The index is a data structure – remember that.

What kind of data structure is an index?

  • B- trees are the most commonly used data structures for indexes. The reason B- trees are the most popular data structure for indexes is due to the fact that they are time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B- tree can be sorted. The RDBMS typically determines which data structure is actually used for an index. But, in some scenarios with certain RDBMS’s, you can actually specify which data structure you want your database to use when you create the index itself.

What kind of data structure is an index?

  • B- trees are the most commonly used data structures for indexes. The reason B- trees are the most popular data structure for indexes is due to the fact that they are time efficient – because look-ups, deletions, and insertions can all be done in logarithmic time. And, another major reason B- trees are more commonly used is because the data that is stored inside the B- tree can be sorted. The RDBMS typically determines which data structure is actually used for an index. But, in some scenarios with certain RDBMS’s, you can actually specify which data structure you want your database to use when you create the index itself.
  • Hash tables are another data structure that you may see being used as indexes – these indexes are commonly referred to as hash indexes. The reason hash indexes are used is because hash tables are extremely efficient when it comes to just looking up values. So, queries that compare for equality to a string can retrieve values very fast if they use a hash index. For instance, the query we discussed earlier (SELECT * FROM Employee WHERE Employee_Name = ‘Jesus’) could benefit from a hash index created on the Employee_Name column.
  • Disadvantage of hash index - hash tables are not sorted data structures, and there are many types of queries which hash indexes can not even help with. For instance, suppose you want to find out all of the employees who are less than 40 years old. 
  • Indexes that use a R- tree data structure are commonly used to help with spatial problems. For instance, a query like “Find all of the Starbucks within 2 kilometers of me” would be the type of query that could show enhanced performance if the database table uses a R- tree index.
  • Another type of index is a bitmap index, which work well on columns that contain Boolean values (like true and false), but many instances of those values – basically columns with low selectivity.

Database index with B Tree

  • Idea behind inverted index, which is used in all three technologies covered by this article, is very simple. Imagine we have record with fields A, B and C and we want to execute queries like “select * from records where A=x”. Instead of scanning all records in table, we can create additional data structure (index), which will track for each possible value of attribute A list to all records having this value.
  • So if we need all record which have A=1 we will find “1” entry point in index (by attribute A) and get list of all records which conforms that criterion.
  • BTree is most common implementation of inverted index in RDBMS (world of RDBMS is diverse, so I’m sure there are alterative implementations but let’s stick with BTrees here). BTree is good for disk based storages and has been used in relational databases for decades. Typical SQL database will allow you to create index using one or more attributes and you can create multiple indexes for same table. If you are creating index for multiple attributes, order of attributes is important. Normally RDBMS can use only one BTree index for particular table during execution of query. That’s an very important point!
  • Use of B Tree for database index - http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.adref.doc/adref235.htm

What exactly is inside a database index?

  • So, now you know that a database index is created on a column in a table, and that the index stores the values in that specific column. But, it is important to understand that a database index does not store the values in the other columns of the same table. For example, if we create an index on the Employee_Name column, this means that the Employee_Age and Employee_Address column values are not also stored in the index. If we did just store all the other columns in the index, then it would be just like creating another copy of the entire table – which would take up way too much space and would be very inefficient.

How does a database know when to use an index?

  • When a query like “SELECT * FROM Employee WHERE Employee_Name = ‘Jesus’ ” is run, the database will check to see if there is an index on the column(s) being queried. Assuming the Employee_Name column does have an index created on it, the database will have to decide whether it actually makes sense to use the index to find the values being searched – because there are some scenarios where it is actually less efficient to use the database index, and more efficient just to scan the entire table. Read this article to understand more about those scenarios: Selectivity in SQL.

Can you force the database to use an index on a query?

  • Generally, you will not tell the database when to actually use an index – that decision will be made by the database itself. Although it is worth noting that in most databases (like Oracle and MySQL), you can actually specify that you want the index to be used.

How to create an index in SQL

  • Single column index
    • CREATE INDEX name_index ON Employee (Employee_Name)
  • Multi column index
    • CREATE INDEX name_index ON Employee (Employee_Name, Employee_Age)

What is the cost of having a database index?

  • So, what are some of the disadvantages of having a database index? Well, for one thing it takes up space – and the larger your table, the larger your index. Another performance hit with indexes is the fact that whenever you add, delete, or update rows in the corresponding table, the same operations will have to be done to your index. Remember that an index needs to contain the same up to the minute data as whatever is in the table column(s) that the index covers.
  • As a general rule, an index should only be created on a table if the data in the indexed column will be queried frequently.

Clustered index

  • A clustered index determines the order in which the rows of a table are stored on disk. If a table has a clustered index, then the rows of that table will be stored on disk in the same exact order as the clustered index. 
  • Suppose we have a table named Employee which has a column named EmployeeID. Let’s say we create a clustered index on the EmployeeID column. What happens when we create this clustered index? Well, all of the rows inside the Employee table will be physically – sorted (on the actual disk) – by the values inside the EmployeeID column. What does this accomplish? Well, it means that whenever a lookup/search for a sequence of EmployeeID’s is done using that clustered index, then the lookup will be much faster because of the fact that the sequence of employee ID’s are physically stored right next to each other on disk – that is the advantage with the clustered index.

Disadvantage of clustered index

  • A disadvantage to using a clustered index is the fact that if a given row has a value updated in one of it’s (clustered) indexed columns what typically happens is that the database will have to move the entire row so that the table will continue to be sorted in the same order as the clustered index column.
  • Since a clustered index also tells the database in which order to physically store the rows on disk, when the Owner_Name is changed it will have to move an updated row so that it is still in the correct sorted order. So, now the row that used to belong to “Roger Federer” will have to be moved on disk so that it’s grouped (or clustered) with all the car entries that belong to “Rafael Nadal”. Clearly, this is a performance hit. This means that a simple UPDATE has turned into a DELETE and then an INSERT – just to maintain the order of the clustered index.
  • For this exact reason, clustered indexes are usually created on primary keys or foreign keys, because of the fact that those values are less likely to change once they are already a part of a table.

Why can a table have only one clustered index but multiple non-clustered index ?

  • Because a clustered index determines the order in which the rows will be stored on disk, having more than one clustered index on one table is impossible. Imagine if we have two clustered indexes on a single table – which index would determine the order in which the rows will be stored? Since the rows of a table can only be sorted to follow just one index, having more than one clustered index is not allowed.
  • A table can have multiple non-clustered indexes because they don’t affect the order in which the rows are stored on disk like clustered indexes.

Create a clustered index

  • CREATE CLUSTERED INDEX name_index ON Employee (Employee_Name)

Pointers

  • By default a clustered index is created on a primary key
  • You can specify the NONCLUSTERED key word when creating a primary key and it won't be clustered
  • You can then specify the CLUSTERED key word when creating an index, unique or not, and it will be clustered.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment