ColumnStore Indexes

ColumnStore Index is a new feature created as a result  of  project named “Apollo” in  SQL Server Code Name “Denali”.It is very important NEWS if you have SQL Server Datawarehouse.In this Paper I will let you know the difference between SQL Server’s traditional indexes and new ColumnStore index.  I will also let you know when to use Traditional row indexes and when to use columnStore indexes.


Intended Audience

           
SQL Server developers , Dataware house team , DBA’s..


Introduction

           
First of all we should know what is Columnstore index.Columnstore index is that index which stores data column-wise instead of row-wise.it is mainly for datawarehouse purpose like fact and dimensions more than 10 Million rows. We can create only one per table and currently we can create only non –clustered columnstore index. If we are partitioning the table the columnstore index must contain partitioning column. It is for read only purpose. SQL Server’s traditional indexes, clustered and nonclustered, are based on the B-tree.  B-trees are great for finding data that match a predicate on the primary key.  They’re also reasonably fast when you need to scan all the data in a table .

How Columstore index data store.

           
If we take the data with example shown below the data is stored in 2 different formats
1)    Data Stored in a Traditional “Row Store” Format
Row Store

1
Ross
San Francisco
CA
2
Sherry
New York
NY
3
Gus
Seattle
WA
4
Stan
San Jose
CA
5
Lijon
Sacramento
CA


2)    Data stored in Column store format .
ColumnStore
1 2 3 4 5
Ross Sherry Gus Stan Lijon
San Francisco New York Seattle San Jose Sacramento
CA NY WA CA CA


In columnstore index each index page stores data from a single column  but in traditional row indexes each index page stores rows .this is very major change .

What are the benefits of using Column store index .


   The 2 main important benefits we will get when we are using Columnstore index are
1.    Compression
2.    Fetching only required columns .
      Compression :     Most general-purpose relational database management systems, including SQL Server, store data in row-wise fashion.  This organization is sometimes called a row store.  Both heaps and B-trees are row stores because they store the values from each column in a given row contiguously.  When you want to find all the values associated with a row, having the data stored together on one page is very efficient.  Storing data by rows is less ideal for compressing the data.  Most compression algorithms exploit the similarities of a group of values.  The values from different columns usually are not very similar.  When data is stored row-wise, the number of rows per page is relatively few, so the opportunities to exploit similarity among values are limited.  A column store organizes data in column-wise fashion.  Data from a single column are stored contiguously.  Usually there is repetition and similarity among values within a column.  The column store organization allows compression algorithms to exploit that similarity.

Fetching only Required Columns :   When data is stored column-wise, each column can be accessed independently of the other columns.  If a query touches only a subset of the columns in a table, IO is reduced.  Data warehouse fact tables are often wide as well long.  Typical queries touch only 10 – 15% of the columns.  That means a column store can reduce IO by 85 – 90%, a huge speedup in systems that are often IO bound, meaning the query speed is limited by the speed at which needed data can be transferred from disk into memory

When to use Traditional Indexes Vs Cloumnstore Indexes


            While it’s possible to build a system that stores all data in columnar format, row stores still have advantages in some situations.  A B-tree is a very efficient data structure for looking up or modifying a single row of data.  So if your workload entails many single row lookups and many updates and deletes, which is common for OLTP workloads, you will probably continue to use row store technology.  Data warehouse workloads typically scan, aggregate, and join large amounts of data.  In those scenarios, column stores really shine.
SQL Server now provides you with a choice.  You can build columnstore indexes on your big data warehouse tables and get the benefits of column store technology and batch mode processing without giving up the benefits of traditional row store technology when a B-tree is the right tool for the job

Why we get higher  Performance from cloumnstore indexes.


           The performance  of the queries on columnstore indexed is very high because data organized in a column, shares many more similar characteristics than data organized across rows this result in higher level of compression . Use of VertiPaq compression algorithm technology also more superior than SQL compression algorithm (available in Analyse server for PowerPivot in SQL 2008R2) . columnstore index  requires Less I/O transferred from disk to Memory . It Fetches data only for columns needed by query (Bitmap filter optimization).Algorithms are optimized to take better advantage of modern hardware (More Cores, more Ram,..) . it will also run in  Batch mode process . Batch-mode processing: is a new, highly-efficient vector technology that works with columnstore indexes. Check Query plan for execution mode. A batch is stored as a vector in a separate area of memory and represent +/- 1000 rows of data
Columnstore index creation Syntax

  CREATE [ NONCLUSTERED ] COLUMNSTORE INDEX index_name ON <object> ( column [ ,...n ] )
 [ WITH DROP_EXISTING = { ON | OFF } [ MAXDOP = x ] ) ]
 [ ON { { partition_scheme_name ( column_name ) } | filegroup_name | "default" } ]
  Explination to some of the arguments
ON partition_scheme_name(column_name)
Specifies the partition scheme that defines the filegroups onto which the partitions of a partitioned index will be mapped. The partition scheme must exist within the database by executing CREATE PARTITION SCHEME. column_name specifies the column against which a partitioned index will be partitioned. This column must match the data type, length, and precision of the argument of the partition function that partition_scheme_name is using. column_name is not restricted to the columns in the index definition. When partitioning a columnstore index, Database Engine adds the partitioning column as a column of the index, if it is not already specified.
If partition_scheme_name or filegroup is not specified and the table is partitioned, the index is placed in the same partition scheme, using the same partitioning column, as the underlying table.
  MAXDOP = max_degree_of_parallelism
Overrides the Configure the max degree of parallelism Server Configuration Option configuration option for the duration of the index operation. Use MAXDOP to limit the number of processors used in a parallel plan execution. The maximum is 64 processors.
max_degree_of_parallelism can be:
1
Suppresses parallel plan generation.
>1
Restricts the maximum number of processors used in a parallel index operation to the specified number or fewer based on the current system workload.
0 (default)
Uses the actual number of processors or fewer based on the current system workload.
Columnstore index can create on the fallowing datatypes.
 The common business data types can be included in a columnstore index. The following data types can be included in a columnstore index.
  char and varchar
  char and nvarchar (except varchar(max) and nvarchar(max))
 Decimal (and numeric) (Except with precision greater than 18 digits.)
  int , bigint, smallint, and tinyint
 float (and real)
 bit
 money and smallmoney
All date and time data types (except datetimeoffset with scale greater than 2)
Columnstore index cannot be  created on the fallowing datatypes.
The following data types cannot be included in a columnstore index.
binary and varbinary
ntext , text, and image
varchar(max) and nvarchar(max)
uniqueidentifier
rowversion (and timestamp)
sql_variant
decimal (and numeric) with precision greater than 18 digits
datetimeoffset with scale greater than 2
CLR types (hierarchyid and spatial types)
xml

How much memory need to create column store index


        The memory required for creating a columnstore index depends on the number of columns, the number of string columns, the degree of parallelism (DOP), and the characteristics of the data. SQL Server will request a memory grant before trying to create the index. If not enough memory is available to create the index in parallel with the current max DOP, SQL Server will reduce the DOP as needed to get an adequate memory grant. If SQL Server cannot get a memory grant to build the index with DOP = 1, the index creation will fail.
A rule of thumb for estimating the memory grant that will be requested for creating a columnstore index is:
Memory grant request in MB = [(4.2 *Number of columns in the CS index) + 68]*DOP + (Number of string cols * 34)

Conclusion


     Columnstore indexes can reduce the burden on IT and shorten ETL time by decreasing reliance on pre-built summary aggregates, whether they are indexed views, user-defined summary tables, or OLAP cubes. Designing and maintaining aggregates is often a difficult, labor-intensive task. A single columnstore index can replace dozens of aggregates. Column stores are less brittle than aggregates because if a query is changed slightly, the columnstore can still support it, whereas a specific aggregate may no longer be useful to accelerate the query.


Comments

Popular posts from this blog

SSAS : A connection cannot be made to redirector. Ensure that 'SQL Browser' service is running. (Microsoft.AnalysisServices.AdomdClient). + Solution

SSIS Error : Unable to infer the XSD from the XML file. The XML contains multiple namespaces