Shared posts

26 Mar 07:51

Foray into Jenkins, Puppet, Docker, and Photon

by Edward Haletky

I have made a foray into Jenkins to deploy VMware Photon within my vSphere environment. This foray has the end goal of using Jenkins, VMware Photon, and Docker to deploy applications within my hybrid cloud. I have an increased need to deploy web properties as well as to automatically apply in-depth testing to those properties. Quite frankly, the amount of time it takes me to do these things by hand is just getting to be too much of a time sink, so now it is time to use modern tools to augment my existing scripts. Here is my journey into the new world.

The steps to achieving this are well laid out by many others, and I  leveraged previous work as much as possible. So, here are my steps:

  • Install CentOS 7 Minimal Install into a VM, and then add the following RPMs:
    • perl
    • open-vm-tools
    • git
  • Install VMware Tools into the CentOS 7 Minimal Install VM.
  • Install Photon from ISO (http://bl.ocks.org/jrrickard/114b8c35b1d5306ff3e0). Now, I stopped my duplication of this effort after the first jenkins-slave was installed. All else did not meet my needs.
  • Install Fedora EPEL distribution into the Jenkins VM for CentOS 7 (https://fedoraproject.org/wiki/EPEL). This is needed to install Open Source Puppet.
  • Install Puppet into the Jenkins VM. While I could have used another server, this simplifies network traffic somewhat (https://docs.puppetlabs.com/guides/install_puppet/install_el.html). Before following any of the steps in the install guide for Puppet, I installed the following RPMs:
    • puppet
    • puppet-firewalld
    • puppet-server
  • Add the following plugins to Jenkins:
    • Active Directory Plugin
    • Docker Plugin
    • docker-build-step
    • GIT Client Plugin
    • GIT Plugin
    • git-notes Plugin
    • PowerShell Plugin
    • Tracking Git Plugin
    • vSphere Plugin

These plugins, once configured, allow the secure integration with GIT, vSphere, and Docker.

  • Configure Global Security to enable Active Directory. After enabling security within Jenkins, I enabled Active Directory and Project-based Matrix Authorization Strategy. Once you enable this authorization strategy, be sure to add at least one user as the administrator so you can log in and continue to use Jenkins. If you get this wrong, you will need to edit the config.xml file for Jenkins and remove any authentication stanzas, then restart Jenkins.
  • Add a Jenkins-specific user to vCenter with a Jenkins Role. Remember, there is a best practice of having one user per service in vCenter, and there is a need for each service user to have the proper roles. The roles are for those who need to clone a VM (http://kb.vmware.com/kb/1027743).
  • Configure (Jenkins) System with Add a new cloud. The Add a new cloud button configures the vSphere plugin. You configure the vSphere plugin by pointing it to your vCenter server and giving it a unique name. In my case, I called my vCenter server “Photon,” to remind me I am only using this for Photon VMs.

The first stage of my install is now finished. I should be able to clone a virtual machine using the vSphere plugin within Jenkins. I achieved this by doing the following:

  • Use the New Item link to create a new Project inside Jenkins.
  • Give that Project a name, such as “Photon-Nginx.”
  • Disable any source code management, as we have yet to write any code.
  • Disable everything else inside the Project configuration, as we do not want to do anything but a vSphere Build Step.
  • Using Add build step, add a vSphere Build Step.
  • Select the vCenter server you previously configured.
  • Select the Clone VM from VM or Template vSphere Action and fill out all the lines necessary. I furthermore selected Linked Clone. All fields need to be filled out; the VM to clone name should be your Photon instance. You should click on the Check Data button to test your entries. I found it useful to also have the vCenter Web Client running at the same time.
  • Save the Project configuration.
  • Right-click on the Project name and select Build Now. If everything works as expected—and it did for me—you should now have deployed a link clone of your Photon instance.

However, that is not all I wish to do. Deploying Photon linked clones from within Jenkins is just the first step. I really want to configure a Docker container within the Photon instance and then use Instant Clone to deploy many copies of my chosen application as needed. This requires a bit more work. Specifically, it requires the use of PowerShell, as that is the only way Instant Clone works. It also requires that I know the IP of the virtual machine, which I can get through the vSphere SDK, and also via PowerShell. To do all this, I need a Windows Jenkins slave. The next steps are to deploy this Jenkins slave to the Windows vCenter helper VM that I use for all Windows-related tasks. I could use another node, but this one has everything installed, such as PowerCLI, etc. The steps I took are:

sudo firewall-cmd --zone=public --add-port=34540/tcp --permanent
sudo firewall-cmd --reload

Now, if everything works, you should see a new node within your Jenkins Manage Jenkins -> Manager Node list. I labelled this slave node as vCenter so I can target build steps to just that node.

The next set of build steps are to do the following:

  • Determine the public IP of the linked clone just copied.
  • Load NGINX as a Docker container within that VM.
  • Run the container.
  • Ensure I can access this new web server.
  • Use Jenkins to run a load test against my newly deployed environment.

My future plans include making containers for an NGINX Load Balancer/Reverse Proxy so I can have multiple web properties using the same ports, and deploying WordPress with NGINX and Facebook’s HipHop VM to improve PHP performance as a container, using a containerized version of MySQL. Eventually, I should have containers for my web-based applications so that I can be Linux-distribution agnostic. All this, while maintaining a level of security, isolation, and separation of my workloads.

The post Foray into Jenkins, Puppet, Docker, and Photon appeared first on AstroArch Consulting, Inc.

26 Mar 07:51

How I Became a Consultant: Part 1

by Edward Haletky

There is no one true path to becoming an independent consultant: there are many. But perhaps a look at my path may help others. There are two, perhaps three, types of consultants out there. The questions to ask yourself are, “Which are you?” and “Why does each matter?” There are concerns with each type that impact your life, your family, and your state of mind. Keeping all those in balance is important as a consultant.

The three types of consultants are:

  1. Consultant for an existing organization. If you work for a vendor, value added reseller, analyst firm, etc., you may be asked to be a consultant. That means talking to customers and solving their problems. It also usually means independent work at a customer site, with lots of travel. This is the usual starting point for those who wish to be consultants who like autonomy and working directly with customers.
  2. Consultant for a contracting agency. You are working for another business that works with customers to find the proper consultant to meet their needs. You are, in essence, paid by the agency, which in turn charges more for your time than you charge. The benefit is that you do not need to worry about invoicing, finding gigs, or even insurance. It is all handled for you. The only thing you may need to worry about is your health insurance (there is a need to hold many different insurance policies besides a health policy as a consultant) and ensuring the agency or agencies know your latest endeavors. Once more, you will be required to travel to make the best use of this form of consultancy.
  3. Consultant for your own consultancy. This is the hardest approach to consulting, as you need to find customers yourself and handle the business as well as the consulting. Yet the benefits are far-reaching, including the level of travel you are willing to perform. In the modern day and age, you may not have to travel at all. You will need good communication skills and tools, however. There are two forms of this type of consultancy: as a way of life, and as a business. In essence, are you doing consulting to support your way of life, or are you trying to build an ongoing business that may eventually hire others, at which point you become their agency or existing organization? Where this one takes you depends on your goals.

Now, I have worked at all three of these types of consultancies. I started as a captured employee fresh out of university, but soon went on to working for a contracting agency I found via C.E. Weekly (a then-weekly publication of potential gigs, now Contract Job Hunter).

One thing to note is that many contracts you find may lead you to full-time employment. I was offered several jobs and eventually chose one I really liked (and stayed with the company for roughly fifteen years). Eventually, I went out on my own with my own consultancy. Actually, I’d had my own consultancy since high school, and kept it as a way to work on what I wanted to while doing other things. However, this kind of situation can cause quite a few issues, specifically around conflict of interest.

You need to ensure that your consultancy, if you have one that you keep while doing other things, does not pose a conflict of interest for your other gigs or jobs. One way to do this is to be incredibly open about your work and to ensure that anything you have developed or will do for others is not considered part of your job and will not be owned by anyone else. This is where reading your employee and contract agreements comes in. Most have a place to mention prior art, patents, or works you have done for yourself.

For example, I had sold a product to a part of the organization that wanted to hire me. I was open and clear about that purchase and the need for continued support. The organization stated that this was fine, but that I couldn’t work on it during work hours and could not sell it to any other parts of the organization. The good thing is that the product was diametrically opposed to the work I was doing. That helps. If your product is too similar, you may have to give up on it while you are employed, except to grant support.

As a consultant, understanding potential conflicts of interest is a full-time part of the job. You need to be very careful here, and keep accurate records of the work you have been asked to do and by whom, and what you can and cannot say or even do. Transparency up front is very important.

Being a consultant is a decision you get to make, but in the case of the first approach, you may be forced into it by your current role. Could that lead to other things? Absolutely. What it really takes is the proper mindset and entrepreneurial spirit. Even with these characteristics, taking the leap to become a contract consultant or an owner of a consultancy is a big step, one that needs careful consideration.

However, there is no one path to being an independent consultant. My path moved from type three to one, two, and one again, and then back to type three, where I am now. It all depends on your situation, your appetite for travel, and your goals. A big part of this decision should take into account your family and their comfort levels.

We will discuss many of those considerations in future posts on the subject of being a consultant. Look for more to come soon!

The post How I Became a Consultant: Part 1 appeared first on AstroArch Consulting, Inc.

26 Mar 07:51

Foray into Jenkins, Docker, and Photon: Part 3

by Edward Haletky

In previous Foray into Jenkins, Puppet, Docker, and Photon posts, I was able to clone a Photon OS VM (part 1) and deploy a Docker container into the Photon OS VM (part 2). Now, it is time to do some automated load testing in order to load and security test the deployed application. Load testing is required to determine the upper limit of the load this one container can handle. Once I know that, I can properly scale out the environment. But I also need to ensure that known security holes do not exist.

Load and security testing can be accomplished a number of different ways, but I have the opportunity to play with an Ixia BreakingPoint VE appliance, so I will use that.

Simply put, I did the following:

  • Used part 1 of this Foray to deploy a Photon OS VM with Remote Docker enabled as one build in Jenkins.
  • Used part 2 of this Foray to deploy an NGINX Docker container as a downstream build that fires when part 1 is successful within Jenkins.
  • Created a third build within Jenkins to fire when part 2 is successful.

This third build uses three scripts stored within a private Git repository to  get the IP of the NGINX Docker Photon OS instance (just like it did within part 2). Another script is used to dig username and password out of the SDK credentials file for use to log into the predeployed Ixia BreakingPoint VE. It then runs a third script that does a load test against the NGINX instance.

Very simply put, I have Infrastructure as Code and Testing as Code, not to mention the ability to do Security Testing as Code, within a Jenkins build flow.

To do all this, I needed to add the BreakingPoint TCL shell to my Jenkins image. I did this by downloading BPSH direct from the BreakingPoint VE and placed it within the /usr/local/bin directory on the Jenkins server. However, it would not run immediately, as BPSH is a 32-bit application and needed glibc.i686 and libgcc.i686 to be installed as well to run properly using:

yum -y install glibc.i686 libgcc.i686

Now it was just a case of creating a simple test within the Ixia BreakingPoint interface, exporting that test, and using it as an import to the automated test I would run. This way, most of the heavy lifting was done, and all my script had to do was:

  • Change the target IP for the test
  • Change the URI to use during the test
  • Change the number of sessions to run per second
  • Change the session counts within the test

Four small things to do using just three scripts and a canned test. In effect, 95s until I know if there are some security issues or not. Does my parser fail? Does the web server fail? I get scale plus security testing from one test. Adding more from the Ixia library of tests will tell if, at scale, my server can handle the attacks.

If we tie this testing to an application performance management tool such as New Relic or Dynatrace Ruxit, we get even more data, including where any failures occur.

The first script gets the IP of the test target from vCenter using the Perl SDK. You can use PowerShell, but I had troubles with credential file permissions as the Jenkins user. You have to first create a credential store (see the VMware SDK cred_store.pl instructions).

Get the IP script
#!/usr/bin/perl -w
#
# Copyright (c) 2015 AstroArch Consulting, Inc. All rights reserved
#
# requires credstore

use strict;
use VMware::VIRuntime;
use VMware::VILib;
use VMware::VICredStore;
use File::Basename;

my $vmname=$ARGV[0];

VMware::VICredStore::init(filename => "./vicredentials.xml");
my @server_list=VMware::VICredStore::get_hosts();
my @user_list=VMware::VICredStore::get_usernames(server => $server_list[0]);
my $password=VMware::VICredStore::get_password(server => $server_list[0], username => $user_list[0]);
my $url = "https://".$server_list[0]."/sdk/vimService";
VMware::VICredStore::close();


eval {
	Vim::login(service_url => $url, user_name => $user_list[0], 
		password => $password);
};
if ($@) {
	print "$@"; exit 3;
}

my $vdata = Vim::find_entity_views(view_type => 'VirtualMachine', 
	filter => {"config.name" => $vmname});

foreach (@$vdata) {
	my $vm_view = $_;
	if (defined $vm_view->guest->net) {
		if (defined $vm_view->guest->net) {
			my $net_len = @{$vm_view->guest->net};
			my $cnt = 0;
			while ($cnt < $net_len) {
				if (defined $vm_view->guest->net->[$cnt]->ipAddress) {
					my $ip_len = @{$vm_view->guest->net->[$cnt]->ipAddress};
					my $cnt_ip = 0;
					while ($cnt_ip < $ip_len) {
						print $vm_view->guest->net->[$cnt]->ipAddress->[$cnt_ip]."\n";
						$cnt_ip++;
					}
				}
				$cnt++;
			}
		}
	}
}

Vim::logout();

This second script gets the credentials from the credential store for use by the BreakingPoint script. It is set up to not be called from the command line directly.

Get Credentials Script
#!/usr/bin/perl -w
#
# Copyright (c) 2015 AstroArch Consulting, Inc. All rights reserved
#
# requires credstore
# not designed to be used directly

use strict;
use VMware::VIRuntime;
use VMware::VILib;
use VMware::VICredStore;
use File::Basename;

sub ltrim { my $s = shift; $s =~ s/^\s+//;       return $s };
sub rtrim { my $s = shift; $s =~ s/\s+$//;       return $s };
sub  trim { my $s = shift; $s =~ s/^\s+|\s+$//g; return $s };

my $parent_id;
foreach (`ps -ef`) {
	my ($uid,$pid,$ppid) = split;
	next unless ($pid eq $$);
	$parent_id = $ppid;
	last;
}

my $parent = (grep {/^\s*\d+/} (`ps -p $parent_id`))[0];
my $parent_name = trim((split /\s+/, $parent, 5)[4]);
if ($parent_name eq "bash") {
	exit;
}

VMware::VICredStore::init(filename => "./vicredentials.xml");
my @server_list=VMware::VICredStore::get_hosts();
my @user_list=VMware::VICredStore::get_usernames(server => $server_list[0]);
my $password=VMware::VICredStore::get_password(server => $server_list[0], username => $user_list[0]);
my $url = "https://".$server_list[0]."/sdk/vimService";
VMware::VICredStore::close();

print $user_list[0]." ".$server_list[0]." ".$password;

What follows is the BreakingPoint script, which takes several arguments. The most important are the target IP, the IP of the Ixia BreakingPoint server, and the URL to attack. We also can pass in the sessions count we want and the maximum number of sessions per second. Actually, for this script you need all those entries. Using:

./bps_run.tcl TargetIP TestName IxiaServerIP LocalTestFileName SessionCount MaxPerSecond URL

This script allows us to use one script but pass many different URLs, each with a different attack embedded with in it, such as SQL injection, cross-site scripting, and fuzzed input.

Breaking Point Test Script
#!/usr/local/bin/bpsh
#
# Copyright (c) 2015 AstroArch Consulting, Inc. All rights reserved
#
# -- BPSH comes from BreakingPoint VE after login at
#	Ixia Web Apps -> Help -> Download
# -- BPSH is a 32-bit application and requires 
#	glibc.i686 and libgcc.i686 to be installed on x86_64 systems

####
# Get command line arguments
set testip [lindex $argv 0]
set testname [lindex $argv 1]
set testrig [lindex $argv 2]
set localtestname [lindex $argv 3]
set sessioncount [lindex $argv 4]
set maxpersecond [lindex $argv 5]
set uri [lindex $argv 6]
####

####
# We are getting the credentials to use from an encrypted store
# SECURITY: Never embed credentials unencrypted in code!
set creds [exec ./get_creds.pl]
set username [lindex $creds 0]
set password [lindex $creds 2]
# strip everything after @ as that is not allowed in BP VE
set au [split $username @]
set username [lindex $au 0]
####

####
# Base procedures for import/export
proc testimport {bpsobj} {
	global testname
	global localtestname
	puts "Import the Test"
	$bpsobj importTest $testname -file $localtestname -force
	puts "Imported Test: $testname"
}
 
proc exportresults {testobject} {
	puts "Starting Test Report Export"
	$testobject exportReport -file MyTestReport.pdf
	puts "Test Report Export Completed."
}
####
 
####
# Get the Breaking Point connection
puts "Start the connection to BPS"
set bps [bps::connect $testrig $username $password]
puts "Connected to BPS bps, connection ID = $bps"
####

####
# Work with the chassis to put ports into a group for use
set chsobj [$bps getChassis -onclose exit]; #creates the chassis object
#
# Reserve the ports / really should check reserve state
# -- defaults to group 1 needs to be user specific
# -- should look for open ports and assign to 'new group'
#set pstate [$chsobj getState]
#set lstate [eval "list $pstate"]
#array set astate $lstate
#foreach index [array names astate] {
#	set x [string match {[0-9]} $index]
#	set y [string match {[0-9][0-9]} $index]
#	if { $x > 0 || $y > 0 } {
#		set vstate [eval list $astate($index)]
#		array set avstate $vstate
#		foreach vndex [array names avstate] {
#			puts $vndex
#		}
#	}
#}
# reserve everything <-- This should be smarter
set group 1
for {set i 1} {$i < 9} {incr i 1} {
	for {set j 0} {$j < 8} {incr j 1} {
		$chsobj reservePort $i $j
	}
}
####

####
# Import existing test (created on Ixia Breakpoint and Exported)
testimport $bps
# Create the test from the import
set testobject [$bps createTest -template "$testname" -name "$testname"]
####

########
# Edit the Test! We need to
# 	-- change target
#	-- change session counts
####
# Modify Network Neighborhood of the test by
#	cloning the neighborhood
#	changing the target IP of the test
puts "Modifying Network Neighborhood of Existing Test"
set neighborhood [$testobject cget -neighborhood]
set nnew [join [split $neighborhood "-"] ""]
set newNN [append nnew "New"]
puts "New Test Neighborhood Name: $newNN"
set nn [$bps createNetwork -template $neighborhood -name $newNN]
set ips [$nn getAll ip_external_hosts]
# should only be one
foreach {name ipObj} $ips {
	puts "External Hosts Entry: [$ipObj cget -id] to -ip_address $testip"
	$ipObj configure -ip_address $testip
}
$nn save -name $newNN -force
$testobject configure -network $newNN
####

####
# Get Test Components
puts "Getting Test Components"
set components [$testobject getComponents]
set lcomps [eval "list $components"]
array set acomps $lcomps
# Get New Superflow name
set sflow [$acomps(clientsimpreset_1) cget -superflow ]
set nflow [append sflow New]
####

####
# Modify Superflow
#	change web address
puts "Modify Superflow $sflow to $nflow using Uri $uri"
set superflowobject [$bps createSuperflow -template $sflow -name $nflow]
$superflowobject modifyAction 1 -uri $uri
$superflowobject save -force
####

####
# Modify Test Components with new SuperFlow and Parameters
puts "Modify Test Components for Max Sessions of $sessioncount using Superflow $nflow"
$acomps(clientsimpreset_1) configure -superflow $nflow
$acomps(clientsimpreset_1) configure -sessions.max $sessioncount
$acomps(clientsimpreset_1) configure -sessions.maxPerSecond $maxpersecond
####

####
# save the test
$testobject save -force
####
########

####
# Run the saved test
puts "Starting the test: $testname with neighborhood: $newNN"
#$testobject run -group $group -progress "bps::textprogress stdout"
$testobject run -group $group
####

####
# Get the results
set rc [$acomps(clientsimpreset_1) result]
set aa [$rc get appAttempted] 
set as [$rc get appSuccessful] 
set au [$rc get appUnsuccessful]

puts "Test results: Attempted: $aa Successful: $as Unsuccessful: $au"

$testobject exportReport -file "$testname.pdf" -format pdf
####

####
# Close the connection, cleanup
# -- should clean up reserved ports based on 'group' used
for {set i 1} {$i < 9} {incr i 1} {
	for {set j 0} {$j < 8} {incr j 1} {
		$chsobj unreservePort $i $j
	}
}
$bps delete
####

set res "passed"
if { $au > 0 } {
	set res "failed"
}
puts "Test Completed with $res Result!!!!"
exit $au

The post Foray into Jenkins, Docker, and Photon: Part 3 appeared first on AstroArch Consulting, Inc.

26 Mar 07:38

Review: littleBits Gadgets and Gizmos electronics kits for STEM kids

by Scott Hanselman

GGK_Box_Everything_600x400_v-2I love posting about STEM (Science, Technology, Engineering, and Mathematics) and some of the great resources, products, and software that we can use to better prepare the next generation of little techies.

Here's some previous posts I've done on the topics of STEM, kids, programming, and learning with young people:

The 8 year old (recently 7, now barely 8) has been playing with littleBits lately and having a blast. He loved SnapCircuits so littleBits seemed like a reasonable, if slightly higher-level, option.

SnapCircuits boldly has kids as young as three or four creating circuitry from a simple light and switch all the way up to a solar-powered radio or a burglar/door alarm. It doesn't hide the complexities of volts and amps and includes low-level components like resistors. Frankly, I wish my first EE (Electrical Engineering) class in college was taught with SnapCircuits.

LittleBits (usually a lowercase L) jumps up a layer of abstraction and includes motors, motion detectors, LED arrays, and lots more. There are also specific kits for specific interests like a littleBits Musical Electronics Synth Kit and a littleBits Smart Home Kit that include specific littleBits that extend the base kit.

littleBits1

The key to littleBits is their magic magnet that makes it basically impossible to do something wrong or hurt yourself. The genius here is that the magnet only goes one way (because: magnets) and the connector underlying transmits both power and data.

You start with a power bit, then add an "if" statement like a switch, then move to do a "do" statement like a motor or light or whatever. In just about 20 minutes my 8 year old was able to take a LEGO custom Star Wars Blaster and add totally new functionality like lights and sounds..

The 8 year old wanted to show his Star Wars Blaster/Fan combo made with @littlebits #video

A video posted by Scott Hanselman (@shanselman) on


One of the aspects of littleBits that I think is powerful but that wasn't immediately obvious to me is that you shouldn't be afraid to use glue or more permanent attachments with your projects. I initially tried to attach littleBits with rubber bands and strings but realized that they'd smartly included "glue dots" and Velcro as well as 3M adhesive pads. Once we stopped being "afraid" to use these stickers and adhesives, suddenly little projects became semi-permanent technical art installations.

We got the "Gizmos & Gadgets" kit which is a little spendy, but it includes 15 bits that enables you to do basically anything. The instructions are great and we a had remote-controlled robot that could drive around the room running within an hour. It's a great setup, a fun kit, and something that kids 8-14 will use all the time.

Here are some fantastic examples of other Star Wars related littleBits projects for you to explore:

*Amazon links are referral links on my blog. Click them and share them to support the blog and the work I do, writing this blog on my own time. Thanks!


Sponsor: Big thanks to Wiwet for sponsoring the feed this week. Build responsive ASP.NET web apps quickly and easily using C# or VB for any device in 1 minute. Wiwet ASP.Net templates are integrated into Visual Studio for ease of use. Get them now at Wiwet.com.



© 2016 Scott Hanselman. All rights reserved.
     
15 Feb 10:17

With Deep SQL Knowledge, These MVPs Do Extensive Good

by MVP Award Program

The MVP community is well known for its generosity—freely sharing technical knowledge and expert advice, and often helping to make the world a better place through philanthropic initiatives. We checked back in with a project that got its start almost a decade ago and has brought together scores of MVPs to help children around the world.

 

Inspired by Bill Gates at the 2007 MVP Global Summit to make giving back to others a part of their lives, 53 MVPs came together afterwards to write SQL Server MVP Deep Dives. It was described by its publisher in 2009 as “no ordinary SQL Server book. In SQL Server MVP Deep Dives, the world’s leading experts and practitioners offer a masterful collection of techniques and best practices for SQL Server development and administration.”

 

What also set the book apart was that its authors contributed 100% of their royalties to support War Child International, a network of independent organizations working across the world to help children affected by war.

sql deep dive

Two years later, 64 MVPs came together to reprise those efforts, combining about 1,000 years of experience in SQL Server administration, development, training, and design to author SQL Server MVP Deep Dives, Volume 2. Those MVPs generously pledged the proceeds from their book to Operation Smile, an international children’s medical charity that heals children’s smiles through a mobilized force of medical professionals who provide safe, effective reconstructive surgery for children born with facial deformities such as cleft lip and cleft palate.

 

Congratulations to all the MVPs involved in this initiative, which so far has contributed more than $60,000 to children in need!

 

 

15 Feb 10:17

A Hitchhikers Guide to Search

by MVP Award Program

Editor’s note: The following post was written by Windows Development MVP Matias Quarantaas part of our Technical Tuesday series.

 

The answer to the ultimate question of life, the universe, and everything is… 42″

The Hitchhiker’s Guide to the Galaxy

 

Or, translating it to Web-Applications Terms, the answer to the ultimate question of content discoverability, performance and everything is… a great Search Service.

The goal of this Guide is to provide quick access to the most relevant pieces of information regarding the key aspects of Azure Search, I won’t provide exact code examples but I’ll point you in the best direction to find them hopefully making your journey to knowledge faster.

 

A look back in time

To understand what is Azure Search, let’s first go back in time to Lucene, a text search engine (originally created on Java and later ported to .Net as Lucene.net) that creates indexes on our documents based on their data and characteristics, allowing fast searches including tokenization and root word analysis. It also lets us create custom logic to ponder and score documents, so our results will match more closely what the user is trying to find.

Then came Solr, a wrapper around Lucene that provided index access through XML/HTTP services, caching and replication. We could install Solr on our servers or a virtual machine and consume it from our applications.

ElasticSearch was born as an implementation over Lucene that added a REST interface, replication, faceting, filtering, JSON (schema-free) document storage, geo-localization and suggestions (among its main features). But we still needed to rely on maintaining and managing our own infrastructure.

 

Enter Azure Search

1

Azure Search can be understood as a fully managed Search-as-a-service solution working in the Cloud, this means that we don’t need to worry or invest our time in maintaining the infrastructure behind the search engine, we can focus 100% on creating our product and offer increased value to our customers by adding a robust search experience in our applications. Azure Search adds scalability, fault-tolerance and replication, all working behind a public REST API (that uses OData syntax and Lucene Query syntax for Queries) that allows our applications to use the engine directly from our on-premises servers, from Azure, other cloud provider or any other hosting solution.

It’s important to highlight the flexibility of Azure Search, with a couple of clicks on the Azure Portal (or by the Azure Search Management REST API) we can enlarge or shrink our engine capacity (either size by adjusting partitions or throughput/availability by adjusting replicas) according to our needs and budget, effectively adjusting to the demand curve in a quick and effort-less way.

The main capabilities (some of which we will discuss in this article) of Azure Search are:

  • Full-Text Search: We can create any amount of indexes without extra cost to index our documents manually (using the API) or automatically (using Indexers) and perform blazing fast searches no matter the size of the documents.
  • Multi-Language support: Right now, Azure Search supports 56 languages to use on the text Analyzers that work the indexing magic, allowing word stemming on each and every one of those languages.
  • Custom scoring: By default, Azure Search applies the TF-IDF algorithm on our index’s Searchable fields to calculate a score and provide results ordered by higher to lower, but we can customize this behavior by assigning weights to different fields or applying functions (boosting new content for example) that alter the resulting score.
  • Hit-highlighting: Allows you to show where, in the documents, were the search words found.
  • Suggestions: Suggesting possible values for an auto-complete search text input.
  • Faceting: A facet is the quantitative categorization of a document on a given area, used mostly on Product Listings or Navigation that include possible filtering values with their quantities.
  • Filtering: Allows for result narrowing given certain field values.
  • Advanced querying: Azure Search supports keywords, phrase and prefix search (including the use of “+”, “-” or “*”) and recently Lucene query syntax (we’ll discuss this in detail on following sections).
  • Geo-spatial support: Geo-spatial data can be stored and used to provide custom scoring, filtering and sorting.

 

Creating indexes

An index is defined as an abstraction above your document data. It contains all the necessary information for users to search and find your content. It can contain your whole document or just key attributes, for example, for a news article, the title, body and main image, but not all the other images associated with it, because that’s not relevant to the search function. You can then point to the whole document using the search result if you need to.

An index contains fields, Azure Search supports several types of fields within an index:

  • Edm.String: Text that can optionally be tokenized for full-text search (word-breaking, stemming, etc.).
  • Edm.Boolean: True/false.
  • Edm.Int32: 32-bit integer values.
  • Edm.Int64: 64-bit integer values.
  • Edm.Double: Double-precision numeric data.
  • Edm.DateTimeOffSet: Date time values represented in the OData V4 format: yyyy-MM-ddTHH:mm:ss.fffZ or yyyy-MM-ddTHH:mm:ss.fff[+|-]HH:mm. Precision of DateTime fields is limited to milliseconds. If you upload datetime values with sub-millisecond precision, the value returned will be rounded up to milliseconds (for example, 2015-04-15T10:30:09.7552052Z will be returned as 2015-04-15T10:30:09.7550000Z).
  • Collection (Edm.String): A list of strings that can optionally be tokenized for full-text search.
  • Edm.GeographyPoint: A point representing a geographic location on the globe. For request and response bodies the representation of values of this type follows the GeoJSON “Point” type format. For URLs OData uses a literal form based on the WKT standard.

And each field can have one or more of these attributes:

  • Retrievable: Can be retrieved among the search results.
  • Searchable: The field is indexed and analyzed and can be used for full-text search.
  • Filterable: The field can be used to apply filters or be used on Scoring Functions (next section)
  • Sortable: The field can be used to sort results. Sorting results overrides the scoring order that Azure Search provides.
  • Facetable: The field values can be used to calculate Facets and possibly afterwards used for Filtering.
  • Key: It’s the primary unique key of the document.

2

With this in mind, creating an index in the Azure Portal or through the API is quite easy.

 

Exploring and searching

Searching is achieved through the API using OData syntax or Lucene Query syntax, but there’s a Search explorer available on the Azure Portal too:

3

Queries are done using the Simple Query syntax (we’ll view it in more detail later on), we issue a Search Text and Azure Search calculates the score of each document, results are then returned to us on a Score-descending order. Score customization will be explained in the next section.

Results will contain the fields marked as Retrievable in a Json format along with the calculated Score.

 

Scoring

We talked about indexes and how searches are, by default, treated with the TF-IDF algorithm to calculate the result score on Searchable fields.

What if we don’t want the default behavior? What if our documents have attributes that are more relevant than others, or if we want to provide our users with geo-spatial support?

Luckily, we can do this with Custom Scoring Profiles. A Scoring Profile is defined by:

  • A Name (following Naming Rules).
  • A group of one or more Searchable Fields and a Weight for each of them. The Weight is just a relative value of relevance among the selected fields. For example, in a document that represents a news article with a Title, Summary and Body, I could assign a Weight of 1 to the Body, a Weight of 2 to the Summary (because it’s twice as important) and a Weight of 3.5 to the Title (Weights can have decimals).
  • Optionally, Scoring Functions that will alter the result of the document score for certain scenarios.Available scoring functions are:
    • “freshness”: For boosting documents that are older or newer (on a Edm.DataTimeOffset field). For example, raising the score of the current month’s news above the rest.
    • “magnitude”: For boosting documents based on numeric field (Edm.Int32, Edm.Int64 and Edm.Double) values. Mostly used to boost items given their price (cheaper higher) or count of downloads, but can be applied to any custom logic you can think of.
    • “distance“: For boosting documents based on their location (Edm.GeographyPoint fields). The most common scenario is the “Show the results closer to me” feature on search apps.
    • “tag“: Used for Tag Boosting scenarios. If we know our users, we can “tag” them with (for example) the product categories they like more, and when they search, we can boost the results that match those categories, providing a personalized result list for each user.

Custom Scoring Profiles can be created through the API or on the Portal.

 

Content Indexers

What good is searching if you don’t have content? Adding or updating documents is almost as important as finding them, that’s why Azure Search provides several methods for importing our data to our indexes.

The most straightforward method is using the Document API, directly as a REST service or using the Azure Search SDK. There are plenty of examples around, including geo-spatial support.

But using the API means that the logic of when, or which documents are added and maintained on the index relies on our code, we decide when to add, when to update and when to remove, we may need this “freedom” because our business logic or our storage may need so.

Another option, more dynamic and faster is to use Indexers. An Indexer is a process that describes how the data flows from your data source into a target search index; a search index can have several Indexers (from different data sources) but an Indexer can have only one associated index.

Supported data sources right now are:

  • DocumentDB: You can sync your index with a DocumentDB collection and even customize the queries used to feed the Indexer.
  • SQL Server: Indexing content that exists on Azure SQL databases, on other cloud providers, even on-premises databases, can be achieved by creating an Indexer for SQL, that can map a query on a table or a view.
  • Blob Storage: Azure Search can index your blobs (HTML, MS Office formats, PDF, XML, ZIP, JSON and plain text) in Azure Storage by mapping blob metadata to index fields and the file contents as a single field.

Indexers can be created using the API or using the Portal. They can be run once or assigned a schedule and they can track changes based on SQL Integrated Change Tracking or a High Watermark Policy (an internal mark that tracks last updated timestamps).

 

Suggestions

4

There cannot be a complete search experience without some sort of auto-complete functionality which offers the user possible search terms based on what he or she already typed. With Azure Search you can create your own Suggesters, you will be able to define which fields feed the Suggester based on the fields’ values and then consume the Suggester to provide possible search terms to the user via the API.

The fields used by the Suggester can only be of types Edm.String and Collection(Edm.String) and using Default Analyzers. It is advised that these fields should have low cardinality to provide the best performance, but we’ll talk about performance shortly.

Azure Search Suggesters support Fuzzy Search too, keep in mind that performance-wise, they are slower because of the extra fuzzy analysis.

The creation of the Suggester can be achieved through the API or through the Portal. Keep in mind that you can only have one per index and you cannot edit it afterwards.

There are full examples available, including one that implements Type-Ahead client functionality.

 

Advanced querying

Recently, Lucene query syntax support has been announced on Azure Search based on Apache’s definition.

Some of you might think, what’s the difference between using Azure Search’s queries (let’s call it Simple query syntax) by default and Lucene syntax? If the latter is more complete, why wouldn’t I use it by default?

Azure Search Simple query syntax (based on Lucene’s Simple Query Parser) is enough for almost all scenarios, it will match documents containing any or all of the search terms, including any variations found during analysis of the text and calculate the score based on the TF-IDF algorithm (Custom Scoring Profiles help to customize the fields and score result). The internal Lucene queries used are optimized to provide the best possible performance.

Lucene query syntax gives you a more granular and powerful control of the query. It’s mostly used for these key scenarios:

  • Fuzzy search: By adding the tilde “~” after search terms, you can instruct Azure Search to ponder variations of the word or misspellings.
  • Proximity Search: By using the tilde “~” plus a number, you can specify what’s the word-distance between two search terms. For example, “hotel airport”~5 will find documents that have both words with a maximum of 5 words in between.
  • Term boosting: By adding a caret “^” to a search term followed by a number (any positive number including decimal values), we can boost the score of the documents that contain that term in particular. For example, searching for “lucene^2 search” will give increased score for those documents that contain “lucene” along with “search” higher than those that only contain “search”. This is different from Scoring Profiles, since Term Boosting points to search terms and Scoring Profiles apply to index fields. By default, any term has a Boosting of 1, using less-than-one values (like 0.2) will effectively decrease the score of that term.
  • Regular expressions: Regular expression syntax can be used just by applying the expression between forward slashes “/”. Valid expressions can be found on the RegExp class.

 

As you can see, using Lucene query syntax gives you more flexibility and power in creating your queries but, “a great power comes with a great responsibility”, your queries are as optimized as you make them.

Analytics

After creating our service and consuming it for some time, we may be wondering: Can I see how frequently is the service being used? What are the most common queries? Am I reaching my service throughput quota?

The answer is yes. You can enable Traffic Analytics for your Azure Search Service. You just need an Azure Storage Account where your analytics log can be stored on.

Once the data starts flowing, you can use tools like PowerBI Desktop to obtain a more graphical and comprehensive information about the service.

Not only can you see how often is your service used and it’s latency, but you can even find out how often are you hitting HTTP Status 503, which means that you are above your service quota.

5

Performance guidelines and tips

Even on a service as optimized as Azure Search there’s room for good practices and correct use scenarios.

Following Pablo Castro’s excellent presentation on AzureCon I’ll humbly highlight some of the most common questions and doubts regarding the performance of the service.

 

Let’s start with some common points:

  • When you create an index, only mark fields as Searchable, Facetable or filterable if they really need to be. This increases the indexing time and the storage usage of your service.
  • Enable Suggesters only if you are going to use them, they will impact on indexing times.
  • If your data is in Azure SQL, DocumentDB or Azure Storage Blobs, use Indexers, they have optimized queries to obtain and process the data.
  • Facetable fields work best for low density values. High cardinality will slow down your queries so it’s best to plan ahead.
  • Low selective queries are obviously, slower, since the engine will have to ponder all the indexed documents if we don’t apply any kind of Filters. Make queries as explicit and selective as possible.

Keep in mind that all service tiers (especially the Free tier) have usage limitations. To overcome these limitations we can scale our service for more storage (increasing Partitions) and/or for more throughput and parallelism (increasing Replicas).

For High-availability, it is recommended to use two Replicas for read-only queries and three Replicas for read-write workloads.

To understand how many replicas we really need, it’s crucial to know what’s the Latency we expect from the service according to our own product or service. The best way of defining the Replicas is to test our service with a normal workload and track what’s the current Latency. If we need a lower Latency, we can increase the Replicas and repeat the testing operation.

 

Conclusion

Hopefully this guide will help you find your answers as quickly as Azure Search provides results, well, maybe a little bit slower, but it will save you enough time so you can focus on building your best search application and get it running quickly enough.

Remember that you can try Azure Search for FREE, you don’t need to spend any money to make your proofs of concept or tests.

Matias

About the author

Microsoft MVP, Azure & Web Engineer, open source contributor and firm believer in the freedom of knowledge.

 

 

 

 

 

15 Feb 10:16

Azure SQL Database Security Features

by Kun Cheng (SQLCAT)

The Microsoft Azure platform is evolving fast. Azure SQL Database, which is a Relational Database service running on Azure, is riding high on the cloud wave with new features enabled at a fast pace. I want to share a few Azure SQL Database security features currently in GA or public preview) that could help developers and DBAs develop and manage a secure SQL Database solution. All security features mentioned in this blog are available for Basic, Standard, and Premium databases in v12 servers.

Feature

Status

Target scenario

Firewall

GA

All

Secure connection

GA

All

Auditing

GA

Log data access/change trails for regulatory compliance

Data masking

Public preview (V12)

Obfuscate confidential data in the result set of a query.

Row-level security (RLS)

Public preview (V12)

Multi-tenant data access isolation.

 

Firewall (GA) – This feature has been available for Azure SQL Database since the very beginning. It’s a way for DBAs to control which clients, based on IP addresses, can access a logical Azure SQL Server or a specific database. By default, for a newly created logical server, no firewall rules are defined and nobody outside of Azure can access any database on that server yet. You must define a rule to start the first connection. Note the firewall rule IP ranges between server level and database level don’t overlap. You may also allow other Azure services to access your server or database using a single rule by selecting a checkbox rather than based on IP addresses.

Secure connection (GA) – SQL Database requires secure communication from clients based on the TDS protocol over TLS (Transport Layer Security). Note for application to be truly protected against man-in-the-middle type of attack, we encourage you to follow these guidelines to explicitly request an encrypted connection and do NOT trust server side certificate.

Auditing (GA) – Allows customers to record selected database events in log files for alerting and post-mortem analysis, for example, as part of maintaining regulatory compliance such as PCI, HIPAA. Common auditing events include insert, update, and delete events on tables. Using SQL Database Auditing, you can store the audit logs in Azure table storage and build reports on top of them. There is preconfigured dashboard report template available for download (requires Excel 2013 or later plus Power query). SQL Database Auditing requires the use of a secure connection string.

Data masking (public preview) – Is a policy based security feature that limits exposure of sensitive data like credit card numbers, social security numbers, clinic patient info to non-privileged users. Similar to Auditing, it’s useful for scenarios with compliance requirements. You may specify masking rules to be applied to designated fields, either at source (tables/columns), or at results (alias used in queries). Note that masking rules are applied to the appropriate data in the result set of a query. Unlike encryption, data masking does NOTprotect sensitive data at rest or during query processing in memory. Data masking requires the use of a secure connection string.

Row-level security (RLS) (public preview) – The feature is aimed at multi-tenant applications that share data in a single table within the same database. Typically, application developers currently have to build logic in the application code to isolate tenants from accessing each other’s. In contrast, RLS centralizes the isolation logic within the database, simplifying application design and reducing the risk of error. With RLS security policy managers can encode the isolation logic in a security policy using inline table-value functions. An example of how to use RLS in a middle-tier, multi-tenant application can be found here.

 

Additional security resources:

15 Feb 10:16

SQL 2014 Clustered Columnstore index rebuild and maintenance considerations

by Denzil Ribeiro

This article describes the index rebuild process as well as index maintenance for clustered Columnstore indexes and is directed towards SQL 2014. In SQL 2016, there have been several index maintenance enhancements that will be covered in a separate post.

 

Overview of Columnstore index build or rebuild process

Building (or rebuilding) a Columnstore index can be a resource intensive operation. Index creation time can be 1.5 times longer than a regular b-tree and resource consumption as well as duration depends on a few factors other than physical resources that include

  • Number and data types of columns which determine the memory grant estimation
  • Degree of Parallelism (DOP)
  • Resource Governor settings

 

Plan for Index Rebuild of a Non-Partitioned Table with 6 billion rows:

image

Plan for Index Rebuild of a partitioned table with 6 billion rows spread across 22 partitions

image

The Columnstore index is built in 2 phases indicated by the plans above:

  • Primary (Global) Dictionary: This is built first and is a serial plan irrespective of MAXDOP settings and completed first and persisted. In order to build the primary dictionary we sample 1% of the rows if the table is over a million rows. Given this is a serial plan the duration taken for this step is incurred in all cases. The memory grant here is limited to 10%.
  • Segments and Local Dictionaries: Segments are now built in parallel as are local dictionaries. First an estimate is made on a per-thread basis of memory consumption based on a segment size of 1 million rows. The memory grant is requested and all the threads are started but only the first one builds a segment and once it is done the actual memory grant is known per thread. While this first segment is being built, all other threads wait on it to finish and have the wait type COLUMNSTORE_BUILD_THROTTLE. Given the COLUMNSTORE_INDEX_BUILD_THROTTLE is cumulative as all remainder of the threads wait for the first segment to be built, the higher the DOP on the system, the higher you will see this value. AS you can see in the XE capture below, you see the first segment build completing and then the waits for all the remainder of the threads completing.

 

clip_image006[7]

 

From the first segment build, we know how much memory was needed. Based on this knowledge we activate only N number of threads whose total memory grant will fit into the memory that was granted earlier. This number of threads activated is represented in the plan as “Effective Degree of Parallelism” and can be checked against the actual degree of parallelism as seen in the diagram.

image

 

The Memory Grant information is also available from the Query Plan.

image

 

During the life of the index build, additional memory can be granted within resource governor limits and low memory conditions are checked as well. If a Low memory condition is detected, the segment will be trimmed before it reaches the 1 million row mark.

A segment can be trimmed or closed before the 1 million mark from an index build perspective if

  • Low Memory condition is hit
  • Dictionary is full ( Dictionary size is 16MB)
  • DOP is greater than 1 and the last “N” row groups created don’t have 1 million rows

 

Columnstore Index Build Extended Events

There are a couple Extended Events that can help in diagnosing index build and segment quality related issues.

a. Column_store_index_build_throttle – indicates if the DOP has been throttled.

image

b. Column_store_index_build_low_memory – tells us if a segment is trimmed due to low memory condition

image

c. Column_store_index_build_process_segment – gives us the segment that was trimmed and the trim_reason. This list isn’t exhaustive, 1 = Low Memory, 2 = Dictionary full

image

Columnstore Index build test cases:

The following table depicts Index build results on a non-partitioned table with 6 billion rows. While observing the numbers we can see that the higher the DOP the higher the CPU. Also the higher the DOP, given more threads are spawned, the memory grant goes up.

Note: There wasn’t a noticeable difference when the same tests were performed on a partitioned table. The benefits of having Columnstore index on a partitioned table include being able to rebuild an index at the partition level, partition elimination in conjunction with segment elimination and ability to archive data efficiently.

This is a 6 billion row table that was tested, SQL Server has 60 cores.

StartTime

EndTime

Duration

MAXDOP

Actual DOP

max_grant_kb

CPU

4/10/15 8:02 AM

4/10/15 8:27 AM

0:24:50

64

60

28989608

85%

4/10/15 8:28 AM

4/10/15 9:07 AM

0:39:40

32

32

15461128

50%

4/10/15 9:08 AM

4/10/15 10:08 AM

1:00:13

16

16

7730568

28%

4/10/15 10:09 AM

4/10/15 11:53 AM

1:43:59

8

8

3865288

13%

       

image

If you look at the perfmon charts below and the timings, Rebuild index for a Columnstore index does not scale linearly. This is because the Global dictionary is built in serial as discussed, and is a constant irrespective of MAXDOP for the index build. As you see in the perfmon images below, the dictionary build time takes a larger percentage of the overall time the higher the degree of parallelism.

MAXDOP 32: Building the Global dictionary takes around 20% of the total index build time with MAXDOP 32.

image

MAXDOP 64: Building the Global dictionary takes around 29% of the total index build time when maxdop is 64.

image

 

Columnstore Index Maintenance

There are 2 operations from a Columnstore Index maintenance perspective

INDEX REORGANIZE: This manually moves closed ROWGROUPS into compressed columnar storage and this is done online. You do not have to do this, the tuple mover will ultimately move a closed rowgroup into columnar storage but the tuple mover is single threaded so issuing an index reorganize is a way to manually invoke compression on a closed row group.

INDEX REBUILD: This can be done at a partition Level for a partitioned table, and is an offline index build and at this point there isn’t an ONLINE equivalent. The Rebuild reads and recompresses all the data in the specified partition or the entire index. In an updatable Columnstore, deletes of data that reside in a compressed rowgroup are logical deletes. You can encounter situations where majority of the data in a rowgroup could be deleted. In such cases in order to reclaim that space, you have to REBUILD the index for the partition in question. Another reason one would want to rebuild an index on the partition is to improve rowgroup quality if you have a large number of rowgroups each having a small number of rows. In this case rebuilding an index can consolidate smaller row groups into larger ones which could help both from the compression perspective as well as from the query performance perspective.

 

Here is a sample script which can help identify partitions that are good candidates to be rebuilt based on some thresholds defined in the script.

 

/*--------------------------------------------------------------------------------- 
The sample scripts are not supported under any Microsoft standard support program or service 
and are intented as a supplement to online documentation.The sample scripts are provided AS IS without warranty 
of any kind either expressed or implied. Microsoft further disclaims all implied warranties including, 
without limitation, any implied warranties of merchantability or of fitness for a particular purpose.
#--------------------------------------------------------------------------------- */
 
 /*
 Rebuild index statement is printed at partition level if
  a. RGQualityMeasure is not met for @PercentageRGQualityPassed Rowgroups 
    -- this is an arbitrary number, what we are saying is that if the average is above this number, don't bother rebuilding as we consider this number to be good quality rowgroups
 b. Second constraint is the Deleted rows, currently the default that is set am setting is 10% of the partition itself. If the partition is very large or small consider adjusting this
 c. In SQL 2014, post index rebuild,the dmv doesn't show why the RG is trimmed to < 1 million in this case in SQL 2014. 
   - If the Dictionary is full ( 16MB) then no use in rebuilding this rowgroup as even after rebuild it may get trimmed
   - If dictionary is full only rebuild if deleted rows falls above the threshold
  */

 if object_id('tempdb..#temp') IS NOT NULL
 drop table #temp
 go
 
 Declare @DeletedRowsPercent Decimal(5,2)
 -- Debug = 1 if you need all rowgroup information regardless
 Declare @Debug int =0
 -- Percent of deleted rows for the partition
 Set @DeletedRowsPercent = 10   
 -- RGQuality means we are saying anything over 500K compressed is good row group quality, anything less need to re-evaluate.
 Declare @RGQuality int = 500000 
 -- means 50% of rowgroups are < @RGQUality from the rows/rowgroup perspective 
 Declare @PercentageRGQualityPassed smallint = 20  
 ;WITH CSAnalysis
 ( object_id,TableName,index_id,partition_number,CountRGs,TotalRows,
  AvgRowsPerRG,CountRGLessThanQualityMeasure,RGQualityMeasure,PercentageRGLessThanQualityMeasure
  ,DeletedRowsPercent,NumRowgroupsWithDeletedRows)
 AS
 (SELECT object_id,object_name(object_id) as TableName, index_id,
  rg.partition_number,count(*) as CountRGs, sum(total_rows) as TotalRows, Avg(total_rows) as AvgRowsPerRG,
  SUM(CASE WHEN rg.Total_Rows <@RGQuality THEN 1 ELSE 0 END) as CountRGLessThanQualityMeasure, @RGQuality as RGQualityMeasure,
  cast((SUM(CASE WHEN rg.Total_Rows <@RGQuality THEN 1.0 ELSE 0 END)/count(*) *100)  as Decimal(5,2))  as PercentageRGLessThanQualityMeasure,
  Sum(rg.deleted_rows * 1.0)/sum(rg.total_rows *1.0) *100 as 'DeletedRowsPercent',
  sum (case when rg.deleted_rows >0 then 1 else 0 end ) as 'NumRowgroupsWithDeletedRows'
  FROM  sys.column_store_row_groups rg  
  where rg.state = 3 
  group by rg.object_id, rg.partition_number,index_id
),
CSDictionaries  --(maxdictionarysize int,maxdictentrycount int,[object_id] int, partition_number int)
 AS
 (   select max(dict.on_disk_size) as maxdictionarysize, max(dict.entry_count) as maxdictionaryentrycount
  ,max(partition_number) as maxpartition_number,part.object_id, part.partition_number
  from sys.column_store_dictionaries dict
  join sys.partitions part on dict.hobt_id = part.hobt_id
  group by part.object_id, part.partition_number
) 
 select a.*,b.maxdictionarysize,b.maxdictionaryentrycount,maxpartition_number 
 into #temp from CSAnalysis a
 inner join CSDictionaries b
 on a.object_id = b.object_id and a.partition_number = b.partition_number

 
-- Maxdop Hint optionally added to ensure we don't spread small amount of rows accross many threads
-- IF we do that, we may end up with smaller rowgroups anyways.
 declare @maxdophint smallint, @effectivedop smallint  
 -- True if running from the same context that will run the rebuild index.
 select @effectivedop=effective_max_dop from sys.dm_resource_governor_workload_groups
 where group_id in (select group_id from sys.dm_exec_requests where session_id = @@spid)
 
 -- Get the Alter Index Statements.
  select 'Alter INDEX ' + QuoteName(IndexName) + ' ON ' + QuoteName(TableName) + '  REBUILD ' +
 Case 
 when maxpartition_number = 1 THEN ' '
 else  ' PARTITION = ' + cast(partition_number as varchar(10)) 
 End
  + ' WITH (MAXDOP ='  + cast((Case  WHEN (TotalRows*1.0/1048576) < 1.0 THEN 1 WHEN (TotalRows*1.0/1048576) < @effectivedop THEN  FLOOR(TotalRows*1.0/1048576) ELSE 0 END) as varchar(10)) + ')'
 as Command
 from #temp a
 inner join
 ( select object_id,index_id,Name as IndexName from sys.indexes
    where type in ( 5,6) -- non clustered columnstore and clustered columnstore
 ) as b
on b.object_id = a.object_id and a.index_id = b.index_id
where ( DeletedRowsPercent >= @DeletedRowsPercent)
-- Rowgroup Quality trigger, percentage less than rowgroup quality as long as dictionary is not full
 OR ( ( ( AvgRowsPerRG < @RGQuality and TotalRows > @RGQuality) AND PercentageRGLessThanQualityMeasure>= @PercentageRGQualityPassed)
  AND maxdictionarysize < ( 16*1000*1000)) -- DictionaryNotFull, lower threshold than 16MB.
 order by TableName,a.index_id,a.partition_number

-- Debug print if needed
if @Debug=1
  Select getdate() as DiagnosticsRunTime,* from #temp
  order by TableName,index_id,partition_number
else
  Select getdate() as DiagnosticsRunTime,* from #temp
  -- Deleted rows trigger
  where ( DeletedRowsPercent >= @DeletedRowsPercent)
  -- Rowgroup Quality trigger, percentage less than rowgroup quality as long as dictionary is not full
  OR ( ( ( AvgRowsPerRG < @RGQuality and TotalRows > @RGQuality) AND PercentageRGLessThanQualityMeasure>= @PercentageRGQualityPassed)
  AND maxdictionarysize < ( 16*1000*1000)) -- DictionaryNotFull, lower threshold than 16MB.
  order by TableName,index_id,partition_number
-- Add logic to actually run those statements


 

Summarizing some of the best practices:

  • Partitioning enables rebuild of an index at the partition level as well as dictionaries created for each partition besides the other manageability benefits.
  • MAXDOP influences the memory grant size. If segments are getting trimmed due to low memory, reducing MAXDOP can help.
  • Resource Govenor settings for the respective pool can be increased from the default of 25% when rebuilding indexes if low memory causes rowgroups to be trimmed.
  • Massive deletions of rows that are already compressed in columnar format require a REBUILD of the index to reclaim that space.

 

 

Denzil Ribeiro

Program Manager SQL/Azure CAT

15 Feb 10:16

Choosing hash distributed table vs. round-robin distributed table in Azure SQL DW Service

by Murshed Zaman_AzureCAT

This topic explains the various Azure SQL Data Warehouse distributed table types, and offers guidance for choosing the type of distributed table to use and when. There are two types of distributed tables in Azure SQL DW at the writing of this article, hash distributed table and round-robin distributed table.

Designing databases to use these distributed tables effectively will help you to achieve the storage and query processing benefits of the Azure SQL DW Service (SQL DW).

In SQL DW a distribution is an Azure SQL database, in which one or more distributed tables are stored. Each instance of SQL DW has many distributions. Many distributions can reside in a single instance of Azure SQL instance.

The amount of distributions are subject to change and not important for understanding this particular topic.

Hash Distributed Table Basics

A hash distributed table is a table whose rows are dispersed across multiple distributions based on a hash function applied to a column. Each SQL instance contains a group of one or more rows. The following diagram depicts how table within SQL DW gets stored as a hash distributed table.

clip_image002

When processing queries involving distributed tables, SQL DW instances execute multiple internal queries, in parallel, within each SQL instance, one per distribution.  These separate processes (independent internal SQL queries) are executed to handle different distributions during query and load processing.

A distribution column is a single column (specified at table creation time) that SQL DW uses to assign each row to a distribution. A deterministic hash function uses the value in the distribution column to assign each row to belong to one and only one distribution. Two identical column values with the same data type will be hashed the same and thus will end up in the same distribution.

In the diagram, each row in the original file is stored on one distribution. The number of rows in each distribution can vary and is usually not identical from distribution to distribution.

There are performance considerations for the selection of a distribution column, such as minimizing data skew, minimizing data movement, and the types of queries executed on the system. For example, query performance improves when two distributed tables are joined on a column that is of the same data type and size. This is called a distribution compatible join or a co-located join.

Round-Robin Distributed Table Basics

A round-robin distributed table is a table where the data is evenly (or as evenly as possible) distributed among all the distributions without the use of a hash function. A row in a round-robin distributed table is non-deterministic and can end up in different distributions each time they are inserted.

Each JOIN to a round-robin distributed table is a data movement operation. The data movement needed to perform join operations is a separate topic and will be published as a separate blog soon.

Usually common dimension tables or tables that doesn’t distribute evenly are good candidates for round-robin distributed table.

The following diagram shows a round-robin distributed table that is stored on different distribution.

clip_image002[7]

Best Practices

In SQL DW, a user query is a logical query that gets divided into many physical queries one for each distribution. The Engine Service on the control node acts as a coordinator and waits for each of these individual queries to finish before returning results or the next part of the multi-step query is executed.

When creating a table in SQL DW, you need to decide if the table will be hash distributed or round-robin distributed. This decision has implications for query performance. Each of these distributed tables may require data movement during query processing when joined together. Data movement in MPP RDBMS system is an expensive but sometimes unavoidable step. In my 8 plus years of working with MPP data warehouse I haven’t seen a real customer workload that can completely eliminate data movement. The objective of a good data warehouse design in SQL DW is to minimize data movement so let’s keep that in mind while choosing table design.

Here are considerations for choosing to use a round-robin distributed table or a hash distributed table:

1. To choose a good distribution design with SQL DW, one should know their data, DDL and queries. This is not unique to SQL DW but for most MPP RDBMS system. You need to minimize data movement queries but also watch out for data that can heavily skew a certain distribution. If one of the distribution has more data than others, it will be the slowest performing distribution. Since SQL DW queries are as fast as its slowest distribution, we need to take notes of any data-heavy (skewed or hot) distribution for the same table.

2. A nullable column is a bad candidate for any hash distributed table. All null columns are hashed the same and thus the rows will end up on the same distribution creating a skewed (hot) distribution. If most of the columns are null able and no good hash distribution can be achieved, that table is a good candidate for round-robin distribution. Choose ‘not null’ columns when creating table that will be hash distributed.

3. Any fact tables that has a default value in a column is also not a good candidate to create a hash distributed table. DW Developers will sometime assign -1 value to an otherwise unknown value or early arriving values for a fact table. These values will create data skew on a particular distribution. Avoid these kind of default value column unless you know for sure that the -1 values are negligible in your data.

4. Large fact tables or historical transaction tables are usually stored as hash distributed tables. These tables usually have a surrogate key that is monotonically increasing and are used in JOIN conditions with other fact and dimension tables. These surrogate keys are a good candidate for distributing the data as there are many unique values in that column. This allows the query operations to be performed across all distributions. Each distribution can work independently on separate subsets of data. This takes advantage of the processing resources across the MPP system. Queries on distributed tables may require data movement between distributions during query execution and that is okay.

5. Dimension tables or other lookup tables in a schema can usually be stored as round-robin tables. Usually these tables connects to more than one fact tables and optimizing for one join may not be the best idea. Also usually dimension tables are smaller which can leave some distributions empty when hash distributed. Round-robin by definition guarantees a uniform data distribution.

6. If you are unsure of query patterns and data, you can start with all tables in round-robin distribution. And as you learn the patterns the data can be easily redistributed on a hash key.

7. When using ‘group by’ SQL DW will shuffle the data on the group by key. When multiple keys are present and statistics is up-to-date SQL DW’s cost based optimizer will pick the right key to shuffle the data. If this group by key is heavily non-unique then the query will be slower. A worst case example would be grouping by gender of a large customer table. If your query is running slower, look into explain plan (add the word ‘explain’ before your query and execute) to find out what key is being used as the shuffle key. There may or may not be anything you can do to change this based on the query.

15 Feb 10:15

SQL 2016: Columnstore row group Merge policy and index maintenance improvements

by Denzil Ribeiro

            

   A Columnstore index contains row groups that can that have a maximum of 1,048,576 rows. A row group can be closed and compressed due to multiple reasons before that maximum of 1,048,576 rows is reached. Ideally we want that row count in each row group to be as close to the maximum as possible in order to get both better compression and in turn better segment elimination as fewer segments would have to be scanned. In SQL Server 2014, if a row group was compressed before it reached the maximum of 1,048,576 rows, there wasn’t a way to see the reason why it was compressed early before it even hit the maximum.

For a more in-depth discussion of why row groups could be compressed before reaching the maximum row limit (“trim reason”), see the post Data Loading performance considerations with Clustered Columnstore indexes

In SQL Server 2016 we expose the row group trim reason via the Dynamic Management View (DMV) sys.dm_db_column_store_row_group_physical_stats as shown below.

image

The reasons for trim for a clustered Columnstore index include:

 

Trim_reason_desc

Trim Reason

UNKNOWN_UPGRADED_FROM_PREVIOUS_  VERSION

The reason wasn’t provided prior to SQL Server 2016, so it is also not captured for upgraded data.

NO_TRIM

Not trimmed.  The row group has 1,048,576 rows.

BULKLOAD

BATCHSIZE specified for bulk insert, or end of bulk insert.

REORG_FORCED

REORG with COMPRESS_ALL_ROWGROUPS = ON which closes every open row group and compresses it into columnar format

DICTIONARY_SIZE   

If Dictionary is full, the row group will be trimmed ( 16MB dictionary)

MEMORY_LIMITATION

Memory pressure during index build caused row group to be trimmed

RESIDUAL_ROW_GROUP_INDEXBUILD

Last row group(s) have less than 1 million rows when index rebuilt.

 

Prior to SQL Server 2016, if we wanted to coalesce smaller compressed row groups into larger row groups or if we wanted to reclaim space due to deleted rows, we would have to rebuild the index. This offline index rebuild could be done at the table or the partition level.

For a more in-depth discussion of index rebuild and maintenance in SQL Server 2014 see SQL 2014 Clustered Columnstore index rebuild and maintenance considerations

Consider the two trimmed row group scenarios below:

A. Multiple row groups exist that are compressed and have less than the 1,048,576 row maximum. Here you see the trim reason as BULKLOAD, which means that there was an insert that was greater than 102,400 rows that directly moves that data into compressed row groups.

Looking at the sys.dm_db_column_store_row_group_physical_stats output below, the current state of rows-per row group could be due to multiple sessions (threads) each inserting 125,000 rows or alternatively it could be due to a single INSERT…SELECT which used a parallel plan, with each thread having inserted 125,000 rows.

image

B. The Columnstore index sys.dm_db_column_store_row_group_physical_stats output below shows several row groups with a significant number of rows deleted. In this scenario, the deleted rows in a compressed segment are logical deletes and not physical deletes, so in order to reclaim space, you have to rebuild the index in versions prior to SQL 2016

image

 

SQL Server 2016 Merge Policy

SQL Server 2016 introduces the ability to “merge” smaller, eligible row groups into larger row groups. This is achieved by running INDEX REORGANIZE against a Columnstore index in SQL Server 2016

INDEX REORGANIZE

  • Moves closed row groups into compressed columnar format
  • Merges multiple row groups into larger row groups that fit within the maximum row group size
  • Is an online operation unlike a rebuild index which is offline

A row group is “eligible” to be merged if it means any of the following conditions:

  • The row group is compressed
  • The row group has 10% or more rows deleted
  • The row group was NOT trimmed due to a full dictionary

Here are a couple examples of merge policy at work

a. Single Rowgroup (Self) -merge for deleted rows (reclaiming of space): In the most simplistic case, you don’t even need multiple row groups to be involved in the merge. For example, if a single row group has more than 10% of its rows logically deleted, then it qualifies for a “self-merge” (space is reclaimed by removing the logically deleted rows from the single row group). The below table shows two examples of single-row groups – one that is eligible for self-merge and one that is not.

Row group Size

Deleted Rows

Self-Merge

400,000

120,000

Yes, deleted rows > 10%

150,000

5,000

No

 

b. Merge across multiple row groups: In the example below, only two row groups are being depicted, but more than 2 row groups can be candidates to be merged as will be shown later in this article.

 

Row group 1 row count

Row group 2 row count

Eligible for Merge?

950,000

920,000

No, because row group 1 combined with row group 2 exceed the row maximum per row group

400,000

500,000

Yes, because row group 1 combined with row group 2 result in 900,000 rows (under the 1,048,576 row maximum)

1,000,000

( 200,000 deleted)

500,000

( 300,000 deleted)

Yes, subtracting out the deleted rows, the combined row count across row groups is under the row maximum

 

Tracking Merge Activity:

You can track the merge activity and row group qualification using the Extended Events columnstore_rowgroup_merge_start and columnstore_rowgroup_merge_end events as below:

CREATE EVENT SESSION [TupleMover] ON SERVER 

ADD EVENT sqlserver.columnstore_no_rowgroup_qualified_for_merge,
ADD EVENT sqlserver.columnstore_rowgroup_compressed,
ADD EVENT sqlserver.columnstore_rowgroup_merge_complete,
ADD EVENT sqlserver.columnstore_rowgroup_merge_start
ADD TARGET package0.event_file(SET filename=N'XeMerge',max_file_size=(10))
GO
Alter EVENT SESSION [XeMerge] ON SERVER  State = START
 

 


Now consider the following INDEX REORGANIZE scenario against an existing Columnstore index:

-- Invoke the REORG command to fire Merge and remove deleted rows
-- and coalesce the smaller rowgroups into a larger rowgroup.
alter index cci_temp on FactResellerSalesXL_CCI_temp reorganize

Looking at the output of the extended events, notice that across all the merged row groups there were 1,000,000 rows and 501,786 deleted rows.

image

Once the merge operation is done, the output of sys.dm_db_column_store_row_group_physical_stats shows one compressed rowgroup that contains the rows in all the prior merged row groups. The merged row groups will have a state description of TOMBSTONE momentarily until they are cleaned up.

image

In addition to the new sys.dm_db_column_store_row_group_physical_stats DMV, an operational stats DMV sys.dm_db_column_store_row_group_operational_stats has also been added to SQL Server 2016 which gives us visibility into the frequency of scans and lock contention for partitions and row groups (similar to sys.dm_db_index_operational_stats DMV that exists for regular indexes):

image

In short, the new Merge functionality in INDEX REORGANIZE simplifies Columnstore index maintenance significantly and other DMV’s add visibility into the internals of columnar indexes. There are several other supportability improvements on the Columnstore front not mentioned in this post several other Extended Events and Perfmon counters added to enable better troubleshooting both clustered Columnstore and updatable non-clustered Columnstore indexes

Denzil Ribeiro ( @denzilribeiro )

 

 

15 Feb 10:15

Data Analysis and Analytics software – what to choose? Infographics and Information

by SQLMaster

Tweet


Data Analysis is a key process that needs to build systematically apply relevant statistical methods and techniques on logical & physical data models, that is essential for data evaluation (or get near truth of data).

The above method is a traditional analysis method followed in a relational & datawarehouse platform scenario, in the current times Data Analytics is a new concept (science) that has developed enormous opportunities for the users to analyse, conclude and visualise information about relevant data set. Analytics will help organisations to sustain better business decisions and build data science capabilities to exisitng models/theory within their data platform. In my opinion, technology isn’t a barrier to sustain this kind of new trend.

 

By creating a data science, the organisation can build upon data collection methods, ingestion processes and end-results for entire data models that isn’t specific to a particular domain. Having said that Big Data is another trend catching up (fast paced in few parts of world) other parts of world to build data streams and collate multiple data sources for better insights with variety of software tools and machine learning programming methods. So there is a big question about which software or programming method is right for thi kind of job, not going into biased discussion on technology/vendor for data analytics/analysis kind of project.

In this cyberage information is available at tip of your fingers (with few clicks on your device). For specific bunch of users such as data stewards, report authors and data scientists there is a wide range of availability within software field, just to name a few:

The reason about why I have listed R on the top (which is a fact as well) is it will enable emphasis on data insights, build statistics and data visualisations that can solve real-world problems. No doubt that this is a favourite for professionals like data scientists and statiscians.

Not just with R, in this kind of scenario Python also matches up to the expectations with few differences on how one should evaluate data analysis/analytics software for their specific data science needs.

To put a strong foothold on Microsoft data analytics world, SQL Server 2016 has opened up enormous opportunities for data science stream by adding R integration into the database. In few words:

SQL Server 2016 expands its scope beyond transaction processing, data warehousing and business intelligence to deliver advanced analytics as an additional workload in SQL Server with proven technology from Revolution Analytics.  Not just that, building PolyBase into SQL Server, expanding the power to extract value from unstructured and structured data using your existing T-SQL skills. With this wave, you can then gain faster insights through rich visualizations on many devices including mobile applications on Windows, iOS and Android.

In the recent times I have found a fantastic reference to comparison between R & Python, see below (source):

As a closing note I would encourage you to refer to what is available with Microsoft as well:

Few links as well:

SQL Server R Services

What’s New in SQL Server 2016, December Update

What’s New in Reporting Services

Interactive Data Visualization BI Tools

Happy knowledge-sharing!

(37)

15 Feb 10:14

On the addition of useless where clauses

by Gail

I remember a forum thread from a while back. The question was on how to get rid of the index scan that was in the query plan. Now that’s a poor question in the first place, as the scan might not be a problem, but it’s the first answer that really caught my attention.

Since the primary key is on an identity column, you can add a clause like ID > 0 to the query, then SQL will use an index seek.

Technically that’s correct. If the table has an identity column with the default properties (We’ll call it ID) and the clustered index is on that identity column, then a WHERE clause of the form WHERE ID > 0 AND <any other predicates on that table> can indeed execute with a clustered index seek (although it’s in no way guaranteed to do so). But is it a useful thing to do?

Time for a made up table and a test query.

CREATE TABLE dbo.Orders(
  OrderID INT IDENTITY(1,1) NOT NULL PRIMARY KEY,
  OrderDate DATETIME2(7) NULL,
  ShipmentRef CHAR(10) NULL,
  ShipmentDate DATE NULL,
  Status VARCHAR(20) NOT NULL
);

That’ll do the job. And then a few hundred thousand rows via SQL Data Generator and we’re good to go.

And for a query that has a nasty index scan, how about

SELECT  OrderDate,
    ShipmentDate,
    Status
  FROM    dbo.Orders
  WHERE   LTRIM(RTRIM(Status)) = 'Delivered';

ClusteredIndexScan

Now, that’s running as a clustered index scan because the predicate’s not SARGable and besides, there’s no index on that column, but let’s pretend we don’t know that.

If I add a WHERE clause predicate that filters no row out, can I get a query plan with an index seek?

SELECT  OrderDate,
    ShipmentDate,
    Status
  FROM    dbo.Orders
  WHERE   LTRIM(RTRIM(Status)) = 'Delivered'
    AND OrderID > 0;

Why yes, I can.

ClusteredIndexSeek

Op Success? Well…

The goal of performance tuning is to improve the performance of a query, not to change operators in a query plan. The plan is a tool, not a goal.

Have we, by adding a WHERE clause predicate that filters out no rows, improved performance of the query? This needs an extended events session to answer. Nothing fancy, just a sql_statement_completed event will do the trick.

I ran each query 10 times, copied the captured events into Excel and averaged them:

Query with just the LTRIM(RTRIM(Status)) = ‘Delivered’
CPU: 77ms
Duration: 543ms

Query with LTRIM(RTRIM(Status)) = ‘Delivered’ AND OrderID > 0
CPU: 80ms
Duration: 550ms

We haven’t tuned that query. I won’t say we’ve made it slower either, the differences are well within the error range on our measuring, but there’s definitely no meaningful performance gain.

There’s no gain because we haven’t changed how the query executes. A scan, and in this case it will be a scan of the entire index, will likely use the non-leaf levels of the b-tree to locate the logical first page of the leaf level, then will read the entire leaf level. The seek we managed to generate will use the b-tree to find the value 0 in the clustered index key, that’s what makes it a seek. Since the column is an identity starting at 1, that means the first row read will be on the logical first page of the leaf level, then it will read the entire leaf level.

Both will do the same amount of work, and so we haven’t done anything useful to the query by adding a WHERE clause that filters out no rows.

Scans are not always bad. If a query needs to read every row of a table, that’s a scan and effort shouldn’t be expended trying to make it an index seek.

To improve the performance of a query, we need to make changes that reduce the work needed to run the query. That often starts with reducing the amount of data that the query reads, by changing the query so that it can use indexes effectively and/or adding indexes to support the query. Not by adding pointless pieces to a query just to change plan operators from ones that are believed to be bad to ones that are believed to be good. Doing that is just a waste of time and effort.

15 Feb 10:14

It’s time! Registration is now open for Microsoft Ignite

by SQL Server Team

It’s time — Register now for Microsoft Ignite and mark your calendar for September 26–30. Be there when top industry leaders talk about what’s next, get direct access to the people who built the products you use every day, and surround yourself with smart people talking tech.

Sharpen your ideas with the best in the business.

From high-level strategy to deep product insights, you’ll hear about new technologies and explore real-world solutions to today’s challenges. Customize your schedule to go deep on the topics that matter most to you and your organization including: cloud infrastructure and management, big data, analytics, productivity, communications, operating systems, mobile devices, and more.

  • 1,000+ hours of content, 700+ sessions, and a multitude of networking opportunities
  • Insights and roadmaps from industry leaders
  • Deep dives and live demos on the products you use every day
  • Direct access to Microsoft product experts
  • Hands-on learning in digital labs

Connect with techies like yourself.

After last year’s conference, an attendee wrote, “Networking at #MSIgnite is like coming home.” We make it easy to connect, with dozens of opportunities including networking lounges, mealtime mashups, and meetups. You’ll expand your contact list—and meet people who can challenge your thinking and inspire you to take your ideas further.

Join us. Register for Microsoft Ignite 2016.

15 Feb 10:14

Cores is Cores

by Jeremiah Peschka

CPU cores are all made the same, right? Hyper-Threading is just a fancy way of saying “Push the turbo button harder!”

Actually, Wikipedia informs me that I’m wrong and Hyper-Threading is a fancy (and trademarked) way of saying “You can do more than one thing on a core at the same time because computers are a pack of lies.”

I can assure you, these are not the same core dog.

I can assure you, these are not the same core dog.

The General Idea

Many people seem to operate under the assumption that any one core is as good as any other core and since two cores are better than one, why doesn’t my CPU with 4 physical cores run twice as fast when I turn on Hyper-Threading?

Because physics.

Hyper-Threading works by taking advantage of the idea that computers are usually off doing other things – waiting for storage, waiting for the network, waiting for RAM, waiting for you to click “Buy it now” on 10 pounds of socks. So while the CPU is waiting, it goes off and does something else, like sending your credit card numbers to hackers.

Hyper-Threading exists because computers spend most of their time doing nothing, so they might as well try to be productive.

What’s That Mean For Us?

For most workloads, Hyper-Threading is great. You’re usually waiting on storage, so you might as well go ahead and send those credit card numbers off elsewhere. For CPU intensive workloads, you have to use your brain a little bit and say “Wait a minute, if I can scale linearly to the number of physical cores, what happens when I’m pretending I have more cores than I really have?”

Since this is computers and not the global financial sector, circa 2007, you hit a performance cliff. When CPU is your bottleneck, faking it won’t make anything faster.

How can I say all of this so jovially? Because I broke my computer, that’s why.

Oh Crap, He Wrote Code!

That’s right, I wrote code. I wrote a program that I call The HyperThreader. It’s dumb as a brick – it counts from 1 to 10E8 and then computes the square root of that number. This is a CPU intensive workload, no disks were harmed. The program then does the same thing but across 6 workers (the number of cores I have) and then again across 12 workers (the number of logical cores I have).

You can see the raw results over on github.

Here’s what happens:

1 thread – average time of 1151.857ms
6 threads – average time of 1194.262ms

So far so good. Execution time isn’t really changing, each task is off wandering around on its own processor core. We can account for the 40ms difference between these two because I was playing Paula Abdul’s greatest hits in the background.

12 threads – average time of 1831.81ms

Since this isn’t twice as slow, I’m going to assume that I’m not using all of my CPUs on each task (something could probably be more efficient), but this leads me to my conclusion…

IT’S ALL FILTHY DIRTY LIES!

This where people usually get tripped up. Execution gets around 53% slower when I start pretending I have resources available. Windows, and the .NET Framework, do their best to pretend that I have resources available. But, the fact is, that I don’t. I only have 6 cores, so the computer has to spend time switching between them. If resources were still available, the average execution time would be closer to what we saw with only 1 core.

If you’re wondering why your SQL Server In-Memory OLTP demo doesn’t scale beyond the number of physical cores, now you know – because you can’t imagine performance out of nothing. That’s like saying “This 4 cylinder car can haul a family of 4, so to take the extended family out and about, I need a V12” and then rushing out to by a supercar with only 2 seats.

Hat tip to Josh Bush and Dave Liebers for eyeballing the code to make sure it did what it claimed.


 

Cute Clones” by Steve Jurvetson licensed with CC BY 2.0.

20 Jan 23:50

JSON in SQL Server 2016

by SQLMaster

JSON is frequently used for data exchange formats including the web browser with a common format. When we are dealing formated text with databases then it is essential to process JSON text retrieved from other systems with specific format to SQL Server database tables as JSON text.

JSON is newest addition to SQL Server family, from 2016 version onwards (which is under CTP 3.2 as of now).

A logical information about JSON functions within SQL Server that will enable to analyze/query JSON data, transform to relational format and export results as JSON text. (Source: Microsoft)

 

To know more I would recommend to go through the blog series posted by Jovan Popovic (MSFT). See below:

(44)

20 Jan 23:50

4 Use Cases for NoSQL Databases

by A.R. Guess

by Angela Guess Jim Scott recently put together a list of four “game-changing” use cases for NoSQL databases in Smart Data Collective. He begins with financial services: “When you picture a trading firm or hedge fund, you might see a Gordon Gecko clone, clutching multiple phones, hunched over a Bloomberg terminal, barking orders to buy […]

The post 4 Use Cases for NoSQL Databases appeared first on DATAVERSITY.

20 Jan 23:50

Data security, SQL Server 2016, and your business

by David Hobbs-Mallyon

Security is unquestionably a major priority for Microsoft. A recent news story reported that the company “is spending $1 billion a year to make Microsoft products more secure.” The Microsoft data platform, including SQL Server and Azure SQL Database, is at the top of the list of products investing in security. But, be aware that a commitment to data security is actually nothing new. SQL Server has long been recognized for its outstanding security record: According to the National Institute of Standards and Technology(NIST)1 public security board, for the past six years, SQL Server has had the fewest security vulnerabilities when compared with the major database vendors. In addition, SQL Server has been deemed “the most secure database” by the Information Technology Industry Council (ITIC). Despite this excellent security record, Microsoft is not content to rest on its laurels and is continuing to invest in security, providing customers with new and improved tools to secure data and applications.

From an IT infrastructure and compliance perspective, the importance of protecting data is clear. Witness the fact that security has been identified as one of the “Eight emerging data center trends to follow in 2016.” But data protection also has profound business implications and can even be a competitive differentiator by helping drive customer loyalty and retention, create opportunities for premium offers and new sources of revenue, and protect future revenue streams, according to Forrester Research 2.To help deal with the complexity and scope of data security — and diminish risks to your business — Microsoft provides an across-the-board, in-depth security approach that includes application security, network security, and database security.

Data Security and SQL Server

Playing into this overall approach, SQL Server 2016 and Azure SQL Database include advanced, layered security functionality to help protect data itself as well as access to that data, and then provide monitoring capabilities. Data security features include (but are not limited to) the following:

  • Always Encrypted enables encryption inside client applications without revealing encryption keys to SQL Server. It allows changes to encrypted data without the need to decrypt it first.
  • Transparent Data Encryption (TDE) protects data at rest by encrypting all the user data in data files. TDE prevents users from attaching or restoring a database to another server as a way to access the data.
  • Support for Transport Layer Security (TLS), which has now been updated to version 1.2, protects data in transit and offers protection from such tactics as man-in-the-middle attacks.
  • Dynamic Data Masking (DDM) and Row-Level Security (RLS) help developers build applications that require restricted direct access to certain data as a means of preventing users from seeing specific information.

This layered approach to data security and Microsoft’s overall commitment to advancing security and privacy protection address important considerations for business today. Upcoming blogs will go into deep technical detail on these security capabilities, but examining a business scenario can help illuminate the business benefits that data security can help ensure.

Business implications

Data has become not only a business asset, but it is now also a competitive differentiator: A company that can ensure that customer and business data are secured has a competitive edge over a company that does not make data security a priority. This means that for business and technical decision-makers to enable their businesses to compete effectively, they need a data platform with built-in security features and they need a strategy that takes advantage of the built-in security capabilities.

The business implications of data security range from speeding up customer service, to impacting the bottom line, to protecting shareholder value. Underscoring the potential bottom-line concerns of financial executives, a recent survey found that 66 percent of CFOs consider security to be a high or very high priority. Even at the end-user level, the potential business impact of exposing sensitive data is recognized: Another recent survey discloses that “71 percent of end users say that they have access to company data they should not be able to see.”

How can Microsoft’s data security capabilities ease such concerns? Consider just one example showing how Dynamic Data Masking, as a part of your data security program, can help you address the point raised by those end users who admitted they had access to data (such as Social Security Numbers or health details) that they shouldn’t be able to view. For example, suppose you have a call center where representatives deal with customer billing questions. When a customer record comes up, the representative needs to see certain information to answer questions. But some customer information, such as specific personal health details, need to remain confidential for HIPAA compliance. With Dynamic Data Masking, IT administrators can take simple steps to define policies, or rules, to mask any personally identifiable information that is not needed for the customer interaction. This way, the representative can view a customer record without having access to confidential information. Customer information is secured, but at the same time, customer service is able to answer questions by accessing appropriate data without compromising privacy.

Commitment to security built-In

As the article cited above emphasizes, Microsoft is spending $1 billion per year to ensure that its products are secured so that businesses are protected. SQL Server and Azure SQL Database are continuously building-in state-of-the-industry security technologies as part of this ongoing commitment to security. For business, this means you don’t have to pay extra to give IT staff security tools that are easy to deploy and maintain — those tools are built into Microsoft’s data platform. At the same time, businesses can build data security infrastructure that supports customers and provides a competitive edge. To learn more about Microsoft’s data security approach, see the Security Center for SQL Server Database Engine and Azure SQL Database and the SQL Security Blog.

See the other posts in the SQL Server 2016 blogging series

1. National Institute of Standards and Technology Comprehensive Vulnerability Database update 10/2015

2. The Future of Data Security And Privacy: Growth And Competitive Differentiation Vision: The Data Security And Privacy Playbook, John Kindervag, Heidi Shey, and Kelley Mak, Forrester, July 10, 2015

20 Jan 22:35

Pros and Cons: Warehouse Vs. Data Lakes

by Thomas Hazel

Learn more about Thomas Hazel. This column will not be the proverbial “Pros and Cons” article, weighing the good with the bad. One can find such content habitually year after year and month after month, all of which will outline the obvious advantages and disadvantage between any two things. This is particularly prevalent in the […]

The post Pros and Cons: Warehouse Vs. Data Lakes appeared first on DATAVERSITY.

20 Jan 22:35

Can ColumnStore Help Pagination Workloads?

by Aaron Bertrand

Almost a year ago to the day, I posted my solution to pagination in SQL Server, which involved using a CTE to locate just the key values for the set of rows in question, and then joining back from the CTE to the source table to retrieve the other columns for just that "page" of rows. This proved most beneficial when there was a narrow index that supported the ordering requested by the user, or when the ordering was based on the clustering key, but even performed a little better without an index to support the required sort.

Since then, I've wondered if ColumnStore indexes (both clustered and non-clustered) might help any of these scenarios. TL;DR: Based on this experiment in isolation, the answer to the title of this post is a resounding NO. If you don't want to see the test setup, code, execution plans, or graphs, feel free to skip to my summary, keeping in mind that my analysis is based on a very specific use case.

Setup

On a new VM with SQL Server 2016 CTP 3.2 (13.0.900.73) installed, I ran through roughly the same setup as before, only this time with three tables. First, a traditional table with a narrow clustering key and multiple supporting indexes:

CREATE TABLE [dbo].[Customers]
(
	[CustomerID] [int] NOT NULL,
	[FirstName] [nvarchar](64) NOT NULL,
	[LastName] [nvarchar](64) NOT NULL,
	[EMail] [nvarchar](320) NOT NULL UNIQUE,
	[Active] [bit] NOT NULL DEFAULT 1,
	[Created] [datetime] NOT NULL DEFAULT SYSDATETIME(),
	[Updated] [datetime] NULL,
  CONSTRAINT [PK_Customers] PRIMARY KEY CLUSTERED ([CustomerID])
);
 
CREATE NONCLUSTERED INDEX [Active_Customers] 
ON [dbo].[Customers]([FirstName],[LastName],[EMail])
WHERE ([Active]=1);
 
-- to support "PhoneBook" sorting (order by Last,First)
CREATE NONCLUSTERED INDEX [PhoneBook_Customers] 
ON [dbo].[Customers]([LastName],[FirstName])
INCLUDE ([EMail]);

Next, a table with a clustered ColumnStore index:

CREATE TABLE [dbo].[Customers_CCI]
(
	[CustomerID] [int] NOT NULL,
	[FirstName] [nvarchar](64) NOT NULL,
	[LastName] [nvarchar](64) NOT NULL,
	[EMail] [nvarchar](320) NOT NULL UNIQUE,
	[Active] [bit] NOT NULL DEFAULT 1,
	[Created] [datetime] NOT NULL DEFAULT SYSDATETIME(),
	[Updated] [datetime] NULL,
  CONSTRAINT [PK_CustomersCCI] PRIMARY KEY NONCLUSTERED ([CustomerID])
);
 
CREATE CLUSTERED COLUMNSTORE INDEX [Customers_CCI] 
ON [dbo].[Customers_CCI];

And finally, a table with a non-clustered ColumnStore index covering all of the columns:

CREATE TABLE [dbo].[Customers_NCCI]
(
	[CustomerID] [int] NOT NULL,
	[FirstName] [nvarchar](64) NOT NULL,
	[LastName] [nvarchar](64) NOT NULL,
	[EMail] [nvarchar](320) NOT NULL UNIQUE,
	[Active] [bit] NOT NULL DEFAULT 1,
	[Created] [datetime] NOT NULL DEFAULT SYSDATETIME(),
	[Updated] [datetime] NULL,
  CONSTRAINT [PK_CustomersNCCI] PRIMARY KEY CLUSTERED 
  ([CustomerID])
);
 
CREATE NONCLUSTERED COLUMNSTORE INDEX [Customers_NCCI] 
ON [dbo].[Customers_NCCI]
(
        [CustomerID],
	[FirstName],
	[LastName],
	[EMail],
	[Active],
        [Created],
        [Updated]
);

Notice that for both tables with ColumnStore indexes, I left out the index that would support quicker seeks on the "PhoneBook" sort (last name, first name).

Test Data

I then populated the first table with 1,000,000 random rows, based on a script I've re-used from previous posts:

INSERT dbo.Customers WITH (TABLOCKX) 
  (CustomerID, FirstName, LastName, EMail, [Active])
SELECT rn = ROW_NUMBER() OVER (ORDER BY n), fn, ln, em, a
FROM 
(
  SELECT TOP (1000000) fn, ln, em, a = MAX(a), n = MAX(NEWID())
  FROM
  (
    SELECT fn, ln, em, a, r = ROW_NUMBER() OVER (PARTITION BY em ORDER BY em)
    FROM
    (
      SELECT TOP (2000000)
        fn = LEFT(o.name, 64), 
        ln = LEFT(c.name, 64), 
        em = LEFT(o.name, LEN(c.name)%5+1) + '.' 
             + LEFT(c.name, LEN(o.name)%5+2) + '@' 
             + RIGHT(c.name, LEN(o.name+c.name)%12 + 1) 
             + LEFT(RTRIM(CHECKSUM(NEWID())),3) + '.com', 
        a  = CASE WHEN c.name LIKE '%y%' THEN 0 ELSE 1 END
      FROM sys.all_objects AS o CROSS JOIN sys.all_columns AS c 
      ORDER BY NEWID()
    ) AS x
  ) AS y WHERE r = 1 
  GROUP BY fn, ln, em 
  ORDER BY n
) AS z 
ORDER BY rn;

Then I used that table to populate the other two with exactly the same data, and rebuilt all of the indexes:

INSERT dbo.Customers_CCI WITH (TABLOCKX)
  (CustomerID, FirstName, LastName, EMail, [Active])
SELECT CustomerID, FirstName, LastName, EMail, [Active]
FROM dbo.Customers;
 
INSERT dbo.Customers_NCCI WITH (TABLOCKX)
  (CustomerID, FirstName, LastName, EMail, [Active])
SELECT CustomerID, FirstName, LastName, EMail, [Active]
FROM dbo.Customers;
 
ALTER INDEX ALL ON dbo.Customers      REBUILD;
ALTER INDEX ALL ON dbo.Customers_CCI  REBUILD;
ALTER INDEX ALL ON dbo.Customers_NCCI REBUILD;

The total size of each table:

Table Reserved Data Index
Customers 463,200 KB 154,344 KB 308,576 KB
Customers_CCI 117,280 KB 30,288 KB 86,536 KB
Customers_NCCI 349,480 KB 154,344 KB 194,976 KB

 
And the row count / page count of the relevant indexes (the unique index on e-mail was there more for me to babysit my own data generation script than anything else):

Table Index Rows Pages
Customers PK_Customers 1,000,000 19,377
Customers PhoneBook_Customers 1,000,000 17,209
Customers Active_Customers 808,012 13,977
Customers_CCI PK_CustomersCCI 1,000,000 2,737
Customers_CCI Customers_CCI 1,000,000 3,826
Customers_NCCI PK_CustomersNCCI 1,000,000 19,377
Customers_NCCI Customers_NCCI 1,000,000 16,971

 

Procedures

Then, in order to see if the ColumnStore indexes would swoop in and make any of the scenarios better, I ran the same set of queries as before, but now against all three tables. I got at least a little bit smarter and made two stored procedures with dynamic SQL to accept the table source and sort order. (I am well aware of SQL injection; this isn't what I would do in production if these strings were coming from an end user, so please don't take it as a recommendation to do so. I trust myself just enough in my enclosed environment that it's not a concern for these tests.)

CREATE PROCEDURE dbo.P_Old
  @PageNumber  INT = 1,
  @PageSize    INT = 100,
  @Table       SYSNAME,
  @Sort        VARCHAR(32)
AS
BEGIN
  SET NOCOUNT ON;
 
  DECLARE @sql NVARCHAR(MAX) = N'
 
  SELECT CustomerID, FirstName, LastName,
      EMail, Active, Created, Updated
    FROM dbo.' + QUOTENAME(@Table) + N'
    ORDER BY ' + CASE @Sort 
	  WHEN 'Key'         THEN N'CustomerID'
	  WHEN 'PhoneBook'   THEN N'LastName, FirstName'
	  WHEN 'Unsupported' THEN N'FirstName DESC, EMail'
	END
	+ N'
    OFFSET @PageSize * (@PageNumber - 1) ROWS
    FETCH NEXT @PageSize ROWS ONLY OPTION (RECOMPILE);';
 
  EXEC sys.sp_executesql @sql, N'@PageSize INT, @PageNumber INT', @PageSize, @PageNumber;
END
GO
 
CREATE PROCEDURE dbo.P_CTE
  @PageNumber  INT = 1,
  @PageSize    INT = 100,
  @Table       SYSNAME,
  @Sort        VARCHAR(32)
AS
BEGIN
  SET NOCOUNT ON;
 
  DECLARE @sql NVARCHAR(MAX) = N'
 
  ;WITH pg AS
  (
    SELECT CustomerID
      FROM dbo.' + QUOTENAME(@Table) + N'
      ORDER BY ' + CASE @Sort 
	  WHEN 'Key'         THEN N'CustomerID'
	  WHEN 'PhoneBook'   THEN N'LastName, FirstName'
	  WHEN 'Unsupported' THEN N'FirstName DESC, EMail'
	END
	+ N' OFFSET @PageSize * (@PageNumber - 1) ROWS
      FETCH NEXT @PageSize ROWS ONLY
  )
  SELECT c.CustomerID, c.FirstName, c.LastName,
      c.EMail, c.Active, c.Created, c.Updated
  FROM dbo.' + QUOTENAME(@Table) + N' AS c
  WHERE EXISTS (SELECT 1 FROM pg WHERE pg.CustomerID = c.CustomerID)
  ORDER BY ' + CASE @Sort 
	  WHEN 'Key'         THEN N'CustomerID'
	  WHEN 'PhoneBook'   THEN N'LastName, FirstName'
	  WHEN 'Unsupported' THEN N'FirstName DESC, EMail'
	END
	+ N' OPTION (RECOMPILE);';
 
  EXEC sys.sp_executesql @sql, N'@PageSize INT, @PageNumber INT', @PageSize, @PageNumber;
END
GO

Then I whipped up some more dynamic SQL to generate all the combinations of calls I would need to make in order to call both the old and new stored procedures, in all three of the desired sort orders, and at different page numbers (to simulate needing a page near the beginning, middle, and end of the sort order). So that I could copy PRINT output and paste it into SQL Sentry Plan Explorer in order to get runtime metrics, I ran this batch twice, once with the procedures CTE using P_Old, and then again using P_CTE.

DECLARE @sql NVARCHAR(MAX) = N'';
 
;WITH [tables](name) AS 
(
  SELECT N'Customers' UNION ALL SELECT N'Customers_CCI' 
  UNION ALL SELECT N'Customers_NCCI'
),
sorts(sort) AS
(
  SELECT 'Key' UNION ALL SELECT 'PhoneBook' UNION ALL SELECT 'Unsupported'
),
pages(pagenumber) AS
(
  SELECT 1 UNION ALL SELECT 500 UNION ALL SELECT 5000 UNION ALL SELECT 9999
),
procedures(name) AS
(
  SELECT N'P_CTE' -- N'P_Old'
)
SELECT @sql += N'
  EXEC dbo.' + p.name
  + N' @Table = N' + CHAR(39) + t.name
  + CHAR(39) + N', @Sort = N' + CHAR(39)
  + s.sort + CHAR(39) + N', @PageNumber = ' 
  + CONVERT(NVARCHAR(11), pg.pagenumber) + N';'
FROM tables AS t
  CROSS JOIN sorts AS s
  CROSS JOIN pages AS pg
  CROSS JOIN procedures AS p
  ORDER BY t.name, s.sort, pg.pagenumber;
 
PRINT @sql;

This produced output like this (36 calls altogether for the old method (P_Old), and 36 calls for the new method (P_CTE)):

  EXEC dbo.P_CTE @Table = N'Customers', @Sort = N'Key', @PageNumber = 1;
  EXEC dbo.P_CTE @Table = N'Customers', @Sort = N'Key', @PageNumber = 500;
  EXEC dbo.P_CTE @Table = N'Customers', @Sort = N'Key', @PageNumber = 5000;
  EXEC dbo.P_CTE @Table = N'Customers', @Sort = N'Key', @PageNumber = 9999;
  EXEC dbo.P_CTE @Table = N'Customers', @Sort = N'PhoneBook', @PageNumber = 1;
  ...
  EXEC dbo.P_CTE @Table = N'Customers', @Sort = N'PhoneBook', @PageNumber = 9999;
  EXEC dbo.P_CTE @Table = N'Customers', @Sort = N'Unsupported', @PageNumber = 1;
  ...
  EXEC dbo.P_CTE @Table = N'Customers', @Sort = N'Unsupported', @PageNumber = 9999;
  EXEC dbo.P_CTE @Table = N'Customers_CCI', @Sort = N'Key', @PageNumber = 1;
  ...
  EXEC dbo.P_CTE @Table = N'Customers_CCI', @Sort = N'Unsupported', @PageNumber = 9999;
  EXEC dbo.P_CTE @Table = N'Customers_NCCI', @Sort = N'Key', @PageNumber = 1;
  ...
  EXEC dbo.P_CTE @Table = N'Customers_NCCI', @Sort = N'Unsupported', @PageNumber = 9999;

I know, this is all very cumbersome; we're getting to the punchline soon, I promise.

Results

I took those two sets of 36 statements and started two new sessions in Plan Explorer, running each set multiple times to ensure we were getting data from a warm cache and taking averages (I could compare cold and warm cache too, but I think there are enough variables here).

I can tell you right off the bat a couple of simple facts without even showing you supporting graphs or plans:

  • In no scenario did the "old" method beat the new CTE method I promoted in my previous post, no matter what type of indexes were present. So that makes it easy to virtually ignore half of the results, at least in terms of duration (which is the one metric end users care about most).
  • No ColumnStore index fared well when paging toward the end of the result – they only provided benefits toward the beginning, and only in a couple of cases.
  • When sorting by the primary key (clustered or not), the presence of ColumnStore indexes did not help – again, in terms of duration.

With those summaries out of the way, let's take a look at a few cross-sections of the duration data. First, the results of the query ordered by first name descending, then e-mail, with no hope of using an existing index for sorting. As you can see in the chart, performance was inconsistent – at lower page numbers, the non-clustered ColumnStore did best; at higher page numbers, the traditional index always won:

Unsupported Sort Order - Duration (milliseconds)Duration (milliseconds) for different page numbers and different index types

And then the three plans representing the three different types of indexes (with grayscale added by Photoshop in order to highlight the major differences between the plans):

Plan for traditional indexPlan for traditional index

Plan for clustered ColumnStore indexPlan for clustered ColumnStore index

Plan for non-clustered ColumnStore indexPlan for non-clustered ColumnStore index

A scenario I was more interested in, even before I started testing, was the phone book sorting approach (last name, first name). In this case the ColumnStore indexes were actually quite detrimental to the performance of the result:

Duration (milliseconds) for phone book sorting

The ColumnStore plans here are near mirror images to the two ColumnStore plans shown above for the unsupported sort. The reason is the same in both cases: expensive scans or sorts due to a lack of a sort-supporting index.

So next, I created supporting "PhoneBook" indexes on the tables with the ColumnStore indexes as well, to see if I could coax a different plan and/or faster execution times in any of those scenarios. I created these two indexes, then rebuilt again:

CREATE NONCLUSTERED INDEX [PhoneBook_CustomersCCI] 
ON [dbo].[Customers_CCI]([LastName],[FirstName])
INCLUDE ([EMail]);
 
ALTER INDEX ALL ON dbo.Customers_CCI  REBUILD;
 
CREATE NONCLUSTERED INDEX [PhoneBook_CustomersNCCI] 
ON [dbo].[Customers_NCCI]([LastName],[FirstName])
INCLUDE ([EMail]);
 
ALTER INDEX ALL ON dbo.Customers_NCCI REBUILD;

Here were the new durations:

Duration (milliseconds) against three index types

Most interesting here is that now the paging query against the table with the non-clustered ColumnStore index seems to be keeping pace with the traditional index, up until we get beyond the middle of the table. Looking at the plans, we can see that at page 5,000, a traditional index scan is used, and the ColumnStore index is completely ignored:

Phone Book plan ignoring the non-clustered ColumnStore indexPhone Book plan ignoring the non-clustered ColumnStore index

But somewhere between the mid-point of 5,000 pages and the "end" of the table at 9,999 pages, the optimizer has hit a kind of tipping point and – for the exact same query – is now choosing to scan the non-clustered ColumnStore index:

Phone Book plan 'tips' and uses the ColumnStore indexPhone Book plan 'tips' and uses the ColumnStore index

This turns out to be a not-so-great decision by the optimizer, primarily due to the cost of the sort operation. You can see how much better the duration gets if you hint the regular index:

-- ...
;WITH pg AS
  (
    SELECT CustomerID
      FROM dbo.[Customers_NCCI] WITH (INDEX(PhoneBook_CustomersNCCI)) -- hint here
      ORDER BY LastName, FirstName OFFSET @PageSize * (@PageNumber - 1) ROWS
      FETCH NEXT @PageSize ROWS ONLY
  )
-- ...

This yields the following plan, almost identical to the first plan above (a slightly higher cost for the scan, though, simply because there is more output):

Phone Book plan with hinted indexPhone Book plan with hinted index

You could achieve the same using OPTION (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX) instead of the explicit index hint. Just keep in mind that this is the same as not having the ColumnStore index there in the first place.

Conclusion

While there are a couple of edge cases above where a ColumnStore index might (barely) pay off, it doesn't seem to me that they're a good fit for this specific pagination scenario. I think, most importantly, while ColumnStore does demonstrate significant space savings due to compression, the runtime performance is not fantastic because of the sort requirements (even though these sorts are estimated to run in batch mode, a new optimization for SQL Server 2016).

In general, this could do with a whole lot more time spent on research and testing; in piggy-backing off of previous articles, I wanted to change as little as possible. I'd love to find that tipping point, for example, and I'd also like to acknowledge that these are not exactly massive-scale tests (due to VM size and memory limitations), and that I left you guessing about a lot of the runtime metrics (mostly for brevity, but I don't know that a chart of reads that aren't always proportional to duration would really tell you). These tests also assume the luxuries of SSDs, sufficient memory, an always-warm cache, and a single-user environment. I'd really like to perform a larger battery of tests against more data, on bigger servers with slower disks and instances with less memory, all the while with simulated concurrency.

That said, this could also just be a scenario that ColumnStore isn't designed to help solve in the first place, as the underlying solution with traditional indexes is already pretty efficient at pulling out a narrow set of rows – not exactly ColumnStore's wheelhouse. Perhaps another variable to add to the matrix is page size – all of the tests above pull 100 rows at a time, but what if we are after 10,000 or 100,000 rows at a time, regardless of how big the underlying table is?

Do you have a situation where your OLTP workload was improved simply by the addition of ColumnStore indexes? I now that they are designed for data warehouse-style workloads, but if you've seen benefits elsewhere, I'd love to hear about your scenario and see if I can incorporate any differentiators into my test rig.

The post Can ColumnStore Help Pagination Workloads? appeared first on SQLPerformance.com.

20 Jan 22:34

The Key to Unlocking Big Data is Machine Learning

by A.R. Guess

by Angela Guess Ben Rossi recently wrote for Information Age, “A recent Gartner survey found that more than 75% of companies are currently investing or planning to invest in big data initiatives over the next two years. This heightened interest has led analysts to speculate that big data project investments will reach $242 billion in […]

The post The Key to Unlocking Big Data is Machine Learning appeared first on DATAVERSITY.

20 Jan 22:34

American Express’s Use of Big Data and Machine Learning

by A.R. Guess

by Angela Guess Bernard Marr writes in Data Informed, “American Express handles more than 25 percent of credit card activity in the United States and, in 2014, surpassed handling $1 trillion in transactions. The company interacts with people on both sides of transactions: millions of businesses and millions of buyers. So it’s no surprise then […]

The post American Express’s Use of Big Data and Machine Learning appeared first on DATAVERSITY.

20 Jan 22:33

Stop using Task Manager to check SQL’s memory usage!

by Gail

There’s two fairly common questions I see on the forums around SQL Server’s memory usage. Either the question asks why SQL’s using too much memory, or why it’s using too little.

Too much memory isn’t usually a real problem, it’s often due to max server memory being left at its default of 2048TB, along with a lack of understanding of how SQL uses memory.

Too little memory used is also often not a real problem, rather it’s usually from using the wrong tools to check SQL Server’s memory usage.

Let’s start by looking at an example.

This particular server has 16 GB of memory, and SQL Server’s max server memory is set to 10GB. Since the last restart of the instance, I’ve run SELECT * FROM .. against every table in a 30GB database. That should have warmed the cache up nicely.

MemoryTaskManager

Err, what? SQL Server’s not even using 100MB? I’ve just read 30GB of data and SQL Server’s not even using 1% of it’s allowed memory!!!

Or is it?

TotalServerMemory

A look at perfmon shows a completely different picture. Total and target server memory are both at 10GB. So why is Task Manager showing such a low figure?

LockedPagesInErrorLog

The service account that SQL’s running under has been granted the Lock Pages in Memory permission. This means that SQL’s not using the normal Windows memory routines to allocate memory.

Normally, SQL Server and other Windows applications allocate memory using the VirtualAlloc API call. This allocates virtual memory which is pageable. When SQL Server has been given the Lock Pages in Memory permission, it doesn’t use VirtualAlloc to allocate memory. Instead it uses the API call AllocateUserPhysicalPages. Memory allocated with this API call is not pageable, it has to remain in physical memory.

Task Manager’s memory counters (the Working Set ones) only show memory that’s been allocated using VirtualAlloc. Hence, when SQL Server has the Lock Pages in Memory permission and is allocating most of its memory using AllocateUserPhysicalPages, its memory usage in Task Manager will appear to be extraordinarily low. As far as I recall, in that case it’s only the non-buffer memory which is still allocated with VirtualAlloc, and that’s just things like the thread stacks, the CLR memory, backup buffers, and similar. It’s not the buffer pool. The buffer pool gets allocated with AllocateUserPhysicalPages.

If I remove the Lock Pages in Memory and re-run the test, Task Manager shows completely different values

MemoryTaskManager_NoLockedPages

Now the buffer pool is being allocated with VirtualAlloc and so Task Manager shows the full 10GB of memory usage.

In summary, Task Manager can show a completely incorrect value for SQL Server’s memory usage if the SQL service account has the Lock Pages in Memory permission. This can lead to a lot of wasted time if it is concluded that Task Manager is correct and SQL Server is using little memory.

Rather leave Task Manager alone and use perfmon and the DMVs to check what SQL Server’s memory allocation actually is. They’ll both be correct whether Lock Pages are being used or not.

20 Jan 22:33

Everything Fails

by Jeremiah Peschka

Everything is horrible!

Wait, that’s not the message I want to send at all.

I wonder if these guys have an HA/DR plan...

I wonder if these guys have an HA/DR plan…

Planning for Failure Should Be Comprehensive

Think about the last time you thought about high availability and disaster recovery…

You’re lying, nobody ever thinks about HA and DR. Not until something is already on fire, at least.

Now, pretending you did think about HA and DR at some point in the distant past, how far down the rabbit hole did you go? Were there two servers? Did each server have redundant NICs? Power supplies? Were you using RAID? Did you think about the UPS?

Every component in the system needs to be considered when you’re looking into HA and DR. Using an AlwaysOn Availability Group, clustering, or database mirroring isn’t enough – there’s more to it.

Failure Has Consequences

Let’s use a specific example instead of talking in the abstract.

We’ll assume that you’ve decided those super fast consumer grade SSDs are the way to go. You’ve planned the rest of your deployment. You’ve got an AlwaysOn Availability Group. You’re ready to go. Right?

There’s still one more thing to talk about – power. See, most of those consumer grade SSDs don’t have any kind of battery in them. And, as you might know, disks lie. So we can’t really be sure if our writes are actually permanently stored somewhere unless we safely shut down the computer. Which always happens when the power goes out, right?

In this particular case, we need to keep worrying about power – what happens if the power fails? Is this server connected to a UPS? What happens when the UPS kicks in? Is there a backup generator? Will the server stay on? Can the server be automatically shut down? What’s that look like instead?

Ask Awful Questions

Being prepared has everything to do with asking yourself terrible questions. Work through the entire stack and come up with as many ways for things to fail as you can. Explore how you’d prevent these scenarios. You can’t provide a mitigation for everything that you come up with, but it’s good to think of these things.

Once you’ve got your List of Awfulness, work the feasible things into your HA and DR plans. Make sure that you’re covered as best as you can. Sometimes it makes sense to sweat the small stuff.


 

Explosión” by kinojam is licensed under CC BY-NC-SA 2.0

20 Jan 22:33

SQL Server 2016 CTP 3.2: Introducing end-to-end mobile BI

by SQL Server Team

The SQL Server 2016 Community Technology Preview (CTP) 3.2 is now available for download! In SQL Server 2016 CTP 3.2, part of our new rapid preview model, we made enhancements to several features which you can try in your development and test environments. Additionally, a number of Mobile BI additions and enhancements will be available in CTP 3.2 and by the end of December.

In SQL Server 2016 CTP 3.2, available for download or in an Azure VM today, you will see enhancements in several areas, including:

SQL Server Reporting Services – End-to-end mobile BI on any device

The SQL Server 2016 CTP 3.2 release marks a significant milestone for SQL Server Reporting Services (SSRS) as we continue to deliver on our Microsoft BI reporting roadmap and the promise to enable users to get business insights, anyway, anywhere and from any device.

In this preview release, we are adding support for mobile reports to SQL Server Reporting Services for on-premises implementations. This means that Reporting Services will be able to support two report types, paginated reports, which are the existing Reporting Services reports and mobile reports, which are based on Datazen technology that was acquired in April of 2015.

Mobile reports are optimized for mobile devices and form factors and provide an optimal experience for users accessing BI reports on mobile devices. With SQL Server 2016 CTP 3.2, you can author and manage mobile reports for easy consumption by users across your organization online and on mobile devices.

Author interactive mobile reports

Microsoft SQL Server Mobile Report Publisher is the single point for creation of mobile reports. You can simply connect to SQL Server Reporting Services to access data sources and easily create stunning reports, then publish them to SQL Server Reporting Services for access by others in the organization via a unified web experience or on mobile devices.

Microsoft SQL Server Mobile Report Publisher preview is available for download from the Microsoft Download Center.

Consume mobile reports using a unified Mobile BI experience

Whether you are using SQL Server Reporting Services on-premises, Power BI in the cloud, or both as your report delivery solution you will only need one mobile app (for each of Windows, iOS and Android) to access dashboards and mobile reports on mobile devices.

Starting with the Power BI app for iOS all of your BI content will be available at your fingertips from within one single, unified mobile app. The Power BI app for iOS that includes a preview of the SSRS mobile reporting capabilities is available from the App Store.

New Web portal experience

The addition of the mobile report type in SQL Server Reporting Services is accompanied with an entirely new web portal experience allowing users to access KPIs, paginated and mobile reports in one centralized location.

Ready to learn more about these exciting new capabilities and explore the opportunities and scenarios it can enable for you and your organization? Download the CTP 3.2 release today and check out this SSRS blog post to find out how your organization can get business insights, any way, anywhere and from any device.

Additional enhancements

SQL Server Management Studio (SSMS) features improvements to the XEvents wizard to allow the use of templates when connected to an Azure v12 server, user interface improvements to AlwaysEncrypted wizards and dialogs, an improvement to the Results pane to enable switching to the results tab after query execution, and updates to the Showplan comparison feature to enable comparison of the current execution plan with one saved in a file. Please visit the SSMS blog post to learn more.

SQL Server Analysis Services (SSAS) updates allow scripting in SSMS, creation of calculated tables, and DirectQuery for models with the 1200 compatibility level. Please visit the SSAS team blog post to learn more.

SQL Server Data Tools (SSDT) now includes enhancements for the new connection experience for Microsoft SQL Server and Azure SQL Database which was introduced in the CTP 3.1, as well as programmability support for SQL Server 2016 CTP3.2 features and enhancements in SQL Server Analysis Services.  Please visit the SSDT team blog post to learn more.

SQL Server Integration Services (SSIS) enhancements include HDFS-to-HDFS copy support, as well as Hadoop connectivity improvements, including ARVO file format support and Kerberos authentication support.

Download SQL Server 2016 CTP 3.2 today!

As the foundation of our end-to-end data platform, SQL Server 2016 is the biggest leap forward in Microsoft's data platform history with real-time operational analytics, rich visualizations on mobile devices, built-in advanced analytics, new advanced security technology, and both on-premises and in the cloud.

To learn more about SQL Server 2016, visit the SQL Server 2016 preview page. To experience the new, exciting features in SQL Server 2016 and the new rapid release model, download the preview or try the preview by using a virtual machine in Microsoft Azure and start evaluating the impact these new innovations can have for your business.

Have questions? Join the discussion on the new SQL Server 2016 capabilities at MSDN and StackOverflow. If you run into an issue or would like to make a suggestion, you can let us know through Connect. We look forward to hearing from you!

For additional information about CTP 3.2, see What’s New in SQL Server 2016 and SQL Server 2016 Release Notes.

20 Jan 22:32

Preview the newest ODBC SQL Server Driver for Windows and Linux

by SQL Server Team

We are pleased to announce the community technology preview of Microsoft ODBC Driver 13 for SQL Server on Windows and Linux, supporting Ubuntu, RedHat and SUSE distributions! The updated driver provides robust data access to Microsoft SQL Server and Microsoft Azure SQL Database via ODBC on Windows and Linux platforms.

Always Encrypted for Windows and Linux

You can now use Always Encrypted with the Microsoft ODBC Driver on Linux and Windows. Always Encrypted is a new SQL Server 2016 and Azure SQL Database security feature that can help prevent sensitive data from being seen in plaintext in a SQL Server instance. It lets you transparently encrypt the data in the application, so that SQL Server will only handle the encrypted data and not plaintext values. Even if the SQL instance or the host machine is compromised, an attacker gets ciphertext of the sensitive data. In order to use the Always Encrypted feature, you have to use a supported driver such as ADO.NET or the ODBC 13 Driver for SQL Server Preview to encrypt the plain text data then store the encrypted data inside SQL Server 2016 CTP2 and above or Azure SQL Database. Similarly, you will use a capable driver like the new ODBC driver or ADO.NET to decrypt the data.

Internationalized Domain Names for Windows

Internationalized Domain Names (IDNs) allow your web server to use Unicode characters for server name, enabling support for more languages. Using the new Microsoft ODBC Driver 13 for SQL Server on Windows Preview, you can convert a Unicode serverName to ASCII compatible encoding (Punycode) when required during a connection. This conversion is enabled by setting the property serverNameAsACE to true. Otherwise, if the DNS service is configured to allow the use of Unicode characters, use the default serverNameAsACE property value of false.

Linux ODBC drivers add Ubuntu support

The preview ODBC drivers for Linux now supports Ubuntu, RedHat and SUSE. This is Microsoft’s first ODBC Driver for SQL Server release supporting Ubuntu. You can now enjoy enterprise level support while connecting to SQL Server from Ubuntu. It also updates the drivers to unixODBC driver manager 2.3.1 support.

Learn more

The ODBC driver is part of SQL Server and the Microsoft Data Platform’s wider interoperability program, with drivers for PHP 5.6, Node.js, JDBC, and ADO.NET already available. Look for more features in coming releases as we continue to build out support for Linux in our ODBC driver.

We invite you to explore the latest the Microsoft Data Platform has to offer via a trial of Microsoft Azure SQL Database or by trying the new SQL Server 2016 CTP.

For more information see documentation on the Microsoft Developer Network.

Questions? Join the discussion of the new driver capabilities at MSDN and stackoverflow. If you run into an issue or would like to make a suggestion, let us know via Connect.

20 Jan 22:32

Data security, SQL Server 2016, and your business

by David Hobbs-Mallyon

Security is unquestionably a major priority for Microsoft. A recent news story reported that the company “is spending $1 billion a year to make Microsoft products more secure.” The Microsoft data platform, including SQL Server and Azure SQL Database, is at the top of the list of products investing in security. But, be aware that a commitment to data security is actually nothing new. SQL Server has long been recognized for its outstanding security record: According to the National Institute of Standards and Technology(NIST)1 public security board, for the past six years, SQL Server has had the fewest security vulnerabilities when compared with the major database vendors. In addition, SQL Server has been deemed “the most secure database” by the Information Technology Industry Council (ITIC). Despite this excellent security record, Microsoft is not content to rest on its laurels and is continuing to invest in security, providing customers with new and improved tools to secure data and applications.

From an IT infrastructure and compliance perspective, the importance of protecting data is clear. Witness the fact that security has been identified as one of the “Eight emerging data center trends to follow in 2016.” But data protection also has profound business implications and can even be a competitive differentiator by helping drive customer loyalty and retention, create opportunities for premium offers and new sources of revenue, and protect future revenue streams, according to Forrester Research 2.To help deal with the complexity and scope of data security — and diminish risks to your business — Microsoft provides an across-the-board, in-depth security approach that includes application security, network security, and database security.

Data Security and SQL Server

Playing into this overall approach, SQL Server 2016 and Azure SQL Database include advanced, layered security functionality to help protect data itself as well as access to that data, and then provide monitoring capabilities. Data security features include (but are not limited to) the following:

  • Always Encrypted enables encryption inside client applications without revealing encryption keys to SQL Server. It allows changes to encrypted data without the need to decrypt it first.
  • Transparent Data Encryption (TDE) protects data at rest by encrypting all the user data in data files. TDE prevents users from attaching or restoring a database to another server as a way to access the data.
  • Support for Transport Layer Security (TLS), which has now been updated to version 1.2, protects data in transit and offers protection from such tactics as man-in-the-middle attacks.
  • Dynamic Data Masking (DDM) and Row-Level Security (RLS) help developers build applications that require restricted direct access to certain data as a means of preventing users from seeing specific information.

This layered approach to data security and Microsoft’s overall commitment to advancing security and privacy protection address important considerations for business today. Upcoming blogs will go into deep technical detail on these security capabilities, but examining a business scenario can help illuminate the business benefits that data security can help ensure.

Business implications

Data has become not only a business asset, but it is now also a competitive differentiator: A company that can ensure that customer and business data are secured has a competitive edge over a company that does not make data security a priority. This means that for business and technical decision-makers to enable their businesses to compete effectively, they need a data platform with built-in security features and they need a strategy that takes advantage of the built-in security capabilities.

The business implications of data security range from speeding up customer service, to impacting the bottom line, to protecting shareholder value. Underscoring the potential bottom-line concerns of financial executives, a recent survey found that 66 percent of CFOs consider security to be a high or very high priority. Even at the end-user level, the potential business impact of exposing sensitive data is recognized: Another recent survey discloses that “71 percent of end users say that they have access to company data they should not be able to see.”

How can Microsoft’s data security capabilities ease such concerns? Consider just one example showing how Dynamic Data Masking, as a part of your data security program, can help you address the point raised by those end users who admitted they had access to data (such as Social Security Numbers or health details) that they shouldn’t be able to view. For example, suppose you have a call center where representatives deal with customer billing questions. When a customer record comes up, the representative needs to see certain information to answer questions. But some customer information, such as specific personal health details, need to remain confidential for HIPAA compliance. With Dynamic Data Masking, IT administrators can take simple steps to define policies, or rules, to mask any personally identifiable information that is not needed for the customer interaction. This way, the representative can view a customer record without having access to confidential information. Customer information is secured, but at the same time, customer service is able to answer questions by accessing appropriate data without compromising privacy.

Commitment to security built-In

As the article cited above emphasizes, Microsoft is spending $1 billion per year to ensure that its products are secured so that businesses are protected. SQL Server and Azure SQL Database are continuously building-in state-of-the-industry security technologies as part of this ongoing commitment to security. For business, this means you don’t have to pay extra to give IT staff security tools that are easy to deploy and maintain — those tools are built into Microsoft’s data platform. At the same time, businesses can build data security infrastructure that supports customers and provides a competitive edge. To learn more about Microsoft’s data security approach, see the Security Center for SQL Server Database Engine and Azure SQL Database and the SQL Security Blog.

See the other posts in the SQL Server 2016 blogging series

1. National Institute of Standards and Technology Comprehensive Vulnerability Database update 10/2015

2. The Future of Data Security And Privacy: Growth And Competitive Differentiation Vision: The Data Security And Privacy Playbook, John Kindervag, Heidi Shey, and Kelley Mak, Forrester, July 10, 2015

20 Jan 22:32

Geek City: SQL Server 2016 CTP3 In-Memory OLTP Internals Whitepaper

by Kalen Delaney
Just a quick note that my CTP3 paper has finally been published, and now I can start working on the final version! Here’s the link: http://download.microsoft.com/download/D/5/2/D52D374F-D442-4275-B570-0EB527102F4D/SQL_Server_In_Memory_OLTP_Internals_Overview_for_SQL_Server_2016_CTP3_EN_US.pdf Enjoy! ~Kalen...(read more)
20 Jan 22:31

Big Data, Antitrust, and the European Union

by A.R. Guess

by Angela Guess Peter Sayer reports in Computer World, “Europe’s top antitrust authority is on the lookout for companies using big data to stifle competition, although it hasn’t spotted any problems yet, according to Competition Commissioner Margrethe Vestager. It’s good news when companies use data to cut costs and offer better service, the European Commission’s […]

The post Big Data, Antitrust, and the European Union appeared first on DATAVERSITY.

20 Jan 22:31

Microsoft SQL Server Index Performance

by Andrew

An index in SQL Server is a structure in association of which the speed of retrieval of rows from tables or views increases. It is comprised of keys that are associated or have been made from a single or multiple columns in tables and views. These keys get stored in a B-tree structure that facilitates SQL server to search for particular rows that are associated with the selected keys in a very quick manner.

This blog deals with an SQL Server Index Performance issue that was faced by one of our customers.

Issue:

The customer had an ETL package installed which was running for a longer time than usual. The package served a simple function- inserting multiple rows from a staging table to a large table containing million rows.

Reason:

It was revealed that the reason behind it was the OLEDB destination adapter. The adapter was consuming the maximum execution time. On further investigation of the issue, it came into light that the destination table comprised of a large number of indexes with some of them having a size larger than 200 GB. Moreover, the execution plan showed that the SQL Servr index updates were also responsible for the slow speed of the insert operations.

Solution:

In order to solve the SQL Server Index performance issue, firstly we thought of disabling the indexes of large size. However, it was found out that these large sized indexes were used for some kind of daily reports and disabling them would mean severe impact on the daily performance report.

The second solution which came to our mind was to disable the indexes before ETL and then enabling the afterwards. However, this solution also had it drawbacks. The index rebuild process consumed more amount of time and the reports on the indexes had to be run immediately, therefore this solution was also discarded.

Now the only solution, which suited every aspect, was the reducing of index size. This solution was based on the assumption that smaller sized indexes will consume less index updates, thus leading to better performance for update and data insert transactions.

Reduction of SQL Server Index Size

A deep analysis of the SQL database index revealed that irrespective of the fact that the large indexes had only a small number of key fields, they comprised of a large number of columns in INCLUDED columns logic.

SQL Server Index Optimization

In order to solve the problem in hand, optimizing the existing SQL Server indexes without doing any compromises with the read queries’ performance is the solution that can be preferred. In order to do so, we searched all the queries from the cache that had the reference for the large indexes. On analyzing the execution plan of the indexes, we came to know that no index was fully satisfying the queries as they had Lookup operators present in the execution plans. Therefore, we thought of removing a few columns from INCLUDE part as it would not change the query because the Lookup operator had to bring a lot more fields.

When a number of fields from INCLUDE portion of the covering indexes whilst not disturbing the main index, the index sizes dropped to 40%. Therefore, the query performance was drastically improved as the execution cost of the SELECT queries was reduced and they ran faster than before.

The most important point in the above process was that removing some columns from the INCLUDED part of the SQL Server index helped in the performance improvement of read and write queries. This is because the smaller indexes require a very less amount of memory and disk input/output.

Summary

Whilst covering indexes prove to be of great importance for improving read queries, they also play a crucial role in the performance degradation of read and write transactions in large tables. In cases of fully covering indexes, reducing its size may turn out to be problematic as it can make the queries less efficient and partially covering. However, if the SQL Server index is partially recovering and the queries and transactions are not performing well, downsizing is the most efficient solution.

(458)

20 Jan 22:31

HA/DR for Azure SQL Database

by James Serra

Azure SQL Database is a relational database-as-a-service in the cloud.  It uses a special version of Microsoft SQL Server as its backend that is nearly identical to SQL Server (see Azure SQL Database Transact-SQL differences).  While there are many benefits to using SQL Database over SQL Server, in this blog post I’ll talk about the various types of high-availability and disaster recovery options that are much easier to setup than SQL Server.

When you use the Azure portal to create a SQL Database, the various plans under the pricing tier include three service tiers: Basic, Standard, and Premium.  Here are those three plans with their high-availability (HA) and disaster recovery (DR) options:

Basic: Automatic Backups, Point In Time Restore up to 7 days, Disaster recovery (DR): Geo-Restore, restore to any Azure region

Standard: Automatic Backups, Point In Time Restore up to 14 days, DR: Standard Geo-Replication, offline secondary

Premium: Automatic Backups, Point In Time Restore up to 35 days, DR: Active Geo-Replication, up to 4 online (readable) secondary backups

Here are more details on those options:

High Availability: Each database possesses one primary and two local replica databases stored on LRS Azure Blob Storage that reside in the same datacenter, providing high availability within that datacenter.  At least two of those databases are synchronous.  The hardware these databases reside on are on completely physically separate sub-systems.  So if the hardware fails, your database will automatically and seamlessly fail over to the synchronous copy.

Automatic Backups: All Basic, Standard, and Premium databases are protected by automatic backups.  Full backups are taken every week, differential backups every day, and log backups every 5 minutes.  The first full backup is scheduled immediately after a database is created.  Normally this completes within 30 minutes but it can take longer.  If a database is “born big”, for example if it is created as the result of database copy or restore from a large database, then the first full backup may take longer to complete.  After the first full backup all further backups are scheduled automatically and managed silently in the background.  Exact timing of full and differential backups is determined by the system to balance overall load.  Backup files are stored locally in blob storage in the same data center as your databases with local redundancy.  When you restore a database, the required backup files are retrieved and applied.  The full, differential, and log backups are also copied to the blob storage in the paired secondary region in the same geo-political area for disaster recovery purpose (RA-GRS).  These geo-redundant copies are used to enable geo-restore as explained shortly.

Point In Time Restore: Point In Time Restore is designed to return your database to an earlier point in time. It uses the database backups, incremental backups and transaction log backups that the service automatically maintains for every user database.  See Azure SQL Database Point in Time Restore.  To restore a database, see Recover an Azure SQL Database from a user error.  When you perform a restore, you’ll get a new database on the same server.

Geo-Restore: When you create a SQL Database server, you choose the region you want it in (i.e. East US), and this is your primary region.  If there is an incident in this region and a database is unavailable, you can restore it from the geo-redundant backup copy in the secondary region to any region, using the same technology as point in time restore, and therefore the databases are resilient to the storage outages in the primary region.  Note that with this option, your data could be up to one hour behind.  See Azure SQL Database Geo-Restore.

Standard Geo-Replication: This is where a copy of your data in the primary database is constantly being written asynchronously to a non-readable secondary database on a server in a different region (geo-redundancy).  In the event of a disaster you can fail over to the secondary.  Since the copy is asynchronous the data in the secondary database will be behind the primary, but not by more than five seconds (you can make the copy synchronous by using the system procedure sp_wait_for_database_copy_sync).  See Azure SQL Database Standard Geo-Replication.

Active Geo-Replication: Similar to Standard Geo-replication, your data is being asynchronously written except it’s on up to four secondary servers in different regions, and these secondaries are readable (each continuous copy is referred to as an called online secondary database).  You can also fail over to a secondary in the event of disaster in the same way as Standard Geo-Replication.  In addition, Active Geo-Replication can be used to support application upgrade or relocation scenarios without downtime, as well as load balancing for read-only workloads.  See Active Geo-Replication for Azure SQL Database.

A word about database failover:
If a region has an extended outage you will receive an alert in the Azure Portal and will see your SQL Database servers’ state set to Degraded.  At that point an application has a choice of initiating the failover or waiting for the datacenter to recover.  If your application needs to optimize for higher availability and can tolerate a data loss of 5 seconds then it should failover as soon as you receive an alert or detect database connectivity failures.  If your application is sensitive to data loss you may opt to wait for the SQL Database service to recover.  If this happens no data loss will occur.  In case you initiate the failover the database you must reconfigure your applications appropriately to connect to the new primary databases.  Once you have completed the failover you will want to ensure that the new primary is also protected as soon as possible.  Since primary region recovery may take time you will have to wait for your server to change from Degraded back to Online status. This will allow you to initiate geo-replication from the new primary to protect it.  Until seeding of the new secondary is completed your new primary will remain unprotected.

More info:

Creating a large data warehouse in Azure

Business Continuity Overview

Design for business continuity

SQL Database Enable Geo Replication in Azure Portal

Fault-tolerance in Windows Azure SQL Database

Distributed Storage: How SQL Azure Replicas Work

High Availability and Disaster Recovery for Azure SQL Databases