Shared posts

26 Mar 07:43

Cloud Platform Release Announcements for March 2, 2016

by Cloud Platform Team

This is a blog post of a new ongoing series of consolidated updates from the Cloud Platform team.

 

In today’s mobile first, cloud first world, Microsoft provides the technologies and tools to enable enterprises to embrace a cloud culture. Our differentiated innovations, comprehensive mobile solutions and developer tools help all of our customers realize the true potential of the cloud first era.

You expect cloud-speed innovation from us, and we’re delivering across the breadth of our Cloud Platform product portfolio. Below is a consolidated list of our latest releases to help you stay current, with links to additional details if you’d like more information. In this update:

  • Microsoft Cloud App Security (Adallom) | Preannounce
  • Azure AD Identity Protection | Public preview
  • Intune: New MAM enabled Apps from Skype for Business, Dynamics CRM, and Adobe Reader | GA and Intune: Outlook support for MAM without Device Enrollment | GA
  • Microsoft R Server | Azure Marketplace
  • OMS extension for Linux VMs | Public – Full experience
  • Power BI Content Packs Alpine Metrics, Dynamics AX – Retail Channel Manager and Dynamics AX – Financial Performance generally available
  • StorSimple Virtual Array, on-premises | GA
  • Visual Studio Code extension for React Native | Public preview

Microsoft Cloud App Security (Adallom) | Preannounce

Last week, prior to RSA conference, we announced the upcoming general availability of a new, cloud-delivered security solution: Microsoft Cloud App Security.

Based on our Adallom acquisition, Microsoft Cloud App Security is a new service that provides visibility, control, and security over cloud applications at a level similar to on-premises. You get the benefits of cloud and SaaS applications, while gaining insight into user activity, detecting anomalous behavior, identifying compromised accounts, and increasing protection over critical company data. With this new solution, we are committed to supporting not only Microsoft but also third-party cloud applications.

Cloud App Security, as a standalone solution, will be generally available in April 2016. If you would like to learn more about this product and experience it real time, sign up to be notified once it’s available.

Azure AD Identity Protection | Public preview

Identity Protection by Azure Active Directory now in public preview.

Azure Active Directory Identity Protection takes secure identity and access management to the next level by detecting attacks in real time, informing you of risks and applying controls to keep your enterprise safe.

The service detects suspicious activities for users and privileged (admin) identities based on signals like brute force attacks, leaked credentials, sign ins from unfamiliar locations, infected devices and more and provides remediation recommendations to protect against these activities in real-time. More importantly, based on these suspicious activities, a user risk severity is calculated and risk-based conditional access policies can be configured and automatically protect the identities of your organization from future threats.

For more information, please refer to the Identity Protection documentation.

New in Microsoft Intune: More MAM enabled apps from Outlook, Skype for Business, Dynamics CRM, and Adobe Reader

Having the right set of managed apps is essential for creating a productive and protected mobile experience across your organization, that’s why we’re continually working to expand our portfolio of MAM enabled apps for Microsoft Intune. Since its release in 2015, the Microsoft Outlook app has been one of the most popular Intune managed apps available. We’re excited to announce an update to the Outlook app that allows you to protect your corporate data using MAM controls without device enrollment.

This is a great option for BYOD scenarios where you want to keep corporate data safe without managing a user’s device. Available now for iOS and Android!

In case you missed them, here is a list of other MAM-enabled apps we’ve recently made available for MDM enrolled devices.

  • Microsoft Skype for Business (MAM with MDM) available now for iOS and Android
  • Microsoft Dynamics CRM (MAM with MDM) available now for iOS and Android
  • Adobe Reader (MAM with MDM) available now for iOS

Keep up-to-date on what apps are coming and what’s been released on the Intune Partners page. Also, check out this great article to learn more about how Intune MAM functionality and policies help protect your corporate data.

Microsoft R Server | Azure Marketplace

Now available as a Market Place VM. Microsoft R Server is a cost effective enterprise-class big data big analytics platform available today for purchase on Azure Marketplace as Virtual Machines. Supporting a variety of big data statistics, predictive modeling and machine learning capabilities, Microsoft R Server is also 100% R. Microsoft R Server provides users with the best of both: cost-effective and fast big data analytics that are fully compatible with the R language, the de facto standard for modern analytics users.

Microsoft R Server WM in Azure

Microsoft is excited to announce the availability of its latest release of Microsoft R Server as a general availability on Microsoft Azure via Windows, Linux, Hadoop and Teradata-based virtual machines in the Azure Marketplace. Through Azure’s world-wide cloud infrastructure customers now have on-demand access to high-performance predictive analytics to accelerate growth, optimize operations, and expedite data insight and discovery from any place and at any time. Availability in Azure Marketplace is the first step in Microsoft’s plan to bring Microsoft R Server into Azure and, in the bigger picture, Cortana Analytics.

Through the Azure Marketplace, users can run computations on data sets up to one terabyte on cloud-based Windows and Linux multi-CPU instances from four to 32 vCPUs (virtual CPUs), accessing data copied from an Azure data store, including blob storage, Azure Data Lake, or SQL Server, or accessed through direct connection to Azure Data Lake, use of Azure Files, or ODBC. Utility pricing for both versions starts at $1.50 per four-cores per hour with no long-term commitments.

The Windows version of Microsoft R Server is accessed with Windows Remote Desktop and includes Microsoft ’s Visual Studio-based IDE for R developers (Microsoft R Server DevelopR). Access to the Linux version of RRE is via SSH with R access through the RGui. Alternatively, customers have the option of bringing their own license for a favorite IDE such as RStudio, RStudio Server or StatET. Both the Windows and Linux offerings include Microsoft ’s DeployR web services module.

Microsoft R Server features a unique “write once deploy anywhere” functionality that enables data analysts and IT teams to write code once and deploy it anywhere in a variety of data management platforms, enterprise data warehouses, grids, clusters, servers and workstations without re-engineering costs. It is the only Big Data Big Analytics platform to include a library of Big Data-ready algorithms that run inside the Cloudera, Hortonworks, and MapR Hadoop platforms, Teradata databases, and soon SQL Server, with the highest possible performance.

Some important considerations

Microsoft R Server offers are now a first-party MSFT offers based on MRS 8.0 and having CSS support. Although the new offers will be first-party offers and branded as MSFT their purchase will still require credit cards for the R Server portion of the fees due to the restrictions of the Azure Marketplace commerce infrastructure.

This will impact large customers who purchase Azure credits and partners with sponsored subscriptions until we can replace the Marketplace offerings with core images. We are working to offer the entire server solution and integrate the meters with Azure billing.

OMS extension for Linux VMs | Public – Full experience

Microsoft Operations Management Suite enables seamless onboarding experience for virtual machines running in Azure, either Linux or Windows.

Operations Management Suite provides a VM extension for Linux IaaS VMs in Azure. This capability will enable Azure users to simply onboard to the Log Analytics service from the Azure management portal.

Visit the Hybrid IT management blog for additional information.

Power BI Content Packs Alpine Metrics, Dynamics AX – Retail Channel Manager, and Dynamics AX – Financial Performance generally available

Power BI continues to make it easier for users to connect to their data, providing pre-built solutions for popular services as part of the experience. This month we added a content pack for Alpine Metrics and Dynamics AX – Financial Performance and Retail Channel Performance.

Alpine Metrics, utilizing Cortana Analytics, predicts sales outcomes with 95%+ accuracy. Leveraging Dynamics CRM and Salesforce, the Alpine Metrics Sales Predictions content pack for Power BI includes metrics such as potential and predicted sales and risks, allowing you deeper insight into the future of your business.

Microsoft Dynamics AX is an enterprise resource planning (ERP) solution that helps enterprises increase the speed of doing business. This update includes two Dynamics AX 7 content packs which are targeted towards different business users. The Financial Performance content pack, designed specifically for CFOs, provides access to insights about your organization’s financial performance. The Retail Channel Performance is targeted for channel managers and focuses on sales performance to predict trends and uncover insights by drawing directly from Retail & Commerce data.

Subscribers to these supported services can now quickly connect to their account from Power BI and see their data through live dashboards and interactive reports that have been pre-built for them. Getting started with data visualization and analysis has never been easier, for additional information visit the Power BI blog.

StorSimple Virtual Array, on-premises | GA

StorSimple on-premises Virtual Array generally available

No additional charge | StorSimple

Microsoft Azure StorSimple is announcing the general availability of the StorSimple Virtual Array. This is a version of the StorSimple solution in a virtual machine form installed in a customer’s remote or branch office. This provides a cost-effective, light-weight solution for remote and branch office environments while still delivering the fundamental value of the StorSimple Hybrid Cloud Storage solution:

  • Seamless cloud integration with automated tiering to manage data growth
  • Simplified data protection with cloud-based snapshots
  • Location independent recovery of data from the cloud

The StorSimple Virtual Array can be downloaded from the StorSimple Manager, on the Azure portal, and installed on a Virtual Machine configured on a Hyper-V or VMware hypervisor. The StorSimple Virtual Array can be configurable as a NAS (using SMB) or a SAN (using iSCSI) device. It will initially be targeted for File Sharing, for information worker and collaboration scenarios, and small database workloads using a combination of local and tiered volumes. The StorSimple Virtual Array will be managed from the StorSimple Manager Service, in the Azure portal.

The on-premises StorSimple Virtual Array is billed against the same meters as the StorSimple 8000 Series, the StorSimple meter and the Azure Blob Storage meter.

For more information, please refer to the StorSimple overview.

Visual Studio Code extension for React Native | Public preview

At the reactconf conference (SFO 22,23) we announced the availability of a Preview Visual Studio Code extension for ReactNative development.

ReactNative is a framework to create native, cross-platform applications for Android and iOS using JavaScript. The Visual Studio Code extension, adds IntelliSense, code editing and debugging capabilities on Visual Studio Code for ReactNative projects. To find out more about the release, check out the Visual Studio Blog announcement or ReactNative website.

26 Mar 07:35

Databas(ics)

by Buck Woody

The beginnings of data science is data. Data are things that you know about, well, other things, so it makes …

Continue reading →

26 Mar 07:35

Data Science and the Lytro Camera

by Buck Woody

Since the first practical camera was invented in the 1800’s, it’s been used as a scientific tool. In essence, it’s …

Continue reading →

26 Mar 07:34

But *Why* Do You Trust Your Data?

by Buck Woody

At the beginning of every data project is the data. While we spend a great deal of time figuring out …

Continue reading →

26 Mar 07:34

Occam’s Razor and the Data Science Project

by Buck Woody

The Cortana Analytics suite from Microsoft is not a single platform, but actually a group of related products and features. Why …

Continue reading →

26 Mar 07:32

In-Memory OLTP Simulator: The Experiment

by Artemakis Artemiou [MVP]
A few days ago I published a 38-page technical paper titled "In-Memory OLTP Simulator: The Experiment". In-Memory OLTP Simulator is a software tool that I developed, which allows the user to easily simulate virtually any workload against the powerful In-Memory OLTP Engine of Microsoft SQL Server. This paper presents a series of simulations for a specific scenario against different workload
26 Mar 07:26

SQL Nexus 2016 in Copenhagen

by Davide Mauri

For 2nd to the 4th of May, in Copenhagen, the SQL Nexus conference will take place and it looks like is going to be one of those events that, if you live in Europe, you really cannot miss.

SQL_Nexus_930x180px_webbanner_speaker

Just visit the website to see how awesome is the speaker roster and the, even if the agenda is not yet there, you can already feel that is going to be *really* interesting:

http://www.sqlnexus.com

Now, Beside the following Pre-Conference

Reference Big Data Lambda Architecture in Azure
The Lambda Architecture is a new generic, scalable and fault-tolerant data processing architecture, that is becoming more and more popular now that big data and real-time analytics are frequently requested by end users, enabling them to make informed decisions more precisely and quickly. During this full-day workshop we'll see how the Azure Data Platform can perfectly support such an architecture and how to use each technology to build it. From Azure IoT Hub and Azure Stream Analytics to Azure Data Lake and Power BI, we'll build a small Lambda-Architecture solution so that you'll be able to become confident with it and its implementation using Azure technologies.

that I’ll deliver with my friend Allan Mitchell that I’ve already mentioned before I’m happy to announce that I’ll also have a regular session on Machine Learning, a topic I really love:

Azure ML: from basic to integration with custom applications
In this session, Davide will explore Azure ML from the inside out. After a gentle approach on Machine Learning, we’ll see the Microsoft offering in this field and all the feature it offers, creating a simple yet 100% complete Machine Learning solution.
We’ll start from something simple and then we’ll also move to some more complex topics, such as the integration with R and Python, IPython Notebook until the Web Service publishing and usage, so that we can integrate the created ML solution with batch process or even use it in real time with LOB application.
All of this sound cool to you, yeah? Well it is, since with ML you can really give that “something more” to your customers or employees that will help you to make the difference. Guaranteed at 98.75%!

See you there!

26 Mar 07:22

Transposing Columns onto Rows

by superlatch

After a long period of absence, I have returned to posting some blog articles.  This one popped up last week from an Oracle DBA while I was onsite with a customer.  The conversation went something along the lines of:

Oracle DBA: “How can I place columns values on rows? I have a limited page width and need to alter the output”

Microsoftie “ah.. well, that’s easy..”

Consider the output below:

 

image

but we want it to appear like this:

image

 

The magic is really in the UNPIVOT function as shown below.

CREATE DATABASE sandbox;

USE sandbox;

CREATE TABLE tblPerson
(
    Email_Address varchar(50),
    First_Name varchar(50),
    Last_Name varchar(50)
);

INSERT INTO tblPerson VALUES
('ben@test.com', 'Ben', 'WJ')

SELECT * FROM tblPerson;
SELECT tblPivot.Property, tblPivot.Value FROM (SELECT CONVERT(sql_variant,Email_Address) AS Email_Address, CONVERT(sql_variant,First_Name) AS First_Name, CONVERT(sql_variant,Last_Name) AS Last_Name FROM tblPerson) Person UNPIVOT (Value For Property In (Email_Address, First_Name, Last_Name)) as tblPivot;
26 Mar 07:21

Testing SQL Server HA using Hyper-V in Windows 8 Release Preview

by superlatch

I often find the need to build and test SQL Server clusters to support native two-node or N+1 (multi-instance) scenarios and also validate SQL Server 2012 Always On configurations.  I was previously running Windows Server 2008 R2 on my laptop but I a few issues using this as a day to day operating system e.g. no support for standby when hyper-v was enabled, no dual boot (hyper-v on/off) with bitlocker enabled, no Bluetooth support etc.  Fortunately, Windows 8 meets all my needs and the experience has been excellent to date.

Note: for rapid provisioning of other images, I created a fully patched base OS image which was sysprep’d (this is incredibly easy).  I now just copy the image if I want to create other server roles e.g. System Center 2012.

The only issue I have found to date is that host internet connectivity is affected (delayed) after defining multiple internal network adapters, therefore I switched to using Private network adapters.   I only need internal adapters for host to guest connectivity e.g. copying files etc.

My hardware and software is defined below:

  • Laptop HP 8540w (16GB memory, 4 cores hyper-threaded)
  • Two internal SATA disks (hybrid)
  • Windows 8 Release Preview (x64) with Hyper-V enabled

The environment consists of the following virtual guest images:

  • 1 x Domain Controller (also hosts the virtual storage)
  • 1 x Primary Node (node 1)
  • 1 x Secondary Node (node 2)

All servers are running Windows Server 2008 R2 Enterprise Edition x64 with Service Pack 1.  The cluster role is enabled on node 1 and node 2. 

Storage

The shared storage is provisioned using the iSCSI Software Target which I defined on the domain controller.  I didn’t find a need to dedicate a specific storage server.  The screenshot of the virtual storage is presented below

image

 

image

Cluster Validation is good, the only warning was inconsistent OS patch levels on both cluster nodes as shown below.

 

image

After running cluster validation, I created a cluster and the final configuration is shown below. 

Below is a screenshot of the raw cluster configuration.

image

I’ll blog more following the SQL Server install.  I also plan to repeat this for Window Server 2012 and SQL Server 2012.

26 Mar 07:21

Are you interested in Data Science?

by superlatch

The University of Dundee is now gauging interest in a number of data science focused courses (see below).

image

You can register interest via this link http://www.dundee.ac.uk/admissions/new_courses/register.php

26 Mar 07:21

Redux: Using a C# script task in SSIS to download a file over http

by superlatch

A few people have asked for further information about the C# script task which I blogged about (quite a while ago).  I mistakenly forgot to add the full source code, sorry everyone.  Here is the link to the original blog post:  http://blogs.msdn.com/benjones/archive/2009/03/29/using-a-c-script-task-in-ssis-to-download-a-file-over-http.aspx

I have since imported the SSIS package into Visual Studio 2010 (BIDS) and the code compiles without error.  Some of the code below is truncated on the right (just a formatting issue I need to resolve) but the core of the code is there.  

As always, let me know if there are any problems. 

/*
   Download a file over http using Script Task in SQL Server 2008 R2 Integration Services.   
Two key variables, vSSOReportURL, which is constructed in a prior Script Task e.g. http://www..
vSSOLocalFileName, which is the fully qualified reference for the downloaded file e.g. c:\myfile.zip


*/ using System; using System.Data; using Microsoft.SqlServer.Dts.Runtime; using System.Windows.Forms; using System.Net; using System.Net.Security; namespace ST_7e897e41dd5945f3b77366d32f0a97e0.csproj { [Microsoft.SqlServer.Dts.Tasks.ScriptTask.SSISScriptTaskEntryPointAttribute] public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase { #region VSTA generated code enum ScriptResults { Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success, Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure }; #endregion public void Main() { WebClient myWebClient; string RemoteURI; string LocalFileName; bool FireAgain = true; Variables vars = null; Dts.VariableDispenser.LockForRead("User::vSSOReportURL"); Dts.VariableDispenser.LockForRead("User::vSSOLocalFileName"); Dts.VariableDispenser.LockForWrite("User::vSSOReportURLIndicator"); Dts.VariableDispenser.GetVariables(ref vars); try { // Ignore certificate warnings ServicePointManager.ServerCertificateValidationCallback = new RemoteCertificateValidationCallback(delegate { return true; }); // Initiate webclient download, use default credentials (current login) myWebClient = new WebClient(); myWebClient.Credentials = CredentialCache.DefaultCredentials; RemoteURI = vars["User::vSSOReportURL"].Value.ToString(); LocalFileName = vars["User::vSSOLocalFileName"].Value.ToString(); // Log provider notification Dts.Events.FireInformation(0, String.Empty, String.Format("Downloading '{0}' from '{1}'", LocalFileName, RemoteURI), String.Empty, 0, ref FireAgain); // Download the file myWebClient.DownloadFile(RemoteURI, LocalFileName); // Set report URL indicator, this is used to determine the http source of the // download i.e. vSSOReportURL or vSSOReportURLRetry for the message which is // written to the table vars["User::vSSOReportURLIndicator"].Value = 0; // Return success Dts.TaskResult = (int)ScriptResults.Success; } catch (Exception ex) { // Catch and handle error Dts.Events.FireError(0, String.Empty, ex.Message, String.Empty, 0); Dts.TaskResult = (int)ScriptResults.Failure; } } } }

The files are extracted using an Execute Process Task (with 7-Zip) as shown below:

image

And the arguments are set using the expression (below).  There are probably better ways of doing this but I found this worked well.

image

The .zip file is then archived using a File System task and the extracted file is renamed to .xlsx. 

26 Mar 07:21

Redux: Using an SSIS package to monitor and archive the default trace file

by superlatch

I’ve recently been asked for details regarding the SSIS package I wrote to copy and archive the default trace file which is generated by SQL Server.  The contents of the file can be quite useful for troubleshooting or diagnosis purposes.

I’ve updated the package to work with SQL Server 2008 R2 and SQL Server 2012.  

The detection of a new trace file is implemented using a Script Task which watches the \Log folder for new files.  When a new file is detected it copies the previous file to an archive location.   The logic is embedded in Visual Basic.NET (not my personal choice although I was limited in SQL Server 2005 when I wrote the package). 

image

The archive process renames the file with the date and time and then copies the file to a chosen location.   I should point out that I use expressions on a few variables to alter outputs such as the filename i.e. date_time_filename. 

I also noticed that the service needs appropriate permissions to both access the \LOG directory and also copy to the target directory, in my scenario, this was \\server\share.  When I was testing, I launch SSDT (SQL Server Data Tools) using Administrator privileges for testing purposes as a quick workround to permission issues).

image

Here is the code for the Script Task (apologies for the word wrap, the Live Writer plug-in seems to do this to fit it on the page).  I have commented out some of the writeline commands I was using to debug the package when it was initially developed.

Code Snippet
  1. 'Disclaimer:
  2.     'The sample scripts and SSIS package are not supported under any Microsoft standard support program or service.
  3.     'The sample scripts and SSIS package are provided AS IS without warranty of any kind.
  4.     'Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose.
  5.     'The entire risk arising out of the use or performance of the sample scripts and documentation remains with you.
  6.     'In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits,
  7.     'business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.
  8.  
  9.     Public Sub Main()
  10.  
  11.         Dim vars As Variables
  12.         Dim strComputer As String
  13.         Dim objWMIService As Object
  14.         Dim colMonitoredEvents As Object
  15.         Dim objEventObject As Object
  16.         Dim strSourceDirectory As String
  17.         Dim strServerName As String
  18.         Dim strSourceErrorLogDirectory As String
  19.         Dim strSourceErrorLogDirectoryWithQuotes As String
  20.  
  21.         Try
  22.             ' Use the SSIS variables in this code for the WMI query
  23.             strServerName = Dts.Variables("v_ServerName").Value
  24.             strSourceErrorLogDirectory = Dts.Variables("v_DT_SourceLogDirectory").Value
  25.  
  26.             Console.WriteLine("Servername: " + strServerName)
  27.             Console.WriteLine("Monitoring \Log Directory: " + strSourceErrorLogDirectory)
  28.             
  29.             ' Replace \ with \\\\ which is needed for the WMI query
  30.             strSourceErrorLogDirectory = Replace(strSourceErrorLogDirectory, "\", "\\\\")
  31.             strSourceErrorLogDirectoryWithQuotes = Chr(34) & strSourceErrorLogDirectory & Chr(34)
  32.  
  33.             'MsgBox("Server Name: " + strServerName)
  34.  
  35.             ' Connect to the WMI source
  36.             objWMIService = GetObject("winmgmts:\\" & strServerName & "\root\cimv2")
  37.  
  38.             ' Monitor the error log folder for instances of ERRORLOG.1 as this is the file we want to archive
  39.             ' The directory name is parameterised and populated from the SSIS variable
  40.  
  41.             ' Monitor the directory for new default trace files
  42.             colMonitoredEvents = objWMIService.ExecNotificationQuery _
  43.                 ("SELECT * FROM __InstanceCreationEvent WITHIN 10 WHERE " _
  44.                     & "Targetinstance ISA 'CIM_DirectoryContainsFile' and " _
  45.                         & "TargetInstance.GroupComponent= " _
  46.                                 & "'Win32_Directory.Name=" & strSourceErrorLogDirectoryWithQuotes & "'")
  47.  
  48.             objEventObject = colMonitoredEvents.NextEvent()
  49.  
  50.             'MsgBox("A new file was just created: " + objEventObject.TargetInstance.PartComponent)
  51.  
  52.             Dim strReturned, strFilePath As String
  53.  
  54.             strReturned = objEventObject.TargetInstance.PartComponent
  55.             strFilePath = Split(strReturned, "CIM_DataFile.Name=")(1)
  56.             strFilePath = Replace(strFilePath, """", "")
  57.             strFilePath = Replace(strFilePath, "\\", "\")
  58.             'MsgBox("Sliced file: " + strFilePath)
  59.  
  60.             ' strFilePath is C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\LOG\log_nnn.trc
  61.             ' Array element (6) is log_nnn.trc, this assumes the above directory structure
  62.             Dim strFilename As String
  63.  
  64.             'strFilename = Split(strFilePath, "\")(6)
  65.             'MsgBox("Split: " + strFilename)
  66.  
  67.             strFilename = System.IO.Path.GetFileName(strFilePath)
  68.             'MsgBox("IO.Path: " + strFilename)
  69.  
  70.             ' If filename like log_ then enter this code path
  71.             ' The default trace filename is always log_ so we can rely on this for filename matching
  72.             If strFilename Like "log_*.trc" Then
  73.  
  74.                 Console.WriteLine("A new default trace file was just created in \LOG called " + strFilename)
  75.                 Trace.WriteLine("A new default trace file was just created in \LOG called " + strFilename)
  76.  
  77.                 ' Archive the previous default trace file
  78.                 'MsgBox("Default Trace found, now process the file")
  79.  
  80.                 Dim arrContainer As Array
  81.                 Dim intTraceFileNumber As Integer
  82.                 Dim strArchiveFileName As String
  83.  
  84.                 arrContainer = Split(strFilename, "_")
  85.  
  86.                 'Console.WriteLine(arrContainer(0).ToString)
  87.                 'Console.WriteLine(arrContainer(1).ToString)
  88.  
  89.                 ' Split 1111.trc so we only store 1111 to convert to int
  90.                 arrContainer = Split(arrContainer(1), ".")
  91.  
  92.                 ' This is the active default trace file number
  93.                 'Console.WriteLine(arrContainer(0).ToString)
  94.  
  95.                 ' Convert the active trace file number to int and decrease by 1
  96.                 intTraceFileNumber = CInt(arrContainer(0)) - 1
  97.  
  98.                 ' Convert back to string and create the default trace file name
  99.                 strArchiveFileName = "log_" + CStr(intTraceFileNumber) + ".trc"
  100.                 'Console.WriteLine("Archiving: " + strArchiveFileName + " to " + Dts.Variables("v_DT_DestinationDefaultTraceDirectory").Value)
  101.  
  102.                 'MsgBox(strArchiveFileName)
  103.  
  104.                 'Write the filename to the SSIS variable
  105.                 Dts.Variables("v_DT_ActiveFileName").Value = strArchiveFileName
  106.  
  107.                 Console.WriteLine("Archiving: " + strArchiveFileName + " to " + Dts.Variables("v_DT_DestinationDefaultTraceDirectory").Value)
  108.                 MsgBox("Output to SSIS Variable: " + Dts.Variables("v_DT_ActiveFileName").Value + " strFilename: " + strArchiveFileName)
  109.  
  110.                 ' Indicate success to move on to the next step
  111.                 Dts.TaskResult = ScriptResults.Success
  112.             End If
  113.  
  114.             ' Error handling
  115.         Catch ex As Exception
  116.             Console.WriteLine(System.DateTime.Now.ToString + " - SSIS Script Task Error: " + ex.Message.ToString)
  117.             Dts.TaskResult = ScriptResults.Failure
  118.         End Try
  119.  
  120.     End Sub

 

I hope you find this useful.

I will try and attach the SSIS package to this post later.

26 Mar 07:21

Building a data mart to analyse web log traffic

by superlatch
I recently had the opportunity to build a data mart in order to analyse web logs from IIS servers.  The solution utilised SQL Server 2012 Database Engine, Integration Services, Analysis Services and Excel 2010 (to slice/dice the data).   

Loading IIS logs (text files with fixed width spaces) was fairly challenging due as the column definitions would alter throughout.  I therefore used the .NET StreamReader class to read the metadata and detect changes in the fields. 

The data flow is shown below.

image

A snippet of the code in the Load Data (Source Script Component) is presented below

Code Snippet
  1. // Get variables
  2. strSourceFile = Variables.vCurrentSourceFileName;
  3. intAuditLogFileIdentifier = Variables.vAuditExecutionFileIdentifier;
  4.  
  5. try
  6. {
  7.     // Create an instance of StreamReader to read from a file.
  8.     // The using statement also closes the StreamReader.
  9.     using (StreamReader sr = new StreamReader(strSourceFile))
  10.     {
  11.         String line;
  12.         int intNumberOfFields = 0;
  13.         string[] strListOfFields = null;
  14.  
  15.         Trace.WriteLine("Log File: " + strSourceFile);
  16.  
  17.         // Output the source file name as the first line (debugging purposes)
  18.         OutputLogFileRawDataBuffer.AddRow();
  19.         OutputLogFileRawDataBuffer.colRawData = strSourceFile;
  20.  
  21.         // Read and display lines from the file until the end of the file is reached.
  22.         while ((line = sr.ReadLine()) != null)

 

Extracting the data from the file was relatively straightforward.  I placed the string into an array based on the fixed spacing between fields.  From the data, I was able to extract useful information such as browser type i.e. Safari, Chrome, IE and even browser version. 

IP addresses were mapped to geolocation using the free GeoLite information http://www.maxmind.com/app/geolite (CSV data imported into the database).  I converted longitude and latitude to the spatial geography data type and presented this against a world map (I wanted to validate that the GeoLite data correctly mapped to the IP address e.g. country/city to IP address).

   1:  USE BiKitchen;
   2:   
   3:  DECLARE @SRID int = 4326
   4:  DECLARE @pLat nvarchar(max)
   5:  DECLARE @pLong nvarchar(max)
   6:  DECLARE @g geography
   7:   
   8:  -- Check longitude and latitude for London
   9:  SET @pLat = (SELECT CAST(Latitude AS nvarchar(max)) FROM [GeoData].[GeoLiteCity-Location] WHERE locid = 13547)
  10:  SET @pLong = (SELECT CAST(longitude AS nvarchar(max)) FROM [GeoData].[GeoLiteCity-Location] WHERE locid = 13547)
  11:   
  12:  SET @g =  geography::STPointFromText('POINT(' +
  13:          @pLong + ' ' +
  14:          @pLat + ')', @SRID).BufferWithTolerance(11000, 1000, 0)
  15:   
  16:  SELECT @pLat AS Latitude, @pLong AS Longitude
  17:  -- SELECT @g.Lat, @g.Long
  18:   
  19:  -- Map the geography type to base world map data
  20:  -- View the result in the spatial tab to validate coordinates
  21:  SELECT @g AS spatiallocation
  22:  UNION ALL SELECT geog FROM World_Borders
 

The star schema was built and deployed in SQL Server 2012 Analysis Services (UDM).  I found named calculations to be incredibly powerful way of extending the data model and making attributes more meaningful for end-users

image

The data was presented using Excel 2010, a screenshot is shown below.  I found slicers to be extremely useful

image

image

 

I thought it would be interesting to see what Excel 2013 had to offer so I tried to create a Power View report but this is not currently supported against the UDM.   There are however some nice enhancements to chart types so I’ll be looking at this in more detail.

image 

26 Mar 07:20

Using PowerShell to deploy Windows Azure Virtual Machines and Windows Azure SQL Databases

by superlatch

Author: Benjamin Wright-Jones (Microsoft)
Contributors: Karthika Raman (Microsoft)
Technical Reviewers: Guy Bowerman (Microsoft), Sanjay Nagamangalam (Microsoft)

I have recently been exploring the use IaaS (Infrastructure as a Service) to provide cloud-based virtual machines (VM’s) as opposed to laptop-based VM’s and also PaaS (Platform as a Service) for SQL databases.  I like the idea of carrying around a lighter more portable laptop and using cloud services to help me day to day, in contrast to carrying a heavy weight workstation for Hyper-v usage.

I know we can deploy VM’s through the Azure Portal but I prefer an automated approach.  Fortunately the new PowerShell cmdlets support Azure VM provisioning and also Azure SQL database provisioning (plus some other nice interfaces).  This enables me to quickly spin up a SQL Server VM in Azure or SQL database in Azure.   I am actually quite amazed what is possible with PowerShell and it is my new best friend.  PowerShell ISE in Windows 8 is superb, I highly recommend this as a development environment due to the intellisense and cmdlets search pane integration. 

Below is the PowerShell script I wrote to provision a VM.  Unfortunately some manual steps are still required if you wish to manage the SQL Server instance remotely through Management Studio e.g. opening firewall ports, enabling TCP etc.   This is all documented here http://www.windowsazure.com/en-us/manage/windows/common-tasks/sql-server-on-a-vm/.  Fortunately it is a lot easier to connect to an Azure SQL database and we can automate the registration of the provisioned instance in Management Studio.

Azure VM Provisioning (SQL Server 2012 Evaluation Edition)

Step 1. Download and register the Azure publishing certificate (one time only event). 

In order to use PowerShell with the Azure VM and SQL Database services you will need to download and import the publishing file. Fortunately, this is a simple process.  I also store my certificate on Skydrive so I can access it everywhere I go in case I need it again.

Get-AzurePublishSettingsFile
Import-AzurePublishSettingsFile C:\...

If you don’t import the publishing file then you may see an error similar to below when attempting to access the Azure services.

An error occurred while making the HTTP request to https://management.core.
windows.net/ae81ecb1-a8af-4fb7-87c5-4418babb4ff2/services/sqlservers/servers
/<server>?op=ResetPassword. This could be due to the fact that the server
certificate is not configured properly with HTTP.SYS in the HTTPS case. This
could also be caused by a mismatch of the security binding between the client
and the server. 
Step 2. View available subscriptions and set the correct Subscription
Get-AzureSubscription | Select SubscriptionName
Select-AzureSubscription –SubscriptionName

Incidentally, you may also need to associate a specific Azure Storage Vault (ASV) Account with your subscription, for some reason the default value was null so I had to allocate a specific account.  The Azure Storage account is required to host the VM disks which are provisioned during the creation of an Azure VM image.  

Get-AzureStorageAccount | Select Label
Set-AzureSubscription -SubscriptionName "Windows Azure MSDN - Visual Studio Ultimate" 
-CurrentStorageAccount "<storageaccount>"
Step 3. View the available Azure VM images and locations
Get-AzureVMImage | Select ImageName
Get-AzureLocation | Select DisplayName
Step 4. Create an Azure VM

I like my VM’s big! ExtraLarge!  You can use the New-AzureQuickVM syntax, New-AzureVM or New-AzureVMConfig syntax,  The New-AzureQuickVM automatically creates and provisions the VM which does not require any additional steps. 

New-AzureQuickVM -Windows -ServiceName "<azureservice>" -Name "<vmname>" 
-ImageName "b83b3509582419d99629ce476bcb5c8__Microsoft-SQL-Server-2012-
Evaluation-CY13Feb-SQL11-SP1-CU2-11.0.3339.0" –Password <password> -Location "North Europe"
–InstanceSize “ExtraLarge” 

You should see something like this below. 

image

If you see a DNS error then you may be trying to provision a under a duplicate service name e.g. Error "DNS name already exists" is misleading as it refers to a duplicate service name so change this in the parameters. This servicename refers to the VM service by which you reference/connect e.g. <vmservice>.cloudapp.net

Step 6. Start the Azure VM

Start the Azure VM using the command below.

Start-AzureVM -ServiceName "<servicename>" -Name "<vmname>"

You can view the properties of your Azure VM’s by using Get-AzureVM –ServiceName <ServiceName>.  I would also add that additional data disks can be simply added using the Add-AzureDataDisk syntax providing the ability to simply increase the capacity of the provisioned Azure VM instance.

Step 7. Download the  remote desktop connection file to your local desktop!

Another great feature is the ability to automatically download the remote desktop file for the provisioned Azure VM. 

Get-AzureRemoteDesktopFile -ServiceName "<ServiceName>" -name "<vmname>" -LocalPath "$ENV:userprofile\Desktop\myVm01.rdp"

It is also possible to automatically launch and connect to your VM instance using the standard Remote Desktop client, mstsc.exe, which just eliminates another step in the process to connect to your Azure VM. 

mstsc $ENV:userprofile\Desktop\myAzureVm1.rdp

Of course you could parameterise all of this to make life even simpler. 

Step 8. Connect!

Simply enter the username and password off you go, simple.   You could create multiple PowerShell batch files for each Azure VM image type.  I am sticking with the SQL Server 2012 instance for now. 

Teardown

Cleaning up the environment is simple too, just a couple of PowerShell commands to stop and remove the provisioned VM.

Step 9. Stop the Azure VM
Stop-AzureVM  -ServiceName "<ServiceName>" -Name "<VmName>"
Step 10. Remove/delete the Azure VM
Remove-AzureVM -ServiceName "<ServiceName>" -Name "<VmName>"
Step 11. Remove/delete the Azure VM disks

The VHD’s associated with the image are not automatically removed so you will need to issue the Remove-AzureDisk command.  You can view the existing VHD’s and the associated image and container using the Get-AzureDisk command as shown below.  You will notice that I only have one disk (VHD) associated to an image.  The other VHD’s were from previous Azure VM deployments. 

image

Removing (deleting) the VHD is simple.  The –DeleteVHD parameter is required if you wish to permanently delete the image from ASV so use with caution!

Remove-AzureDisk –DiskName <diskname> –DeleteVHD

image

Step 12. Remove the allocated cloud service
Remove-AzureService -ServiceName "<servicename>" 
Step 13. Remember to save your PowerShell script and parameterise it for one-click deployment!

Azure SQL Database Provisioning

On a related noted, it is also possible to provision a Windows Azure SQL Database using the new PowerShell cmdlets allowing me to rapidly deploy a cloud-based relational data store.  You must register the Azure publishing certificate unless this has been done previously (as above in step 1.).

Import-AzurePublishSettingsFile C:\....
Set-AzureSqlDatabaseServer –ServerName <server> –AdminPassword <password>

Creating a new Azure SQL Database instance is also easy, an example is provided below.  I chose North Europe due to my geographical location. 

New-AzureSqlDatabaseServer -location "North Europe" -AdministratorLogin "<login>" 
-AdministratorLoginPassword "<password>”

You’ll need to create a firewall rule so you can connect to the new Azure SQL Database instance:

New-AzureSqlDatabaseServerFirewallRule -ServerName <server> –RuleName <rulename> 
-StartIPAddress "0.0.0.0" -EndIPAddress "222.222.222.222"

As I was working through this, I discovered a VERY useful command which pops up a dialog with the help options for a specific command, an example is shown below.  Omitting the –ShowWindow syntax will output the help details to the console window.

Get-Help Set-AzureSqlDatabase –ShowWindow

The next step was to create a SQL Server authenticated connection to the server hosting the Windows Azure SQL Database.  This is an important step as it establishes the context for the connection. 

#specify sql auth credential
$servercredential = new-object System.Management.Automation.PSCredential("<username>", 
("<password>" | ConvertTo-SecureString -asPlainText -Force)) #create a connection context $ctx = New-AzureSqlDatabaseServerContext –ServerName <servername> -Credential $serverCredential

Incidentally, if you are wondering what is stored in the connection context then see below:

ServerName        : <server>
SessionActivityId : 850d5e6f-7201-4bad-8fd5-331086064d4a
ClientSessionId   : e8d82a6d-0ed2-4aa3-9c38-3c3da924ab6a-2013-03-13 15:49:35Z
ClientRequestId   : d2436b67-ecca-453d-8d7c-12619e599784-2013-03-13 16:02:36Z
Databases         : {master} 

Wondering what databases you can see?  Easy.

Get-AzureSqlDatabase -Context $ctx

Which returns all the databases deployed on the provisioned instance.  In the case below, only the master database was listed as no other databases are currently deployed.

Name          : master
CollationName : SQL_Latin1_General_CP1_CI_AS
Edition       : Web
MaxSizeGB     : 5
CreationDate  : 12/03/2013 22:35:57

Want a new database?  Easy again.

New-AzureSqlDatabase -Context $ctx –DatabaseName <databasename>-Collation SQL_Latin1_General_CP1_CI_AS 
-Edition Web -MaxSizeGB 5

Interestingly, there are a host of DataServiceContext class options made available under the context of the connection such as ServerMetrics and DatabaseMetrics.  This provides some interesting insight into the metadata for your Azure SQL database server such as throttled connections and failures.   Unfortunately, the context commands are not documented right now so this is just exploratory and the exposed properties may be removed in the future. 

$ctx.ServerMetrics.IncludeTotalCount()
$ctx.DatabaseMetrics.IncludeTotalCount()

Beyond the ability to provision a Windows Azure SQL Database using PowerShell cmdlets, I can also save time by automatically registering the Azure SQL instance in SQL Server Management Studio by invoking the SQL Server 2012 PowerShell command New-Item as below (thereby saving even more time!).  The AzureSqlDbServer1 reference is the friendly name which appears in the SQL Server Management console.

Import-Module sqlps
Cd "sqlregistration\Database Engine Server Group"
New-Item AzureSqlDbServer1 -ItemType Registration -Value "server=<server>.database.windows.net; 
integrated security=false; userid=<username>; password=<password>; initial catalog=<databasename>"

Teardown

Removing (or de-provisioning) the Azure SQL database, instance and Management Studio registration is simple.  The last command, Remove-Item, is a SQL Server PowerShell command to delete the Management Studio server registration and this must be invoked using sqlps as above.

Remove-AzureSqlDatabase $ctx –DatabaseName "<dbname>"
Remove-AzureSqlDatabaseServer -ServerName "<AzureSqlDbServer>"
Import-Module sqlps
Remove-Item AzureSqlDbServer1

Closing Remarks

The new PowerShell cmdlets for Azure are a fantastic way to easily provision either VM’s or a database in the cloud.  I will be parameterising my scripts (and including try.. catch blocks) to quickly create an Windows Azure Virtual Machine or Windows Azure SQL Database as needed (one-click deployment made easy!).  PowerShell ISE is also an excellent development environment which can be leveraged for not only Azure VM or SQL database provisioning but also for many more solution scenarios. 

What is missing?

  • Provisioning Azure HDInsight clusters is not currently possible however this should be coming soon http://hadoopsdk.codeplex.com (refer to Programmatic Cluster Management). 
  • PowerShell cmdlets for Azure SQL Reporting are not currently available.
  • The ability to provision an Azure VM image with the full business intelligence stack deployed i.e. SharePoint 2013, PowerView and Power Pivot integration. 
  • PowerShell remoting to be automatically enabled in the Azure VM.
  • The ability to invoke SqlCmd and query Azure SQL databases through PowerShell (inc. support for Federations). 

References

Managing Windows Azure SQL Databases with PowerShell http://gallery.technet.microsoft.com/scriptcenter/Managing-Windows-Azure-SQL-632acc4b

Windows Azure SQL Database Management with PowerShell(http://blogs.msdn.com/b/windowsazure/archive/2013/02/07/windows-azure-sql-database-management-with-powershell.aspx

Getting Started with SQL Server on a Windows Azure Virtual Machine http://www.windowsazure.com/en-us/manage/windows/common-tasks/sql-server-on-a-vm/

26 Mar 07:19

Announcing SQL Server Stretch Database service preview

by SQL Server Team

Microsoft is pleased to announce a preview of the new SQL Server Stretch Database service. With this preview, SQL Server 2016 users can dynamically stretch their on-premises warm and cold data to an Azure service that is engineered from the ground up for compute capacity and virtually infinite storage. Unlike typical cold data management solutions, your data is always at hand and secure by design. You can now keep as much data as you need indefinitely without long maintenance windows or the high costs of traditional enterprise storage.

The Stretch Database service makes remote query processing possible by providing compute and storage in a way that’s completely transparent to the application; using Stretch Database typically doesn’t require any application changes. If added security is required you can use Stretch Database with new Always Encrypted technology, where sensitive data is encrypted before sending to Azure and the encryption key always remains on-premises—extending data in a more secured manner for greater peace of mind.

How to use Stretch Database with SQL Server 2016

Right now, with SQL Server 2016 RCO, when you enable stretching to Azure from SQL Server Management Studio, the default behavior is to stretch to Azure SQL Database. If you would like to preview the new Stretch Database service, you can enroll in the preview and we’ll provide instructions for configuring your SQL Server 2016 database for this new service. For future builds of SQL Server 2016, Stretch Database service will be the default target location for stretching.

Getting started with Stretch Database

If you’re ready to learn more about Stretch Database, read our getting started articles about Stretch Database in SQL Server books online.

If you’re ready to sign up for the Stretch Database preview, register for preview access. To trial SQL Server 2016, you can download an evaluation copy here or try the preview by using a virtual machine in Microsoft Azure.

Questions?

Join the discussion of the new SQL Server 2016 capabilities at MSDN and StackOverflow. If you run into an issue or would like to make a suggestion, you can let us know on Connect. We look forward to hearing from you!

26 Mar 07:19

Technical Overview: SQL Server 2016 Release Candidate 0

by SQL Server Team

The SQL Server engineering team is pleased to announce the availability of SQL Server 2016 Release Candidate 0. This is an important milestone in the release of SQL Server 2016, as it marks feature completion for most dimensions of the product and means a very rich set of capabilities are now available. These include: real-time operational analytics, rich visualizations on mobile devices, built-in advanced analytics, new advanced security technologies, and new hybrid scenarios allowing you to securely stretch data to the cloud.

To learn more about the release, visit the SQL Server 2016 preview page. To experience the new, exciting features in SQL Server 2016 and the new rapid release model, download the preview and start evaluating the impact these new innovations can have for your business.

Questions? Join the discussion of the new SQL Server 2016 capabilities at MSDN and StackOverflow. If you run into an issue or would like to make a suggestion, you can let us know using Microsoft’s Connect tool. We look forward to hearing from you.

For more information on this release, see SQL Server 2016 Release Notes and What’s New in SQL Server 2016.

Database scoped configuration

This release now supports a new database level object that holds optional configuration values that affect the behavior of the application code at the database level. This support is available in both SQL Server 2016 Release Candidate (RC0) and SQL Database V12 using the new ALTER DATABASE SCOPED CONFIGURATION (Transact-SQL) statement. This statement modifies the default SQL Server 2016 Database Engine behavior for a particular database. A generic mechanism for creating database configuration(s) at creation time is not provided.

These options are:

  • Clear procedure cache.
  • Set the MAXDOP parameter to an arbitrary value (1,2, …) for the primary database based on what works best for that particular database and set a different value (e.g. 0) for all secondary database used (e.g. for reporting queries).
  • Set the query optimizer cardinality estimation model independent of the database to compatibility level.
  • Enable or disable parameter sniffing at the database level.
  • Enable or disable query optimization hotfixes at the database level.

The following T-SQL Syntax is supported:

ALTER DATABASE SCOPED CONFIGURATION

{
{ [ FOR SECONDARY] SET }
}
| CLEAR PROCEDURE_CACHE
[;]

< set_options > ::=
{
MAXDOP = { | PRIMARY}
| LEGACY_CARDINALITY_ESTIMATION = { ON | OFF | PRIMARY}
| PARAMETER_SNIFFING = { ON | OFF | PRIMARY}
| QUERY_OPTIMIZER_HOTFIXES = { ON | OFF | PRIMARY}
}

The new functionality is supported for both Azure SQL Database and SQL Server.

PolyBase enhancements in RC0 include:

  • Support for the latest Cloudera distribution CDH5.5 on Linux
  • Support for public containers and blobs in Azure blob storage

SQL Server Management Studio improvements in this release include:

In-Memory OLTP new features in RC0:

  • Parallel scan of non-clustered indexes. All indexes on memory-optimized tables now support parallel scan, leading to increased performance of analytics-style queries that scan large sets of data.
  • Reduced downtime during upgrade: upgrade from earlier builds of SQL Server 2016 no longer runs database recovery; the time it takes to run upgrade is no longer a factor of the data size. For upgrade and attach/restore from SQL Server 2014 the number of database restarts is reduced to one: during upgrade, the database is restarted once, thus the time it takes to upgrade is a factor of the data size.
  • Built-in function @@SPID is now supported natively compiled T-SQL modules and in constraints on memory-optimized tables
  • Log-optimized and parallel ALTER: most ALTER TABLE operations now run in parallel and result in only the metadata change to be written to the transaction log, reducing the time required to run the ALTER operation.

Improved support for large number of foreign key references

With CTP3.3, we added support for efficient DELETE operations for tables with up to 10,000 incoming foreign key references. With RC0, we are adding support for the following:

  • Efficient UPDATE operations on tables with up to 10,000 incoming foreign key references.
  • Efficient DELETE and UPDATE operations for partitioned tables with large numbers of foreign key references.

The following limitations still apply:

  • MERGE operations are not supported for tables with large numbers of foreign key references
  • Large numbers of foreign key references are not currently supported for memory-optimized tables, or stretch database.

New Built-in Table-Valued Function STRING_SPLIT

STRING_SPLIT is a T-SQL function that splits input character expression by specified separator and outputs result as a table.

Syntax:

STRING_SPLIT ( string, separator )
returns: table with a column named value

Examples:

Split comma separated value literal string
SELECT * FROM STRING_SPLIT(‘Lorem ipsum dolor sit amet.’, ”)

The result is:

value
—–
Lorem
ipsum
dolor
sit
amet.

Split comma separated value string in a column
SELECT id, title, value
FROM Article
CROSS APPLY STRING_SPLIT(tags, ‘,’)

Support for import and export of UTF-8 data

Data stored in a UTF-8 encoded file can now be imported into SQL Server and exported from SQL Server into a UTF-8 encoded file, using BULK INSERT T-SQL command and bcp command line utility. As a part of the import process, the UTF-8 encoded strings are read from the file, and converted and stored as native data types that correspond to the target columns in SQL Server tables. As a part of the export process, data are converted from native data types into UTF-8 encoded strings and written into the file. All native data types are supported except Xml, SqlVariant, Text, NText, and Image. The UTF-8 code page is specified by providing the following arguments:

  • CODEPAGE = ‘65001’, DATAFILETYPE = ‘Char’, in case of BULK INSERT command,
  • -c -C 65001, in case of bcp utility.

Examples:

Importing with BULK INSERT into SQL Server from a file
BULK INSERT MyTable
FROM ‘path\file.csv’
WITH ( FIELDTERMINATOR = ‘,’, CODEPAGE = ‘65001’, DATAFILETYPE = ‘Char’)

Importing with bcp utility into SQL Server from a file
bcp MyTable in   “path\file.csv” -T -t , -c -C 65001

Exporting with bcp utility from SQL Server to a file
bcp MyTable out “path\file.csv” -T -t , -c -C 65001

AlwaysOn Availability Groups add support for:

  • Distributed Availability Groups: It is now possible to configure replication between availability groups. Note: these different availability groups can live on different Windows Clusters.
  • Streaming seeding of replicas: New secondary replicas in an Availability Group can now be seeded directly from the primary replica without the need for database backup/restore.

SQL Server Analysis Services (SSAS)

This month’s update to Analysis Services delivers support for display folders for Tabular models, any models created with new SQL Server 2016 compatibility level can now be used with PowerShell and SSIS. Finally, the new Tabular Object model is released to allow easier development and maintenance through code against tabular models. See the Analysis Services blog for more details.

SQL Server Reporting Services (SSRS)

Much more functionality is now available in this next preview of the new Reporting Services web portal. You can now:

  • Subscribe to reports to receive them in your email inbox, download, rename, move, and delete reports and other catalog items
  • Manage shared datasets
  • Manage data caching and refresh
  • Create “linked” reports with different default parameter values
  • Pin to a dashboard in a group when pinning a report chart, gauge, map, or image to a Power BI dashboard
  • Choose the currency for monetary values when creating a KPI or mobile report

SQL Server Integration Services (SSIS)

SSIS Projects in Visual Studio SQL Server Data Tools (SSDT) can now target 2012, 2014 and 2016 versions of SQL Server by switching a project level property between 2012, 2014 and 2016 SQL server. SSIS now supports setting a server wide customized logging level. Finally, the SSIS HDFS connector now supports the ORC file format, in addition to CSV and Avro. See the Integration Services blog for more details.

Master Data Services (MDS) adds a New Derived Hierarchy Management web page. See the Master Data Services blog for more details.

For more information on this release, see SQL Server 2016 Release Notes and What’s New in SQL Server 2016.

26 Mar 07:18

Announcing SQL Server on Linux

by SQL Server Team

Today, we announced our plans to bring SQL Server to Linux. This will enable SQL Server to deliver a consistent data platform across Windows Server and Linux, as well as on-premises and cloud. We are bringing the core relational database capabilities to preview today, and are targeting availability in mid-2017.  Read Scott Guthrie’s blog post to learn more.

26 Mar 07:16

How to make text searching go faster

by Rob Farley

...but first, let’s look at one oft-forgotten reason why finding a particular piece of text can be slow, if not impossible: collation. This will then provide a useful platform for making it go faster.

I say ‘impossible’, but of course it’s never impossible to find something in a database (assuming it’s there). It might take longer, but you can always scan the column for it, starting on the first page and going until you’ve found it. Various things like Full Text Search can help make things easier, but all-too-frequently we see code that searches for SomeText%, or worse: %SomeText%. This is the thing I want to look at – finding non-indexed strings patterns.

First let’s remember that if we are hoping to use an index, we need to know what language we’re in. I have spoken before about how I picked up a map in Sweden to find Västerås, but couldn’t find it listed in the map’s index. I didn’t realise that in Swedish, ‘ä’ and ‘å’ are not the same as ‘a’, and found at a different part of the alphabet. When I searched using an English alphabet, I couldn’t find the entry. I might think that ‘Västerås’ and ‘Vasteras’ are the same, but a Swedish person would tell me otherwise. It’s like if I refer to a game as ‘football’, you would need to understand my personal collation setting to know what I was talking about. When Michael Palin sung (as a lumberjack) about wearing high-heels, suspenders and a bra, he wasn’t referring to anything that held his trousers up, despite what people using an American collation setting might think.

But this is about making searches for text go faster. If we’re comparing two strings in a different collation we get an error, but let’s think about speed.

Consider that I’m looking for rows in a table WHERE CommentText LIKE '%Farl%'. Right away, I’m sure you appreciate that no amount of regular indexing on CommentText would let me perform an ordinary b-tree index search to find that row. I could improve it by using other technologies that will allow the individual words in my text to be found, but I’m just looking for a particular piece of text. It’s not even a whole word.

For my experiment, I’m using a table on SQL 2014 on my Surface Pro 2. It’s a larger version of AdventureWorks2012’s Person.Address with 19 million rows. There is a column called AddressLine1, which has collation SQL_Latin1_General_CP1_CI_AS and has type nvarchar(120). You can create it using code like this:

CREATE TABLE [Person].[Address_Big](
    [BigAddressID] [int] IDENTITY(1,1) NOT NULL,
    [AddressID] [int] NOT NULL,
    [AddressLine1] [nvarchar](60) NOT NULL,
    [AddressLine2] [nvarchar](60) NULL,
    [City] [nvarchar](30) NOT NULL,
    [StateProvinceID] [int] NOT NULL,
    [PostalCode] [nvarchar](15) NOT NULL,
    [SpatialLocation] [geography] NULL,
    [rowguid] [uniqueidentifier] NOT NULL,
    [ModifiedDate] [datetime] NOT NULL
);
go

insert Person.Address_Big
select * from Person.Address;
go 100

I ran this query quite a few times, and it took about 40 seconds to tell me there were no rows returned.

select *
from Person.Address_Big
where AddressLine1 like N'%Farl%'
option (maxdop 1);

Obviously no one lives on 203 Farley Avenue, or 1 Farlabulous Drive. But nor do they live at 711 Gofarles Street. You see, despite the fact that I had specified ‘Farl’ with a capital F and lower-case ‘arl’, it didn’t care about that at all. My collation setting told it explicitly to ignore case. That’s what the CI was for in SQL_Latin1_General_CP1_CI_AS. In fact, if we query select * from fn_helpcollations() where name = 'SQL_Latin1_General_CP1_CI_AS'; we see it says “Latin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 52 on Code Page 1252 for non-Unicode Data”. So it’s not only case-insensitive, it’s also kanatype-insensitive and width-insensitive too.

Clearly there is a lot more work for it to do, when scanning large amounts of text looking for a group of consecutive characters that could match inexactly.

Now, without changing my table at all, I changed my query like so, telling it to use a binary collation for the search. To search exactly rather than inexactly.

select *
from Person.Address_Big
where AddressLine1 like N'%Farl%' collate Latin1_General_BIN
option (maxdop 1);

I could’ve used the SQL collation SQL_Latin1_General_CP437_BIN, but I find it easier to remember the Windows collation, and in Australia, the default settings for SQL are to use Windows collations. They’re a little better than the SQL collations [citation needed].

But anyway – this query returned in just 7 seconds. I re-ran the original one – 40 seconds. I re-ran this one – 7 seconds. It really was significantly faster. The plan is the same. There is no sneakiness going on. The search for the binary text was simply faster.

This makes sense. If I’m looking for a particular string, it’s going to be quicker if I can just look for the exact bits, and not have to consider what the text might be in a different case, or if width needs to play a part, and so on.

Now you might think “Great – I’m going to add that to all my string searches”, but you should understand that there is potential for the results to be different. If there were someone in my table who lived in FARLEY BOULEVARD (in all caps, in the way that French people often write their surnames, for example), then that would have been found in my case-insensitive-collation search, but not in my binary-collation search for the lower-case letters ‘arl’. It’s useful if the data in your system is only stored in capitals, in which case (ha!) you could actually change the collation of your column, but it’s definitely worth considering the benefits of asking for a collation-specific search.

And what about grouping, you ask? Ok, maybe I didn’t hear you ask that, but let’s pretend you did.

If there’s an index on the column you’re grouping, then changing the collation is going to hurt a bit. Grouping could take advantage of a Stream Aggregate under our indexed collation, but changing the column is like throwing it away the index order (ORDER BY doesn’t get handled well by changing the collation) means a Hash is required. But comparing two query plans that both use Hash Match (Aggregate), one on a case-insensitive collation and one on a binary collation, then I found the latter was slightly faster. Not as drastic a change as searching, but still 10-30% better. One would run in about 12 seconds, and one in about 10.

select City, count(*)
from Person.Address_Big
group by City
option (maxdop 1);

select City collate Latin1_General_BIN, count(*)
from Person.Address_Big
group by City collate Latin1_General_BIN
option (maxdop 1);

Considering what’s going on with a hash function and non-exact strings is actually pretty interesting. HASH(Value) must produce the same value for any two values that are considered equal – such as ‘FARLEY’ and ‘Farley’ in my CI collation. For this to happen, it obviously can’t hash the actual values, it must have to convert the values into a common form that will hash the same way regardless of case, kana, and width. But this is work that is hidden from the query plan. We can see the impact of it through the query speed, but not anywhere in the plan. This will become yet another thing for me to investigate – but not this week before T-SQL Tuesday comes around and I need to publish this post. New father Bob Pusateri (@sqlbob) is hosting this month, about text searching, in case you hadn’t guessed.

TSQL2sDay150x150

@rob_farley

26 Mar 07:13

SQL Server 2016 SSIS Data Flow Buffer Auto Sizing capability benefits data loading on Clustered Columnstore tables

by Denzil Ribeiro

 

Reviewers: Joe Sack, Arvind Shyamsundar, Murshed Zaman, Sanjay Mishra

 The focus of this blog is to introduce SSIS changes in the Data Flow task in SQL Server 2016 that help data loading into Columnstore tables.

On SQL Server relational data warehouses, Columnstore indexes are being widely used due to the immense value realized by customers from both a performance aspect as well as cost savings due to compressed size on storage.  SSIS is a popular ETL tool that is often used to load data into the warehouse. We covered some of the Data loading best practices with Data Loading Performance Considerations with Clustered Columnstore indexes  on SQL Server 2014 and had mentioned a few caveats with choosing appropriate Maximum insert commit size and its implications on rows potentially landing in a delta store uncompressed due to the Maximum buffer size in SSIS. For details see “SSIS Data Flow Task Max Insert Commit Size and implications on Row group sizes section” here.

On SQL Server 2016, looking at the default SSIS Data Flow Properties you see that the default value for “DefaultBufferSize” is 10MB and the DefaultBufferMaxRows is 10,000.

image

Given the rowgroup max size of 1,048,576 if you try to force your Maximum insert commit size as shown below, you could end up getting very few rows committed per insert and more importantly they can all end up in the delta store which is a rowstore. How many rows are inserted in one transaction totally depends on the row size as to how many rows would fit in a 10MB buffer. Prior to SQL Server 2016, you could only adjust the DefaultBufferSize to a maximum of 100MB. For additional details see the article: SSIS Buffer Sizing

 image

An extended event capture below shows us that each bulk insert ended up with a batch size of 5825 rows.

image

From the DMV output below, you can see that all the rows end up in the Delta store. You could adjust the BufferSize to 100MB and that row count could increase to a higher number but it all depended on the size of the row. Ultimately the rowgroup will fill up and close and will be compressed by Tuple mover but that is not the most efficient way to load data as soon after loading the data is not compressed.

select * from sys.dm_db_column_store_row_group_physical_stats
where object_id = object_id('FctSales')

clip_image008[8]

AutoAdjustBufferSize Data Flow Property

SQL Server 2016  introduces a new Data Flow property called AutoAdjustBufferSize which when set to “true” ignores the DefaultBufferSize setting and the actual buffer size is calculated based on the DefaultBufferMaxRows. The formula used for calculating the Buffersize is specified below

BufferSize = DefaultBufferMaxRows (set at Design time) * Rowsize calculated at runtime.

Given this, you can set my Maximum Insert commit size to an appropriate batch size (a good start is 1,048,576) which is very useful in case of a large data transfer so you are not committing the entire large transfer in one transaction. Looking at the prior example, if I set AutoAdjustBufferSize property to true and set the DefaultBufferMaxRows to 1,048,576 which is the max row group size, you see a difference in behavior where that Max Insert commit size is
honored as the Buffer size is adjusted automatically.

clip_image010[8]image

 Extended Events showing the inserts from the SSIS package show us honoring the Max Insert commit size.

image

And the goal is to get the data into compressed rowgroups directly as you can see below. As an aside, the state INVISIBLE is an in-memory state of a compressed rowgroup before made available to users.

image

Performance test results

From a performance perspective these settings do have ramifications on the load times. This is a sample table used in load tests depicted by the figure below, the row size of this sample table was 1,674 bytes. These test results are on a specific VM with a specific table definition, results of performance in other scenarios will depend on the configuration of the machine, and the table definition. 

Test SSIS Data flow configuration Effective Batch size Rows Elapsed Time
1 AutoAdjustBuffer = FalseDefaultBufferMaxRows = 10000 (default)DefaultBufferSize = 10485760 (default) 10,000 30 million 15:54
2 AutoAdjustBuffer = FalseDefaultBufferMaxRows = 1048576DefaultBufferSize = 104857600 62,638 30 million 12:03
3 AutoAdjustBuffer = TrueDefaultBufferMaxRows = 1048576 1,048,576 30 million 5:56

In the test #2 in the table above, given the max default buffer size is 100MB (which was the maximum on SQL Server 2014), given the individual row was 1674 bytes, at max you would get 62638 rows to fill that buffer before SSIS issued a commit which means it all ended up in the Delta store. Only Test 3 in this scenario ended up in compressed row groups directly.

Note: Increasing the buffer size can result in paging which can be monitored with the “Buffers Spooled” performance counter if the machine running the SSIS package cannot sustain that buffer size in memory

AutoAdjustBufferSize  property of the data flow task is a great addition in SQL Server 2016 to aid in data loading scenarios.

 

26 Mar 07:11

Cross-database queries in Azure SQL Database

by James Serra

A limitation with Azure SQL database has been its inability to do cross-database SQL queries.  This has changed with the introduction of elastic database queries, now in preview.  However, it’s not as easy as on-prem SQL Server, where you can just use the three-part name syntax DatabaseName.SchemaName.TableName.  Instead, you have to define remote tables (tables outside your current database), which is similar to how PolyBase works for those of you familiar with PolyBase.

Here is sample code that, from within database AdventureWorksDB, selects data from table Customers in database Northwind:

--Within database AdventureWorksDB, will select data from table Customers in database Northwind

--Create database scoped master key and credentials

CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<password>';

--Needs to be username and password to access SQL database

CREATE DATABASE SCOPED CREDENTIAL jscredential WITH IDENTITY = '<username>', SECRET = '<password>';

--Define external data source

CREATE EXTERNAL DATA SOURCE RemoteNorthwindDB WITH 
           (TYPE = RDBMS,
            LOCATION = '<servername>.database.windows.net',
            DATABASE_NAME = 'Northwind',  
            CREDENTIAL = jscredential 
            );

--Show created external data sources

select * from sys.external_data_sources; 

--Create external (remote) table.  The schema provided in your external table definition needs to match the schema of the tables in the remote database where the actual data is stored. 

CREATE EXTERNAL TABLE [NorthwindCustomers]( --what we want to call this table locally
	[CustomerID] [nchar](5) NOT NULL,
	[CompanyName] [nvarchar](40) NOT NULL,
	[ContactName] [nvarchar](30) NULL,
	[ContactTitle] [nvarchar](30) NULL,
	[Address] [nvarchar](60) NULL,
	[City] [nvarchar](15) NULL,
	[Region] [nvarchar](15) NULL,
	[PostalCode] [nvarchar](10) NULL,
	[Country] [nvarchar](15) NULL,
	[Phone] [nvarchar](24) NULL,
	[Fax] [nvarchar](24) NULL
)    
WITH
(
  DATA_SOURCE = RemoteNorthwindDB,
  SCHEMA_NAME = 'dbo', --schema name of remote table
  OBJECT_NAME = 'Customers' --table name of remote table
);

--Show created external tables

select * from sys.external_tables; 

--You can now select data from this external/remote table, including joining it to local tables

select * from NorthwindCustomers

--Cleanup

DROP EXTERNAL TABLE NorthwindCustomers;

DROP EXTERNAL DATA SOURCE RemoteNorthwindDB;

DROP DATABASE SCOPED CREDENTIAL jscredential;  

DROP MASTER KEY;  

More info:

Elastic database query for cross-database queries (vertical partitioning)

26 Mar 07:11

ERwin Modeling Products Sale is Final

by Karen Lopez

image

CA announced today that CA ERwin Data Modeler and the rest of the modeling business (people, content, communities, etc.) have been sold to Parallax Capital Partners

CA has completed the sale of the ERwin data modeling business to Parallax Capital Partners, a private equity firm with an exceptional track record of transitioning divisions, subsidiaries and product lines into successful stand-alone entities.

The transaction, which closed on February 29, is a win-win scenario that was carefully designed to ensure mutual value and a seamless transition for customers, partners, and each of the approximately 60 ERwin employees worldwide. This move also aligns with our global partner strategy, which is an important component to CA’s growth model.

With this divestiture, ERwin is an independent company that will continue to be led by its current management team.

Parallax Capital is a private equity firm that specializes in lower middle market (between $5 and $100 million) software companies.  In looking at their current portfolio, I recognize only a couple of companies, with Micro Focus being the one that I recognized instantly, but they sold that in the early 2000s.  Parallax owns a diverse set of companies, so I’m not sure where they will go with the ERwin Modeling product set.

What I do know is that CA was clear after the failed Embarcadero purchase attempt that they were still intending to sell off ERwin, so a purchase is important to the ERwin user market.  I have no other information and expect that initial communications will be that everything is remaining the same until it changes.

This quote: “This move also aligns with our global partner strategy, which is an important component to CA’s growth model. “ appears to imply that CA did not consider data modeling a growth area of the enterprise software business.  As sad as that is, I agree.

My initial feelings are that having the ERwin business owned by an entity that does not own a competing product is likely best for customers.  Competition is good, for technical quality, innovation and pricing.

UPDATE: a new, more upbeat announcement has gone up on ERwin.com http://erwin.com/resources/news/erwin-divested-from-ca-technologies/

What do you think the impact of this sale will be on you and the data modeling market?

26 Mar 07:09

Conor vs. SQL Injection

by Conor Cunningham [MSFT]

While I was at SQL Intersection yesterday, I was asked if Microsoft has any best practice guidance on avoiding SQL Injection attacks.

I asked around a bit and, sure enough, yes we do.

Best Practice Guidance on SQL Injection Proections

26 Mar 07:09

Conor vs. SQL Azure Optimizer Hotfix Traceflag 4199

by Conor Cunningham [MSFT]

Another question:  Does Windows Azure SQL Database (aka SQL Azure) run with or without the optimizer hotfix traceflag (4199) from normal SQL Server?

Answer:  It currently does NOT run with 4199 enabled. 

In the future, we may be able to expose something more specific and advanced here.  Customers needing to have this level of control can use SQL Server 2012 in a Virtual Machine instead on Windows Azure.  FWIW, we have not seen this to be a significant issue on SQL Azure so far (in terms of escalated support cases).  Obviously this can change in the future.

Thanks,

 

Conor

26 Mar 07:07

T-SQL UDTs. (Huh!) What are they good for?

by bartduncan

(The title of this post might seem a little inflammatory, but it’s not just a troll -- I really do think that people should seriously question whether it’s wise to use T-SQL User Defined Data Types.  Mostly, though, I just like that Edwin Starr song.) 

 

Let me start this post off by saying that I understand why user-defined data types (UDTs) in T-SQL seem alluring.  At first glance, you might expect UDTs to give you some very nice benefits, such as:

 

 

Expected Benefit #1: Suppose you have a field that stores percentages values (0-100).  In T-SQL you might choose the tinyint data type for such a field, but this isn’t the only data type that you could have chosen to store percentage values.  If there is more than one table that contains a percentage-type field, you could end up with the same sort of value being stored using different data types.  By creating a UDT called PERCENT, you can avoid the need for database developers to memorize “Percentages should be stored using tinyint.”

 

Reality: This is based on the idea that it’s easier for a developer to memorize “Percentages are stored using a UDT named PERCENT” than it is to memorize “Percentages are stored using system type tinyint.”  The problem is that you have to know the underlying data type in order to deal with the column correctly.  For example, a database developer that was retrieving a PERCENT (UDT) field in a VB.NET or C# app would need to know what .Net data type should be used to retrieve values from the field.  A thoughtful dev would wonder: “Maybe I should store this PERCENT value in a byte variable, which can hold values from 0 to 255.  But wait: Can a PERCENT column hold a negative percentage?  If so, I’ll need to use the short data type, which can represent negative integers.  Hold on a second!  I wonder if this PERCENT data type can store fractional percentages?  If so, I’d need to retrieve the column value into a decimal variable...” 

 

In other words, if you don’t use a UDT, the application developer has to memorize “Percentages are stored using system type tinyint.”  But if you use a T-SQL UDT the application developer has to know both “Percentages are stored using a UDT named PERCENT” and “Percentages are stored using system type tinyint.”  The attempt to make the app developer’s life easier can backfire and actually made it a little harder.  (It’s worth mentioning that SQLCLR UDTs don’t have this same limitation.  App developers can reference an assembly that gives them a client-side data type that corresponds to the server-side SQLCLR UDT, so the developer doesn’t have to know the details of the UDT’s implementation in order to use it.)

 

 

Expected Benefit #2: A tinyint can store 0-255, but your percentage fields should only allow 0-100.  You can restrict a column’s domain by defining a CHECK constraint on the column.  But if there are several percentage type columns in several different tables, you’d need to define identical CHECK constraints on each column.  Wouldn’t it be nice if you could attach the constraint to the PERCENT user-defined data type, so that it would apply automatically to each column that used this UDT? 

 

Reality: This would be a real benefit of T-SQL UDTs.  Unfortunately, the ability to attach a rule to a UDT has been officially deprecated, so you shouldn’t use this functionality.  Since you can no longer attach custom domain constraints or other business rules to the data type, a T-SQL UDT is really nothing more than an alias for a system data type.  Sorry!

 

 

Expected Benefit #3: Suppose that your requirements have changed, and now you need to be able to store fractional percentages.  You decide to store percentages as decimal(3,2) instead of tinyint.  If all percentage columns were defined as tinyint, you’d have to go around to all of the columns and modify each column’s data type separately.  And to change a column’s data type you must first drop any indexes or constraints that reference the column.  What a pain.  Wouldn’t it be easier if you could just change the PERCENT UDT definition in one place?

 

Reality: T-SQL doesn’t have an ALTER TYPE command for UDTs.  To change a UDT definition you must do a DROP TYPE followed by CREATE TYPE.  But of course you can’t drop a type that is being used in a column, so you have to do an awkward little dance to switch all of the columns that use the UDT to use the underlying system data type, instead (so they no longer reference the UDT), then drop and redefine the UDT, then ALTER the columns a second time to re-reference the new UDT. 

 

                ALTER TABLE tbl1 ALTER COLUMN foo_pct TINYINT;

                ALTER TABLE tbl2 ALTER COLUMN foo_pct TINYINT;

                ALTER TABLE tbl3 ALTER COLUMN foo_pct TINYINT;

                ALTER TABLE tbl4 ALTER COLUMN foo_pct TINYINT;

 

                DROP TYPE PERCENT;

                CREATE TYPE PERCENT FROM DECIMAL(3,2);

 

                ALTER TABLE tbl1 ALTER COLUMN foo_pct PERCENT;

                ALTER TABLE tbl2 ALTER COLUMN foo_pct PERCENT;

                ALTER TABLE tbl3 ALTER COLUMN foo_pct PERCENT;

                ALTER TABLE tbl4 ALTER COLUMN foo_pct PERCENT;

 

And if any indexes or constraints reference any of these columns, you’d still have to drop those before the first ALTER COLUMN, and recreate them afterward.  If you have to write DDL that redefines the columns using the base data type that the UDT is supposed to hide, where’s the abstraction benefit?  More importantly, if you didn’t use UDTs this same change can be done with less than half the code:  

 

                ALTER TABLE tbl1 ALTER COLUMN foo_pct DECIMAL(3,2);

                ALTER TABLE tbl2 ALTER COLUMN foo_pct DECIMAL(3,2);

                ALTER TABLE tbl3 ALTER COLUMN foo_pct DECIMAL(3,2);

                ALTER TABLE tbl4 ALTER COLUMN foo_pct DECIMAL(3,2);

 

 

So there you have the reasons why I avoid using T-SQL user-defined types in my own code.  It’s not that UDTs are all that bad; it’s just that in my experience they fall short of their promise, and they can make some things even more cumbersome than the same task would be without the use of UDTs.  If you think I’m overlooking some concrete benefit of UDTs, I’d love to hear from you in the comments section.  If you do choose to use UDTs, just be sure you aren’t basing the decision to use them on inaccurate assumptions about the benefits they’ll provide. 

 

 

An important caveat: The above applies to simple T-SQL UDTs.  You can also create a user-defined type in SQL Server using SQLCLR and .Net.  While a T-SQL UDT is never anything more than a simple alias for one of the system data types, a SQLCLR UDT can be a truly custom data type.  With SQLCLR you can implement complex types, embed custom business logic like domain constraints or validation rules into the type, and so on.  In other words, with SQLCLR UDTs you can actually realize the benefit that I called “Expected Benefit #2”, above.  As mentioned earlier, you also get much of Expected Benefit #1.  SQLCLR UDTs still have the limitations described in #3, but the ability to extend SQL Server in the way that SQLCLR allows is very powerful.  That benefit could easily compensate for the issue discussed in #3, if the task you are trying to accomplish would be much harder to implement using one of the system data types. 

 

 

 

26 Mar 07:07

Living with SQL's 900 Byte Index Key Length Limit

by bartduncan

We recently had a situation where we needed to interface with an external non-relational data source that happened to use really long strings to identify entity instances. These identity strings had theoretically unbounded length, but in practice they were never more than 1000-2000 characters. The database layer needed to accept an identifier of this sort and use it to retrieve some information from the relational store that was related to that object. I’ll call the entity type “widget”, and the long string identifier from the external system the “external key”.

 

We could just store the external key in the database table and use it to locate the correct widget, but we couldn’t afford to scan the widget table for every lookup. The number of rows in the widget table was expected to be large, so we needed the lookup to be indexed. Unfortunately, SQL has a limit of 900 bytes for the keys in each index, and the external key values can be longer than this. If you create an index on combination of columns that could contain more than 900 bytes, you get this warning:

 

Warning! The maximum key length is 900 bytes. The index 'xyz' has maximum length of 4000 bytes. For some combination of large values, the insert/update operation will fail.

 

Then if you try to insert a row into the table that has more than 900 bytes in the indexed columns, it will fail with this error:

 

Msg 1946, Level 16, State 3, Line 11

Operation failed. The index entry of length 1501 bytes for the index 'xyz' exceeds the maximum length of 900 bytes.

 

Even if this restriction within SQL didn’t exist, I wouldn’t ever think of creating a table with a >1KB primary key. A surrogate primary key like an IDENTITY column or a GUID/UNIQUEIDENTIFIER column should be used, instead. You'd use the surrogate as your primary key, and all lookups and foreign keys within the data tier should use the surrogate. But the external system doesn’t know anything about the surrogate keys used in the data tier; the only identifier that the external system can pass to the database is that >900 byte external key value. So while a surrogate for primary key is a good idea here, it doesn’t solve the immediate problem – we still need to be able to do an efficient (indexed) lookup of an external key value in the database, even though a column that could store the external key values would be too large to index.

 

If you find yourself in a similar situation, here are some alternatives to consider:

 

 

A. Change the external system to use an identifier that is always less than 900 bytes long. Examine the entity’s attributes: is there some combination of attributes that together make up a candidate key (in other words, is there some combination of other attributes that is guaranteed to be unique for each entity instance)?  Even if the external system doesn’t internally use those columns as an identifier, it could still pass them to the data tier when it wanted to look up a widget.  In the database, you’d create a UNIQUE index on this column or combination of columns.  

 

 

B. Require non-key columns as additional criteria for each lookup. For example, suppose a widget has a “Name” attribute that is always short enough to be indexed. Name is not guaranteed to be globally unique, so neither the external system nor the relational model can use this as a key. But if the column is fairly selective it may be sufficient to give you lookups that are efficient enough that it’s almost like doing an index seek on the external key. In this example you would index the [name] column in the database, and use a routine for lookups like this.

 

   CREATE PROC usp_get_widet @external_key VARCHAR(MAX), @name NVARCHAR(60) AS
   SELECT *
   FROM widgets
   WHERE external_key = @external_key AND name = @name;

 

SQL will use a seek on the non-clustered [name] index to narrow the results down to a tiny percentage of the table (hopefully no more than a few rows), then it will do a residual filter to get rid of all but the single row that matches the external key value. Note that you can add more than one supplemental search column if necessary to increase the selectivity of the non-key index seek. Also be sure to note that you still need to do the filter on the [external_key] column. That predicate won’t be satisfied by an index seek, but it’s still necessary for correctness. Consider making the full external key INCLUDEd in the nonclustered index, to minimize the number of bookmark lookups the engine needs to do to find the single matching row.

 

One obvious downside to this approach is that the additional required search criteria complicate the interface between the database and the external component.

 

 

C. Truncate the external key to 900 bytes and index the truncated portion of the key. For example, your widget table might look like this:

 

   CREATE TABLE widgets (
       surrogate_key INT PRIMARY KEY IDENTITY,
       external_key VARCHAR(MAX),
       external_key_fragment AS LEFT (external_key, 900),
       ...

 

You’d create a (nonunique) index on the [external_key_fragment] column, and your lookup procedure might look something like this:

 

   CREATE PROC usp_get_widet @external_key VARCHAR(MAX) AS
   SELECT *
   FROM widgets
   WHERE external_key = @external_key
       AND external_key_fragment = LEFT (@external_key, 900);

 

Of course, this assumes that some predictable 900-byte portion of the external key is reasonably selective. If it’s possible for a large portion of the rows to have the same initial 900 bytes (assuming the fragment is generated in the way shown above), then this won’t help any – the optimizer will correctly estimate that a scan would be more efficient than a seek, and you’ll still end up scanning the entire table for each lookup.

 


D. Break up or parse the external key into its component parts, and index a selective subset of the parts.  This assumes that the long external key is really a serialized version of a composite key. It must have some underlying structure, and it must be practical to break the key into its child parts within the data tier. It also assumes that some portion of this structured key is guaranteed to be unique enough to support an efficient index seek that only returned a few rows.

 

For example, suppose that the external key was string that actually encoded a hierarchy like “RootIdentifierX\ChildIdentifierY\GrandChildIdentifierZ\...”  If you’re willing to push knowledge of the structure of this serialized key down into the data tier, the database could parse this apart and use the identifiers for the first few levels of the hierarchy to do a secondary lookup that could be satisfied via an index seek. You’d need to store those parts of the external key in separate columns (so they could be index) in addition to the full external key.  Once you’ve indexed the portions that you can, consider adding the remainder as an INCLUDEd column in the index; that would allow the index seek to locate the single matching row without resorting to a bookmark lookup. (It would also make it less likely that the optimizer would choose to scan the table because it guessed that a seek on the component parts wouldn’t be selective enough.)

 

 

E. Hash the external key and index the hash. This is essentially building a hash table and persisting it in the database. With this approach, your widgets table might look something like this:

 

 CREATE TABLE widgets (
     surrogate_key INT PRIMARY KEY IDENTITY, 
     external_key VARCHAR(MAX), 

     -- A computed column that hashes the external key (MD5 returns a 
     -- 128-bit hash). 
     external_key_hash AS CAST (HASHBYTES('MD5', external_key) AS VARBINARY(16)), 
     ...

 

The [external_key_hash] column would be indexed, and the lookup proc would do this:

 

 CREATE PROC usp_get_widet @external_key VARCHAR(MAX) AS 
 SELECT * 
 FROM widgets 
 WHERE external_key = @external_key 
     AND external_key_hash 
          = CAST (HASHBYTES('MD5', @external_key) AS VARBINARY(16));

 

It could be argued that the CAST(HASHBYTES(…)) stuff should be hidden within a UDF (ditto for alternatives (C) and (D)):


   CREATE FUNCTION dbo.fn_hash_external_key (@urn nvarchar(max))
   RETURNS VARBINARY(16)
   WITH SCHEMABINDING
   AS
   BEGIN
      RETURN HASHBYTES ('MD5', @urn);
   END;

 

The computed column and the lookup proc would both reference this function so that the details of the hash function lived in a single place. That would indeed be cleaner, but before taking that route be sure you’re aware of the potential problems with scalar UDF performance that are described at http://www.developerfusion.com/community/blog-entry/8389063/tsql-scalar-functions-are-evil/ and http://sqlblog.com/blogs/adam_machanic/archive/2006/08/04/scalar-functions-inlining-and-performance-an-entertaining-title-for-a-boring-post.aspx

 

 

 

Of all of these, option (A) is the only one that I find very aesthetically pleasing.  The others could work in a pinch, though.  Please drop me a comment if I’m overlooking some other alternative here.

(Cross-posted here)

26 Mar 07:07

Query Tuning Fundamentals: Density, Predicates, Selectivity, and Cardinality

by bartduncan

A few days ago I was in the middle of writing up a quick post about a query performance problem I recently ran into. The writeup referenced predicate selectivity, and I found myself wondering whether everyone who came across the post would have a good intuitive understanding of what that referred to. Just in case, I thought I'd do a quick overview of some terms that you'll definitely run across if you're doing much query tuning. You can find this information in other forms elsewhere on the Internet, but these are practical descriptions that I've found to be useful when introducing people to query tuning concepts. 


Density
Density, when used to describe the data in a column, is a measure of how often duplicate values occur in that column. Another way to think of density is as a measure of the uniqueness of the data in a column: high density --> less unique data. Density values range from 0 to 1.0.  There are different (but equivalent) ways to think of density.

Density = 1/[# of distinct values in a column]
Density = Avg. number of duplicates for a given value / Total row count

Consider a Customers table with a CountryID column and a RegionID column. Suppose that 100 different CountryID values are present in the table, and 1000 different RegionIDs are present. The density of the CountryID column would be higher than the density of the RegionID column because a typical country is less unique than a typical Region. Note that this is an overall characterization of a range of values. There may be some regions that occur more frequently than some countries -- a U.S.-based company might have more customers in California than in Estonia, for example.  But CountryID still has higher density than RegionID simply because the average country is represented in more rows than the average region.

Density of the CountryID column:

Density = 1/[# of distinct values in the column]
Density = 1/100 = 0.01

Note that the density of 0.01, or 1%, corresponds to the percentage of the total rows that would be returned by a query for a single value.  In other words, an average country makes up 1% of the table.  if there were 10,000 rows in the table, there were be 100 rows (1% of 10,000) per country, on average.  

Density of the RegionID column:

Density = 1/[# of distinct values in a column]
Density = 1/1000 = 0.001

The density of the CountryID column (0.01) is higher than the density of the RegionID column (0.001) because the CountryID column is less unique.

 

Q: What is the density of a column that contains only a single value repeated in every row?
A: 1.0 -- the highest possible density.

Q: What is the density of a unique column, such as an IDENTITY or primary key column?
A: 1/[row count]

Hands-On Example
Here is a script to create the scenario described above. 

use tempdb
go
create table customers (customerid int identity, country varchar(10), region varchar(10))
go
set nocount on
begin tran
declare @x int
set @x = 1
while (@x <=10000)
begin
    insert into customers (country, region) values ('Country' + convert (varchar, @x % 100), 'Region' + convert (varchar, @x % 1000))
    set @x = @x + 1
end
while @@trancount > 0 commit tran
go
create statistics stats_country on customers (country) with fullscan
create statistics stats_region on customers (region) with fullscan
create statistics stats_country_region on customers (country, region) with fullscan
go
dbcc show_statistics (customers, stats_country)
dbcc show_statistics (customers, stats_region)
go

Run this and check out the DBCC SHOW_STATISTICS output for the stats on the [country] and [region] columns.  In the SHOW_STATISTICS output you'll see a value called "All density".  This should match the values predicted in the calculations described above: Country and Region densities of 0.01 and 0.001, respectively.  (TIP: you should generally use "All density" and disregard the "Density" value in SHOW_STATISTICS's first resultset -- the value in the "Density" column is generally not used by the query optimizer.)  If the statistics are multi-column as in the [stats_country_region] statistics created in this example, an All density value is also maintained for combinations of columns; this can help the optimizer make more accurate predictions when a query includes a filter on more than one column.

Name Updated Rows Rows Sampled Steps Density Average Key Length String Index Filter Expression Unfiltered Rows
stats_country Jan 25 2011 10:14AM 10000 10000 55 0.01 8.9 YES NULL 10000

 

All density Average Length Columns
0.01 8.9 country

 


Predicate
A predicate is an expression that evaluates to True or False. In SQL you encounter predicates in joins and in WHERE or HAVING clauses, where they are used to either filter out or qualify rows. Here's a query with two filter predicates:

SELECT *
FROM table1
WHERE column1 > 20 AND column2 IS NULL

And a query with one join predicate:

SELECT *
FROM table1
INNER JOIN table2 ON table1.column1 = table2.column1

 


Selectivity
Selectivity is also a measure of uniqueness, of sorts. High selectivity implies high uniqueness, or a low number of matching values. Low selectivity implies a low uniqueness, or a high percent of matches. 

Selectivity is most commonly used to describe a predicate. Returning to the example used to describe density, consider a Customers table with a Region column. Suppose that most of the company's customers are in RegionA, and only a few are in RegionB. The predicate "Region='RegionB'" would be described as more selective because it returns a smaller percentage of the rows in the table. (As an aside, this kind of uneven distribution of data within a column is an example of data skew.)  The estimated selectivity of a predicate is essential when the optimizer is evaluating possible plans. It will affect things like join order (all other things equal, the most selective join operation is generally performed first) and whether SQL chooses to do a table scan or an index seek followed by a bookmark lookup.

Selectivity for a filter predicate against a base table can be calculated as "[# rows that pass the predicate]/[# rows in the table]".  If the predicate passes all rows in the table, its selectivity is 1.0.  If it disqualifies all rows, its selectivity is 0. (This can be confusing.  Note that 0.000001 reflects a high selectivity even though the number is small, while 1.0 is low selectivity even though the number is higher.)

The optimizer often uses the histogram steps in column statistics to estimate the selectivity of a predicate.  Re-run the sample script above, and note the third and final resultset in the "DBCC SHOW_STATISTICS(customers, stats_country)" output.  A single row in this resultset represents a step in a histogram.  Each histogram step summarizes the density of all rows in the table that have a [country] value that is equal to or less than that row's RANGE_HI_KEY value, but greater than the RANGE_HI_KEY of the preceding histogram step. The snippet below shows the first 4 rows.

 

RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
Country0 0 100 0 1
Country1 0 100 0 1
Country11 100 100 1 100
Country13 100 100 1 100

 

Take the third row in the histogram as an example. This step summarizes all rows with a country value greater than 'Country1' (the RANGE_HI_KEY for the preceding step), but less than or equal to 'Country11' (this step's RANGE_HI_KEY).  There are 100 rows with exactly the value 'Country11' (EQ_ROWS), and 100 rows with some other value in this range (RANGE_ROWS).   The DISTINCT_RANGE_ROWS value of 1 tells us that those 100 other RANGE_ROWS are made up of exactly one other distinct [country] value.  (These happen to be the rows with a country value of 'Country10', although that information isn't present in the statistics.)  I've often thought that the DISTINCT_RANGE_ROWS column was poorly-named; a better name might be "DISTINCT_RANGE_VALUES" because it reports a count of distinct values, not rows.  The AVG_RANGE_ROWS column reports the average number of rows (100) for each of the (1) distinct values in the range.  Note that by dividing RANGE_ROWS by DISTINCT_RANGE_ROWS, you calculate step density, a measure of the uniqueness of an average value within the range of values described by that histogram step.

Some simple examples might help illustrate how the optimizer uses this data to estimate selectivity.  For the query below, the optimizer's task is to guess the selectivity of a predicate like "country='Country11'".  It will look for the "Country11" value in the histogram on the country column, and will find that this value happens to be one of the RANGE_HI_KEY values (histogram step endpoints).   As a result, it can use EQ_ROWS -- the number of rows observed with this exact value. The estimated selectivity of the predicate [country]='Country11' is 100/10,000 (EQ_ROWS / [table rowcount]), or 0.01.  The actual selectivity is also 0.01, meaning that the optimizer's estimated selectivity turned out to be accurate. 

Actual Selectivity

Est. Selectivity
... WHERE [country]='Country11' 0.1 0.1

 

For the query below the search value 'Country10' is not a RANGE_HI_KEY -- instead, the value falls somewhere in the middle of a range of values that is summarized by a histogram step.  This means that the optimizer will need to use step density, which implies an assumtion that the search value is typical of the values in its histogram step. The estimated number of rows with a given value in this historgram step is AVG_RANGE_ROWS, which is 100 for this histogram step.  That means that the estimated and actual selectivity are both 0.01 (100 / 10,000). 

Actual Selectivity Est. Selectivity
... WHERE [country]='Country10' 0.1 0.1
 

 

Here's a more interesting example. Note that the optimizer's guessed selectivity is wrong: it thought that 100 rows would be returned (selectivity of 0.1 * 10000 rows in the table), but in reality no rows passed the filter predicate. The search value 'Country10xyz' falls within the same histogram step used in the prior query, so the optimizer guessed the number of rows that would be returned using the exact same calculation as in the prior example. This isn't a bug; it's just a reflection of the fact that statistics can only provide an imperfect summary of the data in a table. You can read more about this here.

Actual Selectivity Est. Selectivity
.. WHERE [country]='Country10xyz' 0 0.1
 

 

The information exposed by DBCC SHOW_STATISTICS can give you a better understanding of why the optimizer thinks that a particular plan will be cheaper than other plans. This is just a quick overview; for a much more detailed description SHOW_STATISTICS output and SQL statistics in general, check out the whitepaper "Statistics Used by the Query Optimizer in Microsoft SQL Server 2008" @ http://msdn.microsoft.com/en-us/library/dd535534(v=sql.100).aspx.  (If you haven't ever read this paper, it's a very good read.)

 


Cardinality
All of this brings us finally to cardinality, which is kind of "where the rubber hits the road" when you are trying to understand why a query plan is slow.

For our purposes, cardinality can be thought of as the number of rows returned by a query operator.  (A few examples of physical query operators that you will probably recognize: Index Seek, Nested Loop Join, Filter)  The cardinality of an operator like a Filter is determined by multiplying the selectivity of that operator -- that is, the % of its input rows that it will pass -- by the cardinality (rowcount) of the operator's child.  If an operator receives 500 rows from its child operator and has a selectivity of 0.3 (30%), it has a cardinality of 500*0.3, or 150 rows.

Each operator in a query plan has an estimated cardinality (the number of rows the optimizer guessed that the operator would return) and an actual cardinality (the number of rows that the operator returned in the real world).  You can see both by running a query with "SET STATISTICS PROFILE ON".

Rows Executes EstimateRows EstimateExecutions
SELECT c1 FROM t1 WHERE c1 = 'xyz'
2210 1   |--Filter([t1].[c1] = 'xyz')  21.0 1.0
2742 1       |--Index Scan(OBJECT:([t1].[idx1])) 2971 1.0
    

 

Sometimes the query optimizer cannot accurately predict the number of rows that a given operator will return. This can prevent SQL from estimating the cost of a query plan correctly, which can lead to the selection of a suboptimal plan. Cardinality estimation errors are one of the most common causes of slow query plans in SQL Server, so it is very important to know how to identify cardinality estimation problems in a query plan. (That's a little beyond the point of this post, though -- here I'm just defining some terms.) In the plan above, you can see that SQL made a cardinality estimation error: it guessed that the filter would be very selective and would only pass 21 of 2971 rows, when in reality almost all of the rows passed the filter.

The optimizer has a number of ways to estimate cardinality, none of which are completely foolproof. 

  1. If the predicate is simple like "column=123" and if the search value happens to be a histogram endpoint (RANGE_HI_KEY), then EQ_ROWS can be used for a very accurate estimate.
  2. If the search value happens to fall between two step endpoints, then the average density of values in that particular histogram step is used to estimate predicate selectivity and operator cardinality.
  3. If the specific search value is not known at compile time, the next best option is to use average column density ("All density"), which can be used to calculate the number of rows that will match an average value in the column. 
  4. In some cases none of the above are possible and optimizer has to resort to a "magic number"-based estimate.  For example, it might make a totally blind guess that 10% of the rows will be returned, where the "10%" value would be hardcoded in the optimizer's code rather than being derived from statistics. 

In most scenarios option #1 is the ideal, but because most values aren't histogram endpoints, option #2 is the most frequently-used method for simple filter and join predicates. Option #4 usually provides the worst estimates, and is only used when there is no other alternative. Thankfully, most of the time when a magic number is used, this is a side effect of poor query design; by rewriting the query you can enable a more accurate cardinality estimate based on either average column density or histogram lookup.

 

26 Mar 07:07

Don’t depend on expression short circuiting in T-SQL (not even with CASE)

by bartduncan

There are a fair number of blog posts and forum discussions regarding expression short circuiting in T-SQL. Some of the most authoritative posts, like this one, come to the following conclusions: (a) You cannot depend on expression evaluation order for things like “WHERE <expr1> OR <expr2>“, since the optimizer might choose a plan that evaluates the second predicate before the first one. But, (b) order of evaluation of the expressions in a CASE statement is fixed, so you can depend on deterministic short circuit evaluation of a CASE statement. For example, this wouldn’t protect you from a divide-by-zero error:

    WHERE (@value = 0) OR ((col1 / @value) = 2)

But the idea is that this variation is functionally-equivalent, and should protect you from the error:

    WHERE
        CASE
            WHEN (@value = 0) THEN 2
            ELSE (col1 / @value)
        END = 2

Before now that’s the advice I would have offered, too. But I just ran into a situation where a CASE statement does not provide predictable short circuiting. Here’s a simplified repro:

    ALTER FUNCTION dbo.test_case_short_circuit (@input INT)
    RETURNS TABLE
    AS
    RETURN (
        SELECT calculated_value =
            CASE
                WHEN @input <= 0 THEN 0
                ELSE LOG10 (@input)
            END
    );
    GO

    SELECT * FROM dbo.test_case_short_circuit (0);
    GO

This fails with this error:


     Msg 3623, Level 16, State 1, Line 2
     An invalid floating point operation occurred.

The LOG10 function raises this error when its input is 0 or a negative value. In some cases it appears that the plan may still evaluate the expression in the second CASE branch even when it won’t be using the value. This is a case where CASE doesn’t provide deterministic short circuiting.

I want to make sure no one takes away the conclusion that SQL Server doesn’t support expression short circuiting. It definitely does. It’s just that you don’t have explicit control over the order of expression evaluation — even with CASE, apparently. And if you’re going to depend on short circuiting, you need a deterministic order of expression evaluation.

What can you do about it? One option would be to always scrub things so that an error isn’t possible even when the CASE branch’s output won’t be used. For example, using “ELSE LOG10 (CASE WHEN @input <= 0 THEN 1 ELSE @input END)” in the repro script doesn’t change the behavior of the function, but avoids the error. Unfortunately, that’s not so pretty.

UPDATE (4 Mar 2011): To be clear, I’ve used CASE before for its short-circuiting properties, and I don’t intend to go back and revisit all of that old code in light of this example. This seems like an edge case to me. But it’s worth being aware that such edge cases exist if you’re thinking about relying on CASE for short circuiting. The most defensive programming approach would be to write the expression in such a way that it doesn’t require particular short circuiting behavior.

UPDATE (10 Jun 2011): The owners of this code have marked this bug as fixed. From their comments, it sounds like you are supposed to be able to rely on deterministic order of expression evaluation for CASE statements. But any SQL release in your hands right now will still be vulnerable to this problem -- keep an eye out for the issue as you use CASE for short circuiting.

(Cross-posted to here.)

04 Mar 18:38

The Internet of Things Will Make Big Data Look Small

by A.R. Guess

by Angela Guess Tom Krazit recently wrote in Fortune, “It’s kind of amazing that we all settled on the term “big data” before the “Internet of things” really arrived. That pending revolution, in which we’ll see all kinds of new objects connected to the Internet thanks to the cheap hardware provided by the smartphone boom, […]

The post The Internet of Things Will Make Big Data Look Small appeared first on DATAVERSITY.

04 Mar 18:38

10 Commandments of BI

by A.R. Guess

by Angela Guess Shant Hovsepian recently wrote in Datanami, “When looking at BI tools for your organization, there are 10 ‘Commandments’ you should live by. First Commandment: Thou Shalt Not Move Big Data. Moving Big Data is expensive: it is big, after all, so physics is against you if you need to load it up […]

The post 10 Commandments of BI appeared first on DATAVERSITY.

04 Mar 18:38

Bad cardinality estimates from SSMS plans – redux

by Aaron Bertrand

Over three years ago now, I posted about a fix to Plan Explorer regarding bad cardinality estimates that SQL Server's Showplan XML was producing, in the case of key/RID lookups with a filter predicate in SQL Server 2008 and above. I thought it would be interesting to look back and go into a little more detail about one of these plans and the iterations that we went through to ensure we were displaying correct metrics, regardless of what Management Studio shows. Again, this work was largely done by Brooke Philpott (@MacroMullet) and Greg Gonzalez (@SQLsensei) and with great input from Paul White (@SQL_Kiwi).

This is quite similar to the query I presented in my earlier post, which came from Paul (and which would take some work to reproduce exactly in modern versions of AdventureWorks, where at the very least transaction dates have changed):

SELECT
    th.ProductID,
    p.Name,
    th.TransactionID,
    th.TransactionDate
FROM Production.Product AS p
JOIN Production.TransactionHistory AS th ON
    th.ProductID = p.ProductID
WHERE 
    p.ProductID IN (1, 2)
    AND th.TransactionDate BETWEEN '20070901' AND '20071231';

The plan from Management Studio looked correct enough:

However, if you look closer, it seems that the ShowPlan has pushed the estimated number of executions from the key lookup straight over to the estimated number of rows for the final exchange:

On first glance, the graphical plan diagram in Plan Explorer looks quite similar to the plan that SSMS produces:

Now, in the process of developing Plan Explorer, we have discovered several cases where ShowPlan doesn't quite get its math correct. The most obvious example is percentages adding up to over 100%; we get this right in cases where SSMS is ridiculously off (I see this less often today than I used to, but it still happens).

Another case is where, starting in SQL Server 2008, SSMS started putting total estimated rows instead of rows per execution along with lookups, but only in cases where a predicate is pushed to the lookup (such as the case in this bug reported by Paul, and this more recent observation by Joey D'Antoni). In earlier versions of SQL Server (and with functions and spools), we would typically show estimated row counts coming out of a lookup by multiplying the estimated rows per execution (usually 1) by the estimated number of rows according to SSMS. But with this change, we would be over-counting, since the operator is now already doing that math. So, in earlier versions of Plan Explorer, against 2008+, you would see these details in the tooltips, connector lines, or in the various grids:

(Where does 1,721 come from? 67.5 estimated executions x 25.4927 estimated rows.)

Back in 2012, we fixed part of this issue by not performing this mathematical operation any longer, and relying solely on the estimated row counts coming out of the key lookup. This was almost correct, but we were still relying on the estimated row count ShowPlan was providing us for the final exchange:

We quickly addressed this issue as well, in version 7.2.42.0 (released on Hallowe'en 2012), and now feel we are providing information that is much more accurate than Management Studio (though we will keep an eye on this bug from Paul):

This clearly happened a long time ago, but I still thought it would be interesting to share. We continue to make enhancements to Plan Explorer to provide you with the most accurate information possible, and I will be sharing a few more of these nuggets in upcoming posts.

The post Bad cardinality estimates from SSMS plans – redux appeared first on SQLPerformance.com.