08 Sep 19:25

Primer on Power BI (Business Intelligence)

by MVP Award Program

Editor’s note: The following post was written by SQL Server MVP Mark Tabladillo

Power BI is new and emerging self-service business intelligence and business analytics framework brings together and enhances key Microsoft technologies:

Office
SQL Server
Azure
SharePoint

Fundamentally, Power BI is considered a premium Office option, because Microsoft licenses it that way. Yet, the technology details also comprise new collaboration technologies for SQL Server, Azure and SharePoint. A successful technology collaboration will have boundaries which could arguably belong to one or more of the contributing technology groups.

This document provides links and introductory information to Power BI. My analysis is more useful for the enterprise planner (CIO, CTO, Information Technology Architect), but also is useful for individual consumers. Power BI is a technology which extends from individual use on any device (laptop, tablet or smartphone) and all the way up to high-scale cloud or hybrid (cloud plus on premise) production architecture.

Books have already been written on aspects of the component Power BI technologies, and some will be recommended for further study. For this document, the purpose is to provide an overview of the key points in knowing what this technology is and how it might be useful in your organization. Along the way, I provide web links (URLs) to pages and videos that document and demonstrate key features of Power BI technology. In larger view, Power BI is at the heart of how Microsoft is now developing integrated aspects of the already-technologies (Azure, SQL Server, Office, and SharePoint) and represents a direction for all these technologies for the foreseeable future.

The major sections of this report include:

Definition – what is Power BI?
Licensing Power BI – how can I or we get Power BI?
Excel 2013 Features – what are the major features in on premise (legacy) Excel 2013?
Power BI for Office 365 – what are the major features of the online Power BI for Office 365?
Power BI with Excel 2010 – what can Excel 2010 users do with Power BI?
Recommended Resources – where can I find free online resources and recommended paid books?

Definition

Formally, Microsoft claims that this technology is comprised of the following features and services:

Excel Features

Power Query – easily discover and connect to data from public and corporate data sources
Power Pivot – create a sophisticated Data Model directly in Excel
Power View – create reports and analytical views with interactive data visualizations
Power Map – explore and navigate geospatial data on a 3D map experience in Excel

Power BI for Office 365

Power BI Sites – share, view, and interact with reports in these collaborative Power BI sites
Power BI Q&A (Question and Answer) – use natural language queries to find, explore, and report over your data
Query and Data Management – share and manage queries and data sources, and view query usage analytics
Power BI Windows Store App – view reports on the go, with the Power BI app

IT (Information Technology) Infrastructure Services for Power BI Office 365

Provision Power BI for Office 365 – create a tenant, and get Power BI for your organization
Power BI Admin Center – manage your organization’s Power BI infrastructure from a site
Data Management Gateway – connect on-premises data to the cloud, seamlessly and securely

Many presentations I have seen on Power BI start with flashy demos and features. I have both seen and done such demos several times for user groups and a national conference called PASS Business Analytics Conference. I will recommend some video demos throughout in this document, and provide some recommended links for further reading.

Though first, the licensing needs to be examined because Microsoft is offering more than just the legacy pricing options for Office. In this report, I will be discussing and emphasizing the Office 365 integration over the SharePoint integration. In my experience with consulting clients, a key question is how much the technology costs and how to obtain it.

Licensing Power BI

Licensing has been one of the most actively discussed aspects among current Power BI users. It is wise to spend some time on the topic up front. Though a combination of technologies, Power BI is obtained through an Office license.

While some pieces of Power BI have been and are available for Office 2010, I support Microsoft in recommending that people obtain Office 2013 for a more stable technology and complete Power BI experience (especially if organizations are coming from Office 2007 or Office 2003 or earlier). The technology reason is that Excel 2013 Power Pivot is superior to Excel 2010 Power Pivot, and in enough ways to recommend the higher level.

Power BI is not required for Office 2013: the reason is that Power BI is considered a premium set of features and services, available only at higher licensing levels. For people wanting Office as a one-time purchase, they can continue to obtain a single PC license in the United States:

Office Home & Student 2013 – Word, Excel, PowerPoint, OneNote
Office Home & Business 2013 – Word, Excel, PowerPoint, OneNote, Outlook
Office Professional 2013 – Word, Excel, PowerPoint, OneNote, Outlook, Publisher, Access

Though, none of these three versions include Power BI Excel Features, which instead requires Office Professional Plus 2013 (only available through volume subscriptions, or through MSDN Premium or Ultimate subscriptions). I have a MSDN Ultimate subscription, and therefore I have the Power BI Excel Features.

If you only have Power BI Excel Features, you can start using many new self-service business intelligence features, and you can be happy having such abilities. However, these Excel Features are only part of the Power BI technology. In several online forums, many others who have the same MSDN Ultimate subscription level have been asking Microsoft to extend the benefits to include all of Power BI and not just the Excel Features.

For now, experiencing Power BI for Office 365 and the IT Infrastructure Services for Power BI currently requires one of several higher Office 365 subscriptions. All Office 365 subscriptions require an annual commitment, and would have a penalty for early cancellation. As with the one-time purchase options, only certain higher Office 365 subscriptions include Power BI. Again, the reason is that Power BI is a premium set of features and services.

In general, Office 365 subscriptions have been organized into several clusters, one for personal users, and the others toward different types of businesses. You could be a single person or private group and purchase a Business Use option. I listed these options in the following table, so that you can see it all in one place.

Personal Use (single account, priced per month, could be paid monthly)

Office 365 Personal – Single Computer and Tablet: Word, Excel, PowerPoint, OneNote, Outlook, Publisher, Access
Office 365 Home – Five Computers and Five Tablets: Word, Excel, PowerPoint, OneNote, Outlook, Publisher, Access
Office 365 Pro Plus – Five Computers and Five Tablets: Word, Excel, PowerPoint, OneNote, Outlook, Publisher, Access – Includes Power BI for Office 365

Business Use (priced per user, per month) – details on the Microsoft Website

Office Small Business
- Office 365 Small Business
- Office 365 Small Business Premium
Midsize Business

Office 365 Midsize Business – Includes Power BI for Office 365

Enterprise

Hosted email (Exchange Online Plan 1)
Office 365 Enterprise E1
Office 365 Enterprise E3 – Includes Power BI for Office 365
Office 365 Enterprise E4 – Includes Power BI for Office 365

Education or Academic– for qualifying organizations, reduced cost compared to Enterprise

Office 365 Education A2
Office 365 Education A3
Office 365 Education A4

Government– for qualifying organizations, reduced cost compared to Enterprise

Exchange Online (Plan 1)
Exchange Online (Plan 2)
Office 365 (Plan E1) for Government
Office 365 (Plan E3) for Government

Non-Profits– for qualifying organizations, reduced cost compared to Enterprise
- Office 365 Small Business for Nonprofits
- Office 365 Small Business Premium for Nonprofits
- Office 365 Enterprise E1 for Nonprofits
- Office 365 Enterprise E3 for Nonprofits – Includes Power BI for Office 365

If you were counting, of the above options, the only five Office 365 subscriptions which include complete Power BI are:

Office 365 Pro Plus
Office 365 Midsize Business
Office 365 Enterprise E3
Office 365 Enterprise E4
Office 365 Enterprise E3 for Nonprofits

In summary, when considering Office 2013, there is one partial licensing path to Power BI (Office 2013 Professional Plus) or one of the five listed Office 365 subscriptions.

Recommendations

If you are an individual, I first recommend the Office 365 Pro Plus subscription.
If you are looking for a business, I would first look at the Office 365 Midsize Business subscription.

In a later section, I will describe what features of Power BI you could have using Office 2010 (and particularly Excel 2010). Even having Office 2010 should make someone happy with at least getting started. Again, I am recommending Office 2013 over Office 2010, though pragmatically many organizations (for example) already committed to Office 2010 before or as Power BI was being created.

Excel 2013 Features

The Excel Features work on-premises with either Excel 2010 or Excel 2013 (requiring an Office Professional Plus subscription, which comes with an MSDN Premium or Ultimate subscription). Again, Excel 2013 is my recommended platform for using this technology, and the focus of this section (a later section discusses the comparatively limited Excel 2010). The Excel Features are comprised of four elements, which I will summarize and provide key technical descriptions.

Power Query – easily discover and connect to data from public and corporate data sources
Power Pivot – create a sophisticated Data Model directly in Excel
Power View – create reports and analytical views with interactive data visualizations
Power Map – explore and navigate geospatial data on a 3D map experience in Excel

Some of the Power BI elements (like Power Pivot and Power View) are now native to Excel 2013. Other elements are add-ins, and sometimes may “disappear” from the ribbon: first try to enable them again from the COM add-in window, or secondly, uninstall and reinstall them. Naturally, the best situation is when these emerging features are native to the Office version. Not all these features directly impact SQL Server technology, though the one which most clearly extends SQL Server (Analysis Services) is Power Pivot.

Power Query

The development version of Power Query was termed “Data Explorer”. I first saw this technology under development and while interesting as a web application, did not immediately excite me for its possibilities. Since then, the product is continuing to mature, and my interest has come to increase in what this technology can do. For power Excel users, I would hope that Power Query would become the default way to import data into Excel.

Power Query allows for some amount of data preprocessing during the import phase. Many of the steps which Excel users have come to do manually, such as splitting columns, removing columns or renaming columns, can be scripted within the Power Query interface. The software is wizard driven, and I did say scripting: underneath the technology is the Power Query Formula Language (informally known as “M”) allowing for future maturity into a reusable import technology. How that technology grows can depend on what Microsoft hears from the user community.

The next table summarizes many of the features. Having Office 365 increases the features by allowing shared queries. This type of structure is to be expected: the standalone features would come with in-premise Excel, and having the Office 365 cloud options open up collaboration features.

Power Query Summary of Features

Standard Power Query features

Value-added features with an Office 365 Power BI subscription

Easily discover, combine, and refine data for better analysis in Excel.

Introduction to Microsoft Power Query for Excel
Import data from external data sources
Shape data from multiple data sources
Filter, sort, and group data
Search public data (currently United States only)

In addition to the features in the standalone edition, securely share and manage your data queries within the enterprise in Excel.

Inside Excel, here’s what the latest ribbon looks like in Excel 2013 x64:

The online search is considered a feature for searching your Office 365 datasets. It becomes active once you sign in.

Beyond that first icon, the common “Get External Data” icons allow for reading any number of sources. The “From Web” link is often used in demonstrations, and permits searching website URLs or feeds. Other external data options include ODBC, SQL Server, Windows Azure, Microsoft Access, SharePoint Lists, OData, Windows Azure Marketplace, Hadoop File System, and even Facebook.

In process, the wizard will provide a preview of the data (once a connection, meaning authentication and authorization, are established). The preview window then allows for choosing the preprocessing steps for the data: though nothing happens until you submit the entire list. Along the way, the Power Query Formula Language builds the specific steps together into a query script. Once the preprocessing selections are done, you submit the list and the results come back to Excel. I recommend trying the technology yourself.

Limitations are important for power users, so I am including them in this report.

Power Query Specifications and Limits

Feature	Limitation
Query name length	80 characters
Invalid characters in a query name	Double quotes (“), periods (.), leading or trailing whitespaces
Number of cells in a Query Editor data preview	3,000 cells
Navigation pane items displayed per level: databases per server and tables per database.	First 1,000 items in alphabetical order. You can manually add a non-visible item by modifying the formula for this step
Size of data processed by the Engine	Limited to available virtual memory (for 64-bit version) or about 1GB for 32-bit version, if data cannot be fully streamed, such as when sorting the data set locally before filling it
Number of columns per table	16,384
Maximum size of text in a preview cell	1M characters
Maximum size of text filled to Excel or data model	Not limited by Power Query
Maximum dataset size when evaluating a query	256MB
Maximum number of rows filled to worksheet	1,048,576
Soft limit of persistent cache. A soft limit can be exceeded for periods of time.	4GB
Individual entries in the cache	1GB
Compressed query and dependencies as stored in the connection string. For more information about how to display connection information, see Display connection information.	64K characters

Action Steps:

Power Pivot

I will count Power Pivot as the first of the Excel features, at least the first one I saw. Of the four Excel features, this one has been where I have been spending most of the time with clients, and also my own presentations (I did some combining Power Pivot and data mining, which you can find on web). This technology sits on top of what we now call xVelocity, a rapid summation and compression engine. The underlying technology now scales to production in what is called Tabular mode for Analysis Services.

Collectively, Power Pivot for Excel, Power Pivot for SharePoint and Tabular mode in Analysis services comprise key elements of what Microsoft has named the BI (Business Intelligence) Semantic Model:

BI Semantic Model

As with the rest of Power BI, there is an Excel feature which works well just on its own. However, there is also a technology path and way to share Power Pivot data models using either SharePoint or Tabular mode in Analysis Services. The interface is its own Excel window, but has a familiar spreadsheet-type interface which can declare data types and relationships among data models (tables).

Power Pivot Features

The DAX language permits programming custom measures. This language works with Power Pivot in Excel, but more generally also allows for querying Microsoft’s Multidimensional and Data Mining mode (OLAP cube) databases. Learning and using DAX is considered an intermediate to advanced Power Pivot skill.

Again, here are some summarized capacity specifications for Power Pivot. Please note that using a 32-bit system is an additional strain: only about 2.1 GB of memory is available for all Excel activity, including Power BI. I recommend that Power BI users upgrade from 32-bit (x86) to 64-bit (x64). Using x64, I have shown demos where Power Pivot can import over 2M records from a SQL Server data warehouse, and of course I was not even pushing the entire limit.

Some users report that because of their corporate environment, they were able to acquire a substitute: access to a virtual machine (perhaps shared) running x64 Office and accessed through Remote Desktop Connection Manager.

Data Model Specification and Limits

Product or Platform	Maximum Limit
Excel 2013	32-bit environment is subject to 2 gigabytes (GB) of virtual address space, shared by Excel, the workbook, and add-ins that run in the same process. A data model’s share of the address space might run up to 500 – 700 megabytes (MB), but could be less if other data models and add-ins are loaded. 64-bit environment imposes no hard limits on file size. Workbook size is limited only by available memory and system resources.
SharePoint Server 2013 ¹	Maximum file size for uploading to a document library: 50 megabytes (MB) default 2 gigabytes (GB) maximum ² Maximum file size for rendering a workbook in Excel Services: 10 megabytes (MB) default 2 gigabytes (GB) maximum ²
Excel Online in Office 365 ³	250 megabytes (MB) total file size limit. Core worksheet contents (everything not in the Data Model) size limits according to file size limits for workbooks in SharePoint Online.

Footnotes

¹On SharePoint Server, notice that the defaults that are much lower than the maximum allowed. Ask your SharePoint administrator about raising file size limits if your file is too big to upload or render. More information about Software boundaries and limits for SharePoint Server 2013.

²Maximum Upload Size must be configured for each web application by a SharePoint administrator. Maximum Workbook Size must be configured in Excel Services by a service administrator. More information for administrators can be found in Configure Maximum File Upload Size on TechNet.

³Limits in Office 365 are not configurable, but can change over time. Check the Office 365 for Enterprise Service Descriptions for the latest information. You can also see SharePoint Online: software boundaries and limits.

Power Pivot Capacity Specifications

Object	Specification / Limit
Object name length	100 characters
Invalid characters in a Name	. , ; ' ` : / \ * \| ? " & % $ ! + = () [] {} < >
Number of tables per PowerPivot database	(2^31) - 1 = 2,147,483,647
Number of columns and calculated columns per table	(2^31) - 1 = 2,147,483,647
Number of calculated measures in a table	(2^31) - 1 = 2,147,483,647
PowerPivot memory size for saving a workbook	4GB = 4,294,967,296 bytes
Concurrent requests per workbook	6
Local cubes connections	5
Number of distinct values in a column	1,999,999,997
Number of rows in a table	1,999,999,997
String length	536,870,912 bytes (512 MB), equivalent to 268,435,456 Unicode characters (256 mega characters)

Power Pivot is a feature of Excel 2013 (Office Professional Plus).

Action Steps:

Power View

More than just a graphing interface, Power View surfaces an interactive work surface for exploring visual data. The technology targets both Excel and SharePoint. The technology for Excel is currently based on the Silverlight. Here are some screenshots of options:

Microsoft provides a few guidance documents for Power View specifications. Key to making Power View for Excel work includes making sure the prerequisite Silverlight is available. Viewing the results in browsers requires considering what specific browser version is used:

Power View is a feature of Excel 2013 (Office Professional Plus).

Action Steps:

Power Map

The fourth and final Excel feature for Power BI teams Bing Maps with Excel. The technology goes beyond just the standard 2D maps available with Excel or Power View, and extends mapping into three-dimensional views which can be turned into movies you can create.

Within Excel, Power Map shares the same earlier-mentioned size limitations of Excel 32-bit (x86) or 64-bit (x64).

Action Steps:

Power BI for Office 365

As I mentioned earlier, Power BI for Office 365 extends what Power BI could achieve with Excel alone. The main two features of Office 365 are collaboration (working with teams) and portability (working across locations and devices). The technology is viewed through web browsers, and therefore would allow teams to use whatever laptop, tablet, or smartphone device they have already invested in. As mentioned in the licensing section, obtaining this software also includes the legacy on premise versions of Microsoft Office, demonstrating the desire to keep this new web world connected with the already familiar Office experience. Because of the streamlined licensing structure, I also am emphasizing and promoting Office 365 as the primary way to experience Power BI collaboration. It is true that collaboration can happen with SharePoint, and that software is separately licensed from the lengthy options already discussed.

The components of Office 365 are:

Power BI Sites – share, view, and interact with reports in these collaborative Power BI sites
Power BI Q&A (Question and Answer) – use natural language queries to find, explore, and report over your data
Query and Data Management – share and manage queries and data sources, and view query usage analytics
Power BI Windows Store App – view reports on the go, with the Power BI app

The first action step is to view the two minute preview video on Office 365.

Power BI Sites

This technology is comprised of websites you make with your Power BI content. Organizing your information on the web allows collaboration (teams) and portability (location and devices). This feature is considered an aspect of SharePoint online. Click this link to see the preview video.

Power BI Q&A (Question and Answer)

This technology opens up a way to query inside data stored on Power BI sites. Based on Bing search technology, the interface interprets natural language entries and provides data results. Click this link to see the preview video.

Query and Data Management

These features are an inherent part of the Power BI sites. Collaboration requires knowing what happens with that information.

Click this link to study the Data Management Experience in Power BI Office 365.

Power BI Windows Store App

Office 365 already allows for HTML5 rendering for mobile devices (including laptops, tablets and smartphones). Though, the native Power BI Windows Store App extends that functionality for Windows devices. The consumption experience matches Power BI sites. Click here for the preview video.

Managing Power BI for Office 365

As mentioned in the introduction, there are other features which involve managing Office 365. We can expect these features to grow and improve, as Microsoft often asks the community for input on what features people would find useful.

IT (Information Technology) Infrastructure Services for Power BI Office 365

Provision Power BI for Office 365 – create a tenant, and get Power BI for your organization
Power BI Admin Center – manage your organization’s Power BI infrastructure from a site
Data Management Gateway – connect on-premises data to the cloud, seamlessly and securely

Power BI with Excel 2010

The following chart summarizes the key elements of Power BI, and what would be available for users of Excel 2010.

Software	Solution
Power Pivot	A version one of Power Pivot is available for Excel 2010. Microsoft’s website provides video, demos, and hands-on labs to try out the software. Opening up an Excel 2010 Power Pivot workbook in Excel 2013 requires an irreversible upgrade (meaning that collaboration between Excel 2010 and Excel 2013 users is not possible – pick one version or the other). There are some technical details involved in upgrading, which you can study by clicking this link.
Power Map	Power Map is not available for Excel 2010. Versions of Excel previous to 2002 had a native map feature. Though, advanced Excel users will know about Microsoft MapPoint, which has gone through nineteen versions since its debut. Sadly, MapPoint is being discontinued as of December 31, 2014. You might be able to obtain a copy now, or through some MSDN subscriptions.
Power Query	Power Query is available for Microsoft Office 2010 Professional Plus with Software Assurance. There are two versions, one for 32-bit (x86) and one for 64-bit (x64).
Power View	Power View is not available for Excel 2010.

Recommended Resources

First, the free Microsoft digital books – these books include some Power BI topics like DAX, but generally all types of Microsoft topics:

Next, I generally recommend reading Microsoft’s website and documentation http://msdn.microsoft.com. Often, the documentation integrates video and demos: Microsoft has become better in doing that. A general site for Power BI video is Channel 9, which has many videos from Microsoft technical conferences.

Finally, there are books you buy. Many of my MVP friends and other equally-skilled professionals have authored books on Power BI topics. More continue to be published, some of them came out this month (July 2014), and most of them are generally available in both hardcopy and digital formats. All these listed books are on some aspect of Power BI.

Collie, R. (2012). DAX Formulas for PowerPivot: A Simple Guide to the Excel Revolution: Holy Macro! Books.

de Jonge, K. (2014). Dashboarding and Reporting with Power Pivot and Excel: How to Design and Create a Financial Dashboard with PowerPivot – End to End: Holy Macro! Books.

Ferrari, A., & Russo, M. (2013). Microsoft Excel 2013 Building Data Models with PowerPivot: Microsoft Press.

Jelen, B., & Collie, R. (2014). PowerPivot Alchemy: Patterns and Techniques for Excel: Holy Macro! Books.

Larson, B., Davis, M., English, D., & Purington, P. (2012). Visualizing Data with Microsoft Power View: McGraw-Hill/Osborne Media.

Webb, C. (2014). Power Query for Power BI and Excel: Apress.

Summary

Hopefully, this report provides good background information on Power BI. Feel free to contact me with feedback, either through my website http://marktab.net or on Twitter @marktabnet.

About the author

Mark provides enterprise data science analytics advice and solutions. He uses Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, SAS, SPSS, R, and Hadoop (among other tools). He works with Microsoft BI (SSAS, SSIS, SSRS, SharePoint).

Mark has a been a public voice for analytics since 1998: Microsoft TechEd, PASS Business Analytics Conference, Predictive Analytics World, SAS Global Forum, PASS Summit. He is a SQL Server MVP, a trainer and consultant with SolidQ, and teaches part-time at the University of Phoenix. His blog is at http://marktab.net

About MVP Mondays

The MVP Monday Series is created by Melissa Travers. In this series we work to provide readers with a guest post from an MVP every Monday. Melissa is a Community Program Manager, formerly known as MVP Lead, for Messaging and Collaboration (Exchange, Lync, Office 365 and SharePoint) and Microsoft Dynamics in the US. She began her career at Microsoft as an Exchange Support Engineer and has been working with the technical community in some capacity for almost a decade. In her spare time she enjoys going to the gym, shopping for handbags, watching period and fantasy dramas, and spending time with her children and miniature Dachshund. Melissa lives in North Carolina and works out of the Microsoft Charlotte office.

08 Sep 17:44

PowerScripting Podcast

by arcanecode

I just wanted to give a thanks to the guys at the PowerScripting Podcast for having me on tonight. As soon as it is released I’ll follow up with a link.

For those who came here from hearing me on the podcast, you can find more info on SQL Saturday at: http://bit.ly/sqlsat328

If you want to find out more about my sessions at the PASS Summit, you can jump to http://bit.ly/acsummit. My co-presenters for the precon are Brad Ball @sqlballs and Jason Strate @stratesql.

The Pragmatic Works webinars can be found on the company website at http://pragmaticworks.com. Just follow the Free Training on the T’s to get access to the webinars. You can search by author name (Robert Cain will get you mine) or topic.

My other training videos can be found on Pluralsight, http://pluralsight.com/training.

I also have a youtube channel with a couple of videos, https://www.youtube.com/user/arcanecode. Check out the Column Mode Editing video for a quick editing tip on making your life easier with both PowerShell and SQL Server.

08 Sep 17:44

Grouped Concatenation in SQL Server

by Aaron Bertrand

Grouped concatenation is a common problem in SQL Server, with no direct and intentional features to support it (like XMLAGG in Oracle, STRING_AGG or ARRAY_TO_STRING(ARRAY_AGG()) in PostgreSQL, and GROUP_CONCAT in MySQL). It has been requested, but no success yet, as evidenced in these Connect items:

Connect #247118 : SQL needs version of MySQL group_Concat function (Postponed)
Connect #728969 : Ordered Set Functions – WITHIN GROUP Clause (Active)

What is Grouped Concatenation?

For the uninitiated, grouped concatenation is when you want to take multiple rows of data and compress them into a single string (usually with delimiters like commas, tabs, or spaces). Some might call this a "horizontal join." A quick visual example demonstrating how we would compress a list of pets belonging to each family member, from the normalized source to the "flattened" output:

There have been many ways to solve this problem over the years; here are just a few, based on the following sample data:

CREATE TABLE dbo.FamilyMemberPets
(
  Name SYSNAME,
  Pet SYSNAME,
  PRIMARY KEY(Name,Pet)
);
 
INSERT dbo.FamilyMemberPets(Name,Pet) VALUES
(N'Madeline',N'Kirby'),
(N'Madeline',N'Quigley'),
(N'Henry',   N'Piglet'),
(N'Lisa',    N'Snowball'),
(N'Lisa',    N'Snowball II');

I am not going to demonstrate an exhaustive list of every grouped concatenation approach ever conceived, as I want to focus on a few aspects of my recommended approach, but I do want to point out a few of the more common ones:

Scalar UDF

CREATE FUNCTION dbo.ConcatFunction
(
  @Name SYSNAME
)
RETURNS NVARCHAR(MAX)
WITH SCHEMABINDING 
AS 
BEGIN
  DECLARE @s NVARCHAR(MAX);
 
  SELECT @s = COALESCE(@s + N', ', N'') + Pet
    FROM dbo.FamilyMemberPets
	WHERE Name = @Name
	ORDER BY Pet;
 
  RETURN (@s);
END
GO
 
SELECT Name, Pets = dbo.ConcatFunction(Name)
  FROM dbo.FamilyMemberPets
  GROUP BY Name
  ORDER BY Name;

Note: there is a reason we don't do this:

SELECT DISTINCT Name, Pets = dbo.ConcatFunction(Name)
  FROM dbo.FamilyMemberPets
  ORDER BY Name;

With DISTINCT, the function is run for every single row, then duplicates are removed; with GROUP BY, the duplicates are removed first.

Common Language Runtime (CLR)

This uses the GROUP_CONCAT_S function found at http://groupconcat.codeplex.com/:

SELECT Name, Pets = dbo.GROUP_CONCAT_S(Pet, 1)
  FROM dbo.FamilyMemberPets
  GROUP BY Name
  ORDER BY Name;

Recursive CTE

There are several variations on this recursion; this one pulls out a set of distinct names as the anchor:

;WITH x as 
(
  SELECT Name, Pet = CONVERT(NVARCHAR(MAX), Pet),
    r1 = ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Pet)
  FROM dbo.FamilyMemberPets
),
a AS 
(
  SELECT Name, Pet, r1 FROM x WHERE r1 = 1
),
r AS
(
  SELECT Name, Pet, r1 FROM a WHERE r1 = 1
  UNION ALL
  SELECT x.Name, r.Pet + N', ' + x.Pet, x.r1
    FROM x INNER JOIN r
	ON r.Name = x.Name
	AND x.r1 = r.r1 + 1
)
SELECT Name, Pets = MAX(Pet)
  FROM r
  GROUP BY Name 
  ORDER BY Name
  OPTION (MAXRECURSION 0);

Cursor

Not much to say here; cursors are usually not the optimal approach, but this may be your only choice if you are stuck on SQL Server 2000:

DECLARE @t TABLE(Name SYSNAME, Pets NVARCHAR(MAX),
  PRIMARY KEY (Name));
 
INSERT @t(Name, Pets)
  SELECT Name, N'' 
  FROM dbo.FamilyMemberPets GROUP BY Name;
 
DECLARE @name SYSNAME, @pet SYSNAME, @pets NVARCHAR(MAX);
 
DECLARE c CURSOR LOCAL FAST_FORWARD
  FOR SELECT Name, Pet 
  FROM dbo.FamilyMemberPets
  ORDER BY Name, Pet;
 
OPEN c;
 
FETCH c INTO @name, @pet;
 
WHILE @@FETCH_STATUS = 0
BEGIN
  UPDATE @t SET Pets += N', ' + @pet
    WHERE Name = @name;
 
  FETCH c INTO @name, @pet;
END
 
CLOSE c; DEALLOCATE c;
 
SELECT Name, Pets = STUFF(Pets, 1, 1, N'') 
  FROM @t
  ORDER BY Name;
GO

Quirky Update

Some people *love* this approach; I don't comprehend the attraction at all.

DECLARE @Name SYSNAME, @Pets NVARCHAR(MAX);
 
DECLARE @t TABLE(Name SYSNAME, Pet SYSNAME, Pets NVARCHAR(MAX),
  PRIMARY KEY (Name, Pet));
 
INSERT @t(Name, Pet)
  SELECT Name, Pet FROM dbo.FamilyMemberPets
  ORDER BY Name, Pet;
 
UPDATE @t SET @Pets = Pets = COALESCE(
    CASE COALESCE(@Name, N'') 
      WHEN Name THEN @Pets + N', ' + Pet
      ELSE Pet END, N''), 
	@Name = Name;
 
SELECT Name, Pets = MAX(Pets)
  FROM @t
  GROUP BY Name
  ORDER BY Name;

FOR XML PATH

Quite easily my preferred method, at least in part because it is the only way to *guarantee* order without using a cursor or CLR. That said, this is a very raw version that fails to address a couple of other inherent problems I will discuss further on:

SELECT Name, Pets = STUFF((SELECT N', ' + Pet 
  FROM dbo.FamilyMemberPets AS p2
   WHERE p2.name = p.name 
   ORDER BY Pet
   FOR XML PATH(N'')), 1, 2, N'')
FROM dbo.FamilyMemberPets AS p
GROUP BY Name
ORDER BY Name;

I've seen a lot of people mistakenly assume that the new CONCAT() function introduced in SQL Server 2012 was the answer to these feature requests. That function is only meant to operate against columns or variables in a single row; it cannot be used to concatenate values across rows.

More on FOR XML PATH

FOR XML PATH('') on its own is not good enough – it has known problems with XML entitization. For example, if you update one of the pet names to include an HTML bracket or an ampersand:

UPDATE dbo.FamilyMemberPets
  SET Pet = N'Qui>gle&y'
  WHERE Pet = N'Quigley';

These get translated to XML-safe entities somewhere along the way:

Qui&gt;gle&amp;y

So I always use PATH, TYPE).value(), as follows:

SELECT Name, Pets = STUFF((SELECT N', ' + Pet 
  FROM dbo.FamilyMemberPets AS p2
   WHERE p2.name = p.name 
   ORDER BY Pet
   FOR XML PATH(N''), TYPE).value(N'.[1]', N'nvarchar(max)'), 1, 2, N'')
FROM dbo.FamilyMemberPets AS p
GROUP BY Name
ORDER BY Name;

I also always use NVARCHAR, because you never know when some underlying column will contain Unicode (or later be changed to do so).

You may see the following varieties inside .value(), or even others:

... TYPE).value(N'.', ...
... TYPE).value(N'(./text())[1]', ...

These are interchangeable, all ultimately representing the same string; the performance differences between them (more below) were negligible and possibly completely nondeterministic.

Another issue you may come across is certain ASCII characters that are not possible to represent in XML; for example, if the string contains the character 0x001A (CHAR(26)), you will get this error message:

Msg 6841, Level 16, State 1, Line 51
FOR XML could not serialize the data for node 'NoName' because it contains a character (0x001A) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.

This seems pretty complicated to me, but hopefully you don't have to worry about it because you're not storing data like this or at least you're not trying to use it in grouped concatenation. If you are, you may have to fall back to one of the other approaches.

Performance

The above sample data makes it easy to prove that these methods all do what we expect, but it is hard to compare them meaningfully. So I populated the table with a much larger set:

TRUNCATE TABLE dbo.FamilyMemberPets;
 
INSERT dbo.FamilyMemberPets(Name,Pet)
  SELECT o.name, c.name
  FROM sys.all_objects AS o
  INNER JOIN sys.all_columns AS c
  ON o.[object_id] = c.[object_id]
  ORDER BY o.name, c.name;

For me, this was 575 objects, with 7,080 total rows; the widest object had 142 columns. Now again, admittedly, I did not set out to compare every single approach conceived in the history of SQL Server; just the few highlights I posted above. Here were the results:

You may notice a couple of contenders missing; the UDF using DISTINCT and the recursive CTE were so off the charts that they would skew the scale. Here are the results of all seven approaches in tabular form:

Approach	Duration (milliseconds)
FOR XML PATH	108.58
CLR	80.67
Quirky Update	278.83
UDF (GROUP BY)	452.67
UDF (DISTINCT)	5,893.67
Cursor	2,210.83
Recursive CTE	70,240.58

Average duration, in milliseconds, for all approaches

Also note that the variations on FOR XML PATH were tested independently but showed very minor differences so I just combined them for the average. If you really want to know, the .[1] notation worked out fastest in my tests; YMMV.

Conclusion

If you are not in a shop where CLR is a roadblock in any way, and especially if you're not just dealing with simple names or other strings, you should definitely consider the CodePlex project. Don't try and re-invent the wheel, don't try unintuitive tricks and hacks to make CROSS APPLY or other constructs work just a little faster than the non-CLR approaches above. Just take what works and plug it in. And heck, since you get the source code too, you can improve upon it or extend it if you like.

If CLR is an issue, then FOR XML PATH is likely your best option, but you'll still need to watch out for tricky characters. If you are stuck on SQL Server 2000, your only feasible option is the UDF (or similar code not wrapped in a UDF).

Next Time

A couple of things I want to explore in a follow-on post: removing duplicates from the list, ordering the list by something other than the value itself, cases where putting any of these approaches into a UDF can be painful, and a slew of practical use cases for this functionality.

The post Grouped Concatenation in SQL Server appeared first on SQLPerformance.com.

08 Sep 17:43

How to ask questions politely and correctly

by Paul Randal

This post is a much-expanded version of the editorial in my last bi-weekly SQLskills Insider newsletter, including a bunch of the feedback I got.

(Also see my corollary post How to answer questions politely and correctly.)

Recently I’ve been complaining on Twitter about some of the ridiculous questions I receive over email (I get a bunch of SQL Server questions every day from random people across the world), where it was clear that the senders had done no research at all before sending me an email, or they demanded a response ASAP, or they didn’t say please or thank you, or a combination of all three. It gets pretty frustrating and it bugs me when people are demonstrably lazy and/or rude (usually without meaning to be). I see the same thing in forum questions too and occasionally on #sqlhelp on Twitter.

Don’t get me wrong, I like getting SQL Server questions in email, and I encourage class students to do it. I’m not being a grumpy old curmudgeon: I think people should learn how to ask questions properly. If you’re old enough to be responsible for SQL Server, you should have basic human-human communication skills.

I had feedback from the newsletter that some people might not know how to use Google/Bing. In this day and age, if someone is responsible for SQL Server in any way, shape, or form, and they don’t know how to search online, they shouldn’t be allowed near any I.T. infrastructure IMHO. I had a few people suggest sending Let Me Google That For You links as replies to emails where the answer can easily be found online, but I think that’s too snarky of a response and I don’t like to be rude.

I came up with a handy rule: if my 14-yr old smart daughter could find the answer on Google from your question, you’d shouldn’t post or ask in email.

But what about the concern of over-use of Google, so people never remember anything and rely on finding the answer online? And what if the answer is wrong? (Sometimes my life is this! :-) Yes, these are legitimate concerns. So maybe do the research and then ask a clarifying question or ask for corroboration from an authoritative source. And research the methodology for solving your problem, not just ‘the answer’. Teach yourself to fish rather than just finding the fish (mangling the old saying…)

And what about the concern about people wasting hours searching on Google rather than spending 5 minutes asking a question? That’s between you and your manager to work out how your company wants you to spend your time. But if you’re sending a random email question to someone *outside* your company, due diligence is not optional unless you’re paying for help.

This all brings to mind a training course from my days at Microsoft entitled Precision Questioning. The course taught how to hone your questioning technique for maximum efficacy, saving time for the questioner and the person/group being questioned. It also stressed the need to do due diligence in trying to find the answer to your question before taking up someone else’s time to help. I think something similar would be of great benefit to technical communities on the Internet where people ask questions.

Here are some quick guidelines for asking SQL Server questions I came up with (many of these apply to asking questions in general):

State your problem as unambiguously as possible.
State what research you’ve done before asking the question (this show’s you’ve done due diligence). You could even say what you searched for on Google/Bing.
State the SQL Server version/SP/CU/build you’re using.
State what you’ve done to try to alleviate your problem so far, if anything.
Ideally, include all Transact-SQL code to reproduce the phenomena (or, at least the details of the schema, indexes, query, query plan etc.) – just use your best judgment and include these as an annotated/commented script.
Include whatever results are pertinent.
Try to anticipate what people would ask to get more information, and provide answers in your question.
If you’re sending an unsolicited email to someone like me, consider this: if it’s going to take more than a minute or two to read the question, the answer’s going to be something like: this is too long to do in a quick email, please post on forum XYZ or let me know if you’d like some formal consulting help. Free questions are cool (to me), but don’t abuse the privilege. Note that this doesn’t apply if you know me or you’ve been invited to send questions (e.g. from being in a class). You might still get that same answer though, but it won’t be because you’ve abused the privilege.
Mandatory: say please and thank you.
Don’t demand urgent or immediate help. The only people who can ask for urgent assistance or use ASAP in their email are colleagues and people paying for help. Unless you’re in one of those categories, you’re not entitled to help – doesn’t matter how urgent your problem is. If your problem really is urgent, put yourself into one of those two categories. And even if you’re in one of those categories, make sure you frame the request so it isn’t too demanding as that can come across as rude.
DO NOT SEND THE SAME EMAIL QUESTION TO MULTIPLE PEOPLE INDIVIDUALLY. Seriously, this wastes people’s time by having multiple people respond to you. That’s what a forum is for. This is the #1 thing that annoys me and 100% guarantees a response stating that they will not get help from us (unless they’re paying, of course :-) Some people also expand this rule to include not posting the same question to multiple forums at the same time.

You might say to all of this that I have to consider the culture of the person asking the question and cut them some slack if they don’t say please or thank you, for instance. NO!!!! *They* need to consider *my* culture. That’s how communication works. The onus is on the person asking for help to make sure they’re communicating appropriately. This is *especially* true when sending random email. Along these lines, there’s a really great book that I recommend to anyone who interacts in a business setting with people outside their own culture: Kiss, Bow, or Shake Hands (and all its more focused companion titles).

If you’re using #sqlhelp on Twitter, most of this isn’t possible, and it’s accepted to not take up valuable tweet space with please and thank you. But asking a question over two tweets, using 1/2 and 2/2 is perfectly acceptable, and putting ‘Looked on Google already’ shows due diligence. You might be asked to post a longer question on a forum, in which case you should do so.

Whatever your chosen method for asking, make sure you respond to those who respond to you. Nothing puts people off from answering questions more than a complete lack of any thanks, or response to a clarifying question. And it’s just common courtesy (sadly lacking these days, especially in the relative anonymity of the internet).

I could sum all of this up by just asking people to be polite, conscientious, and grateful when asking questions.

If you’re reading this and thinking ‘OMG I sent a question to Paul and didn’t do those things, now he hates me’, you can stop panicking. If you really weren’t polite, demanded urgent help, or showed lack of any research, I would have let you know in my reply.

And don’t forget, this isn’t just for asking questions online, these guidelines should help within your company as well. Feel free to use this list, forward this post, or use it in a way that will help you or your company. I don’t mean this to be a marketing post, but a lot of what I have here is covered in my Communications: How to Talk, Write, Present, and Get Ahead! course on Pluralsight.

Now, where can you ask your question online?

Twitter, using the #sqlhelp hash tag
Forums like on SQL Server Central, and their dedicated Ask SQL Server Central site
MSDN’s SQL Server forums
SQL Sentry’s performance Q&A forum
Stack Overflow for general programming questions
Server Fault for general admin questions
Direct email to people who you know accept them and reply

And a whole bunch of other places where people give out help for free, but these are the main ones as far as I’m concerned.

To finish off, I leave you with two blog posts I wrote three years ago:

The more information you can provide, show you’ve tried on your own to find an answer, and are polite and grateful, the more likely you’ll get useful answers quickly. And if you’re sending direct email, you’d better do all of this if you want a reply. Often you’ll find that going through the rigmarole of putting all this into your question will help you get to the answer on your own, so problem solved!

You’ll be amazed at the great help you can get online for free… if you ask properly!

PS Happy to hear your thoughts on this. If I’ve missed anything I’ll update the post based on your comment.

The post How to ask questions politely and correctly appeared first on Paul S. Randal.

08 Sep 17:42

Trace flags - TF 1117

by Damian

I had a session about trace flags this year on the SQL Day 2014 conference that was held in Wrocław at the end of April. The session topic is important to most of DBA's and the reason I did it was that I sometimes forget about various trace flags :). So I decided to prepare a presentation but I think it is a good idea to write posts about trace flags, too.

Let's start then - today I will describe the TF 1117. I assume that we all know how to setup a TF using starting parameters or registry or in the session or on the query level. I will always write if a trace flag is local or global to make sure we know how to use it.

Why do we need this trace flag? Let’s create a test database first. This is quite ordinary database as it has two data files (4 MB each) and a log file that has 1MB. The data files are able to expand by 1 MB and the log file grows by 10%:

USE [master]

CREATE DATABASE [TF1117]

ON PRIMARY

( NAME = N'TF1117',

FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.SQL2014\MSSQL\DATA\TF1117.mdf' ,

SIZE = 4096KB ,

MAXSIZE = UNLIMITED,

FILEGROWTH = 1024KB

( NAME = N'TF1117_1',

FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.SQL2014\MSSQL\DATA\TF1117_1.ndf' ,

SIZE = 4096KB ,

MAXSIZE = UNLIMITED,

FILEGROWTH = 1024KB

)

LOG ON

( NAME = N'TF1117_log',

FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL12.SQL2014\MSSQL\DATA\TF1117_log.ldf' ,

SIZE = 1024KB ,

MAXSIZE = 2048GB ,

FILEGROWTH = 10%

)

Without the TF 1117 turned on the data files don’t grow all up at once. When a first file is full the SQL Server expands it but the other file is not expanded until is full. Why is that so important? The SQL Server proportional fill algorithm will direct new extent allocations to the file with the most available space so new extents will be written to the file that was just expanded. When the TF 1117 is enabled it will cause all files to auto grow by their specified increment. That means all files will have the same percent of free space so we still have the benefit of evenly distributed IO. The TF 1117 is global flag so it affects all databases on the instance. Of course if a filegroup contains only one file the TF does not have any effect on it.

Now let’s do a simple test. First let’s create a table in which every row will fit to a single page: The table definition is pretty simple as it has two integer columns and one character column of fixed size 8000 bytes:

create table TF1117Tab

(

col1 int,

col2 int,

col3 char (8000)

)

Now I load some data to the table to make sure that one of the data file must grow:

declare @i int

select @i = 1

while (@i < 800)

begin

insert into TF1117Tab values (@i, @i+1000, 'hello')

select @i= @i + 1

end

I can check the actual file size in the sys.database_files DMV:

SELECT name, (size*8)/1024 'Size in MB'

FROM sys.database_files

As you can see only the first data file was expanded and the other has still the initial size:

name Size in MB

--------------------- -----------

TF1117 5

TF1117_log 1

TF1117_1 4

There is also other methods of looking at the events of file autogrows. One possibility is to create an Extended Events session and the other is to look into the default trace file:

DECLARE @path NVARCHAR(260);

SELECT @path = REVERSE(SUBSTRING(REVERSE([path]),

CHARINDEX('\', REVERSE([path])), 260)) + N'log.trc'

FROM sys.traces

WHERE is_default = 1;

SELECT DatabaseName,

[FileName],

SPID,

Duration,

StartTime,

EndTime,

FileType =

CASE EventClass

WHEN 92 THEN 'Data'

WHEN 93 THEN 'Log'

END

FROM sys.fn_trace_gettable(@path, DEFAULT)

WHERE EventClass IN (92,93)

AND StartTime >'2014-07-12' AND DatabaseName = N'TF1117'

ORDER BY StartTime DESC;

After running the query I can see the file was expanded and how long did the process take which might be useful from the performance perspective.

Now it’s time to turn on the flag 1117.

DBCC TRACEON(1117)

I dropped the database and recreated it once again. Then I ran the queries and observed the results. After loading the records I see that both files were evenly expanded:

name Size in MB

--------------------- -----------

TF1117 5

TF1117_log 1

TF1117_1 5

I found also information in the default trace. The query returned three rows. The last one is connected to my first experiment when the TF was turned off. The two rows shows that first file was expanded by 1MB and right after that operation the second file was expanded, too. This is what is this TF all about J

08 Sep 17:42

Nepotism In The SQL Family

by Rob Farley

There’s a bunch of sayings about nepotism. It’s unpopular, unless you’re the family member who is getting the opportunity.

But of course, so much in life (and career) is about who you know.

From the perspective of the person who doesn’t get promoted (when the family member is), nepotism is simply unfair; even more so when the promoted one seems less than qualified, or incompetent in some way. We definitely get a bit miffed about that.

But let’s also look at it from the other side of the fence – the person who did the promoting. To them, their son/daughter/nephew/whoever is just another candidate, but one in whom they have more faith. They’ve spent longer getting to know that person. They know their weaknesses and their strengths, and have seen them in all kinds of situations. They expect them to stay around in the company longer. And yes, they may have plans for that person to inherit one day. Sure, they have a vested interest, because they’d like their family members to have strong careers, but it’s not just about that – it’s often best for the company as well.

I’m not announcing that the next LobsterPot employee is one of my sons (although I wouldn’t be opposed to the idea of getting them involved), but actually, admitting that almost all the LobsterPot employees are SQLFamily members… …which makes this post good for T-SQL Tuesday, this month hosted by Jeffrey Verheul (@DevJef).

You see, SQLFamily is the concept that the people in the SQL Server community are close. We have something in common that goes beyond ordinary friendship. We might only see each other a few times a year, at events like the PASS Summit and SQLSaturdays, but the bonds that are formed are strong, going far beyond typical professional relationships.

And these are the people that I am prepared to hire. People that I have got to know. I get to know their skill level, how well they explain things, how confident people are in their expertise, and what their values are. Of course there people that I wouldn’t hire, but I’m a lot more comfortable hiring someone that I’ve already developed a feel for. I need to trust the LobsterPot brand to people, and that means they need to have a similar value system to me. They need to have a passion for helping people and doing what they can to make a difference. Above all, they need to have integrity.

Therefore, I believe in nepotism. All the people I’ve hired so far are people from the SQL community. I don’t know whether I’ll always be able to hire that way, but I have no qualms admitting that the things I look for in an employee are things that I can recognise best in those that are referred to as SQLFamily.

…like Ted Krueger (@onpnt), LobsterPot’s newest employee and the guy who is representing our brand in America. I’m completely proud of this guy. He’s everything I want in an employee. He’s an experienced consultant (even wrote a book on it!), loving husband and father, genuine expert, and incredibly respected by his peers.

It’s not favouritism, it’s just choosing someone I’ve been interviewing for years.

@rob_farley

08 Sep 17:41

Third Party Applications and Other Acts of Violence Against Your SQL Server

by kevin

I just got finished reading a great blog post from my buddy, Thomas LaRock (t | b), in which he describes a useful personal policy he used to track changes made to his SQL Servers when installing third-party products. Note that I’m talking about line-of-business applications here – your inventory management systems and help desk ticketing apps. I’m not talking about monitoring and tuning applications since they, by their very nature, need a different sort of access to your back-end server resources. (Full disclosure: both Tom and I currently work for different tools vendors. But we’re both wearing our former enterprise DBA hats for this discussion).

You can read Tom’s blog post, as well as download the T-SQL script which checks for the most common vendor red flags, HERE. I strongly recommend that you read this post and utilize his script, or some similar technology such as Policy-Based Management, to accomplish the same ends. I penciled in a quick comment on Tom’s post. But as my comment began to lengthen, I realized it might make a nice supplement to his insights as a blog post of my own.

So here’s my addition to Tom’s post:

Third-Party Applications Also Raise a Red Flag for Things They DO NOT CHANGE in Your SQL Server, BUT SHOULD.

One area where I frequently regretted the need for vendor apps, when I worked as an enterprise DBA, was in security. Nothing sends quite as strong a message of “We don’t really care enough to work hard on this application” like an app which uses only the SA account for user access.

This practice of using SA for a line-of-business application is so bad on so many levels. My experiences showed that most apps that used only SA to access the database had many other problems. First of all, you could be certain that security was the least of their concerns and that there’d be other flagrant breaches of database security best practices. But often, upon deeper inspection, I would discover that use of SA as the only account for an application has the harbinger of database design issues, lurking performance problems, and lousy code. That one red flag foretold of very bad things to come with that vendor’s product.

Now in case you were hiding in a cave and/or refused to ever read the news, I want to remind you that SQL injection hacker attacks are one of the most damaging of all hacks on the Internet. And much of the time, those SQL injection attack happen because of sloppy coding practices such as using SA for standard transaction processing. Surely you mean “sloppy security practices”, Kevin? Nope. I mean CODING. The main reason these applications rely on SA is because the development team did not want to code a more robust authorization system. “Hey, let’s give it ALL to the end-user. They know what they’re doing, right?” Well, sometimes. But you can’t count on that assumption. And you can also assume that bad people who are not users will want to break in to the application. Again, there’s that word ‘harbinger’ again. It just sounds so fricken ominous, doesn’t it? But I digress…

The next time you are face with the buy versus build decision and the executives choose to buy, make sure that the application DOES NOT USE SA for standard transactional data processing. For example, if your company installs a new help desk ticketing application, make sure the application comes with at least a distinct account for data readers, a data writer, and for super-users, FOR THAT ONE DATABASE.

If you don’t know whether an application and its backend database use SA or not, be sure to check. (Tom’s script helps you do that). And if you’re able to influence future buy vs build decisions, be sure to make this a sticking point. Nothing helps an application vendor clean up their act, technologically speaking, like telling them WHY they are losing your business. You’ll make the world a better place.

So what do you think? Are there other things about installing a vendor database that raise a red flag when they do not change? What are they? I’d love to hear your comments.

Many thanks,

-Kevin

-Follow me on Twitter!
-Google Author
-More content at KevinEKline.com

The post Third Party Applications and Other Acts of Violence Against Your SQL Server appeared first on Kevin Kline.

31 Jul 22:00

A 24-Year-Old Scammed Apple 42 Times In 16 Different States

by timothy

redletterdave (2493036) writes "Sharron Laverne Parrish Jr., 24, allegedly scammed Apple not once, but 42 times, cheating the company out of more than $300,000 — and his scam was breathtakingly simple. According to a Secret Service criminal complaint, Parrish allegedly visited Apple Stores and tried to buy products with four different debit cards, which were all closed by his respective financial institutions. When his debit card was inevitably declined by the Apple Store, he would protest and offer to call his bank — except, he wasn't really calling his bank. So he would allegedly offer the Apple Store employees a fake authorization code with a certain number of digits, which is normally provided by credit card issuers to create a record of the credit or debit override. But that's the problem with this system: as long as the number of digits is correct, the override code itself doesn't matter."

Read more of this story at Slashdot.

31 Jul 21:52

Ask Slashdot: Is Running Mission-Critical Servers Without a Firewall Common?

by Soulskill

An anonymous reader writes: I do some contract work on the side, and am helping a client set up a new point-of-sale system. For the time being, it's pretty simple: selling products, keeping track of employee time, managing inventory and the like. However, it requires a small network because there are two clients, and one of the clients feeds off of a small SQL Express database from the first. During the setup, the vendor disabled the local firewall, and in a number of emails back and forth since (with me getting more and more aggravated) they went from suggesting that there's no need for a firewall, to outright telling me that's just how they do it and the contract dictates that's how we need to run it. This isn't a tremendous deal today, but with how things are going, odds are there will be e-Commerce worked into it, and probably credit card transactions... which worries the bejesus out of me. So my question to the Slashdot masses: is this common? In my admittedly limited networking experience, it's been drilled into my head fairly well that not running a firewall is lazy (if not simply negligent), and to open the appropriate ports and call it a day. However, I've seen forum posts here and there with people admitting they run their clients without firewalls, believing that the firewall on their incoming internet connection is good enough, and that their client security will pick up the pieces. I'm curious how many real professionals do this, or if the forum posts I'm seeing (along with the vendor in question) are just a bunch of clowns.

Read more of this story at Slashdot.

31 Jul 21:42

The Role of Technical Communities

by Artemakis Artemiou [MVP]

Everything in our world is governed by technology. From large corporations to small stores and simple individuals, everyone uses a computational system for performing the most complex or even the simplest task. Even though the level of technology’s involvement in our lives can be characterized by many as a controversial topic, it is a fact that technology is everywhere. The creation of new

31 Jul 21:42

An error occurred while executing batch. Error message is: Arithmetic Overflow

by Tim Radney

Recently an incident came across my desk where an end user was receiving an error trying to retrieve some records from the database. The error message the customer received was “An error occurred while executing batch. Error message is: Arithmetic Overflow”.

Experience has taught me that this message is related to an invalid dataset within a record or invalid column size. In this particular event the column was FLOAT and a data entry contain characters and symbols. Obviously an error in user entry or an import.

The issue that comes up is how can you query the data if it continues to give this error each time you try to query a set of data that contains that record. The easiest way is to convert the column to another data type. In my case I converted the FLOAT data type to NVARCHAR.

SELECT CONVERT(NVARCHAR(255),COLUMN) AS Column_Name
FROM TABLE
WHERE Value = ‘ABC123’

I am then able to query the dataset to my range of data to see the invalid record. I can then run an update statement to correct the values or null them out.

31 Jul 19:23

MacGyver Moments

by kevin

In case you haven’t heard, your MacGyver Moments are those times when you improvised an excellent solution to a problem using non-traditional materials, techniques, or tools – like the time I repaired my flux capacitor using bailing wire, chewing gum, the tears of Glenn Beck, and the sweat of a master ninja.

I’m a bit more like MacGruber than MacGyver

Necessity is the Mutha of Invention

Believe it or not, I’ve got a long history of doing things MacGyver style. To begin with, I grew up without two nickels to rub together. That means you have to improvise – a lot. Once I hit teenage years, I was constantly tinkering with all things mechanical and electrical trying to stretch their useful lifespan beyond any conception of “reasonable use”, much like my teenage dating experiences. For example, I once cobbled together garage workshop fan from the leftover parts of an electric pencil sharpener, a frame made of bailing wire (yes, bailing wire), solder, and handmade cardboard fan blades. Shamefully, it did not include any duct tape. But it worked great! I had to do that stuff all the time out of necessity, such as with my series of malfunctioning cars and dodgy electronics. Hey, my MacGyver experiences also taught me that admitting to owning a “Plymouth Horizon” is, when trying to impress the ladies, about the same as sporting a 4″ gravy stain on your “Howling Wolf” t-shirt.

Aaaah, the Plymouth – a vehicle truly designed somewhere between the third and fifth ring of Hades. I can’t even count the number of jury-rigged fixes I put into that thing. One that stands out clearly, because my friends used this to prank me on several occasions, was the repair to the broken door handle on the driver’s side. Since I couldn’t afford replacement parts, I used bailing wire (I could always count on you, my old friend) fixed to the inner locking mechanism and a key ring as a handle dangling invisibly from the bottom door sill. About a year after that (around 1986 or so), I discovered junk yards and was able to rip a barely used mechanism from a Dodge of the same body style for $3. I became pretty well known at those junk yards – I blame Chrysler. Quality was not job 1 in those days.

Creativity and Innovation can be like that. No buck teeth in my case.

Geek Creativity

Like my friend Brent Ozar, I played a lot of Dungeons & Dragons (D&D) growing up. I was always the “DM”, that is, the guy who administrated the game, while all my friends played characters on the adventure. But D&D, like many good products, was designed to suck the dollars right out of your product by producing an endless stream of new adventures to play and supplements to purchase. I, on the other hand, wasn’t gonna spend a dime of my money on that. So I created my own adventures (called modules). As I got better, I constructed campaigns (i.e a related sets of modules) and from that, a related set of campaigns into an entire Tolkien-like world (this is called a mileau) which I called Aquilonia. My buddies loved it and, realizing that bragging about geekness creates an inverse coolness effect, can brag that I won some contests at regional conventions for game design.

Code Creativity

Back in the day, I remember wondering if I should stop doing things differently than my buddies and peers. For example, we were assigned to write a program that would find the day of the week based on passing in any pre-Y2K date in one of my COBOL college courses. (No mocking please – COBOL was big back then). Everyone in the class, and I mean everyone, wrote very large programs that delivered the day of the week through very large (and, imo, cumbersome) IF-THEN-ELSE structures. My program had two elegant WHILE loops and thirty-eight lines of code based on the premise that our calendar repeats every 14 years and that all months contain at least 28 days and at most 31 days. I got an A+.

SQL Server Improvisation

Back when my IT shop had bragging privileges as one of the largest enterprise installations of Microsoft SQL Server (in the v6.0 and v6.5 days), we were faced with implementing some sort of way to do bi-directional replication. The product did a good job of standard transactional replication, but bi-directional replication was only possible through a major rewrite of the application. Instead, we cooked up a way to make it work – SQL Server would allow us do outward transactional replication as the transactions occurred. We were able to bring the data back by also enabling sync replication (which is non-transactional) every evening. Thus, we got bi-directional data flows without spending hundreds of thousands of dollars on third-party tools. (I must also give kudos to the team I worked with and who made this solution work, particularly Dwayne Seiber, who is still lead DBA there.)

Summary

Creativity, innovation, and improvisation are what you make of them. You must make the choice to pursue them. Yes, necessity often forces to seek out MacGyver moments. So does laziness in some cases. How many master developers do you know who’ll work very hard to fully conceptualize their program before starting to write it, saving time in the long run. Now that’s creative laziness! But MacGyver moments also stretch our abilities and call us to higher achievements.

The post MacGyver Moments appeared first on Kevin Kline.

31 Jul 19:22

Scripting out Contained Database Users

by Argenis

Today’s blog post comes from a Twitter thread on which @SQLSoldier, @sqlstudent144 and @SQLTaiob were discussing the internals of contained database users.

Unless you have been living under a rock, you’ve heard about the concept of contained users within a SQL Server database (hit the link if you have not). In this article I’d like to show you that you can, indeed, script out contained database users and recreate them on another database, as either contained users or as good old fashioned logins/server principals as well.

Why would this be useful? Well, because you would not need to know the password for the user in order to recreate it on another instance. I know there is a limited number of scenarios where this would be necessary, but nonetheless I figured I’d throw this blog post to show how it can be done.

A more obscure use case: with the password hash (which I’m about to show you how to obtain) you could also crack the password using a utility like hashcat, as highlighted on this SQLServerCentral article.

The Investigation

SQL Server uses System Base Tables to save the password hashes of logins and contained database users. For logins it uses sys.sysxlgns, whereas for contained database users it leverages sys.sysowners.

I’ll show you what I do to figure this stuff out: I create a login/contained user, and then I immediately browse the transaction log with, for example, fn_dblog.

It’s pretty obvious that only two base tables touched by the operation are sys.sysxlgns, and also sys.sysprivs – the latter is used to track permissions. If I connect to the DAC on my instance, I can query for the password hash of this login I’ve just created.

A few interesting things about this hash. This was taken on my laptop, and I happen to be running SQL Server 2014 RTM CU2, which is the latest public build of SQL Server 2014 as of time of writing. In 2008 R2 and prior versions (back to 2000), the password hashes would start with 0x0100. The reason why this changed is because starting with SQL Server 2012 password hashes are kept using a SHA512 algorithm, as opposed to SHA-1 (used since 2000) or Snefru (used in 6.5 and 7.0). SHA-1 is nowadays deemed unsafe and is very easy to crack.

For regular SQL logins, this information is exposed through the sys.sql_logins catalog view, so there is really no need to connect to the DAC to grab an SID/password hash pair. For contained database users, there is (currently) no method of obtaining SID or password hashes without connecting to the DAC.

If we create a contained database user, this is what we get from the transaction log:

Note that the System Base Table used in this case is sys.sysowners. sys.sysprivs is used as well, and again this is to track permissions.

To query sys.sysowners, you would have to connect to the DAC, as I mentioned previously. And this is what you would get:

There are other ways to figure out what SQL Server uses under the hood to store contained database user password hashes, like looking at the execution plan for a query to sys.dm_db_uncontained_entities (Thanks, Robert Davis!)

SIDs, Logins, Contained Users, and Why You Care…Or Not.

One of the reasons behind the existence of Contained Users was the concept of portability of databases: it is really painful to maintain Server Principals (Logins) synced across most shared-nothing SQL Server HA/DR technologies (Mirroring, Availability Groups, and Log Shipping). Often times you would need the Security Identifier (SID) of these logins to match across instances, and that meant that you had to fetch whatever SID was assigned to the login on the principal instance so you could recreate it on a secondary. With contained users you normally wouldn’t care about SIDs, as the users are always available (and synced, as long as synchronization takes place) across instances.

Now you might be presented some particular requirement that might specify that SIDs synced between logins on certain instances and contained database users on other databases. How would you go about creating a contained database user with a specific SID?

The answer is that you can’t do it directly, but there’s a little trick that would allow you to do it.

Create a login with a specified SID and password hash, create a user for that server principal on a partially contained database, then migrate that user to contained using the system stored procedure sp_user_migrate_to_contained, then drop the login.

CREATE LOGIN <login_name> WITH PASSWORD = <password_hash> HASHED, SID = <sid> ;

GO

USE <partially_contained_db>;

GO

CREATE USER <user_name> FROM LOGIN <login_name>;

GO

EXEC sp_migrate_user_to_contained @username = <user_name>, @rename = N’keep_name’, @disablelogin = N‘disable_login’;

GO

DROP LOGIN <login_name>;

GO

Here’s how this skeleton would look like in action:

And now I have a contained user with a specified SID and password hash. In my example above, I renamed the user after migrated it to contained so that it is, hopefully, easier to understand.

Enjoy!

31 Jul 19:21

The 2008 R2 Bug Fix That Breaks RCSI

by Paul White

One of the fixes included in Cumulative Update 11 for SQL Server 2008 R2 Service Pack 2 addresses an "incorrect deadlock" that can occur in a specific scenario (explained later in this article). Unfortunately, the fix introduces a new bug, where SELECT queries under RCSI (read committed snapshot isolation) start taking table-level intent-shared locks. As a consequence, you may see increased blocking (and potentially deadlocking) for RCSI queries after applying 2008 R2 SP2 CU11 (or later).

This will come as an unwelcome surprise to anyone accustomed to readers not blocking writers (and vice-versa) when using RCSI. There is no fix for the RCSI bug at the time of writing. In fact, the Connect item created by Eugene Karpovich to report the issue has been closed as "Won't Fix", though I understand this decision is currently under review.

Ordinarily, this issue might not be such a huge concern, because cumulative updates are generally not as widely applied as full service packs. However, Microsoft recently announced there will be a Final Service Pack 3 for SQL Server 2008 R2. This service pack will be a simple roll up of existing SP2 cumulative updates (up to and including CU13) but with no new fixes. The result of all this is that, unless something changes in the meantime, users applying SP3 will suddenly start being affected by the RCSI bug introduced in CU11.

edit: Just before this article was published, Microsoft confirmed this regression will be fixed in SP3.

The same "incorrect deadlock" bug (whose fix introduces the new bug) was also fixed in Cumulative Update 8 for SQL Server 2012 Service Pack 1 as described in KB2923460. The fix for SQL Server 2012 is different, and does not introduce the new RCSI problem.

SQL Server 2014 was never affected by either issue, as far as I can tell. There is certainly no documentation to indicate otherwise, and the tests I have performed on 2014 RTM, CU1, and CU2 do not reproduce either bug.

The 2008 R2 RCSI Bug

A SELECT query running under RCSI typically takes only a schema stability (Sch-S) lock, which is compatible with all other locks except a schema modification (Sch-M) lock. When CU11 (or later) is applied to a SQL Server 2008 R2 instance, these queries start taking a table-level intent-shared (Tab-IS) lock. The following test script can be used to demonstrate the difference in behaviours:

USE master;
GO
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
SET NOCOUNT ON;
GO
CREATE DATABASE RCSI;
GO
ALTER DATABASE RCSI
SET READ_COMMITTED_SNAPSHOT ON;
GO
ALTER DATABASE RCSI
SET ALLOW_SNAPSHOT_ISOLATION OFF;
GO
USE RCSI;
GO
CREATE TABLE dbo.Test
(
    id integer IDENTITY NOT NULL, 
    col1 integer NOT NULL,
 
    CONSTRAINT PK_Test
    PRIMARY KEY CLUSTERED (id)
);
GO
INSERT dbo.Test 
    (col1) 
VALUES 
    (1), (2), (3), (4);
GO
-- Show locks
DBCC TRACEON (1200, 3604, -1) WITH NO_INFOMSGS;
SELECT * FROM dbo.Test;
DBCC TRACEOFF (1200, 3604, -1) WITH NO_INFOMSGS;
GO
ALTER DATABASE RCSI
SET SINGLE_USER
WITH ROLLBACK IMMEDIATE;
 
USE master;
 
DROP DATABASE RCSI;

When run against an instance of SQL Server 2008 R2 without the bug, the debug output shows a single Sch-S lock taken for the test statement as expected:

Process acquiring Sch-S lock on OBJECT: 7:2105058535:0 result: OK
Process releasing lock on OBJECT: 7:2105058535:0

When run against SQL Server 2008 R2 build 10.50.4302 (or higher) the output is similar to:

Process acquiring IS lock on OBJECT: 7:2105058535:0 result: OK
Process releasing lock on OBJECT: 7:2105058535:0

Notice the Sch-S lock has been replaced by a Tab-IS lock.

Implications and Mitigations

An intent-shared (IS) lock is still a very compatible lock, but it is not quite as concurrency-friendly as Sch-S. The lock compatibility matrix shows that an IS lock conflicts with:

Sch-M (schema modification) – as per Sch-S
BU (bulk update)
X (exclusive)

The incompatibility with exclusive (X) locks means a read under RCSI will block if a concurrent process holds an exclusive lock on the same resource. Likewise, a writer that needs an exclusive lock will block if a concurrent RCSI reader holds an IS lock. Excusive locks are obtained whenever data is modified, and held to the end of the transaction, so the effect of the bug is that readers under RCSI will be blocked by concurrent writers (and vice versa) when they weren't before CU11 was applied.

A significant mitigating factor is that the bug only causes a table-level intent-shared lock to be taken. A concurrent writer that needs a table-level exclusive lock will cause blocking (and potentially a deadlock). However, concurrent writers that only require exclusive locks at a lower (e.g. row, page or partition) level will not cause blocking or a deadlock. At the table level, these writers will only acquire an intent-exclusive (IX) lock, which is compatible with Tab-IS. The exclusive locks taken at lower levels of granularity will not cause a conflict.

In most systems, table-level exclusive (Tab-X) locks will be relatively uncommon. Unless explicitly requested using a TABLOCKX hint, some possible causes of a Tab-X lock are:

Lock escalation from a lower granularity
Using SERIALIZABLE without a supporting index for key-range locks

A technical workaround is to add the (redundant) table hint WITH (READCOMMITTED) to every table in every query that runs under RCSI. This happens to bypass the bug so only a Sch-S lock is taken, but it is hardly a practical proposition.

Despite these mitigations, taking Tab-IS for a read-only query under RCSI is still incorrect behaviour. I hope it can be fixed for SQL Server 2008 R2 before Service Pack 3 is released.

The "Incorrect Deadlock" Bug

As mentioned earlier, the RCSI bug is introduced as a side-effect of a fix for an "incorrect deadlock" bug. This earlier issue is documented for SQL Server 2008 R2 in KB2929464 and for SQL Server 2012 in KB2923460. Neither document is a model of clarity (or accuracy), but the underlying issue is quite interesting, so I want to spend a bit of time looking at it here.

Essentially, the deadlock occurs when:

Three or more concurrent transactions read from the same table
The UPDLOCK and TABLOCK hints are used in all three cases
The database setting READ_COMMITTED_SNAPSHOT is ON

Note that it does not matter which isolation level the transactions run under. To reproduce the bug, first run the setup script below:

USE master;
GO
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
GO
CREATE DATABASE IncorrectDeadlock;
GO
ALTER DATABASE IncorrectDeadlock 
SET READ_COMMITTED_SNAPSHOT ON;
GO
USE IncorrectDeadlock;
GO
CREATE TABLE dbo.Test
(
    id integer IDENTITY NOT NULL, 
    col1 integer NOT NULL,
 
    CONSTRAINT PK_Test
    PRIMARY KEY CLUSTERED (id)
);
GO
INSERT dbo.Test 
    (col1) 
VALUES 
    (1);

Next, run the following script in three separate connections (note that the transaction is left open):

USE IncorrectDeadlock;
GO
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
GO
BEGIN TRANSACTION;
SELECT
    T.id,
    T.col1
FROM dbo.Test AS T
    WITH (UPDLOCK, TABLOCK);

At this point, the first session will have returned a result set and the other two will be blocked. The "incorrect deadlock" arises when the first session completes its transaction (either committing or rolling back). When this occurs, one of the other two sessions will report a deadlock:

Deadlock error

The deadlock occurs because the two previously-blocked sessions hold Tab-IX (table-level intent-exclusive) and both want to convert their lock to Tab-X (table-level exclusive). Tab-IX is compatible with another Tab-IX, but not Tab-X. This is a conversion deadlock (and the irony here is that UPDLOCK is often used to avoid conversion deadlocks).

Feel free to vary the transaction isolation level for the three queries as you wish. The deadlock will occur regardless, so long as RCSI is enabled, with the same locks involved. When the tests are complete, remove the test database:

USE IncorrectDeadlock;
 
ALTER DATABASE IncorrectDeadlock
SET SINGLE_USER 
WITH ROLLBACK IMMEDIATE;
 
USE master;
 
DROP DATABASE IncorrectDeadlock;

Analysis and Explanation

I personally don't recall ever having used UPDLOCK and TABLOCK together in my code. To me, this combination of hints seems intuitively odd because SQL Server does not have a table-level update lock. So, what does it even mean to specify UPDLOCK and TABLOCK hints together?

The documentation has this to say:

UPDLOCK

Specifies that update locks are to be taken and held until the transaction completes. UPDLOCK takes update locks for read operations only at the row-level or page-level. If UPDLOCK is combined with TABLOCK, or a table-level lock is taken for some other reason, an exclusive (X) lock will be taken instead.

This suggests that the hint combination ought to result in a single exclusive table lock. In fact, this is not quite the whole story:

In SQL Server 2000, combining UPDLOCK and TABLOCK hints results in Tab-S (a shared table lock) being taken followed by conversion to Tab-X (exclusive table lock) under all isolation levels except READ UNCOMMITTED. This sequence of locks can result in a deadlock where three or more sessions are involved: two sessions acquire Tab-S and both wait on the other to convert to Tab-X. Under READ UNCOMMITTED, SQL Server 2000 takes Sch-S then Tab-X, which is not prone to deadlock (just normal blocking).

In SQL Server 2005 onward (without the bug fix) the locks taken depend only on whether RCSI is enabled or not. If RCSI is enabled, all isolation levels take Tab-IX then convert to Tab-X. This sequence causes the deadlock the bug fix addresses.

If RCSI is not enabled, the matching isolation levels behave as they did under SQL Server 2000 (taking Tab-S then converting to Tab-X). The (new for 2005) snapshot isolation level takes Sch-S followed by Tab-X. As a consequence, SI and READ UNCOMMITTED are the only isolation levels not prone to this deadlock in the UPDLOCK, TABLOCK scenario when RCSI is not enabled.

The Deadlock Fix

The fix changes the locks taken when UPDLOCK and TABLOCK are specified together, for all isolation levels, and regardless of whether RCSI is enabled or not. After the fix is applied, UPDLOCK and TABLOCK cause the engine to acquire Tab-SIX (table-level shared with intent exclusive), which is then converted to Tab-X.

This avoids the deadlock scenario because Tab-SIX is incompatible with another Tab-SIX. Remember, the deadlock occurred when two processes held Tab-IX waiting to convert to Tab-X. With Tab-IX replaced by Tab-SIX, it is not possible for both to hold Tab-SIX at the same time. The result is a normal blocking scenario instead of a deadlock.

Final Thoughts

The "incorrect deadlock" fix does resolve one particular deadlock scenario, but it still does not result in the behaviour I imagine the people specifying UPDLOCK and TABLOCK envisaged. If SQL Server did have a Tab-U (table-level update) lock, it would prevent concurrent changes to the table but allow concurrent readers. This is what I imagine the intent of people using these hints together would be, and I can see how it might be useful.

The current implementation (where Tab-X is ultimately taken instead of the missing Tab-U) does not match this expectation because Tab-X prevents concurrent reads (unless a row-versioning isolation level is used). We might as well specify TABLOCKX in many cases. The fact that the fix also introduces a new bug (for users of SQL Server 2008 R2 only) is also unfortunate, particularly if the bug goes on to be included in 2008 R2 SP3.

Note that the deadlock fix is not being made available for SQL Server versions prior to 2008 R2. These versions will continue to have the complex locking behaviour for UPDLOCK and TABLOCK as described above.

My thanks to Eugene Karpovich who first brought this issue to my attention in a comment to my article on Data Modifications Under RCSI.

The post The 2008 R2 Bug Fix That Breaks RCSI appeared first on SQLPerformance.com.

31 Jul 19:20

Transitioning from SMP to MPP, the why and the how

by SQL Server Team

This blog post was authored by: Sahaj Saini, PM on the Microsoft Analytics Platform System (APS) team.

In this blog post, we’ll provide a quick overview of Symmetric Multi-Processing (SMP) vs. Massively Parallel Processing (MPP) systems, how to identify triggers for migrating from SMP to MPP, key considerations when moving to Microsoft Analytics Platform System (APS), and a discussion about how to take advantage of the power of an MPP solution such as APS.

Let us begin with a scenario. Emma is the Database Administrator at Adventure Works Cycles, a bicycle manufacturing company. At Adventure Works, Emma and her team are using traditional SQL Server SMP as their data warehousing solution. The company has been growing rapidly and with growing competition in the bicycle industry, the business analysts at Adventure Works Cycles would like quicker insight into their data. Emma is now facing the following challenges with the SMP deployment –

High Data Volume and Data Growth: With increasing sales and a growing customer base, the data volume has grown rapidly to cross 10 TB.
Longer Data Loading/ETL times: With the need to produce daily reports to management, Emma finds the current ETL speed inadequate to intake and process the increasing quantity of data flowing from other OLTP and non-relational systems.
Slow Query Execution: Query execution times are slowing down due to the increase of data and it is becoming increasingly difficult to generate insights for daily reporting in a timely manner.
Long Cube Processing Time: With the current cube processing time, it is difficult to meet the real-time reporting needs of the company.

In order to overcome these challenges, Emma and her team evaluate the purchase of a larger, expensive and more powerful set of server and storage hardware to their datacenter. This approach would solve their problem but only for the short-term as the data growth is expected to explode in the next 12 months. With data growth that Adventure Works is expecting to see, even the bigger and more powerful SMP solutions would hit a wall very quickly. Emma would like to see a solution that scales as their data needs grow.

What’s the difference between SMP and MPP?

Before we jump into solving Emma’s problems, let’s quickly define what SMP and MPP are. Symmetric Multi-Processing (SMP) is a tightly coupled multiprocessor system where processors share resources – single instances of the Operating System (OS), memory, I/O devices and connected using a common bus. SMP is the primary parallel architecture employed in servers and is depicted in the following image.

Massively Parallel Processing (MPP) is the coordinated processing of a single task by multiple processors, each processor using its own OS and memory and communicating with each other using some form of messaging interface. MPP can be setup with a shared nothing or shared disk architecture.

In a shared nothing architecture, there is no single point of contention across the system and nodes do not share memory or disk storage. Data is horizontally partitioned across nodes, such that each node has a subset of rows from each table in the database. Each node then processes only the rows on its own disks. Systems based on this architecture can achieve massive scale as there is no single bottleneck to slow down the system. This is what Emma is looking for.

MPP with shared-nothing architecture is depicted in the following image.

Microsoft Parallel Data Warehouse (PDW) running on a Microsoft Analytics Platform System appliance is implemented as an MPP shared-nothing architecture. It consists of one control node and storage attached compute nodes inter-connected by Ethernet and Infiniband. The control node hosts the PDW engine – the brains of the MPP system – that creates parallel query plans, co-ordinates query execution on compute nodes, and data aggregation across the entire appliance. All nodes, including control and compute, host a Data Movement Service (DMS) to transfer data between nodes.

For more details on PDW architecture, you can read the Architecture of the Microsoft Analytics Platform System post.

Transitioning to MPP

To realize the value offered by MPP, Emma and her team purchase a Microsoft APS appliance and begin transitioning to MPP. Let’s take a look at how they adapt their solution to take full advantage of APS’s shared nothing MPP architecture.

Table Design

As previously mentioned, APS is based on a shared nothing MPP architecture which means that nodes are self-sufficient and do not share memory or disks. The architecture, therefore, requires you to distribute your large tables across nodes to get the benefits of the massively parallel processing. APS allows the definition of a table as either distributed or replicated. The decision to choose one versus the other depends on the volume of data and the need for access to all of the data on a single node.

Distributed Tables

A distributed table is one where row data within the table is distributed across the nodes within the appliance to allow for massive scale. Each row ends up in a one distribution in one compute node as depicted by the image below.

To take advantage of the distributed nature of APS, Emma modifies the large tables, typically Fact and large dimension tables, to be distributed in APS as follows:

CREATE TABLE [dbo].[FactInternetSales]
(
  [ProductKey] [int] NOT NULL,
  [OrderDateKey] [int] NOT NULL,
  .
  .
  [ShipDate] [datetime] NULL
) 
WITH
(
  DISTRIBUTION = HASH(ProductKey),
CLUSTERED COLUMNSTORE INDEX
);

As you can see, this is a typical DDL statement for table creation with a minor addition for distributed tables. Tables are distributed by a deterministic hash function applied to the Distribution Column chosen for that table. Emma chooses Product Key as the distribution column in the FactInternetSales table because of the high cardinality and absence of skew, therefore distributing the table evenly across nodes.

Replicated Tables

If all tables were distributed, however, it would require a great deal of data movement between nodes before performing join operations for all operations. Therefore, for smaller dimension tables such as language, countries etc. it makes sense to replicate the entire table on each compute node. That is to say, the benefits of enabling local join operations with these tables outweigh the cost of extra storage consumed. A replicated table is one that is replicated across all compute nodes as depicted below.

Emma designs the small tables, typically dimension tables, to be replicated as follows:

 CREATE TABLE [dbo].[DimDate](
  [DateKey] [int] NOT NULL,
  .
  .
  [SpanishDayNameOfWeek] [nvarchar](10) NOT NULL,
)
WITH
(
CLUSTERED COLUMNSTORE INDEX
);

By appropriately designing distributed and replicated tables, Emma aligns her solution with common MPP design best practices and enables efficient processing of high volumes of data. For example, a query against 100 billion rows in a SQL Server SMP environment would require the processing of all of the data in a single execution space. With MPP, the work is spread across many nodes to break the problem into more manageable and easier ways to execute tasks. In a four node appliance (see the picture above), each node is only asked to process roughly 25 billion rows – a much quicker task. As a result, Emma observes significant improvements to the query execution time and her business can now make better decisions, faster. Additionally, Emma can grow the data warehouse to anywhere from a few terabytes to over 6 petabytes of data in by adding “scale units” to APS.

Data Loading

With SQL Server SMP, Emma and her team were using ETL processes via a set of SSIS packages to load data into the data warehouse – (1) Extracting data from the OLTP and other systems; (2) Transforming the data into dimensional format; and (3) Loading the data to target dimension or fact tables in the Data Warehouse. With increasing volumes of data, the SSIS sever in the middle becomes a bottleneck while performing transformations, resulting in slow data loading.

With APS, Emma and her team can use ELT instead, to Extract the data from the OLTP and other systems and Load it to a staging location on APS. Then, the data can be Transformed into dimensional format not with SSIS but with the APS Engine utilizing the distributed nature of the appliance and the power of parallel processing. In a 4-node appliance, four servers would be doing the transformations on subsets of data versus the single node SSIS server.

This parallel processing results in a significant boost in data loading performance. Emma can then use the Create Table As Select (CTAS) statement to create the table from the staging table as follows.

CREATE TABLE [dbo].[DimCustomer] 
WITH
(
  CLUSTERED COLUMN INDEX,
  DISTRIBUTION = HASH (CustomerKey)
)
AS
SELECT * FROM [staging].[DimCustomer];

By switching to an ELT process, Emma utilizes the parallel processing power of APS to see performance gains in data loading.

In conclusion, Emma and her team have found answers to their SMP woes with MPP. They can now feel confident handling the data volume and growth at Adventure Works with the ability to scale the data warehouse as needed. With ELT and the power of parallel processing in APS, they can load data into APS faster and within the expected time-window. And by aligning with APS’s MPP design, they can achieve breakthrough query performance, allowing for real-time reporting and insight into their data.

Visit the Analytics Platform System page to access more resources including: datasheet, video, solution brief, and more..

To learn more about migration from SQL Server to the Analytics Platform System

31 Jul 19:20

PASS Summit 2014: Inside the World’s Largest Gathering of SQL Server and BI Professionals

by SQL Server Team

Mrdenny
now

PASS VP of Marketing Denise McInerney – a SQL Server MVP and Data Engineer at Intuit – began her career as a SQL Server DBA in 1998 and attended her first PASS Summit in 2002. The SQL Server Team caught up with her ahead of this year’s event, returning to Seattle, WA, Nov. 4-7, to see what she’s looking forward to at the world’s largest conference for SQL Server and BI professionals.

For those who’ve never attended or who’ve been away for a while, what is PASS Summit?
PASS Summit is the world’s largest gathering of Microsoft SQL Server and BI professionals. Organized by and for the community, PASS Summit delivers the most technical sessions, the largest number of attendees, the best networking, and the highest-rated sessions and speakers of any SQL Server event.

We like to think of PASS Summit as the annual reunion for the #sqlfamily. With over 200 technical sessions and 70+ hours of networking opportunities with MVPs, experts and peers, it’s 3 focused days of SQL Server. You can take hands-on workshops, attend Chalk Talks with the experts, and get the answers you need right away at the SQL Server Clinic, staffed by the Microsoft CSS and SQLCAT experts who build and support the features you use every day. Plus, you can join us early for 2 days of pre-conference sessions with top industry experts and explore the whole range of SQL Server solutions and services under one roof in the PASS Summit Exhibit Hall.

Nowhere else will you find over 5,000 passionate SQL Server and BI professionals from 50+ countries and 2,000 different companies connecting, sharing, and learning how to take their SQL Server skills to the next level.

What’s on tap this year as far as sessions?
We’ve announced a record 160+ incredible community sessions across 5 topic tracks: Application and Database Development, BI Information Delivery, BI Platform Architecture, Development and Administration; Enterprise Database Administration and Deployment, and Professional Development. And watch for over 60 sessions from Microsoft’s top experts to be added to the lineup in early September.

You can search by speaker, track, session skill level, or session type – from 10-minute Lightning Talks, to 75-minute General Sessions, to 3-hour Half-Day Sessions and our full-day pre-conference workshops.

And with this year’s new Learning Paths, we’ve made it even easier to find the sessions you’re most interested in. Just use our 9 Learning Path filters to slice and dice the lineup by everything from Beginner sessions to Big Data, Cloud, Hardware Virtualization, and Power BI sessions to SQL Server 2014, High Availability/Disaster Recovery, Performance, and Security sessions.

Networking is at the heart of PASS Summit – what opportunities do you have for attendees to connect with each other?
PASS Summit is all about meeting and talking with people, sharing issues and solutions, and gaining knowledge that will make you a better SQL Server professional. Breakfasts, lunches, and evening receptions are all included and are designed to offer dedicated networking opportunities. And don't underestimate the value of hallway chats and the ability to talk to speakers after their sessions, during lunches and breaks, and at the networking events.

We have special networking activities for first-time attendees, for people interested in the same technical topics at our Birds of a Feather luncheon, and at our popular annual Women in Technology luncheon, which connects 600+ attendees interested in advancing role of women in STEM fields. Plus, our Community Zone is THE place to hang out with fellow attendees and community leaders and learn how to stay involved year-round.

You mentioned the networking events for first-time attendees. With everything going on at Summit, how can new attendees get the most out of their experience?
Our First-Timers Program takes the hard work out of conference prep and is designed specifically to help new attendees make the most of their time at Summit. We connect first-timers with conference alumni, take them inside the week with community webinars, help them sharpen their networking skills through fun onsite workshops, and share inside advice during our First Timers orientation meeting.

In addition, in our “Get to Know Your Community Sessions,” longtime PASS members share how to get involved with PASS and the worldwide #sqlfamily, including encouraging those new to PASS to connect with their local SQL Server communities through PASS Chapters and continue their learning through Virtual Chapters, SQLSaturdays, and other free channels.

How can you learn more about sessions and the overall PASS Summit experience?
A great way to get a taste of Summit is by watching PASS Summit 2013 sessions, interviews, and more on PASStv. You can also check out the best of last year’s Community blogs.

Plus, stay tuned for 24 Hours of PASS: Summit Preview Edition on September 9 to get a free sneak peek at some of the top sessions and speakers coming to PASS Summit this year. Make sure you follow us on Twitter at @PASS24HOP / #pass24hop for the latest updates on these 24 back-to-back webinars.

Where can you register for PASS Summit?
To register, just go to Register Now – and remember to take advantage of the $150 discount code from your local or Virtual PASS Chapter. We also have a great group discount for companies sending 5 or more employees. And don’t forget to purchase the session recordings for year-round learning on all aspects of SQL Server.

Once you get a taste for the learning and networking waiting for you at PASS Summit, we invite you to join the conversation by following us on Twitter (watch the #sqlpass #summit 14 hashtags) and joining our Facebook and LinkedIn groups. We’re looking forward to an amazing, record-breaking event, and can’t wait to see everyone there!

Please stay tuned for regular updates and highlights on Microsoft and PASS activities planned for this year’s conference.

31 Jul 18:49

CodeSOD: The Joys of Interdisciplinary Work

by Erik Gern

Lisa thought that the Modesto Biology Institute was the perfect working environment. The scientists who showed her around were all friendly, not the "evil, lab-coated villains" portrayed in Fritz Lang films. The lab director, Howard, pointed out the lack of horror monsters in their lab after Lisa joked about it during her interview.

"See?" Howard said, gesturing. "You won't find anything scarier than a petri dish in here ... except for grant applications." He looked disgusted at the suggestion.

"I'll take your word for it," Lisa said, laughing.

But there were horrors lurking in the lab ... just not the kind that grow in petri dishes.

SQL Swarm

Lisa heard a plaintive cry down the hall one day. She found Howard looked forlornly at his monitor.

"It's our web interface," he said. "We use it to log experimental data. I wrote it myself a few years back, but it's been giving me so much trouble lately. That's the third time the server's crashed today. I don't see why it keeps choking up every time we add a new dataset."

"Well, that's why I'm here," Lisa said. "Mind giving me an hour or two to see what's wrong?"

The web interface used a custom Java library -- also written by Howard -- that spoke to an Oracle database. Whereas she expected to see a few tables (one for data, one for experiments, and maybe one for additional metadata), she instead found over 3000: two tables for each experiment ever conducted at Modesto.

"Howard," Lisa said, "can I prune any of these tables?"

"Afraid not. We can't afford to lose any of that data. Our funding would go out the window." He shuddered.

Lisa instead optimized some confiurations in Oracle. She tried not to think of the thousands of tables that lay waiting in that database.

`System.out` Basilisk

"Lisa, could you give me a hand? I'm trying to add a file input field to our spreadsheet uploader, but it keeps rendering outside of the form."

By now Lisa had gotten her hands dirty in the web interface. She had finally convinced Howard to combine the 3000 tables after another server crash broke the app for almost a day. However, she hadn't yet touched the view code.

Well, it can't be any worse than all those tables, she thought, and immediately regretted it once she opened spreadsheetUpload.java.

System.out.println("<select name=\"operation1\"><option value=\"greater\">><option value=\"smaller\"><"); System.out.println("<option value=\"greaterequal\">>=<option value=\"smallerequal\"><="); System.out.println("<option value=\"equal\">=<option value=\"notequal\"><>"); System.out.println("<option value=\"like\">LIKE<option value=\"in\">IN"); System.out.println("<option value=\"notin\">NOT IN<option value=\"between\">BETWEEN</select>"); System.out.println("<input type=text size=10 name=\"bed1\">&nbsp</center><br>"); 2 System.out.println("<center><font color=white>[ Name='osmY' ] [ Name LIKE 'flg_' ] [Name LIKE '%rp%'] [IN (2,3,5,6)] [BETWEEN 2 AND 3]</font></center><br>"); System.out.println("&nbsp<input type=checkbox name=\"ck2\">"); System.out.println("<select name=\"and2\"><option value=\"and\">AND<option value=\"or\">OR</select>");

"It'll be a few minutes," she yelled to Howard down the hall. There were no comments, no string variables to be found, just line after line of System.out.println. How on Earth did Howard even get it working in the first place?

Class-Based Hydra

Grants and the neverending search for funding are the worst aspects of working in a research lab, Howard explained one day. There are always strings attached.

Such as one that came from their biggest contributor.

"There have to be at least 100 defined classes," Howard said, pointing out a stipulation in a grant application. "Otherwise the interface can't be considered a 'major research appliance.'"

"That many classes would be a nightmare," Lisa objected. "The app has 15. It doesn't need any more."

"They'll get access to the source code any way they can," he said. "You've practically taken over for me since you started. Is there any way you could make it happen?"

"I'll try."

To inflate the class count, Lisa duplicated the different categories of experiments conducted at Modesto. Before, there was just one class called ExperimentModel.java. Now there were over several hundred, including DoubleBlindPharmaceuticalModel.java, DoubleBlindPsychologicalModel.java, and even FieldStudyModel.java.

But it suited their benefactors ... for a time.

The FORTRAN Menace

"God, I'm so sorry for putting you through all this." Howard held the latest request from the lab's benefactors. "Apparently they'll double our funding if all of our software ran on FORTRAN."

"...What." Lisa nearly dropped her tea.

"FORTRAN. It's well beyond my expertise. Do they even teach it in school nowadays?"

"There's a class." Lisa looked on a the world-weary Howard, trying to please the lab's patrons while keeping her sane. "I'll look into it."

As Lisa discovered, it was possible to rewrite their web application to run in FORTRAN, with a few extra libraries installed for HTTP communication and UTF-8 support. She gave the news to Howard, who shook his head. "Thanks, but I know when to stop before we create a monster."

Lisa didn't have to rewrite the interface in FORTRAN. The Modesto Biology Institute kept its funding, and Lisa works there to this day.

[Advertisement] Have you seen BuildMaster 4.3 yet? Lots of new features to make continuous delivery even easier; deploy builds from TeamCity (and other CI) to your own servers, the cloud, and more.

31 Jul 18:45

Tablemania!

by snoofle

We've all heard of dates. Not the kind with someone to whom you're attracted. No, I mean the kind that represent some relative position on the calendar. Now we've all seen what people do when faced with the extremely difficult task of working with dates. Short of programming in brainfsck, intelligent programmers use the built-in variety, be it in C#, Java, etc.

Sometimes, you need to be able to identify different parts of a date. But, given (let's use Java as an example) a java.util.Date, how on Earth can any reasonable person possibly figure out the year without too much grief? I mean, it's mathematically hard to convert all those milliseconds to years. There must be a better way to identify years!

Mick's co-worker decided that a database was required. Not only did he incorporate a database to figure out the year component of dates, he thought that each year needed to be explicitly identified...

Kris must work with the folks who work with Mick, because he worked on fixing a foreign key constraint issue, and was instructed to just add rows to the table that was causing the problem...

Of course, others have difficulties understanding the proper use of database tables for things beyond dates. Justus found this wonderful representation of catalog data:

Jon sent in a sample of a catalog table designed by Nigel who left shortly thereafter. Apparently, Nigel felt that each item in the catalog deserved it's own column in the table. Jon's pretty sure this will need a rewrite:

Still others feel that normalization is simply not an option. An anonymous submitter sent in something their offshore outsourcing company turned in; a table with 600 columns, including:

50 float columns
50 integer columns
50 date-time columns
100 bit columns
100 money columns
200 varchar(1024) columns
50 XML columns

Of course, dates/years and normalization are notoriously difficult concepts. Booleans on the other hand... Shane found a Yes table which has no counterpart No table... It has one row with one column containing a single "Y"...

One of the strengths of databases is their ability to record events that have happened. Sam was tasked with bringing over a mailing list to a new site. Part of the system mails out newsletters to all subscribers each month. By the time Sam inherited it, about 300 newsletters had already been sent. Apparently, the developer thought it would be a good idea to keep track of every recipient of every newsletter, and so added a slew of extra rows of tinyint's to keep track of who had / had not received the mail.

Of course, while there are many right ways to do that, this is probably not one of them.

31 Jul 18:43

German MVPs Host SQL Saturday and Hackathon

by MVP Award Program

Nearly 300 MVPs, technology enthusiasts and SQL fans joined forces in Germany to participate in a hackathon and SQL Saturday event featuring MVP presenters. The two-day event was organized by SQL Server MVPs Constantin Klein, Tillmann Eitelberg and Oliver Engels. We had the chance to catch up with Constantin “Kostja” Klein to get an inside look into the event.

What was the inspiration for creating such a unique community event?

"Since BIG DATA is a hype-topic, we decided to favor a Hackathon over a regular Pre-Con in order to allow attendees to really get a first, hands-on experience with reference to the existing technologies on the Microsoft platform, like HDInsight and PowerBI. With Azure MVP Sascha Dittmann, Scott Klein and Emil Siemens we also found the right people to introduce the tools to the attendees and help them on occurring problems."

What was the highlight of the event?

"The highlight was the presentation of the results at the end of the day. This was when all other teams had the chance to find out about different approaches to the same problem, different ways of visualization, etc.
By the way, the challenge we prepared for the day was to get some interesting insight and visualization out of more than 250,000 tweets collected during [the] football World Cup with [the] hashtag #WorldCup. And actually we had some interesting findings, like it seems that in the UK, there is a disproportionately high number of people who like football and use Twitter.

We wanted attendees to use the Microsoft cloud technology stack. Therefore we had Azure accounts prepared and helped people to get the environment (Cloud Storage, Azure SQL Database and HDInsight cluster) up and running. People helped each other and we started the day with building teams of up to five people who then worked together. In fact, most of the teams were not colleagues and had never worked together before."

What is the benefit of attending such a great community event?

"We believe that you get a much better kick-start for dealing with a brand new topic – which it was for almost all attendees – when you really have time to start a project and play with the technology. So during the wrap-up many attendees confirmed, that they now have a first real experience, they can take home and elaborate on that. This is totally different if you just listened to a whole day lecture. Obviously this lecture could cover much more details, but an attendee would not be able to immediately reproduce what he heard or had seen. Working in mixed groups is another interesting aspect, which helps people to deal with and adjust to new situations."

Congratulations to the MVP organizers, presenters and all the participants!

27 Jul 03:11

The Guard Changes on the Iron Throne

Mrdenny
now

Submitted by: (via Omer Barnea)

Tagged: Game of Thrones , nerdgasm , Video , g rated , win

Share on Facebook

Wishfor420 likes this

27 Jul 02:26

The Psychology of Phishing

by samzenpus

Mrdenny
test

An anonymous reader writes Phishing emails are without a doubt one of the biggest security issues consumers and businesses face today. Cybercriminals understand that we are a generation of clickers and they use this to their advantage. They will take the time to create sophisticated phishing emails because they understand that today users can tell-apart spam annoyances from useful email, however they still find it difficult identifying phishing emails, particularly when they are tailored to suit each recipient individually. Fake emails are so convincing and compelling that they fool 10% of recipients into clicking on the malicious link. To put that into context a legitimate marketing department at a FTSE 100 company typically expects less than a 2% click rate on their advertising campaigns. So, how are the cybercriminals out-marketing the marketing experts?

Read more of this story at Slashdot.

27 Jul 02:25

For Half, Degrees In Computing, Math, Or Stats Lead To Other Jobs

by Unknown Lamer

dcblogs (1096431) writes The Census Bureau reports that only 26% of people with any type of four-year STEM degree are working in a STEM field. For those with a degree specifically in computer, math or statistics, the figure is 49%, nearly the same for engineering degrees. What happens to the other STEM trained workers? The largest numbers are managers at non-STEM businesses (22.5%), or having careers in education (17.7%), business/finance (13.2%) and office support (11.5%). Some other data points: Among those with college degrees in computer-related occupations, men are paid more than women ($90,354 vs. $78,859 on average), and African American workers are more likely to be unemployed than white or Asian workers.

Read more of this story at Slashdot.

27 Jul 02:23

Lawrence Krauss: Congress Is Trying To Defund Scientists At Energy Department

by samzenpus

Lasrick writes Physicist Lawrence Krauss blasts Congress for their passage of the 2015 Energy and Water Appropriations bill that cut funding for renewable energy, sustainable transportation, and energy efficiency, and even worse, had amendments that targeted scientists at the Department of Energy: He writes that this action from the US Congress is worse even than the Australian government's move to cancel their carbon tax, because the action of Congress is far more insidious: "Each (amendment) would, in its own way, specifically prohibit scientists at the Energy Department from doing precisely what Congress should mandate them to do—namely perform the best possible scientific research to illuminate, for policymakers, the likelihood and possible consequences of climate change." Although the bill isn't likely to become law, Krauss is fed up with Congress burying its head in the sand: The fact that those amendments "...could pass a house of Congress, should concern everyone interested in the appropriate support of scientific research as a basis for sound public policy."

Read more of this story at Slashdot.

27 Jul 02:18

New SSL Server Rules Go Into Effect Nov. 1

by Soulskill

alphadogg writes: Public certificate authorities (CAs) are warning that as of Nov. 1 they will reject requests for internal SSL server certificates that don't conform to new internal domain naming and IP address conventions designed to safeguard networks. The concern is that SSL server digital certificates issued by CAs at present for internal corporate e-mail servers, Web servers and databases are not unique and can potentially be used in man-in-the-middle attacks involving the setup of rogue servers inside the targeted network, say representatives for the Certification Authority/Browser Forum (CA/B Forum), the industry group that sets security and operational guidelines for digital certificates. Members include the overwhelming bulk of public CAs around the globe, plus browser makers such as Microsoft and Apple. The problem today is that network managers often give their servers names like 'Server1' and allocate internal IP addresses so that SSL certificates issued for them through the public CAs are not necessarily globally unique, notes Trend Micro's Chris Bailey.

Read more of this story at Slashdot.

27 Jul 02:14

Enraged Verizon FiOS Customer Seemingly Demonstrates Netflix Throttling

by timothy

MojoKid (1002251) writes The ongoing battle between Netflix and ISPs that can't seem to handle the streaming video service's traffic, boiled over to an infuriating level for Colin Nederkoon, a startup CEO who resides in New York City. Rather than accept excuses and finger pointing from either side, Nederkoon did a little investigating into why he was receiving such slow Netflix streams on his Verizon FiOS connection. What he discovered is that there appears to be a clear culprit. Nederkoon pays for Internet service that promises 75Mbps downstream and 35Mbps upstream through his FiOS connection. However, his Netflix video streams were limping along at just 375kbps (0.375mbps), equivalent to 0.5 percent of the speed he's paying for. On a hunch, he decided to connect to a VPN service, which in theory should actually make things slower since it's adding extra hops. Speeds didn't get slower, they got much faster. After connecting to VyprVPN, his Netflix connection suddenly jumped to 3000kbps, the fastest the streaming service allows and around 10 times faster than when connecting directly with Verizon. Verizon may have a different explanation as to why Nederkoon's Netflix streams suddenly sped up, but in the meantime, it would appear that throttling shenanigans are taking place. It seems that by using a VPN, Verizon simply doesn't know which packets to throttle, hence the gross disparity in speed.

Read more of this story at Slashdot.

27 Jul 02:01

Should new index columns be in the key, or included?

by Erin Stellato

SQL Server 2005 added the ability to include nonkey columns in a nonclustered index. In SQL Server 2000 and earlier, for a nonclustered index, all columns defined for an index were key columns, which meant they were part of every level of the index, from the root down to the leaf level. When a column is defined as an included column, it is part of the leaf level only. Books Online notes the following benefits of included columns:

They can be data types not allowed as index key columns.
They are not considered by the Database Engine when calculating the number of index key columns or index key size.

For example, a varchar(max) column cannot be part of an index key, but it can be an included column. Further, that varchar(max) column doesn't count against the 900-byte (or 16-column) limit imposed for the index key.

The documentation also notes the following performance benefit:

An index with nonkey columns can significantly improve query performance when all columns in the query are included in the index either as key or nonkey columns. Performance gains are achieved because the query optimizer can locate all the column values within the index; table or clustered index data is not accessed resulting in fewer disk I/O operations.

We can infer that whether the index columns are key columns or nonkey columns, we get an improvement in performance compared to when all columns are not part of the index. But, is there a performance difference between the two variations?

The Setup

I installed a copy of the AdventuresWork2012 database and verified the indexes for the Sales.SalesOrderHeader table using Kimberly Tripp's version of sp_helpindex:

USE [AdventureWorks2012];
GO
EXEC sp_SQLskills_SQL2012_helpindex N'Sales.SalesOrderHeader';

Default indexes for Sales.SalesOrderHeader

We'll start with a straight-forward query for testing that retrieves data from multiple columns:

SELECT [CustomerID], [SalesPersonID], [SalesOrderID],
  DATEDIFF(DAY, [OrderDate], [ShipDate]) AS [DaysToShip], [SubTotal]
FROM [Sales].[SalesOrderHeader]
WHERE [CustomerID] BETWEEN 11000 and 11200;

If we execute this against the AdventureWorks2012 database using SQL Sentry Plan Explorer and check the plan and the Table I/O output, we see that we get a clustered index scan with 689 logical reads:

Execution plan from original query

(In Management Studio, you could see the I/O metrics using SET STATISTICS IO ON;.)

The SELECT has a warning icon, because the optimizer recommends an index for this query:

USE [AdventureWorks2012];
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [Sales].[SalesOrderHeader] ([CustomerID])
INCLUDE ([OrderDate],[ShipDate],[SalesPersonID],[SubTotal]);

Test 1

We will first create the index the optimizer recommends (named NCI1_included), as well as the variation with all the columns as key columns (named NCI1):

CREATE NONCLUSTERED INDEX [NCI1]
ON [Sales].[SalesOrderHeader]([CustomerID], [SubTotal], [OrderDate], [ShipDate], [SalesPersonID]);
GO
 
CREATE NONCLUSTERED INDEX [NCI1_included]
ON [Sales].[SalesOrderHeader]([CustomerID])
INCLUDE ([SubTotal], [OrderDate], [ShipDate], [SalesPersonID]);
GO

If we re-run the original query, once hinting it with NCI1, and once hinting it with NCI1_included, we see a plan similar to the original, but this time there's an index seek of each nonclustered index, with equivalent values for Table I/O, and similar costs (both about 0.006):

Original query with index seeks – key on the left, include on the right

(The scan count is still 1 because the index seek is actually a range scan in disguise.)

Now, the AdventureWorks2012 database isn't representative of a production database in terms of size, and if we look at the number of pages in each index, we see they’re exactly the same:

SELECT
  [Table]    = N'SalesOrderHeader',
  [Index_ID] = [ps].[index_id],
  [Index]    = [i].[name],
  [ps].[used_page_count],
  [ps].[row_count]
FROM [sys].[dm_db_partition_stats] AS [ps]
INNER JOIN [sys].[indexes] AS [i] 
  ON [ps].[index_id] = [i].[index_id] 
  AND [ps].[object_id] = [i].[object_id]
WHERE [ps].[object_id] = OBJECT_ID(N'Sales.SalesOrderHeader');

Size of indexes on Sales.SalesOrderHeader

If we're looking at performance, it's ideal (and more fun) to test with a larger data set.

Test 2

I have a copy of the AdventureWorks2012 database that has a SalesOrderHeader table with over 200 million rows (script HERE), so let’s create the same nonclustered indexes in that database and re-run the queries:

USE [AdventureWorks2012_Big];
GO
CREATE NONCLUSTERED INDEX [Big_NCI1]
ON [Sales].[Big_SalesOrderHeader](CustomerID, SubTotal, OrderDate, ShipDate, SalesPersonID);
GO
 
CREATE NONCLUSTERED INDEX [Big_NCI1_included]
ON [Sales].[Big_SalesOrderHeader](CustomerID)
INCLUDE (SubTotal, OrderDate, ShipDate, SalesPersonID);
GO
 
SELECT [CustomerID], [SalesPersonID],[SalesOrderID], 
  DATEDIFF(DAY, [OrderDate], [ShipDate]) AS [DaysToShip], [SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1))
WHERE [CustomerID] between 11000 and 11200;
 
SELECT [CustomerID], [SalesPersonID],[SalesOrderID], 
  DATEDIFF(DAY, [OrderDate], [ShipDate]) AS [DaysToShip], [SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1_included))
WHERE [CustomerID] between 11000 and 11200;

Original query with index seeks against Big_NCI1 (l) and Big_NCI1_Included (r)

Now we get some data. The query returns over 6 million rows, and seeking each index requires just over 32,000 reads, and the estimated cost is the same for both queries (31.233). No performance differences yet, and if we check the size of the indexes, we see that the index with the included columns has 5,578 fewer pages:

SELECT
  [Table]    = N'Big_SalesOrderHeader',
  [Index_ID] = [ps].[index_id],
  [Index]    = [i].[name],
  [ps].[used_page_count],
  [ps].[row_count]
FROM [sys].[dm_db_partition_stats] AS [ps]
INNER JOIN [sys].[indexes] AS [i] 
  ON [ps].[index_id] = [i].[index_id] 
  AND [ps].[object_id] = [i].[object_id]
WHERE [ps].[object_id] = OBJECT_ID(N'Sales.Big_SalesOrderHeader');

Size of indexes on Sales.Big_SalesOrderHeader

If we dig into this a big further and check dm_dm_index_physical_stats, we can see that difference exists in the intermediate levels of the index:

SELECT
  [ps].[index_id],
  [Index] = [i].[name],
  [ps].[index_type_desc],
  [ps].[index_depth],
  [ps].[index_level],
  [ps].[page_count],
  [ps].[record_count]
FROM [sys].[dm_db_index_physical_stats](DB_ID(), 
  OBJECT_ID('Sales.Big_SalesOrderHeader'), 5, NULL, 'DETAILED') AS [ps]
INNER JOIN [sys].[indexes] AS [i] 
  ON [ps].[index_id] = [i].[index_id] 
  AND [ps].[object_id] = [i].[object_id];
 
SELECT
  [ps].[index_id],
  [Index] = [i].[name],
  [ps].[index_type_desc],
  [ps].[index_depth],
  [ps].[index_level],
  [ps].[page_count],
  [ps].[record_count]
FROM [sys].[dm_db_index_physical_stats](DB_ID(), 
  OBJECT_ID('Sales.Big_SalesOrderHeader'), 6, NULL, 'DETAILED') AS [ps]
INNER JOIN [sys].[indexes] [i] 
  ON [ps].[index_id] = [i].[index_id] 
  AND [ps].[object_id] = [i].[object_id];

Size of indexes (level-specific) on Sales.Big_SalesOrderHeader

The difference between the intermediate levels of the two indexes is 43 MB, which may not be significant, but I'd probably still be inclined to create the index with included columns to save space – both on disk and in memory. From a query perspective, we still don't see a big change in performance between the index with all the columns in the key and the index with the included columns.

Test 3

For this test, let's change the query and add a filter for [SubTotal] >= 100 to the WHERE clause:

SELECT [CustomerID],[SalesPersonID],[SalesOrderID], 
  DATEDIFF(DAY, [OrderDate], [ShipDate]) AS [DaysToShip], [SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1))
WHERE CustomerID = 11091
AND [SubTotal] >= 100;
 
SELECT [CustomerID], [SalesPersonID],[SalesOrderID], 
  DATEDIFF(DAY, [OrderDate], [ShipDate]) AS [DaysToShip], [SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1_included))
WHERE CustomerID = 11091
AND [SubTotal] >= 100;

Execution plan of query with SubTotal predicate against both indexes

Now we see a difference in I/O (95 reads versus 1,560), cost (0.848 vs 1.55), and a subtle but noteworthy difference in the query plan. When using the index with all the columns in the key, the seek predicate is the CustomerID and the SubTotal:

Seek predicate against NCI1

Because SubTotal is the second column in the index key, the data is ordered and the SubTotal exists in the intermediate levels of the index. The engine is able to seek directly to the first record with a CustomerID of 11091 and SubTotal greater than or equal to 100, and then read through the index until no more records for CustomerID 11091 exist.

For the index with the included columns, the SubTotal only exists in the leaf level of the index, so CustomerID is the seek predicate, and SubTotal is a residual predicate (just listed as Predicate in the screen shot):

Seek predicate and residual predicate against NCI1_included

The engine can seek directly to the first record where CustomerID is 11091, but then it has to look at every record for CustomerID 11091 to see if the SubTotal is 100 or higher, because the data is ordered by CustomerID and SalesOrderID (clustering key).

Test 4

We'll try one more variation of our query, and this time we'll add an ORDER BY:

SELECT [CustomerID],[SalesPersonID],[SalesOrderID], 
  DATEDIFF(DAY, [OrderDate], [ShipDate]) AS [DaysToShip], [SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1))
WHERE CustomerID = 11091
ORDER BY [SubTotal];
 
SELECT [CustomerID],[SalesPersonID],[SalesOrderID], 
  DATEDIFF(DAY, [OrderDate], [ShipDate]) AS [DaysToShip], [SubTotal]
FROM [Sales].[Big_SalesOrderHeader] WITH (INDEX (Big_NCI1_included))
WHERE CustomerID = 11091
ORDER BY [SubTotal];

Execution plan of query with SORT against both indexes

Again we have a change in I/O (though very slight), a change in cost (1.5 vs 9.3), and much larger change in the plan shape; we also see a larger number of scans (1 vs 9). The query requires the data to be sorted by SubTotal; when SubTotal is part of the index key it is sorted, so when the records for CustomerID 11091 are retrieved, they are already in the requested order.

When SubTotal exists as an included column, the records for CustomerID 11091 must be sorted before they can be returned to the user, therefore the optimizer interjects a Sort operator in the query. As a result, the query that uses the index Big_NCI1_included also requests (and is given) a memory grant of 29,312 KB, which is notable (and found in the properties of the plan).

Summary

The original question we wanted to answer was whether we would see a performance difference when a query used the index with all columns in the key, versus the index with most of the columns included in the leaf level. In our first set of tests there was no difference, but in our third and fourth tests there was. It ultimately depends on the query. We only looked at two variations – one had an additional predicate, the other had an ORDER BY – many more exist.

What developers and DBAs need to understand is that there are some great benefits to including columns in an index, but they will not always perform the same as indexes that have all columns in the key. It may be tempting to move columns that are not part of predicates and joins out of the key, and just include them, to reduce the overall size of the index. However, in some cases this requires more resources for query execution and may degrade performance. The degradation may be insignificant; it may not be…you will not know until you test. Therefore, when designing an index, it’s important to think about the columns after the leading one – and understand whether they need to be part of the key (e.g. because keeping the data ordered will provide benefit) or if they can serve their purpose as included columns.

As is typical with indexing in SQL Server, you have to test your queries with your indexes to determine the best strategy. It remains an art and a science – trying to find the minimum number of indexes to satisfy as many queries as possible.

The post Should new index columns be in the key, or included? appeared first on SQLPerformance.com.

27 Jul 02:00

How To Get The Boss To Pay For Training

by Andy Warren

How do you get the boss to pay for training? In my experience you have to:

Ask. The boss isn’t sitting around thinking of ways to make your life better
Align. Ask for training that is relevant, or show how it is relevant if not obvious
Persist. Even if you do a nice clean ask, it’s not a top priority. Follow up without being annoying!

It’s doesn’t need to be complicated. A short email, a link to the class or a an attachment with the details. Something as simple as this might net you some nice company paid training!

<<Boss>>,

I’m planning to attend SQLSaturday Orlando on September 27th, it’s a free all day training event for SQL Server professionals. On Thursday September 25th they are also offering an all day class on Enterprise Monitoring for SQL Server that is being taught by David Pless, a Senior Field Engineer for Microsoft. It looks very good and I think I’d benefit from attending. The registration fee is $120 and that includes lunch. Would it be possible for the company to pay for the training?

Thanks,

<<you>>

I know that works out to AAP. I’ll work on a better mnemonic!

27 Jul 02:00

Get started backing up to the cloud with SQL Server Backup to Microsoft Azure Tool

by SQL Server Team

If you’re considering backing up your SQL Server database to the cloud, there are many compelling reasons. Not only will you have an offsite copy of your data for business continuity and disaster recovery purposes, but you can save on CAPEX by using Microsoft Azure for cost-effective storage. And now, you can choose to backup to Microsoft Azure even for databases that aren’t running the latest version of SQL Server – creating a consistent backup strategy across your database environment.

SQL Server has these tools and features to help you back up to the cloud:

In SQL Server 2014, Managed Backup to Microsoft Azure manages your backup to Microsoft Azure, setting backup frequency based on data activity. It is available inside the SQL Server Management Studio in SQL Server 2014.
In SQL Server 2012 and 2014, Backup to URL provides backup to Microsoft Azure using T-SQL and PowerShell scripting.
For prior versions, SQL Server Backup to Microsoft Azure Tool enables you to back up to the cloud all supported versions of SQL Server, including older ones. It can also be used to provide encryption and compression for your backups – even for versions of SQL Server that don’t support these functions natively.

To show you how easy it is to get started with SQL Server Backup to Microsoft Azure Tool, we’ve outlined the four simple steps you need to follow:

Prerequisites: Microsoft Azure subscription and a Microsoft Azure Storage Account. You can log in to the Microsoft Azure Management Portal using your Microsoft account. In addition, you will need to create a Microsoft Azure Blob Storage Container: SQL Server uses the Microsoft Azure Blob storage service and stores the backups as blobs.

Step 1: Download the SQL Server Backup to Microsoft Azure Tool, which is available on the Microsoft Download Center.

Step 2: Install the tool. From the download page, download the MSI (x86/x64) to your local machine that has the SQL Server Instances installed, or to a local share with access to the Internet. Use the MSI to install the tool on your production machines. Double click to start the installation.

Step 3: Create your rules. Start the Microsoft SQL Server Backup to Microsoft Azure Tool Service by running SQLBackup2Azure.exe. Going through the wizard to setup the rules allows the program to process the backup files that should be encrypted, compressed or uploaded to Azure storage. The Tool does not do job scheduling or error tracking, so you should continue to use SQL Server Management Studio for this functionality.

On the Rules page, click Add to create a new rule. This will launch a three screen rule entry wizard.

The rule will tell the Tool what local folder to watch for backup file creation. You must also specify the file name pattern that this rule should apply to.

To store the backup in Microsoft Azure Storage, you must specify the name of the account, the storage access key, and the name of the container. You can retrieve the name of the storage account and the access key information by logging into the Microsoft Azure management portal.

At this time, you can also specify whether or not you wish to have the backup files encrypted or compressed.

Once you have created one or more rules, you will see the existing rules and the option to Modify or Delete the rule.

Step 4: Restore a Database from a Backup Taken with SQL Server Backup to Microsoft Azure Tool in place. The SQL Server Backup to Microsoft Azure Tool creates a ‘stub’ file with some metadata to use during restore. Use this file like your regular backup file when you wish to restore a database. SQL Server uses the metadata from this file and the backup on Microsoft Azure storage to complete the restore.

If the stub file is ever deleted, you can recover a copy of it from the Microsoft Azure storage container in which the backups are stored. Place the stub file into a folder on the local machine where the Tool is configured to detect and upload backup files.

That’s all it takes! Now you’re up and running with Backup to and Restore from Microsoft Azure.

To learn more about why to back up to the cloud, join Forrester Research analyst Noel Yuhanna in a webinar on Database Cloud Backup and Disaster Recovery. You’ll find out why enterprises should make database cloud backup and DR part of their enterprise database strategy.

The webinar takes place on Tuesday, 7/29 at 9 AM Pacific time; register now.

27 Jul 02:00

For proven in-memory technology without costly add-ons, migrate your Oracle databases to SQL Server 2014

by SQL Server Team

Today, we are making available a new version of SQL Server Migration Assistant (SSMA), a free tool to help customers migrate their existing Oracle databases to SQL Server 2014. Microsoft released SQL Server 2014 earlier this year, after months of customer testing, with features such as In-Memory OLTP to speed up transaction performance, In-Memory Columnstore to speed up query performance, and other great hybrid cloud features such as backup to cloud directly from SQL Server Management Studio and the ability to utilize Azure as a disaster recovery site using SQL Server 2014 AlwaysOn.

Available now, the SQL Server Migration Assistant version 6.0 for Oracle databases, greatly simplifies the database migration process from Oracle databases to SQL Server. SSMA automates all aspects of migration including migration assessment analysis, schema and SQL statement conversion, data migration as well as migration testing to reduce cost and reduce risk of database migration projects. Moreover, SSMA version 6.0 for Oracle databases brings additional features such as automatically moving Oracle tables into SQL Server 2014 in-memory tables, the ability to process 10,000 Oracle objects in a single migration, and increased performance in database migration and report generation.

Many customers have realized the benefits of migrating their database to SQL Server using previous versions of SSMA. For example:

Dollar Thrifty Automotive Group migrated their rental car rate engine and saves $135,000 annually
Sumitomo Rubber Industries migrated their 21 mission critical systems from an Oracle database to SQL Server and cut software licensing costs by half.
G&T Conveyor saves 83 percent on ERP by moving from an Oracle database to SQL Server

SSMA for Oracle is designed to support migration from Oracle 9i or later version to all editions of SQL Server 2005, SQL Server 2008, SQL Server 2008 R2, and SQL Server 2012 and SQL Server 2014. The SSMA product team is also available to answer your questions and provide technical support at ssmahelp@microsoft.com

To download SSMA for Oracle, go here. To evaluate SQL Server 2014, go here.

27 Jul 02:00

Thank You SQLBits!

Just a very quick post to say thank you to all the organisers, speakers, helpers, sponsors, venue staff, caterers, party entertainers, long suffering partners and anyone else involved with the event as it was a top, top few days.

This was my 3rd SQLBits after Liverpool + Nottingham and in just this short time, it is plain to see how much it has grown and how popular it has become. The quality of each and every facet seems to be getting better at each event. There was ample of space to network, chill or roam at the venue and everywhere seemed to be a nice ambient temperature even though it was pretty hot outside on the Thursday + Friday. Food delivery was bang on and only took a matter of moments to queue and get your lunch/refreshments. Sessions were plentiful, wide ranging and delivered well by both seasoned and new speakers alike.

And the party, well, simply WOW!.. that’s one I’ll remember for a very long time. So much to do, eat, drink and play it was past my bedtime before I knew it. The party organisers should be damn proud of what they achieved there.

Other than SQLBits 5 which was held in Wales, all of the events have been hosted in England, so it’ll be interesting how many more events are held before SQLBits travels west across the waters to Ireland or the other way into Europe somewhere (or even on the waters as per the feedback survey). I’d actually quite like to see SQLBits in Europe somewhere as I’d tie it all in with a vacation with the missus seeing as partners are welcome at the Friday night party.

I’ve thought long and hard at things to improve following on from this event, but I really struggled to think of anything I didn’t like such was the quality of the event. One suggestion I made on the feedback survey was the potential for a 90 minute spotlight session or two by renowned speakers. I know that it’ll might make the timetable a little trickier to work out, but a 90 minute deep dive by a world class speaker would be pretty awesome.

Additionally with the number of session abstracts being submitted, I do wonder if there is the potential to make it a four day event! Hopefully one day not before too long eh nudge, nudge!?..

Enjoy!...

Follow me on twitter @sqlserverrocks

Subscribe to my blog RSS feed

Comment on this or any of my posts or contact me directly from http://www.olcot.co.uk

Mrdenny

Shared posts

Definition

Licensing Power BI

Excel 2013 Features

Power Query

Power Pivot

BI Semantic Model

Power Pivot Features

Power View

Power Map

Power BI for Office 365

Power BI Sites

Power BI Q&A (Question and Answer)

Query and Data Management

Power BI Windows Store App

Managing Power BI for Office 365

IT (Information Technology) Infrastructure Services for Power BI Office 365

Power BI with Excel 2010

Recommended Resources

Summary

What is Grouped Concatenation?

Scalar UDF

Common Language Runtime (CLR)

Recursive CTE

Cursor

Quirky Update

FOR XML PATH

More on FOR XML PATH

Performance

Conclusion

Next Time

Necessity is the Mutha of Invention

Geek Creativity

Code Creativity

SQL Server Improvisation

Summary

The Investigation

SIDs, Logins, Contained Users, and Why You Care…Or Not.

The 2008 R2 RCSI Bug

Implications and Mitigations

The "Incorrect Deadlock" Bug

Analysis and Explanation

The Deadlock Fix

Final Thoughts

What’s the difference between SMP and MPP?

Transitioning to MPP

Table Design

Distributed Tables

Replicated Tables

Data Loading

SQL Swarm

System.out Basilisk

Class-Based Hydra

The FORTRAN Menace

The Setup

Test 1

Test 2

Test 3

Test 4

Summary

`System.out` Basilisk