Shared posts

19 Apr 00:14

Top community posts from February 2015

by Michael Tidmarsh

shutterstock_241547266
Jeb Bush image via Shutterstock

You might be wondering why there’s a picture of former Florida governor Jeb Bush here. Well, his partial email dump was included in our top community blogs for the month of February. Come check out the stories and share your thoughts with the group.

14 Apr 20:09

How to resequence column based on numeric prefix using TSQL

by Marlon Ribunal

Beginning in version 2005, SQL Server is making manipulation of  partitioned set of rows easier by using Window Functions (ROW_NUMBER(), aggregate functions with OVER(), etc.). You can manipulate partitions of rows on the fly with Window Functions.

I’ve seen how complicated queries that require recursion were simplified by using combination of   Common Table Expression (CTE) and Window Functions. There are less reasons now, or almost zero excuses, to justify using cursors in your queries.

I had a data requirement that I thought on the onset that I could apply the awesomeness of CTE’s and Window Functions on. I needed to process a dataset into deterministic output. I had to output the data in pre-determined sort order. Each partition would have maximum of 10 rows.

To preserve the deterministic order of my data, I would have to prefix each data with 0,1,2,3,5,6,7,8,and 9 within the partition group.

Easy, right? Not really. The problem was my input (bar delimited string) has values that could exceed that 10 max. In cases when the number of values exceeds 10, I would grab the last 10 values. The number of values is not fixed. Some record would have 2, 3, 11, 99, or who-knows-what number of values. Regardless of the number of values, I need to grab the last 10.

Since I don’t know how many values there are in a given record, I would have to read beginning from the last value by reversing the whole string input, and reversing the results back to normal order. I have a parse function that split the values in the bar delimited string input. Since I am parsing the input in reverse, the 0 index gets assigned to the actual last value in the string (reversed).

I was working on SQL Server 2000 for this requirement :-(

To give you an idea of what I am talking about, here’s my input (left) and the desired output (right). Basically, I needed to resequence the numeric prefix in order, i.e., 0,1,2,3…9 from an input of x…3,2,1,0

resequence tsql sql server

Let me walk you through step by step…

Let’s build the sample dataset. Pardon my insert constructor. Remember this is SQL 2000. It could have been nicer, I know.


SELECT x.part_no, x.sub_part_no
INTO #temp
FROM ( SELECT 'a' AS part_no, '7ABC' AS sub_part_no
UNION ALL
SELECT 'a' AS part_no,'6DEF' AS sub_part_no
UNION ALL
SELECT 'a' AS part_no, '5GHI' AS sub_part_no
UNION ALL
SELECT 'a' AS part_no, '4JKL' AS sub_part_no
UNION ALL
SELECT 'a' AS part_no, '3MNO' AS sub_part_no
UNION ALL
SELECT 'a' AS part_no, '2QRS' AS sub_part_no
UNION ALL
SELECT 'a' AS part_no, '1TUV' AS sub_part_no
UNION ALL
SELECT 'a' AS part_no, '0WXY' AS sub_part_no
UNION ALL
SELECT 'b' AS part_no, '4A12' AS sub_part_no
UNION ALL
SELECT 'b' AS part_no, '3B34' AS sub_part_no
UNION ALL
SELECT 'b' AS part_no, '2C56' AS sub_part_no
UNION ALL
SELECT 'b' AS part_no, '1D78' AS sub_part_no
UNION ALL
SELECT 'b' AS part_no, '0E01' AS sub_part_no
) x

This is how the pre-processed dataset looks like:

input sequence tsql

SQL Server 2000

The next step is to find out the maximum, or highest, number of prefix (sequence) in each partition (part_no in this case). You probably already know what I am trying to do here.


SELECT part_no
,LEFT(sub_part_no, 1) AS sequence_no
FROM #temp y

max sequence tsql

A simple subquery (subselect) returns exactly that:


SELECT part_no
 ,( SELECT MAX(LEFT(sub_part_no, 1))
 FROM #temp x
 WHERE x.part_no = y.part_no
 ) AS max_sequence
 ,sub_part_no
FROM #temp y

resequencing values in tsql

Now that I have determined the max value of my prefix (sequence) within my partition, I lay that next to my original sequence and let simple arithmetic do its magic.


SELECT part_no, max_sequence, sequence_no, cast(max_sequence AS INT) - CAST(sequence_no AS INT) AS new_sequence
FROM(
SELECT part_no
,(SELECT MAX(LEFT(sub_part_no,1)) FROM #temp x WHERE x.part_no = y.part_no) AS max_sequence
,LEFT(sub_part_no,1) AS sequence_no
FROM #temp y
) xx

new sequence tsql

Putting that together…

SELECT z.part_no
 ,sub_part_no AS old_sub_part_no
 ,CAST(CAST(max_counter AS INT) - CAST(LEFT(sub_part_no, 1) AS INT) AS VARCHAR(2)) + RIGHT(sub_part_no, LEN(sub_part_no) - 1) AS new_sub_part_no
FROM ( SELECT part_no
 ,sub_part_no
 ,( SELECT MAX(LEFT(sub_part_no, 1))
 FROM #temp x
 WHERE x.part_no = y.part_no
 ) AS max_counter
 FROM #temp y
 GROUP BY part_no
 ,sub_part_no
 ) z
ORDER BY z.part_no
 ,z.new_sub_part_no

Another option is to simulate the ROW_NUMBER() window function (reference here)…

SELECT part_no
,sub_part_no AS old_sub_part_no
,CAST(z.rowNumber AS VARCHAR(1)) + RIGHT(sub_part_no, LEN(sub_part_no) - 1) AS new_sub_part_no
FROM ( SELECT part_no
,sub_part_no
,( SELECT COUNT(*) - 1
FROM #temp AS x
WHERE x.part_no = y.part_no
AND x.sub_part_no >= y.sub_part_no
) AS rowNumber
FROM #temp AS y
) AS z
ORDER BY part_no, new_sub_part_no

tsql resequence output
And there you have it. That’s a simple way of resequencing values within partition in TSQL.

 SQL Server 2008/2012/2014

UPDATE: (3/26/2015)

Thanks, Bob for the script (see comment below). I tested this with SQL Server 2008 R2 and SQL Server 2012

SELECT part_no
,sub_part_no AS old_sub_part_no
,CAST (ROW_NUMBER() OVER ( PARTITION BY part_no ORDER BY CONVERT(INT, LEFT(sub_part_no, 1)) DESC ) - 1 AS VARCHAR(1)) + RIGHT(sub_part_no, LEN(sub_part_no) - 1) AS new_sub_part_no
FROM #temp

The post How to resequence column based on numeric prefix using TSQL appeared first on SQL, Code, Coffee, etc..

No related posts.

14 Apr 19:53

The Awesome Optimization

by Xander R.

Dave liked to think that he was a responsible, thorough application developer. He always tried to understand a problem before tackling it, to think through all ways a change could affect things, and to gather information before making decisions. So when he received complaints about the speed of the custom web application used by his work’s health department, he decided to do a little research to gather all the facts before reporting the issue to his boss.

Do not throw in trash

And sure enough, all it took was a little research: the logs were full of database query timeouts and memory limit terminations. After a small amount of performance analysis on the test database server, it was very clear that the database was by far the main cause of the slow speeds. As the number of entries in the database naturally increased, many of the queries became exponentially slower.

Satisfied he had actionable information, Dave went to his boss to get approval to make fixes.

“Thank you, Dave! But the database isn’t your job - you’re an application developer. Anything related to the database is the DBA’s job. I’ll get ‘Awesome’ on this right away.”

‘Awesome’, the affectionately-nicknamed-by-the-boss local Database Administrator (DBA), had no experience programming, but was considered “master of all things even tangentially database-related”. Dave tried to make his boss aware of an important caveat: “But the data access code is in the application. How can the DBA optimize it there?”

His boss was resolute: “That. Is. Not. Your. Job. Don’t even worry about it. I’m sure Awesome knows how to do it.”

And so instead of optimizing queries, Dave begrudgingly set up another testing instance of the application for Awesome to work with and handed over the details. Three days later, Awesome showed up at his door with bad news: “The application is completely broken.”

Incredulous, Dave logged in to debug a little and quickly discovered that the “web.config” file had been modified and was pointing to an incorrect server. “Awesome, please do not modify any code or application settings without letting me know. Modifying the web.config file can be dangerous if you don’t know what you’re doing. Just optimize by profiling the database on the testing database server - you don’t need to dig through the code.” With a glare, Awesome left, never to be heard from again.

Until 5 days later, when he showed up at Dave’s door with amazing news: “It’s done.” Again incredulous, Dave logged into the test instance… and was shocked by how fast it was. Every page loaded nearly instantly. Even complex records searches displayed results immediately. Not a single perceptible delay remained.

“How did you do it?”, Dave asked.

Awesome was recalcitrant: “You don’t need to know. It’s not your job. All you need to know is that it’s a lot faster, and I did my job.”

Dave pressed him further. Had he modified any code? Had he added any indexes? What changes had he made?
“No, no changes in code, like you told me. The code is your job”, Awesome calmly informed Dave. Honestly puzzled, Dave pressed one more time for details: “Well… what exactly did you do?”

“Oh, the old database server was really slow, so I copied the database to a new server with a lot more RAM.”

“So copying the database to another server took 5 days?”, Dave asked.

“Well, I also had to archive some old records. The database had too much data in it.”

And that’s when Dave discovered the real reason things were so fast on the test instance: Awesome had “optimized” the database by simply removing a lot of data from it. Were it full of logs, or other data that had a defined shelf-life, this might have been acceptable. But as the database was full of the health department’s records that had to be maintained, this was pretty much the definition of Not Okay™.

“Awesome, we can’t just delete data from the database. We have to maintain those health records.”

At least the changes hadn’t been deployed to production.

“… But… I already made the switchover to the new database server.”

And that’s when Dave checked the production instance and realized that it, too, was similarly fast. And that it, too, was similarly missing lots of records.

As luck would have it though, Awesome was not very thorough at his job, and he hadn’t bothered to touch the old database on the original server. So Dave logged in, changed the configuration to point at the original location, and the site came back up with all old records intact… and without the “optimizations” that Awesome had put in place. He informed Awesome: “No worries - I’ve fixed it. All the data is back. Just leave the new server up for a bit so all the changes can sync. I’ll let you know when it’s done.”

As Awesome walked away, confused but relieved, Dave settled in for a long day’s work. Someone was going to have to manually merge any new entries and records changes back into the original database - and then actually improve performance. And, while it wasn’t his job, Dave figured that those tasks might be better left to someone a bit more understanding and thorough, instead of someone “awesome”.

[Advertisement] Release! is a light card game about software and the people who make it. Play with 2-5 people, or up to 10 with two copies - only $9.95 shipped!
14 Apr 19:51

Dynamic Tables

by snoofle

We've all learned to write programs a bit more generically than the immediate problem requires. We use interfaces in our code and include concrete implementation classes via some language-appropriate mechanism. We use factories to produce the object we want, rather than hard code them. We use code generators to spew forth mountains of code from configurable templates. Sometimes, we even generate SQL on-the-fly. It provides more flexibility that way; instead of having to write a separate query for every permutation of question, we can write something that can dynamically create the query, execute it and return the results.

Drop Leaf Table

At Initech, Ian was assigned an interesting JIRA ticket: Investigate errors regarding column length. Since that was all the information written in the ticket, Ian hunted down the author of the ticket to pry out a tad more information. Apparently, the part of Initech's intranet website that was used by the sales agents was suddenly throwing errors about some kind of maximum-size error, and he needed to find out what it was, why it was happening and how to fix it.

Sounds routine enough. Someone probably overflowed some buffer somewhere, or had too much data to fit in some column in the database.

The website in question was nothing out of the ordinary. It was written in PHP using the Zend framework connecting via PDO to a MySQL database. It had been mostly written by several programmers who had left the company a few months prior, and was of a quality that was not unusual for a corporate PHP website. There were functions with reasonably sensible names. The MVC pattern was applied in a rational way. It was possible to follow the code structure without retching or having seizures.

Of course, since this was a fairly decent sized company, there were also the requisite SQL injection vulnerabilities on every single input field. No input was ever sanitized; not even manually. There was no error checking on query result sets. Countless uninitialized variables littered the landscape, scattered amongst informative debug statements like: echo("whyyyyyyyy?");

The page producing the error was written by Gary. It was a multi-tabbed form with only some minor WTFs. One drop-down box had a completely different look&feel than all the others on the page, and any kind of error more complicated than a validation error produced the error message "1".

Sales agents were supposed to use it while talking to customers to enter information about the person they were calling. The error they were getting was a MySQL error regarding maximum row sizes, but it wasn't because of too much data in an input field. After a double-take and a quick Googling, Ian found that yes, MySQL -does- have a maximum size on the length of a row, namely 64KB. But this was just an Insert statement, right? What would maximum row length have to do with an existing table schema?

After several spelunking sessions down through the fetid bowels of the system, Ian found the fecal mass of detritus that was causing the problem.

It turns out that Gary was being micro-managed by the VP of Operations. This led to an endless series of requests to add a new field to the form. Gary got tired of doing the same work over and over, and took it upon himself to automate the problem. To this end, any time a new field was added to the form (usually by someone SSH'ing onto the remote server and just editing the necessary PHP files directly in production), the page would later on detect that the database did not have a matching column for it, and would create an alter table statement to add the new column to the database. Automatically. On the fly. With no human knowledge of, or intervention thereof. Live. In Production.

The fact that it had lasted several months without causing some kind of outage was impressive.

The name of the new column was the ID of the field on the form. But what would the data type be? The answer Gary settled upon was... varchar(255). Fast-forward a few months; all those varchar(255) columns started to add up, as management decreed more and more and more fields for more and more and more customer data to be recorded. While having a database table with 200+ Stringly-Typed columns is a WTF in and of itself, the fact that they were automatically added and with a substantial size only made the problem worse, particularly as 255 characters was far more than necessary for most fields. Two-letter state abbreviations, 5-digit Postal codes, 4-digit Zip+4 postal codes, 8-digit customer IDs, apartment numbers, phone numbers, and many others were all blanketed under the varchar(255) datatype.

While Ian wanted to tear out this monstrosity and banish it to the recesses of the Git commit log, he was informed by his boss that, since the Intranet website was going to be taken down in a few months, it would be more prudent to simply resize some of the columns to buy them enough time to take the whole site down. And so, choosing some of the more egregious offenders, Ian altered ten columns or so to reduce them to a sane size (or, better yet, to turn columns holding integer values into actual Integers).

[Advertisement] Use NuGet or npm? Check out ProGet, the easy-to-use package repository that lets you host and manage your own personal or enterprise-wide NuGet feeds and npm repositories. It's got an impressively-featured free edition, too!
14 Apr 18:12

Unexpected Clustered Index Fragmentation

by Paul Randal

In a departure from my ‘knee-jerk performance tuning’ series, I’d like to discuss how index fragmentation can creep up on you under some circumstances.

What is Index Fragmentation?

Most people think of ‘index fragmentation’ as meaning the problem where the index leaf pages are out of order – the index leaf page with the next key value is not the one that’s physically contiguous in the data file to the index leaf page currently being examined. This is called logical fragmentation (and some people refer to it as external fragmentation – a confusing term that I don’t like).

Logical fragmentation happens when an index leaf page is full and space is required on it, either for an insert or to make an existing record longer (from updating a variable-length column). In that case, the Storage Engine creates a new, empty page and moves 50 % of the rows (usually, but not always) from the full page to the new page. This operation creates space in both pages, allowing the insert or update to proceed, and is called a page split. There are interesting pathological cases involving repeated page splits from a single operation and page splits that cascade up the index levels, but they’re beyond the scope of this post.

When a page split occurs, it usually causes logical fragmentation because the new page that’s allocated is highly unlikely to be physically contiguous to the one that’s being split. When an index has lots of logical fragmentation, index scans are slowed down because the physical reads of the necessary pages cannot be done as efficiently (using multi-page ‘readahead’ reads) when the leaf pages are not stored in order in the data file.

That’s the basic definition of index fragmentation, but there’s a second kind of index fragmentation that most people don’t consider: low page density (sometimes call internal fragmentation, again, a confusing term I don’t like).

Page density is a measure of how much data is stored on an index leaf page. When a page split occurs with the usual 50/50 case, each leaf page (the splitting one and the new one) are left with a page density of only 50%. The lower the page density, the more empty space there is in the index and so the more disk space and buffer pool memory you can think of as being wasted. I blogged about this problem a few years back and you can read about it here.

Now that I’ve given a basic definition of the two kinds of index fragmentation, I’m going to refer to them collectively as simply ‘fragmentation.’

For the remainder of this post I’d like to discuss three cases where clustered indexes can become fragmented even if you’re avoiding operations that would obviously cause fragmentation (i.e. random inserts and updating records to be longer).

Fragmentation from Deletes

“How can a delete from a clustered index leaf page cause a page split?” you might be asking. It won’t, under normal circumstances (and I sat thinking about it for a few minutes to make sure there wasn’t some weird pathological case! But see the section below…) However, deletes can cause page density to get progressively lower.

Imagine the case where the clustered index has a bigint identity key value, so inserts will always go to the right-hand side of the index and will never, ever be inserted into an earlier portion of the index (barring someone reseeding the identity value – potentially very problematic!). Now imagine that the workload deletes records from the table that are no longer required, after which the background ghost cleanup task will reclaim the space on the page and it will become free space.

In the absence of any random inserts (impossible in our scenario unless someone reseeds the identity or specifies a key value to use after enabling SET IDENTITY INSERT for the table), no new records will ever use the space that was freed from the deleted records. This means that the average page density of the earlier portions of the clustered index will steadily decrease, leading to increasing amount of wasted disk space and buffer pool memory as I described earlier.

Deletes can cause fragmentation, as long as you consider page density as part of ‘fragmentation.’

Fragmentation from Snapshot Isolation

SQL Server 2005 introduced two new isolation levels: snapshot isolation and read-committed snapshot isolation. These two have slightly different semantics, but basically allow queries to see a point-in-time view of a database, and for lock-collision-free selects. That’s a vast simplification, but it’s enough for my purposes.

To facilitate these isolation levels, the development team at Microsoft that I led implemented a mechanism called versioning. The way that versioning works is that whenever a record changes, the pre-change version of the record is copied into the version store in tempdb, and the changed recorded gets a 14-byte versioning tag added on the end of it. The tag contains a pointer to the previous version of the record, plus a timestamp that can be used to determine what is the correct version of a record for a particular query to read. Again, hugely simplified, but it’s only the addition of the 14-bytes that we’re interested in.

So whenever a record changes when either of these isolation levels is in effect, it may expand by 14 bytes if there isn’t already a versioning tag for the record. What if there isn’t enough space for the extra 14 bytes on the index leaf page? That’s right, a page split will occur, causing fragmentation.

Big deal, you might think, as the record is changing anyway so if it was changing size anyway then a page split would probably have occurred. No – that logic only holds if the record change was to increase the size of a variable-length column. A versioning tag will be added even if a fixed-length column is updated!

That’s right – when versioning is in play, updates to fixed-length columns can cause a record to expand, potentially causing a page split and fragmentation. What's even more interesting is that a delete will also add the 14-byte tag, so a delete in a clustered index could cause a page split when versioning is in use!

The bottom line here is that enabling either form of snapshot isolation can lead to fragmentation suddenly starting to occur in clustered indexes where previously there was no possibility of fragmentation.

Fragmentation from Readable Secondaries

The last case I want to discuss is using readable secondaries, part of the availability group feature that was added in SQL Server 2012.

When you enable a readable secondary, all queries you do against the secondary replica are converted to using snapshot isolation under the covers. This prevents the queries from blocking the constant replaying of log records from the primary replica, as the recovery code acquires locks as it goes along.

To do this, there needs to be 14-byte versioning tags on records on the secondary replica. There’s a problem, because all replicas need to be identical, so that the log replay works. Well, not quite. The versioning tag contents aren’t relevant as they’re only used on the instance that created them. But the secondary replica can’t add versioning tags, making records longer, as that would change the physical layout of records on a page and break the log replaying. If the versioning tags were already there though, it could use the space without breaking anything.

So that’s exactly what happens. The Storage Engine makes sure that any needed versioning tags for the secondary replica are already there, by adding them on the primary replica!

As soon as a readable secondary replica of a database is created, any update to a record in the primary replica causes the record to have an empty 14-byte tag added, so that the 14-bytes is properly accounted for in all the log records. The tag isn’t used for anything (unless snapshot isolation is enabled on the primary replica itself), but the fact that it’s created causes the record to expand, and if the page is already full then…

Yes, enabling a readable secondary causes the same effect on the primary replica as if you enabled snapshot isolation on it – fragmentation.

Summary

Don’t think that because you’re avoiding using GUIDs as cluster keys and avoiding updating variable-length columns in your tables then your clustered indexes will be immune to fragmentation. As I’ve described above, there are other workload and environmental factors that can cause fragmentation problems in your clustered indexes that you need to be aware of.

Now don’t knee-jerk and think that you shouldn’t delete records, shouldn’t use snapshot isolation, and shouldn’t use readable secondaries. You just have to be aware that they can all cause fragmentation and know how to detect, remove, and mitigate it.

SQL Sentry has a cool tool, Fragmentation Manager, which you can use as an add-on to Performance Advisor to help figure out where fragmentation problems are and then address them. You may be surprised at the fragmentation you find when you check! As a quick example, here I can visually see – down to the individual partition level – how much fragmentation exists, how quickly it got that way, any patterns that exist, and the actual impact it has on wasted memory in the system:

SQL Sentry Fragmentation Manager data<br />
(click to enlarge)SQL Sentry Fragmentation Manager data (click to enlarge)

In my next post, I’ll discuss more about fragmentation and how to mitigate it to make it less problematic.

The post Unexpected Clustered Index Fragmentation appeared first on SQLPerformance.com.

14 Apr 18:11

Incomplete checkpoints and recovery

by Paul Randal

Back in 2009 I blogged about how checkpoints work (see How do checkpoints work and what gets logged) and I received a question in email on Monday that I thought would make a good little blog post.

The question is (paraphrased): What happens if a checkpoint starts but doesn’t finish before a crash occurs? Will that checkpoint be used for crash recovery?

The answer is no, it won’t. Now if I left it at that, that really would be a little blog post, so let me explain my answer :-)

The purpose of a checkpoint is to bring the pages in the data files up-to-date with what’s in the transaction log. When a checkpoint ends, there’s a guarantee that as of the LSN of the LOP_BEGIN_CKPT log record, all changes from log records before that point are persisted in the data files on disk. There’s no guarantee about logged changes after that point, only before it. In other words, all the log records before the LSN of the LOP_BEGIN_CKPT log record are no longer required for crash recovery, unless there’s a long running transaction that started before that LSN.

When the checkpoint ends, the boot page of the database (page 9 in file 1 – see here for some more info) is updated with the beginning LSN of the checkpoint (and then if in the SIMPLE recovery mode, any log clearing/truncating can occur).

So if a checkpoint started but didn’t end before a crash, it’s LSN would not be in the boot page and so crash recovery would start from the previous checkpoint. This is good, because an incomplete checkpoint means there’s no guarantee about which logged changes are persisted in the data files, and so crash recovery wouldn’t be able to work correctly from only starting at the beginning of the incomplete checkpoint.

A corollary question could be: How does SQL Server guarantee that there’s always one complete checkpoint in the active portion of the log, in case a crash occurs?

The answer is that log clearing/truncation of a VLF containing an LOP_BEGIN_CKPT log record cannot happen until another complete checkpoint has occurred. In other words, a complete checkpoint has to occur since the last log clearing/truncation before the next one can happen. If a checkpoint hasn’t occurred, the log_reuse_wait_desc for the database in sys.databases will return CHECKPOINT. It’s not common to see this occur, but you might see it if there’s a very long running checkpoint (e.g. a very large update on a system with a slow I/O subsystem so the flushing of data file pages takes a long time) and very frequent log backups, so two log backups occur over the time taken by a single checkpoint operation. It could also happen if you’ve messed with the sp_configure recovery interval and set it higher than the default.

Interesting, eh?

The post Incomplete checkpoints and recovery appeared first on Paul S. Randal.

14 Apr 18:11

Powershell Invoke-Sqlcmd –Hostname Parameter sets the Application Name instead of the Host Name

by Greg Low

Two of the parameters in SQL Server connections are the Application Name and the Host Name. You can see these in SQL Server if you execute the following command:

image

I’ve always been a fan of having applications identify themselves in their connection strings. It makes tasks like tracing much easier. The tools supplied with SQL Server do a reasonable job of that as you can see above. But many other tools don’t do such a good job.

I was working at a site today where they are using Powershell to execute commands for monitoring. I noticed that the Powershell commands did not set the Application Name in the connection string when using Invoke-Sqlcmd. Note the following example:

image

I then tried to work out how to set the Application Name. When I checked the documentation for Invoke-Sqlcmd, it shows that the Hostname is set via the SQLCMD option –H, by using the –Hostname parameter.

image

However, note that if you use the –Hostname option, it actually sets the Application Name and does not set the Host Name:

image

 



17 Mar 09:52

Handling NULL Character \x00 when Exporting to File Using BCP

by Artemakis Artemiou [MVP]
This article discusses the behavior of the BCP utility when extracting data from SQL Server to files and more specifically to CSV files. Imagine that you have the below table: As you can see, records 2 and 3 contain null values. Then you export the table contents to CSV using BPC: bcp "SELECT [code],[description],[price] FROM [TestDB1].[dbo].[product]" queryout "c:\tmp\
17 Mar 09:51

SQL Automated Backup and Patching Support in Existing SQL Server Virtual Machines

by SQL Server Team

In January, we released the SQL Automated Backup and SQL Automated Patching services for SQL Server Virtual Machines in Azure. These services automate the processes of backing up and patching your SQL Server VMs. In that release, you were able to configure these services in the Azure Preview Portal when provisioning a new SQL Server 2014 Enterprise or Standard VM. You could also configure these services in an existing Virtual Machine via PowerShell commandlets.

We have now expanded the experience so you can configure these services in an existing SQL Server VM in the Azure Preview Portal. Whether you have already enabled these services or not inside your Virtual Machine, you can go to that VM in the Azure Preview Portal and either update your configuration or create a new configuration for each service. You will find both services under the Configuration label, shown in Figure 1 below.

Figure 1. Configuration label has both services

Try these services out in the Azure Portal, and check out the documentation for further details.

17 Mar 09:51

SQLRally Nordic - after conference thoughts

by Damian

Hello All

I was extremely honored to be able to participate in the SQLRally Nordic this year. The event took place this week in Copenhagen and I think it was a great success - from every angle. Many thanks to the organizers and many thanks to the community for being in that place. There was a great atmosphere during the whole event and what I like the most as a speaker - I had a great room for my speech! 

I hope that the event will take place also next years but now I would like to invite to the annual PLSSUG conference to Wrocław. Let's meet in May. At least half of the tracks will be in English and you could meet also may world-class speakers. Just visit the page http://sqlday.pl

 

Cheers

Damian 

17 Mar 09:51

Power BI vNext SSAS Connector and Security Reloaded

by tlachev

To follow up on my previous post on the same subject, a customer was eager to jump on the Power BI->On Prem SSAS bandwagon and try the simplified security model. But "omne initium difficile est" (every beginning is difficult). Their SSAS server was installed on a domain acme.com while their e-mail addresses were using a different scheme, e.g. user@contoso.com, although both server and accounts were under the same acme domain. As we've quickly found out, Active Directory had an issue with this setup which manifested with the following error in the SQL Server Profiler connected to SSAS.

"The following system error occurred: The name provided is not a properly formed account name."

If you see this error, follow these steps to confirm the issue:

  1. Remote in to your SSAS server.
  2. Open Command Prompt and enter:
    Whoami/upn
  3. If you see that your login domain name is different that the e-mail you use to sign up to Power BI, you'll have the same issue. For example, your e-mail address might be user@contoso.com while whoami might show user@acme.com.

The workaround is to bridge the two identity schemes by adding an alternative UPN suffix in Active Directory Users and Trusts, as Greg Galloway explained here. This of course will require help from your AD administrator (one of the most important persons to befriend on your BI quest) so some coercing is to be expected.

Thanks to Jen Underwood, Will Thompson, Mini Nair and Giri Nair from Microsoft who helped us troubleshoot the issue.

17 Mar 09:51

Customer Service Done Well

by andyleonard
Fri, Mar 6, 2015 at 8:44 AM Hi Trello People, I think our organization has Trello Gold, but I cannot access features I think I should be able to access. I cannot change backgrounds, I cannot attach emails to cards. Help, Andy Fri, Mar 6, 2015 at 11:26 AM Hi Andy, Thanks for writing to us! Is [my work email address] the email address associated with your account? If so, you're correct- as part of Trello Business Class, your account should have Trello Gold. Could you create specific steps to help us...(read more)
17 Mar 09:51

ICYMI: Troubleshooting Availability Groups and the Schema Switcharoo – SQL Server Performance

by kevin

Aaron Bertrand (b | t) is on a roll this month’s SQL Server performance highlights with more great articles from SQLPerformance.com.

 

Troubleshooting AlwaysOn – Sometimes it takes many sets of eyes

Aaron Bertrand (b | t) hit some obstacles recently when configuring an Availability Groups lab environment. This post shows how some assistance from the community helped him isolate and solve the issues.

Another argument for stored procedures

Learn the subtle ways that ad hoc queries can interfere with SQL Server performance by taking up more space in the plan cache than they really need.

Schema Switch-A-Roo : Part 2

Get even more detail about what happens to metadata when you use the schema transfer technique behind the scenes move data and database objects.

So what do you think? What SQL Server performance issues would you like to read about?

-Kev

-Follow me on Twitter!
-Google Author

The post ICYMI: Troubleshooting Availability Groups and the Schema Switcharoo – SQL Server Performance appeared first on Kevin Kline.

17 Mar 09:51

Updated SQL Server PHP Driver Now Available

by SQL Server Team

As part of SQL Server’s ongoing interoperability program, we are pleased to announce an updated Microsoft SQL Server driver for PHP.  The new driver, which supports PHP 5.6, is now available!

This driver allows developers who use the PHP scripting language to access Microsoft SQL Server and Microsoft Azure SQL Database, and to take advantage of new features implemented in ODBCThe new version works with Microsoft ODBC Driver 11 or higher.

You can download the PHP driver here.  We invite you to explore the rest of the latest the Microsoft Data Platform has to offer via a trial evaluation of Microsoft SQL Server 2014, or by trying the new preview of Microsoft Azure SQL Database.

17 Mar 09:51

Tuning Parallel Data Warehouse Queries

by Rob Farley

Performance tuning in regular SQL Server can be addressed in a number of ways. This can involve looking at what’s going on with the disk configuration, the memory configuration, the wait stats, the parallelism settings, indexing, and so much more. But if you have a Parallel Data Warehouse (PDW) environment, then there are a lot of things that are taken out of your hands.

When you buy the Analytics Platform System (APS) appliance, which hosts PDW, you’re buying a complete system. It’s an appliance, in the same way that you buy appliances for the kitchen. When you go camping and you want to cook some food, you have to build a fire, and find some way of suspending the food over the fire. Building a fire isn’t exactly easy in itself – you need to consider what you’re going to do with kindling, you need larger pieces of wood for building heat, you need a spark – and then working out how to suspend the food at an appropriate height is another matter again! There are plenty of opportunities to tune this environment, and I’m sure many people have this down to a fine art. For most of us, though, we would rather just have an appliance (an oven) and just put the food in and turn it on.

The PDW environment is like the oven rather than the campfire. Disk comes preconfigured. Network comes preconfigured. Placement of files and filegroups is preconfigured, as is the configuration of tables and databases across these filegroups. High availability decisions are made for us, and we are left to design the database.

But the design of the tables within the database can dramatically affect the performance of the system – in ways that regular SQL Server environments don’t even start to have to consider. And this is because of the concept of RAM across multiple servers.

To be able to see a row of data, ready to be able to do something with it – whether that be to return it to the user who ran the query, to manipulate it in some way, or to join it to another row from another table – it must exist in RAM. DBAs understand this, and make sure that their SQL environments have as much RAM as possible, to avoid having to pull that row into RAM from disk mid-query. But when two rows that need to be joined exist on different servers, the memory spaces in RAM that can hold those rows exist on different servers too. In PDW, we have just that situation – lots of servers, with the data spread out across them. It looks like a single server, but is actually made up of lots of servers in a scaled-out architecture.

Therefore, to join data in PDW, we have to be confident that any two rows that need to be joined exist on the same server. This can be done in one of three ways:

1. At least one of the rows can be stored on every server.

This is done by configuring one of the tables as a Replicated table rather than a Distributed table. This is very common, and is a common feature for dimension tables that don’t change particularly often. If the entire list of Stores is stored (ha!) on every server, then our Sales data can be easily joined to it without fear that the Store in question for a particular Sales row might not be on the same server as said Sales row.

2. The two tables can be distributed using the same key, with that key being part of the join conditions.

Another common method. Suppose we have a large number of Products, which are frequently being changed and versioned, and we have a large amount of Sales. Both are good candidates for being Distributed rather than Replicated – we probably wouldn’t want a full copy of our large and quickly-growing Products table on every server. But if both tables are distributed on ProductKey and this is part of the join condition (p.ProductKey = s.ProductKey), then the system will know that any two rows that are a successful match according to the join conditions must also be on the server.

3. Data can be moved mid-query.

In regular SQL, if a row isn’t in RAM already, we need to pull a copy of it into RAM off the disk. Similarly, in PDW, if the system can’t be sure that the row exists on the right server, it can pull a copy of it onto the right server. This is known as Data Movement, and just as regular DBAs try to minimise the amount of disk activity for performance tuning, PDW DBAs try to minimise Data Movement.

Data Movement comes in a variety of flavours, but I want to mention two of them here. One is broadcasting, which is turning the smaller of the two tables (or at least, the data which is identified as being of interest once other row and column filters have been applied) into a temporary replicated table, putting a copy of that data on every server. This turns our scenario into situation 1, and the query can continue. Another is shuffling, which is distributing some of the data on a different key, which is part of the join conditions – putting us into situation 2.

To satisfy a single query, there might be a lot of data movement required.

If you consider a typical data warehouse query, there are a lot of joins, across a lot of tables, and the join conditions are different for every join. Unless you start to think about PDW queries in a very different way to which you think about non-PDW queries, you can very easily suffer from poor performance simply down to unnecessary data movement.

Thinking outside the box can provide some useful mechanisms for tuning PDW, which can see the performance improve massively.

Denormalising is a method that many data warehouse designers are very familiar with, but when you start to consider the pain of joins in a PDW environment, it starts to become a lot more significant. It’s often not appropriate to denormalise too much, but if you have been keen on normalising significantly in the past, you should potentially reconsider this useful method.

Redundant predicates are something which many developers don’t seem to appreciate. I’ve heard developers say that any piece of redundant code should be removed, because it’s going to create extra work. But in the PDW world, these predicates could help explain to the PDW Query Optimizer that data movement isn’t actually necessary, and could save you a large amount of effort.

Views are often maligned in the regular SQL world, as people write views that query views that query views. I’m not suggesting that you fall into that trap, but you should certainly consider views to encapsulate join logic, encouraging the use of those redundant predicates that I just mentioned.

Table splitting seems counter-intuitive to avoiding joins, but when split horizontally (having some rows in one table and others in another), gives extra options around how to distribute the data. Some data could be distributed according to one column, and other data could be distributed on another. This has the added benefit of avoiding skew, but can also benefit from using views when accessing these tables, as PDW does not support check constraints. Splitting tables vertically is rarer, but can allow dimensions to get wider and allow for additional denormalisation before hitting the maximum row width of 32kB (in-row data is limited to 8kB, but data movement buffers are only 32kB, which makes for a much harder limit).

These methods don’t have the same kind of impact in your ordinary SQL Server environments, but can have a massive effect in PDW. You can see examples of how these things can come into play by looking at a previous post of mine http://sqlblog.com/blogs/rob_farley/archive/2014/12/05/minimising-data-movement-in-pdw-using-query-optimisation-techniques.aspx, where you should be able to spot my use of all of these methods.

Oh, and this piece about performance tuning fits into this month’s T-SQL Tuesday topic, hosted by Russ Thomas (@sqljudo) on the topic of Performance.

TSQL2sDay150x150

@rob_farley

17 Mar 09:50

Data Stories…

by Karen Lopez
17 Mar 09:50

7 Databases in 70 Minutes

by Karen Lopez

Lara Rubbelke (@sqlgal ) and I recently presented 7 Databases in 70 Minutes, a sort of homage to the book 7 Databases in 7 Weeks.  The event was SQLBits, a UK-based SQL Server event.  We first gave this talk at the PASS Summit last year.

We don’t talk about the same databases as the book, but the concepts are the same.  We cover relational, column family, graph, key value, Hadoop, and document database technologies, focusing mostly on the reasons why you would want to consider these and what a typical create and query statement might look like.

And then we end with 7 reasons why you should start exploring them.

It’s a blast talking about so many things in such a short time frame and it’s fun watching light bulbs go off as people realize these aren’t just silly open source projects, but real, enterprise class solutions for common enterprise processes.

Check out our slide deck.

Have you been looking at non-relational technologies to tell your data stories, too?

17 Mar 09:50

T-SQL Tuesday #64 : One Trigger or Many?

by Aaron Bertrand

http://sqljudo.wordpress.com/2015/03/02/tsql-tue-64-calling-all-tuners-and-gear-heads/

It's that Tuesday of the month – you know, the one when the blogger block party known as T-SQL Tuesday happens. This month it is hosted by Russ Thomas (@SQLJudo), and the topic is, "Calling All Tuners and Gear Heads." I'm going to treat a performance-related problem here, though I do apologize that it might not be fully in line with the guidelines Russ set out in his invitation (I'm not going to use hints, trace flags or plan guides).

At SQLBits last week, I gave a presentation on triggers, and my good friend and fellow MVP Erland Sommarskog happened to attend. At one point I suggested that before creating a new trigger on a table, you should check to see if any triggers already exist, and consider combining the logic instead of adding an additional trigger. My reasons were primarily for code maintainability, but also for performance. Erland asked if I had ever tested to see if there was any additional overhead in having multiple triggers fire for the same action, and I had to admit that, no, I hadn't done anything extensive. So I'm going to do that now.

In AdventureWorks2014, I created a simple set of tables that basically represent sys.all_objects (~2,700 rows) and sys.all_columns (~9,500 rows). I wanted to measure the effect on the workload of various approaches to updating both tables – essentially you have users updating the columns table, and you use a trigger to update a different column in the same table, and a few columns in the objects table.

  • T1: Baseline: Assume that you can control all data access through a stored procedure; in this case, the updates against both tables can be performed directly, with no need for triggers. (This isn't practical in the real world, because you can't reliably prohibit direct access to the tables.)
  • T2: Single trigger against other table: Assume that you can control the update statement against the affected table and add other columns, but the updates to the secondary table need to be implemented with a trigger. We'll update all three columns with one statement.
  • T3: Single trigger against both tables: In this case, we have a trigger with two statements, one that updates the other column in the affected table, and one that updates all three columns in the secondary table.
  • T4: Single trigger against both tables: Like T3, but this time, we have a trigger with four statements, one that updates the other column in the affected table, and a statement for each column updated in the secondary table. This might be the way it's handled if the requirements are added over time and a separate statement is deemed safer in terms of regression testing.
  • T5: Two triggers: One trigger updates just the affected table; the other uses a single statement to update the three columns in the secondary table. This might be the way it's done if the other triggers aren't noticed or if modifying them is prohibited.
  • T6: Four triggers: One trigger updates just the affected table; the other three update each column in the secondary table. Again, this might be the way it's done if you don't know the other triggers exist, or if you're afraid to touch the other triggers due to regression concerns.

Here is the source data we're dealing with:

-- sys.all_objects:
SELECT * INTO dbo.src FROM sys.all_objects;
CREATE UNIQUE CLUSTERED INDEX x ON dbo.src([object_id]);
GO
 
-- sys.all_columns:
SELECT * INTO dbo.tr1 FROM sys.all_columns;
CREATE UNIQUE CLUSTERED INDEX x ON dbo.tr1([object_id], column_id);
-- repeat 5 times: tr2, tr3, tr4, tr5, tr6

Now, for each of the 6 tests, we're going to run our updates 1,000 times, and measure the length of time

T1: Baseline

This is the scenario where we're lucky enough to avoid triggers (again, not very realistic). In this case, we'll be measuring the reads and duration of this batch. I put /*real*/ into the query text so that I can easily pull the stats for just these statements, and not any statements from within the triggers, since ultimately the metrics roll up to the statements that invoke the triggers. Also note that the actual updates I'm making do not really make any sense, so ignore that I'm setting the collation to the server/instance name and the object's principal_id to the current session's session_id.

UPDATE /*real*/ dbo.tr1 SET name += N'',
  collation_name = @@SERVERNAME
  WHERE name LIKE '%s%';
 
UPDATE /*real*/ s SET modify_date = GETDATE(), is_ms_shipped = 0, principal_id = @@SPID
  FROM dbo.src AS s
  INNER JOIN dbo.tr1 AS t
  ON s.[object_id] = t.[object_id]
  WHERE t.name LIKE '%s%';
 
GO 1000

T2: Single Trigger

For this we need the following simple trigger, which only updates dbo.src:

CREATE TRIGGER dbo.tr_tr2
ON dbo.tr2
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE s SET modify_date = GETDATE(), is_ms_shipped = 0, principal_id = SUSER_ID()
    FROM dbo.src AS s 
	INNER JOIN inserted AS i
	ON s.[object_id] = i.[object_id];
END
GO

Then our batch only needs to update the two columns in the primary table:

UPDATE /*real*/ dbo.tr2 SET name += N'', collation_name = @@SERVERNAME
  WHERE name LIKE '%s%';
GO 1000

T3: Single trigger against both tables

For this test, our trigger looks like this:

CREATE TRIGGER dbo.tr_tr3
ON dbo.tr3
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE t SET collation_name = @@SERVERNAME
    FROM dbo.tr3 AS t
	INNER JOIN inserted AS i
	ON t.[object_id] = i.[object_id];
 
  UPDATE s SET modify_date = GETDATE(), is_ms_shipped = 0, principal_id = @@SPID
    FROM dbo.src AS s
    INNER JOIN inserted AS i
    ON s.[object_id] = i.[object_id];
END
GO

And now the batch we're testing merely has to update the original column in the primary table; the other one is handled by the trigger:

UPDATE /*real*/ dbo.tr3 SET name += N''
  WHERE name LIKE '%s%';
GO 1000

T4: Single trigger against both tables

This is just like T3, but now the trigger has four statements:

CREATE TRIGGER dbo.tr_tr4
ON dbo.tr4
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE t SET collation_name = @@SERVERNAME
    FROM dbo.tr4 AS t
	INNER JOIN inserted AS i
	ON t.[object_id] = i.[object_id];
 
  UPDATE s SET modify_date = GETDATE()
    FROM dbo.src AS s
    INNER JOIN inserted AS i
    ON s.[object_id] = i.[object_id];
 
  UPDATE s SET is_ms_shipped = 0
    FROM dbo.src AS s
    INNER JOIN inserted AS i
    ON s.[object_id] = i.[object_id];
 
  UPDATE s SET principal_id = @@SPID
    FROM dbo.src AS s
    INNER JOIN inserted AS i
    ON s.[object_id] = i.[object_id];
END
GO

The test batch is unchanged:

UPDATE /*real*/ dbo.tr4 SET name += N''
  WHERE name LIKE '%s%';
GO 1000

T5: Two triggers

Here we have one trigger to update the primary table, and one trigger to update the secondary table:

CREATE TRIGGER dbo.tr_tr5_1
ON dbo.tr5
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE t SET collation_name = @@SERVERNAME
    FROM dbo.tr5 AS t
	INNER JOIN inserted AS i
	ON t.[object_id] = i.[object_id];
END
GO
 
CREATE TRIGGER dbo.tr_tr5_2
ON dbo.tr5
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE s SET modify_date = GETDATE(), is_ms_shipped = 0, principal_id = @@SPID
    FROM dbo.src AS s
    INNER JOIN inserted AS i
    ON s.[object_id] = i.[object_id];
END
GO

The test batch is again very basic:

UPDATE /*real*/ dbo.tr5 SET name += N''
  WHERE name LIKE '%s%';
GO 1000

T6: Four triggers

This time we have a trigger for each column that is affected; one in the primary table, and three in the secondary tables.

CREATE TRIGGER dbo.tr_tr6_1
ON dbo.tr6
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE t SET collation_name = @@SERVERNAME
    FROM dbo.tr6 AS t
    INNER JOIN inserted AS i
    ON t.[object_id] = i.[object_id];
END
GO
 
CREATE TRIGGER dbo.tr_tr6_2
ON dbo.tr6
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE s SET modify_date = GETDATE()
    FROM dbo.src AS s
    INNER JOIN inserted AS i
    ON s.[object_id] = i.[object_id];
END
GO
 
CREATE TRIGGER dbo.tr_tr6_3
ON dbo.tr6
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE s SET is_ms_shipped = 0
    FROM dbo.src AS s
    INNER JOIN inserted AS i
    ON s.[object_id] = i.[object_id];
END
GO
 
CREATE TRIGGER dbo.tr_tr6_4
ON dbo.tr6
AFTER UPDATE
AS
BEGIN
  SET NOCOUNT ON;
  UPDATE s SET principal_id = @@SPID
    FROM dbo.src AS s
    INNER JOIN inserted AS i
    ON s.[object_id] = i.[object_id];
END
GO

And the test batch:

UPDATE /*real*/ dbo.tr6 SET name += N''
  WHERE name LIKE '%s%';
GO 1000

Measuring workload impact

Finally, I wrote a simple query against sys.dm_exec_query_stats to measure reads and duration for each test:

SELECT 
  [cmd] = SUBSTRING(t.text, CHARINDEX(N'U', t.text), 23), 
  avg_elapsed_time = total_elapsed_time / execution_count * 1.0,
  total_logical_reads
FROM sys.dm_exec_query_stats AS s 
CROSS APPLY sys.dm_exec_sql_text(s.sql_handle) AS t
WHERE t.text LIKE N'%UPDATE /*real*/%'
ORDER BY cmd;

Results

I ran the tests 10 times, collected the results, and averaged everything. Here is how it broke down:

Test/Batch Average Duration
(microseconds)
Total Reads
(8K pages)
T1: UPDATE /*real*/ dbo.tr1 … 22,608 205,134
T2: UPDATE /*real*/ dbo.tr2 … 32,749 11,331,628
T3: UPDATE /*real*/ dbo.tr3 … 72,899 22,838,308
T4: UPDATE /*real*/ dbo.tr4 … 78,372 44,463,275
T5: UPDATE /*real*/ dbo.tr5 … 88,563 41,514,778
T6: UPDATE /*real*/ dbo.tr6 … 127,079 100,330,753

 
And here is a graphical representation of the duration:

Average Duration (microseconds)

Conclusion

It is clear that, in this case, there is some substantial overhead for each trigger that gets invoked – all of these batches ultimately affected the same number of rows, but in some cases the same rows were touched multiple times. I will probably perform further follow-on testing to measure the difference when the same row is never touched more than once – a more complicated schema, perhaps, where 5 or 10 other tables have to be touched every time, and these different statements could be in a single trigger or in multiple. My guess is that the overhead differences will be driven more by things like concurrency and the number of rows affected than by the overhead of the trigger itself – but we shall see.

Want to try the demo yourself? Download the script here.

The post T-SQL Tuesday #64 : One Trigger or Many? appeared first on SQLPerformance.com.

17 Mar 09:50

Announcing SSIS Design Patterns Training in London 7-10 Sep 2015!

by andyleonard
I am honored to work alongside TechniTrain to deliver SQL Server Integration Services Design Patterns in London 7-10 Sep 2015! Register before 31 Mar to save £100. Who is it for? The target audience for this course is intermediate SQL Server Integration...(read more)
17 Mar 09:49

Data Loading performance considerations with Clustered Columnstore indexes

by Denzil Ribeiro

This article describes data loading strategies specific to tables with a Clustered Columnstore index on SQL Server 2014.  For fundamental data  loading strategies, an excellent read is the whitepaper Data Loading Performance Guide and it is greatly recommended. Though that whitepaper doesn’t include Columnstore indexes, many other concepts presented there still hold true for any data loading into SQL server.

 

Overview of Loading Data into Columnstore Indexes

When you insert or load data into a table that has a Clustered Columnstore index, the insert lands in one of 2 places:

  • The delta store
  • Compressed Columnstore

If you insert a small number of rows (< 102,400 rows) in a table with a Clustered Columnstore index ends up in the “delta store”. The “delta-store” is a row store. When this Delta store has 1,048,576 rows, it is marked as CLOSED and is compressed into columnar format by a background process called the Tuple Mover. The Tuple mover encodes and compresses the data and stores it in a columnar format.

If you are bulk loading data, whether this data lands in the delta store (row store) or directly in a compressed Columnstore depends on the batch size specified during the bulk insert process. If the Batch size for the insert is > 102,400 rows, the data no longer ends up in the delta store, rather is inserted directly into a compressed rowgroup in columnar format. This is true whether it is Bulk insert or BCP or anything using the Bulk API, even true for insert/select operations inserting a large amount of data as long as a single batch size inserts more than 102,400 rows.The size of a rowgroup is capped at 1,048,576 rows. Ideally we want the compressed rowgroup to be as close to that upper limit as the potential for larger compression exists. The added benefit of having a higher density rowgroup is scanning less row groups when querying the data.

The rowgroup size is not just a factor of the max limit but is affected by the following factors.

  • The dictionary size limit which is 16MB
  • Insert batch size specified
  • The Partition scheme of the table since a rowgroup doesn’t span partitions
  • Memory Grants or memory pressure causing row groups to be trimmed
  • Index REORG being forced or Index Rebuild being run

 

Transaction Logging implications

In a Data-warehouse scenario, assuming that our recovery model is SIMPLE or BULK-Logged, the trace flag 610 can have implications in certain cases even in the Columnstore scenario. There is a detailed explanation of trace flag 610 in the Data Loading Guide under the section I/O Impact of Minimal Logging under Trace Flag 610

The delta store is a page compressed Btree and so the minimal logging considerations still apply. Below is an example of a bulk load of 50,000 rows into a table. Given the number of rows is < 102,400, it lands in the delta store. Logging of any inserts into the delta store are affected by the trace flag 610.

Looking at the Log Records:

select count(*) as CountLogRecords from fn_dblog(NULL,NULL)

image

 

Ordering Data at Initial Load Time

When loading data into a clustered Columnstore index, consideration should be given not only to data load performance but also to query performance. Unlike a btree clustered index, where rows are ordered, loading into a clustered Columnstore index doesn’t order the data. Row groups are compressed in the order that the data is loaded and no specific ordering is done post rowgroup compression.

Say for example, we have a table with one of the columns being MarketID and we have a daily load that happens for all markets. If the input data that is loaded is not ordered by MarketID, then then queries that encompass the Market ID as a group by Column or predicate may have to scan all the segments. Due to how the data was inserted, rowgroups may not be able to be eliminated

In a case such as this, for the initial data load into a clustered column store table, it may be beneficial to sacrifice some load speed in favor of query speed by implementing the approach below

  • Load Data into a Heap ( Can be loaded concurrently with TABLOCK)
  • Create a Clustered index on the required column (MarketID in this example)
  • Create a Clustered Columnstore index with DROP_EXISTING = ON clause.

This will increase your load time but the trade-off is better rowgroup elimination and query execution time. This strategy can be used only on the initial load. Incremental data loads would benefit from the loaded data being ordered or partitioned by MarketID if the application or ETL process has the ability to sort the data prior to inserting or produce sorted data when reading from the data source.

 

Data Loading Scenarios: Non-Partitioned table

In this scenario we will look at loading data into a non-partitioned table and specifically look at the impact of parallel loading of data streams and the effect batch size in loading data.

 

Concurrent Loading

Loading in concurrent Streams can increase the Load throughput into a single table. In this case when we load concurrently, multiple row groups are created whether delta rowgroups or compressed rowgroups. Concurrent loading of the data does enable us to utilize multiple cores on the system and in an ideal scenario with no other bottleneck you can spawn one bulk load per core. How many parallel streams to spawn in reality though depends on both the size of data and resources on the server as at some point you will hit a physical resource bottleneck whether it is CPU or Memory or Disk IO or a logical resource bottleneck such as contention on a ROWGROUP_FLUSH lock which can be induced by slower disk IO on the Log drive. You can use the Waits and Queues Methodology alongside performance monitor to find out your primary bottleneck if at some point you do not get closer to linear scale as you increase the number of streams. A good tool to analyze the waits is SQL Nexus

The data below is from some tests run on a particular table on a particular server (see appendix). The objective of this is to demonstrate that parallel streams do improve throughput.

Data Input: 32 files for a total of 340 GB, 3 billion rows inserted into a single non-partitioned table. All the 32 files are processed for each run, the number of concurrent streams indicate how many files are processed in parallel.

 

Table Type

Concurrent
streams

Duration

hh:mm:ss

DBSize_GB

Throughput
GB /Hour

AvgCPU

MaxCPU

CCI

4

2:05:01

164

135

6%

9%

CCI

8

1:05:39

164

341

14%

15%

CCI

16

0:32:41

164

637

27%

28.5%

CCI

32

0:24:00

164

816

50%

56%

Heap with TABLOCK

16

0:15:29

360

1360

18%

22%

Heap with TABLOCK

32

0:09:15

360

2267

36%

38%

 

Note: In the case of loading into a heap, there isn’t any encoding or compression that needs to be done on the data which does affect the overall load speed but in the case of loading data into a heap you do significantly more IO. However with a Columnstore index given the data is both encoded and compressed the final size of the data is significantly smaller and some of the time is taken to compress and encode this data.

If we consider just the Clustered Columnstore tables, we see that concurrent streams scale in a rather linear fashion until you start hitting resource bottlenecks. This test had multiple streams being inserted into the same table.

 

image

 

Size or space considerations and Segment Density

The space or compression that you get on a table using CCI depends on several factors including the schema of the table and types and number of columns, the redundancy of data within a row group, the average size of a row group among others. When you look at the size for a table with a cluster Columnstore index, the size includes the size for the rowgroup itself and the size for the dictionaries.

Loading with different batch sizes can result in different sizes of the table potentially. Note that batch size here isn’t the only factor that controls the size of the rowgroup as we have mentioned earlier in the article, but is one that is easily controlled during load. One of the other implications of having a rowgroup filled as close to the limit as possible is that during Query time, the unit of IO is a segment which means that the fuller the segments, the less segments have to be read.

In the table below, we not only consider the batch size, but also the average rowgroup size in question since there are multiple factors as described earlier that contribute to a rowgroup being trimmed which in turn can result in lower compression. In the case of bulk loading in batches less than 102,400 rows, we ought to consider using trace flag 610 as performance is benefited if the delta store operations are minimally logged which either way is a best practices in a Data Warehouse. As you can see from the table below, as the batch size reduces, the Database size potentially increases. One more observation here is that though we specify a batch size of 1048576 rows, the rowgroup is trimmed due to the dictionary being full. A good practice is to start out with a batch size of 1048576 and then adjust from there based on the average rows/rowgroup.

Data Input: 16 files totaling 170 GB concurrently loaded by 16 processes. Total of 1.5 Billion rows inserted into a single non-partitioned table.

 

Batch Size

DBSize

Duration

Avg Rows/ rowgroup

Trim Reason

50,000

138.5 (1) 82.5 ( 2)

0:24:22

1048576 (1) 349813 (2)

Inserts into Row-store until it is filled, compressed later by tuple Mover. Data not immediately available in columnar format. Rowgroup trimmed due to dictionary full.

102,400

87.4

0:20:01

102,400

Batch Size

1048576

82.5

0:17:05

349813

Dictionary Full

 

Note: The batch size parameter depends on the bulk load method. BCP uses BATCHSIZE, BULK Insert can use BATCHSIZE, SSIS uses Maximum Insert Commit Size. Further details on the reasons for rowgroups being trimmed are here

(1) When you insert into a row store (batch size < 102,400) data is initially inserted into the delta store which is a page compressed btree. At a later point compression into columnar format is done by the tuple mover background process. When you look at the size initially the size can be larger as the background process hasn’t moved all the rows into compressed columnar storage.

(2) This is the point that the Tuple Mover has moved all the rows into compressed row groups and this can take a while as the tuple mover is single threaded.

 

Data Loading Scenarios: Partitioned table

Most often, very large data warehouse tables are partitioned due to some of the benefits of partitioning that include

  • Enabling Data archival with a sliding window
  • Efficient loading of data
  • Manageability improvements including index rebuilds at the partition level
  • Partition elimination during queries

Loading data into a partitioned table with a Columnstore index depends largely on a several factors

  • The data source from which Data is being Loaded
  • Whether the data being loaded is sorted or not.
  • Data being loaded into an empty partition or rather the granularity of data load matching the partition granularity

 

Data Loads aligned to Partition Boundary:

This is a case that we have a partitioned table and data is loaded into an empty partition. In this case we follow the methodology below just like we do for a single partition table and then swap that table into the empty partition. This is covered in detail in the Data Loading Guide under then Bulk Loading into a partitioned table Section. Here are the brief steps

  • Create a Staging table with same definition as the destination table including the Clustered Columnstore index.
  • Load the data into the staging table (concurrently if possible).
  • Create the constraints that match the partition boundaries
  • Swap that staging table with the corresponding empty partition

Optional: If you require the data ordered for better rowgroup elimination based on query patterns, see the steps detailed out in “Ordering Data at Initial Load Time” section.

Data Loads that span multiple partitions

In this scenario, we have a partition table where the data being loaded from a file or single stream may span multiple partitions. In this case, SQL Server has to sort the data is sorted prior to insert into the CCI table. Depending on the batch size used in this scenario and the resources such as memory required, the Sort will reside in memory or will spill to disk as seen in the plan below. The sort is more expensive if the input data is not sorted on the partition key. If the data load is coming from an existing relational table, it would be beneficial to sort it with the ORDER BY clause on the partition key against the source where the data resides. In the case of flat files as your input, we would recommend if possible for the application or ETL process to produce files that are sorted on the partition key and aligned to a partition boundary.

 

image

 

Choosing a batch size here has a several ramifications:

  • The sorting is done for a particular batch, so it affects the memory grant size and resources to sort the number of rows that correspond to the batch size
  • The batch size is applicable to the Bulk insert statement as a whole and not limited to the number of rows inserted within a specific partition. Even though you specify a batch size of ~ 1 Million, you could end up with several smaller compressed row groups than you are used to seeing for a non-partitioned table.
  • If more than 102,400 rows are inserted in a single partition within a single batch, then they are instantly compressed, if not they are inserted into a delta store (row store).
  • If the bulk load distributes rows across all partitions in small batches, you may end up with open delta rowgroups in all partitions and may have to force compression using the index REORG
  • Using a higher batch size than ~1 Million rows may be beneficial if you know that the incoming data being inserted spans multiple partitions.

If we look at the sys.column_store_row_groups, the general expectation is that if we use a batch size > 102,400 you will see data being inserted into compressed row groups directly. However row groups exist within the partition boundary and so we may see inserts into a delta store or row store within a partition though our batch size is greater than 102,400. At the end of your data load, you could potentially see a cross section of row groups in the OPEN state, some CLOSED and some COMPRESSED.

Example below shows row groups in partition 2 that has both a delta store rowgroup that is OPEN and currently accepting rows, and one that is CLOSED waiting for TUPLE mover to move it to compressed columnar format and one that is already in a compressed format.

image

 

Looking at the table below we observe that

  • When the Input Data is sorted on the Partition key, the throughput is higher than when it is not
  • Import into a Partitioned table can insert into delta store as shown above if the rows that land in a specific partition are < 102,400 per batch
  • The DB Size specified below is the final DB size once the Tuple mover is run. Depending on how spread out the data is across partitions, initial size can be much larger before Tuple move compresses the row groups
  • If the data is not sorted on the partition key, knowing a good batch size to use at load time is trickier and we may end up with smaller row groups which means the compression will be less than ideal as seen in the last 2 rows of the table. Increasing batch size will reduce table size at the expense of resources such as memory for the sort.

Data Input: 16 files totaling 170 GB concurrently loaded by 16 processes. Total of 1.5 Billion rows inserted.

 

Table Type

Sorted on Partition Key

Concurrent
streams

 

 

Batch Size

Duration

  

DB Size

GB

Throughput
GB /Hour

 

Average

CPU

 

Non-Partitioned

N/A

16

1048576

0:16: 05

82.5

637

27%

Partitioned

17 Mar 09:49

SQL Server 2014 is Certified for SAP Applications On-Premises and in the Cloud

by SQL Server Team

As of March 11 2015, SAP has certified support for SAP NetWeaver-based applications on Microsoft SQL Server 2014.  Now you can run even more of your Tier-1, mission-critical workloads on SQL Server.  And, the ability to run SAP on Microsoft Azure means that it can be accomplished with low total cost of ownership (TCO).

SQL Server 2014 provides the higher scale, availability, and breakthrough performance needed for your most demanding SAP workloads. The updatable in-memory ColumnStore will deliver blazing fast query performance for your SAP Business Warehouse (BW).  SQL Server AlwaysOn availability groups help with the reliability and availability requirements of SAP systems by enabling multiple, readable secondaries that can be used for failover and read workloads like reporting and backup.

With SAP’s certification, you can also run SAP in Microsoft Azure Virtual Machines with SQL Server 2014. Azure enables SAP customers to reduce TCO by leveraging Microsoft infrastructure as system needs grow, rather than investing in additional servers and storage.  With Microsoft Azure, customers can leverage development and test environments in the cloud that can be spun up and scaled out as needed.  SQL Server 2014 also introduced Disaster Recovery to Azure using an asynchronous AlwaysOn secondary, which can make Azure a part of your SAP disaster recovery plan.

With the certification, customers can now adopt SQL Server 2014 for mission-critical SAP workloads, and we look forward to telling you their stories soon. Here are some customers who are taking advantage of SAP on SQL Server today:

  • Quanta Computer Boosts Performance of Its SAP ERP System with In-Memory Technology
  • Zespri International Prunes Costs, Defends Business from Disasters by Running SAP in the Cloud
  • Saudi Electric Company Increases Query Times by 75 Percent, Can Respond Faster to Customers
  • Mitsui & Co. Deploys High-Availability and Disaster-Recovery Solution After Earthquake

Many companies are already betting their mission critical apps on SQL Server 2014. To read about Microsoft’s leader position for business critical operational database management and data warehouse workloads, read Gartner's Magic Quadrant for Operational Database Management Systems and Magic Quadrant for Data Warehouse Database Management Systems Report.  For more information about how customers are already using SQL Server 2014 for mission critical applications, read these case studies:

For more about the powerful combination of Microsoft and SAP, visit http://www.microsoft.com/SAP.  To get started with SQL Server 2014, click here.

17 Mar 09:49

Biml: An Official Topic for PASS Summit 2015 Presentations

by andyleonard
I have waited (again) until the last minute to submit presentations for the PASS Summit. I’m submitting “Using Biml as an SSIS Design Patterns Engine” – a presentation based on a chapter in SSIS Design Patterns and Levels 2-4 of the Stairway to Biml – and I am pleasantly surprised to see Biml included in the top selections. w00t! :{> PS – The PASS Summit 2015 Call for Speakers closes at 9:00 PM PDT Sunday, 15 Mar 2015 (04:00 GMT 16 Mar 2015). There’s still time, but hurry! Learn more: Linchpin...(read more)
17 Mar 09:48

Moving Reporting Services off of a Cluster

by psssql

We had a customer that had deployed Reporting Services to their Cluster. They now wanted to move the RS Instance off of the Cluster and onto its own machine and leave the Catalog Database on the Clustered SQL Server.

We have a blog talks about Reporting Services and clusters. You can find that at the following link.

Reporting Services, Scale Out and Clusters…
http://blogs.msdn.com/b/psssql/archive/2010/05/27/reporting-services-scale-out-and-clusters.aspx

This focuses more on why you shouldn’t do it and doesn’t address how to get out of the situation if you are in it. So, I wanted to just outline what we did for this customer and it may help others who get into the same situation.

Our goal is to not have RS running on either physical node of the Cluster and instead have RS running on a separate machine outside of the cluster. We want RS to be running on its own server.

NOTE: This is for Native Mode Reporting Services.

Let’s go through the steps to get this migration accomplished.

Backup the Encryption Key

The first thing we need to do is backup the Encryption key for the current instance that is running. We can do this by going to the Reporting Services Configuration Manager and going to the Encryption Keys section.

clip_image001

clip_image003

The Backup button should be enabled if you haven’t already backed up the key.

Make sure you have the Virtual Network Name (VNN) of your SQL Cluster

If you don’t know the VNN of your SQL Cluster, you can go to the Failover Cluster Manager to get this. Make sure you are looking at the SQL Cluster and not the Windows Cluster. We will need this name when we point the other machine to the Database holding the catalog database.

Assuming that the current RS Instance on the cluster is using that cluster for the catalog database, you can also get it from the Reporting Services Configuration Manager in the Database Section.

clip_image004

Stop Reporting Services

Make sure that the Reporting Services Service is stopped on both Cluster Nodes. You will also want to change the service to be disabled so it doesn’t start back up. To disable the service, you can do that within the SQL Server Configuration Manager.

Go to the properties of the Reporting Services Service. On the Service Tab, change the Start Mode to Disabled.

clip_image006

Install Reporting Services

Go ahead and install Reporting Services on the server you want it to run on. Depending on what you are going to do on that server, you should only need to choose the Native Mode RS Feature and nothing else.

Configure the new Reporting Services Instance

After the instance is installed on the new machine, start the Reporting Services Configuration Manager.

The setup will be the normal configuration steps you would do for configuring Reporting Services with the following exceptions.

Database

Make sure we are pointing to the Virtual Network Name of the SQL Cluster for the Database Server. Also make sure we select the Catalog Database that the other server was using. We want to use the same one to make sure we don’t lose any data. The default name will be ReportServer.

Scale-Out Deployment

After the database is configured, you can go to the Scale-Out Deployment section. If you see the Cluster Nodes listed here, you will want to remove them. As we only want this new server to be used.

clip_image008

Encryption Keys

We will now want to restore the Encryption Key that we already backed up. Go to the Encryption Keys tab and click on Restore.

clip_image010

That’s it! It should be up and running now on the new server and you should be able to browse to Report Manager and see your reports and they should render.

References

Host a Report Server Database in a SQL Server Failover Cluster
https://msdn.microsoft.com/en-us/library/bb630402.aspx

Configure a Native Mode Report Server Scale-Out Deployment
https://msdn.microsoft.com/en-us/library/ms159114.aspx

 

Robyn Ciliax
Microsoft Business Intelligence Support

17 Mar 09:48

Fixing VLF issues for all databases on a server

by Jeff Garbus
Note: This script presumes that databases are in SIMPLE recovery mode (as it was at this client site at the time). It reduces log file size to minimum, then increases it to what it was before it was shrunk, in one big alter. At the same time, it changes autogrowth to 10% across the board.




 


/*


select db_name(),(size+127)/128,* from sys.sysfiles where fileid = 2


DBCC SHRINKFILE (N'IMChartData_log' , 2)


GO


ALTER DATABASE [AMDDEMO] MODIFY FILE ( NAME = N'IMChartData_log', SIZE = 267264KB )


GO


*/


DECLARE @cmd VARCHAR(6000)


SET @cmd = 'USE [?] ;


declare @filename varchar(1000), @dbcc varchar(2000), @alter varchar(1000), @size int


select @filename = name, @size = (size+127)/128 from sys.sysfiles where filename like ''%ldf''


print db_name()


print @filename


set @dbcc = ''dbcc shrinkfile (xx, 2)''


set @dbcc = REPLACE(@dbcc, ''xx'' ,@filename)


set @alter = ''ALTER DATABASE ['' + db_name() +


''] MODIFY FILE ( NAME = xx, SIZE = '' + convert(varchar, @size ) + ''MB, FILEGROWTH = 10%)''


set @alter = REPLACE(@alter, ''xx'' ,@filename)


print @dbcc


print @alter


exec (@dbcc)


exec (@alter)


'


Exec sp_msforeachdb @cmd

17 Mar 09:48

The preview of Power BI is now available worldwide #powerbi

by Marco Russo (SQLBI)

Today Microsoft announced the availability of Power BI preview also outside United States.

http://blogs.msdn.com/b/powerbi/archive/2015/03/16/power-bi-preview-now-available-worldwide.aspx

One important thing to note is that also the mobile app for iPad and iPhone are also available worldwide. Jus go in the Power BI Download page and select the app you want. For a PC, I strongly suggest to download the Power BI Designer. You will find another good reason to do that later this week on this blog... but no spoilers by now!

I suggest you to read the longest recap made by Chris Webb on his blog: there is also some news about public preview of Office 2016. 

17 Mar 09:48

Does statistics update cause a recompile?

by JackLi

This my “statistics series” blog.   See “Statistics blogs reference” at end of this blog.

In this blog, I will talk two scenarios related to recompile in conjunction with statistics update.  A statement can be recompiled for two categories of reasons.  First category is related to correctness (such as schema change).  Another category is related to plan optimality.   Statistics update related recompile falls into second category.

If I were to ask you a question “Does statistics update cause recompile for a query referencing the table?”,  what would your answer be?  In most cases, the answer is YES!  However there are a couple of scenarios where recompile is not necessary.   In other words, a query won’t recompile even you have updated statistics for the tables being accessed.  We actually got users who called in and inquired about the behaviors from time to time.

Scenario 1 – trivial plan

When a plan is trivial, it’s unnecessary to recompile the query even statistics has been updated.  Optimizer generates trivial plan for very simple queries (usually referencing a single table).  In XML plan, you will see statementOptmLevel="TRIVIAL". In such case, it’s futile and you won't get a better or different plan.

Let’s see this in action.  In the script below, I created a table and two procedures (p_test1 and p_test2).  p_test1 has a very simple statement.   I execute them once so that the plans will be cached.  Then one row is inserted (this is very important as it will be explained in the 2nd scenario).  Statistics then is updated.

use tempdb
go
if object_id ('t') is not null      drop table t
go
create table t(c1 int)
go
create procedure p_test1 @p1 int
as
    select c1 from t where c1 = @p1
go
create procedure p_test2 @p1 int
as
select t1.c1 from t t1 join t t2 on t1.c1=t2.c1 where t1.c1 = @p1
go
set nocount on
declare @i int = 0

while @i < 100000
begin
    insert into t values (@i)
    set @i += 1
end
go
create index ix on t(c1)
go
exec p_test1 12
exec p_test2  12
go
insert into t values (-1)
go
update statistics t
go

 

I started profiler trace to trace “SQL:StmtRecompile” event followed by running the following queries again

--no recompile because of trivial plan
exec p_test1 12
--recompile because of stats updated with data change and it's not a trivial plan
exec p_test2  12

Note that only the statement from p_test2 produced StmtRecompile event.   This is because the statement in p_test1 produced a trivial plan.  Recompile would be futile anyways.

image

 

Scenario 2 –no data change

In this scenario, the plan can be non-trivial plan but it still won’t recompile if the table whose statistics was updated hasn’t got any row modification(insert,delete and update) since last statistics update. 

Let’s use the same demo above to illustrate the behavior. 

let’s update statistics one more time  (update statistics t).  Note that I didn’t modify the table.  Now run the query (p_test2)  again below.  Note that no StmtRecompile event was produced.  It used the existing plan.    In short,  if there is no data change, there is no need to recompile.

--no recompile because there is no data change even though stats got updated
exec p_test2  12

image

 

Scenario 2 actually has complications.  Suppose that you updated statistics yesterday.  Today you decided that you need to update statistics with fullscan thinking that it may produce better statistics to benefit queries.   But there was no change in data.  You may be in for a surprise that SQL still used the same plans without recompileing.   In such case, you will need to manually free procedure cache to get rid of the plan.

Statistics blogs reference

  1.  filtered statistics
  2. statistics update with index rebuild
  3. partitioned table statistics
  4. sampling & statistics quality.

 

Jack LI | Senior Escalation Engineer | Microsoft SQL Server Support

17 Mar 09:47

MVP Events - March 2015

by MVP Award Program

 

The event will take place at Microsoft offices in São Paulo March 13-14. During the two-day event, attendees will have a varied program of technical and strategic sessions, delivered by experts from Brazil and Microsoft Technology Center, as well such as MVPs. The sessions were selected to serve all segments (Developer, Consumer Experts and IT Pro), and strive for a strong interaction and relationship between MVPs and Microsoft. We want you to make the most and enjoy!

Hosted by Microsoft, the inaugural US MVP Open Days kicks off March 27th at the Microsoft Office in Malvern, Pennsylvania. The two-day event will be packed with technical sessions, a series of facilitated breakout sessions, opportunities to connect with Microsoft team members and MVPs and—of course—delicious food. We will have sessions which are of interest to all MVPs across the business segments (developer, consumer and IT pro)—presented by Microsoft team members and MVPs. Our goal: help enhance your award year experience with this regional opportunity to learn, network and have fun with the community.  

Representing 17 countries within Middle East & Africa, MEA MVP Program is the home for 128 MVPs, providing support on 32 different expertise contributing to the Microsoft communities day and night. Additionally to the Global MVP Summit where you can interact with specific product groups focusing on your expertise, we are happy to announce that we have decided to host the “MEA MVP Open day” MVP networking gathering again this year since we had a successful event last year. You will not only have the chance to meet or see again your peers from the region but use this as an opportunity to enjoy great sessions from our guest speakers and fellow MVPs. Once we have more about the event we will share… Please don’t forget to regularly check our site! Looking forward in seeing you in March!

The French Community Day, called “Journée des Communautés”, is set for the Friday March 20, at the Microsoft France subsidiary. This annual event will gather MVP, MSP and also Regional Director, from France, Belgium and Switzerland. It will also be a great opportunity to meet and connect with Microsoft France employees, especially through a Q/A with the French DX Evangelists team. Discussing, meeting, sharing experience and also having fun are the main goals of this community day!

17 Mar 09:46

Python and Data : SQL Server as a data source for Python applications

by Susan Ibach

python-logo@2xThis post will show you how to use Python to connect to a SQL Server database, save and retrieve data.

I ( @HockeyGeekGirl ) recently recorded some courses with Christopher Harrison ( @GeekTrainer ) on Microsoft Virtual Academy about coding with Python. During that series of courses we explored several different data sources. Sometimes it was difficult to find good code examples and documentation on how to connect to those data SQLServerNoVersionsources with Python. So I have put together this series on Python and Data to help others who may be trying to work with different data sources using Python

This blog post will explain

  • What Python package should I use?
  • Connecting to the database
  • Inserting a row
  • Retrieving a single row
  • Retrieving multiple rows
  • Additional Python resources

The examples in this post are written using CPython 3.4 in Visual Studio and Python Tools for Visual Studio. If you want to use the same tools:

What Python package should I use?

Connecting to SQL Server requires installing a Python package in your code that supports connections to SQL Server. In this post we use pypyodbc.

pypyodbc runs on runs  on PyPy / CPython / Iron Python , Python 3.4 / 3.3 / 3.2 / 2.4 / 2.5 / 2.6 / 2.7 , Win / Linux / Mac , 32 / 64 bit

To install this package in a Visual Studio Python project, create a virtual environment in your solution in Solution Explorer for one of the supported versions of Python

image

Right click your Virtual Environment and select Install Python Package

image

Enter the package name pypyodbc and Select OK.

image

Connecting to the database

In order to connect to the database you use the connect method of the Connection object.

pypyodbc.connect(‘Driver = {drivername};Server=servername; Database=databaseName; uid=username;pwd=password)

  • Driver - identifies the driver you wish to use to connect to the database, the correct driver to use will depend on which database product you are using. Since we are using SQL Server, our driver should be SQL Server
  • Server - identifies the server where SQL Server is running. If you are running SQL Server on the same PC where you are running your Python code the server name will be localhost
  • Database - is the name of your database in SQL Server. I have created a database called testdb.
  • uid and pwd - are the SQL Server username and password that has permissions to log into the database and perform the desired actions. In this example I am logging in with the default sys admin password sa.

In this example we assume you are using Mixed Mode authentication on your SQL Server database instead of Windows authentication/Integrated security. If you are not sure what form of authentication your SQL Server installation is using, check out the MSDN article Change Server Authentication Mode

Here is what that call looks like in my code

import pypyodbc
connection = pypyodbc.connect('Driver={SQL Server};'
                                'Server=localhost;'
                                'Database=testdb;'
                                'uid=sa;pwd=P@ssw0rd')
connection.close()

Inserting a record

In order to insert a record you need to

  • declare a cursor.
  • pass the SQL Statement you wish to execute to the cursor using the execute method.
  • save your changes using the commit method of the connection or cursor

If you need to pass any values to your SQL statement, you can represent those in your SQL statement using a ? then pass in an array containing the values to use for the parameters when you call the execute method of your cursor

In SQL we insert a row into a database with the INSERT statement

INSERT INTO tablename

(columnName1, columndName2, columndName3, …)

VALUES

(value1, value2, value3, …)

For example. If I have a table called customers with the columns customerid, firstname, lastname, city. On my customers table customerid is an IDENTITY column that assigns an id to any new record inserted automatically. Therefore, when I insert a new customer record I don’t need to specify a value for customerid.

INSERT INTO customers

( lastname, firstname, city)

VALUES

(‘Susan’,’Ibach’,’Toronto’)

Here’s a code example that will insert that record into our customers table using Python

import pypyodbc
connection = pypyodbc.connect('Driver={SQL Server};'
                                'Server=localhost;'
                                'Database=testdb;'
                                'uid=sa;pwd=P@ssw0rd')
cursor = connection.cursor()
SQLCommand = ("INSERT INTO Customers "
                 "(firstName, lastName, city) "
                 "VALUES (?,?,?)")
Values = ['Susan','Ibach','Toronto']
cursor.execute(SQLCommand,Values)
connection.commit()
connection.close()

Retrieving a single row

If you want to retrieve a single row from a database table you use the SQL SELECT command.

SELECT columnname1, columnname2, columndname3, …

FROM tablename

WHERE columnnamex = specifiedvalue

for example if I want to retrieve the firstname, lastname and city information for the customer with a customer id of 2 you would use the following SELECT statement

SELECT firstname, lastname, city

FROM customers

WHERE customerid = 2

To execute that command with Python I use a cursor and the execute statement the same way I executed the insert command.  After I execute the command I need to call the fetchone() method of the cursor to populate an array with the values returned by the SELECT statement. The first row of the array will contain the first column specified in the select statement. The second row of the array will contain the second column specified in the select statement and so on.

import pypyodbc
connection = pypyodbc.connect('Driver={SQL Server};'
                                'Server=localhost;'
                                'Database=testdb;'
                                'uid=sa;pwd=P@ssw0rd')
cursor = connection.cursor()
SQLCommand = ("SELECT firstname, lastname, city "
               "FROM customers "
               "WHERE customerid = ?")
Values = [2]
 
cursor.execute(SQLCommand,Values)
 
results = cursor.fetchone()
 
print("Your customer " + results[0] + " " + results[1] + " lives in " + results[2])
 
connection.close()

Retrieving multiple rows

If your select statement will retrieve multiple rows, you can simply move your fetchone() method call into a loop to retrieve all the rows from the command.

import pypyodbc
connection = pypyodbc.connect('Driver={SQL Server};'
                                'Server=localhost;'
                                'Database=testdb;'
                                'uid=sa;pwd=P@ssw0rd')
cursor = connection.cursor()
SQLCommand = ("SELECT customerid, firstname, lastname, city "
               "FROM customers")
 
cursor.execute(SQLCommand)
 
results = cursor.fetchone()
 
while results:
     print ("Your customer " +  str(results[0]) + " " + results[1] + " lives in " + results[2])
     results = cursor.fetchone()
 
connection.close()

Additional Python resources

If you want to learn more about Python check out the learning to code with Python series on Microsoft Virtual Academy
Part 1 - Introduction to Coding with Python
  • Displaying Text
  • String Variables
  • Storing Numbers
  • Working with Dates and Times
  • Making Decisions with Code
  • Complex Decisions with Code
  • Repeating Events
  • Remembering Lists
  • How to Save Information in Files
  • Functions
  • Handling Errors
Part 2 – Introduction to Creating Websites Using Python and Flask
  • Introduction to Flask
  • Creating a Web Interface
  • Data Storage Locations
  • Using Redis
  • Using Redis and Flask on Azure
Part 3 - Python, SQL and Flask : (recorded March 4th, available on demand by mid March at Microsoft Virtual Academy
  • Design of a Flask Application
  • Designing Python Classes
  • Introduction to Relational Databases
  • Connecting to Relational Databases
  • Layouts Using Jinja
  • Introduction to Bootstrap
QuickStart Python and MongoLab
17 Mar 09:44

Windows Core is Windows 10 is Windows Core is Windows 10

by Jerry Nixon

If you want to understand Windows 10, you need to understand some of the underpinnings that make it possible. It’s more than the next version of Windows. There are engineering accomplishments that have been underway for years; they culminate in a new kind of Windows and a significantly advanced approach to enabling apps across devices.

image

Windows is a large body of code. Its complexity and interdependencies imply a kind of immutability. After all, who would want to pull a loose thread on that complex tapestry? Yet thoughtful Microsoft architects and engineers, almost a decade ago, foresaw a looming problem and decided to fix it before it was at their throats. They plotted to refactor Windows into core components.

Code refactoring is the process of restructuring existing computer code – changing the factoring – without changing its external behavior. Refactoring improves nonfunctional attributes of the software. Advantages include improved code readability and reduced complexity to improve source code maintainability, and create a more expressive internal architecture or object model to improve extensibility.

It was as simple as saying, “Build a starship,” and equally as daunting. One piece at a time Windows’ core was taken out and put back in. This remarkable, multi-year work touched every critical system, resulting in more eyes on the code, more tests, more compatibility, and better implementations. It is difficult to overstate the value of this effort, yet it will likely be the most understated.

With Windows 8, Microsoft announced an engineering marvel: all Windows devices would share one OS kernel. It resulted in little fanfare, probably because its complexities were unfathomable by mainstream marketing and tech media. It was the predecessor ultimately crowning Windows 10 with Windows Core – the completion of an unsung, multi-year refactoring effort.

The return of the king

The first benefactors would be the authors of hardware drivers. Instead of fumbling to create multiple drivers for printers, USB devices, and graphics cards for every possible version of Windows – driver authors now have a path toward a universal hardware driver. Though the winner seems to be the authors, the consumer will ultimately benefit the most from this one.

Because of Windows Core, Windows 10 is a modular, lightweight operating system that can be deployed to various devices with a small storage footprint and memory requirement. If this were not true, how could Windows dream of shipping on a Raspberry Pi (for instance)? But, it does. Windows is everywhere, shipping its core components without interdependencies on irrelevant, heavy, device-specific subsystems.

What’s more, the Window desktop is awesome. How does Microsoft deliver this rich functionality? By adding to the Windows Core with desktop-specific capabilities tailored for desktop users. These do not ship to Xbox, as an example, because Xbox has its own console-specific capabilities for to deliver incredible experiences to Xbox users. Yet both share the Windows Core.

Let’s be clear, Windows Core is not a new Windows SKU. Windows Core is the name given to the subset of common Windows code enabling the fundamental features of Windows across devices. The subset upon which features are added to enable a SKU.

Common across every device is Windows Core. This enables a consistent, reliable operating system made modular and delivered without the weight of one device family constraining another. The delicate work of maintaining a reliable code base while continually moving forward across such an implausible timeline might be Windows 10’s magnum opus; one we will likely never discuss again – not adequately.

The man from Snowy River

To the team, I tip my hat.

17 Mar 09:36

Tragedies of the Remote Worker: "Looks like you're the only one on the call"

by Scott Hanselman

You're the only one on the callI'm writing this as I sit alone in a remote meeting room. As a remote worker, this is just one of the tiny, daily paper cuts. To be clear, I like being remote and I wouldn't change it, but some days Being a Remote Worker Sucks.

This is a rant, but if you were remote you'd understand.

You're the only one on the call.

It was nice of them put a link to join the meeting into the invitation, except they never joined the meeting. They've changed their IM status to Do Not Disturb and aren't answering their phones. You're all alone in a virtual room and are now late for a meeting you were originally early for.

When's the next time you're up?

Whenever you are on-site, folks always say "when are you up next?" Seriously. Like I'm just on vacation the other 6 weeks I'm not at the mother ship.

If only there were a global network with cameras and audio that would allow us to have a conversation while I'm away? But, alas, there isn't, so I'll see you again in 6 to 8 weeks.

I'm remote but that doesn't mean I'm not available EVERY WORK DAY.

Fifteen Minutes of "Can you hear me?"

Please. Unmute your damn phone. http://howtounmute.com. Learning how to use your basic VOIP camera and audio is a sign of respect for your remote workers.

You have a Webcam, use it.

You can see each other, but I can't see you. I don't care that you "don't like to use your webcam." We are having a business meeting, turn it on so the remote works can get one of their 5 senses back. Seeing your face is the whole point. It really helps. Bonus points if you adjust your webcam when it's time to see the whiteboard.

Have Empathy - Put yourself in the remote person's shoes

When I came to work here I sent five managers gift-wrapped web cams with a note on how to use them. During my next office visit I found 4 of them opened and shoved off to the side of their desks. If I had a gluten allergy I think you'd be more accommodating. But I don't, I'm a remote worker.

Remote iPad on a Stick - Double Robotics

I'm remote, please add call link to the meeting invite

Thanks for scheduling that meeting. Awesome that you got a room and everything. But I'm going to email you right back and remind you to add a call bridge/goto meeting/lync invite/google hangout. I just need access.

Move closer to the mic

You're in your office talking to me remotely, but not only will you not turn on your camera but you're talking on a speaker phone with your back to me as you spin in your desk chair.

Did the meeting end? Guys? Any one there?

It's so sad when I'm left on the table and you've all left the room. I'm just trapped in the Klingon Phone and you've got feet.

Don't fade away. When someone is remote it's so important to check in as you're closing the meeting.

The Klingon PhoneYour Inability to Deal with Me Remotely

Everyone has some special need. Mine is I'm remote. Your inability to be even slightly flexible to that fact causes me problems literally daily. Remote workers go out of their way to be available.

I'm on Lync, Skype, Slack, Twitter, and my cell phone is published in the company directory.

And you just literally said with a straight face, "I couldn't get ahold of you." O_O

Hearing an Important Conversation...as they hang up

This happens more often than you'd think. The meeting is over and they are hanging up. You can see their hand dropping to hit "End Call" and then someone starts mentioning something TOTALLY IMPORTANT and....dial tone.

Why don't you move up here?

Wow! I never thought of that. After 7 years of working remotely for a dozen reasons, you finally asked the right question! Why don't I just move up there?

Because. Reasons.

What tiny indignities do you deal with as a remote worker? Sound off in the comments.

Related Links


Sponsor: Big thanks to Aspose for sponsoring the blog feed this week! Are you working with Files? Aspose.Total for .NET has all the APIs you need to create, manipulate and convert Microsoft Office documents and many other formats in your applications. Start a free trial today.



© 2015 Scott Hanselman. All rights reserved.