07 Jul 18:48

Half Credit

by Remy Porter

“I didn’t buy ten cylinders of CO₂! Why is PCard telling me I did?”

There were a pile of tickets in Adam’s queue, slightly larger than the proverbial molehill, but nowhere near mountain status either. Several employees saw incorrect charges on their purchasing cards, including one of the VPs. The first suspicion was fraud, but these charges looked legitimate: plane tickets to cities where the company had plants, raw materials purchases from vendors the plants usually did business with, etc. Since a VP was impacted, this confusion needed to be resolved ASAP, which meant Adam was pulling long hours to fix it.
Jean Luc PCard
“PCard” was the name for the purchasing card application. The purpose of the tool was to allow managers and executives validate the purchases made on the purchasing cards, and determine if they really belonged there or should have had a purhase order. It was one of Adam’s nuisance products, simply because it mostly worked fine- but minor issues were viewed as “my hair’s on fire” class emergencies. The only silver lining was that the application was considered “mature”, which meant that in three months, support would move to a much cheaper offshore team.

The bulk of the application’s logic was an SSIS package. BankCo, the bank which provided the purchasing cards, handled all of the billing and payments, but didn’t provide any sort of reporting on purchases. They did provide a steady feed of flat-files. It true enterprise IT fashion, these files ping-ponged across a series of SFTP servers and network shares until they dropped in a folder that the SSIS package monitored.

Adam started with the bank. Over the phone, they assured him that the charges were correct, and that the problem must be in how Adam’s tool processed the data. Adam wasn’t convinced, but he agreed to examine the import job in detail.

The job imported two major files. The first was a record of every purchasing card issued by BankCo:

4916086563859697 Bobsen     Joebob          032015…
4916410586003002 Bobsen     Sallybob        052015…

The first column was the card number, and the remaining columns contained all of the important metadata- holder, expiration date, date issued, etc. The SSIS job just blindly grabbed this data and jammed it into a table called PCARDS. The schema of the table was nearly identical to the flat file, save for an additional column called PC4, which contained the 4-digit suffix of the card number.

The second file was the actual list of transactions:

9697 Office Max             473.75  04272014…
9697 HmtpnInnSuites         578.95  03192014…
3002 CndaGas Co.            1047.20 05162014…

Again, the PTRANS table was nearly identical to this file format. For relational integrity, the 4-digit suffix linked this table to the underlying purchasing card. The fact that transactions keyed off of the last four digits of the card number immediately caught Adam’s eye. He checked the purchasing card table and confirmed that several cards did have the same 4-digit suffix.

Adam called the bank. “You’ve issued our company multiple cards with the same suffix, but when you send us transactions, the only identifier you send us is the last four digits. So we can’t actually tell which card that transaction is for.”

“That’s impossible. There should never be any repeats in the suffix.”

“And yet,” Adam said, “there are. Can you cahnge the file to send us the whole card number, instead?”

“No! Too many companies depend on the format staying the way it is. We’re not going to make a custom format for you.”

“Couldn’t you add a column at the end? Something? Someone else must have this problem!”

The conversation chased itself in circles until it fell over, panting. Adam hung up the phone with zero progress made. The bank was going to do nothing. Stuck, Adam went to his boss, Steve.

“What I was thinking,” Adam said, “was that we could put all of the unclear transactions in a bucket. When users log in, they’ll be prompted to claim any transactions that might be theirs. There could be an escalation path if transactions remain unclaimed-”

“No!”, Steve said. “First, the business units already feel that the application requires too much data entry. Second, they don’t have a budget to add features, only to fix bugs. Third, no one on the business side is going to actually monitor and maintain that. Fix the bug. Don’t invent new functionality.”

Stymied by the bank and his boss, Adam realized that he was going to have to solve this problem on his own. There was no perfect solution, but Adam was feeling… lucky.

Only a handful of people had duplicate suffixes on their p-cards. Most of the time, the cards were used for one off purchases or to resupply raw materials outside of a regular purchase-order. This meant than some users were heavy users of the cards, but most weren’t. It also meant that many of the common transactions would show repeat business with the same vendor.

Armed with that information, Adam was ready to “fix the bug”. He changed the primary key of the PCARDS table to an auto-generated numeric ID. Then, on PTRANS, he removed the foreign key on the suffix, and replaced it with a foreign key to the new ID field. Finally, he did a bit of creative programming in the SSIS package. If there were multiple purchasing cards that could link to the same transaction, the loading job differentiated them this way:

If one and only one card had previous transactions with this vendor, assign the transaction to that card.
In all other cases, assign the transaction to the card which had the most transactions for the month.

Adam slapped a comment on the SSIS package to explain the WTFery he had perpetrated. In the long run, his simple heuristic would almost certainly fail to categorize transactinos correctly, but Adam wasn’t concerned with the “long run”. He just needed it to work for three more months, until the support for the app moved offshore.

[Advertisement] BuildMaster 4.1 has arrived! Check out the new Script Repository feature and see how you can deploy builds from TFS (and other CI) to your own servers, the cloud, and more.

Kevin Walz likes this

07 Jul 17:54

An Unexpected Side-Effect of Adding a Filtered Index

by Paul White

Adding a filtered index can have surprising side-effects on existing queries, even where it seems the new filtered index is completely unrelated. This post looks at an example affecting DELETE statements that results in poor performance and an increased risk of deadlock.

Test Environment

The following table will be used throughout this post:

CREATE TABLE dbo.Data 
(
    RowID       integer IDENTITY NOT NULL, 
    SomeValue   integer NOT NULL,      
    StartDate   date NOT NULL,
    CurrentFlag bit NOT NULL,
    Padding     char(50) NOT NULL DEFAULT REPLICATE('ABCDE', 10),
    CONSTRAINT PK_Data_RowID
        PRIMARY KEY CLUSTERED (RowID)
);

This next statement creates 499,999 rows of sample data:

INSERT dbo.Data WITH (TABLOCKX)
    (SomeValue, StartDate, CurrentFlag)
SELECT
    CONVERT(integer, RAND(n) * 1e6) % 1000,
    DATEADD(DAY, (N.n - 1) % 31, '20140101'),
    CONVERT(bit, 0)
FROM dbo.Numbers AS N
WHERE 
    N.n >= 1 
    AND N.n < 500000;

That uses a Numbers table as a source of consecutive integers from 1 to 499,999. In case you do not have one of those in your test environment, the following code can be used to efficiently create one containing integers from 1 to 1,000,000:

WITH
    N1 AS (SELECT N1.n FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS N1 (n)),
    N2 AS (SELECT L.n FROM N1 AS L CROSS JOIN N1 AS R),
    N3 AS (SELECT L.n FROM N2 AS L CROSS JOIN N2 AS R),
    N4 AS (SELECT L.n FROM N3 AS L CROSS JOIN N2 AS R),
    N AS (SELECT ROW_NUMBER() OVER (ORDER BY n) AS n FROM N4)
SELECT
    -- Destination column type integer NOT NULL
    ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM N
OPTION (MAXDOP 1);
 
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1);

The basis of the later tests will be to delete rows from the test table for a particular StartDate. To make the process of identifying rows to delete more efficient, add this nonclustered index:

CREATE NONCLUSTERED INDEX 
    IX_Data_StartDate
ON dbo.Data 
    (StartDate);

The Sample Data

Once those steps are completed, the sample will look like this:

SELECT TOP (100)
    D.RowID,
    D.SomeValue,
    D.StartDate,
    D.CurrentFlag,
    D.Padding
FROM dbo.Data AS D
ORDER BY
    D.RowID;

The SomeValue column data may be slightly different due to the pseudo-random generation, but this difference is not important. Overall, the sample data contains 16,129 rows for each of the 31 StartDate dates in January 2014:

SELECT 
    D.StartDate, 
    NumRows = COUNT_BIG(*)
FROM dbo.Data AS D
GROUP BY
    D.StartDate
ORDER BY 
    D.StartDate;

Data Distribution

The last step we need to perform to make the data somewhat realistic, is to set the CurrentFlag column to true for the highest RowID for each StartDate. The following script accomplishes this task:

WITH LastRowPerDay AS
(
    SELECT D.CurrentFlag
    FROM dbo.Data AS D
    WHERE D.RowID =
    (
        SELECT MAX(D2.RowID)
        FROM dbo.Data AS D2
        WHERE D2.StartDate = D.StartDate
    )
)
UPDATE LastRowPerDay
SET CurrentFlag = 1;

The execution plan for this update features a Segment-Top combination to efficiently locate the highest RowID per day:

Notice how the execution plan bears little resemblance to the written form of the query. This is a great example of how the optimizer works from the logical SQL specification, rather than implementing the SQL directly. In case you are wondering, the Eager Table Spool in that plan is required for Halloween Protection.

Deleting A Day of Data

Ok, so with the preliminaries completed, the task at hand is to delete rows for a particular StartDate. This is the sort of query you might routinely run on the earliest date in a table, where the data has reached the end of its useful life.

Taking 1 January 2014 as our example, the test delete query is simple:

DELETE dbo.Data
WHERE StartDate = '20140101';

The execution plan is likewise pretty simple, though worth looking at in a bit of detail:

Simple Delete

Plan Analysis

The Index Seek on the far right uses the nonclustered index to find rows for the specified StartDate value. It returns just the RowID values it finds, as the operator tooltip confirms:

Index Seek Properties

If you are wondering how the StartDate index manages to return the RowID, remember that RowID is the unique clustered index for the table, so it is automatically included in the StartDate nonclustered index.

The next operator in the plan is the Clustered Index Delete. This uses the RowID value found by the Index Seek to locate rows to remove.

The final operator in the plan is an Index Delete. This removes rows from the nonclustered index IX_Data_StartDate that are related to the RowID removed by the Clustered Index Delete. To locate these rows in the nonclustered index, the query processor needs the StartDate (the key for the nonclustered index).

Remember the original Index Seek did not return the Start Date, just the RowID. So how does the query processor get the StartDate for the index delete? In this particular case, the optimizer might have noticed that the StartDate value is a constant and optimized it away, but this is not what happened. The answer is that the Clustered Index Delete operator reads the StartDate value for the current row and adds it to the stream. Compare the Output List of the Clustered Index Delete shown below, with that of the Index Seek just above:

Clustered Index Delete Properties

It might seem surprising to see a Delete operator reading data, but this is the way it works. The query processor knows it will have to locate the row in the clustered index in order to delete it, so it might as well defer reading columns needed to maintain nonclustered indexes until that time, if it can.

Adding a Filtered Index

Now imagine someone has a crucial query against this table that is performing badly. The helpful DBA performs an analysis and adds the following filtered index:

CREATE NONCLUSTERED INDEX
    FIX_Data_SomeValue_CurrentFlag
ON dbo.Data (SomeValue)
INCLUDE (CurrentFlag)
WHERE CurrentFlag = 1;

The new filtered index has the desired effect on the problematic query, and everyone is happy. Notice that the new index does not reference the StartDate column at all, so we do not expect it to affect our day-delete query at all.

Deleting a day with the filtered index in place

We can test that expectation by deleting data for a second time:

DELETE dbo.Data
WHERE StartDate = '20140102';

Suddenly, the execution plan has changed to a parallel Clustered Index Scan:

Notice there is no separate Index Delete operator for the new filtered index. The optimizer has chosen to maintain this index inside the Clustered Index Delete operator. This is highlighted in SQL Sentry Plan Explorer as shown above ("+1 non-clustered indexes") with full details in the tooltip:

Nonclustered index details

If the table is large (think data warehouse) this change to a parallel scan might be very significant. What happened to the nice Index Seek on StartDate, and why did a completely unrelated filtered index change things so dramatically?

Finding the Problem

The first clue comes from looking at the properties of the Clustered Index Scan:

Clustered Index Scan Properties

As well as finding RowID values for the Clustered Index Delete operator to delete, this operator is now reading CurrentFlag values. The need for this column is unclear, but it does at least begin to explain the decision to scan: the CurrentFlag column is not part of our StartDate nonclustered index.

We can confirm this by rewriting the delete query to force the use of the StartDate nonclustered index:

DELETE D
FROM dbo.Data AS D 
    WITH (INDEX(IX_Data_StartDate))
WHERE StartDate = '20140103';

The execution plan is closer to its original form, but it now features a Key Lookup:

The Key Lookup properties confirm this operator is retrieving CurrentFlag values:

Key Lookup Properties

You might also have noticed the warning triangles in the last two plans. These are missing index warnings:

Missing Index

This is further confirmation that SQL Server would like to see the CurrentFlag column included in the nonclustered index. The reason for the change to a parallel Clustered Index Scan is now clear: the query processor decides that scanning the table will be cheaper than performing the Key Lookups.

Yes, but why?

This is all very weird. In the original execution plan, SQL Server was able to read extra column data needed to maintain nonclustered indexes at the Clustered Index Delete operator. The CurrentFlag column value is needed to maintain the filtered index, so why does SQL Server not just handle it in the same way?

The short answer is that it can, but only If the filtered index is maintained in a separate Index Delete operator. We can force this for the current query using undocumented trace flag 8790. Without this flag, the optimizer chooses whether to maintain each index in a separate operator or as part of the base table operation.

-- Forced wide update plan
DELETE dbo.Data
WHERE StartDate = '20140105'
OPTION (QUERYTRACEON 8790);

The execution plan is back to seeking the StartDate nonclustered index:

The Index Seek returns just RowID values (no CurrentFlag):

Index Seek Properties

And the Clustered Index Delete reads the columns needed to maintain the nonclustered indexes, including CurrentFlag:

Clustered Index Delete Properties

This data is eagerly written to a table spool, which is the replayed for each index that needs maintaining. Notice also the explicit Filter operator before the Index Delete operator for the filtered index.

Another pattern to watch out for

This problem does not always result in a table scan instead of an index seek. To see an example of this, add another index to the test table:

CREATE NONCLUSTERED INDEX
    IX_Data_SomeValue_CurrentFlag
ON dbo.Data (SomeValue, CurrentFlag);

Note this index is not filtered, and does not involve the StartDate column. Now try a day-delete query again:

DELETE dbo.Data
WHERE StartDate = '20140104';

The optimizer now comes up with this monster:

This query plan has a high surprise factor, but the root cause is the same. The CurrentFlag column is still needed, but now the optimizer chooses an index intersection strategy to get it instead of a table scan. Using the trace flag forces a per-index maintenance plan and sanity is once again restored (the only difference is an extra spool replay to maintain the new index):

Only filtered indexes cause this

This issue only occurs if the optimizer chooses to maintain a filtered index in a Clustered Index Delete operator. Non-filtered indexes are not affected, as the following example shows. The first step is to drop the filtered index:

DROP INDEX FIX_Data_SomeValue_CurrentFlag
ON dbo.Data;

Now we need to write the query in a way that convinces the optimizer to maintain all the indexes in the Clustered Index Delete. My choice for this is to use a variable and a hint to lower the optimizer's row count expectations:

-- All qualifying rows will be deleted
DECLARE @Rows bigint = 9223372036854775807;
 
-- Optimize the plan for deleting 100 rows
DELETE TOP (@Rows)
FROM dbo.Data
OUTPUT
    Deleted.RowID,
    Deleted.SomeValue,
    Deleted.StartDate,
    Deleted.CurrentFlag
WHERE StartDate = '20140106'
OPTION (OPTIMIZE FOR (@Rows = 100));

The execution plan is:

Narrow Plan

Both nonclustered indexes are maintained by the Clustered Index Delete:

Index Maintenance

The Index Seek returns only the RowID:

Index Seek Properties

The columns needed for the index maintenance are retrieved internally by the delete operator; these details are not exposed in show plan output (so the output list of the delete operator would be empty). I added an OUTPUT clause to the query to show the Clustered Index Delete once again returning data it did not receive on its input:

Clustered Index Delete Properties

Final Thoughts

This is a tricky limitation to work around. On the one hand, we generally do not want to use undocumented trace flags in production systems.

The natural 'fix' is to add the columns needed for filtered index maintenance to all nonclustered indexes that might be used to locate rows to delete. This is not a very appealing proposition, from a number of points of view. Another alternative is to just not use filtered indexes at all, but that is hardly ideal either.

My feeling is that the query optimizer ought to consider a per-index maintenance alternative for filtered indexes automatically, but its reasoning appears to be incomplete in this area right now (and based on simple heuristics rather than properly costing per-index/per-row alternatives).

To put some numbers around that statement, the parallel clustered index scan plan chosen by the optimizer came in at 5.5 units in my tests. The same query with the trace flag estimates a cost of 1.4 units. With the third index in place, the parallel index-intersection plan chosen by the optimizer had an estimated cost of 4.9, whereas the trace flag plan came in at 2.7 units (all tests on SQL Server 2014 RTM CU1 build 12.0.2342 under the 120 cardinality estimation model, and with trace flag 4199 enabled).

I regard this as behaviour that should be improved. You can vote to agree or disagree with me on this Connect item.

The post An Unexpected Side-Effect of Adding a Filtered Index appeared first on SQLPerformance.com.

07 Jul 17:50

PolyBase in APS - Yet another SQL over Hadoop solution?

by SQL Server Team

This blog post will highlight PolyBase’s truly unique approach focusing on:

Query capabilities across various heterogeneous data sources – on-premises and cloud (Microsoft Azure) bringing Microsoft data services together forming one complete data platform solution
Total freedom for users with no lockdown, agnostic to the actual Hadoop distribution and/or underlying operating system
Faster insights from all your data in a simple and performing fashion allowing users to leverage their existing tools and SQL scripts

1. Bringing the relational world together with Hadoop & Cloud (Azure)

In the very recent past, various SQL over Hadoop/HDFS solutions have been developed, such as Impala, HAWQ, Stinger, SQL-H, Hadapt to name just a few. While there are clear technical differences between the various solutions, at a high level, they are similar in offering a SQL-like front end over data stored in HDFS.

So, is PolyBase yet another similar solution competing with these approaches? The answer is yes and no. On first glance, PolyBase is a T-SQL front end that allows customers to query data stored in HDFS. However, with the recently announced Analytics Platform System (APS), we have updated PolyBase with new syntax to highlight our extensible approach. With PolyBase, we bring various Microsoft data management services together and allow appliance users to leverage a variety of Azure services. This enables a new class of hybrid scenarios and reflects the evolution of PolyBase to a true multi-data source query engine. It allows users to query their big data – regardless of whether it is stored in an on-premises Hadoop/HDFS cluster, Azure storage, Parallel Data Warehouse, and other relational DBMS systems (offered in a future PolyBase release).

Complete Data Platform with PolyBase as key integrative component

2. Freedom of Choice

Openness

One important key differentiator of PolyBase compared to all of the existing competitive approaches is ‘openness’. We do not force users to decide on a single solution, like some Hadoop providers are pursuing. With PolyBase, you have the freedom to use an HDInsight region as a part of your APS appliance, to query an external Hadoop cluster connected to APS, or to leverage Azure services from your APS appliance (such as HDInsight on Azure).

To achieve this openness, PolyBase offers these three building blocks.

Building blocks for PolyBase

The syntax for using PolyBase is simple and follows familiar T-SQL language constructs.

T-SQL for creating external data sources (Azure, external Hadoop cluster, HDI region)

CREATE EXTERNAL DATA SOURCE Azure_DS WITH (TYPE=Hadoop, LOCATION= 'wasbs://<youraccount>.blob.core.windows.net');
CREATE EXTERNAL DATA SOURCE External_HDP_DS WITH (TYPE=Hadoop, LOCATION='hdfs://…’, JOB_TRACKER_LOCATION='…');
CREATE EXTERNAL DATA SOURCE HDI_DS WITH (TYPE=Hadoop, LOCATION = 'hdfs://…’, JOB_TRACKER_LOCATION = '…’);

T-SQL for creating external file formats (delimited text files and Hive RCFiles)

CREATE EXTERNAL FILE FORMAT DelimText1 WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR =','));
CREATE EXTERNAL FILE FORMAT DelimText2 (FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS (FIELD_TERMINATOR ='|', USE_TYPE_DEFAULT = TRUE));
CREATE EXTERNAL FILE FORMAT HiveRC WITH (FORMAT_TYPE = RCFILE, SERDE_METHOD = 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe')

T-SQL for creating external tables (for Azure, external Hadoop cluster, HDI region)

CREATE EXTERNAL TABLE [dbo].[Old_SensorData_Azure] (…) WITH (LOCATION='//Sensor_Data/May2009/sensordata.tbl', DATA_SOURCE Azure_DS, FILE_FORMAT = DelimText2, REJECT_TYPE = VALUE, REJECT_VALUE = 100)
CREATE EXTERNAL TABLE [dbo].[SensorData_ExternalHDP] (…) WITH (LOCATION='//Sensor_Data/May2014/sensordata.tbl', DATA_SOURCE = External_HDP_DS, FILE_FORMAT = DelimText1, REJECT_TYPE = VALUE, REJECT_VALUE = 0)
CREATE EXTERNAL TABLE [dbo].[SensorData_HDI] (…) WITH (LOCATION='//Sensor_Data/Year2013', DATA_SOURCE = HDI_DS, FILE_FORMAT = HiveRC)

A user can now create statistics for each of the external tables shown above to improve the query performance. We extended SQL Server’s mature stats framework to work against external tables in the same way it works against regular tables. Statistics are crucial for the PolyBase query engine in order to generate optimal execution plans and to decide when pushing computation into the external data source is beneficial.

Performance

While other SQL over Hadoop solutions (e.g. Impala, Stinger, and HAWQ) have improved, it remains true that they still cannot match the query performance of a mature relational MPP system. With PolyBase, the user can import data in a very simple fashion into PDW (through a CTAS statement, see below), use the fast SQL Server column store technology along with the MPP architecture, or let the PDW/PolyBase query optimizer decide which parts of the query get executed in Hadoop and which parts in PDW. This optimized querying, called split-based query processing, allows parts of the query to be executed as Hadoop MR jobs that are generated on-the-fly completely transparent for the end user. Thereby, the PolyBase query optimizer takes into account parameters such as the spin-up time for MR jobs and the generated statistics to determine the optimal query plan.

In general, if it comes to performance the answer usually is ‘it depends on the actual use case/query’. With PolyBase, the user has total freedom and can leverage capabilities of PDW and/or Hadoop based on their actual needs and application requirements.

PolyBase in APS bridging the gap between the relational world, Hadoop (external or internal) and Azure

The T-SQL statement below will run across all data sources combining structured appliance data with un/semi-structured data in external Hadoop, internal HDInsight region, and Azure (e.g. historical data) –

T-SQL SELECT querying external Hadoop, HDInsight & PDW regions, and Azure

SELECT machine_name, machine.location 
 FROM Machine_Information_PDW, Old_SensorData_Azure, SensorData_HDI, SensorData_ExternalHDP
WHERE Machine_Information_PDW.MachineKey = Old_SensorData_Azure.MachineKey
and Machine_Information_PDW.MachineKey = SensorData_HDI.MachineKey
and Machine_Information_PDW.MachineKey = SensorData_ExternalHDP.MachineKey
and SensorData_HDI.Temperature> 80 and Old_SensorData_Azure.Temperature > 80
and SensorData_ExternalHDP.Temperature > 80

This query example shows how simplicity and performance are combined at the same time. It shows three external tables referring to three different locations plus one regular (distributed) PDW table. While executing the query, the PolyBase/PDW query engine will decide, based on the statistics, whether or not to push computation to the external data source (i.e. Hadoop).

Rewriting & Migrating existing applications

Finally, you may have heard that Hadoop is ‘cheaper’ than more mature MPP DBMS systems. However, what you might not have heard about is the cost associated with rewriting existing applications and ensuring continued tool support. This goes beyond simple demos showing that tool ‘xyz’ works on top of Hadoop/HDFS.

PolyBase does not require you to download and install different drivers. The beauty of our approach is that external tables appear like regular tables in your tool of choice. The information about the external data sources and file formats is abstracted away. Many Hadoop-only solutions are not fully SQL-ANSI compliant and do not support various SQL constructs. With PolyBase, however, you don’t need to rewrite your apps because it uses T-SQL and preserves its semantics. This is specifically relevant when users are coming from a ‘non-Java/non-Hadoop world’. You can explore and visualize your data sets either by using the Microsoft BI solutions (initiated on-premises or through corresponding Azure services) or by using the visualization tool of your choice. PolyBase keeps the user experience the same.

3. Simplified ETL & Fast Insights

It’s already a painful reality that many enterprises store and maintain data in different systems that are optimized for different workloads and applications, respectively. Admins are spending much time moving, organizing, and keeping data in sync. This reality imposes another key challenge which we are address with PolyBase – in addition to querying data in external data sources, a user can achieve a simpler and more performant ETL (extraction, transformation, loading). Different than existing connector technologies, such as SQOOP, a PolyBase user can use T-SQL statements to either import data from external data sources (CTAS) or export data to external data sources (CETAS).

T-SQL CETAS statement to age out Hadoop & PDW data to Azure

CREATE EXTERNAL TABLE Old_Data_2008_Azure
WITH (LOCATION='//Sensor_Data/2008/sensordata.tbl', DATA_SOURCE=Azure_DS, FILE_FORMAT=DelimText2)
AS SELECT T1.* FROM Machine_Information_PDW T1 JOIN SensorData_ExternalHDP T2
ON (T1.MachineKey = T2.MachineKey) WHERE T2.YearMeasured = 2008

Combines data from external Hadoop and PDW sources and stores the results in Azure

Under-the-covers, the PolyBase query engine is not only leveraging the parallelism of an MPP system, it also pushes computation to the external data source to reduce the data volume that needs to be moved. The entire procedure remains totally transparent for the user while ensuring a very fast import & export of data that greatly outperforms any connector technology offered today. With the CTAS statement, a user can import data into the relational PDW region where it stores the data as column store. This way, users can immediately leverage the column store technology in APS without any further action.

T-SQL CTAS statement for importing Hadoop data into PDW

CREATE TABLE Hot_Machines_2011 WITH (Distribution = hash(MachineKey),
CLUSTERED COLUMNSTORE INDEX)
AS SELECT * FROM SensorData_HDI where SensorData_HDI.YearMeasured = 2011 and SensorData_HDI.Temperature > 150

Combines PolyBase with column store – Imports data from Hadoop into PDW CCI tables

In summary, PolyBase is more than just another T-SQL front end over Hadoop. It has evolved into a key integrative component that allows users to query, in a simple fashion, data stored in heterogeneous data stores. There is no need to maintain separate import/export utilities. PolyBase ensures great performance by leveraging the computation power available in external data sources. Finally, the user has freedom in almost every dimension whether it’s about tuning the system and getting the best performance, choosing their tools of choice to derive valuable insights, and to leverage data assets stored both on-premises and within the Azure data platform.

Watch how APS seamlessly integrates data of all sizes and types here

Learn more about APS here

07 Jul 17:48

Eek! Starting as a Regular Full-Time Employee

by noreply@blogger.com (Jessica M. Moss)

Well, I jumped into the career deep end last week. I started my first-ever job as a regular full-time employee. This week contained a mixture of emotions for me: scared, nervous, but overall - EXCITEMENT! It's been a long road, but I'm very happy with where I've come and where I'm going.

I started with a consulting company right out of college. Back then, graduating with a computer science degree pretty much meant you knew a little bit about a lot, but had no deep knowledge in any area. So working on a little bit of everything to gain some knowledge in a consulting company seemed like a perfect fit! I still remember during my interview, the interviewee said "You'll be assigned a project over the weekend, buy a book on the technology, and by the time you arrive on Monday, you need to be an expert". While we all know that won't exactly work, I was able to learn SO much by hopping around and trying different technologies, industries, and systems.

While trying out different technologies, I fell in love with data warehousing. To that end, I then tried my hand at my own business, still consulting, but with more training and mentoring. I absolutely loved this kind of work, but wasn't a huge fan of the full-time travel or all of the paperwork ;) I then went back to consulting for a couple of local companies, where I was able to work with many organizations with data, without the travel or paperwork.

So... after ten years of consulting, why did I give it up? Truthfully, I'm ready for a change. While it's fun to bounce from organization to organization, there's always that dichotomy between "the client" and "the consultant". And I don't blame either side for thinking that way. Consultants provide a great service for organizations that don't have the resources (time, people, money, etc.) to do it themselves. At the end of the day, a consultant's number one duty is to their company, whether that is looking for additional sales or is eventually leaving the project to move onto the next.

All that being said, I'm off to start my new adventure! I'm a data architect for a firm based in Richmond that handles chemicals and manufacturing. I get to help them consolidate their data in a data warehouse and hopefully learn a bunch in the process... I also need to learn how to be a regular full-time employee. Do you have any advice for me?

23 Jun 18:38

SQL Server 2014 Backup Encryption

by MVP Award Program

Editor’s note: The following post was written by SQL Server MVP Nicolas Souquet

Introduction

With the verly large amounts of data companies store today, databases are very likely to contain critical and confidential data, making database backups encryption an essential database engine feature. It is easily foreseeable that such feature will soon be required to pass accreditation audits.

Raising the security bar through releases, SQL Server 2014 now enables to encrypt database backups natively. It has been a long awaited feature, as the options to cover this need were rather cumbersome :

- Transparent Data Encryption, released with SQL Server 2008, but it implies some CPU overhead, and the backup file compression ratio is then very low;

- Disk encryption, such as BitLocker, but ciphering large disk volumes is a long operation;

- Third-party software.

In this paper, we will expose how backup encryption works, before walking through the steps needed to encrypt databases backups with T-SQL and with Maintenance Plans. We will finally detail the actions required to restore a ciphered database backup. Backing up to a file and to Azure will be addressed.

How does backup encryption work ?

There is no encryption without keys, and backup encryption is no exception : we first need to create a Database Master Key (DMK) of the master system database, and a certificate. The DMK is a symmetric key, and is unique to each database in each SQL Server instance : we cannot restore an encrypted backup file on a distinct or re-installed SQL Server instance without the master system database DMK (or an encrypted database without its DMK).

The Database Master Key is encrypted by the AES 256 algorithm, using the Service Master Key, which is also a symmetric key. It is encrypted based on the SQL Server service account credentials and the Windows Data Protection API machine key. The Service Master Key is unique per SQL Server instance, and created during SQL Server installation. It is stored in the master system database and in the user database, so as to enable its automatic decryption (cf. sys.symmetric_keys system view).

Finally, a Certificate contains a public key that is digitally signed, and may contain a private key. This private key is protected by the DMK. While SQL Server can generate IETF X.509v3-compliant certificates, it also allows to use certificates generated by third-parties (cf. sys.certificates system view).

We can summarize the encryption hierarchy levels with the following diagram :

The backup process works on a data page basis : it copies data pages from the data files into the backup file. Since SQL Server 2008, we can compress database backups. This feature is supported by an algorithm similar to the ones behind file compression software : it factorizes patterns of data found in the data pages. Whether the backup data pages are compressed or not, SQL Server 2014 is able to encrypt these pages with the AES 128, AES 192, AES 256 or Triple DES algorithms.

Encrypting a database backup in T-SQL

In this example, we will use the Contoso demo database. The code needed to take an encrypted database backup is following the steps described in the previous section. First of all, we must backup the service master key :

USE master

-- Saving the service master key

-- Ideally, the resulting file should be stored on a distinct,

-- secure machine with restricted access

BACKUP SERVICE MASTER KEY

TO FILE = 'E:\SQLServerBackup\SQL2014_service_master_key.key'

ENCRYPTION BY PASSWORD = 'a$tr0n9#!P@$$w0r2_f0rDBb@ckupEncryption';

We then have to create the master key, which again is straightforward. The password used to encrypt the master key can be different than the one used to encrypt the service master key backup file.

-- Creating a new database master key

CREATE MASTER KEY

ENCRYPTION BY PASSWORD = 'a$tr0n9#!P@$$w0r2_f0rDBb@ckupEncryption';

-- Saving the database master key

-- Ideally, the resulting file should be stored on a distinct,

-- secure machine with restricted access

BACKUP MASTER KEY

TO FILE = 'E:\SQLServerBackup\SQL2014_Contoso_master_key.key'

ENCRYPTION BY PASSWORD = 'a$tr0n9#!P@$$w0r2_f0rDBb@ckupEncryption';

We can confirm the creation of the master key by examining the sys.symmetric_keys system view : it now shows one, named ##MS_DatabaseMasterKey##. Next, we need to create the certificate, which is achieved with a single instruction :

USE master

-- Creating the certificate

CREATE CERTIFICATE Contoso_BackupEncryptionWithSQLServer2014

WITH SUBJECT = 'SQL Server 2014 Backup Encryption demo with Contoso';

Similarly, interrogating the sys.certificates system view in the master database context will now return a supplementary row, and the column pvt_key_encryption_type_desc indicates ENCRYPTED_BY_MASTER_KEY.

As for every key, we want to save the certificate : this requires to specify a private key file. It clearly implies that one does not work without the other. Here again, a password is required to cipher the certificate file, and optionally a different one can be defined for the decryption. This password is the public key of the certificate.

-- Saving the certificate

-- Ideally, the resulting file should be stored on a distinct,

-- secure machine with restricted access

BACKUP CERTIFICATE Contoso_BackupEncryptionWithSQLServer2014

TO FILE = 'E:\SQLServerBackup\SQL2014_Contoso_certificate.cer'

WITH PRIVATE KEY

(

FILE = 'E:\SQLServerBackup\SQL2014_Contoso_certificate_private_key.key'

, ENCRYPTION BY PASSWORD = 'a$tr0n9#!P@$$w0r2_f0rDBb@ckupEncryption'

);

At this point, the groundwork is laid for taking encrypted backups. We have 4 files that we need to keep in a secure and distinct storage area :

Backing up a database and ciphering the resulting file only requires us to choose an encryption algorithm and to specify the certificate we want to use :

-- Backing up the ContosoRetailDW demo database, with encryption

BACKUP DATABASE ContosoRetailDW

TO DISK = 'E:\SQLServerBackup\ContosoRetailDW\ContosoRetailDW_FULL_ENCRYPTED.bak'

WITH INIT, CHECKSUM, COMPRESSION, STATS = 1

, ENCRYPTION

(

ALGORITHM = AES_256

, SERVER CERTIFICATE = Contoso_BackupEncryptionWithSQLServer2014

)

Restoring a database from an encrypted backup file

Before we walk thorugh the restoration process, we must keep in mind that SQL Server 2014 introduces native backup encryption. Thus, restoring a database from a natively encrypted backup file on a version of SQL Server that is anterior to SQL Server 2014 is not supported.

Restoring the database on the same SQL Server 2014 instance

Restoring a database from an encrypted backup file on the same SQL Server 2014 instance as the one on which its backup has been taken is operated as usual : all the keys and the certificate are already registered in the master database. Consequently, they are opened automatically when needed for decryption.

RESTORE DATABASE ContosoRetailDW_RestoredFromEncryptedBackupFile

FROM DISK = 'E:\SQLServerBackup\ContosoRetailDW\ContosoRetailDW_FULL_ENCRYPTED.bak'

WITH MOVE 'ContosoRetailDW2.0' TO 'E:\Contoso\ContosoRetailDW_EncrypteBackup_data.mdf'

, MOVE 'ContosoRetailDW2.0_log' TO 'E:\Contoso\ContosoRetailDW_EncrypteBackup_log.ldf'

, STATS = 1

Restoring a database from an encrypted backup file on another SQL Server 2014 instance

This operation requires to :

Restore the Database Master Key from its backup file
Create the Certificate from its backup file, which involves the private key file.

The restoration process is as follows :

USE master

-- Restoring the master key on the target SQL Server instance

-- from its backup file

RESTORE MASTER KEY

FROM FILE = 'E:\SQLServerBackup\SQL2014_Contoso_master_key.key'

-- the password that was used to encrypt the master key in the source SQL Server instance

DECRYPTION BY PASSWORD = 'a$tr0n9#!P@$$w0r2_f0rDBb@ckupEncryption'

-- the password with which we want to encrpyt it on the target SQL Server instance

-- the service key is different on the source and the target SQL Server instances

ENCRYPTION BY PASSWORD = '@n0therStrongP@$$w0r2!';

-- Since the master key is not registered in the master database

-- We need to open it in order to decrypt it

-- It stays opened for the session duration

OPEN MASTER KEY

DECRYPTION BY PASSWORD = '@n0therStrongP@$$w0r2!'

-- Restoring the certificate by the private key

-- the password is the one we used to encrypt it on the source SQL Server instance

CREATE CERTIFICATE Contoso_BackupEncryptionWithSQLServer2014

FROM FILE = 'E:\SQLServerBackup\SQL2014_Contoso_certificate.cer'

WITH PRIVATE KEY

(

FILE = 'E:\SQLServerBackup\SQL2014_Contoso_certificate_private_key.key'

, DECRYPTION BY PASSWORD = 'a$tr0n9#!P@$$w0r2_f0rDBb@ckupEncryption'

)

-- Getting the list of files in the backup file

-- The instruction is identical for a non-encrypted database

RESTORE FILELISTONLY

FROM DISK = 'E:\SQLServerBackup\ContosoRetailDW\ContosoRetailDW_FULL_ENCRYPTED.bak'

-- Finally restoring the database

-- The instruction is identical for a non-encrypted database

RESTORE DATABASE ContosoRetailDW

FROM DISK = 'E:\SQLServerBackup\ContosoRetailDW\ContosoRetailDW_FULL_ENCRYPTED.bak'

WITH MOVE 'ContosoRetailDW2.0' TO 'D:\SQL Server\\ContosoRetailDW_EncryptBackup_data.mdf'

, MOVE 'ContosoRetailDW2.0_log' TO 'D:\SQL Server\ContosoRetailDW_EncryptBackup_log.ldf'

, STATS = 1

-- Closing the master key

CLOSE MASTER KEY

In the case of a restoration automation, we may not want to restore and open the database master key each time.

This is achievable by running the following command, after having restored and opened the master key in the same session :

ALTER MASTER KEY REGENERATE

WITH ENCRYPTION BY PASSWORD = 'aN0th€r$tr0n9#!P@$$w0r2_f0rDBb@ckupEncryption'

Taking encrypted backups with Maintenance Plans

The dialog window to customize maintenance plan backup tasks had been revamped in SQL Server 2014, and reveals the option to take encrypted backups. It lets you pick the encryption algorithm among the four available.

However, there is not a way yet to manually create a certificate or an asymmetric key from this interface, or from SQL Server Management Studio Object Explorer : it has to be created following the steps described earlier.

Backing up and restoring in Azure

SQL Server 2014 enables to backup databases to Microsoft’s cloud platform : Azure (More exactly since SQL Server 2012 SP1 CU4). This nice feature is now enabling companies :

- to store their backups outside their data center, covering the case in which it would face a complete outage;

- to save their data in a geographically remote location at a low cost compared to investing in a secondary data center.

Prerequisites

In order to backup a database in Azure, we need to have :

- An Azure subscription

- Access to Azure by the Azure Portal (or via PowerShell)

- An Azure Storage Account created

- A storage container created. In this example, it is named “backup”.

Instructions to save a database to Azure

As the documentation reveals, instead of specifying the TAPE or DISK clause of the BACKUP instruction, we use URL, which makes the implementation of this feature remarkably simple.

Prior to this, we need to create a Credential : it is a security element that contains a user name and password for a user to access resources external to SQL Server. In our case, the Credential will contain the name of the storage account, and the primary key to this storage account.

In the print-screen below, we find the name of the storage account under the NAME column :

Clicking the MANAGE ACCESS KEYS button brings a popup window, in which we find the storage primary access key. So to create the credential with which we will gain access to Azure storage, we write :

USE master;

CREATE CREDENTIAL SQLServer2014EncryptedBackupInTheCloud

WITH IDENTITY = 'account_name'

, SECRET = 'AzureStoragePrimaryAccessKey_ItIsNormallyALongString'

Before issuing the BACKUP instruction, we need to get the URL that we will specify. We can get it by successively clicking on the storage account name, and on the CONTAINERS tab :

This is the occasion to confirm the name of the storage container : “backup”, which is also visible in the URL.

Finally, the instruction is :

BACKUP DATABASE ContosoRetailDW

TO URL = 'http://account_name.blob.core.windows.net/backup/ContosoRetailDW.bak'

WITH INIT, COMPRESSION, CHECKSUM, STATS = 1

, CREDENTIAL = 'SQLServer2014EncryptedBackupInTheCloud'

, ENCRYPTION

(

ALGORITHM = AES_256

, SERVER CERTIFICATE = Contoso_BackupEncryptionWithSQLServer2014

)

Once the backup is complete, we can see the file in the container :

Restoring a database from Azure

Similar to the restoration instruction sequence we have seen for file backups, it is required to have :

- the database master key and the certificate restored;

- the credential existing; if necessary, we can recreate it based on the data gathered from the Azure portal website.

The command batch is then :

OPEN MASTER KEY

DECRYPTION BY PASSWORD = '@n0therStrongP@$$w0r2!'

RESTORE DATABASE ContosoRetailDW

FROM URL = 'http://account_name.blob.core.windows.net/backup/ContosoRetailDW.bak'

WITH CREDENTIAL = 'SQLServer2014EncryptedBackupInTheCloud'

Performance Considerations

SQL Server 2014 offers four algorithms to encrypt backup files, and ciphering is an intensive CPU activity. It is also possible that the various algorithms have a different impact on the size of the resulting backup files.

To measure the impact of the various algorithm on CPU usage, we have taken a Performance Monitor trace with the Processor / % Processor Time counter only, and taken a backup of the same database with each of the four algorithms, plus a non-encrypted backup. All backups were compressed. The batch of commands is :

BACKUP DATABASE [ContosoRetailDW]

TO DISK = 'E:\SQLServerBackup\ContosoRetailDW_NO_ENCRYPTION.bak'

WITH INIT, COMPRESSION, CHECKSUM

WAITFOR DELAY '00:00:10'

BACKUP DATABASE [ContosoRetailDW]

TO DISK = 'E:\SQLServerBackup\ContosoRetailDW_AES_128.bak'

WITH INIT, COMPRESSION, CHECKSUM , ENCRYPTION

(

ALGORITHM = AES_256

, SERVER CERTIFICATE = Contoso_BackupEncryptionWithSQLServer2014

)

WAITFOR DELAY '00:00:10'

BACKUP DATABASE [ContosoRetailDW]

TO DISK = 'E:\SQLServerBackup\ContosoRetailDW_AES_192.bak'

WITH INIT, COMPRESSION, CHECKSUM , ENCRYPTION

(

ALGORITHM = AES_256

, SERVER CERTIFICATE = Contoso_BackupEncryptionWithSQLServer2014

)

WAITFOR DELAY '00:00:10'

BACKUP DATABASE [ContosoRetailDW]

TO DISK = 'E:\SQLServerBackup\ContosoRetailDW_AES_256.bak'

WITH INIT, COMPRESSION, CHECKSUM , ENCRYPTION

(

ALGORITHM = AES_256

, SERVER CERTIFICATE = Contoso_BackupEncryptionWithSQLServer2014

)

WAITFOR DELAY '00:00:10'

BACKUP DATABASE [ContosoRetailDW]

TO DISK = 'E:\SQLServerBackup\ContosoRetailDW_TRIPLE_DES_3KEY.bak'

WITH INIT, COMPRESSION, CHECKSUM , ENCRYPTION

(

ALGORITHM = AES_256

, SERVER CERTIFICATE = Contoso_BackupEncryptionWithSQLServer2014

)

We wait 10 seconds between each backup, so as for the CPU usage to show bumps on the graph between each backup execution. Below is the commented Perfmon trace graph. It shows that while the AES algorithms have not much impact on the CPU usage compared to each other and to a backup taken with no encryption, the TRIPLE DES algorithm uses 25-30% more CPU than the AES ones.

After having integrated the Perfmon trace file into a table, and associated the CPU usage to the encryption algorithm used based on the time, we computed the median of the CPU usage, which produces the result below :

Finally, we can look at the backup duration and the backup file size. A few columns have been added to the system tables in the msdb system database, to expose backup encryption :

- dbo.backupset

key_algorithm
encryptor_thumbprint
encryptor_type

- dbo.backupmediaset

is_encrypted

Below is a query that retrieves the backup history, specifying the columns from the dbo.backupset table listed above :

;WITH

CTE AS

(

SELECT database_name

, last_backup_start_date_time

, physical_device_name

, backup_size / (1024 * 1024) AS backup_size_MB

, compressed_backup_size / (1024 * 1024) AS compressed_backup_size_MB

, CAST(backup_size / compressed_backup_size AS decimal(5,2)) AS compression_ratio

, key_algorithm

, encryptor_thumbprint

, encryptor_type

FROM (

SELECT S.database_name

, MAX(S.backup_start_date) AS last_backup_start_date_time

, MF.physical_device_name

, S.backup_size

, S.compressed_backup_size

, S.key_algorithm

, S.encryptor_thumbprint

, S.encryptor_type

FROM msdb.dbo.backupset AS S

INNER JOIN msdb.dbo.backupmediafamily AS MF

ON S.media_set_id = MF.media_set_id

GROUP BY S.database_name

, S.type

, MF.physical_device_name

, S.backup_size

, S.compressed_backup_size

, S.key_algorithm

, S.encryptor_thumbprint

, S.encryptor_type

) AS BH

)

SELECT D.name AS database_name

, C.last_backup_start_date_time

, C.physical_device_name

, CAST(C.backup_size_MB AS decimal(15,2)) AS backup_size_MB

, CAST(C.compressed_backup_size_MB AS decimal(15,2)) AS compressed_backup_size_MB

, compression_ratio

, C.key_algorithm

, C.encryptor_thumbprint

, C.encryptor_type

FROM sys.databases AS D

INNER JOIN CTE AS C

ON D.name = C.database_name

ORDER BY C.last_backup_start_date_time

We obtain the following result, which clearly demonstrates that the encryption has no impact on the compression of the backup file size :

About the author

Nicolas Souquet is a Bangkok-based SQL Server database architect, and also a writer and speaker. He has centered his career on SQL Server databases modeling and performance tuning. With 10 years of industry experience, he has been working on all versions of SQL Server, ranging from 2000 to 2014, and has been continuously involved in several OLTP data-intensive environments. He has been rewarded SQL Server MVP in 2009, 2012 and 2013 for publishing articles on his blog (http://blog.developpez.com/elsuket/) and animating SQL Server conferences in Bangkok. He is now gaining interest in Business Intelligence implementations with SQL Server. Finally, he has participated in writing a book on SQL Server 2014 in French with two other SQL Server MVPs, Frédéric Brouard and David Barbarin, which will be released in July 2014.

About MVP Monday

The MVP Monday Series is created by Melissa Travers. In this series we work to provide readers with a guest post from an MVP every Monday. Melissa is a Community Program Manager, formerly known as MVP Lead, for Messaging and Collaboration (Exchange, Lync, Office 365 and SharePoint) and Microsoft Dynamics in the US. She began her career at Microsoft as an Exchange Support Engineer and has been working with the technical community in some capacity for almost a decade. In her spare time she enjoys going to the gym, shopping for handbags, watching period and fantasy dramas, and spending time with her children and miniature Dachshund. Melissa lives in North Carolina and works out of the Microsoft Charlotte office.

29 May 17:00

The End of the Mentoring Experiment

by Andy Warren

Back in April 2011 Steve Jones and I launched The Mentoring Experiment. We had some ideas and some goals, but truly it was an experiment to see what we might do to drive more mentoring and mentoring relationships. We had some success at the experiments we tried, but in the end we found that it is just hard to do – mentoring relationships are absolutely about chemistry and that’s just hard to find, and our very unsophisticated matching didn’t come close to that. What we did find was that for many (even most) people a sounding board/coach/very temporary mentor can help them figure out a challenging issue quickly (a few hours of talking). Career direction, career changes, salary, we saw a lot of common themes and I think our mentors did pretty well at helping the people we matched them with.

It was a hard project because we didn’t have a good way to show results or velocity and that matters, especially when you’re the experimenter and trying to measure success and stay engaged. There’s something powerful about seeing SQLSaturday #318 – it feels like success/velocity. It’s not the same to talk about mentoring matchup #318. It’s even harder because Steve and I are both fairly intense about privacy matters and mentoring falls into that zone – we couldn’t (and wouldn’t) just list everyone who wanted a mentor or who was paired up with a mentor. We tried to figure out a way to inject mentoring into the culture more. I’ve said for a while that the one thing I’d go back and change about SQLSaturday would be to have more focus on networking, and I’d add mentoring to that, but it’s really hard to go back and change a formula, especially when it works. Mentoring is several steps harder than networking, and networking isn’t easy. We just couldn’t find a way that didn’t feel forced and didn’t have the potential to create more problems than it fixed. Think about trying to drive mentoring in a chapter meeting with 10-20 people. It’s just hard.

It’s also suffered from lack of attention over the past six months. Steve and I both have other things going on and as we prioritized, this kept getting pushed down. We wanted to continue, but clearly we were struggling, and we decided it was time to wrap this up rather than let it linger, and maybe we should have done it sooner in hindsight. Not all experiments lead to big victories. This one didn’t. At least for us, but maybe it did change the conversation some and we raised the profile of the topic. We’re grateful to our friends that answered our call for mentors and to those who applied to be mentees, taking a chance on us and an idea, and hopefully they all learned something from the experience.

It isn’t the end of our interest in mentoring. We’ll still talk about it, we’re both still interesting in helping people where we can as we’ve always done – quietly, behind the scenes, in a pay-it-forward kind of way. We’d be thrilled to see someone take up the cause, in whatever shape that might take. If you can find a way to build on what we’ve done, or use it for inspiration, or do it entirely differently and make it work better – do it! There’s a lot to be said for trying to do some good in the world.

29 May 16:59

Comparing Windows Azure Virtual Machine Performance

by Glenn Berry

Unless you have been making a concerted effort to ignore it, you may have heard that Microsoft would really like for you to move much of your SQL Server database infrastructure into a Microsoft data center, whether you go to an Azure SQL Database (which I recently discussed here), or whether you host it on a Windows Azure Virtual Machine. Microsoft calls these persistent virtual machines “compute instances”, and they have two main tiers to choose from, which include the Basic Compute Tier and the Standard Compute Tier. They describe these two tiers as:

Basic Compute Tier: This new tier of compute instances is similar in configuration to the Standard tier with lower prices. These instances do not include load balancer and auto-scaling. They are well-suited for single instance production applications, development workloads, test servers and batch processing applications that might not require these features. The basic compute tier is currently available only for the General Purpose Instances. These instances range from Basic A0 to Basic A4.

Standard Compute Tier: This tier of compute instances provides an optimal set of compute, memory and IO resources for running a wide array of applications. These instances include both auto-scaling and load balancing capabilities at no additional cost. The standard compute tier is available across General Purpose, Memory Intensive and Compute Intensive instances. These instances range from Standard A0 to Standard A7.

There are several important advantages to hosting your SQL Server infrastructure on a Windows Azure Virtual Machine. First, you have no capital costs for storage or hardware, along with no ongoing maintenance of the storage or hardware. Second, you have no OS or SQL Server license costs (when you use a SQL image from the standard Azure VM gallery). Third, you can create a new Azure VM that already has SQL Server installed in a few minutes (even though it may take a little longer to completely configure the OS and the SQL Server instance to your exact requirements). Going forward, it will still be up to you to install Windows and SQL Server updates, but you won’t have to worry about things like firmware, BIOS, or driver updates.

If you want to use Windows Azure Virtual Machines to host all or part of your SQL Server infrastructure, you should be aware of the current pricing details that are available for the specific Azure data center that you want to host your virtual machines, since pricing can vary across different Microsoft data centers. Microsoft currently has 13 different Windows Azure virtual machine sizes, as detailed in their Virtual Machine and Cloud Service Sizes for Azure page. Microsoft reduced the hourly pricing for the memory intensive instances (Standard A5, Standard A6, and Standard A7) by 18% in most of their data centers on May 1, 2014, and the pricing shown in Table 1 reflects those new, lower prices.

The Single-Core Score and the Multi-Core Score in the two right-hand columns of Table 1 are the average scores that I observed using the 32-bit version of the Geekbench 3.05 processor and memory benchmark on a sample Windows Server 2012 R2 Datacenter VM in the East U.S. Data Center. These scores may or may not be representative of what you will see.

VM Size	CPU Cores	Memory (GB)	Hourly Cost	Monthly Cost	Single-Core Score	Multi-Core Score
Basic A0	1 (shared)	0.75	$0.018	~$14	507	498
Basic A1	1	1.75	$0.074	~$56	679	670
Basic A2	2	3.50	$0.148	~$111	709	1,358
Basic A3	4	7.00	$0.296	~$221	717	2,472
Basic A4	8	14.00	$0.592	~$441	724	4,042
Standard A0	1 (shared)	0.75	$0.020	~$15	492	502
Standard A1	1	1.75	$0.090	~$67	1,068	1,083
Standard A2	2	3.50	$0.180	~$134	1,069	2,002
Standard A3	4	7.00	$0.360	~$268	1,070	3,593
Standard A4	8	14.00	$0.720	~$536	1,094	6,446
Standard A5	2	14.00	$0.330	~$246	1,080	2,026
Standard A6	4	28.00	$0.660	~$492	1,080	3,686
Standard A7	8	56.00	$1.320	~$983	1,056	6,185

Table 1: Selected Virtual Machine Specifications for Windows Azure in the East U.S. Data Center

Currently, Microsoft has nine data centers that can host new persistent virtual machines, which include East U.S., West U.S., Brazil South (Preview), North Europe, West Europe, East Asia, Southeast Asia, Japan West, and Japan East. According to Microsoft, “A1 is the smallest size recommended for production workloads”, and you should select “a virtual machine with 4 or 8 CPU cores when using SQL Server Enterprise Edition”. One useful, if somewhat dated reference for running SQL Server on a Windows Azure Virtual Machine is the Performance Guidance for SQL Server in Windows Azure Virtual Machines that was published in June, 2013.

Windows Azure Virtual Machine Characteristics

When you look at the CPU properties on the Performance tab in Windows Server 2012 R2 Task Manager (in Figure 1 and Figure 2), you will notice that it reports that it is using a relatively old, 45nm AMD Opteron 4171 HE processor, running at a speed of 2.1GHz. This particular six-core processor was introduced in June of 2010, as part of the two-socket Lisbon family. The HE suffix means that it is a “low-powered” energy efficient model which is not a good choice for SQL Server usage, since it gives up a significant amount of performance for a relatively small amount of reduced energy usage. After doing some digging, I have been told that this processor is a special OEM processor for Microsoft data centers.

Figure 1: Basic A0 Virtual Machine in East U.S. Data Center

Figure 2: Standard A7 Virtual Machine in East U.S. Data Center

The other big issue with this processor besides its age and relatively poor single-threaded performance is the fact that it only has six physical cores. This is a problem with the Basic A4, Standard A4, and Standard A7 VM sizes, which have two NUMA nodes and eight total physical cores. This would mean that a VM of that size would cross a NUMA node on the underlying physical host, which is not a good idea for memory performance. I have a hard time believing that Microsoft would do this on purpose. I also have a hard time believing that every single Azure VM in every single data center that I have tried so far is using the exact same elderly AMD processor. It is fairly common knowledge that Microsoft has at least three different generations of hardware (Gen 1, Gen 2, and Gen 3) that they have used so far in their Azure data centers. After some more inquiries, I have discovered that this AMD Opteron 4171 HE processor is an Azure Gen 2 processor.

You can browse the Geekbench 3 online database of uploaded benchmark results, looking for systems using the AMD Opteron 4171 HE processor here. You may notice that every single result for this processor seems to be for a Microsoft Virtual Machine, which is also quite curious. Windows Server 2012 R2 Task Manager is reporting the L1 cache as “N/A” and not even listing the L2 and L3 cache sizes on these Azure VMs. Another curious piece of evidence is the fact that the Standard Instances have about 50% higher Geekbench 3 scores than the equivalent Basic Instances when they have the exact same total processor core counts and memory sizes, for both the Single-Core score and Multi-Core score. This much of a variance does not make any sense if the underlying host machine is actually using the same processor.

Summary

All of this evidence initially led me to the conclusion that Microsoft was probably obscuring the actual processor in the host machine. I thought they might be doing this to try to prevent people from purposely provisioning multiple VMs until they happen to get a VM is running on newer, faster, host hardware. It turns out that Microsoft is not quite that clever. I have been assured that Microsoft does not alter the identity of the CPU in an Azure VM. There are newer Azure Gen 3 processors that you may get in an Azure VM, as you provision new VMs in the future. Another possible reason for my results was that they are likely using some sort of governance to limit VM performance to a reliable, uniform level, regardless of the underlying host hardware, so that they can host more VMs on less hardware over time. This would be a smart course of action for an IaaS hoster.

The relatively low Geekbench 3.05 scores (see Figure 3) for even the largest Azure VMs means that you are giving up a significant amount of processor and memory performance compared to an equivalent physical two-socket server with the same number of processor cores and memory.

Figure 3: 32-bit Geekbench 3.05 Results for Standard A4 Virtual Machine in East U.S. Data Center

Many SQL Server workloads will run perfectly well with this level of processor and memory performance, albeit a little slower than you may be used to. If you factor in the SQL Server 2014 Enterprise Edition license savings from an eight-core machine, plus the capex for a modest, two-socket server and its associated storage, you could afford to run a Standard A7 virtual machine 24×7 for about five to six years. Given that sort of ROI, I can see many organizations making the economic decision to move at least a portion of their SQL Server infrastructure to Azure Virtual Machines. As long as your workload can run on a 56GB or smaller VM, and as long as having less CPU and memory performance than a typical recent vintage laptop is also acceptable, this is a rational course of action. Microsoft recently announced the availability of larger, much faster A8 and A9 VM Compute Intensive Instances, that use Intel Xeon E5-2670 processors. This will be a huge improvement in performance over the Azure Gen 2 processors.

I’ll be taking a look at I/O performance in Azure Virtual Machines in an upcoming article.

The post Comparing Windows Azure Virtual Machine Performance appeared first on SQLPerformance.com.

29 May 10:26

CodeSOD: Finding Meaning

by Bruce Johnson

Ray was not just a recent addition to the development team. She was also a recent grad just getting started in her career. As a result, she was eager to glean as much knowledge as she could from any source she could find. She poured over all of the documentation provided to her by her team lead. She read the wiki (because that's the storehouse of communal knowledge for every team), reviewed the code and examined the diagrams of the data models used within the application.

All was going swimmingly...until she came across the following.

She hunkered down, trying to figure out what going on. The names of the tables really didn't help out much. I mean, sure. There is a Header table. With fifteen variations on the fSID field. And there are a whole bunch of SetSizes tables. Each of which is...er...nicely numbered. Because the distinction between tblSetSizes_13 and tblSetSizes_6 is readily apparent. Right???

And to top it off, notice how the names of the SetSizes tables have been aligned with the fSID fields. So that fSID1 lines up with the tblSetSizes_2. All in the name of avoiding confusion by the developer.

After spending enough time to ensure that the purpose of the diagram wasn't something blinding obvious (or opaquely obvious), she walked over to her team lead's desk.

"Ahhhh. I see you found it.", he said.

"Found it??"

"Yes. My masterpiece. I spent more than a few hours on that diagram getting everything defined and lined up properly"

Ray had a confused look on her face.

"I even had to set the size of the tables so they would all fit on the screen."

"Set...the size??"

"Yes. That's why I named them like I did."

"I don't think I understand. What is this structure for?"

"It's not for anything. It's art."

Ray paused. "Art?"

"Yes. I like the way it looked. The symmetry. The layout and contrasting colors. I like it, so, I kept the diagram around to preserve it".

"Okay. Thanks. I guess", said Ray.

"No problem. Come back any time you need to have something explained."

Ray retreated back to her desk. Afraid of what she might find, Ray didn't really have the enthusiasm to dig into more of the project code. Or database diagrams. But she did start to think about how long she would have to stay in this job before it wouldn't look bad on her resume if she left. Or maybe she should just start looking now. By using a little artistic license, she could just pretend this job never actually happened.

[Advertisement] BuildMaster 4.1 has arrived! Check out the new Script Repository feature and see how you can deploy builds from TFS (and other CI) to your own servers, the cloud, and more.

byron lewis likes this

29 May 10:24

Report: Verizon Claimed Public Utility Status To Get Government Perks

by samzenpus

An anonymous reader writes "Research for the Public Utility Law Project (PULP) has been released which details 'how Verizon deliberately moves back and forth between regulatory regimes, classifying its infrastructure either like a heavily regulated telephone network or a deregulated information service depending on its needs. The chicanery has allowed Verizon to raise telephone rates, all the while missing commitments for high-speed internet deployment' (PDF). In short, Verizon pushed for the government to give it common carrier privileges under Title II in order to build out its fiber network with tax-payer money. Result: increased service rates on telephone users to subsidize Verizon's 'infrastructure investment.' When it comes to regulations on Verizon's fiber network, however, Verizon has been pushing the government to classify its services as that of information only — i.e., beyond Title II. Verizon has made about $4.4 billion in additional revenue in New York City alone, 'money that's funneled directly from a Title II service to an array of services that currently lie beyond Title II's reach.' And it's all legal. An attorney at advocacy group Public Knowledge said it best: 'To expect that you can come in and use public infrastructure and funds to build a network and then be free of any regulation is absurd....When Verizon itself is describing these activities as a Title II common carrier, how can the FCC look at broadband internet and continue acting as though it's not a telecommunication network?'"

Read more of this story at Slashdot.

29 May 08:07

Microsoft Demos Real-Time Translation Over Skype

by samzenpus

Z80xxc! (1111479) writes "Today at the first annual Code Conference, Microsoft demonstrated its new real-time translation in Skype publicly for the first time. Gurdeep Pall, Microsoft's VP of Skype and Lync, compares the technology to Star Trek's Universal Translator. During the demonstration, Pall converses in English with a coworker in Germany who is speaking German. 'Skype Translator results from decades of work by the industry, years of work by our researchers, and now is being developed jointly by the Skype and Microsoft Translator teams. The demo showed near real-time audio translation from English to German and vice versa, combining Skype voice and IM technologies with Microsoft Translator, and neural network-based speech recognition.'"

Read more of this story at Slashdot.

29 May 08:06

Webinar Follow-Up: Leveraging Cloud for the Modern SQL Developer

Presenter: Jason Strate

In the past, developing a database was a simple affair where you built and deployed the database for a server down the hall in your datacenter, or in some cases to a server under someone's desk. While those good ole days were a much more simple time, they were quite limiting in how a database could be deployed and there were much fewer options for ensuring stability and performance. Today, SQL Developers have many more options for deployment including the services provided through Windows Azure. In this session, we'll dive into what you need to know to start developing databases with Windows Azure and discuss the value that cloud delivery can provide to your database development.

Click here to view the full webinar!

Follow-Up Questions:

Q: Can you also create that BACPAC from within Visual Studio?

Yes, provided I am understanding the questions correctly. You can use SQL Server Data Tools, within VS, to create BACPACs

Q: If you want to learn SQL Azure, how does the pricing work? I was a little confused with how charges are incurred, I assume that it's not by each query you execute, etc..

For SQL Database, pricing is based on the size of the database and the network throughput that is incurred. The compute per query isn't calculated, just the throughput of data after. You can find more information here - http://azure.microsoft.com/en-us/pricing/details/sql-database/

As far as getting started goes, if you have an MSDN account, you can actually get lower that then published rates.

Q: How do you troubleshoot problems on the azure db, like querying the DMVs and Extended events, where do you save the file or what other option is there?

Tracing similar to on-premise databases is not available yet. There are DMVs though that can provide a lot of what is needed.

Q: I have SS 2012 on my machine, when i Create a project in Data Tool I am just seeing SQL Server 2005 & SQL Server 2008 under Project Version

You have to download SQL Server Data Tools seperately from the Visual Studio and SQL Server installs. The tools can be found here - http://msdn.microsoft.com/en-us/data/tools.aspx

Q: Wouldn't you have to migrate from Cloud 2012 to Cloud 2014 to XXX?

No, changes are introduced across the SQL Database SaaS platform on a regular basis. The idea is that as you own and use the service, it is continuously improving.

29 May 08:02

Working with Temporal Data in SQL Server

by Dejan Sarka

My third Pluralsight course, Working with Temporal Data in SQL Server, is published. I am really proud on the second part of the course, where I discuss optimization of temporal queries. This was a nearly impossible task for decades. First solutions appeared only lately. I present all together six solutions (and one more that is not a solution), and I invented four of them. http://pluralsight.com/training/Courses/TableOfContents/working-with-temporal-data-sql-server

29 May 08:02

Architecture and Benefits of VMware vSphere Virtualization- Part - I

by Sam Lucido

In today’s world many technocrats, DBAs, CIOs and CTOs are very concerned about the expenses bill on hardware procurements and also on the optimal , consolidated and cost-effective usage of the same. I will try to get inside the problem with the help virtualization as a solution and try to build a series of blogs on virtualization (with special reference to vBlock and Oracle) . In this particular blog I will try to address the basic concepts of virtualization and its comparison with the traditional physical server architecture.

Virtualization Traditionally, operating systems and software run on a physical server. Many challenges exist in running a large number of physical servers in a typical datacenter. To run this type of infrastructure may not be efficient and cost effective in the long run. To plan and spend for the maintenance cost of this type of very large infrastructure (square footage, rack space, power, cooling, cabling, and server provisioning) are some of the problems that IT staff and their managements are addressing on a daily basis.

Typically, there exists a one to one correspondence between a physical computer and the software that it runs. This relationship leaves most of the resources of the computers hugely idle and underutilized, leaving between only 5–15 percent (approx.) of physical server capacity in use. The cost of the space and power required to house, run and keep these systems cool can be expensive.

It is very tedious to provision physical servers as it is a time consuming process. In nonvirtualized environments time is required to procure new hardware, and get it installed in the datacenter, install and patch an operating system and finally install and configure the required applications on the same and that can take a huge time duration to get going. This process also includes many other tasks to integrate the system into the infrastructure. For example, configuring databases, servers, firewall rules, enabling switch ports and provisioning storage. Below figure shows us a basic landscape of virtualization.

Benefits The benefits of using the virtual Box vis-a-vis Physical box can be summed up in the following chart.

In physical environments, the OS is installed on the top of the physical hardware and while upgrading ,we need to ensure that the device drivers are in right version and installed properly as per the requirements and latest compatibility matrix. If there are incompatibilities, there may be some adverse implications in terms of business cost ,time and personnel.

Virtualizing these systems save on this cost because virtual machines are 100 percent software. The virtual machine is a set of files. A virtual machine uses standardized virtual device drivers. The hardware can be upgraded without change to the virtual machine.

Architecture The difference between the Physical and the virtual architecture can be depicted in the below diagram:-

The graphics shown above illustrate the differences between a virtualized and a nonvirtualized host. In traditional architectures, the operating system interacts directly with the installed hardware. It schedules processes to run, allocates memory to applications, sends and receives data on network interfaces and reads from and writes to attached storage devices. In comparison, a virtualized host interacts with installed hardware through a thin layer of software called the virtualization layer or hypervisor. The hypervisor provides physical hardware resources dynamically to virtual machines as needed to support the operation of the virtual machines. The hypervisor allows virtual machines to operate with a degree of independence from the underlying physical hardware. For example, a virtual machine can be moved from one physical host to another. Also, its virtual disks can be moved from one type of storage to another without affecting the functioning of the virtual machine.

Virtualization is the panacea for many problems that related to the CPU, memory, and networking and resources bottlenecks. Virtualization is a technology that decouples physical hardware from a computer operating system and allows us to consolidate and run multiple workloads as virtual machines on a single computer. In short ,a virtual machine is a computer that is created by software that enables us to use all the computer resources in a shared manner and enables us to run like a physical computer which runs an operating system and applications. Each virtual machine contains its own virtual hardware, including a virtual CPU, memory, hard disk, and network interface card, which look like physical hardware to the operating systems and applications.

Scope for Virtualization

There are multiple benefits/scope for virtualization as shown below :-

CPU Virtualization
Memory Virtualization
Networking Virtualization
File System Virtualization
Server resources sharing or Virtualization

In the next blog I will discuss the architecture of vSphere and its allied features in greater detail. In the future blogs, I will try to take take a deep dive into the various aspects of virtualization with special reference to vSphere and oracle. Gradually I will highlight why Oracle should be used in virtualized environment like vSphere which leverages the performance of Oracle Database and also reduces the TCO of the DB in terms of licensing costs.

Further Suggested Readings : Best Practices for Virtualizing Your Oracle Database – Datastores

28 May 22:17

TrueCrypt Website Says To Switch To BitLocker

by Soulskill

Several readers sent word that the website for TrueCrypt, the popular disk encryption system, says that development has ended, and Windows users should switch to BitLocker. A notice on the site reads, "WARNING: Using TrueCrypt is not secure as it may contain unfixed security issues. ... You should migrate any data encrypted by TrueCrypt to encrypted disks or virtual disk images supported on your platform." It includes a link to a new version of TrueCrypt, 7.2, and provides instructions on how to migrate to BitLocker. Many users are skeptical of a site defacement, and there's been no corroborating post or communication from the maintainers. However, the binaries appear to be signed with the same GPG key that the TrueCrypt Foundation used for previous releases. A source code diff of the two versions has been posted, and the new release appears to simply remove much of what the software was designed to do. It also warns users away from relying on it for security. (The people doing an audit of TrueCrypt had promised a 'big announcement' soon, but that was coincidental.) Security experts are warning to avoid the new version until the situation can be verified.

Read more of this story at Slashdot.

28 May 22:16

Extended Events In Azure SQL Database. For real.

I’ve just returned home after teaching two weeks of SQLskills Immersion Events. Total post-event exhaustion for two days. Trying to produce some nice demos on the new SQL Azure Database tiers and services for my SQL Azure Database A-Z preconference ssession at SQLTuneIn. And then this happens…

About a year or so ago, after attending a TechEd talk on XEvents in Azure SQL Database (ASD, the PaaS offering), I looked for and found metadata and blogged about it. The feature never seemed to come to fruition and after a while, I stopped looking.

This morning, I received email from Egbert Schagen, a person I’d met in my conference travels. He let me know that XEvents now appeared to work in ASD. They appeared in the SSMS object explorer. Egbert said he’d seen no announcement for it, and by web search he came across my old blog entry. He thought I might like to know. And he was right.

And so they DO work. I tried this in an Standard S1 Edition database I had and Egbert said they also worked in Web/Business edition. As with almost everything ASD, these event sessions exist at a database level, rather than in master.

There are 12 pre-configured event sessions. It doesn’t look like you can create your own. The sessions are:
azure_xe_activity_tracking*
azure_xe_errors_warnings
azure_xe_object_ddl
azure_xe_post_query_detail
azure_xe_pre_query_detail
azure_xe_query
azure_xe_query_batch*
azure_xe_query_detail*
azure_xe_query_execution*
azure_xe_query_waits*
azure_xe_query_waits_detail*
azure_xe_waits

The event sessions with an asterisk can be turned on and you can “Watch Live Data” in SSMS. The rest of them look like they require a credential because they write to Azure blob storage. More on that as I try and figure through this with the metadata. It appears you can only run one event session at a time. The second concurrent one always produces a “system is currently busy” message on attempting to start it.

I’m not sure what the “official” status is, but for now it looks like you *can* trace in Azure SQL Database. Great! Thanks, ASD developers. Thanks, Egbert.

@bobbeauch

The post Extended Events In Azure SQL Database. For real. appeared first on Bob Beauchemin.

28 May 20:23

Extended Events In Azure SQL Database. For real.

by Bob Beauchemin

UPDATE: Last month (Sept 2015), Extended Events (not just the metadata that I wrote about) are officially available in Azure SQL Database!! For more information, here’s the main documentation article.

I’m not sure what the “official” status is, but for now it looks like you *can* trace in Azure SQL Database. Great! Thanks, ASD developers. Thanks, Egbert.

@bobbeauch

The post Extended Events In Azure SQL Database. For real. appeared first on Bob Beauchemin.

28 May 12:46

SQL Server Change Data Capture Application

by Ahmad Osama

I have written a SQL Server Change Data Capture GUI interface to facilitate for CDC operations. The project is hosted here https://sqlcdcapp.codeplex.com/. Feel free to download, review and suggest changes. You can even download the source code and implement things too.

This blog describes how to download, install and use the application.

Installation

The application doesn't require installation. Download and extract the zip file. Execute SQLCDCApp.exe to get started.

Enable CDC on a database

To enable CDC on a database, first connect to SQL server and list all databases as shown below.

Click on button “Get Databases” to list available databases as shown below.

Select a database to enable CDC and click “Enable CDC” button. Similarly, to disable CDC on a database, select it from the grid and click “Disable CDC”. Multiple databases can be selected to enable/disable CDC.

Enable CDC on a table

Click “Get Tables” button shown in above image to list tables for selected databases as shown below.

Enable cdc on table

Select table/tables from the grid and click “Capture Table” to enable CDC or “Disable CDC” to disable CDC on selected tables with default values. Specify custom options if required.

View Change Data

To view change data, select a CDC enabled table from the grid and click “Get Changed Data” button. A new window will open as shown below.

View change records

Select the captured instance to view change data, then select the type of data to view by checking on of the radio buttons. Click “GO” button display data as shown below.

change data

Export Change Data

The data can be exported to

- A comma delimited (.txt/.csv) file.

- A table in a database

Export data to a file

Select “Export to file” option and click “Export” button. A dialog box will appear as shown below

export to file

Select the file name to save data to and then click “Save”. The data will be written to selected file.

Export data to database

Select “Export to database” and connect to the server to export data to. Select the database and table to export data to. Select “Create New” in the table combo box to create a new table as shown below.

export to database

A new window will open. Provide the new table name and click “GO” button as shown below.

New table name

Once done, click “Export” button. This will create the new table in the selected database and will populate it with the change data.

Cheers!!!

Ahmad

@_ahmadosama

If you like this post do like us on facebook.

All our events: http://www.sqlservergeeks.com/default-category/events

Our spam-free, 24x7 moderated FaceBook group: www.facebook.com/groups/theSQLgeeks

Our FaceBook page: www.facebook.com/SQLServerGeeks

Our awesome gallery: http://www.facebook.com/SQLServerGeeks/photos_albums

And a must watch, our video - http://www.youtube.com/watch?v=EAFhDwXxlzM

28 May 12:43

Detecting Hyper-Threading state

by jchang

To interpret performance counters and execution statistics correctly, it is necessary to know state of Hyper-Threading (on or off). In principle, at low overall CPU utilization, for non-parallel execution plans, it should not matter whether HT is enabled or not. Of course, DBA life is never that simple (see my other blogs on HT). The state of HT does matter at high overall utilization and in parallel execution plans depending on the DOP. SQL Server does seem to try to allocate threads on distinct...(read more)

28 May 12:38

New Middleware Promises Dramatically Higher Speeds, Lower Power Draw For SSDs

by timothy

mrspoonsi (2955715) writes "A breakthrough has been made in SSD technology that could mean drastic performance increases due to the overcoming of one of the major issues in the memory type. Currently, data cannot be directly overwritten onto the NAND chips used in the devices. Files must be written to a clean area of the drive whilst the old area is formatted. This eventually causes fragmented data and lowers the drive's life and performance over time. However, a Japanese team at Chuo University have finally overcome the issue that is as old as the technology itself. Officially unveiled at the 2014 IEEE International Memory Workshop in Taipei, the researchers have written a brand new middleware for the drives that controls how the data is written to and stored on the device. Their new version utilizes what they call a 'logical block address scrambler' which effectively prevents data being written to a new 'page' on the device unless it is absolutely required. Instead, it is placed in a block to be erased and consolidated in the next sweep. This means significantly less behind-the-scenes file copying that results in increased performance from idle."

Read more of this story at Slashdot.

28 May 12:38

Microsoft Office Mix: No-Teacher-Left-Behind Course Authoring

by timothy

theodp (442580) writes "While they aim to democratize learning, the Massive Open Online Course (MOOC) movement has, for the most part, oddly left K-12 teachers out of the online content creation business. ZDNet's Simon Bisson reports on Office Mix, Microsoft's new PowerPoint plug-in and associated cloud service, which Bisson says makes it easy to create and distribute compelling educational content (screenshots). GeekWire's Frank Catalano also makes an interesting case for why Office Mix's choice of PowerPoint, "the poster child for delivering boring presentations in non-interactive settings," could still be a disrupter in the online content creation space. By the way, MOOC.org, the collaboration of edX and Google which also aims to help "teachers easily build and host courses for the world to take," is slated to go live in the first half of 2014. It'll be interesting to see how MOOC.org's authoring tools differ from Google Research's Course Builder effort."

Read more of this story at Slashdot.

28 May 12:37

Ask Slashdot: Tech Customers Forced Into Supporting Each Other?

by timothy

An anonymous reader writes "Has anyone else noticed the trend towards 'community forums' where customers are basically being recruited to solve the issues of other customers while the companies selling the products causing the issues sit back and take a passive role in the process? Granted, sometimes the companies' employees play an active part in the forums and provide some value-add by contributing crucial, and often undocumented, knowledge that solves the problem in a timely fashion. Unfortunately, that isn't always the case, and this leaves customers with no visibility into whether or not their problems are being addressed, and, if they are, when they might expect to receive assistance. This is bad enough when dealing with consumer electronics that cost up to a couple of hundred of dollars, but it's completely unacceptable when dealing with proprietary design tool vendors that are charging several thousand dollars for software licenses for tools that are the only option if a customer doesn't want to drop an order of magnitude more money to go with 3rd party tools (e.g., Synopsys). Who do you think are the worst offenders of this downloading of support onto the backs of the customers themselves, and what can be done about it?"

Read more of this story at Slashdot.

28 May 12:31

Bug In DOS-Based Voting Machines Disrupts Belgian Election

by samzenpus

jfruh (300774) writes "In 20 cantons in Belgium's Flanders region, voting machines are x86 PCs from the DOS era, with two serial ports, a parallel port, a paltry 1 megabyte of RAM and a 3.5-inch disk drive used to load the voting software from a bootable DOS disk. A software bug in those machines is slowing the release of the results from yesterday's election, in which voters chose members of the regional, national, and European parliaments. The remaining voting machines, which are Linux-based, are unaffected, as were voters in the French-speaking Wallonia region of the country, most of whom use paper ballots."

Read more of this story at Slashdot.

28 May 12:28

Book Review: Hacking Point of Sale

by samzenpus

benrothke (2577567) writes "The only negative thing to say about Hacking Point of Sale: Payment Application Secrets, Threats, and Solutions is its title. A cursory look at it may lead the reader that this is a book for a script kiddie, when it is in fact a necessary read for anyone involved with payment systems. The book provides a wealth of information that is completely pragmatic and actionable. The problem is, as the book notes in many places, that one is constantly patching a system that is inherently flawed and broken." Keep reading for the rest of Ben's review.

Read more of this story at Slashdot.

28 May 12:24

Why You Shouldn't Use Spreadsheets For Important Work

by Soulskill

An anonymous reader writes "Computer science professor Daniel Lemire explains why spreadsheets shouldn't be used for important work, especially where dedicated software could do a better job. His post comes in response to evaluations of a new economics tome by Thomas Piketty, a book that is likely to be influential for years to come. Lemire writes, 'Unfortunately, like too many people, Piketty used spreadsheets instead of writing sane software. On the plus side, he published his code on the negative side, it appears that Piketty's code contains mistakes, fudging and other problems. ... Simply put, spreadsheets are good for quick and dirty work, but they are not designed for serious and reliable work. ... Spreadsheets make code review difficult. The code is hidden away in dozens if not hundreds of little cells If you are not reviewing your code carefully and if you make it difficult for others to review it, how do expect it to be reliable?'"

Read more of this story at Slashdot.

28 May 12:13

Tales from the Interview: Case-Closed Captioned

by Dan Adams-Jacobson

Computers do... funny things to people. With alarming frequency, capable, adult professionals of all stripes just can't seem to overcome the most basic obstacles involved in the everyday practice of computing. When your grandmother struggles to understand that sending email doesn't require postage stamps, it's kind of endearing; when it's your co-workers, who grew up with computers just like you did and use them every day to do every aspect of their jobs, it's... less so. That so many people still fail so frequently leads one to wonder: can everyone really be this feckless, or are we, the creators of the software they struggle to master, just incapable of creating something usable?

Sometimes, the answer is emphatically the former.

Tazza worked for a company that subtitled live television, and, as such, had staff working on a round-the-clock basis. While the twenty four-hour support desk dealt with on-air problems and operational issues, Tazza was the main point of escalation for issues that were harder to classify. Bright and early one morning at 4 AM, Tazza received a call from the support desk.

"Hi Taz," Jordan yawned down the line, "how's your morning?"

"Have coffee, will travel," Tazza said. "Guess your shift's almost over?"

"Thank God. So, look, Mary's trying to prep subs for the 5 AM slot, but she says Titleist isn't working since it was upgraded to the new version yesterday."

"Hmm, that's weird. I'm pretty sure it wasn't upgraded yesterday. I'll take a look, thanks. Get some sleep."

Tazza fired up LogMeIn and tried to connect to Mary's laptop. It seemed to be offline, which was pretty serious if she was trying to finish subs for the next time slot. Taz dialed her number.

"Hi, Mary. I can't connect to your laptop, which might explain the problems you're having with Titleist. Can you access the internet?"

"Hmm..." Mary murmured, her mouse clicking in the background. "How do I check that?"

"Just go to google.com, or any other site that you don't already have open."

"Okay, that's working! Thanks!"

"Wait, wait... I haven't done anything yet. And I still can't see you online. Let me send you an IM."

After a moment, Tazza heard the tell-tale ding! from Mary's side. "Ok, I got it!"

"Okay. Under the Lync chat window there's a couple of buttons. Can you click on the monitor button and share your desktop with me so I can see what's up?"

"I don't have those buttons," Mary said. "Someone changed the Lync software on my laptop and removed all the other stuff, too."

"Uh..." Tazza stared at the screen blankly for a minute, trying to work out what could have actually happened. The subtitlers' laptops were pretty mission-critical; no one would just update or remove a bunch of software from them at random. It was time for a sanity check.

"Ok, Mary, thanks for bearing with me. Next, please click the Start button, type C-M-D and press Enter. In the black window that appears, type "HOSTNAME" and press Enter. What does it say?"

"Let's see... Oh! LAP-JENNYSMITH. What's that mean?"

Tazza took a deep breath. "It means you're using Jenny's laptop, not yours. Jenny isn't a subtitler, so she doesn't have the same software on her computer. Where's your laptop?"

"No, see, this is my laptop, just someone changed everything on it."

"Please humour me, Mary. Do you take your laptop home or do you leave it in the office?"

"I leave it here, but—"

"Okay, so where in the office did you leave it yesterday?"

"Over on the other desk..."

"Would you just take a quick look over there, please?"

"Oh! There it is!"

Taz sighed, already contemplating his second coffee break. "Would you like me to stay on the line to see if it's okay?"

"No, thanks! I'll be fine. I better go get the five o'clock titles done!"

"Have a nice day, Mary."

[Advertisement] BuildMaster 4.1 has arrived! Check out the new Script Repository feature and see how you can deploy builds from TFS (and other CI) to your own servers, the cloud, and more.

byron lewis, Ronald.phillips likes this

28 May 12:11

PHP Next Generation

by Soulskill

An anonymous reader writes "The PHP Group has put up a post about the future of PHP. They say, 'Over the last year, some research into the possibility of introducing JIT compilation capabilities to PHP has been conducted. During this research, the realization was made that in order to achieve optimal performance from PHP, some internal API's should be changed. This necessitated the birth of the phpng branch, initially authored by Dmitry Stogov, Xinchen Hui, and Nikita Popov. This branch does not include JIT capabilities, but rather seeks to solve those problems that prohibit the current, and any future implementation of a JIT capable executor achieving optimal performance by improving memory usage and cleaning up some core API's. By making these improvements, the phpng branch gives us a considerable performance gain in real world applications, for example a 20% increase in throughput for Wordpress. The door may well now be open for a JIT capable compiler that can perform as we expect, but it's necessary to say that these changes stand strong on their own, without requiring a JIT capable compiler in the future to validate them.'"

Read more of this story at Slashdot.

28 May 11:34

Aligning Skills with Real World Business Benefits

by MVP Award Program

Editor’s note: The following post was written by SharePoint MVP Steve Smith

This article deals with the subject of ‘Aligning skills with real world business benefits’ and why continued investment in technology skillsets is even more important today than it ever was. We will be looking at the importance of practical training alongside real world skills and aligning it all with qualifications and how a company as well an individual or team would benefit in both the short and long term after being through this process.

After many years in the education space especially around Microsoft products there is no doubt that the products themselves have evolved into much more complex platforms. The knowledge and skills that we developed in the nineties and early 2000 certainly provided a solid foundation for the core skills needed in today’s world. It is also more important than ever to have our end users trained and supported on these products, historically IT would deploy the product such as email and our users would get little if no formal training on how to use the products themselves. As the product range has evolved so has the benefit to the users and the business, but only if they all know how to get the most from it.

But what if you are fairly new to this brave new world of Microsoft technologies what skills am I talking about and why are they still so important?

To read the full post, click here

28 May 11:31

vCenter Operations Manager Troubleshooting

by Jason Gaudreau

vCenter Operations Manager 5.8 (vC Ops) is a tool from VMware that collects massive amounts of data from a variety of sources. You might wonder what is the difference between the metrics collected from ESXi by vCenter server and the metrics collected by vCenter Operations Manager? VMware vCenter shows you a lot of different metrics for the past hour in 20 second increments, if you start to research information further back in time it reveals less metrics and the data points become more averaged out. For example, the past day has 20 second intervals, the past week shows 30 minute intervals, and the past month shows two-hour intervals in vCenter. A two-hour long average can hide a lot of peaks and valleys, it might be good for some general capacity planning, but it isn't good if you are trying to troubleshoot the root cause of an application performance issue. It is simply to large an interval, you need a much finer data sampling. That is where vC Ops comes in!

vCenter Operations Manager does three things differently. It keeps all the metrics, it keeps five minute intervals, and it keeps them for six months. You can retroactively go back and tell an application owner if he was having a performance problem at a certain time. It is going to give you a lot more confidence about providing relevant information to your IT business partners.

Another fearture with vCenter Operations Manager is dynamic thresholds. vC Ops takes the collected metrics and it looks for patterns over time. It can then make predictions for the future with these patterns that help you proactively maintain your environment.

Under the Operations tab there are four badges, which include Health, Workload, Anomalies, and Faults for every resource. Health is nothing more than the aggregate of the other three badges and is scored 0 to 100, with 100 being the best score.

Faults show alert information, such as the link state down. When an event triggered alert occurs, it does not automatically clear by design, this is to ensure that someone looks at the state and takes corrective actions so it doesn't happen in the future.

The Workload badge shows the demand for the physical resources. Every resource has some level of capacity. For example, CPU capacity is the number of cores you have times the amount of GHz per core. The virtual machines on the host are using a portion of that capacity, the workload is what fraction of that capacity is being demanded. Workload is demand divided by the capacity as a percentage.

Workload shows us four resource dimensions with CPU, memory, disk I/O, and network I/O. In this case, our host is providing the resources and the demand is being generated by the virtual machines. There is some virtualization overhead for ESXi, but predominately the demand is coming from the guests. We can see that CPU workload shows us 4%, which means that of the total capacity (number of cores times the amount of GHz per core), 4% of the resources are being asked by the virtual machines on the host.

Looking at the host, we can immediately see that the host isn't stressed. The Workload badge is calculated by the most utilized resource, in this instance memory is the most heavily used resource with 29%, which provides the Workload badge of 29% above.

The important thing to remember about workload is that it is the demand for the resources not the usage. Think of it this way, demand is how much of the resource was wanted and the usage is what was delivered. This is fundamental to performance troubleshooting, most performance problems are caused because the demand is higher than what is delivered. The gap between demand and usage can be considered contention. The more contention, the less likely your end user is happy with the performance. And ultimately, the definition of a performance problem is the user is unhappy and the definition of their not being a performance problem is the user is happy.

If we had a virtual machine that ran a batch process and was at 100% utilization for 5 hours during the night is that a performance problem? If the user received the expected output the next day, there isn't a performance problem. Adding additional resources may speed up the process, but it isn't really necessary. Performance problems are very subjective.

Virtual machines have approximately 250 to 300 metrics. But, none of those metrics predict if the user is happy. A small fraction of those metrics can help establish if the user is happy, and those tend to be the metrics that measure contention; which show that the virtual machine demanded a certain amount of resources and the amount of resources that were provided.

In the Workload box, it displays the resource utilization for the last 5 minutes for CPU, memory, disk I/O, and network I/O. Just above, we can see the host is bound by Memory and it shows us the utilization for the past 6 hours. There are some peaks and valleys, but there is plenty of capacity remaining over the past 6 hours. If we want to look further back, then we need to switch from the Detail view to the All Metrics view.

We are going to finish off our post by talking about the last badge, Anomalies. Remember earlier, I mentioned that vCenter Operations Manager collects the metrics and then tries to find patterns in the numbers to produce dynamic thresholds. It tries to predict the resource utilization for the next day. If that value is outside the predicted range, then that becomes an anomaly. If you have a very high anomaly count, it doesn't mean there is a problem, but it might be a good indicator that there is something wrong based on past behavior. For instance, if you had a virtual machine that had been sitting around for months for an application to be rolled out to your business users, your anomaly score is going to be low until that application load has become available for production. The workload may still be relatively low, but because it is outside of the predicted range the anomaly score is going to be high.

Every night the vCenter Operations Manager analytic virtual machine dynamically calculates the thresholds, it looks at every metrics for every virtual machine and predicts what it is going to look like for tomorrow. Every single one of the predictions is different. That is why the analytic virtual machine in vC Ops requires so much disk capacity, disk I/O, CPU, and memory. Every night it is analyzing millions of data points.

If we look on the right side of the Operations details, we can see the metrics that are not behaving normally. We can see the expected ranges at the specific time frames. In the picture above, the Data Store Capacity Contention (%) is below what is expected. The red arrows pointing down shows values below the dynamic threshold, and the red arrows pointing up shows the values above the dynamic threshold. The yellow light-bulb shows that it is an active anomaly.

If your Workload and your Anomaly badges are both red, then you definitely want to track down the issue.

vCenter Operations Manager can be a bit daunting at first, but if you take the time to learn the tool it can help you move from reactive problem solving to proactive maintenance with its built in analytics engine.

23 May 19:37

Why Your Abstract Wasn't Selected

by AllenMWhite

We're anxiously waiting to hear from PASS which sessions were selected for the 2014 Summit in November. It's a big job to go through the hundreds of submissions and pick the sessions that will appeal to the people who will be paying over $1,000 to attend...(read more)

23 May 18:42

Failed To Lock Virtual Machine’s Configuration

by Marlon Ribunal

[Updated]

I rebuilt my Hyper-V virtual machines / virtual network by deleting the old ones and building new machines on a portable external hard drive. So after a while of not starting up the host laptop, I could not start any of the machines. First, I noticed that the state of the machines on the Hyper-V Manager now says “Off-critical.” I am getting this connection error: Failed to lock virtual machine’s configuration. Unspecified error (0x80004005). I checked the Hyper-V Setting and I found out that the folder path of the Hard Disks and Virtual Machines files has been changed to its default setting. I’m not sure why but that’s exactly what happened: (This has nothing to do with this error. See correction at the bottom of this post.) ~~Needless to say, I just needed to change that to the appropriate path, which is my external hard drive:~~ The real issue in this case is the Hyper-V not recognizing the external hard drive because of the fact that the Hyper-V Virtual Machine Management Service was started before the external hard drive was attached via the USB.

I could have simply restarted the service or the laptop after I attached the external hard drive, and that would have fixed the issue. ~~After the change, I just performed a quick a restart and that fixed the connection issue:~~

CORRECTION: The path setting described above has nothing to do with this connection error. The Hyper-V Setting path, illustrated above, has nothing to do with the Hyper-V configuration not finding the correct path. This path setting tells Hyper-V where you want the succeeding virtual machines stored. For more information on this “Failed to lock virtual machine’s configuration” error, please read: Why is my virtual machine “Off-Critical”?

Mrdenny

Shared posts

Test Environment

The Sample Data

Deleting A Day of Data

Plan Analysis

Adding a Filtered Index

Deleting a day with the filtered index in place

Finding the Problem

Yes, but why?

Another pattern to watch out for

Only filtered indexes cause this

Final Thoughts

1. Bringing the relational world together with Hadoop & Cloud (Azure)

2. Freedom of Choice

3. Simplified ETL & Fast Insights

How does backup encryption work ?

Encrypting a database backup in T-SQL

Restoring a database from an encrypted backup file

Taking encrypted backups with Maintenance Plans

Backing up and restoring in Azure

Prerequisites

Instructions to save a database to Azure

Restoring a database from Azure

Performance Considerations

Windows Azure Virtual Machine Characteristics

Summary

Installation

Enable CDC on a database

Enable CDC on a table

View Change Data

Export Change Data

Export data to a file

Export data to database