Shared posts

24 Sep 06:22

Always Encrypted Performance : A Follow-Up

by Aaron Bertrand

Last week, I wrote about the limitations of Always Encrypted as well as the performance impact. I wanted to post a follow-up after performing more testing, primarily due to the following changes:

  • I added a test for local, to see if network overhead was significant (previously, the test was only remote). Though, I should put "network overhead" in air quotes, because these are two VMs on the same physical host, so not really a true bare metal analysis.
  • I added a few extra (non-encrypted) columns to the table to make it more realistic (but not really that realistic).
      DateCreated  DATETIME NOT NULL DEFAULT SYSUTCDATETIME(),
      DateModified DATETIME NOT NULL DEFAULT SYSUTCDATETIME(),
      IsActive     BIT NOT NULL DEFAULT 1

    Then altered the retrieval procedure accordingly:

    ALTER PROCEDURE dbo.RetrievePeople
    AS
    BEGIN
      SET NOCOUNT ON;
      SELECT TOP (100) LastName, Salary, DateCreated, DateModified, Active
        FROM dbo.Employees
        ORDER BY NEWID();
    END
    GO
  • Added a procedure to truncate the table (previously I was doing that manually between tests):
    CREATE PROCEDURE dbo.Cleanup
    AS
    BEGIN
      SET NOCOUNT ON;
      TRUNCATE TABLE dbo.Employees;
    END
    GO
  • Added a procedure for recording timings (previously I was manually parsing console output):
    USE Utility;
    GO
     
    CREATE TABLE dbo.Timings
    (
      Test NVARCHAR(32),
      InsertTime INT,
      SelectTime INT,
      TestCompleted DATETIME NOT NULL DEFAULT SYSUTCDATETIME(),
      HostName SYSNAME NOT NULL DEFAULT HOST_NAME()
    );
    GO
     
    CREATE PROCEDURE dbo.AddTiming
      @Test VARCHAR(32),
      @InsertTime INT,
      @SelectTime INT
    AS
    BEGIN
      SET NOCOUNT ON;
      INSERT dbo.Timings(Test,InsertTime,SelectTime)
        SELECT @Test,@InsertTime,@SelectTime;
    END
    GO
  • I added a pair of databases which used page compression – we all know that encrypted values don't compress well, but this is a polarizing feature that may be used unilaterally even on tables with encrypted columns, so I thought I would just profile these too. (And added two more connection strings to App.Config.)
    <connectionStrings>
        <add name="Normal"  
             connectionString="...;Initial Catalog=Normal;"/>
        <add name="Encrypt" 
             connectionString="...;Initial Catalog=Encrypt;Column Encryption Setting=Enabled;"/>
        <add name="NormalCompress"
             connectionString="...;Initial Catalog=NormalCompress;"/>
        <add name="EncryptCompress" 
             connectionString="...;Initial Catalog=EncryptCompress;Column Encryption Setting=Enabled;"/>
    </connectionStrings>
  • I made many improvements to the C# code (see the Appendix) based on feedback from tobi (which led to this Code Review question) and some great assistance from co-worker Brooke Philpott (@Macromullet). These included:
    • eliminating the stored procedure to generate random names/salaries, and doing that in C# instead
    • using Stopwatch instead of clumsy date/time strings
    • more consistent use of using() and elimination of .Close()
    • slightly better naming conventions (and comments!)
    • changing while loops to for loops
    • using a StringBuilder instead of naive concatenation (which I had initially chosen intentionally)
    • consolidating the connection strings (though I am still intentionally making a new connection within every loop iteration)

Then I created a simple batch file that would run each test 5 times (and repeated this on both the local and remote computers):

for /l %%x in (1,1,5) do (        ^
AEDemoConsole "Normal"          & ^
AEDemoConsole "Encrypt"         & ^
AEDemoConsole "NormalCompress"  & ^
AEDemoConsole "EncryptCompress" & ^
)

After the tests were complete, measuring the durations and space used would be trivial (and building charts from the results would just take a little manipulation in Excel):

-- duration
 
SELECT HostName, Test, 
  AvgInsertTime = AVG(1.0*InsertTime), 
  AvgSelectTime = AVG(1.0*SelectTime)
FROM Utility.dbo.Timings
GROUP BY HostName, Test
ORDER BY HostName, Test;
 
-- space
 
USE Normal; -- NormalCompress; Encrypt; EncryptCompress;
 
SELECT COUNT(*)*8.192 
  FROM sys.dm_db_database_page_allocations(DB_ID(), 
    OBJECT_ID(N'dbo.Employees'), NULL, NULL, N'LIMITED');

Duration Results

Here are the raw results from the duration query above (CANUCK is the name of the machine that hosts the instance of SQL Server, and HOSER is the machine that ran the remote version of the code):

Raw results of duration queryRaw results of duration query

Obviously it will be easier to visualize in another form. As shown in the first graph, remote access had a significant impact on the duration of the inserts (over 40% increase), but compression had little impact at all. Encryption alone roughly doubled the the duration for any test category:

Duration (milliseconds) to insert 100,000 rowsDuration (milliseconds) to insert 100,000 rows

For the reads, compression had a much bigger impact on performance than either encryption or reading the data remotely:

Duration (milliseconds) to read 100 random rows 1,000 timesDuration (milliseconds) to read 100 random rows 1,000 times

Space Results

As you might have predicted, compression can significantly reduce the amount of space required to store this data (roughly in half), whereas encryption can be seen impacting data size in the opposite direction (almost tripling it). And, of course, compressing encrypted values doesn't pay off:

Space used (KB) to store 100,000 rows with or without compression and with or without encryptionSpace used (KB) to store 100,000 rows with or without compression and with or without encryption

Summary

This should give you a rough idea of what to expect the impact to be when implementing Always Encrypted. Keep in mind, though, that this was a very particular test, and that I was using an early CTP build. Your data and access patterns may yield very different results, and further advances in future CTPs and updates to the .NET Framework may reduce some of these differences even in this very test.

You'll also notice that the results here were slightly different across the board than in my previous post. This can be explained:

  • The insert times were faster in all cases because I am no longer incurring an extra round-trip to the database to generate the random name and salary.
  • The select times were faster in all cases because I am no longer using a sloppy method of string concatenation (which was included as part of the duration metric).
  • The space used was slightly larger in both cases, I suspect because of a different distribution of random strings that were generated.

Appendix A – C# Console Application Code

using System;
using System.Configuration;
using System.Text;
using System.Data;
using System.Data.SqlClient;
 
namespace AEDemo
{
    class AEDemo
    {
        static void Main(string[] args)
        {
            // set up a stopwatch to time each portion of the code
            var timer = System.Diagnostics.Stopwatch.StartNew();
 
            // random object to furnish random names/salaries
            var random = new Random();
 
            // connect based on command-line argument
            var connectionString = ConfigurationManager.ConnectionStrings[args[0]].ToString();
 
            using (var sqlConnection = new SqlConnection(connectionString))
            {
                // this simply truncates the table, which I was previously doing manually
                using (var sqlCommand = new SqlCommand("dbo.Cleanup", sqlConnection))
                {
                    sqlConnection.Open();
                    sqlCommand.ExecuteNonQuery();
                }
            }
 
            // first, generate 100,000 name/salary pairs and insert them
            for (int i = 1; i <= 100000; i++)
            {
                // random salary between 32750 and 197500
                var randomSalary = random.Next(32750, 197500);
 
                // random string of random number of characters
                var length = random.Next(1, 32);
                char[] randomCharArray = new char[length];
                for (int byteOffset = 0; byteOffset < length; byteOffset++)
                {
                    randomCharArray[byteOffset] = (char)random.Next(65, 90); // A-Z
                }
                var randomName = new string(randomCharArray);
 
                // this stored procedure accepts name and salary and writes them to table
                // in the databases with encryption enabled, SqlClient encrypts here
                // so in a trace you would see @LastName = 0xAE4C12..., @Salary = 0x12EA32...
                using (var sqlConnection = new SqlConnection(connectionString))
                {
                    using (var sqlCommand = new SqlCommand("dbo.AddEmployee", sqlConnection))
                    {
                        sqlCommand.CommandType = CommandType.StoredProcedure;
                        sqlCommand.Parameters.Add("@LastName", SqlDbType.NVarChar, 32).Value = randomName;
                        sqlCommand.Parameters.Add("@Salary", SqlDbType.Int).Value = randomSalary;
                        sqlConnection.Open();
                        sqlCommand.ExecuteNonQuery();
                    }
                }
            }
 
            // capture the timings
            timer.Stop();
            var timeInsert = timer.ElapsedMilliseconds;
            timer.Reset();
            timer.Start();
 
            var placeHolder = new StringBuilder();
 
            for (int i = 1; i <= 1000; i++)
            {
                using (var sqlConnection = new SqlConnection(connectionString))
                {
                    // loop through and pull 100 rows, 1,000 times
                    using (var sqlCommand = new SqlCommand("dbo.RetrieveRandomEmployees", sqlConnection))
                    {
                        sqlCommand.CommandType = CommandType.StoredProcedure;
                        sqlConnection.Open();
                        using (var sqlDataReader = sqlCommand.ExecuteReader())
                        {
                            while (sqlDataReader.Read())
                            {
                                // do something tangible with the output
                                placeHolder.Append(sqlDataReader[0].ToString());
                            }
                        }
                    }
                }
            }
 
            // capture timings again, write both to db
            timer.Stop();
            var timeSelect = timer.ElapsedMilliseconds;
 
            using (var sqlConnection = new SqlConnection(connectionString))
            {
                using (var sqlCommand = new SqlCommand("Utility.dbo.AddTiming", sqlConnection))
                {
                    sqlCommand.CommandType = CommandType.StoredProcedure;
                    sqlCommand.Parameters.Add("@Test", SqlDbType.NVarChar, 32).Value = args[0];
                    sqlCommand.Parameters.Add("@InsertTime", SqlDbType.Int).Value = timeInsert;
                    sqlCommand.Parameters.Add("@SelectTime", SqlDbType.Int).Value = timeSelect;
                    sqlConnection.Open();
                    sqlCommand.ExecuteNonQuery();
                }
            }
        }
    }
}

The post Always Encrypted Performance : A Follow-Up appeared first on SQLPerformance.com.

24 Sep 06:20

SQL Server 2016 – Introduction to Stretch Database

by Tim Radney
No, not that stretchNo, that is not the stretch you're looking for

Beginning with SQL Server 2016 you will have the ability to store portions of a database in the cloud. This new ability is known as Stretch Database and the feature will be beneficial to those needing to keep transactional data for long periods of time and those looking to save money on storage. Being able to seamlessly migrate data to the Microsoft Azure Cloud will give you the ability to archive data without having to change the way your applications query the data.

In SQL Server 2016 Community Technology Preview 2 (CTP2), Stretch Database migrates entire tables. If your database is already set up to store archive data in separate tables from current data, you will be able to easily migrate the archive data to Azure. Once you enable Stretch Database, it will silently migrate your data to an Azure SQL Database. Stretch Database leverages the processing power in Azure to run queries against remote data by rewriting the query. You will see this as a "remote query" operator in the query plan.

An easy way to identify databases and tables that are eligible for being Stretch-enabled is to download and run the SQL Server 2016 Upgrade Advisor and run the Stretch Database Advisor. Aaron Bertrand (@AaronBertrand) wrote about this recently:

Limitations for Stretch Database

Not all tables will be eligible for being Stretch-enabled. Certain table properties, data and column types, constraints, and indexes are not supported, such as:

  • Memory-optimized and replicated tables
  • Tables that contain FILESTREAM data, use Change Tracking or Change Data
  • Data types such as timestamp, sql_variant, XML, geography or columns that are Always Encrypted
  • Check and default constraints or foreign key constraints that reference the table
  • XML, full-text, spatial, clustered columnstore and indexed views that reference the Stretch-enabled table
  • You cannot run UPDATE or DELETE statements, or run CREATE INDEX or ALTER INDEX operations on a Stretch-enabled table

For a full listing of limitations, you can visit: Requirements and limitations for Stretch Database.

Setting up Stretch Database

Getting started isn't a complicated task. You'll need an Azure account and then enable Stretch Database on the instance.

To enable Stretch Database on an instance run:

EXEC sys.sp_configure N'remote data archive', '1';
RECONFIGURE;
GO

For this demo I'm going to use the AdventureWorks2014 database on a SQL Server 2016 CPT2 instance. I'll start by creating a new table:

USE [AdventureWorks2014];
GO
 
CREATE TABLE dbo.StretchTest
(
  FirstName VARCHAR(50),
  LastName  VARCHAR(50)
);
GO

And then I'll populate the test table StretchTest with some data:

USE [AdventureWorks2014];
GO
 
INSERT INTO dbo.StretchTest(FirstName, LastName)
VALUES('Paul', 'Randal'),  ('Kimberly', 'Tripp'),('Jonathan', 'Kehayias'),
      ('Erin', 'Stellato'),('Glenn', 'Berry'),   ('Tim', 'Radney');
GO

I now have a table that I can stretch to the Microsoft Azure Cloud. To do this I'll use the GUI by right-clicking on AdventureWorks2014, choosing Tasks, and selecting Enable Database for Stretch.

Image1

The Enable Database for Stretch wizard will open, as below:

Image2

I'll click next:

And sign in to my Microsoft Azure account:

I'm then prompted to verify which account I want to use:

Then I select which Azure location I want to use and specify an admin login and password. When you do this, make sure to make note of the admin username and password because you will need this in the future in order to reconnect to the Azure SQL Database if you have to restore the database.

I then click next:

And click Finish and the database starts provisioning to Azure SQL Database Server.

Image7

I've just created a secure linked server definition on my local server that has the remote Azure SQL database as the endpoint. I can view this in Server Objects, Linked Servers as well as in my Azure account under SQL Databases. Note that only system processes can use this linked server; user logins cannot issue queries through the linked server to the remote endpoint.

Now that Stretch Database is enabled for the instance and for the AdventureWorks2014 database I can now stretch my new table. To stretch the table to Azure I need to alter the table and enable remote data archive.

USE [AdventureWorks2014];
GO
 
ALTER TABLE [StretchTest]
ENABLE REMOTE_DATA_ARCHIVE WITH ( MIGRATION_STATE = ON );
GO

In addition to new features with SQL Server 2016, there are some new DMVs as well. To monitor the migration of data to Azure you can query sys.dm_db_rda_migration_status. When I queried the DMV after enabling remote data archive I was able to see that the 6 rows were migrated:

Image8

Backup and Restore of a Stretch Database

Currently in SQL Server 2016 CTP2 when a database that is Stretch-enabled is backed up, a shallow backup is created which does not include the data that has been migrated to the Azure SQL database. It is expected that with the RTM release of SQL Server 2016 backing up a Stretch-enabled database will create a deep backup that will contain both local and stretched data.

When restoring a database that is Stretch-enabled, you'll have to reconnect the local database to the remote Azure SQL Database. You do this by running the stored procedure sys.sp_reauthorize_remote_data_archive as a db_owner.

If I now back up the Stretch-enabled AdventureWorks2014 database and restore it, I will no longer be able to query the StretchTest table until I reconnect to Azure SQL Database by running:

USE [AdventureWorks2014];
GO
 
EXEC sys.sp_reauthorize_remote_data_archive @azure_username, @azure_password;
GO

Once reconnected I get a message similar to the one below and then I'm able to query the Stretched data once again:

Copying remote database 'RDAAdventureWorks201467B6D9D4-E8E0-4C54-B3EF-7C2D3F1326C4' to remote database 'RDAAdventureWorks2014660B555C-8DD1-4750-9A04-2868CD1C646D'.

Waiting for remote database copy to complete.

Remote database 'RDAAdventureWorks2014660B555C-8DD1-4750-9A04-2868CD1C646D' has completed copying, and is now online.

When restoring a Stretch-enabled database to another instance, that instance must have "remote data archive enabled". Once you have restored the database and enabled "remote data archive", all that is required is reconnecting to the Azure SQL Database by running the sys.sp_reauthorize_remote_data_archive stored procedure.

The backups for Azure SQL Databases for Basic, Standard, and Premium service tiers are taken every hour. The backup retention period varies depending on the service tier level. At time of writing, for basic it is 7 days, standard 14 days, and premium is 35 days. You can restore Azure SQL Databases by using the Microsoft Azure web portal.

Un-migrate Data

To migrate data back to local storage from an Azure SQL Database you will need to create a new local table with the same schema as the Stretch-enabled table. You then have to copy the data from the Stretch-enabled table into the new local table. Once the data is copied you drop the Stretch-enabled table and rename the new local table to the name of the Stretch-enabled table that was just dropped.

You can only disable Stretch for a database once all Stretch-enabled tables have been dropped. If you drop a database enabled for Stretch, the local database is removed but the remote data is not; you will have to drop the remote database from the Azure management portal.

Summary

Stretch Database is an easy way to migrate archive data to Microsoft Azure, if your database supports it. Currently in SQL Server 2016 CTP2 there are many limitations with table, data, and column properties, data and column types, constraints, and indexes. If you are not restricted by those limitations, then Stretch Database is a simple way to migrate historical data to Azure SQL Database and free up valuable local storage. Managing backups will become a bit more complex since your data will be split between on premise and in the cloud.

I'm looking forward to these restrictions being lifted in the RTM release, and I'm sure many of you will be able to make use of this cool feature.

The post SQL Server 2016 – Introduction to Stretch Database appeared first on SQLPerformance.com.

24 Sep 06:20

Case Study: The Data Science Behind Zillow’s Zestimates

by A.R. Guess

by Angela Guess Alex Woodie recently wrote for Datanami, “If you’re like most homeowners, you probably sneak a peek at your ‘Zestimate’ from time to time to see how your home’s value might have changed. Getting a Zestimate is very easy and straightforward for users, but behind the scenes, there’s a hefty amount of data…

The post Case Study: The Data Science Behind Zillow’s Zestimates appeared first on DATAVERSITY.

24 Sep 06:19

“I Didn’t Realize That I Was a Data Scientist”

by A.R. Guess

by Angela Guess Apryl DeLancey, President and CEO of Social Age Media recently wrote in the Huffington Post, “Hello, my name is Apryl and I am a data scientist. I didn't realize that I was a data scientist until earlier this year. For my entire career I just thought that I was a statistician that…

The post “I Didn’t Realize That I Was a Data Scientist” appeared first on DATAVERSITY.

24 Sep 06:18

More on CXPACKET Waits: Skewed Parallelism

by Paul Randal

In my previous post, I discussed CXPACKET waits and ways to prevent or limit parallelism. I also explained how the control thread in a parallel operation always registers a CXPACKET wait, and that sometimes non-control threads may also register CXPACKET waits. This can happen if one of the threads is blocked waiting for a resource (so all the other threads finish before it and register CXPACKET waits also), or if cardinality estimates are incorrect. In this post I’d like to explore the latter.

When cardinality estimates are incorrect, the parallel threads doing the query work are given uneven amounts of work to do. The typical case is where one thread is given all the work, or way more work than the other threads. This means that those threads that finish processing their rows (if they were even given any) before the slowest thread register a CXPACKET from the moment they finish until the slowest thread finishes. This problem can lead to a seeming explosion in CXPACKET waits occurring and is commonly called skewed parallelism, because the distribution of work between the parallel threads is skewed, not even.

Example of Skewed Parallelism

I’ll walk through a contrived example to show how to identify such cases.

First off, I’ll create a scenario where a table has wildly inaccurate statistics, by manually setting the number of rows and pages in an UPDATE STATISTICS statement (don’t do this in production!):

USE [master];
GO
 
IF DB_ID (N'ExecutionMemory') IS NOT NULL
BEGIN
    ALTER DATABASE [ExecutionMemory] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;
    DROP DATABASE [ExecutionMemory];
END
GO
 
CREATE DATABASE [ExecutionMemory];
GO
USE [ExecutionMemory];
GO
 
CREATE TABLE dbo.[Test] (
    [RowID] INT IDENTITY,
    [ParentID] INT,
    [CurrentValue] NVARCHAR (100),
    CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED ([RowID]));
GO
 
INSERT INTO dbo.[Test] ([ParentID], [CurrentValue])
SELECT 
    CASE WHEN ([t1].[number] % 3 = 0)
        THEN [t1].[number] – [t1].[number] % 6
        ELSE [t1].[number] END, 
    'Test' + CAST ([t1].[number] % 2 AS VARCHAR(11))
FROM [master].[dbo].[spt_values] AS [t1]
WHERE [t1].[type] = 'P';
GO
 
UPDATE STATISTICS dbo.[Test] ([PK_Test]) WITH ROWCOUNT = 10000000, PAGECOUNT = 1000000;
GO

So my table only has a few thousand rows in it, but I’ve faked it having 10 million rows.

Now I’ll create a contrived query to select the top 500 rows, which will go parallel as it thinks there are millions of rows to scan.

USE [ExecutionMemory];
GO
 
SET NOCOUNT ON;
GO
 
DECLARE @CurrentValue NVARCHAR (100);
 
WHILE (1=1)
SELECT TOP (500) 
    @CurrentValue = [CurrentValue]
FROM dbo.[Test]
ORDER BY NEWID() DESC;
GO

And set that running.

Viewing the CXPACKET Waits

Now I can look at the CXPACKET waits that are occurring using a simple script to look at the sys.dm_os_waiting_tasks DMV:

SELECT
    [owt].[session_id],
    [owt].[exec_context_id],
    [owt].[wait_duration_ms],
    [owt].[wait_type],
    [owt].[blocking_session_id],
    [owt].[resource_description],
    [er].[database_id],
    [eqp].[query_plan]
FROM sys.dm_os_waiting_tasks [owt]
INNER JOIN sys.dm_exec_sessions [es] ON
    [owt].[session_id] = [es].[session_id]
INNER JOIN sys.dm_exec_requests [er] ON
    [es].[session_id] = [er].[session_id]
OUTER APPLY sys.dm_exec_sql_text ([er].[sql_handle]) [est]
OUTER APPLY sys.dm_exec_query_plan ([er].[plan_handle]) [eqp]
WHERE
    [es].[is_user_process] = 1
ORDER BY
    [owt].[session_id],
    [owt].[exec_context_id];

If I execute this a few times, eventually I see some results showing skewed parallelism (I stripped off the query plan handle link and curtailed the resource description, for clarity, and notice I put in the code to grab the SQL text if you want that too):

session_id exec_context_id wait_duration_ms wait_type blocking_session_id resource_description database_id
56 0 1 CXPACKET NULL exchangeEvent 13
56 1 1 CXPACKET 56 exchangeEvent 13
56 3 1 CXPACKET 56 exchangeEvent 13
56 4 1 CXPACKET 56 exchangeEvent 13
56 5 1 CXPACKET 56 exchangeEvent 13
56 6 1 CXPACKET 56 exchangeEvent 13
56 7 1 CXPACKET 56 exchangeEvent 13

Results showing skewed parallelism in action

The control thread is the one with exec_context_id set to 0. The other parallel threads are those with exec_context_id higher than 0, and they’re all showing CXPACKET waits apart from one (note that exec_context_id = 2 is missing from the list). You’ll notice that they all list their own session_id as the one that’s blocking them, and that’s correct because all the threads are waiting for another thread from their own session_id to complete. The database_id is the database in whose context the query is being executed, not necessarily the database where the problem is, but it usually is unless the query is using three-part naming to execute in a different database.

Viewing the Cardinality Estimation Problem

With the query_plan column in the query output (which I removed for clarity), you can click on it to bring up the graphical plan and then right-click and select View with SQL Sentry Plan Explorer. This shows as below:

Plan Tree

I can immediately see that there’s a cardinality estimate issue, as the Actual Rows for the Clustered Index Scan is only 2,048, compared to 10,000,000 Est (Estimated) Rows.

If I scroll across, I can see the distribution of rows across the parallel threads that were used:

Rows distributed over threads

Lo and behold, only a single thread was doing any work during the parallel portion of the plan – the one that didn’t show up in the sys.dm_os_waiting_tasks output above.

In this case, the fix is to update the statistics for the table.

In my contrived example that won’t work, as there haven’t been any modifications to the table, so I’ll re-run the set up script, leaving out the UPDATE STATISTICS statement.

The query plan then becomes:

New, fixed plan

Where there’s no cardinality problem and no parallelism either – problem solved!

Summary

If you see CXPACKET waits occurring, it’s easy to check for skewed parallelism, using the method described above. All the cases I’ve seen have been due to cardinality estimation issues of one kind or another, and often it’s simply a case of updating statistics.

As far as general wait statistics are concerned, you can find more information about using them for performance troubleshooting in:

Until next time, happy troubleshooting!

The post More on CXPACKET Waits: Skewed Parallelism appeared first on SQLPerformance.com.

24 Sep 06:18

AUTO_CLOSE and the SQL Server ERRORLOG

Today I opened up a SQL Server ERRORLOG and saw these two messages repeated every 20 seconds or so:

Starting up database ‘AdventureWorks2014′.

CHECKDB for database ‘AdventureWorks2014′ finished without errors on 2015-08-23 02:15:08.070 (local time).  This is an information message only; no user action required.

When you initially see these two messages repeated over and over, it might seem like SQL Server is caught in some issue with recovery.  Or you might think it’s running CHECKDB over and over.  Neither are true.  The database has AUTO_CLOSE enabled.  (And you see the CHECKDB message because it’s reading the boot page and noting the last time CHECKDB ran successfully…to see what updates that entry, check out my post What DBCC Checks Update dbccLastKnownGood?)

When AUTO_CLOSE is enabled, after the last user exits the database, the database shuts down and its resources are freed.  When someone tries to access the database again, the database reopens.  You might be thinking that for databases that are not accessed that often, this might be a good thing.  After all, freeing resources and giving them back to SQL Server for use elsewhere sounds useful.  Not so much.  There’s a cost associated with that shut down, and a cost to open the database back up when a user connects.  For example – shutting down a database removes all plans for that database from cache.  The next time a user runs a query, it will have to be compiled.  If the user disconnects, the plan is freed from cache.  If someone connects one minute later and runs the same query, it has be compiled again.  You get the point: this is inefficient.  And really, how many databases in your production environment do you really not access?  If you’re not accessing the database, why is it in a production instance?  If you want a few more details on AUTO_CLOSE, check out the entry for ALTER DATABASE in Books Online.

I am sure (maybe?) that there are valid cases for having AUTO_CLOSE enabled.  But I haven’t found one yet :)

On top of the resource use, realize that every time the database starts up, you’re going to get the above two messages in the ERRORLOG.  In the log I was looking at, there were multiple databases with this option enabled, so the log was flooded with these messages.  In general, I’m a huge fan of cycling the ERRORLOG on a regular basis (just set up an Agent job that runs sp_cycle_errorlog every week), and I try to reduce “clutter” in the log as much as possible.  This means don’t enable a setting like AUTO_CLOSE which can put in all those messages, and use trace flag 3226 to stop logging successful backup messages (they still go to msdb).

Oh yes, to disable AUTO_CLOSE:

ALTER DATABASE 'AdventureWorks2014' SET AUTO_CLOSE OFF;

GO

 

The post AUTO_CLOSE and the SQL Server ERRORLOG appeared first on Erin Stellato.

24 Sep 06:18

Long Internet of Things Life Cycles Could Cause Major Security Problems

by A.R. Guess

by Angela Guess Stilgherrian recently wrote for ZDnet, “Not long ago, Dick Bussiere visited a power station. The visit was part of his job as Tenable Network Security's principal architect for the Asia Pacific Region. And what did he find? A machine running Windows 2000 Server connected to the network, in control of things. That…

The post Long Internet of Things Life Cycles Could Cause Major Security Problems appeared first on DATAVERSITY.

24 Sep 06:17

Relational databases vs Non-relational databases

by James Serra

I see a lot of confusion about the place and purpose of the many new database solutions (“NoSQL databases”) compared to the relational databases solutions that have been around for many years.  So let me try to explain the differences and best use cases for each.

First lets clarify these database solutions into two groups:

1) Relational databases, which can also be called relational database management systems (RDBMS) or SQL databases.  The most popular of these are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2.  These RDBMS’s are mostly used in large enterprise scenarios, with the exception of MySQL, which is mostly used to store data for web applications, typically as part of the popular LAMP stack (Linux, Apache, MySQL, PHP/ Python/ Perl).

2) Non-relational databases, also called NoSQL databases, the most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j.  These databases are usually grouped into four categories: Key-value stores, Graph stores, Column stores, and Document stores (see Types of NoSQL databases).

All relational databases can be used to manage transaction-oriented applications (OLTP), and most non-relational databases that are in the categories Document stores and Column stores can also be used for OLTP, adding to the confusion.  OLTP databases can be thought of as “Operational” databases, characterized by frequent, short transactions that include updates and that touch a small amount of data and where concurrency of thousands of transactions is very important (examples including banking applications and online reservations).  Integrity of data is very important so they support ACID transactions (Atomicity, Consistency, Isolation, Durability).  This is opposed to data warehouses, which are considered “Analytical” databases characterized by long, complex queries that touch a large amount of data and require a lot of resources.  Updates are infrequent.  An example is analysis of sales over the past year.

Relational databases usually work with structured data, while non-relational databases usually work with semi-structured data (i.e. XML, JSON).

Let’s look at each group in more detail:

Relational Databases

A relational database is organized based on the relational model of data, as proposed by E.F. Codd in 1970.  This model organizes data into one or more tables (or “relations”) of rows and columns, with a unique key for each row.  Generally, each entity type that is described in a database has its own table with the rows representing instances of that type of entity and the columns representing values attributed to that instance.  Since each row in a table has its own unique key, rows in a table can be linked to rows in other tables by storing the unique key of the row to which it should be linked (where such unique key is known as a “foreign key”).  Codd showed that data relationships of arbitrary complexity can be represented using this simple set of concepts.

Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.

The reasons for the dominance of relational databases are: simplicity, robustness, flexibility, performance, scalability and compatibility in managing generic data.

But to offer all of this, relational databases have to be incredibly complex internally.  For example, a relatively simple SELECT statement could have dozens of potential query execution paths, which a query optimizer would evaluate at run time.  All of this is hidden to users, but under the hood, the RDBMS determines the best “execution plan” to answer requests by using things like cost-based algorithms.

For large databases, especially ones used for web applications, the main concern is scalability.  As more and more applications are created in environments that have massive workloads (i.e. Amazon), their scalability requirements can change very quickly and grow very large.  Relational databases scale well, but usually only when that scaling happens on a single server (“scale-up”).  When the capacity of that single server is reached, you need to “scale-out” and distribute that load across multiple servers, moving into so-called distributed computing.  This is when the complexity of relational databases starts to cause problems with their potential to scale.  If you try to scale to hundreds or thousands of servers the complexities become overwhelming.  The characteristics that make relational databases so appealing are the very same that also drastically reduce their viability as platforms for large distributed systems.

Non-relational databases

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Motivations for this approach include:

  1. Simplicity of design.  Not having to deal with the “impedance mismatch” between the object-oriented approach to write applications and the schema-based tables and rows of a relational database.  For example, storing all the customer order info in one document as opposed to having to join many tables together, resulting in less code to write, debug, and maintain
  2. Better “horizontal” scaling to clusters of machines, which solves the problem when the number of concurrent users skyrockets for applications that are accessible via the web and mobile devices.  Using documents makes it much easier to scale-out as all the info for that customer order is contained in one place as opposed to being spread out on multiple tables.  NoSQL databases automatically spread data across servers without requiring application changes (auto-sharding), meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool.  Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption
  3. Finer control over availability.  Servers can be added or removed without application downtime.  Most NoSQL databases support data replication, storing multiple copies of data across the cluster or even across data centers, to ensure high availability and disaster recovery
  4. To easily capture all kinds of data “Big Data” which include unstructured and semi-structured data.  Allowing for a flexible database that can easily and quickly accommodate any new type of data and is not disrupted by content structure changes.  This is because document database are schemaless, allowing you to freely add fields to JSON documents without having to first define changes (schema-on-read instead of schema-on-write).  You can have documents with a different number of fields than other documents.  For example, a patient record that may or may not contain fields that list allergies
  5. Speed.  The data structures used by NoSQL databases (i.e. JSON documents) differ from those used by default in relational databases, making many operations faster in NoSQL than relational databases due to not having to join tables (at the cost of increased storage space due to duplication of data – but storage space is so cheap nowadays this is usually not an issue).  In fact, most NoSQL databases do not even support joins
  6. Cost.  NoSQL databases usually use clusters of cheap commodity servers, while RDBMS tend to rely on expensive proprietary servers and storage systems.  Also, the licenses for RDBMS systems can be quite expensive while many NoSQL databases are open source and therefore free

The particular suitability of a given NoSQL database depends on the problem it must solve.

NoSQL databases are increasingly used in big data and real-time web applications.  They became popular with the introduction of the web, when databases went from a max of a few hundred users on an internal company application to thousands or millions of users on a web application.  NoSQL systems are also called “Not only SQL” to emphasize that they may also support SQL-like query languages.

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance.  Some reasons that block adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL.  Also, most NoSQL stores lack true ACID transactions or only support transactions in certain circumstances and at certain levels (e.g., document level).  Finally, RDBMS’s are usually much simpler to use as they have GUI’s where many NoSQL solution use a command-line interface.

Comparing the two

One of the most severe limitations of relational databases is that each item can only contain one attribute.  If we use a bank example, each aspect of a customer’s relationship with a bank is stored as separate row items in separate tables.  So the customer’s master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on.  All these tables are linked to each other through the use of relations such as primary keys and foreign keys.

Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model.  Key-value pairs allow you to store several related items in one “row” of data in the same table.  We place the word “row” in quotes because a row here is not really the same thing as the row of a relational table.  For instance, in a non-relational table for the same bank, each row would contain the customer’s details as well as their account, loan and investment details.  All data relating to one customer would be conveniently stored together as one record.

This seems an obviously superior method of storing data, but it has a major drawback: key-value stores, unlike relational databases, cannot enforce relationships between data items.  For instance, in our key-value database, the customer details (name, social security, address, account number, loan processing number, etc.) would all be stored as one data record (instead of being stored in several tables, as in the relational model).  The customer’s transactions (account withdrawals, account deposits, loan repayments, bank charges, etc.) would also be stored as another single data record.

In the relational model, there is an built-in and foolproof method of ensuring and enforcing business logic and rules at the database layer, for instance that a withdrawal is charged to the correct bank account, through primary keys and foreign keys.  In key-value stores, this responsibility falls squarely on the application logic and many people are very uncomfortable leaving this crucial responsibility just to the application.  This is one reason why relational databases will continued to be used.

However, when it comes to web-based applications that use databases, the aspect of rigorously enforcing business logic is often not a top priorities.  The highest priority is the ability to service large numbers of user requests, which are typically read-only queries.  For example, on a site like eBay, the majority of users simply browse and look through posted items (read-only operations).  Only a fraction of these users actually place bids or reserve the items (read-write operations).  And remember, we are talking about millions, sometimes billions, of page views per day.  The eBay site administrators are more interested in quick response time to ensure faster page loading for the site’s users, rather than the traditional priorities of enforcing business rules or ensuring a balance between reads and writes.

Relational-model databases can be tweaked and set up to run large-scale read-only operations through data warehousing, and thus potentially serve a large amount of users who are querying a large amount of data, especially when using relational MPP architectures like Analytics Platform System, Teradata, Oracle Exadata, or IBM Netezza, which all support scaling.  As mentioned before, data warehouses are distinct from typical databases in that they are used for more complex analysis of data.  This differs from the transactional (OLTP) database, whose main use is to support operational systems and offer day-to-day, small scale reporting.

However, the real challenge is the relational model’s lack of scalability when dealing with OLTP applications, or any solution with a lot of individual writes, which is the domain of relational SMP architectures.  This is where non-relational models can really shine.  They can easily distribute their data loads across dozens, hundreds and in extreme cases (think Google search) even thousands of servers.  With each server handling only a small percentage of the total requests from users, response time is very good for each individual user.  Although this distributed computing model can be built for relational databases, it is a real pain to implement, especially when there are a lot of writes (i.e OLTP), requiring techniques like sharding which usually requires significant coding outside of the application’s business logic.  This is because the relational model insists on data integrity at all levels, which must be maintained, even as the data is accessed and modified by several different servers.  This is the reason for the non-relational model as the architecture of choice for web applications such as cloud-computing and social networking.

So in summary, RDBMS’s suffer from no horizontal scaling for high transaction loads (millions of read-writes), while NoSQL databases solve high transaction loads but at the cost of data integrity and joins.

Keep in mind many solutions will use a combination of relational and non-relational databases (see What is Polyglot Persistence?).

Also keep in mind that you may not need the performance of a non-relational database and instead just going with storing files in HDFS and using Apache Hive will be enough (Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis that it provides via an SQL-like language called HiveQL).

And to end on a note that adds to the confusion, we have a another category forming called NewSQL: NewSQL is a class of modern RDBMS’s that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional relational database system.  The disadvantages is they are not for OLAP-style queries, and they are inappropriate for databases over a few terabytes.  Examples include VoltDB, NuoDB, MemSQL, SAP HANA, Splice Machine, Clustrix, and Altibase.

A picture showing the categories that many of the products fit into:

Untitled picture

An excellent graphic that shows how all the technologies fit in the Azure cloud is from Understanding NoSQL on Microsoft Azure:

nosql

The bottom line for using a NoSQL solution is if you have an OLTP application that has thousands of users and has a very large database requiring a scale-out solution and/or is using JSON data, in particular if this JSON data has various structures.  You also get the benefit of high availability as NoSQL solutions store multiple copies of the data.  Just keep in mind you for performance you may sacrifice data consistency, as well as the ability to join data, use SQL, and to do quick mass updates.

More info:

MySQL vs MongoDB

MySQL vs. MongoDB: Looking At Relational and Non-Relational Databases

10 things you should know about NoSQL databases

Introduction to Databases

Difference between SQL and NoSQL : Comparision

SQL vs NoSQL Database Differences Explained with few Example DB

NoSQL, NewSQL, or RDBMS: How To Choose

NewSQL – RDBMS on Steroids

NoSQL vs NewSQL Databases Choose the Right Tool for the Right Job

SQL vs NoSQL: you do want to have a relational storage by default

Oracle Defends Relational DBs Against NoSQL Competitors

Understanding NoSQL on Microsoft Azure

Meet the Avant-Garde of New Relational Databases

To SQL or NoSQL? That’s the database question

24 Sep 06:17

New Azure SQL Database offerings add capabilities to scale data up and out

by SQL Server Team

Tiffany Wissner, Senior Director of Data Platform Marketing

At Microsoft we’re committed to meeting the needs of our customers by offering choices as they build cloud applications for scale and performance. Today we’re continuing to deliver on this commitment with two announcements:

As more customers build more powerful applications in Azure using Azure SQL Database, we've seen strong demand for more options to scale performance. We’re dedicated to offering a portfolio of data services for cloud applications that effectively scales up to support data growth from more users, and scales out to support growth from more customers and fluctuating resource demands.

To meet this scale up demand, we are announcing the release of new Azure SQL Database Premium performance levels, P4 and P11. With the introduction of both P4 and P11, we’ve expanded the range of throughput performance developers can leverage to scale their applications. Developers can quickly start with our basic or standard performance tiers and easily scale database performance up to 350 times greater with our premium tier as their data needs grow – all without code or application changes.  This enables developers to match application needs with the right database performance.



On the scale out front, we have seen an increasing demand for Elastic Database pools since we announced its public preview.  We’ve seen SaaS developers adopt and embrace elastic databases because it makes it easier to manage their growing number of customers.  With Elastic Database pools SaaS developers can manage hundreds of databases for each of their customers, optimize the performance of each database, and manage the cost of these databases to support their explosive growth. This means their customers still get a great experience, even during peak usage periods, and there’s no need to over-provision resources during lulls.

One example is StudioPlus, which offers solutions for professional photographers. CEO Matthew Hunt says, “We are expanding our legacy on-premises solution with a SaaS solution. Azure SQL Database is a fundamental part of this journey because it allows us to leverage an existing code base and gives us a platform for unlimited customer growth. From a DevOps perspective, it’s simple to set up and manage Azure SQL Database Elastic Database pools to support the unlimited scale-out of a one-database-per-customer pattern.”



To support the increasing demand of Elastic Database Pools, we are now introducing a public preview of two additional service tiers for Elastic Databases, Basic and Premium.  These new tiers give organizations and developers with SaaS applications greater options and flexibility for getting started with Elastic Databases and scaling out to meet new customer demands.




We are thrilled that we can support StudioPlus and a growing number of developers on their SaaS journey with Azure SQL Database.

Our announcements today add scale and performance options, and these releases are just part of our end-to-end data platform. We continue to make it easier for customers to maximize their data dividends with our data platform and services. It’s never been easier to capture, transform, mash-up, analyze and visualize any data, of any size, at any scale, in its native format using familiar tools, languages and frameworks in a trusted environment on-premises and in the cloud.

Find out more and try Azure SQL Database and Azure SQL Database Elastic Databases today.

24 Sep 06:16

Analytics Platform System Update 4 Generally Available

by SQL Server Team

Matt Usher, Senior Program Manager, DS SQL Engineering

We are pleased to announce the general availability of Appliance Update 4 for the Microsoft Analytics Platform System. The Microsoft Analytics Platform System is a turnkey big data analytics appliance, combining Microsoft’s massively parallel processing (MPP) data warehouse technology with HDInsight, Microsoft’s 100% Apache Hadoop distribution, and delivering it as a turnkey appliance. APS’s big data capability is powered by PolyBase, an integrated technology allowing customers to utilize the skillset they've developed in TSQL for querying and managing data in Hadoop platforms.

This update delivers several new capabilities and features, including:

PolyBase/Hadoop enhancements

  • Support for Cloudera 5.1 and Hortonworks Data Platform 2.1 and 2.2
  • Extended partial aggregate pushdowns into Hadoop
  • Grow HDInsight region on an appliance with an existing region

T-SQL compatibility improvements to reduce migration friction from SQL SMP

  • Scalar User Defined Functions – CREATE/DROP FUNCTION statements
  • PIVOT/UNPIVOT relational operators
  • Round Robin as a third table type for hybrid performance
  • Query plan improvements to favor hash joins over nested loop joins in DW scenarios for better query performance
  • Addition of a new DBCC command - DBCC DROPCLEANBUFFERS
  • SQL Server SMP to APS Migration Utility

Performance

  • 1.5x data return rate improvement for SELECT queries to improve external analysis integration

OEM Hardware on AU4

  • Dell, Quanta, and HP (Gen9) next generation server support
    • Intel Haswell Processors, 256 GB (16x16Gb) 2133MHz memory
  • Updated HP 5900 series switches including High Availability improvements
  • Mixed hardware version support allows extended appliance lifetime and expandability

Together with SQL Data Warehouse, Microsoft offers a choice of MPP data warehouse on-premises, in the cloud or both.

To learn more, please see the Microsoft Analytics Platform portal.

Related Link:
Microsoft Analytics Platform System Appliance Update 4 Documentation and Client Tools

24 Sep 06:14

Live Query Statistics in SQL 2014, not just SQL 2016

by Rob Farley

One of the coolest things in SQL Server 2016 is Live Query Statistics (LQS). But did you know that it’s available in SQL 2014? (Edit: Needs SP1)

The thing is that we haven’t been able to view it effectively before now, before SQL Server Management Studio 2016 became available.

LQS provides the ability to watch an execution plan while the query is still running.

image

In this image above, you can see that I have a fairly long-running query, and I’ve taken a screen shot while the query is running. I can tell it’s still running because of the dotted lines. At this point, the Hash Match has asked the Index Scan for all the rows it’s going to, and the Index Scan has been marked as 100% done. The Hash Match is now asking the Merge Join for rows. The Merge Join is pulling data out of the Sort, which has already pulled in the rows it needed, and so on.

It’s cool stuff.

And significantly, this is against SQL Server 2014 (SP1). I had to use SSMS 2016, but then the button appeared at the top to include Live Query Statistics...

image

...and then I could see them. When I connected to an earlier version, such as SQL Server 2012, the box was disabled and I couldn’t see them.

So why not install SSMS 2016 (it’s only the client tool – I’m not suggesting you put it on a server), and see how some of your queries look?

You shouldn’t do this against a production machine, because it takes some effort for SQL Server to produce the data used for this. But see how you go. It’s definitely a useful performance-tuning feature which is available right now.

@rob_farley

24 Sep 05:57

"Cannot change the host configuration" error message when adding disk storage to a VMware vSphere ESXi Host

by Simon Seagrave

"Cannot change the host configuration" error message when adding disk storage to a VMware vSphere ESXi HostExperiencing the “Cannot change the host configuration” and or Call “HostDatastoreSystem.QueryVmfsDatastoreCreateOptions” for object “datastoreSystem-xxxx” on vCenter Server error message when trying to add disk storage to your VMware vSphere ESXi host?  If so, check out my video below where I run through the easy to follow steps in resolving this issue.

As with most things there can often be a multiple causes to a particular error, though in my experience I have found in the majority of cases this particular error message is down to adding pre-used disks which contain an existing partition or similar information on the disk.  Now this isn’t always the case, so in the basic steps outlined in the video I show you how to confirm what the actual error is.

*** IMPORTANT ***  Whenever touching anything to do with your storage, always ensure you have a good backup of your production data first.  Even in an instance such as this where “in theory” you are adding a “new” disk to your ESXi host, if you were to identify the disk ID incorrectly you could end up impacting data on another disk.  Just be careful, and use common sense.  Smile

 

Also, check out this useful VMware KB article for some more troubleshooting information.

I hope this videos helps, and definitely keep note of the couple of commands I use as these often prove useful when storage trouble shooting on an ESXi host.  If you found the video of use, please consider giving it a thumbs up on YouTube and also subscribing to my channel – thanks.

The post "Cannot change the host configuration" error message when adding disk storage to a VMware vSphere ESXi Host appeared first on TechHead and was written by .

24 Sep 05:56

Dealing with Software Religious Arguments and Architectural Zealotry

by Scott Hanselman
4302882942_9e2b92fdeb_b

Warning: Excessive use of Capitals for Emphasis ahead.

A friend of mine left his job to start a medical startup and has been in the middle of a Fight Over The Tech Stack. The current challenge is very bifurcated...very polarized. It's old vs. new, enterprise vs. startup, closed vs. open source, reliable vs. untested. There doesn't seem to be any middle ground.

Sometimes fights like these start with a Zealot.

Zealot: a person who is fanatical and uncompromising in pursuit of their religious, political, or other ideals.

Not all, don't get mad yet, but sometimes. Sometimes a Technical Religious Zealot is on your team - or runs your team - and they can't make objective decisions about a particular piece of technology.

"Don't use Microsoft, it killed my Pappy! Rails? Please, that won't scale. Node? Maybe if you're 17 that'll work! The only real way to write right software is with Technology X."

The language may not be this overt, but the essence is that Software can only be built This Way.

Here's the thing. Lean in. There's lots of ways to build software. Lots of successful ways. In fact, Success is a great metric.

But there's a lot of crappy Java apps, there's a lot of crappy C# apps, and there's lot of crappy Technology X apps.

Enthusiasm for a technology is understandable, especially if you've had previous success. I've worked in C++, Pascal, node.js, Java, and C#, myself. I've had great success with all of them, but I'm currently most excited about .NET and C#. I'm an enthusiast, to be clear. I've also told people who have hired me for projects that .NET wasn't the right tech for their problem.

Be excited about your technical religion, but also not only respect others' technical religion, celebrate their successes and learn from them as they may inform your own architectures. Every religious can learn from others, and the same is true in software.

Beware the Zealots. Software is a place for measurement, for experience, for research, and for thoughtful and enthusiastic discussion. You or the Zealot may ultimately disagree with the team decision but you should disagree and commit. A good Chief Architect can pull all these diverse architectural conversations and business requirements into a reasonable (and likely hybrid) stack that will serve the company for years to come.

Dear Reader, how do you deal with Technology Decisions that turn into Religious Arguments? Sound off in the comments.

SOCIAL: Hey folks, please do follow me on Facebook https://fb.me/scott.hanselman or Twitter! https://twitter.com/shanselman

* Photo "Enthusiasm Rainbow Gel" by Raquel Baranow used under CC BY 2.0


Sponsor: Big thanks to Infragistics for sponsoring the feed this week! Responsive web design on any browser, any platform and any device with Infragistics jQuery/HTML5 Controls.  Get super-charged performance with the world’s fastest HTML5 Grid - Download for free now!



© 2015 Scott Hanselman. All rights reserved.
     
15 Aug 06:22

U.K. Government Seeking To End Reliance On Oracle

by timothy
Mrdenny

now

jfruh writes: The U.K. Cabinet Office has reportedly asked government departments and agencies to try to find ways to end their reliance on Oracle software, a move motivated by the truly shocking number of Oracle licenses currently being paid for by the British taxpayer. The Department for Environment, Food and Rural Affairs alone has paid £1.3 million (US$2 million) per year for some 2 million Oracle licenses, or about 200 licenses per staff member.

Share on Google+

Read more of this story at Slashdot.

15 Aug 04:35

Tech Firms, Retailers Propose Security and Privacy Rules For Internet of Things

by Soulskill
chicksdaddy writes: As the Obama Administration and the rest of the federal bureaucracy hem and haw about whether and how to regulate the fast-growing Internet of Things, a group representing private sector firms has come out with a framework for ensuring privacy and security protections in IoT products that is lightyears ahead of anything under consideration inside the Beltway. The Online Trust Alliance — a group made up of such staunch civil liberties and privacy advocates as Target Stores (?), Microsoft and home security firm ADT — on Tuesday released a draft of its IoT Trust Framework (PDF), which offers voluntary best practices in security, privacy and what OTA calls "sustainability" (read "lifecycle management") for home automation, and wearable health/fitness technologies.

Share on Google+

Read more of this story at Slashdot.

15 Aug 04:33

The Fastest-Growing Tech State Is... Minnesota

by samzenpus
Nerval's Lobster writes: What's the fastest-growing state for technology jobs? You might be tempted to say California or New York, or even North Carolina. But according to new data from the U.S. Bureau of Labor Statistics, it's actually Minnesota, which saw the size of its tech workforce jump 8.36 percent over the past six months, to 37,600 workers. Utah and Nebraska came in second and third on the list of fastest-growing states, with six-month tech-employment gains of 5.75 percent and 5.22 percent, respectively. Michigan and Florida came in fourth and fifth. States with smallish tech-worker populations can enjoy heady growth rates by adding relatively few workers. But not all states saw their tech workforce grow in the first half of 2015. Four states—Pennsylvania, Washington, North Carolina, and Alabama—actually saw their workforce decline by 0.61 percent, 0.63 percent, 2.36 percent, and 3.52 percent, respectively, during the period in question. The declines in Washington and North Carolina may come as a surprise to anyone following those states' tech industries, which are quite robust. In Washington's case, layoffs at Microsoft and other firms over the past few months may have contributed to the slight decline.

Share on Google+

Read more of this story at Slashdot.

15 Aug 04:32

Samsung Unveils V-NAND High Performance SSDs, Fast NVMe Card At 5.5GB Per Second

by samzenpus
MojoKid writes: Sometimes it's the enterprise sector that gets dibs on the coolest technology, and so it goes with a trio of TCO-optimized, high-performance solid state drives from Samsung that were just announced, all three of which are based on three-dimensional (3D) Vertical NAND (V-NAND) flash memory technology. The fastest of bunch can read data at up to 5,500 megabytes per second. That's the rated sequential read speed of Samsung's PM1725, a half-height, half-length (HHHL) PCIe card-type NVMe SSD. Other rated specs include a random read speed of up to 1,000,000 IOPS, random write performance of up to 120,000 IOPS, and sequential writes topping out at 1,800MB/s. The PM1725 comes in just two beastly storage capacities, 3.2TB and 6.4TB, the latter of which is rated to handle five drive writes per day (32TB) for five years. Samsung also introduced two other 3D V-NAND products, the PM1633 and PM953. The PM1633 is a 2.5-inch 12Gb/s SAS SSD that will be offered in 480GB, 960GB, 1.92TB, and 3.84TB capacities. As for the PM953, it's an update to the SM951 and is available in M.2 and 2.5-inch form factors at capacities up to 1.92TB.

Share on Google+

Read more of this story at Slashdot.

15 Aug 04:20

How Microsoft Built, and Is Still Building, Windows 10

by samzenpus
An anonymous reader writes with this Venturebeat story about how Windows 10 is different from previous versions because of the way it was designed, including 15 public preview builds, and how much work is still being done. Windows 10 for PCs arrived two weeks ago. Thankfully, we don't need to wait years to say this will be a Microsoft operating system release like no other. The most obvious clue is not the fact that Windows 10 was installed on more than 14 million devices in 24 hours, that you can get it for cheap or upgrade to it for free, nor even that it ships with a digital assistant and a proper browser. No, the big deal here is that Microsoft is turning its OS into a service, and that means as you read these words, it's still being built. For the next few years, we'll be getting not just Windows 10 updates and patches, but new improvments and features. This is possible because Microsoft built this version very differently from all its previous releases.

Share on Google+

Read more of this story at Slashdot.

15 Aug 04:14

Snapshot Isolation Level and Concurrent Modification Collisions - On Disk and In Memory OLTP

by drsql
This is part 1 of my In Memory OLTP blogs for my upcoming book project . In this blog I want to take a (reasonably) quick look at how concurrent modifications are handled using SNAPSHOT isolation level. There are two different flavors of SNAPSHOT Isolation level you can deal with, and while they are similar, there are some very big differences internally, as well as how they work when you interact with them. Generally speaking, SNAPSHOT isolation level is meant to be an implementation of an optimistic...(read more)
15 Aug 04:13

HP’s Big Split and Its New Focus on Big Data

by A.R. Guess

by Angela Guess Jonathan Vanian of Fortune reports, “In just a little under three months, Hewlett Packard will officially split into two separate companies, with one (Hewlett-Packard Enterprise) focusing on enterprise technology, including data center hardware, and the other (HP Inc.) selling personal computers and printers. At a recent HP conference in June, CEO Meg…

The post HP’s Big Split and Its New Focus on Big Data appeared first on DATAVERSITY.

15 Aug 04:01

Running VMware vSphere 6 on an HP Proliant ML110 G6

by Simon Seagrave

HP Proliant ML110 G6 VMware vSphere ESXi 6This is just a quick post to confirm that VMware vSphere/ESXi 6 runs on the HP Proliant ML110 G6.  Although this is now an older server, it is still very much capable and I know that there are still quite a few being used in home/work labs, hence this post.

The CPU, Memory, Storage Controller and Network Adapter are all detected out of the box.  I used the HP image of vSphere ESXi 6.0 that I downloaded from here.

The images below show what is presented through the vSphere Client – I hope this helps answer any questions on whether VMware vSphere 6 will run on the ML110 G6.

 HP-ML110-G6-vSphere6-1

HP-ML110-G6-vSphere6-2

HP-ML110-G6-vSphere6-4

HP-ML110-G6-vSphere6-3

The post Running VMware vSphere 6 on an HP Proliant ML110 G6 appeared first on TechHead and was written by Simon Seagrave.

13 Aug 19:54

SQL Server 2016 : Performance Impact of Always Encrypted

by Aaron Bertrand

As part of T-SQL Tuesday #69, I've blogged about the limitations of Always Encrypted, and I mentioned there that performance could be negatively impacted by its use (as you might expect, stronger security often does have trade-offs). In this post, I wanted to take a quick look at this, keeping in mind (again) that these results are based off of CTP 2.2 code, so very early in the development cycle, and are not necessarily reflective of the performance you'll see come RTM.

First, I wanted to demonstrate that Always Encrypted works from client applications even if the latest version of SQL Server 2016 is not installed there. You do, though, have to install the .NET Framework 4.6 preview (most recent version here, and that may change) in order to support the Column Encryption Setting connection string attribute. If you are running Windows 10, or have installed Visual Studio 2015, this step is not necessary, as you should already have a recent enough version of the .NET Framework.

Next, you need to make sure the Always Encrypted certificate exists on all clients. You create the master and column encryption keys within the database, as any Always Encrypted tutorial will show you, then you need to export the certificate from that machine, and import it on the others where application code will run. Open certmgr.msc, and expand Certificates – Current User > Personal > Certificates, and there should be one there called Always Encrypted Certificate. Right-click that, choose All Tasks > Export, and follow the prompts. I exported the private key and provided a password, which produced a .pfx file. Then you just repeat the opposite process on the client machines: Open certmgr.msc, expand Certificates – Current User > Personal, right-click Certificates, choose All Tasks > Import, and point it at the .pfx file you created above. (Official help here.)

(There are more secure ways to manage these certificates – it is not likely you'd just want to deploy the certificate like this to all machines, as you will soon ask yourself what was the point? I was only doing this in my isolated environment for the purposes of this demo – I wanted to make sure my application was retrieving data over the wire and not just in local memory.)

We create two databases, one with an encrypted table, and one without. We do this to isolate connection strings and also to measure space usage. Of course, there are more granular ways to control which commands need to use an encryption-enabled connection – see the note titled "Controlling the performance impact…" in this article.

The tables look like this:

-- encrypted copy, in database Encrypted
 
CREATE TABLE dbo.Employees
(
  ID INT IDENTITY(1,1) PRIMARY KEY,
  LastName NVARCHAR(32) COLLATE Latin1_General_BIN2 
    ENCRYPTED WITH (ENCRYPTION_TYPE = DETERMINISTIC,
	ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256',
	COLUMN_ENCRYPTION_KEY = ColumnKey) NOT NULL,
  Salary INT
    ENCRYPTED WITH (ENCRYPTION_TYPE = RANDOMIZED,
	ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256',
	COLUMN_ENCRYPTION_KEY = ColumnKey) NOT NULL
);
 
-- unencrypted copy, in database Normal
 
CREATE TABLE dbo.Employees
(
  ID INT IDENTITY(1,1) PRIMARY KEY,
  LastName NVARCHAR(32) COLLATE Latin1_General_BIN2 NOT NULL,
  Salary INT NOT NULL
);

With these tables in place, I wanted to set up a very simple command-line application to perform the following tasks against both the encrypted and unencrypted versions of the table:

  • Insert 100,000 employees, one at a time
  • Read through 100 random rows, 1,000 times
  • Output timestamps before and after each step

So we have a stored procedure in a completely separate database used to produce random integers to represent salaries, and random Unicode strings of varying lengths. We're going to do this one at a time to better simulate real usage of 100,000 inserts happening independently (though not concurrently, as I am not brave enough to try to properly develop and manage a multi-threaded C# application, or try to coordinate and synchronize multiple instances of a single application).

CREATE DATABASE Utility;
GO
 
USE Utility;
GO
 
CREATE PROCEDURE dbo.GenerateNameAndSalary
  @Name NVARCHAR(32) OUTPUT,
  @Salary INT OUTPUT
AS
BEGIN
  SET NOCOUNT ON;
  SELECT @Name = LEFT(CONVERT(NVARCHAR(32), CRYPT_GEN_RANDOM(64)), RAND() * 32 + 1);
  SELECT @Salary = CONVERT(INT, RAND()*100000)/100*100;
END
GO

A couple of rows of sample output (we don't care about the actual content of the string, just that it varies):

酹2׿ዌ륒㦢㮧羮怰㉤盿⚉嗝䬴敏⽁캘♜鼹䓧
98600
 
贓峂쌄탠❼缉腱蛽☎뱶
72000

Then the stored procedures the application will ultimately call (these are identical in both databases, since your queries don't need to be changed to support Always Encrypted):

CREATE PROCEDURE dbo.AddPerson
  @LastName NVARCHAR(32),
  @Salary INT
AS
BEGIN
  SET NOCOUNT ON;
  INSERT dbo.Employees(LastName, Salary) SELECT @LastName, @Salary;
END
GO
 
CREATE PROCEDURE dbo.RetrievePeople
AS
BEGIN
  SET NOCOUNT ON;
  SELECT TOP (100) ID, LastName, Salary 
    FROM dbo.Employees
    ORDER BY NEWID();
END
GO

Now, the C# code, starting with the connectionStrings portion of App.config. The important part being the Column Encryption Setting option for only the database with the encrypted columns (for brevity, assume all three connection strings contain the same Data Source, and the same SQL authentication User ID and Password):

<connectionStrings>
  <add name="Utility" connectionString="Initial Catalog=Utility;..."/>
  <add name="Normal"  connectionString="Initial Catalog=Normal;..."/>
  <add name="Encrypt" connectionString="Initial Catalog=Encrypted; Column Encryption Setting=Enabled;..."/>
</connectionStrings>

And Program.cs (sorry, for demos like this, I'm terrible at going in and renaming things logically):

using System;
using System.Collections.Generic;
using System.Text;
using System.Configuration;
using System.Data;
using System.Data.SqlClient;
 
namespace AEDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            using (SqlConnection con1 = new SqlConnection())
            {
                Console.WriteLine(DateTime.UtcNow.ToString("hh:mm:ss.fffffff"));
                string name;
                string EmptyString = "";
                int salary;
                int i = 1;
                while (i <= 100000)
                {
                    con1.ConnectionString = ConfigurationManager.ConnectionStrings["Utility"].ToString();
                    using (SqlCommand cmd1 = new SqlCommand("dbo.GenerateNameAndSalary", con1))
                    {
                        cmd1.CommandType = CommandType.StoredProcedure;
                        SqlParameter n = new SqlParameter("@Name", SqlDbType.NVarChar, 32) 
                                         { Direction = ParameterDirection.Output };
                        SqlParameter s = new SqlParameter("@Salary", SqlDbType.Int) 
                                         { Direction = ParameterDirection.Output };
                        cmd1.Parameters.Add(n);
                        cmd1.Parameters.Add(s);
                        con1.Open();
                        cmd1.ExecuteNonQuery();
                        name = n.Value.ToString();
                        salary = Convert.ToInt32(s.Value);
                        con1.Close();
                    }
 
                    using (SqlConnection con2 = new SqlConnection())
                    {
                        con2.ConnectionString = ConfigurationManager.ConnectionStrings[args[0]].ToString();
                        using (SqlCommand cmd2 = new SqlCommand("dbo.AddPerson", con2))
                        {
                            cmd2.CommandType = CommandType.StoredProcedure;
                            SqlParameter n = new SqlParameter("@LastName", SqlDbType.NVarChar, 32);
                            SqlParameter s = new SqlParameter("@Salary", SqlDbType.Int);
                            n.Value = name;
                            s.Value = salary;
                            cmd2.Parameters.Add(n);
                            cmd2.Parameters.Add(s);
                            con2.Open();
                            cmd2.ExecuteNonQuery();
                            con2.Close();
                        }
                    }
                    i++;
                }
                Console.WriteLine(DateTime.UtcNow.ToString("hh:mm:ss.fffffff"));
                i = 1;
                while (i <= 1000)
                {
                    using (SqlConnection con3 = new SqlConnection())
                    {
                        con3.ConnectionString = ConfigurationManager.ConnectionStrings[args[0]].ToString();
                        using (SqlCommand cmd3 = new SqlCommand("dbo.RetrievePeople", con3))
                        {
                            cmd3.CommandType = CommandType.StoredProcedure;
                            con3.Open();
                            SqlDataReader rdr = cmd3.ExecuteReader();
                            while (rdr.Read())
                            {
                                EmptyString += rdr[0].ToString();
                            }
                            con3.Close();
                        }
                    }
                    i++;
                }
                Console.WriteLine(DateTime.UtcNow.ToString("hh:mm:ss.fffffff"));
            }
        }
    }
}

Then we can call the .exe with the following command lines:

AEDemoConsole.exe "Normal"
AEDemoConsole.exe "Encrypt"

And it will produce three lines of output for each call: the start time, the time after 100,000 rows were inserted, and the time after 100 rows were read 1,000 times. Here were the results I saw on my system, averaged over 5 runs each:

Duration (seconds) of writing and reading dataDuration (seconds) of writing and reading data

There is a clear impact to writing the data – not quite 2X, but more than 1.5X. There was a much lower delta on reading and decrypting the data – at least in these tests – but that wasn't free, either.

As far as space usage, there is roughly a 3X penalty for storing encrypted data (given the nature of most encryption algorithms, this shouldn't be shocking). Keep in mind this was on a table with only a single clustered primary key. Here were the figures:

Space (MB) used to store dataSpace (MB) used to store data

So obviously there are some penalties with using Always Encrypted, as there typically are with just about all security-related solutions (the saying "no free lunch" comes to mind). I'll repeat that these tests were performed against CTP 2.2, which may be radically different than the final release of SQL Server 2016. Also, these differences I have observed may reflect only the nature of the tests I concocted; obviously I am hoping you can use this approach to test your results against your schema, on your hardware, and with your data access patterns.

The post SQL Server 2016 : Performance Impact of Always Encrypted appeared first on SQLPerformance.com.

13 Aug 19:53

Can you still use DBCC PAGE to access data in dropped tables if TDE is turned on?

by Rob Farley

Yes.

#ShortestPostEver

@rob_farley 

 

Oh, ok... I’ll explain some more.

Transparent Data Encryption encrypts data at rest. That’s the stuff that’s on disk – the encryption happens when the data is written to disk, and the decryption happens as the data is loaded into RAM from the disk. The engine handles this so that it’s invisible to the user, applications, and so on. Without it, you can open an MDF/NDF file in a hex editor and read the contents. With it, you can’t.

Here’s an example with a database that’s not encrypted:

image

And here’s an example that is:

image

I searched for some of the text that I could see – successfully in the first, unsuccessfully in the second.

I also used SQL Server to show me the contents of a page using DBCC PAGE, and could do this successfully (once I’d closed the files in the hex editor and brought the databases back online).

image

...which also worked in both databases.

image

I had hoped this would work okay, because I figured that DBCC PAGE would have to pull the data into RAM again (remember this system was offline – the pages weren’t in RAM before), and that it would decrypt this as it did it. But I wondered if DBCC PAGE might be slightly lower-level, and bypass it somehow. I argued with myself that if TDE was indeed Transparent, it shouldn’t care... what if my application relied on using DBCC PAGE, it’s a known feature, even if it is officially undocumented (which is where my doubts set in).

But as you see, it worked okay.

But what if I dropped the table first? Would SQL then go “Hang on – this page isn’t one that I have control over any more...” and refuse to decrypt it?

No – it works just the same.

Even if you drop an object, you can still access the pages that it used until they get overwritten. You won’t be able to read them with a hex editor, but DBCC PAGE will still read them in decrypted form, letting you pull that data out.

And yes, you can even use DBCC WRITEPAGE to overwrite the bytes in their unencrypted form, so that you can use (dangerous) method of fixing corruption, even in an encrypted database. I’ve just redone my fix for Steve Stedman’s 10th Corruption Challenge, and it worked just fine on an encrypted version of the database.

It’s still T-SQL Tuesday on the topic of encryption, so I’m throwing this one into the mix for that as well.

TSQL2sDay150x150

13 Aug 19:51

Choosing hash distributed table vs. round-robin distributed table in Azure SQL DW Service

by Murshed Zaman_AzureCAT

This topic explains the various Azure SQL Data Warehouse distributed table types, and offers guidance for choosing the type of distributed table to use and when. There are two types of distributed tables in Azure SQL DW at the writing of this article, hash distributed table and round-robin distributed table.

Designing databases to use these distributed tables effectively will help you to achieve the storage and query processing benefits of the Azure SQL DW Service (SQL DW).

In SQL DW a distribution is an Azure SQL database, in which one or more distributed tables are stored. Each instance of SQL DW has many distributions. Many distributions can reside in a single instance of Azure SQL instance.

The amount of distributions are subject to change and not important for understanding this particular topic.

Hash Distributed Table Basics

A hash distributed table is a table whose rows are dispersed across multiple distributions based on a hash function applied to a column. Each SQL instance contains a group of one or more rows. The following diagram depicts how table within SQL DW gets stored as a hash distributed table.

clip_image002

When processing queries involving distributed tables, SQL DW instances execute multiple internal queries, in parallel, within each SQL instance, one per distribution.  These separate processes (independent internal SQL queries) are executed to handle different distributions during query and load processing.

A distribution column is a single column (specified at table creation time) that SQL DW uses to assign each row to a distribution. A deterministic hash function uses the value in the distribution column to assign each row to belong to one and only one distribution. Two identical column values with the same data type will be hashed the same and thus will end up in the same distribution.

In the diagram, each row in the original file is stored on one distribution. The number of rows in each distribution can vary and is usually not identical from distribution to distribution.

There are performance considerations for the selection of a distribution column, such as minimizing data skew, minimizing data movement, and the types of queries executed on the system. For example, query performance improves when two distributed tables are joined on a column that is of the same data type and size. This is called a distribution compatible join or a co-located join.  

Round-Robin Distributed Table Basics

A round-robin distributed table is a table where the data is evenly (or as evenly as possible) distributed among all the distributions without the use of a hash function. A row in a round-robin distributed table is non-deterministic and can end up in different distributions each time they are inserted.

Each JOIN to a round-robin distributed table is a data movement operation. The data movement needed to perform join operations is a separate topic and will be published as a separate blog soon.

Usually common dimension tables or tables that doesn’t distribute evenly are good candidates for round-robin distributed table.

The following diagram shows a round-robin distributed table that is stored on different distribution.

clip_image002[7]

Best Practices

In SQL DW, a user query is a logical query that gets divided into many physical queries one for each distribution. The Engine Service on the control node acts as a coordinator and waits for each of these individual queries to finish before returning results or the next part of the multi-step query is executed.

When creating a table in SQL DW, you need to decide if the table will be hash distributed or round-robin distributed. This decision has implications for query performance. Each of these distributed tables may require data movement during query processing when joined together. Data movement in MPP RDBMS system is an expensive but sometimes unavoidable step. In my 8 plus years of working with MPP data warehouse I haven’t seen a real customer workload that can completely eliminate data movement. The objective of a good data warehouse design in SQL DW is to minimize data movement so let’s keep that in mind while choosing table design.

Here are considerations for choosing to use a round-robin distributed table or a hash distributed table:

1. To choose a good distribution design with SQL DW, one should know their data, DDL and queries. This is not unique to SQL DW but for most MPP RDBMS system. You need to minimize data movement queries but also watch out for data that can heavily skew a certain distribution. If one of the distribution has more data than others, it will be the slowest performing distribution. Since SQL DW queries are as fast as its slowest distribution, we need to take notes of any data-heavy (skewed or hot) distribution for the same table.

2. A nullable column is a bad candidate for any hash distributed table. All null columns are hashed the same and thus the rows will end up on the same distribution creating a skewed (hot) distribution. If most of the columns are null able and no good hash distribution can be achieved, that table is a good candidate for round-robin distribution. Choose ‘not null’ columns when creating table that will be hash distributed.

3. Any fact tables that has a default value in a column is also not a good candidate to create a hash distributed table. DW Developers will sometime assign -1 value to an otherwise unknown value or early arriving values for a fact table. These values will create data skew on a particular distribution. Avoid these kind of default value column unless you know for sure that the -1 values are negligible in your data.

4. Large fact tables or historical transaction tables are usually stored as hash distributed tables. These tables usually have a surrogate key that is monotonically increasing and are used in JOIN conditions with other fact and dimension tables. These surrogate keys are a good candidate for distributing the data as there are many unique values in that column. This allows the query operations to be performed across all distributions. Each distribution can work independently on separate subsets of data. This takes advantage of the processing resources across the MPP system. Queries on distributed tables may require data movement between distributions during query execution and that is okay.

5. Dimension tables or other lookup tables in a schema can usually be stored as round-robin tables. Usually these tables connects to more than one fact tables and optimizing for one join may not be the best idea. Also usually dimension tables are smaller which can leave some distributions empty when hash distributed. Round-robin by definition guarantees a uniform data distribution.

6. If you are unsure of query patterns and data, you can start with all tables in round-robin distribution. And as you learn the patterns the data can be easily redistributed on a hash key.

7. When using ‘group by’ SQL DW will shuffle the data on the group by key. When multiple keys are present and statistics is up-to-date SQL DW’s cost based optimizer will pick the right key to shuffle the data. If this group by key is heavily non-unique then the query will be slower. A worst case example would be grouping by gender of a large customer table. If your query is running slower, look into explain plan (add the word ‘explain’ before your query and execute) to find out what key is being used as the shuffle key. There may or may not be anything you can do to change this based on the query.

13 Aug 19:51

5 Ways to Grow Your Business with Big Data

by A.R. Guess

by Angela Guess Jonha Revesencio recently wrote for Smart Data Collective, “Looking at the future of big data, around 53-percent of Internet experts and observers believe that it will cause a huge positive impact for society in nearly all aspects. It’s useful in a way that it can create transparency and allow better analysis of…

The post 5 Ways to Grow Your Business with Big Data appeared first on DATAVERSITY.

13 Aug 19:50

How not to do a Masters degree

by Gail
Mrdenny

now

There are many, many guides to successfully completing a post grad degree, so I am not going to add to them. Instead, this is a list of things, based on personal experience, that you can do to make it somewhere between very difficult and impossible to complete that Masters or Doctorate degree.

To be clear, I’m talking about the degrees which are mostly, if not entirely, based on research and a dissertation, not coursework.

Do the degree while holding down a full-time job

Initially this looks like it’ll work out fine. Work Monday to Friday, work on the degree on Saturday and Sunday. For the first few months it does work fine.

But there’s one thing that a dissertation requires and that’s a large amount of dedicated time. Time to read the literature. Time to come up with the hypothesis or research questions. Time to design an experiment. Time to conduct that experiment. Time to revise the experiment, conduct it again, revise again, conduct again… Time to analyse the results. Time to write up the results. Time to edit, rewrite, edit, rewrite, edit…

As the months pass, and as the initial enthusiasm and fun fades, so it becomes harder and harder to spend most of the weekend on the research, weekend after weekend after weekend, for a couple of years

Sure, it’s possible to complete a Masters degree while working full time, but it’s like playing a game on Insane difficulty level.

Decide that you want to get heavily involved in the SQL community

The SQL community are a great bunch of people and there’s a lot of encouragement to jump in and get involved, for many good reasons

Posting on forums is great in many ways, it boosts your confidence, it gets you recognitions and it’s a good way to get to know features you’ll never use in your regular job.

Blogging and writing are a great way to share knowledge, and there’s a thrill from watching the page hit count go up, from the first comment, from the complements, especially when your blog post gets referenced by others as the authoritative article on a subject.

But it takes time. Lots of time. Articles can take days of work, blog posts can take anything from a few minutes to many hours depending on the subject and the depth of the post. Presenting takes lots and lots of prep time. An hour-long presentation can easily require a day of prep, and that’s once you’ve done several presentations. The first one can take many days of writing slides, rewriting slides, writing demos and rehearsing the presentation several times. Forums will take every minute you’re willing to give to them and more, and there’s the constant temptation of ‘just one more post…’

And where’s that time going to come from? The time that would otherwise have been spent on the research and dissertation.

Get burned out, and don’t seek help

I’m probably going to get flak for this, but it has to be said.

As an industry in general, we are too reluctant to ask for help. I don’t mean technical help (though that too in many cases).

We are too eager to put on a pedestal the person who works 12 hours a day, 6 days a week, the person who pulls all-nighters on a regular basis, the person who never takes vacation because he ‘doesn’t have time’. We’re conditioned to see extreme hours worked as a sign of dedication, of what is needed to get ahead.

It’s not heroic. It’s not required. It’s not something to be admired.

It’s stupid.

Long days are sometimes required, weekend work is sometimes necessary, but they both should be the exception, not the norm. Excessive overtime, if needed to meet deadlines, should be followed with a discussion on what went wrong such that the overtime was required. Was the estimation inadequate? Was the project analysis flawed? Did the scope creep (or gallop)? Were people working on multiple projects at the same time while the project plan expected them to be dedicated? Were assumptions not met or essential infrastructure delayed?

If overtime is frequently required, then management has failed at their job. A developer working 12 hours a day, 6 days a week for months on end is not a sign of dedication or heroics, it’s a sign that someone, somewhere in management is not doing their job properly.

This kind of workload and times are not sustainable. They lead to mistakes and buggy code, they lead to burnout and they lead to people, top people, walking away and never coming back.

I wrote about burnout a few years ago. Looking back now, after another recent brush with it, I don’t think I went far enough in what I said there.

It is not a flaw, it is not a weakness, it is not admitting failure, to seek professional help for burnout. It’s far better to do so than to suffer for years with the effects.

And to get this back on topic, trying to work through burnout is like trying to swim through syrup. Lots of effort, little progress, very easy to give up and stop trying. That’s not going to do wonders for that research and the couple-hundred page dissertation that needs to be done.


Despite all of that, in case anyone is wondering, graduation is mid-September

13 Aug 19:47

Hadoop and Microsoft

by James Serra

In my Introduction to Hadoop I talked about the basics of Hadoop.  In this post, I wanted to cover some of the more common Hadoop technologies and tools and show how they work together, in addition to showing how they work well with Microsoft technologies and tools.  So you don’t have to choose between going with Open-source software (OSS) and going with Microsoft.  Instead, they can be combined and work together very nicely.

In short, Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.  The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common – Contains libraries and utilities needed by other Hadoop modules
  • Hadoop Distributed File System (HDFS) – A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
  • Hadoop MapReduce – A programming model for large scale data processing.  It is designed for batch processing.  Although the Hadoop framework is implemented in Java, MapReduce applications can be written in other programming languages (R, Python, C# etc).  But Java is the most popular
  • Hadoop YARN – YARN is a resource manager introduced in Hadoop 2 that was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop 1 (see Hadoop 1.0 vs Hadoop 2.0).  YARN is often called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop

The term “Hadoop” has come to refer not just to the base modules above, but also to the “ecosystem”, or collection of additional software packages that can be installed on top of or alongside Hadoop.  There are hundreds of such software packages (see the Apache Projects Directory) that are frequently updated, making it difficult to build a Hadoop solution that uses multiple software packages.  Making things much easier are companies that provide “Hadoop in a box”.  These are ready-to-use platforms that contain numerous Hadoop software packages pre-installed and where all the packages are tested to ensure they all work together.  These include Microsoft’s HDInsight, Hortonworks Data Platform (HDP), Cloudera’s CDH, and MapR’s Distribution.

Microsoft’s HDInsight is simply a managed Hadoop service on the cloud (via Microsoft Azure) built on Apache Hadoop that is available for Windows and Linux.  Microsoft Azure is a cloud computing platform and infrastructure, created by Microsoft, for building, deploying and managing applications and services through a global network of Microsoft-managed and Microsoft partner hosted datacenters.  Behind the covers HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.  Although there are many software packages installed by default on a HDInsight distribution, you can add any additional packages if you wish via a Script Action (see Customize HDInsight clusters using Script Action).  With HDInsight you can easily build a Hadoop cluster in the cloud in minutes.  The cluster will consist of virtual machines with the data stored separately in Azure Storage Blobs instead of HDFS (see Why use Blob Storage with HDInsight on Azure).  The great thing about this is you can shut down the cluster and the data will remain (or you can store the data in the native HDFS file system that is local to the compute nodes but would lose the data if you deleted your cluster).  This can result in substantial savings as you can do things like shut down the cluster on weekends if it is not needed.  Since HDInsight is 100 percent Apache-based and not a special Microsoft version, this means as Hadoop evolves, Microsoft will embrace the newer versions.  Moreover, Microsoft is a major contributor to the Hadoop/Apache project and has provided a great deal of time and effort into improving various Hadoop tools.

Shown below is the software packages and their versions included in the Hortonworks Data Platform:

Asparagus-Chart

HDInsight is currently using HDP 2.2 and will be upgraded in the next couple of months to HDP 2.3 (see What’s new in the Hadoop cluster versions provided by HDInsight?).  Microsoft also has a new partner program for Hadoop independent software vendors (ISVs) to expand their offerings on to HDInsight (see Azure HDInsight launches ISV Partner Program at WPC).

Pictured below is the HDInsight platform that shows how the most popular Hadoop software packages fit together:

  • Hive – A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.  The query language is similar to standard SQL statements and is called Hive Query Language (HQL).  It converts queries to MapReduce, Apache Tez and Spark jobs (all three execution engines can run in Hadoop YARN).  This makes querying data much easier as you are writing SQL instead of Java code
  • Storm – A distributed real-time computation system for processing fast, large streams of data, adding real-time data processing to Apache Hadoop
  • HBase – A scalable, distributed NoSQL database that supports structured data storage for large tables
  • Tez – Provides API’s to write YARN apps.  It is a replacement for MapReduce since it is much faster.  It will process data for both batch and interactive use-cases
  • Spark – Spark is intended to be a drop in replacement for Hadoop MapReduce providing the main benefit of improved performance.  It does not require YARN.  It is up to a 100x faster than MapReduce and can also perform real-time analysis of data and supports interactive processing.  It can run on HDFS as well as the Cassandra File System (CFS)

Untitled picture

Data in a Hadoop solution can be processed either in a batch or in real time.

For batch processing, your choices are MapReduce or Spark.  MapReduce was the first and only way for a long time, but now Spark has become the best choice.  Spark has a number of benefits over MapReduce such as performance, a unified programming model (can be used for both batch and real-time data processing), richer and simper API, and multiple datastore support.

For real-time processing, the most popular tools to use are Kafka and Storm for event queuing, and Cassandra or HBase for storage.  These are all OSS solutions but there are many Microsoft equivalent products should you choose to go that route (with most of the Microsoft products easier to use than the OSS products).  The equivalent on the Microsoft stack is Azure Event Hub for Kafka, Azure Stream Analytics for Storm, and Azure blob storage/Azure Data Lake for HBase.

AzureStreamAnalytics_Page

Note that Spark is quickly becoming the tool of choice over Storm.  The difference between Hadoop and Storm in the way they work is that Storm is processing an event at a time versus Hadoop or Spark, which is processing a batch at a time.  The batch can be large, the batch can be small, or the batch can be really small, which is called a micro-batch.  The paradigm in Spark that processes a micro-batch is called Spark streaming.

Storm & Trident (Trident is a high-level abstraction for doing real-time computing on top of Storm) were previously the most popular solution for real-time streaming, but now the most popular is Spark Streaming (an abstraction on Spark to perform stateful stream processing).  Spark is also supported on HDInsight (see Announcing Spark for Azure HDInsight public preview).

spark1_png-550x0

I hope this has provided a clearer picture to you on which Hadoop technologies are the most popular for building solutions today and how Microsoft supports those technologies via HDInsight.  You can learn more about HDInsight and try it for free by clicking here.

13 Aug 19:47

Using the SSMS ConnectionDialog

by psssql

I was attempting to add the SSMS connection dialog to my utility and ran into problems with referenced assemblies.

The ConnectionDialog is documented here: https://msdn.microsoft.com/en-us/library/microsoft.sqlserver.management.ui.connectiondlg.aspx

The following is a snippet using the SSMS ConnectionDialog in a C# application.

Microsoft.SqlServer.Management.UI.ConnectionDlg.

ConnectionDialog dlg = new Microsoft.SqlServer.Management.UI.ConnectionDlg.ConnectionDialog();

Microsoft.SqlServer.Management.Smo.RegSvrEnum.

UIConnectionInfo connInfo = new Microsoft.SqlServer.Management.Smo.RegSvrEnum.UIConnectionInfo { ApplicationName = "My App" };

IDbConnection conn;
dlg.AddServer(
new Microsoft.SqlServer.Management.UI.ConnectionDlg.SqlServerType());
dlg.Text =
"Connect to My Database";

DialogResult dr = dlg.ShowDialogValidateConnection(this, ref connInfo, out conn);
if (DialogResult.OK == dr)
{
   m_strConnString = conn.ConnectionString +
";Pooling=TRUE";
  
if(false == m_strConnString.Contains("Database="))
      m_strConnString +=
";Database=MyDb";

   bRC =
true;
}
else
{
  
bRC = false;
}

To compile the properly, references to the RegSvrEnum and ManagementControls are required.   I compiled it on my system and provided it to a peer, who quickly caused the application to fail with missing assembly references.

I built the application using the SQL Server SSMS 2015 July Preview and they had SQL Server 2014 on their system.   No problem I made sure both assemblies were in the same directory as my application but this still failed.

Following the trail of missing assemblies I had to provide the following in order for the application to execute.

  • Microsoft.SqlServer.Management.Controls
  • Microsoft.SqlServer.Management.UserSettings
  • Microsoft.SqlServer.RegSvrEnum
  • SqlWorkbbench.Interfaces

There is not a redistributable package containing these assemblies.  The supported way is to install the matching SSMS package on the target system.   SSMS can be installed separately using the following link: https://msdn.microsoft.com/en-us/library/mt238290.aspx?f=255&MSPPError=-2147217396  

Bob Dorr - Principal SQL Server Escalation Engineer

13 Aug 19:47

Keep an Eye Out for Fake Data Scientists

by A.R. Guess

by Angela Guess Data expert Bernard Marr recently wrote, “Have you noticed how many people are suddenly calling themselves data scientists? Your neighbour, that gal you met at a cocktail party — even your accountant has had his business cards changed! There are so many people out there that suddenly call themselves ‘data scientists’ because…

The post Keep an Eye Out for Fake Data Scientists appeared first on DATAVERSITY.

13 Aug 19:46

Will Smart Devices Cost Us Our Privacy, Perhaps Our Safety?

by A.R. Guess

by Angela Guess Craig Spiezle writes for TechCrunch, “‘Smart’ devices are all the buzz, whether in the connected home (thermostats, lightbulbs, garage door openers, locks and various appliances) or new wearable devices. They promise convenience along with improved control and efficiency in our lives. But, as highlighted by the recent hacking of automobiles, connectivity can…

The post Will Smart Devices Cost Us Our Privacy, Perhaps Our Safety? appeared first on DATAVERSITY.