Hashing within the Azure Data landscape

The research

I’ve did a small research about what’s the best option is for hashing (e.g. determining deltas or hashing concatenated natural keys). Up until now, the algorithm used for hashing is MD5. Please note that we’re not discussing the hashing of passwords. Want to read more? https://www.linkedin.com/feed/update/urn:li:activity:6295982490112925696

 

Which hashing algorithm should we use?

MD5 has a minimal performance impact and the output (VARBINARY or CHAR(32)) has the lowest storage footprint. But…as of SQL Server 2016, all hashing algorithms except SHA2_256 and SHA2_512 are deprecated:

“…Beginning with SQL Server 2016, all algorithms other than SHA2_256, and SHA2_512 are deprecated. Older algorithms (not recommended) will continue working, but they will raise a deprecation event…”

https://docs.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql

 

Basically Microsoft is saying good bye to all other/older algorithms. When you look at the current Azure data landscape, you bet that for the near future you’re unable to continue using MD5 in your Data Warehouse or Data Mart (SQL Server 2017+, Azure SQL DB and Azure SQL DWH).

So… SHA2_256 it is for hashing keys. I guess for hashing passwords or other sensitive values, use SHA2_512 in combination with salt.

 

Wouldn’t that impact the (ETL) performance?

Yep, unfortunately. But please keep in mind that we just need to continue evolving and Microsoft is in charge 🙂 Just make sure that you only need to hash values once, e.g. in your staging/landing area.

 

How should we use it?

Within SQL Server, Azure SQL DWH or Azure SQL DB, you’ll probably use HASHBYTES() to calculate the hash and it returns a VARBINARY. Personally I’m not a big fan of storing ‘varbinaries’ in a DWH, that why I’m storing it as a CHAR(). I’m not sure whether VARBINARY is supported in all of Azure’s services (ML, etc.?).

Depending on the hashing algorithm, HASHBYTES() will return hashes with different sizes/lengths (bytes):



 


 


 

The VARBINARY values are:



 


 

Now…how to cast/convert it to CHAR()? On LinkedIn we had a short discussion about using CAST() or CONVERT(), since they output different strings. My advice would be to use CONVERT(), since it represents the actual VARBINARY-value (with or without the ‘0x’-prefix).

If you want to have the prefix included in the output string, use:
SELECT CONVERT(CHAR(),HASHBYTES('',''),(1))

Personally I think that you can lose the prefix, because you can always concatenate it if needed. So:
SELECT CONVERT(CHAR(),HASHBYTES('',''),(2))



 


 

Now when you’re converting the VARBINARY to a CHAR(xx), the size/length increases:



 


 

Conclusion

So basically it all come to:

Generate a SHA2_256 hash like SELECT CONVERT(CHAR(64),HASHBYTES('SHA2_256',@TESTVAL),(2))
Generate a SHA2_512 hash like SELECT CONVERT(CHAR(128),HASHBYTES('SHA2_512,@TESTVAL),(2))

Store a SHA2_256 hash as CHAR(64)
Store a SHA2_512 hash as CHAR(128)

Unless Microsoft decides to support MD5 or SHA1 within SQL Server vNext / HASHBYTES(), my advice will be to start using SHA2_256 for key hashes.

In case you don’t need to combine Azure services and just stick to a ‘normal’ DWH on SQL Server, storing a SHA2_256 as BINARY(32) would be the best choice. BINARY() as a datatype could not be supported in a specific service or tool.

Happy hashing! 🙂

 


 


 

Advertisements

Two handy queries which can help you find things in your database


--Search for a specific string in object-definitions:
SELECT DISTINCT
o.name AS Object_Name
,o.type_desc
FROM sys.sql_modules m
INNER JOIN sys.objects o ON m.object_id = o.object_id
WHERE m.definition Like '%SELECT%'

--Find a column within the database:
SELECT
t.name AS TableName,
SCHEMA_NAME(schema_id) AS SchemaName,
c.name AS ColumnName
FROM sys.tables AS t
INNER JOIN sys.columns c ON t.OBJECT_ID = c.OBJECT_ID
WHERE c.name LIKE '%Customer%'
ORDER BY SchemaName, ColumnName









SSRS Dashboarding: a ‘webdesign-look’

Couple of weeks ago, I shared my SSRS dashboard with the community on LinkedIn:

SSRS Dashboard shared on LinkedIn














Koos van Strien asked for the source (RDL) and restyled the dashboard with his ‘webdesign-look’. Pretty nice!

SSRS Dashboard response on LinkedIn

















Koos discusses the changes he made to my dashboard on his blog:
http://www.msbiblog.com/2016/01/28/ssrs-non-ugly-5-the-nitty-griddy-details/#more-402

Here are both dashboards in full size:

SSRS Dashboard Clint Huijbers

 

SSRS Dashboard Koos van Strien



 


 


 

Pause SQL Server service before restarting

Did you know that by pausing the SQL Server service before restarting the instance we allow end users to continue their work uninterrupted and we also stop any new connections to the instance? This is a nicer way of telling people to “get out” of the database in order for the server to be rebooted. I wouldn’t leave the server paused for 60 minutes of course, but I would rather use this method than forcibly disconnect users and rollback their transactions.

When a server is paused you will see messages similar to this in the SQL Server error log:

Error: 17142, Severity: 14, State: 0.
SQL Server service has been paused. No new connections will be allowed. To resume the
service, use SQL Computer Manager or the Services application in Control Panel.

Error: 18456, Severity: 14, State: 13.
Login failed for user ''. Reason: SQL Server service is paused.
No new connections can be accepted at this time. [CLIENT: ]

Next time you are worried about rebooting during the day think about the pause button instead. It might be a nice compromise for your end-users.

Thank Thomas Larock for his blog post: http://thomaslarock.com/2016/01/pause-sql-server-service-before-restarting/








How to calculate the number of working minutes…FAST :)

This is always a performance thing, ETL-wise, but also in stored procedures.
I’ve published an example on how to calculate the number of working days (rounded) via a scalar-valued function:
Function: Calculate working days

This example however calculates the working minutes between dates, while taking into account the company and/or national holidays and weekends.
You could extend the functionality to also include local office hours.

Three examples:
1) The number of working minutes between a given date/time-value and the present (NOW)
2) The number of working minutes between two given date/time-values
3) The number of working minutes between two given date/time-values from a large dataset

Number-of-working-minutes-between-dates-sql-server-set-based

Testing results:
I ran a test on a large dataset which included actual order dates and after the so called ‘WorkingMinutes’-table was made in 4~5 seconds, it took only 3 seconds to process more than 1 million records!

Download the SQL script here:
WorkingMinutes-SQL-Examples








SSIS: Extract package information (e.g. DataFlowTaskName, TaskDescription, PrecedenceConstraint, etc.)

Found this great blog post from Saravanan:
http://sqlmall.wordpress.com/2014/05/29/extract-ssis-2012-package-data/

He describes a way to search through the XML coding of your SSIS packages, to for example, look for a certain DataFlow which is called ‘RAW_DWH_Main‘.

Step 1) Enable FILESTREAM via SQL Server Configuration Manager

http://sqlserverscribbles.com/2013/07/09/how-to-enable-and-configure-filestream-in-sql-server-2008-2012/
http://msdn.microsoft.com/en-us/library/cc645923.aspx

Step 2) Create the FileTable (and a new database)
NOTE: Make sure to fill in an existing path for the directory.


EXEC sp_configure filestream_access_level, 2
RECONFIGURE
GO

CREATE DATABASE PackageDefinition
ON PRIMARY
(Name = PackageDefinition,
FILENAME = 'E:\FileTableDB\PackageDefinition.mdf'),
FILEGROUP FTFG CONTAINS FILESTREAM
(NAME = Package,
FILENAME='E:\FileTableDB\PackageDefinition')
LOG ON (Name = PackageDefinitionLog,
FILENAME = 'E:\FileTableDB\PackageDefinitionLog.ldf')
WITH FILESTREAM (NON_TRANSACTED_ACCESS = FULL,
DIRECTORY_NAME = N'PackageDefinition');
GO

USE PackageDefinition;
CREATE SCHEMA FT
GO

CREATE TABLE FT.PackageDefinition AS FileTable
WITH
(FileTable_Directory = 'PackageDirectory');
GO

Here’s your new FileTable:

Search-in-ssis-packages-with-sql-filetable-structure


Explore the directory and copy/paste your SSIS packages here

Search-in-ssis-packages-with-sql-filetable-explore-directory


Now copy (or perhaps move/relocate) your TFS to this directory.

Almost done…run the query!

;WITH XMLNAMESPACES
(
'www.microsoft.com/SqlServer/Dts' AS DTS
),

PackageDefinition
AS
( /*Query the file content([file_stream]) from filetable*/
SELECT *
,FileTableRootPath() AS RootPath
,file_stream.GetFileNamespacePath() as FilePath
,path_locator.GetLevel() AS PathLevel
,PackageData = CAST(CAST([file_stream] AS VARBINARY(MAX)) AS XML )
FROM [FT].[PackageDefinition]
WHERE file_type = 'dtsx' --SSIS Packages
--AND Name LIKE '%RAW_DWH_Main%' --FileName
AND file_stream.GetFileNamespacePath() LIKE '\PackageDirectory\DWH.SSIS\%' --Project folder
AND path_locator.GetLevel() = 2

)

SELECT r.*
FROM
(
SELECT
RootPath
,FilePath
,PathLevel
,PackageName = Exe.pkg.value ('@DTS:ObjectName' ,'VARCHAR(250)')
,CreationDate = Exe.pkg.value ('@DTS:CreationDate' ,'DATETIME2(0)')
,CreatorName = Exe.pkg.value ('@DTS:CreatorName' ,'VARCHAR(250)')
,DataFlowTaskName = Task.pkg.value ('@DTS:Description' ,'VARCHAR(250)')
,TaskDescription = Task.pkg.value ('@DTS:ObjectName' ,'VARCHAR(250)')
,PrecedenceConstraint = '['+REPLACE((pc.pkg.value ('@DTS:From','VARCHAR(250)')),'Package\','')+'] ' +'---->' + '['+REPLACE((pc.pkg.value ('@DTS:To' ,'VARCHAR(250)')),'Package\','') +']'
,refId = Con.pkg.value ('@DTS:refId' ,'VARCHAR(250)')
,ConnectionProvider = Con.pkg.value ('@DTS:CreationName' ,'VARCHAR(250)')
,ConnectionManagerName = Con.pkg.value ('@DTS:ObjectName' ,'VARCHAR(250)')
,ConnectionString = Par.pkg.value ('@DTS:ObjectName' ,'VARCHAR(250)')
,ParameterName = CM.pkg.value ('@DTS:ConnectionString' ,'VARCHAR(250)')
,ParameterValue = Par.pkg.value ('.' ,'VARCHAR(250)')
,VariableName = Var.pkg.value ('@DTS:ObjectName' ,'VARCHAR(250)')
,VariableValue = Var.pkg.value ('.' ,'VARCHAR(250)')
,PackageData
FROM PackageDefinition
OUTER APPLY PackageData.nodes('/DTS:Executable') AS Exe(pkg)
OUTER APPLY Exe.pkg.nodes('DTS:Executables/DTS:Executable') AS Task(pkg)
OUTER APPLY Exe.pkg.nodes('DTS:PackageParameters/DTS:PackageParameter') AS Par(pkg) --Will only return data while using the 'Package Deployment'-model I guess.
OUTER APPLY Exe.pkg.nodes('DTS:Variables/DTS:Variable') AS Var(pkg)
OUTER APPLY Exe.pkg.nodes('DTS:ConnectionManagers/DTS:ConnectionManager/DTS:ObjectData/DTS:ConnectionManager') AS CM(pkg)
OUTER APPLY Exe.pkg.nodes('DTS:ConnectionManagers/DTS:ConnectionManager') AS Con(pkg)
OUTER APPLY Exe.pkg.nodes('DTS:PrecedenceConstraints/DTS:PrecedenceConstraint') AS pc(pkg)
) AS r
WHERE r.DataFlowTaskName LIKE '%MyDataFlow%'
ORDER BY FilePath



The results:

Search-in-ssis-packages-with-sql-filetable-xml-query-results


Important note:
The query doesn’t return any nested tasks, you may need to use a LIKE-operator for that:
WHERE CAST(PackageData AS NVARCHAR(MAX)) LIKE '%%'

If you have a solution for this, please let me know or post the solution in the comments. Thanks!


More info about FileTables:
http://blog.tallan.com/2012/01/01/its-a-file-system-its-a-database-table-its-sql-server-denali-filetable/

How-filetables-work-more-info-sql-server








How To: Build a (SQL) Server Inventory

Thanks to Brent Ozar for this one.

He describes how you could easily populate a list of (SQL) Servers by using one or a combination of these tools:

How to Survey Your Network for Servers
Put a row in the spreadsheet for every server you have – whether you’re in charge of it or not. We want to start with a good inventory of what we have, and there’s two good free tools to do it.

Microsoft Assessment and Planning Toolkit – it’s actually designed for licensing compliance, but it works great for building server inventories. It scans your network looking for whatever programs you pick, but just confine it to SQL Servers only.

Dell Discovery Wizard for SQL Server – it’s a GUI tool that pings all the servers in your network and tries to figure out if they’ve got SQL Server installed. If you’re in a small shop where your account has admin privileges in the domain, you might find a lot more servers than you expected.

SQL Power Doc – open source PowerShell scripts to discover, document, and diagnose your SQL Servers. I’m not gonna lie: this one is not for the faint of heart, but if you’re willing to get your hands dirty and deal with some manual work, the end result is a nice set of Excel spreadsheets.