Hashing within the Azure Data landscape

The research

I’ve did a small research about what’s the best option is for hashing (e.g. determining deltas or hashing concatenated natural keys). Up until now, the algorithm used for hashing is MD5. Please note that we’re not discussing the hashing of passwords. Want to read more? https://www.linkedin.com/feed/update/urn:li:activity:6295982490112925696

 

Which hashing algorithm should we use?

MD5 has a minimal performance impact and the output (VARBINARY or CHAR(32)) has the lowest storage footprint. But…as of SQL Server 2016, all hashing algorithms except SHA2_256 and SHA2_512 are deprecated:

“…Beginning with SQL Server 2016, all algorithms other than SHA2_256, and SHA2_512 are deprecated. Older algorithms (not recommended) will continue working, but they will raise a deprecation event…”

https://docs.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql

 

Basically Microsoft is saying good bye to all other/older algorithms. When you look at the current Azure data landscape, you bet that for the near future you’re unable to continue using MD5 in your Data Warehouse or Data Mart (SQL Server 2017+, Azure SQL DB and Azure SQL DWH).

So… SHA2_256 it is for hashing keys. I guess for hashing passwords or other sensitive values, use SHA2_512 in combination with salt.

 

Wouldn’t that impact the (ETL) performance?

Yep, unfortunately. But please keep in mind that we just need to continue evolving and Microsoft is in charge 🙂 Just make sure that you only need to hash values once, e.g. in your staging/landing area.

 

How should we use it?

Within SQL Server, Azure SQL DWH or Azure SQL DB, you’ll probably use HASHBYTES() to calculate the hash and it returns a VARBINARY. Personally I’m not a big fan of storing ‘varbinaries’ in a DWH, that why I’m storing it as a CHAR(). I’m not sure whether VARBINARY is supported in all of Azure’s services (ML, etc.?).

Depending on the hashing algorithm, HASHBYTES() will return hashes with different sizes/lengths (bytes):



 


 


 

The VARBINARY values are:



 


 

Now…how to cast/convert it to CHAR()? On LinkedIn we had a short discussion about using CAST() or CONVERT(), since they output different strings. My advice would be to use CONVERT(), since it represents the actual VARBINARY-value (with or without the ‘0x’-prefix).

If you want to have the prefix included in the output string, use:
SELECT CONVERT(CHAR(),HASHBYTES('',''),(1))

Personally I think that you can lose the prefix, because you can always concatenate it if needed. So:
SELECT CONVERT(CHAR(),HASHBYTES('',''),(2))



 


 

Now when you’re converting the VARBINARY to a CHAR(xx), the size/length increases:



 


 

Conclusion

So basically it all come to:

Generate a SHA2_256 hash like SELECT CONVERT(CHAR(64),HASHBYTES('SHA2_256',@TESTVAL),(2))
Generate a SHA2_512 hash like SELECT CONVERT(CHAR(128),HASHBYTES('SHA2_512,@TESTVAL),(2))

Store a SHA2_256 hash as CHAR(64)
Store a SHA2_512 hash as CHAR(128)

Unless Microsoft decides to support MD5 or SHA1 within SQL Server vNext / HASHBYTES(), my advice will be to start using SHA2_256 for key hashes.

In case you don’t need to combine Azure services and just stick to a ‘normal’ DWH on SQL Server, storing a SHA2_256 as BINARY(32) would be the best choice. BINARY() as a datatype could not be supported in a specific service or tool.

Happy hashing! 🙂

 


 


 

Advertisements

Ensemble Modeling Forms (DWH Models)

This is a copy/paste post of Hans Hultgren‘s blog post about the forms of Ensemble Modeling:
https://hanshultgren.wordpress.com/2013/06/16/ensemble-modeling-forms/

Ensemble Modeling Forms: Modeling the Agile Data Warehouse

Ensemble-modeling


Anchor Modeling. Data Vault Modeling. Focal Point Modeling. To name a few. In fact there are dozens of data warehouse data modeling patterns that have been introduced over the past decade. Among the top ones there are a set of defining characteristics. These characteristics are combined in the definition of Ensemble modeling forms (AKA Data Warehouse Modeling). See coverage notes on the Next Generation DWH Modeling conference here (and summary here).

The differences between them define the flavors of Ensemble Modeling. These flavors have vastly more in common than they have differences. When compared to 3NF or Dimensional modeling, the defining characteristics of the Ensemble forms have an 80/20 rule of commonality.

All these forms practice Unified Decomposition (breaking concepts into component parts) with a central unique instance as a centerstone (Anchor, Hub, Focal Point, etc.).
Each separates context attributes into dedicated table forms that are directly attached to the centerstone.
Each uncouples relationships from the concepts they seek to relate.
Each purposefully manages historical time-slice data with varying degrees of sophistication concerning temporal variations.
Each recognizes the differences between static and dynamic data.
Each recognizes the reality of working with constantly changing sources, transformation and rules.
Each recognizes the dichotomy of the enterprise-wide natural business key.
From that foundation of commonality, the various forms of Ensembles begin to take on their own flavors.

While Data Vault is foundationally based on the natural business key as the foundation of the centerstone (Hub), both Anchor and Focal Point center on a unique instance of a concept where the business key is strongly coupled but separate from the centerstone (Anchor, Focal Point).
Both Data Vault and Anchor aim to model Ensembles at the Core Business Concept level while Focal Point tends to deploy slightly more abstracted or generic forms of concepts.
Data Vault and Focal Point utilize forms of attribute clusters (logical combinations) for context while Anchor relies on single attribute context tables.
And there are other differentiating factors as well.

There is one thing that we can all agree on: modeling the agile data warehouse means applying some form of Ensemble modeling approach. The specific flavor choice (Data Vault, Anchor, Focal Point, etc.) should be based on the specific characteristics of your data warehouse program.

* Learn more on Anchor Modeling with Lars Rönnbäck here: Anchor Modeling

More info about Ensemble Modeling: https://hanshultgren.wordpress.com/2012/11/20/ensemble-modeling/








SQLTimer: a SQL Server Timer

the-science-of-timing

thinking_animated_smiley_emoticon

So…you’ve got a fancy Business Intelligence (BI) solution? Predictive Analytics or perhaps Actionable Analytics? Ever thought about the right time to contact your (potential) customers?



SQLTimer might be a handy addition to your BI Suite!
Like a timer for your lighting, it checks every hour whether it’s the right moment to execute jobs for categories like:

BusinessHour: Indicates whether or not it’s the right time to contact your customer during business hours (included in the SQLTimer script)

Other additional time schedules which are downloadable:
Twitter: Indicates whether or not it’s the right time to contact your Twitter users
FaceBook: Indicates whether or not it’s the right time to contact your FaceBook users
LinkedIn: Indicates whether or not it’s the right time to contact your LinkedIn users
GooglePlus: Indicates whether or not it’s the right time to contact your Google+ users
Pinterest: Indicates whether or not it’s the right time to contact your Pinterest users

In this blogpost, I’ll be diving into the SQLTimer itself and the free time schedule called ‘BusinessHour’ (included in the script).

Download the complete script free on:
SQLTimer – How to install


How it works:
Just like the timer for your lightbulb, it is configured on an hourly basis:

SQL-Timer-Execute-TimeSchedule

The values indicate whether it is a bad, better or even the best time to contact your business on a weekly basis:

BusinessHour values:
0) = Outside of business hours
1) = Inside of business hours
2) = Preferred hours

Important note: SQLTimer is setup on a day-to-day basis and thus you might tweak the time schedule a bit for local holidays or other important days in the year:

SQL-Timer-TimeSchedule

SQLTimer objects:
(An existing database is required)
Let’s have a look at all the SQLTimer objects:

SQL-Timer-Objects

*** Tables ***

timer.TimerLogging
This table contains the logging (when a time schedule was executed (StartTime/EndTime) and what value the time schedule had)

SQLTimer’s logging in timer.TimerLogging:

SQL-Timer-Logging

timer.TimeSchedule
This table actually contains the time schedule itself

timer.TimeScheduleConfiguration
This table is all about the configuration

SQLTimer’s configuration in timer.TimeScheduleConfiguration
In case you would like to change the name of the schema, don’t forget to change the configuration settings:

SQL-Timer-TimeScheduleConfiguration


*** Stored Procedures ***

timer.usp_Hourly_Timer
This SP is the timer itself and needs to be scheduled in SQL Agent to be run on an hourly basis

timer.usp_Select_TimeSchedule
You might want to use the timer.usp_Select_TimeSchedule() stored procedure to analyze the time schedule for the coming week:

SQL-Timer-Execute-Select-TimeSchedule

timer.usp_Execute_BusinessHour_Jobs
This SP will execute the preferred jobs (for example: T-SQL code, SP’s or SQL Agent Jobs) for the time schedule ‘BusinessHour’

Here’s how the timer.usp_Execute_BusinessHour_Jobs looks like.
You need to add your T-SQL code, SP’s or SQL Agent Jobs here that need to be executed if the @TimeScheduleValue reaches a certain value.

SQL-Timer-Execute-BusinessHour-Jobs


*** Install the SQLTimer ***

So far about the SQLTimer itself, it’s time to install it!
1) Download the T-SQL script here: SQLTimer – How to install

2) Unzip the ZIP-file and open the *.sql file in SSMS or SSDT

3) Change the database name and uncomment the T-SQL code:

SQL-Timer-T-SQL-Script

4) Execute the T-SQL script
Output of the script:

SQL-Timer-T-SQL-Script-executed

5) Create a SQL Agent Job (Execute T-SQL statements) which executes the timer.usp_Hourly_Timer stored procedure every hour


6) Don’t forget to update the time schedule with your local (company) holidays 🙂
Table: timer.TimeSchedule


Websites:
SQLTimer
BumbleBI IT


Gartner: Magic Quadrant for Data Warehouse (DWH) Database Management Systems 2013

Gartner released its annual Data Warehouse (DWH) Magic Quadrant report last Monday!

Click here for the full report by Gartner.

GartnerDW2013

Microsoft
Microsoft (www.microsoft.com) markets the SQL Server 2012 Fast Track reference architecture, SQL Server 2008 R2 Parallel Data Warehouse Appliance (Update 3), HP Business Decision Warehouse and Dell Quickstart Data Warehouse Appliance. Microsoft does not report customer or license counts, but there are thousands of SQL Server customers worldwide (including analytical and transactional systems).

Strengths
Broad market usage.

With customers in the manufacturing, telecommunications, insurance, government, retail, financial services, transportation and other industries, Microsoft’s solution demonstrates wide applicability to vertical markets. In addition, customers come from all across the world, from North America to EMEA to Asia/Pacific, including countries such as Belgium, Denmark, New Zealand, South Korea, Ukraine and the U.S. Furthermore, customers report annual revenues ranging from under $100 million to over $10 billion. Microsoft is a major rival to any vendor just in sheer numbers of deployed data warehouses (although not in revenue, which, though rising, remains much lower).
Competitive capability and customer loyalty.
Microsoft customers tend to grow with Microsoft. Microsoft also holds its own when competing with other leading vendors, its usual rivals in competitive situations being IBM and Oracle. Its competitors are most often rejected on the basis of price, which means Microsoft’s pricing is considered favorable. Reference customers indicated that half of Microsoft’s competitive wins were directly related to its price and ease-of-use advantages over competitors. Importantly, although customers identified issues of product maturity (such as with metadata and monitoring tools), they considered Microsoft’s skills availability and pricing compensated for these weaknesses — though they still encourage Microsoft to improve on its shortcomings. In addition, Microsoft can claim the highest percentage of customers performing data warehousing on a current software release, according to our survey data.
Product form factors and partners.
By offering DBMS software, providing reference architectures, prebuilding and preloading implementations of reference architectures, offering an appliance and offering professional services (and partner connections), Microsoft offers almost every configuration of data warehouse deployment. In addition, Microsoft has an impressive mix of partners for colead management and lead development, hardware partners, codevelopment agreements with other suppliers and implementers, and reseller agreements. Moreover, Microsoft’s Parallel Data Warehouse appliance, despite a slow start, has been adopted by approximately 100 organizations in the past 18 months. Further adoption of this appliance is likely as Dell continues to enter the data warehouse space and sells the Dell Parallel Data Warehouse Appliance.
Completeness of solution approach.
With so much of the software needed for BI and data warehousing initiatives included in a single license, Microsoft makes the processes of purchasing and license management very easy. In addition, Microsoft offers data models (through partners) and provides assistance (directly and through partners) at the time of implementation. The most notable change by Microsoft in 2012 regarding data warehousing was its new vision for combining structured and unstructured data, desktop analytics tools (such as PowerPivot), enterprise warehousing, fast implementation strategies and the cloud.
Logical data warehouse practices.
Microsoft started early on unstructured and structured data in combination by using SharePoint, PowerPivot and technology acquired from Fast Search & Transfer — all coordinated by the SQL Server engine. It has since added in-memory capabilities across the stack by using xVelocity in SQL Server 2012, SQL Server Analysis Services (SSAS) and PowerPivot. In addition, Microsoft has been catching up with the announcement of HDInsight and a solution offering in partnership with Hortonworks. Finally, with Microsoft’s StreamInsight complex event processing solution, customers can address the velocity requirements of big data.

Cautions
Perceived lack of enterprise-readiness.

As SQL Server grows to enable its expanded use for data warehousing, it is generating more questions from clients about its readiness to support large data warehouses and HA/DR capabilities. However, with the release of SQL Server 2012, Microsoft has continued to mature this offering with the addition of Always On features to improve clustering, failover and active-active data warehouses. Growth in the number of large SQL Server data warehouses will also reduce this perception.
Complete data management capabilities.
SQL Server offers a complete set of data management capabilities, including data integration with SQL Server Integration Services and OLAP capabilities with SSAS. However, as data warehousing projects have matured and grown in complexity, clients have reported (both through Gartner’s direct interactions and through the survey we conducted for this Magic Quadrant) limitations with the capabilities offered with SQL Server, specifically in the areas of metadata management and data integration. Microsoft is still developing its capabilities in these areas.
Business model driven by the “majority market.”
Distributed process management and execution is one aspect of the logical data warehouse. However, although Microsoft was quick to respond to its customers’ fast-developing need to cope with data in high volume, in great variety and at fluctuating velocity, and to offer integration with major Hadoop distributions, this left it behind some of the Leaders in terms of vision. But, then, being a “fast follower” for some features and functions has always been part of Microsoft’s business model.



Gartner: Magic Quadrant for BI Platforms 2013