Thursday, 23 February 2017

Microsoft Certification 70-475 : Designing and Implementing Big Data Analytics Solutions


Having recently sat and passed Microsoft’s exam 70-475, I thought I’d publish the list of references I built up whilst studying. This is still a relatively new exam, so study materials are hard to come by, just as for exam 70-473. As usual, I also made use of the Mindhub practice exam.

I found it difficult to pin-down specific resources for some of the objective areas, so it’s by no means extensive, but covers a good chunk of the exam content.

I also recommend having some prior knowledge of MS SQL, Hadoop and Azure ecosystems before tackling this exam.

Hope this helps!

1. Design big data batch processing and interactive solutions

  • Ingest data for batch and interactive processing
https://docs.microsoft.com/en-us/azure/data-lake-store/
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-load
    • Ingest from cloud-born or on-premises data,
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-scenarios
    • store data in Microsoft Azure Data Lake,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-azure-datalake-connector#sample-copy-data-from-azure-blob-to-azure-data-lake-store
    • store data in Azure BLOB Storage,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-azure-datalake-connector#sample-copy-data-from-azure-data-lake-store-to-azure-blob
    • perform a one-time bulk data transfer,
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-offline-bulk-data-upload
    • perform routine small writes on a continuous basis
  • Design and provision compute clusters
https://blogs.msdn.microsoft.com/cindygross/2015/02/26/create-hdinsight-cluster-in-azure-portal/
    • Select compute cluster type,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-introduction#a-nameoverviewaoverview-of-the-hadoop-ecosystem-in-hdinsight
https://www.blue-granite.com/blog/how-to-choose-the-right-hdinsight-cluster
    • estimate cluster size based on workload
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-provision-clusters
  • Design for data security
    • Protect personally identifiable information (PII) data in Azure
    • encrypt and mask data,
    • implement role-based security
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-secure-data
  • Design for batch processing
https://docs.microsoft.com/en-us/azure/batch/batch-technical-overview
    • Select appropriate language and tool,
    • identify formats,
    • define metadata,
Microsoft Azure Batch - slides 46-48
    • configure output
  • Design interactive queries for big data
https://docs.microsoft.com/en-gb/azure/hdinsight/hdinsight-apache-spark-overview
    • Provision Spark cluster,
https://docs.microsoft.com/en-gb/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
    • set the right resources in Spark cluster,
https://blogs.msdn.microsoft.com/bigdatasupport/2015/08/19/some-things-to-consider-for-your-spark-on-hdinsight-workload/
    • execute queries using Spark SQL,
    • select the right data format (Parquet),
http://parquet.apache.org/documentation/latest/
    • cache data in memory (make sure cluster is of the right size),
    • visualize using business intelligence (BI) tools (for example, Power BI, Tableau),
https://docs.microsoft.com/en-gb/azure/hdinsight/hdinsight-apache-spark-use-bi-tools
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-power-bi
    • select the right tool for business analysis

2. Design big data real-time processing solutions

  • Ingest data for real-time processing
https://docs.microsoft.com/en-gb/azure/stream-analytics/stream-analytics-introduction
http://download.microsoft.com/download/6/2/3/623924DE-B083-4561-9624-C1AB62B5F82B/real-time-event-processing-with-microsoft-azure-stream-analytics.pdf
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-sensor-data-analysis - hands-on tutorial
    • Select data ingestion technology,
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs
    • design partitioning scheme,
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs#partitions
    • design row key of event tables in Hbase
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-overview
http://www.dummies.com/programming/big-data/hadoop/row-keys-in-the-hbase-data-model/
http://hbase.apache.org/0.94/book/rowkey.design.html
  • Design and provision compute resources
    • Select streaming technology in Azure,
https://docs.microsoft.com/en-gb/azure/stream-analytics/stream-analytics-comparison-storm
    • select real-time event processing technology,
https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-compare-event-hubs
    • select real-time event storage technology,
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-outputs
    • select streaming units,
https://azure.microsoft.com/en-us/pricing/details/stream-analytics/#
https://docs.microsoft.com/en-gb/azure/stream-analytics/stream-analytics-scale-jobs
    • configure cluster size,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-provision-clusters#basic-configuration-options
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-provision-clusters#cluster-types
    • assign appropriate resources for Spark clusters,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-resource-manager#what-is-the-optimum-cluster-configuration-to-run-spark-applications
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-resource-manager#how-do-i-know-if-i-am-running-out-of-resource
    • assign appropriate resources for HBase clusters,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-tutorial-get-started#create-hbase-cluster
    • utilize Visual Studio to write and debug Storm topologies
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-develop-csharp-visual-studio-topology
  • Design for Lambda architecture
https://blogs.technet.microsoft.com/msuspartner/2016/01/27/azure-partner-community-big-data-advanced-analytics-and-lambda-architecture/
https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx
http://lambda-architecture.net/
    • Identify application of Lambda architecture,
    • utilize streaming data to draw business insights in real time,
    • utilize streaming data to show trends in data in real time,
    • utilize streaming data and convert into batch data to get historical view,
    • design such that batch data doesn’t introduce latency,
    • utilize batch data for deeper data analysis
  • Design for real-time processing
Real-Time Event & Stream Processing on MS Azure
  • Design for latency and throughput,
    • design reference data streams,
    • design business logic,
    • design visualization output

 

3. Design Machine Learning solutions

  • Create and manage experiments
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-create-experiment
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-studio-overview-diagram
    • Create, manage, and share workspaces;
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-walkthrough-1-create-ml-workspace
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-create-workspace
    • create training experiment;
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-walkthrough-3-create-new-experiment
    • select template experiment from Machine Learning gallery
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-sample-experiments
  • Determine when to pre-process or train inside Machine Learning Studio
    • Select model type based on desired algorithm,
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-algorithm-choice
    • select technique based on data size
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-prepare-data
  • Select input/output types
    • Select appropriate SQL parameters,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-import-data-from-online-sources
    • select BLOB storage parameters,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-import-data-from-online-sources#supported-online-data-sources
    • identify data sources,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-import-data
    • select HiveQL queries
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-create-features-hive
  • Apply custom processing steps with R and Python
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-python-data-access
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-extend-your-experiment-with-r
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-custom-r-modules
    • Visualize custom graphs,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-custom-r-modules#elements-in-the-xml-definition-file
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-execute-python-scripts#working-with-visualizations
    • estimate custom algorithms,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-choice
http://download.microsoft.com/download/A/6/1/A613E11E-8F9C-424A-B99D-65344785C288/microsoft-machine-learning-algorithm-cheat-sheet-v6.pdf
    • select custom parameters,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-web-service-parameters
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-execute-python-scripts#basic-usage-scenarios-in-machine-learning-for-python-scripts
    • interact with datasets through notebooks (Jupyter Notebook)
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-gallery-jupyter-notebooks
https://gallery.cortanaintelligence.com/notebooks
https://gallery.cortanaintelligence.com/Notebook/Tutorial-on-Azure-Machine-Learning-Notebook-1
  • Publish web services
    • Operationalize Azure Machine Learning models,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-publish-a-machine-learning-web-service
    • operationalize Spark models using Azure Machine Learning,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-spark-overview
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-spark-model-consumption#consume-spark-models-through-a-web-interface
    • operationalize custom models
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-model-progression-experiment-to-web-service
 

4. Operationalize end-to-end cloud analytics solutions

  • Create a data factory
    • Identify data sources,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-introduction#data-movement-activities
    • identify and provision data processing infrastructure,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-introduction#data-transformation-activities
    • utilize Visual Studio to design and deploy pipelines
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-tutorial-using-visual-studio
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-build-your-first-pipeline-using-vsm
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-build-your-first-pipeline
  • Orchestrate data processing activities in a data-driven workflow
    • Leverage data-slicing concepts,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution#time-series-datasets-and-data-slices
    • identify data dependencies and chaining multiple activities,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution#run-activities-in-a-sequence
    • model complex schedules based on data dependencies,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution#data-dependency-deep-dive
    • provision and run data pipelines
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-create-pipelines#create-pipelines
  • Monitor and manage the data factory
    • Identify failures and root causes,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-monitor-manage-app
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-monitor-manage-pipelines
    • create alerts for specified conditions,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-monitor-manage-app#creating-alerts
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-monitor-manage-pipelines#create-alerts
    • perform a restatement
  • Move, transform, and analyze data
    • Leverage Pig, Hive, MapReduce for data processing;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-pig-activity
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-hive-activity
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-map-reduce
    • copy data between on-premises and cloud;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-move-data-between-onprem-and-cloud
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-data-management-gateway
    • copy data between cloud data sources;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-data-movement-activities
    • leverage stored procedures;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-stored-proc-activity
    • leverage Machine Learning batch execution for scoring, retraining, and update resource;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-azure-ml-batch-execution-activity
    • extend the data factory with custom processing steps;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-use-custom-activities
    • load data into a relational store
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-azure-sql-connector
    • visualize using Power BI
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-power-bi
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-get-started-visualize-with-power-bi
  • Design a deployment strategy for an end-to-end solution
    • Leverage PowerShell for deployment,
https://docs.microsoft.com/en-us/powershell/resourcemanager/azurerm.datafactories/v2.3.0/azurerm.datafactories
    • automate deployment programmatically
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-create-data-factories-programmatically
https://msdn.microsoft.com/library/mt415893.aspx
https://msdn.microsoft.com/library/dn906738.aspx

Wednesday, 8 February 2017

Copying an Azure Blob snapshot to another storage account using PowerShell

I've been working with Azure Blobs and Snapshots recently. One gripe is that the Azure portal and the various storage explorers don't give you the power to copy a blob snapshot. Blobs, yes, but their snapshots, no.

Depending on your requirements there are many approaches. The PowerShell below gives an example where the snapshot timestamp is extracted from the snapshot URI and then used to obtain an object to the specific snapshot before copying it with the Start-AzureStorageBlobCopy cmdlet.

# Define the source snapshot blob URI
$SrcBlobURI="https://<SOURCE_STORAGE_ACCOUNT>.blob.core.windows.net/vhds/<SOURCE_VHD_PREFIX>.vhd?snapshot=2017-01-24T21:08:36.9371576Z"

# Define the destination storage account and context.
$DestStorageAccountName = "<DESTINATION_STORAGE_ACCOUNT>"
$DestStorageAccountKey = "<DESTINATION_STORAGE_ACCOUNT_KEY>"
$DestContainerName = "<DESTINATION_CONTAINER_NAME>"
$DestContext = New-AzureStorageContext -StorageAccountName $DestStorageAccountName -StorageAccountKey $DestStorageAccountKey

# Determine source snapshot blob container name and context
$SrcDiskInfo = "" | Select StorageAccountName, VHDName, ContainerName
$SrcDiskInfo.StorageAccountName = ($SrcBlobURI -split "https://")[1].Split(".")[0]
$SrcDiskInfo.VHDName = $SrcBlobURI.Split("/")[-1]
$SrcDiskInfo.ContainerName = $SrcBlobURI.Split("/")[3]
$SrcContainerName = $SrcDiskInfo.ContainerName
$SrcStorageAccount = Find-AzureRmResource `
    -ResourceNameContains $SrcDiskInfo.StorageAccountName `
    -WarningAction Ignore
$SrcStorageKey = Get-AzureRmStorageAccountKey `
    -Name $SrcStorageAccount.Name `
    -ResourceGroupName $SrcStorageAccount.ResourceGroupName
$SrcContext = New-AzureStorageContext `
    -StorageAccountName $SrcStorageAccount.Name `
    -StorageAccountKey $SrcStorageKey[0].Value

# Extract timestamp from the Snapshot filename and convert to a datetime
$pre, $post = $SrcBlobURI.split("=")
$snapdt = [uri]::UnescapeDataString($post)
$snapdt = [datetime]::ParseExact( `
    $snapdt,'yyyy-MM-dd"T"HH:mm:ss"."fffffff"Z"',$null `
)

# Get a reference to blobs in the source container.
$blob = Get-AzureStorageBlob -Container $SrcContainerName `
    -Context $SrcContext `
    | Where-Object {`
        $_.ICloudBlob.IsSnapshot `
        -and $_.SnapshotTime -eq $snapdt `
    }

# Copy blobs from one container to another.
$blob | Start-AzureStorageBlobCopy `
    -DestContainer $DestContainerName `
    -DestContext $DestContext `
    -DestBlob "acopy_$($blob.Name)"

Wednesday, 1 February 2017

Microsoft Certification 70-473 : Designing and Implementing Cloud Data Platform Solutions

I recently passed the Microsoft exam 70-473 "Designing and Implementing Cloud Data Platform Solutions", here's my insight into the certification for anyone else thinking of studying and sitting this exam. I've also included a selection of my most useful links I found whilst looking for relevant online content.

Generally, if you have experience with SQL Server (2008+) then focus your study on hybrid scenarios, HA/DR, auditing, monitoring, security and the differences between SQL Server (on-prem Vs Azure) and Azure SQL Database.
Having some Azure certs already will also give you a small head-start. Also, knowing the different MS tools available and when to use them was key.

With the absence of any MS Exam References and Certification videos my preparation material was based on MSDN, Azure and Technet articles, so lots and lots of online reading. I did find various online videos that dealt with specific aspects of the exam, but in the end I stuck to the docs and took the Mindhub practice exam I had as part of the (no longer available) Microsoft Exam Booster Pack.

I hope these prove useful (Note: not all objective areas have a link, if you find a good one then let me know)...


·         Design a hybrid SQL Server solution
o    Design Geo/DR topology,
o    design a data storage architecture,
o    design a security architecture,
o    design a data load strategy
·         Implement SQL Server on Azure Virtual Machines (VMs)
o    Provision SQL Server in an Azure VM,
o    configure firewall rules,
o    configure and optimise storage,
o    migrate an on-premises database to Microsoft Azure,
o    configure and optimise VM sizes by workload
·         Design a SQL Database solution
o    Design a solution architecture,
o    design Geo/DR topology,
o    design a security architecture,
o    design a data load strategy,
o    determine the appropriate service tier
·         Implement SQL Database
o    Provision SQL Database,
o    configure firewall rules,
o    configure active geo-replication,
o    migrate an on-premises database to SQL Database,
o    configure for scale and performance
·         Design and implement data warehousing on Azure
o    Design a data warehousing solution on Azure,
o    design a data load strategy and topology,
o    configure SQL Data Warehouse,
o    migrate an on-premises database to SQL Data Warehouse

·         Design and implement SQL Server Database security
o    Configure firewalls;
o    manage logins, users and roles;
o    assign permissions;
o    configure auditing;
o    configure transparent database encryption
·         Implement Azure SQL Database security
o    Configure firewalls;
o    manage logins, users, and roles;
o    assign permissions;
o    configure auditing;
o    configure row-level security;
o    configure data encryption;
o    configure data masking;
o    configure Always Encrypted

·         Design and implement high availability solutions
o    Design a high availability solution topology,
o    implement high availability solutions between on-premises and Azure,
o    design cloud-based backup solutions,
o    implement backup and recovery strategies
·         Design and implement scalable solutions
o    Design a scale-out solution,
o    implement multi-master scenarios with database replication,
o    implement elastic scale for SQL Database
·         Design and implement SQL Database data recovery
o    Design a backup solution for SQL Database,
o    implement self-service restore,
o    copy and export databases

·         Monitor and troubleshoot SQL Server VMs on Azure
o    Monitor database and instance activity,
o    monitor using dynamic management views (DMVs) and dynamic management functions (DMFs),
o    monitor performance and scalability
·         Monitor and troubleshoot SQL Database
o    Monitor and troubleshoot SQL Database,
o    monitor database activity,
o    monitor using DMVs and DMFs,
o    monitor performance and scalability
·         Automate and manage database implementations on Azure
o    Manage SQL Server in Azure VMs with PowerShell,
o    manage Azure SQL Database with PowerShell,
o    configure Automation and Runbooks

Related Videos

Ignite Exam Prep Session - Channel 9
Exam Boot Camp - MAPA
1.1/3.1 SQL Server HA/DR with Azure
1.2 Migrate SQL Server to Azure
1.2 SQL Server in Azure VMs
1.4/4.2 Hybrid Cloud workloads SQL
1.5 Elastic Data Warehouse as-a-service
2.2 Securing data in SQL Database
3.1 Design a scale-out solution
3.2 Elastic Database with Azure SQL Database
3.3 Implementing SQL Database Data Recovery
4.1 Monitor and Troubleshoot SQL Server
4.3 SQL Server and SQL Database Powershell

Other Links

General Introduction
SQL Server on Azure VMs Learning Path
SQL Database Intro
TechNet SQL Server Reference
MSDN SQL Server Reference
SQL Server (2008) Data Loading White Paper
SQL Server Customer Advisory Team (SQLCAT)
SQL Server Solution Design
Hybrid SQL Server Scenarios
Migrate On-prem to Azure VMs
Training Course