Thursday, 23 February 2017

Microsoft Certification 70-475 : Designing and Implementing Big Data Analytics Solutions


Having recently sat and passed Microsoft’s exam 70-475, I thought I’d publish the list of references I built up whilst studying. This is still a relatively new exam, so study materials are hard to come by, just as for exam 70-473. As usual, I also made use of the Mindhub practice exam.

I found it difficult to pin-down specific resources for some of the objective areas, so it’s by no means extensive, but covers a good chunk of the exam content.

I also recommend having some prior knowledge of MS SQL, Hadoop and Azure ecosystems before tackling this exam.

Hope this helps!

1. Design big data batch processing and interactive solutions

  • Ingest data for batch and interactive processing
https://docs.microsoft.com/en-us/azure/data-lake-store/
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-load
    • Ingest from cloud-born or on-premises data,
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-data-scenarios
    • store data in Microsoft Azure Data Lake,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-azure-datalake-connector#sample-copy-data-from-azure-blob-to-azure-data-lake-store
    • store data in Azure BLOB Storage,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-azure-datalake-connector#sample-copy-data-from-azure-data-lake-store-to-azure-blob
    • perform a one-time bulk data transfer,
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-offline-bulk-data-upload
    • perform routine small writes on a continuous basis
  • Design and provision compute clusters
https://blogs.msdn.microsoft.com/cindygross/2015/02/26/create-hdinsight-cluster-in-azure-portal/
    • Select compute cluster type,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-introduction#a-nameoverviewaoverview-of-the-hadoop-ecosystem-in-hdinsight
https://www.blue-granite.com/blog/how-to-choose-the-right-hdinsight-cluster
    • estimate cluster size based on workload
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-provision-clusters
  • Design for data security
    • Protect personally identifiable information (PII) data in Azure
    • encrypt and mask data,
    • implement role-based security
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-secure-data
  • Design for batch processing
https://docs.microsoft.com/en-us/azure/batch/batch-technical-overview
    • Select appropriate language and tool,
    • identify formats,
    • define metadata,
Microsoft Azure Batch - slides 46-48
    • configure output
  • Design interactive queries for big data
https://docs.microsoft.com/en-gb/azure/hdinsight/hdinsight-apache-spark-overview
    • Provision Spark cluster,
https://docs.microsoft.com/en-gb/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
    • set the right resources in Spark cluster,
https://blogs.msdn.microsoft.com/bigdatasupport/2015/08/19/some-things-to-consider-for-your-spark-on-hdinsight-workload/
    • execute queries using Spark SQL,
    • select the right data format (Parquet),
http://parquet.apache.org/documentation/latest/
    • cache data in memory (make sure cluster is of the right size),
    • visualize using business intelligence (BI) tools (for example, Power BI, Tableau),
https://docs.microsoft.com/en-gb/azure/hdinsight/hdinsight-apache-spark-use-bi-tools
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-power-bi
    • select the right tool for business analysis

2. Design big data real-time processing solutions

  • Ingest data for real-time processing
https://docs.microsoft.com/en-gb/azure/stream-analytics/stream-analytics-introduction
http://download.microsoft.com/download/6/2/3/623924DE-B083-4561-9624-C1AB62B5F82B/real-time-event-processing-with-microsoft-azure-stream-analytics.pdf
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-sensor-data-analysis - hands-on tutorial
    • Select data ingestion technology,
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs
    • design partitioning scheme,
https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-what-is-event-hubs#partitions
    • design row key of event tables in Hbase
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-overview
http://www.dummies.com/programming/big-data/hadoop/row-keys-in-the-hbase-data-model/
http://hbase.apache.org/0.94/book/rowkey.design.html
  • Design and provision compute resources
    • Select streaming technology in Azure,
https://docs.microsoft.com/en-gb/azure/stream-analytics/stream-analytics-comparison-storm
    • select real-time event processing technology,
https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-compare-event-hubs
    • select real-time event storage technology,
https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-define-outputs
    • select streaming units,
https://azure.microsoft.com/en-us/pricing/details/stream-analytics/#
https://docs.microsoft.com/en-gb/azure/stream-analytics/stream-analytics-scale-jobs
    • configure cluster size,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-provision-clusters#basic-configuration-options
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-provision-clusters#cluster-types
    • assign appropriate resources for Spark clusters,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-resource-manager#what-is-the-optimum-cluster-configuration-to-run-spark-applications
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-resource-manager#how-do-i-know-if-i-am-running-out-of-resource
    • assign appropriate resources for HBase clusters,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hbase-tutorial-get-started#create-hbase-cluster
    • utilize Visual Studio to write and debug Storm topologies
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-develop-csharp-visual-studio-topology
  • Design for Lambda architecture
https://blogs.technet.microsoft.com/msuspartner/2016/01/27/azure-partner-community-big-data-advanced-analytics-and-lambda-architecture/
https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx
http://lambda-architecture.net/
    • Identify application of Lambda architecture,
    • utilize streaming data to draw business insights in real time,
    • utilize streaming data to show trends in data in real time,
    • utilize streaming data and convert into batch data to get historical view,
    • design such that batch data doesn’t introduce latency,
    • utilize batch data for deeper data analysis
  • Design for real-time processing
Real-Time Event & Stream Processing on MS Azure
  • Design for latency and throughput,
    • design reference data streams,
    • design business logic,
    • design visualization output

 

3. Design Machine Learning solutions

  • Create and manage experiments
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-create-experiment
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-studio-overview-diagram
    • Create, manage, and share workspaces;
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-walkthrough-1-create-ml-workspace
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-create-workspace
    • create training experiment;
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-walkthrough-3-create-new-experiment
    • select template experiment from Machine Learning gallery
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-sample-experiments
  • Determine when to pre-process or train inside Machine Learning Studio
    • Select model type based on desired algorithm,
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-algorithm-choice
    • select technique based on data size
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-prepare-data
  • Select input/output types
    • Select appropriate SQL parameters,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-import-data-from-online-sources
    • select BLOB storage parameters,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-import-data-from-online-sources#supported-online-data-sources
    • identify data sources,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-import-data
    • select HiveQL queries
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-create-features-hive
  • Apply custom processing steps with R and Python
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-python-data-access
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-extend-your-experiment-with-r
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-custom-r-modules
    • Visualize custom graphs,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-custom-r-modules#elements-in-the-xml-definition-file
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-execute-python-scripts#working-with-visualizations
    • estimate custom algorithms,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-algorithm-choice
http://download.microsoft.com/download/A/6/1/A613E11E-8F9C-424A-B99D-65344785C288/microsoft-machine-learning-algorithm-cheat-sheet-v6.pdf
    • select custom parameters,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-web-service-parameters
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-execute-python-scripts#basic-usage-scenarios-in-machine-learning-for-python-scripts
    • interact with datasets through notebooks (Jupyter Notebook)
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-gallery-jupyter-notebooks
https://gallery.cortanaintelligence.com/notebooks
https://gallery.cortanaintelligence.com/Notebook/Tutorial-on-Azure-Machine-Learning-Notebook-1
  • Publish web services
    • Operationalize Azure Machine Learning models,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-publish-a-machine-learning-web-service
    • operationalize Spark models using Azure Machine Learning,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-spark-overview
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-spark-model-consumption#consume-spark-models-through-a-web-interface
    • operationalize custom models
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-model-progression-experiment-to-web-service
 

4. Operationalize end-to-end cloud analytics solutions

  • Create a data factory
    • Identify data sources,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-introduction#data-movement-activities
    • identify and provision data processing infrastructure,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-introduction#data-transformation-activities
    • utilize Visual Studio to design and deploy pipelines
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-tutorial-using-visual-studio
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-build-your-first-pipeline-using-vsm
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-build-your-first-pipeline
  • Orchestrate data processing activities in a data-driven workflow
    • Leverage data-slicing concepts,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution#time-series-datasets-and-data-slices
    • identify data dependencies and chaining multiple activities,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution#run-activities-in-a-sequence
    • model complex schedules based on data dependencies,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution#data-dependency-deep-dive
    • provision and run data pipelines
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-create-pipelines#create-pipelines
  • Monitor and manage the data factory
    • Identify failures and root causes,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-monitor-manage-app
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-monitor-manage-pipelines
    • create alerts for specified conditions,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-monitor-manage-app#creating-alerts
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-monitor-manage-pipelines#create-alerts
    • perform a restatement
  • Move, transform, and analyze data
    • Leverage Pig, Hive, MapReduce for data processing;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-pig-activity
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-hive-activity
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-map-reduce
    • copy data between on-premises and cloud;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-move-data-between-onprem-and-cloud
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-data-management-gateway
    • copy data between cloud data sources;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-data-movement-activities
    • leverage stored procedures;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-stored-proc-activity
    • leverage Machine Learning batch execution for scoring, retraining, and update resource;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-azure-ml-batch-execution-activity
    • extend the data factory with custom processing steps;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-use-custom-activities
    • load data into a relational store
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-azure-sql-connector
    • visualize using Power BI
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-power-bi
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-get-started-visualize-with-power-bi
  • Design a deployment strategy for an end-to-end solution
    • Leverage PowerShell for deployment,
https://docs.microsoft.com/en-us/powershell/resourcemanager/azurerm.datafactories/v2.3.0/azurerm.datafactories
    • automate deployment programmatically
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-create-data-factories-programmatically
https://msdn.microsoft.com/library/mt415893.aspx
https://msdn.microsoft.com/library/dn906738.aspx

No comments:

Post a Comment