Having recently sat and passed Microsoft’s exam 70-475, I thought I’d publish the list of references I built up whilst studying. This is still a relatively new exam, so study materials are hard to come by, just as for exam 70-473. As usual, I also made use of the Mindhub practice exam.
I found it difficult to pin-down specific resources for some of the objective areas, so it’s by no means extensive, but covers a good chunk of the exam content.
I also recommend having some prior knowledge of MS SQL, Hadoop and Azure ecosystems before tackling this exam.
Hope this helps!
1. Design big data batch processing and interactive solutions
- Ingest data for batch and interactive processing
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-performance
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-load
- Ingest from cloud-born or on-premises data,
- store data in Microsoft Azure Data Lake,
- store data in Azure BLOB Storage,
- perform a one-time bulk data transfer,
- perform routine small writes on a continuous basis
- Design and provision compute clusters
- Select compute cluster type,
https://www.blue-granite.com/blog/how-to-choose-the-right-hdinsight-cluster
- estimate cluster size based on workload
- Design for data security
- Protect personally identifiable information (PII) data in Azure
- encrypt and mask data,
- implement role-based security
- Design for batch processing
- Select appropriate language and tool,
- identify formats,
- define metadata,
- configure output
- Design interactive queries for big data
- Provision Spark cluster,
- set the right resources in Spark cluster,
- execute queries using Spark SQL,
- select the right data format (Parquet),
- cache data in memory (make sure cluster is of the right size),
- visualize using business intelligence (BI) tools (for example, Power BI, Tableau),
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-integrate-power-bi
- select the right tool for business analysis
2. Design big data real-time processing solutions
- Ingest data for real-time processing
http://download.microsoft.com/download/6/2/3/623924DE-B083-4561-9624-C1AB62B5F82B/real-time-event-processing-with-microsoft-azure-stream-analytics.pdf
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-storm-sensor-data-analysis - hands-on tutorial
- Select data ingestion technology,
- design partitioning scheme,
- design row key of event tables in Hbase
http://www.dummies.com/programming/big-data/hadoop/row-keys-in-the-hbase-data-model/
http://hbase.apache.org/0.94/book/rowkey.design.html
- Design and provision compute resources
- Select streaming technology in Azure,
- select real-time event processing technology,
- select real-time event storage technology,
- select streaming units,
https://docs.microsoft.com/en-gb/azure/stream-analytics/stream-analytics-scale-jobs
- configure cluster size,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-provision-clusters#cluster-types
- assign appropriate resources for Spark clusters,
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-resource-manager#how-do-i-know-if-i-am-running-out-of-resource
- assign appropriate resources for HBase clusters,
- utilize Visual Studio to write and debug Storm topologies
- Design for Lambda architecture
https://social.technet.microsoft.com/wiki/contents/articles/33626.lambda-architecture-implementation-using-microsoft-azure.aspx
http://lambda-architecture.net/
- Identify application of Lambda architecture,
- utilize streaming data to draw business insights in real time,
- utilize streaming data to show trends in data in real time,
- utilize streaming data and convert into batch data to get historical view,
- design such that batch data doesn’t introduce latency,
- utilize batch data for deeper data analysis
- Design for real-time processing
- Design for latency and throughput,
- design reference data streams,
- design business logic,
- design visualization output
3. Design Machine Learning solutions
- Create and manage experiments
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-studio-overview-diagram
- Create, manage, and share workspaces;
https://docs.microsoft.com/en-gb/azure/machine-learning/machine-learning-create-workspace
- create training experiment;
- select template experiment from Machine Learning gallery
- Determine when to pre-process or train inside Machine Learning Studio
- Select model type based on desired algorithm,
- select technique based on data size
- Select input/output types
- Select appropriate SQL parameters,
- select BLOB storage parameters,
- identify data sources,
- select HiveQL queries
- Apply custom processing steps with R and Python
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-extend-your-experiment-with-r
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-custom-r-modules
- Visualize custom graphs,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-execute-python-scripts#working-with-visualizations
- estimate custom algorithms,
http://download.microsoft.com/download/A/6/1/A613E11E-8F9C-424A-B99D-65344785C288/microsoft-machine-learning-algorithm-cheat-sheet-v6.pdf
- select custom parameters,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-execute-python-scripts#basic-usage-scenarios-in-machine-learning-for-python-scripts
- interact with datasets through notebooks (Jupyter Notebook)
https://gallery.cortanaintelligence.com/notebooks
https://gallery.cortanaintelligence.com/Notebook/Tutorial-on-Azure-Machine-Learning-Notebook-1
- Publish web services
- Operationalize Azure Machine Learning models,
- operationalize Spark models using Azure Machine Learning,
https://docs.microsoft.com/en-us/azure/machine-learning/machine-learning-data-science-spark-model-consumption#consume-spark-models-through-a-web-interface
- operationalize custom models
4. Operationalize end-to-end cloud analytics solutions
- Create a data factory
- Identify data sources,
- identify and provision data processing infrastructure,
- utilize Visual Studio to design and deploy pipelines
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-build-your-first-pipeline-using-vsm
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-build-your-first-pipeline
- Orchestrate data processing activities in a data-driven workflow
- Leverage data-slicing concepts,
- identify data dependencies and chaining multiple activities,
- model complex schedules based on data dependencies,
- provision and run data pipelines
- Monitor and manage the data factory
- Identify failures and root causes,
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-monitor-manage-pipelines
- create alerts for specified conditions,
https://docs.microsoft.com/en-us/azure/data-factory/data-factory-monitor-manage-pipelines#create-alerts
- perform a restatement
- Move, transform, and analyze data
- Leverage Pig, Hive, MapReduce for data processing;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-hive-activity
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-map-reduce
- copy data between on-premises and cloud;
https://docs.microsoft.com/en-gb/azure/data-factory/data-factory-data-management-gateway
- copy data between cloud data sources;
- leverage stored procedures;
- leverage Machine Learning batch execution for scoring, retraining, and update resource;
- extend the data factory with custom processing steps;
- load data into a relational store
- visualize using Power BI
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-get-started-visualize-with-power-bi
- Design a deployment strategy for an end-to-end solution
- Leverage PowerShell for deployment,
- automate deployment programmatically
https://msdn.microsoft.com/library/mt415893.aspx
https://msdn.microsoft.com/library/dn906738.aspx
No comments:
Post a Comment