Добавить
Уведомления

16. AWS EMR - Access S3 Data with Hue, Hive and Apache Tez

Amazon EMR is a big data platform currently leading in cloud-native platforms for big data with its features like processing vast amounts of data quickly and at a cost-effective scale and all these by using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi and Presto, with the auto-scaling capability of Amazon EC2 and storage scalability of Amazon S3, EMR gives the flexibility to run short-lived clusters that can automatically scale to meet demand task, or for long-running highly available clusters. AWS EMR provides many functionalities that make things easier for us; some of the technologies are: Amazon EC2 Amazon RDS Amazon S3 Amazon CloudFront Amazon Auto Scaling Amazon Lambda Amazon Redshift Amazon Elastic MapReduce (EMR) One of the major services provided by AWS EMR, and we will deal with, is Amazon EMR. EMR, commonly called Elastic Map Reduce, comes over with an easy and approachable way to deal with the processing of larger chunks of data. Imagine a big data scenario where we have a huge amount of data, and we are performing a set of operations over them, say a Map-Reduce job is running; one of the major issues the Bigdata application faces is the tuning of the program, we often find it difficult to fine-tune our program in such a way all the resource allocated is consumed properly. Due to this above tuning factor, the time taken for processing increases gradually. Elastic Map Reduce the service by Amazon is a web service that provides a framework that manages all these necessary features needed for Big data processing in a cost-effective, fast, and secure manner. From cluster creation to data distribution over various instances, all these things are easily managed under Amazon EMR. Furthermore, the services here are on-demand, which means we can control the numbers based on our data, making it cost-efficient and scalable. Reasons for Using AWS EMR So Why Using AMR? What makes it better than others. First, we often encounter a fundamental problem where we cannot allocate all the resources available over the cluster to any application; AMAZON EMR takes care of these problems. Based on the size of data and the demand of the application, it allocates the necessary resource. Also, being Elastic in nature, we can change it accordingly. Second, EMR has huge application support, be it Hadoop, Spark, HBase, making it easier for Data processing. It supports various ETL operations quickly and cost-effectively. It Can also be used over for MLIB in Spark. We can perform various machine learning algorithms inside it. Be it Batch data or Real-Time Streaming of Data, EMR can organize and process both types of Data. 1. The Clusters are the central component in the Amazon EMR architecture. They are a collection of EC2 Instances called Nodes. Each node has its specific roles within the cluster termed as Node type, and based on their roles; we can classify them into 3 types: Master Node Core Node Task Node 2. The Master Node, as the name suggests, is the master that is responsible for managing the cluster, running the components and distributing of data over the nodes for processing. It just keeps track of whether everything is properly managed and running fine and works on in the case of failure. 3. The Core Node is responsible for running the task and storing the data in HDFS in the cluster. In addition, the core Node handles all the processing parts, and the data after that processing is put to the desired HDFS location. 4. The Task Node being optional, only has the job to run the task. This doesn’t store the data in HDFS. 5. Whenever submitting a job, we have several methods to choose how to complete the work. Being it from the cluster’s termination after job completion to a long-running cluster using EMR console and CLI to submit steps, we have all the privilege to do so. 6. We can directly Run the Job on the EMR by connecting it with the master node through the interfaces and tools available that run jobs directly on the cluster. 7. We can also run our data in various steps with the help of EMR; all we have to do is submit one or more ordered steps in the EMR cluster. The Data is stored as a file and is processed sequentially. Starting it from “Pending state to Completed state”, we can trace the processing steps and find the errors from ‘Failed to be Canceled’ all these steps can be easily traced back to this. 8. Once all the instance is terminated the completed state for the cluster is achieved High Speed: Since all the resources are utilized properly, the query processing time is comparatively faster than the other data processing tools that have a much clearer picture. Bulk Data Processing: Be larger the data size EMR can process huge amounts of data in ample time.

12+
15 просмотров
2 года назад
12+
15 просмотров
2 года назад

Amazon EMR is a big data platform currently leading in cloud-native platforms for big data with its features like processing vast amounts of data quickly and at a cost-effective scale and all these by using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi and Presto, with the auto-scaling capability of Amazon EC2 and storage scalability of Amazon S3, EMR gives the flexibility to run short-lived clusters that can automatically scale to meet demand task, or for long-running highly available clusters. AWS EMR provides many functionalities that make things easier for us; some of the technologies are: Amazon EC2 Amazon RDS Amazon S3 Amazon CloudFront Amazon Auto Scaling Amazon Lambda Amazon Redshift Amazon Elastic MapReduce (EMR) One of the major services provided by AWS EMR, and we will deal with, is Amazon EMR. EMR, commonly called Elastic Map Reduce, comes over with an easy and approachable way to deal with the processing of larger chunks of data. Imagine a big data scenario where we have a huge amount of data, and we are performing a set of operations over them, say a Map-Reduce job is running; one of the major issues the Bigdata application faces is the tuning of the program, we often find it difficult to fine-tune our program in such a way all the resource allocated is consumed properly. Due to this above tuning factor, the time taken for processing increases gradually. Elastic Map Reduce the service by Amazon is a web service that provides a framework that manages all these necessary features needed for Big data processing in a cost-effective, fast, and secure manner. From cluster creation to data distribution over various instances, all these things are easily managed under Amazon EMR. Furthermore, the services here are on-demand, which means we can control the numbers based on our data, making it cost-efficient and scalable. Reasons for Using AWS EMR So Why Using AMR? What makes it better than others. First, we often encounter a fundamental problem where we cannot allocate all the resources available over the cluster to any application; AMAZON EMR takes care of these problems. Based on the size of data and the demand of the application, it allocates the necessary resource. Also, being Elastic in nature, we can change it accordingly. Second, EMR has huge application support, be it Hadoop, Spark, HBase, making it easier for Data processing. It supports various ETL operations quickly and cost-effectively. It Can also be used over for MLIB in Spark. We can perform various machine learning algorithms inside it. Be it Batch data or Real-Time Streaming of Data, EMR can organize and process both types of Data. 1. The Clusters are the central component in the Amazon EMR architecture. They are a collection of EC2 Instances called Nodes. Each node has its specific roles within the cluster termed as Node type, and based on their roles; we can classify them into 3 types: Master Node Core Node Task Node 2. The Master Node, as the name suggests, is the master that is responsible for managing the cluster, running the components and distributing of data over the nodes for processing. It just keeps track of whether everything is properly managed and running fine and works on in the case of failure. 3. The Core Node is responsible for running the task and storing the data in HDFS in the cluster. In addition, the core Node handles all the processing parts, and the data after that processing is put to the desired HDFS location. 4. The Task Node being optional, only has the job to run the task. This doesn’t store the data in HDFS. 5. Whenever submitting a job, we have several methods to choose how to complete the work. Being it from the cluster’s termination after job completion to a long-running cluster using EMR console and CLI to submit steps, we have all the privilege to do so. 6. We can directly Run the Job on the EMR by connecting it with the master node through the interfaces and tools available that run jobs directly on the cluster. 7. We can also run our data in various steps with the help of EMR; all we have to do is submit one or more ordered steps in the EMR cluster. The Data is stored as a file and is processed sequentially. Starting it from “Pending state to Completed state”, we can trace the processing steps and find the errors from ‘Failed to be Canceled’ all these steps can be easily traced back to this. 8. Once all the instance is terminated the completed state for the cluster is achieved High Speed: Since all the resources are utilized properly, the query processing time is comparatively faster than the other data processing tools that have a much clearer picture. Bulk Data Processing: Be larger the data size EMR can process huge amounts of data in ample time.

, чтобы оставлять комментарии