Добавить
Уведомления

Apache Spark Tutorial 2018 | Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training

Apache Spark Tutorial 2018 | Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training https://acadgild.com/big-data/big-data-development-training-certification?aff_id=6003&source=youtube&account=X4R8GweypwQ&campaign=youtube_channel&utm_source=youtube&utm_medium=apache-spark-tutorial&utm_campaign=youtube_channel Hello and welcome to Apache spark tutorial powered by Acadgild. This is an instructor-led online tutorial on Apache spark and following are the topics covered in the tutorial. Topics Covered: • What is Apache Spark • Hadoop MapReduce Limitations • Spark Vs Hadoop MapReduce • Why RDD • How is RDD’s Fault Tolerant • Additional Traits of RDD’s • RDD Support Two Kind of Operations • Actions • Transformations • SparkContext • SQLContext Why Apache Spark Became So Demanding? • It is a fast In-memory data-processing engine. • Ability to efficiently execute streaming, machine-learning or SQL workloads which require fast-iterative access to data-sets. • Can run on top of Apache Hadoop YARN • Is designed for data-science and its abstraction makes data-science easier • It can cache data-set in memory and speed up iterative data processing • Includes ML-lib • Is 100 folds faster than MR, in Benchmark tests Hadoop MapReduce Limitations: • It’s based on disk-based computing • Suitable for single pass computations – not iterative computations • Needs a sequence of MR jobs to run iterative tasks • Need Integration with several other frameworks/tools to solve big data use cases 1. Apache storm for stream data processing 2. Apache mahout for machine learning Spark Vs Hadoop MapReduce: Performance: • Spark processes data in memory, while MR persists back to disk, after a MapReduce job. So spark should outperform MR. Nonetheless, Spark needs a lot of memory • If data is too big to fit in memory, then there will be major performance degradation for Spark • MapReduce kills its job, as soon as it’s done. So, it can run easily alongside other services with minor performance differences • Still, Spark has an advantage, as long as we are talking about iterative operations on the data Ease of Use: • Spark has API’s for Java, Scala & python • Spark SQL • In Spark, it’s easy to write user-defined functions • Hadoop MapReduce in Java is difficult to program. Pig, however, makes it easier (need to learn syntax) and hive adds SQL compatibility • Unlike Spark, MapReduce doesn’t have an interactive mode Cost: • Memory in spark cluster needs to be at least as large as the data being processed. So, Hadoop likewise is a cheaper option. • Considering spark benchmark’s – less amount of hardware can perform the same task much faster, especially on the cloud where compute power is paid per use. Compatibility: • Spark can run standalone or on top of Hadoop YARN or on MESOS on the premise • It supports data sources which Implement Hadoop input format. So, it is compatible with all data sources and file-formats supported by Hadoop • It also works with BI tools via JDBC and ODBC Data Processing: • Spark can do more than just data processing. it can process graphs and work with existing ml-libraries • Spark can do real-time as well as batch processing • Hadoop MapReduce is great for batch processing but if you want real-time options on top of it, you will have to use platforms like Storm and Impala, Giraph – for graph processing • Spark is the swiss army knife of data processing • Hadoop MapReduce is the command knife of batch processing Fault Tolerance: • Spark retires per task and speculative execution like MapReduce • But MapReduce relies on hard-drives. So, if a process crashes in the middle of execution, it can carry on from where it left off. • However, spark will have to retry from the beginning • Both have good failure tolerance. MapReduce is slightly more tolerant Security: • Spark can run on top of YARN and use HDFS. So, it can enjoy Kerberos authentication, HDFS file permissions and encryption between nodes. • Hadoop MapReduce can enjoy all Hadoop security benefits and integrate with Hadoop Security projects like Knox Gateway and Sentry. Kindly, go through the complete video to learn more topics like Why RDD, how are RDD’s Fault Tolerant, Spark context, etc. #Sparktutorial, #Tutorial, #ApacheSpark, #Hadoop Please like share and subscribe the channel for more such video. For more updates on courses and tips follow us on: Facebook: https://www.facebook.com/acadgild Twitter: https://twitter.com/acadgild LinkedIn: https://www.linkedin.com/company/acadgild

Иконка канала Кодовый Лагерь
7 подписчиков
12+
19 просмотров
2 года назад
12+
19 просмотров
2 года назад

Apache Spark Tutorial 2018 | Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training https://acadgild.com/big-data/big-data-development-training-certification?aff_id=6003&source=youtube&account=X4R8GweypwQ&campaign=youtube_channel&utm_source=youtube&utm_medium=apache-spark-tutorial&utm_campaign=youtube_channel Hello and welcome to Apache spark tutorial powered by Acadgild. This is an instructor-led online tutorial on Apache spark and following are the topics covered in the tutorial. Topics Covered: • What is Apache Spark • Hadoop MapReduce Limitations • Spark Vs Hadoop MapReduce • Why RDD • How is RDD’s Fault Tolerant • Additional Traits of RDD’s • RDD Support Two Kind of Operations • Actions • Transformations • SparkContext • SQLContext Why Apache Spark Became So Demanding? • It is a fast In-memory data-processing engine. • Ability to efficiently execute streaming, machine-learning or SQL workloads which require fast-iterative access to data-sets. • Can run on top of Apache Hadoop YARN • Is designed for data-science and its abstraction makes data-science easier • It can cache data-set in memory and speed up iterative data processing • Includes ML-lib • Is 100 folds faster than MR, in Benchmark tests Hadoop MapReduce Limitations: • It’s based on disk-based computing • Suitable for single pass computations – not iterative computations • Needs a sequence of MR jobs to run iterative tasks • Need Integration with several other frameworks/tools to solve big data use cases 1. Apache storm for stream data processing 2. Apache mahout for machine learning Spark Vs Hadoop MapReduce: Performance: • Spark processes data in memory, while MR persists back to disk, after a MapReduce job. So spark should outperform MR. Nonetheless, Spark needs a lot of memory • If data is too big to fit in memory, then there will be major performance degradation for Spark • MapReduce kills its job, as soon as it’s done. So, it can run easily alongside other services with minor performance differences • Still, Spark has an advantage, as long as we are talking about iterative operations on the data Ease of Use: • Spark has API’s for Java, Scala & python • Spark SQL • In Spark, it’s easy to write user-defined functions • Hadoop MapReduce in Java is difficult to program. Pig, however, makes it easier (need to learn syntax) and hive adds SQL compatibility • Unlike Spark, MapReduce doesn’t have an interactive mode Cost: • Memory in spark cluster needs to be at least as large as the data being processed. So, Hadoop likewise is a cheaper option. • Considering spark benchmark’s – less amount of hardware can perform the same task much faster, especially on the cloud where compute power is paid per use. Compatibility: • Spark can run standalone or on top of Hadoop YARN or on MESOS on the premise • It supports data sources which Implement Hadoop input format. So, it is compatible with all data sources and file-formats supported by Hadoop • It also works with BI tools via JDBC and ODBC Data Processing: • Spark can do more than just data processing. it can process graphs and work with existing ml-libraries • Spark can do real-time as well as batch processing • Hadoop MapReduce is great for batch processing but if you want real-time options on top of it, you will have to use platforms like Storm and Impala, Giraph – for graph processing • Spark is the swiss army knife of data processing • Hadoop MapReduce is the command knife of batch processing Fault Tolerance: • Spark retires per task and speculative execution like MapReduce • But MapReduce relies on hard-drives. So, if a process crashes in the middle of execution, it can carry on from where it left off. • However, spark will have to retry from the beginning • Both have good failure tolerance. MapReduce is slightly more tolerant Security: • Spark can run on top of YARN and use HDFS. So, it can enjoy Kerberos authentication, HDFS file permissions and encryption between nodes. • Hadoop MapReduce can enjoy all Hadoop security benefits and integrate with Hadoop Security projects like Knox Gateway and Sentry. Kindly, go through the complete video to learn more topics like Why RDD, how are RDD’s Fault Tolerant, Spark context, etc. #Sparktutorial, #Tutorial, #ApacheSpark, #Hadoop Please like share and subscribe the channel for more such video. For more updates on courses and tips follow us on: Facebook: https://www.facebook.com/acadgild Twitter: https://twitter.com/acadgild LinkedIn: https://www.linkedin.com/company/acadgild

, чтобы оставлять комментарии