https://mlwhiz.com/blog/2015/05/09/hadoop_mapreduce_streaming_tricks_and_technique/ https://www.tutorialandexample.com/hadoop-interview-questions https://engineering.linkedin.com/blog/2019/02/the-present-and-future-of-apache-hadoop--a-community-meetup-at-l HBASE ----- https://www.stackchief.com/blog/Top%203%20Most%20Important%20Things%20to%20Know%20About%20HBase HBase is schemaless. The data stored in HBase is not normalized, meaning there is no logical connection or relationship connecting different tables of data. The column-oriented table uses a row key to access different column families. These column families contain actual columns having different versions of data. This results in a four dimensional data model where accessing a single value requires knowing the row key, column family, column, and version. HBase relies on Zookeeper for the coordination of cluster nodes. Updates to the database are registered in a Write-Ahead-Log(WAL) and cached in a Memstore. Once the Memstore reaches its storage capacity, the recorded changes are written to H files and stored in HDFS. HBase achieves faster reads by first consulting the Memstore before checking H files. Additionally, the WAL register provides a backup to anything lost in the Memstore. When data is flushed to H files from the Memstore, the H files are replicated to other data nodes automatically. Advantages of HBase HBase provides a dual approach to data access. While it's row key based table scans provide consistent and real-time reads/writes, it also leverages Hadoop MapReduce for batch jobs. This makes it great for both real-time querying and batch analytics. Hbase also automatically manages sharding and failover support. Cassandra and HBase - Key differences ------------------------------------------- Different Architecture With HBase, a master node is specified to handle administrative functions and assignments for regional servers. These regional servers manage data nodes that perform actual reads/writes on HDFS. HBase uses Zookeeper to coordinate server state and operations. HBase stores data in HDFS. Cassandra stores data outside of HDFS and the Hadoop cluster. Unlike HBase, all nodes in Cassandra share a similar role. There is no concept of masters/slaves but rather multiple seed nodes. With this decentralized architecture, any node can perform any operation. Cassandra uses Gossip instead of Zookeeper to manage internode communication. Which to use and why Use HBase if you are emphasizing consistency with large scale reads. Use Cassandra if high availability is desired. Cassandra requires minimal setup with little administration overhead, making it easier to get started. If you work a lot with MapReduce and batch processing, HBase is preferred for its direct relationship with HDFS. Cassandra is good for single row reads and is optimized for writes. Cassandra also offers more flexibility in the way of CAP theorem tradeoffs. For example, you can configure consistency levels with Cassandra whereas HBase lacks these explicit configurations. HIVE ----- https://www.stackchief.com/blog/The%20Apache%20Hive%20Metastore The Hive metastore is simply a relational database. It stores metadata related to the tables/schemas you create to easily query big data stored in HDFS. When you create a new Hive table, the information related to the schema (column names, data types) is stored in the Hive metastore relational database. Other information like input/output formats, partitions, HDFS locations are all stored in the metastore. t the key point of Hive is to provide a SQL-like abstraction for running MapReduce jobs on Hadoop. This makes it good for analytical queries (OLAP). https://www.packtpub.com/big-data-and-business-intelligence/hadoop-mapreduce-cookbook In order to be used as a value data type of a MapReduce computation, a data type must implement the org.apache.hadoop.io.Writable interface. In order to be used as a key data type of a MapReduce computation, a data type must implement the org.apache.hadoop.io.WritableComparable interface. The Writable interface consists of the two methods, readFields() and write(). The WritableComparable interface introduces the comapreTo() method in addition to the readFields() and write() methods of the Writable interface. Map tasks write their output to the local disk, not to HDFS. Why is this? Map output is intermediate output: it’s processed by reduce tasks to produce the final output, and once the job is complete, the map output can be thrown away. So, storing it in HDFS with replication would be overkill. OOzie job coordinator Hadoop ControlledJob and JobControl classes provide JOIN: Reduce-Side join or Map-Side Join ======================================= https://www.edureka.co/blog/map-side-join-vs-join/ https://www.edureka.co/blog/mapreduce-example-reduce-side-join/ http://kickstarthadoop.blogspot.in/2011/05/hadoop-for-dependent-data-splits-using.html http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html 3 input files as follows 1. UserDetails.txt Every record is of the format ‘mobile number , consumer name’ 2. DeliveryDetails.txt Every record is of the format ‘mobile number, delivery status code’ 3. DeliveryStatusCodes.txt Every record is of the format ‘delivery status code, status message’ Expected Output Jim, Delivered Tom, Pending Harry, Failed Richa, Resend Partitioner =========== The sorted set of keys and their values of a partition would be the input for a reduce task. In Hadoop, the total number of partitions should be equal to the number of reduce tasks for the MapReduce computation. Hadoop Partitioners should extend the org.apache.hadoop.mapreduce. Partitioner abstract class. Hadoop uses org.apache.hadoop. mapreduce.lib.partition.HashPartitioner as the default Partitioner for the MapReduce computations. HashPartitioner partitions the keys based on their hashcode(), using the formula key.hashcode() mod r, where r is the number of reduce tasks. SequenceFileOutputFormat ====== The WordCount count the number of word occurrences within a set of input documents. Locate the sample code from src/chapter1/Wordcount.java. The code has three parts—mapper, reducer, and the main program. Map === The mapper extends from the org.apache.hadoop.mapreduce.Mapper input to the mapper. The map function breaks each line into substrings using whitespace characters such as the separator, and for each token (word) emits (word,1) as the output. public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, new IntWritable(1)); } Reduce ======== The reduce function receives all the values that have the same key as the input, and it outputs the key and the number of occurrences of the key as the output. public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } The main ============ Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: wordcount "); System.exit(2); } Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); //Uncomment this to //job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); Combiner =========== combiner only works with commutative and associative functions. For example, the same idea does not work when calculating mean. As mean is not communicative and associative, a combiner in that case will yield a wrong result. Speculative executions. This means if most of the map tasks have completed and Hadoop is waiting for a few more map tasks, Hadoop JobTracker will start a second copy of remaining task(s) Distributed cache ======================== Hadoop configuration ====================== conf/core-site.xml: conf/hdfs-site.xml: conf/mapred-site.xml: ============= In its default In its default execution. This recipe explains how to control this behavior. How to do it... 1. Run the WordCount sample by passing the following option as an argument: >bin/hadoop jar hadoop-examples-1.0.0.jar wordcount –Dmapred.job.reuse.jvm.num.tasks=-1 /data/input1 /data/output1 2. Monitor the number of processes created by Hadoop (through ps –ef|grephadoop command in Unix or task manager in Windows). Hadoop starts only a single JVM per task slot and then reuses it for an unlimited number of tasks in the job. However, passing arguments through the –D option only works if the job implements the org.apache.hadoop.util.Tools interface. Otherwise, you should set the option through the JobConf.setNumTasksToExecutePerJvm(-1) method.