`
sunwinner
  • 浏览: 198257 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论
文章列表
Secondary sort is used to sort to allow some records to arrive at a reducer ahead of other records, it requires an understanding of both data arrangement and data flow (partitioning, sorting and grouping) and how they're integrated into MapReduce. As below figure shown: The partitioner is invoked a ...
In relational world, semi-join can be defined as a join between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the ...
Map-side join is also known as replicated join, and gets is name from the fact that the smallest of the datasets is replicated to all the map hosts. You can find a implementation in Hadoop in Action. Another implementation is using CompositeInputFormat, which is shown in this blog post. The goal of ...
Env: Single Node with CentOS 6.2 x86_64, 2 processors, 4Gb memory CDH4.3 with Cloudera Manager 4.5 HBase 0.94.6-cdh4.3.0  HBase 0.94.6-cdh4.3.0  HBase shell exercise: [root@n8 ~]# hbase shell 13/07/21 21:11:25 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native ...
Generally there are three different ways of interacting with HBase from a MapReduce application. HBase can be used as data source at the beginning of a job, as a data sink at the end of a job or as a shared resource. HBase as a data source:  The following example using HBase as a MapReduce sourc ...
Suppose you write some Java code to operate HBase via HBase Java client interface, you compile and package the java source code into a jar, called examples.jar. In Hadoop cluster you can use "hbase classpath" to get the class path needed.  $ java -cp examples.jar:`hbase classpath` hbase ...
Hadoop has a number of built-in mechanisms that can facilitate ingress and egress operations, to name a few: Embedded NameNode HTTP server WebHDFS and Hadoop interfaces Hbase built-in API, be specifically the org.apache.hadoop.hbase.mapreduce.TableInputFormat and org.apache.hadoop.hbase.mapredu ...
To enable Oozie's web console, you must download and add the ExtJS library to the Oozie server. If you have not already done this, proceed as follows.   If you use CDH3, you must do: Download the ExtJS version 2.2 library from http://extjs.com/deploy/ext-2.2.zip and place it in a convenient loc ...
用UDP或TCP接受syslog格式日志的时候,比如: flume dump 'syslogUdp(5140)'  这个命令使用UDP在5140端口接收日志。这时候假如你希望从命令行测试能否成功接收: echo '<37>Hello from cmd.' |nc -u localhost 5140  一定要在测试文本头加上<37>用来对日志进行分类,否则flume会抛出如下错误: 2013-07-16 08:26:49,614 [logicalNode dump-10] WARN syslog.SyslogUdpSource: 1 rejected pack ...
In chapter 5 of Data-Intensive Text Processing with MapReduce, it introduces how to implement PageRank algorithm in MapReduce way. Here I am not going to talk more about PageRank itself, please refer to wikipedia or other papers for further explaination. What I'm going to talk about is how to imple ...
In chapter 5 of the book "Data-Intensive Text Processing with MapReduce", it introduced how to parallel breadth-first graph search with MapReduce. This parallel algorithm is a variant of Dijkstra's algorithm. I'm not going to talk about the sequential version of Dijkstra's algorithm, for ...
To configure MapReduce or YARN task scheduler, go to     Services -> mapreduce1/yarn1 -> Configuration. Then click the 'view and edit' tab, search for property 'mapred.jobtracker.taskScheduler'. You will see options as below screenshot shown:   
Hadoop workshop homework.   For privacy, the blog post will not show source code at all, only the job output logs and counters. Copy the packaged jar file into hadoop cluster: [root@n1 hadoop-examples]# scp gsun@192.168.1.102:~/prog/hadoop/cdh4-examples/cdh4-examples.jar . Password: cdh4-ex ...
Hadoop workshop homework.   Since I am an Intellij Idea guy now (I shifted to Intellij Idea from Eclipse several months ago because Intellij Idea is much much better than Eclipse now). Currently Intellij does't have any Hadoop plugins, so I package the output into a jar file, then copy the jar (c ...
In this blog post I introduce some of the benchmarking and testing tools  in the Apache Hadoop distribution. Namely, I'll look at TeraSort, NNBench and MRBench. These are popular choices to benchmark a Hadoop cluster.   Before we start, let me show you the clusters on which the tests will run: ...
Global site tag (gtag.js) - Google Analytics