我正在尝试在中编写以下代码Hadoop map-reduce
。我有一个日志文件,其中包含IP地址以及由其后的各个IP打开的URL。如下:
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
现在,我需要以这种方式来组织该文件的结果,使其列出带有Urls的不同IP地址,然后是该IP打开特定文件的次数。
例如,如果按照整个日志文件192.168.72.224
打开www.yahoo.com
15次,则输出必须包含:
192.168.72.224 www.yahoo.com 15
应该对文件中的所有IP执行此操作,最终输出应如下所示:
192.168.72.224 www.yahoo.com 15
www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
www.gmail.com 19
....
...
..
.
我尝试过的代码是:
public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
我知道此代码存在严重缺陷,请向我提出一个前进的想法。
谢谢你。
我会提出这样的设计:
要实现此功能,您需要实现自定义可写内容以处理。
我个人将使用Spark进行此操作,除非您过于担心性能。有了PySpark,它将像下面这样简单:
rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
print 'IP: %s' % x[0]
for w in x[1]:
print ' website: %s count: %d' % (w[0], w[1])
您的示例的输出为:
IP: 192.168.72.224
website: www.facebook.com count: 2
website: www.m4maths.com count: 2
website: www.google.com count: 5
website: www.gmail.com count: 4
website: www.indiabix.com count: 8
website: www.yahoo.com count: 3
IP: 192.168.72.177
website: www.yahoo.com count: 14
website: www.google.com count: 3
website: www.facebook.com count: 3
website: www.m4maths.com count: 3
website: www.indiabix.com count: 1
IP: 192.168.198.92
website: www.facebook.com count: 4
website: www.m4maths.com count: 3
website: www.yahoo.com count: 3
website: www.askubuntu.com count: 2
website: www.indiabix.com count: 1
website: www.google.com count: 5
website: www.gmail.com count: 1
本文收集自互联网,转载请注明来源。
如有侵权,请联系 [email protected] 删除。
我来说两句