使用mapreduce从日志文件中提取点击计数

阿迪蒂亚·维卡斯(Aditya Vikas Devarapalli)

我正在尝试在中编写以下代码Hadoop map-reduce我有一个日志文件,其中包含IP地址以及由其后的各个IP打开的URL。如下:

192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com
192.168.198.92 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.facebook.com
192.168.198.92 www.indiabix.com
192.168.72.177 www.indiabix.com
192.168.72.224 www.google.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.askubuntu.com
192.168.198.92 www.facebook.com
192.168.198.92 www.gmail.com
192.168.72.177 www.facebook.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.72.224 www.yahoo.com
192.168.72.177 www.google.com
192.168.72.177 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.m4maths.com
192.168.72.177 www.yahoo.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.facebook.com
192.168.198.92 www.google.com
192.168.198.92 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.198.92 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.google.com
192.168.72.177 www.yahoo.com
192.168.72.224 www.yahoo.com
192.168.198.92 www.m4maths.com
192.168.198.92 www.facebook.com
192.168.72.224 www.gmail.com
192.168.72.177 www.google.com
192.168.72.224 www.indiabix.com
192.168.72.224 www.indiabix.com
192.168.72.177 www.m4maths.com
192.168.72.224 www.indiabix.com

现在,我需要以这种方式来组织该文件的结果,使其列出带有Urls的不同IP地址,然后是该IP打开特定文件的次数。

例如,如果按照整个日志文件192.168.72.224打开www.yahoo.com15次,则输出必须包含:

192.168.72.224 www.yahoo.com 15

应该对文件中的所有IP执行此操作,最终输出应如下所示:

192.168.72.224 www.yahoo.com 15
               www.m4maths.com 11
192.168.72.177 www.yahoo.com 6
               www.gmail.com 19
....
...
..
.

我尝试过的代码是:

public class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
{
            private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();
                 public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      {
                       String line = value.toString();
            StringTokenizer tokenizer = new StringTokenizer(line);

                     while (tokenizer.hasMoreTokens())
            {
               word.set(tokenizer.nextToken());
                              output.collect(word, one);
            }
       }
}

我知道此代码存在严重缺陷,请向我提出一个前进的想法。

谢谢你。

0x0FFF

我会提出这样的设计:

  1. 映射器从文件中获取一行,并输出IP作为密钥,一对网站和1作为值
  2. 组合器和减速器。获取IP作为键和(网站,计数)对的序列,按网站(使用HashMap)对其进行汇总,然后输出IP,网站和计数作为输出。

要实现此功能,您需要实现自定义可写内容以处理。

我个人将使用Spark进行此操作,除非您过于担心性能。有了PySpark,它将像下面这样简单:

rdd = sc.textFile('/sparkdemo/log.txt')
counts = rdd.map(lambda line: line.split()).map(lambda line: ((line[0], line[1]), 1)).reduceByKey(lambda x, y: x+y)
result = counts.map(lambda ((ip, url), cnt): (ip, (url, cnt))).groupByKey().collect()
for x in result:
    print 'IP: %s' % x[0]
    for w in x[1]:
        print '    website: %s count: %d' % (w[0], w[1])

您的示例的输出为:

IP: 192.168.72.224
    website: www.facebook.com count: 2
    website: www.m4maths.com count: 2
    website: www.google.com count: 5
    website: www.gmail.com count: 4
    website: www.indiabix.com count: 8
    website: www.yahoo.com count: 3
IP: 192.168.72.177
    website: www.yahoo.com count: 14
    website: www.google.com count: 3
    website: www.facebook.com count: 3
    website: www.m4maths.com count: 3
    website: www.indiabix.com count: 1
IP: 192.168.198.92
    website: www.facebook.com count: 4
    website: www.m4maths.com count: 3
    website: www.yahoo.com count: 3
    website: www.askubuntu.com count: 2
    website: www.indiabix.com count: 1
    website: www.google.com count: 5
    website: www.gmail.com count: 1

本文收集自互联网,转载请注明来源。

如有侵权,请联系 [email protected] 删除。

编辑于
0

我来说两句

0 条评论
登录 后参与评论

相关文章