阅读943 返回首页    go 微信


Hadoop Streaming__Hadoop_开发人员指南_E-MapReduce-阿里云

python 写hadoop streaming作业

mapper代码如下

  1. #!/usr/bin/env python
  2. import sys
  3. for line in sys.stdin:
  4. line = line.strip()
  5. words = line.split()
  6. for word in words:
  7. print '%st%s' % (word, 1)

reducer代码如下

  1. #!/usr/bin/env python
  2. from operator import itemgetter
  3. import sys
  4. current_word = None
  5. current_count = 0
  6. word = None
  7. for line in sys.stdin:
  8. line = line.strip()
  9. word, count = line.split('t', 1)
  10. try:
  11. count = int(count)
  12. except ValueError:
  13. continue
  14. if current_word == word:
  15. current_count += count
  16. else:
  17. if current_word:
  18. print '%st%s' % (current_word, current_count)
  19. current_count = count
  20. current_word = word
  21. if current_word == word:
  22. print '%st%s' % (current_word, current_count)

假设mapper代码保存在/home/hadoop/mapper.py, reducer代码保存在/home/hadoop/reducer.py , 输入路径为hdfs文件系统的/tmp/input,输出路径为hdfs文件系统的/tmp/output。则在E-MapReduce集群上提交下面的hadoop命令

hadoop jar /usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-*.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input /tmp/hosts -output /tmp/output

最后更新:2016-11-23 16:04:18

  上一篇:go Pig 开发手册__Hadoop_开发人员指南_E-MapReduce-阿里云
  下一篇:go HBase 开发手册__开发人员指南_E-MapReduce-阿里云