【编程学习】大数据平台基础课程要点总结——Hadoop基础

在此附上老师教学课件地址:

Hadoop

MODULES OF HADOOP

  • Hadoop Distributed File System (HDFS): A reliable, high-bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines.
  • Hadoop MapReduce: A high-performance parallel/distributed data-processing implementation of the MapReduce algorithm.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop Common: The common utilities that support the other Hadoop modules.

简单操作

操作上基本是在Linux的用法前加上hadoop fs -即可,命令的含义就不用再解释了。

1
2
3
4
5
6
hadoop fs
hadoop fs -help
hadoop fs -ls /
hadoop fs -ls /user/yanfei
hadoop fs -mv LICENSE license.txt
hadoop fs -mkdir yourNAME

进阶操作

1
2
# 将/home/hadoop.txt文件放在HDFS的当前目录(.)下,HDFS中的目录本质是一个逻辑位置
hadoop fs -put /home/hadoop.txt .
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
hdfs fsck

Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]]
# 你要检测的目录,如果不写默认为根目录 /
<path> start checking from this path
# 把损坏的文件移动到/lost+found
-move move corrupted files to /lost+found
# 直接删除损坏的文件
-delete delete corrupted files
# 打印被检测的文件
-files print out files being checked
# 打印检测中的正在被写入的文件
-openforwrite print out files opened for write
# 检测的文件包括系统snapShot快照目录下的
-includeSnapshots include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
# 打印损坏的块及其所属的文件
-list-corruptfileblocks print out list of missing blocks and files they belong to
# 打印 block 的信息
-blocks print out block report
# 打印 block 的位置,即在哪个节点
-locations print out locations for every block
# 打印 block 所在rack
-racks print out network topology for data-node locations
# 打印 block 存储的策略信息
-storagepolicies print out storage policy summary for the blocks
# 打印指定blockId所属块的状况,位置等信息
-blockId print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

常用端口号

1
2
3
4
5
6
7
dfs.namenode.http-address:50070
dfs.datanode.http-address:50075
SecondaryNameNode:50090
dfs.datanode.address:50010
fs.defaultFS:8020 或者9000
yarn.resourcemanager.webapp.address:8088
历史服务器web访问端口:19888

MapReduce

img

MapReduce的优势:

  • 并行处理: 在 MapReduce 中,我们将任务分配给多个节点,每个节点同时处理部分任务。

  • 数据定位: 在 MapReduce 框架中,我们不是将数据移动到处理单元,而是将处理单元移动到数据。

1
2
3
4
5
6
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \ 
-input /user/data/texts.txt \ # 输入文本
-output /user/data/output \ # 输出位置
-mapper "/usr/bin/cat" \ # 这是map函数
-reducer "/usr/bin/wc" # 这是reduce函数
-numReduceTasks 1 # reduce的数量

img

Hadoop Streaming

1
cat ***.txt | mapper | sort | reducer > output

HDFS

读数据:

img

写数据:

img