【编程学习】大数据平台基础课程要点总结——Hadoop基础

6Young2024-03-202024-12-23

在此附上老师教学课件地址:

引用站外地址

Big Data Essentials

Yanfei Kang. Ph.D.

Hadoop

MODULES OF HADOOP

Hadoop Distributed File System (HDFS): A reliable, high-bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines.
Hadoop MapReduce: A high-performance parallel/distributed data-processing implementation of the MapReduce algorithm.
Hadoop YARN: A framework for job scheduling and cluster resource management.
Hadoop Common: The common utilities that support the other Hadoop modules.

简单操作

操作上基本是在Linux的用法前加上hadoop fs -即可，命令的含义就不用再解释了。

hadoop fs
hadoop fs -help
hadoop fs -ls /
hadoop fs -ls /user/yanfei
hadoop fs -mv LICENSE license.txt
hadoop fs -mkdir yourNAME

进阶操作

1 2	# 将/home/hadoop.txt文件放在HDFS的当前目录(.)下，HDFS中的目录本质是一个逻辑位置 hadoop fs -put /home/hadoop.txt .

hdfs fsck

Usage: hdfs fsck <path> [-list-corruptfileblocks | [-move | -delete | -openforwrite] [-files [-blocks [-locations | -racks]]]]
    # 你要检测的目录，如果不写默认为根目录 /
    <path>  start checking from this path
    # 把损坏的文件移动到/lost+found
    -move   move corrupted files to /lost+found
    # 直接删除损坏的文件
    -delete delete corrupted files
    # 打印被检测的文件
    -files  print out files being checked
    # 打印检测中的正在被写入的文件
    -openforwrite   print out files opened for write
    # 检测的文件包括系统snapShot快照目录下的
    -includeSnapshots   include snapshot data if the given path indicates a snapshottable directory or there are snapshottable directories under it
    # 打印损坏的块及其所属的文件
    -list-corruptfileblocks print out list of missing blocks and files they belong to
    # 打印 block 的信息
    -blocks print out block report
    # 打印 block 的位置，即在哪个节点
    -locations  print out locations for every block
    # 打印 block 所在rack
    -racks  print out network topology for data-node locations
    # 打印 block 存储的策略信息
    -storagepolicies    print out storage policy summary for the blocks
    # 打印指定blockId所属块的状况,位置等信息
    -blockId    print out which file this blockId belongs to, locations (nodes, racks) of this block, and other diagnostics info (under replicated, corrupted or not, etc)

常用端口号

dfs.namenode.http-address:50070
dfs.datanode.http-address:50075
SecondaryNameNode:50090
dfs.datanode.address:50010
fs.defaultFS:8020 或者9000
yarn.resourcemanager.webapp.address:8088
历史服务器web访问端口：19888

MapReduce

MapReduce的优势：

并行处理：在 MapReduce 中，我们将任务分配给多个节点，每个节点同时处理部分任务。
数据定位：在 MapReduce 框架中，我们不是将数据移动到处理单元，而是将处理单元移动到数据。

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar \ 
    -input /user/data/texts.txt \ # 输入文本
    -output /user/data/output \ # 输出位置
    -mapper "/usr/bin/cat" \ # 这是map函数
    -reducer "/usr/bin/wc" # 这是reduce函数
    -numReduceTasks 1 # reduce的数量