整理一下lzo相关知识和一些使用方法。
附上对指定目录下日志进行lzo压缩的代码:https://github.com/wzktravel/hadoop-codec
LZO is a compression codec which gives better compression and decompression speed than gzip, and also the capability to split. LZO allows this because its composed of many smaller (~256K) blocks of compressed data, allowing jobs to be split along block boundaries, as opposed to gzip where the dictionary for the whole file is written at the top.
When you specify mapred.output.compression.codec as LzoCodec, hadoop will generate .lzo_deflate files. These contain the raw compressed data without any header, and cannot be decompressed with lzop -d command. Hadoop can read these files in the map phase, but this makes your life hard.
When you specify LzopCodec as the compression.codec, hadoop will generate .lzo files. These contain the header and can be decompressed using lzop -d
However, neither .lzo nor .lzo_deflate files are splittable by default. This is where LzoIndexer comes into play. It generates an index file which tells you where the record boundary is. This way, multiple map tasks can process the same file.
See this cloudera blog post and LzoIndexer for more info.
创建索引
lzo格式默认是不支持splittable的,需要为其添加索引文件,才能支持多个map并行对lzo文件进行处理
MapReduce输出时创建索引
使用lzo索引生成器
123// 使用lzo索引生成器LzoIndexer lzoIndexer = new LzoIndexer(conf);lzoIndexer.index(new Path(outputPath));或者使用分布式索引生成器
123DistributedLzoIndexer lzoIndexer = new DistributedLzoIndexer();lzoIndexer.setConf(conf);lzoIndexer.run(new String[]{outputPath});
对已经是lzo的文件建立索引
|
|
索引文件与源文件在相同目录下。
beachmark
使用MapReduce做wordcount
输入 | 输入大小 | 输出大小 | cpu | memory | map耗时 | reduce耗时 | 总耗时 |
---|---|---|---|---|---|---|---|
Text | 70G | 55G | 190 | 389G | 2分4秒 | 10分15秒 | 12分24秒 |
Lzo | 4.3G | 3.1G | 36 | 78G | 1分55秒 | 6分11秒 | 8分10秒 |
比率 | 6.14% | 5.64% | 18.95% | 20.05% | 92.74% | 60.33% | 65.86% |
如何使用lzo
java
|
|
MapReduce
读取lzo文件
|
|
map中间结果使用lzo压缩
|
|
输出lzo文件
|
|
Spark
读取lzo文件
spark可以直接读取lzo文件。
|
|
输出lzo文件
save时指定输出格式,classOf[com.hadoop.compression.lzo.LzopCodec]
。
|
|
Hive
创建表时指定为lzo存储格式
|
|
修改表为lzo存储格式
对于已经创建好的表,使用alter语句,将其修改为lzo存储格式
|
|
插入数据
需要添加下面两个参数
|
|
hadoop压缩方案
- 单纯hdfs文件,推荐使用lzo格式,解压缩和压缩比都比较均衡,还可以直接使用hadoop fs -text xx.log 查看文件内容
- hive推荐使用ORCfile
- Hbase推荐使用snappy进行压缩
- spark sql和impala,推荐使用parquet
参考
- Hadoop at Twitter (part 1): Splittable LZO Compression
- Do we need to create an index file (with lzop) if compression type is RECORD instead of block?
- What’s the difference between the LzoCodec and the LzopCodec in Hadoop-LZO?
- HDFS中文件的压缩与解压
- Hadoop, how to compress mapper output but not the reducer output
- mapreduce中的压缩
- mapred-default.xml
- Hadoop列式存储引擎Parquet/ORC和snappy压缩
- IBM Developerworks: Hadoop 压缩实现分析