Translate to your Language

Tuesday, June 24, 2014

File Archive using Hadoop Archive

by Unknown  |  in Big Data at  5:28 AM


 Archiving small files
The Hadoop Archive's data format is called har, with the following layout:
foo.har/_masterindex //stores hashes and offsets
foo.har/_index //stores file statuses
foo.har/part-[1..n] //stores actual file data
The file data is stored in multiple part files, which are indexed for keeping the original separation of data intact. Moreover, the part files can be accessed by MapReduce programs in parallel. The index files also record the original directory tree structures and the file statuses. In Figure 1, a directory containing many small files is archived into a directory with large files and indexes.
HarFileSystem – A first-class FileSystem providing transparent access
Most archival systems, such as tar, are tools for archiving and de-archiving. Generally, they do not fit into the actual file system layer and hence are not transparent to the application writer in that the user had to de-archive the archive before use.
Hadoop Archive is integrated in the Hadoop’s FileSystem interface. The HarFileSystemimplements the FileSystem interface and provides access via the har:// scheme. This exposes the archived files and directory tree structures transparently to the users. Files in a har can be accessed directly without expanding it. For example, we have the following command to copy a HDFS file to a local directory:
hadoop fs –get hdfs://namenode/foo/file-1 localdir
Suppose an archive bar.har is created from the foo directory. Then, the command to copy the original file becomes
hadoop fs –get har://namenode/bar.har#foo/file-1 localdir
Users only have to change the URI paths. Alternatively, users may choose to create a symbolic link (from hdfs://namenode/foo to har://namenode/bar.har#foo in the example above), then even the URIs do not need to be changed. In either case,HarFileSystem will be invoked automatically for providing access to the files in the har. Because of this transparent layer, har is compatible with the Hadoop APIs, MapReduce, the shell command -ine interface, and higher-level applications like Pig, Zebra, Streaming, Pipes, and DistCp.

0 comments:

© Copyright © 2015Big Data - DW & BI. by