Hadoop File Formats, when and what to use?

Rajesh Dangi / June 23, 2017

Hadoop File Formats, when and what to use? Hadoop is gaining traction and on a higher adaption curve to liberate the data from the clutches of the applications and native formats. This article helps us look at the file formats supported by Hadoop ( read, HDFS) file system. A quick broad categorizations of file formats would be

  • Basic file formats are: Text format, Key-Value format, Sequence format
  • Other formats which are used and are well known are: Avro, Parquet, RC or Row-Columnar format, ORC or Optimized Row Columnar format

The need ..

A file format is just a way to define how information is stored in HDFS file system. This is usually driven by the use case or the processing algorithms for specific domain, File format should be well-defined and expressive. It should be able to handle variety of data structures specifically structs, records, maps, arrays along with strings, numbers etc. File format should be simple, binary and compressed.. When dealing with Hadoop’s filesystem not only do you have all of these traditional storage formats available to you (like you can store PNG and JPG images on HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and unstructured data. A huge bottleneck for HDFS-enabled applications like MapReduce and Spark is the time it takes to find relevant data in a particular location and the time it takes to write the data back to another location. These issues are exacerbated with the difficulties managing large datasets, such as evolving schemas, or storage constraints. The various Hadoop file formats have evolved as a way to ease these issues across a number of use cases. Choosing an appropriate file format can have some significant benefits: 1. Faster read times 2. Faster write times 3. Splittable files (so you don’t need to read the whole file, just a part of it) 4. Schema evolution support (allowing you to change the fields in a dataset) 5. Advanced compression support (compress the columnar files with a compression codec without sacrificing these features) Some file formats are designed for general use (like MapReduce or Spark), others are designed for more specific use cases (like powering a database), and some are designed with specific data characteristics in mind. So there really is quite a lot of choice.

The generic classification of the characteristics are Expressive, Simple, Binary, Compressed, Integrity to name few. Typically text-based, serial and columnar types…

Since Protocol buffers & thrift are serializable but not splittable they are not largely popular on HDFS use cases and thus Avro becomes the first choice …

Text Input Format

Simple text-based files are common in the non-Hadoop world, and they’re super common in the Hadoop world too. Data is laid out in lines, with each line being a record. Lines are terminated by a newline character \n in the typical UNIX fashion. Text-files are inherently splittable (just split on \n characters!), but if you want to compress them you’ll have to use a file-level compression codec that support splitting, such as BZIP2 Because these files are just text files you can encode anything you like in a line of the file. One common example is to make each line a JSON document to add some structure. While this can waste space with needless column headers, it is a simple way to start using structured data in HDFS.

  • Default, JSON, CSV formats are available
  • Slow to read and write
  • Can’t split compressed files (Leads to Huge maps)
  • Need to read/decompress all fields.

An Input format for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text. Advantages: Light weight Disadvantages: Slow to read and write, Can’t split compressed files (Leads to Huge maps)

Sequence File Input Format

Sequence files were originally designed for MapReduce, so the integration is smooth. They encode a key and a value for each record and nothing more. Records are stored in a binary format that is smaller than a text-based format would be. Like text files, the format does not encode the structure of the keys and values, so if you make schema migrations they must be additive. Typically if you need to store complex data in a sequence file you do so in the value part while encoding the id in the key. The problem with this is that if you add or change fields in your Writable class it will not be backwards compatible with the data stored in the sequence file. One benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks.

  • Traditional map reduce binary file format
  • Stores Keys and Values as a class
  • Not good for Hive ,Which has sql types
  • Hive always stores entire line as a value
  • Default block size is 1 MB
  • Need to read and Decompress all the fields

In addition to text files, Hadoop also provides support for binary files. Out of these binary file formats, Hadoop Sequence Files are one of the Hadoop specific file format that stores serialized key/value pairs. Advantages: Compact compared to text files, Optional compression support. Parallel processing. Container for huge number of small files. Disadvantages: Not good for Hive, Append only like other data formats, Multi Language support not yet provided One key benefit of sequence files is that they support block-level compression, so you can compress the contents of the file while also maintaining the ability to split the file into segments for multiple map tasks. Sequence files are well supported across Hadoop and many other HDFS enabled projects, and I think represent the easiest next step away from text files.

RC (Row-Columnar) File Input Format

RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows used when we want to perform operations on multiple rows at a time. RCFILEs are flat files consisting of binary key/value pairs, which shares much similarity with SEQUENCE FILE. RCFILE stores columns of a table in form of record in a columnar manner. It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the data of a row split as the value part. This means that RCFILE encourages column oriented storage rather than row oriented storage. This column oriented storage is very useful while performing analytics. It is easy to perform analytics when we “hive’ a column oriented storage type. We cannot load data into RCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created RCFILE.

  • columns stored separately
  • Read and decompressed only needed one.
  • Better compression
  • Columns stored as binary Blobs
  • Depend on Meta store to supply Data types
  • Large Blocks - 4MB default
  • Still search file for split boundary

ORC (Optimized Row Columnar)Input Format

ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%. As a result the speed of data processing also increases and shows better performance than Text, Sequence and RC file formats. An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data. We cannot load data into ORCFILE directly. First we need to load data into another table and then we need to overwrite it into our newly created ORCFILE. ORC File Format Full Form is Optimized Row Columnar File Format.ORC File format provides very efficient way to store relational data then RC file,By using ORC File format we can reduce the size of original data up to 75%.Comparing to Text,Sequence,Rc file formats ORC is better

  • Column stored separately
  • Knows Types - Uses Types specific en-coders
  • Stores statistics (Min,Max,Sum,Count)
  • Has Light weight Index
  • Skip over blocks of rows that that don’t matter
  • Larger Blocks - 256 MB by default, Has an index for block boundaries

Using ORC files improves performance when Hive is reading, writing, and processing data comparing to Text,Sequence and Rc. RC and ORC shows better performance than Text and Sequence File formats. Comparing to RC and ORC File formats always ORC is better as ORC takes less time to access the data comparing to RC File Format and ORC takes Less space space to store data. However, the ORC file increases CPU overhead by increasing the time it takes to decompress the relational data. ORC File format feature comes with the Hive 0.11 version and cannot be used with previous versions.

AVRO Format

Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop. Avro is an opinionated format which understands that data stored in HDFS is usually not a simple key/value combo like int/string. The format encodes the schema of its contents directly in the file which allows you to store complex objects natively. Honestly, Avro is not really a file format, it’s a file format plus a serialization and de-serialization framework with regular old sequence files you can store complex objects but you have to manage the process. Avro handles this complexity whilst providing other tools to help manage data over time and is a well thought out format which defines file data schemas in JSON (for interoperability), allows for schema evolutions (remove a column, add a column), and multiple serialization/deserialization use cases. It also supports block-level compression. For most Hadoop-based use cases Avro becomes really good choice. Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing. In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.

Thrift & Protocol Buffers Vs. Avro

Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways –

  • Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization.
  • Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro's schema definition is in JSON and not in any proprietary IDL.

Parquet Format

The latest hotness in file formats for Hadoop is columnar file storage. Basically this means that instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing framework just needs access to a subset of data that is stored on disk as it can access all values of a single column very quickly without reading whole records.

  • Design based on googles Dreamel paper
  • Schema segregated into footer
  • Column major format with stripes
  • Simple type-model with logical types
  • All data pushed to leaves of the tree
  • Integrated compression and indexes

Parquet file format is also a columnar format. Instead of just storing rows of data adjacent to one another you also store column values adjacent to each other. So datasets are partitioned both horizontally and vertically. This is particularly useful if your data processing framework just needs access to a subset of data that is stored on disk as it can access all values of a single column very quickly without reading whole records. Just like ORC file, it’s great for compression with great query performance especially efficient when querying data from specific columns. Parquet format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new columns to the end of the structure. If you’re chopping and cutting up datasets regularly then these formats can be very beneficial to the speed of your application, but frankly if you have an application that usually needs entire rows of data then the columnar formats may actually be a detriment to performance due to the increased network activity required. One huge benefit of columnar oriented file formats is that data in the same column tends to be compressed together which can yield some massive storage optimizations (as data in the same column tends to be similar). It supports both File-Level Compression and Block-Level Compression. File-level compression means you compress entire files regardless of the file format, the same way you would compress a file in Linux. Some of these formats are splittable (e.g. bzip2, or LZO if indexed). Block-level compression is internal to the file format, so individual blocks of data within the file are compressed. This means that the file remains splittable even if you use a non-splittable compression codec like Snappy. However, this is only an option if the specific file format supports it. Summary Overall these format can drastically optimize workloads, especially for Hive and Spark which tend to just read segments of records rather than the whole thing (which is more common in MapReduce). Since Avro and Parquet have so much in common when choosing a file format to use with HDFS, we need to consider read performance and write performance. Because the nature of HDFS is to store data that is write once, read multiple times, we want to emphasize on the read performance. The fundamental difference in terms of how to use either format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are interested in a subset of columns, you can use Parquet. Hopefully by now you’ve learned a little about what file formats actually are and why you would think of choosing a specific one. We’ve discussed the main characteristics of common file formats and talked a little about compression. This is not an exhaustive article that can answer all the use cases but definitely provide pointers for you to explore, so if you want to learn more about the particular codecs I would request you to visit their respective Apache / Wikipedia pages…