When for the first time I was asked in an interview, what are the different file formats that ADF can work with, my innocent answer was , “Well, I have worked with parquet and csv, so since parquet is allowed I think avro and ORC must also be allowed. I think pretty much everything is allowed in ADF”. Later wondering why was I not selected for the project!! In a hope that it does not happens again with me, not with you as well, this time we are discussing the various file formats that ADF allows for!
- Avro Format
- Binary Format
- Delimited text Format
- Excel Format
- JSON Format
- ORC Format
- Parquet Format
- XML Format
AVRO Format
Avro is a row based storage format for Hadoop widely being used by the serialization format. The fact that the data definition of the Avro file is stored in JSON makes it easier to read and interpret by any program. The key feature of Avro file format is its robust nature for schema change handling! It handles schema changes like missing fields, added fields or changed fields. It is a result of this that the old programs can read new ones and vice versa. This makes it evident for storing data in the landing zone –
- data from the landing zone is usually read as a whole for further processing which makes row-based format the right choice
- the downstream systems can easily retrieve table schemas from the file and there is no need for storing the schemas in an external meta store
- avro file format can successfully handle the line breaks and other non-printable characters in data.
- since schema evolution can be handled it becomes the perfect choice.
However, there are few unsupported things for avro as well –
- White space in column name is not supported in avro
- Currently ADF does not allow snappy as compression type when read/write avro files
- Avro complex data types are not supported currently in Copy activity (records, enums, arrays, maps, unions, and fixed). These can be however used in the data flow.
Binary Format
Binary format file goes by its name! It is indeed binary in nature, something only a machine specially designed for it understands it. A binary file is a file whose content must be interpreted by a program or hardware processor that understands how it is formatted. A binary file also contain data that can be readily understood by the program. When transmitting a binary file, from one location to the another, the program transmitting does not parses the file but sends it over to the desired location.
Did you know that most of the digital data is stored in binary format? Text files are as rare as 2% and there are sure shot reasons for using binaries –
- input and output is faster in case of binary files.
- binary files are smaller when compared to the text files
When using binary dataset, the sink data set must also be binary. Also, when parsing the binary file, ADF does not parses the file but treat it as is. The allowed values for the compression code of a binary file are bzip2, gzip, deflate, ZipDeflate, Tar, or TarGzip.
When binary file is used in copy activity sink, there are two levels of compressions that can be applied –
- optimal – the compression operation should complete as quickly as possible even if the resulting file is not optimized
- fastest – the compression operation should be optimally compressed even if the operation takes a bit longer.
Delimited Text Format
Delimited files are the ones which are used to store tables in a file format. So, we have CSV, which is a comma separated file, tab separated file and others. In case of others, the file can be delimited by anything! Here as well, the allowed values for compression are – optimal and fastest.
Excel Format
ADF supports both .xls and .xlsx formats. However, when transferring data via HTTPS, .XLS format is not supported.
JSON Format
JSON stands for java script object notation. It is a lightweight format of storing and transforming data. It is usually used when the data is being sent from the server. JSON is schema-less, text based format of representing a structured data and is based on key value pairs. JSON format can store data that is in form of string, number, array, boolean, NULL or a JSON object and cannot store date, functions and undefined data. The use of JSON comes into picture usually when the data is being shared across the server.
ORC Format
ORC stands for optimized row columnar format file meaning it can store data in more optimized manner when compared to other file formats. It brings up the optimization up to 75% when compared to other file formats. An ORC file contains rows data in groups called Stripes along with file footer. The data cannot be loaded into an ORC file directly. The data is first inserted to a table and then using it, the ORC file is over-written. The good thing about an ORC file format is that it takes up very less space when compared with other file formats, but has an overhead of CPU.
Parquet Format
Parquet is like your ORC, a columnar file format, but is implemented using record-shredding and assembly algorithm. The values in columns which are similar are stored in contigious memory location and offer following advantages –
- Column-wise compression is efficient and saves storage space
- Queries that fetch specific column values need not read the entire row data thus improving performance
- Different encoding techniques can be applied to different columns
XML Format
XML files are plain text files that use custom tags to describe the structure and other features of the document. XML file was created by the W3C used majorly in the web market. The allowed values for compression in XML are – allowed values are bzip2, gzip, deflate, ZipDeflate, TarGzip, Tar, snappy, or lz4. Default is not compressed.
These file formats have much more to explore on, specially the big data ones – parquet, orc, avro! We will be covering them in detailed in the posts to come.
Stay tuned!
Happy Learning ๐
Leave a comment