forked from veeraravi/Spark-notes
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathorc v sparquet vs avro.txt
32 lines (23 loc) · 1.31 KB
/
orc v sparquet vs avro.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
In simplest word, these all are file formats.
Hadoop like big storage and data processing ecosystem need optimized read and write performance oriented data formats.
1) AVRO:-
It is row major format.
Its primary design goal was schema evolution.
In the avro format, we store schema separately from data. Generally avro schema file (.avsc) is maintained.
2) ORC
Column oriented storage format.
Originally it is Hive's Row Columnar file. Now improved as Optimized RC (ORC)
Schema is with the data, but as a part of footer.
Data is stored as row groups and stripes.
Each stripe maintains indexes and stats about data it stores.
3) Parquet
Similar to ORC. Based on google dremel
Schema stored in footer
Column oriented storage format
Has integrated compression and indexes
Space or compression wise I found them pretty close to each other
Around 10 GB of CSV data compressed to 1.1 GB of ORC with ZLIB compression and same data to 1.2 GB of Parquet GZIP. Both file formats with SNAPPY compression, used around 1.6 GB of space.
Conversion speed wise ORC was little better it took 9 min where as parquet took 10 plus min.
Following link should be useful for more comparision
File Format Benchmark - Avro, JSON, ORC & Parquet
https://www.slideshare.net/HadoopSummit/file-format-benchmark-avro-json-orc-parquet