Big data : File Compaction & Maintenance using Apache spark

Have you heard comments like “throw the files into a filesystem and big data framework will take care of the rest”, “Spark is 100x faster so it can handle any volume and any datasets”, “ Why should I worry about partition, sorting in a Big data ecosystem? It is an RDBMS thing” ? I hear these comments in Big data projects most of the time from the customers. I also hear complains from customers about the poor table read performance on existing big data systems. There are plenty of articles on processing the big data with frameworks like Spark but not much on file organization, data storage, and regular maintenance.

Apache Spark is one of the popular frameworks that is used in enterprises for ETL jobs. Spark has its in-memory processing capabilities to ingest data in-memory and run data processing logic. Typical Spark ETL architecture involves the Spark framework, file format, file compression and metastore for the schema. The common file formats used are Parquet and ORC format for batch and AVRO for real-time streaming. These file formats contain data and the associated metadata like partition, stripe size, stride size, min & max values of the partition keys, etc. First, it is important to understand how the Spark code can change the file format metadata.

After processing the data, the Spark engine writes the data into a table or filesystem based on the number of partitions in a dataframe. For example, if the data frame is 100GB in size and has 200 partitions then it writes 200 files with 0.5GB in the filesystem or the table. This scenario creates a lot of small data files in the file system or under the HIVE table. The ETL job with daily delta load may have smaller files after applying Spark default partition. After a month of data loads, we might end up having too many small files under the table and eventually leads to performance issues.

In the above scenario, the Spark engine is forced to conduct massive I/O operations to open and close each small file and their associated metadata, leading up to consume resources like node memory and compute. In addition to that if the data is not sorted or partitioned then it adds fuel to the fire. Now spark has to open and close small files and do a full table scan to conduct the filter and join operations. In such situations, you might have heard a comment like “ we can jack up the resource on cloud and address the issues”. Overspending or scaling up the resources on the cloud at this point may not help.

The compaction process is highly recommended to maintain or improve the performance of the new generation schema on read tables, reduce the cost of infrastructure and meet data latency requirements. It can be implemented during the development phase or support phase based on the project timeline.

The Compaction process involves high-level steps,

1. Reading of smaller files from the table using Spark reader

2. Sorted the data based on certain keys with dataframe methods sort(),

3. Partition the dataframe based on the sort key or derived partition or partition key using repartition (sort key)

4. Write the bigger files into the filesystem or table using write() and sortwithinPartitions() methods.

From the above example, the table or filesystem with 200 files small files with 0.5GB can be turned into 50 files with 4GB size partitioned by a date field. This prep process will enable spark queries based on a date or partition key, to pick the right file using the file format metadata, avoid full table scan and return the results quicker. The compaction process also updates the file format metadata thereby balances the load of the filter or join methods equally on Spark and File formats.

The compaction processes are typically conducted at tables level and run during the enterprise downtime. The process is usually followed by computing statistics to update the table metadata. There is always a trade-off when it comes to faster reads vs faster writes. It is highly recommended to play with file size, compression formats, partitions and sorting to understand the optimal settings to meet data latency requirements. I have noticed significant read performance comparing compacted and non-compacted datasets.

Developing compaction process is an art and sometimes a tough sell. It involves time & resource to conduct control experiments with different variables to find the appropriate file sizing. I have observed significant performance benefits and a reduction in cost of infrastructure by implementing the compaction process. Please provide your experience and comments on this topic.



Data Engineering Architect

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store