Skip to content

A project initiated to save time using spark to process huge amount of text records (approximately 20k instances).

Notifications You must be signed in to change notification settings

HkFromMY/spark-process-resume

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reflection on using Spark for processing data

  1. When there's deadlock, please set to more partitions, one of the CPU cores may not be handling big chunks of data and cause that.
  2. Ensure that each partition has similar number of records to yield the syokness (high performance) of Spark, because if the partition is skewed (where 1 partition has significantly much more records), then it's no difference than sequential processing.
  3. df.withColumn to transform existing column, you can chain them like df.withColumn().withColumn() to form a pipeline-like process.
  4. Use the built-in function such as regexp_replace provided by Spark for better readability and performance, rather than using User-Defined Function (UDF) unless you have no choice.

About

A project initiated to save time using spark to process huge amount of text records (approximately 20k instances).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published