- When there's deadlock, please set to more partitions, one of the CPU cores may not be handling big chunks of data and cause that.
- Ensure that each partition has similar number of records to yield the syokness (high performance) of Spark, because if the partition is skewed (where 1 partition has significantly much more records), then it's no difference than sequential processing.
df.withColumn
to transform existing column, you can chain them likedf.withColumn().withColumn()
to form a pipeline-like process.- Use the built-in function such as
regexp_replace
provided by Spark for better readability and performance, rather than using User-Defined Function (UDF) unless you have no choice.
-
Notifications
You must be signed in to change notification settings - Fork 0
HkFromMY/spark-process-resume
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
A project initiated to save time using spark to process huge amount of text records (approximately 20k instances).
Topics
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published