GitHub - HkFromMY/spark-process-resume: A project initiated to save time using spark to process huge amount of text records (approximately 20k instances).

When there's deadlock, please set to more partitions, one of the CPU cores may not be handling big chunks of data and cause that.
Ensure that each partition has similar number of records to yield the syokness (high performance) of Spark, because if the partition is skewed (where 1 partition has significantly much more records), then it's no difference than sequential processing.
df.withColumn to transform existing column, you can chain them like df.withColumn().withColumn() to form a pipeline-like process.
Use the built-in function such as regexp_replace provided by Spark for better readability and performance, rather than using User-Defined Function (UDF) unless you have no choice.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.ipynb_checkpoints		.ipynb_checkpoints
.gitignore		.gitignore
README.md		README.md
Spark Processing.ipynb		Spark Processing.ipynb
requirements.txt		requirements.txt

Provide feedback