Spark - How to mask column on pySpark?

TjMan 19/Oct/2019 Spark
Spark - How to mask column on pySpark?

There are many cases on data analysis where we do not need sensitive information and often it is a best practice to hide the information partially on a column to security purpose. Here, In this example we took some sample data of credit card to mask it using pySpark.

We also consider here that if an information on the column is incorrect then in the result that value will not be masked. As, we know that each credit card is always a 16 digit number so we are checking that in mask_func function. If the condition matches then only the value will be masked on output.

pySpark Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
import os

def main():
    os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-8-openjdk'
    spark=SparkSession.builder.master("local").getOrCreate()
    df=spark.read.csv('/home/Downloads/CC_Records.csv',header=True)

    print ("Input Data: \n")

    df.show(2)
    mask_func_udf = udf(mask_func, StringType())
    df_masked=df.withColumn("CardNumber_masked",mask_func_udf(df["CardNumber"]))
    df_masked=df_masked.drop("CardNumber").withColumnRenamed("CardNumber_masked","CardNumber")

    print ("Masked Data: \n")

    df_masked.show(2)
    return 0

def mask_func(colVal):
    if len(colVal)==16:
        charList=list(colVal)
        charList[4:12]='x'*8
        return "".join(charList)
    else:
        return colVal



if __name__ == '__main__':
    main()

Result:
Input Data: 

+------------+--------------------+-----------+----------------+-----------------+---+---------+----------+-----------+-------+-----------+
|CardTypeCode|    CardTypeFullName|IssuingBank|      CardNumber|   CardHolderName|CVV|IssueDate|ExpiryDate|BillingDate|CardPIN|CreditLimit|
+------------+--------------------+-----------+----------------+-----------------+---+---------+----------+-----------+-------+-----------+
|          DS|            Discover|   Discover|6480195344642784|Brenda D Peterson|689|  01/2017|   01/2022|          4|   1998|      22700|
|          DC|Diners Club Inter...|Diners Club|  30295201231669|     Dawn U Reese|070|  12/2015|   12/2016|         11|   3915|      12700|
+------------+--------------------+-----------+----------------+-----------------+---+---------+----------+-----------+-------+-----------+
only showing top 2 rows

Masked Data: 

+------------+--------------------+-----------+-----------------+---+---------+----------+-----------+-------+-----------+----------------+
|CardTypeCode|    CardTypeFullName|IssuingBank|   CardHolderName|CVV|IssueDate|ExpiryDate|BillingDate|CardPIN|CreditLimit|      CardNumber|
+------------+--------------------+-----------+-----------------+---+---------+----------+-----------+-------+-----------+----------------+
|          DS|            Discover|   Discover|Brenda D Peterson|689|  01/2017|   01/2022|          4|   1998|      22700|6480xxxxxxxx2784|
|          DC|Diners Club Inter...|Diners Club|     Dawn U Reese|070|  12/2015|   12/2016|         11|   3915|      12700|  30295201231669|
+------------+--------------------+-----------+-----------------+---+---------+----------+-----------+-------+-----------+----------------+
only showing top 2 rows
Download (input data):
attachment

Stay tuned to TuneToTech

Recent Post