4 thoughts on “PySpark Tutorial – Introduction, Read CSV, Columns”

  1. hi
    Interview question :
    how can we remove 2 or mores lines as header while reading file in data frame in pyspark.

    sample file :

    prod,daily,impress
    id,name,country
    01,manish,USA
    02,jhon,UK
    03,willson,Africa

  2. Hi ,
    I have below data in abc.csv file

    S.NO NAME AGE SEX ADDRESS SAMPLE_SENT_DATE SAMPLE_RESULT_DATE
    1 CH.KEERTHI 55 F ZARUGUMALLI 08.08.2020 08.08.2020
    2 P SURESH 57 F Zarugumalli 05-08-2020 11.08.2020
    3 P HEMASRI 35 MALE Zarugumalli 05-08-2020 11.08.2020
    4 CH.DEEPTHI 32 FEMALE Y PALEM 11.08.2020 11.08.2020
    5 CH.KARTHIK 24 FEMALE Y PALEM 11.08.2020 11.08.2020
    6 D.subbarao 23 M Y PALEM 11.08.2020 11.08.2020
    7 iethakshi 40 M Y PALEM 11.08.2020 11.08.2020
    8 irajeswari 50 M Y PALEM 11.08.2020 11.08.2020
    9 CH.KEERTHI 58 MALE Volepalem 31-07-2020 11.08.2020
    10 irajeswari 22 FEMALE Volepalem 30-07-2020 11.08.2020

    Here SAMPLE_SENT_DATE and SAMPLE_RESULT_DATE having dates in different formate
    How to make SAMPLE_SENT_DATE and SAMPLE_RESULT_DATE to DD-MM-YYYY format

    1. Hi Vivek
      Try this.

      columns = ["SAMPLE_SENT_DATE","SAMPLE_RESULT_DATE"]
      data = [("08.08.2020", "08.08.2020"), ("05-08-2020", "11.08.2020"), ("31-07-2020","11.08.2020")]
      rdd = spark.sparkContext.parallelize(data)
      df = rdd.toDF(columns)
      df.show()
      +----------------+------------------+
      |SAMPLE_SENT_DATE|SAMPLE_RESULT_DATE|
      +----------------+------------------+
      |      08.08.2020|        08.08.2020|
      |      05-08-2020|        11.08.2020|
      |      31-07-2020|        11.08.2020|
      +----------------+------------------+
      df = df.withColumn("SAMPLE_SENT_DATE",when(col("SAMPLE_SENT_DATE").rlike("^([0-9]{2}\.[0-9]{2}\.[0-9]{4})$") ,from_unixtime(unix_timestamp(col("SAMPLE_SENT_DATE"), 'dd.MM.yyyy'),'dd-MM-yyyy')).otherwise(col("SAMPLE_SENT_DATE")))
      
      df = df.withColumn("SAMPLE_RESULT_DATE",when(col("SAMPLE_RESULT_DATE").rlike("^([0-9]{2}\.[0-9]{2}\.[0-9]{4})$") ,from_unixtime(unix_timestamp(col("SAMPLE_RESULT_DATE"), 'dd.MM.yyyy'),'dd-MM-yyyy') ).otherwise(col("SAMPLE_RESULT_DATE")))
      
      df.show()
      +----------------+------------------+
      |SAMPLE_SENT_DATE|SAMPLE_RESULT_DATE|
      +----------------+------------------+
      |      08-08-2020|        08-08-2020|
      |      05-08-2020|        11-08-2020|
      |      31-07-2020|        11-08-2020|
      +----------------+------------------+
      

      Also will recommend to keep date in standard "yyyy-MM-dd" format to avoid date format conversion for future date operations.

Leave a Comment

Your email address will not be published.