In Spark, we can use “explode” method to convert single column values into multiple rows. Recently I was working on a task to convert Cobol VSAM file which often has nested columns defined in it. In Spark my requirement was to convert single column value (Array of values) into multiple rows. So let’s see an example to understand it better: Create a sample dataframe with one column as ARRAY Now run the explode function to split each value in col2 as new row. So using explode function, you can split one column into multiple rows.Read More →

In Spark Dataframe, SHOW method is used to display Dataframe records in readable tabular format. This method is used very often to check how the content inside Dataframe looks like. Let’s see it with an example. Few things to observe here: 1) By default, SHOW function will return only 20 records. This is equivalent to Sample/Top/Limit 20 we have in other SQL environment. 2) You can see the string which is longer than 20 characters is truncated. Like “William Henry Har…” in place of “William Henry Harrison”. This is equivalent to width/colwidth etc in typical SQL environment. This is equivalent to below syntax: We canRead More →

Recently I was working on a task where I wanted Spark Dataframe Column List in a variable. This was required to do further processing depending on some technical columns present in the list. So we know that you can print Schema of Dataframe using printSchema method. It will show tree hierarchy of columns along with data type and other info. Example: To Fetch column details, we can use “columns” to return all the column names in the dataframe. This return array of Strings. Example: The requirement was to get this info into a variable. So we can convert Array of String to String using “mkString”Read More →

UNION method is used to MERGE data from 2 dataframes into one. The dataframe must have identical schema. If you are from SQL background then please be very cautious while using UNION operator in SPARK dataframes. Unlike typical RDBMS, UNION in Spark does not remove duplicates from resultant dataframe. It simply MERGEs the data without removing any duplicates. UNION ALL is deprecated and it is recommended to use UNION only. Let’s see one example to understand it more properly. I have 2 dataframes with 5 & 10 records respectively with first 5 common in both the dataframes. scala> df_pres1.show(30) +——-+—————–+———-+——————-+————-+———-+———-+ |pres_id| pres_name| pres_dob| pres_bp| pres_bs|Read More →

In Spark , you can perform aggregate operations on dataframe. This is similar to what we have in SQL like MAX, MIN, SUM etc. We can also perform aggregation on some specific columns which is equivalent to GROUP BY clause we have in typical SQL. Let’s see it with some examples. First method we can use is “agg”. To calculate count, max, min, sum we can use below syntax: scala> df_pres.agg(count($”pres_id”),min($”pres_id”),max($”pres_id”),sum(“pres_id”)).show() +————–+————+————+————+ |count(pres_id)|min(pres_id)|max(pres_id)|sum(pres_id)| +————–+————+————+————+ | 45| 1| 45| 1035| +————–+————+————+————+ Let’s add alias name to columns. scala> df_pres.agg(count($”pres_id”).as(“count”),min($”pres_id”).as(“min”),max($”pres_id”).as(“max”),sum(“pres_id”).as(“sum”)).show() +—–+—+—+—-+ |count|min|max| sum| +—–+—+—+—-+ | 45| 1| 45|1035| +—–+—+—+—-+ Second method is to use “agg” with “groupBy”Read More →

In PySpark, you can do almost all the date operations you can think of using in-built functions. Let’s quickly jump to example and see it one by one. Create a dataframe with sample date values: >>>df_1 = spark.createDataFrame([(‘2019-02-20′,’2019-10-18’,)],[‘start_dt’,’end_dt’]) Check dataframe info >>> df_1 DataFrame[start_dt: string, end_dt: string] Now the problem I see here is that columns start_dt & end_dt are of type string and not date. So let’s quickly convert it into date. >>> df_2 = df_1.select(df_1.start_dt.cast(‘date’),df_1.end_dt.cast(‘date’)) >>> df_2 DataFrame[start_dt: date, end_dt: date] Now we are good. We have a dataframe with 2 columns start_dt & end_dt. Both the columns are of datatype ‘date’. Let’sRead More →