Teradata to PySpark – Replicate ACTIVITYCOUNT to Spark

Recently I was working on a project to convert Teradata BTEQ to PySpark code. Since it was mostly SQL queries, we were asked to typically transform into Spark SQL and run it using PySpark. We used sqlContext mostly for SQL queries however in Teradata you can have some constructs like ACITIVTYCOUNT which can help in deciding if you want to run subsequent queries or not. These conditional constructs cannot be directly converted to equivalent Spark SQL. So in pyspark ,…

Continue Reading

PySpark – zipWithIndex Example

One of the most common operation in any DATA Analytics environment is to generate sequences. There are multiple ways of generating SEQUENCE numbers however I find zipWithIndex as the best one in terms of simplicity and performance combined. Especially when requirement is to generate consecutive numbers without any gap. Below is the detailed code which shall help in generating surrogate keys/natural keys/sequence numbers. Step 1: Create a dataframe with all the required columns from the table. df_0= sqlContext.sql(“select pres_name,pres_dob,pres_bp,pres_bs,pres_in,pres_out from…

Continue Reading