SPARK-PySpark Dataframe

DataFrame is the closest thing a SQL Developer can find in Apache Spark to a regular table in RDBMS. I am technically from SQL background initially working  in traditional RDBMS like Teradata, Oracle, Netezza, Sybase etc. So when I moved from traditional RDBMS to Hadoop for my new projects, I was excited to look for SQL options available in it. I must admit HIVE was the most relevant one and it made my life so simple in my first Hadoop project. Next comes the Apache SPARK.

Apache Spark has Spark SQL as one of the components or API which is blessing for people like me. We don't prefer writing java applications but SQL is our forte. Since Apache Spark is very popular in the market today so working on Spark SQL is exciting too. One of the core object in Spark SQL is DataFrame and it is as good as any Table in RDBMS. You can apply all sorts of SQL operations on a DataFrame directly or indirectly.

Below are the posts using Scala and PySpark DataFrame which I would like to share with you and hope it can help you in transitioning from SQL to Spark.