Spark-SQL DataFrame is the closest thing a SQL Developer can find in Apache Spark. I am technically from SQL background with 10+ years of experience working in traditional RDBMS like Teradata, Oracle, Netezza, Sybase etc. So when I moved from traditional RDBMS to Hadoop for my new projects, I was excited to look for SQL options available in it. I must admit HIVE is the most relevant one and it made my life so simple in my new project. Next comes the Apache SPARK.
Apache Spark has Spark SQL as one of the components which is blessing for people like me. We don’t prefer writing java applications but SQL is our forte. Since Apache Spark is perhaps the loudest buzz word in the market today so working on Spark SQL is exciting too. One of the core object in Spark SQL is DataFrame and it is as good as any Table in RDBMS. You can apply all sorts of SQL operations on a DataFrame directly or indirectly.
Below are the posts using Scala DataFrame which I would like to share with you and hope it can help you in transitioning from SQL to Spark SQL.
- SPARK DATAFRAME SELECT
- SPARK DATAFRAME ALIAS AS
- SPARK DATAFRAME WHERE FILTER
- SPARK DATAFRAME IN-NOT IN
- SPARK DATAFRAME LIKE NOT LIKE RLIKE
- SPARK DATAFRAME WHEN CASE
- SPARK DATAFRAME ORDERBY SORT
- SPARK DATAFRAME REPLACE STRING
- SPARK DATAFRAME CONCATENATE STRINGS
- SPARK DATAFRAME – DISTINCT OR DROP DUPLICATES
- SPARK DATAFRAME JOINS – ONLY POST YOU NEED TO READ
- SPARK DATAFRAME UPDATE COLUMN VALUE
- SPARK DATAFRAME AGGREGATE FUNCTIONS
- SPARK DATAFRAME – UNION/UNION ALL
- SPARK DATAFRAME COLUMN LIST
- SPARK DATAFRAME SHOW
- SPARK DATAFRAME – EXPLODE
- SPARK DATAFRAME NULL VALUES
- SPARK DATAFRAME – MONOTONICALLY_INCREASING_ID
- SPARK DATAFRAME REPARTITION
- SPARK DATAFRAME ADD MULTIPLE COLUMNS WITH VALUE