Recently I have received few queries regarding the query which we are passing to “load” function when using JDBC connection to connect to any RDBMS. The question is whether that query should be Spark SQL compliant or should be RDBMS specific. This is actually a very valid question because Spark SQL does not support all SQL constructs which are supported by typical RDBMS like Teradata , Netezza etc.
Answer to this question is : Query must be RDBMS specific. When we use a jdbc connection, then the Query which you pass is actually executed on RDBMS and then the result set is pushed to DATAFRAME in spark.
To validate it you can try few things like pass a query which is not RDBMS specific. Like “select * from TABLENAME limit 20”. Now this query shall fail in Teradata because “LIMIT” keyword is not supported. You can also see JDBC error codes which are Teradata Error Code and not spark. Similarly, you can pass query which is specific to RDBMS like “select * from TABLENAME sample 20”. Now this query is specific to Teradata but not to Spark as Spark SQL does not support “sample” keyword. But this query will run successfully. Hence we know that query must be RDBMS specific.