Skip to content

SQL & Hadoop

SQL on Hadoop with Hive, Spark & PySpark on EMR & AWS Glue

  • Home
  • About
  • Contact
  • Privacy Policy

SQL & Hadoop

SQL on Hadoop with Hive, Spark & PySpark on EMR & AWS Glue

Close menu
  • Home
  • About
  • Contact
  • Privacy Policy

SQL & Hadoop

SQL on Hadoop with Hive, Spark & PySpark on EMR & AWS Glue

Toggle menu

Category: Apache Spark

Spark Case Study – optimise executor memory and cores per executor

I was recently working on a task where I have to read more than a Terabyte of data spread across multiple parquet files. Also some filters were applied on that data to get the required result set. I did a […]

Read more
Apache SparkBy Raj2 comments

PySpark-How to Generate MD5 of entire row with columns

I was recently working on a project to migrate some records from on-premises data warehouse to S3. The requirement was also to run MD5 check on each row between Source & Target to gain confidence if the data moved is […]

Read more
Apache SparkBy Raj2 comments

Spark single application consumes all resources – Good or Bad for your cluster ?

While working with Spark, I hear it so many times when client or my team “complaints” that single Spark job is taking all resources. So is it bad for your cluster ? Whether to consider this as bad or good […]

Read more
Apache SparkBy Raj0 comments

Spark Performance Tuning with help of Spark UI

Spark is distributed data processing engine which relies a lot on memory available for computation. Also if you have worked on spark, then you must have faced job/task/stage failures due to memory issues. Hence making memory management as one of […]

Read more
Apache SparkBy Raj0 comments

Problem with Decimal Rounding & solution

If you migrate from any RDBMS platform to another, one technical challenge you may face is different Decimal Rounding on both the platforms. I was recently working for a client where we migrated Teradata application into Spark on EMR and […]

Read more
Apache SparkBy Raj0 comments

Posts navigation

1 2 … 7 >

Recent Posts

  • AWS Glue create dynamic frame
  • AWS Glue read files from S3
  • How to check Spark run logs in EMR
  • PySpark apply function to column
  • Run Spark Job in existing EMR using AIRFLOW

Join the discussion

  1. Raj on Free Online SQL to PySpark ConverterAugust 9, 2022

    Thank you for sharing this. I will give it a try as well.

  2. John K-W on Free Online SQL to PySpark ConverterAugust 8, 2022

    Might be interesting to add a PySpark dialect to SQLglot https://github.com/tobymao/sqlglot https://github.com/tobymao/sqlglot/tree/main/sqlglot/dialects

  3. Meena M on Spark Dataframe WHEN caseJuly 28, 2022

    try something like df.withColumn("type", when(col("flag1"), lit("type_1")).when(!col("flag1") && (col("flag2") || col("flag3") || col("flag4") || col("flag5")), lit("type2")).otherwise(lit("other")))

  4. tagu on Free Online SQL to PySpark ConverterJuly 20, 2022

    It will be great if you can have a link to the convertor. It helps the community for anyone starting…

  5. Kyle on Hive Date Functions – all possible Date operationsMay 13, 2022

    I am wondering if there is a way to preserve time information when adding/subtracting days from a datetime. If I…

© 2022 SQL & Hadoop.