Skip to content

SQL & Hadoop

SQL on Hadoop with Hive, Spark & PySpark on EMR & AWS Glue

  • Home
  • About
  • Contact
  • Privacy Policy

SQL & Hadoop

SQL on Hadoop with Hive, Spark & PySpark on EMR & AWS Glue

Close menu
  • Home
  • About
  • Contact
  • Privacy Policy

SQL & Hadoop

SQL on Hadoop with Hive, Spark & PySpark on EMR & AWS Glue

Toggle menu

Category: Apache Spark

Spark Case Study – optimise executor memory and cores per executor

I was recently working on a task where I have to read more than a Terabyte of data spread across multiple parquet files. Also some filters were applied on that data to get the required result set. I did a […]

Read more
Apache SparkBy Raj2 comments

PySpark-How to Generate MD5 of entire row with columns

I was recently working on a project to migrate some records from on-premises data warehouse to S3. The requirement was also to run MD5 check on each row between Source & Target to gain confidence if the data moved is […]

Read more
Apache SparkBy Raj4 comments

Spark single application consumes all resources – Good or Bad for your cluster ?

While working with Spark, I hear it so many times when client or my team “complaints” that single Spark job is taking all resources. So is it bad for your cluster ? Whether to consider this as bad or good […]

Read more
Apache SparkBy Raj0 comments

Spark Performance Tuning with help of Spark UI

Spark is distributed data processing engine which relies a lot on memory available for computation. Also if you have worked on spark, then you must have faced job/task/stage failures due to memory issues. Hence making memory management as one of […]

Read more
Apache SparkBy Raj0 comments

Problem with Decimal Rounding & solution

If you migrate from any RDBMS platform to another, one technical challenge you may face is different Decimal Rounding on both the platforms. I was recently working for a client where we migrated Teradata application into Spark on EMR and […]

Read more
Apache SparkBy Raj0 comments

Posts navigation

1 2 … 7 >

Recent Posts

  • AWS Glue create dynamic frame
  • AWS Glue read files from S3
  • How to check Spark run logs in EMR
  • PySpark apply function to column
  • Run Spark Job in existing EMR using AIRFLOW

Join the discussion

  1. Raj on PySpark-How to Generate MD5 of entire row with columnsMarch 9, 2023

    Done. Please check now.

  2. Anand on PySpark-How to Generate MD5 of entire row with columnsFebruary 25, 2023

    can you please make the video available to learn

  3. Raj on Free Online SQL to PySpark ConverterAugust 9, 2022

    Thank you for sharing this. I will give it a try as well.

  4. John K-W on Free Online SQL to PySpark ConverterAugust 8, 2022

    Might be interesting to add a PySpark dialect to SQLglot https://github.com/tobymao/sqlglot https://github.com/tobymao/sqlglot/tree/main/sqlglot/dialects

  5. Meena M on Spark Dataframe WHEN caseJuly 28, 2022

    try something like df.withColumn("type", when(col("flag1"), lit("type_1")).when(!col("flag1") && (col("flag2") || col("flag3") || col("flag4") || col("flag5")), lit("type2")).otherwise(lit("other")))

© 2023 SQL & Hadoop.
x
x