PySpark handle scientific number

What is scientific notation or exponent number ?

Recently I was working on PySpark process in which requirement was to apply some aggregation on big numbers. The result in output was accurate however it was in exponential format or scientific notation which definitely does not look ok in display. I am talking about numbers which are represented as “1.0125000010125E-8” and we call it “E to the power of” numbers.

How to handle scientific notation in Spark ?

You can handle scientific notation using format_number function in spark. There is no direct way to configure and stop scientific notation in spark however you can apply format_number function to display number in proper decimal format rather than exponential format.

In this post , I have shared the manner in which I have handled exponent format to proper decimal format in Pyspark. Also as per my observation , if you are reading data from any Database via JDBC connection and the datatype is DECIMAL with scale more than 6 then the value is converted to exponential format in Spark.

Create a sample dataframe

Let us create a sample dataframe which has values represented in scientific notation first.

from pyspark.sql.types import StructType,StructField, IntegerType
from pyspark.sql.functions import col,format_number

data = [
(1,1),
(2,12),
(3,123),
(4,1234),
(5,12345),
(6,123456),
(7,1234567),
(8,12345678),
(9,123456789)
]

schema = StructType([ \
    StructField("id",IntegerType(),True), \
    StructField("val", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data,schema=schema)
df1 = df.withColumn("new_val",col("val")/98765432)
df1.select("id","val","new_val").show(truncate=False)

+---+---------+---------------------+
|id |val      |new_val              |
+---+---------+---------------------+
|1  |1        |1.0125000010125E-8   |
|2  |12       |1.215000001215E-7    |
|3  |123      |1.245375001245375E-6 |
|4  |1234     |1.249425001249425E-5 |
|5  |12345    |1.2499312512499313E-4|
|6  |123456   |0.001249992001249992 |
|7  |1234567  |0.012499990887499991 |
|8  |12345678 |0.124999989875       |
|9  |123456789|1.249999989875       |
+---+---------+---------------------+

You can see the values are presented in exponential format i.e. numbers with “E” near the end. I don’t want to see output in scientific notation so I will use format_number function and let’s see the output now.

format_number in pyspark

df1.select("id","val",format_number(col("new_val"),10).alias("new_val")).show(truncate=False)

+---+---------+------------+
|id |val      |new_val     |
+---+---------+------------+
|1  |1        |0.0000000101|
|2  |12       |0.0000001215|
|3  |123      |0.0000012454|
|4  |1234     |0.0000124943|
|5  |12345    |0.0001249931|
|6  |123456   |0.0012499920|
|7  |1234567  |0.0124999909|
|8  |12345678 |0.1249999899|
|9  |123456789|1.2499999899|
+---+---------+------------+

In the above example I have given scale to 10 and you can change it as per your requirement.
In this post, we saw how to use “FORMAT_NUMBER” function in spark to handle scientific notation in double or float or decimal numbers.

Leave a Reply