In our last video, we learned how to read a CSV file in Databricks. Today, we’re diving into some data analysis using that file. So, let’s get started!
Exploring Aggregate Functions in Apache Spark with Python
Our first task is to extract the total number of sales orders received. We’ll use aggregate functions like sum
and count
in Apache Spark with Python to achieve this.
Total Sales Orders:
First, we’ll explore how to use the count
function to determine the total number of sales orders received from our database.
from pyspark.sql.functions import count
total_sales_orders = df.select(count('*')).collect()[0][0]
print(f"Total Sales Orders: {total_sales_orders}")
Total Sales Orders by Date:
Next, if we want to analyze this by date, we can add a date dimension to our DataFrame. This will allow us to view total sales orders by date.
from pyspark.sql.functions import col
sales_by_date = df.groupBy("date").agg(count("*").alias("total_sales"))
sales_by_date.show()
As you can see, we now have total sales displayed by day.
Sorting Sales Orders by Date:
To further refine our results, we can use an ORDER BY
clause to sort the data by date, either in ascending or descending order.
sorted_sales_by_date = sales_by_date.orderBy(col("date").asc())
sorted_sales_by_date.show()
Here, the sales are now ordered by date in ascending order.
Analyzing Sales by Different Time Dimensions
Similarly, if we want to see total sales by year, month, or week, we can adjust the dimensions accordingly.
Total Sales by Month:
from pyspark.sql.functions import month
sales_by_month = df.groupBy(month("date").alias("month")).agg(count("*").alias("total_sales"))
sales_by_month.show()
Now, you can see the total sales by month.
Total Sales by Week:
from pyspark.sql.functions import weekofyear
sales_by_week = df.groupBy(weekofyear("date").alias("week")).agg(count("*").alias("total_sales"))
sales_by_week.show()
Now, let’s say we only want to see sales for the first quarter of 2013, broken down by week. To achieve this, we can apply a WHERE
clause to filter our data.
Total Sales by Week for Q1 2013:
from pyspark.sql.functions import year, quarter
sales_q1_2013 = df.filter((year("date") == 2013) & (quarter("date") == 1)) \
.groupBy(weekofyear("date").alias("week")) \
.agg(count("*").alias("total_sales"))
sales_q1_2013.show()
Now, you can see we have total sales by week for that first quarter of 2013.
Conclusion
By leveraging aggregate functions and various time dimensions in Apache Spark, we can perform detailed analysis on our sales data. This allows us to gain valuable insights and make informed decisions based on our data.
Stay tuned for our next blog where we will dive even deeper into data analysis techniques using Databricks!