Data Analysis in Databricks: Aggregating and Analyzing Sales Data

In our last video, we learned how to read a CSV file in Databricks. Today, we’re diving into some data analysis using that file. So, let’s get started!

Exploring Aggregate Functions in Apache Spark with Python

Our first task is to extract the total number of sales orders received. We’ll use aggregate functions like sum and count in Apache Spark with Python to achieve this.

Total Sales Orders:
First, we’ll explore how to use the count function to determine the total number of sales orders received from our database.

from pyspark.sql.functions import count

total_sales_orders = df.select(count('*')).collect()[0][0]
print(f"Total Sales Orders: {total_sales_orders}")

Total Sales Orders by Date:
Next, if we want to analyze this by date, we can add a date dimension to our DataFrame. This will allow us to view total sales orders by date.

from pyspark.sql.functions import col

sales_by_date = df.groupBy("date").agg(count("*").alias("total_sales"))
sales_by_date.show()

As you can see, we now have total sales displayed by day.

Sorting Sales Orders by Date:
To further refine our results, we can use an ORDER BY clause to sort the data by date, either in ascending or descending order.

sorted_sales_by_date = sales_by_date.orderBy(col("date").asc())
sorted_sales_by_date.show()

Here, the sales are now ordered by date in ascending order.

Analyzing Sales by Different Time Dimensions

Similarly, if we want to see total sales by year, month, or week, we can adjust the dimensions accordingly.

Total Sales by Month:

from pyspark.sql.functions import month

sales_by_month = df.groupBy(month("date").alias("month")).agg(count("*").alias("total_sales"))
sales_by_month.show()

Now, you can see the total sales by month.

Total Sales by Week:

from pyspark.sql.functions import weekofyear

sales_by_week = df.groupBy(weekofyear("date").alias("week")).agg(count("*").alias("total_sales"))
sales_by_week.show()

Now, let’s say we only want to see sales for the first quarter of 2013, broken down by week. To achieve this, we can apply a WHERE clause to filter our data.

Total Sales by Week for Q1 2013:

from pyspark.sql.functions import year, quarter

sales_q1_2013 = df.filter((year("date") == 2013) & (quarter("date") == 1)) \
                   .groupBy(weekofyear("date").alias("week")) \
                   .agg(count("*").alias("total_sales"))
sales_q1_2013.show()

Now, you can see we have total sales by week for that first quarter of 2013.

Conclusion

By leveraging aggregate functions and various time dimensions in Apache Spark, we can perform detailed analysis on our sales data. This allows us to gain valuable insights and make informed decisions based on our data.

Stay tuned for our next blog where we will dive even deeper into data analysis techniques using Databricks!