Inner Join using Pyspark Databricks

Creating an inner join in PySpark within a Databricks environment is a straightforward process. Below is a step-by-step guide to creating an inner join using PySpark, including the setup of the Databricks notebook and the script itself.

Step 1: Set Up Databricks Notebook

  1. Create a Cluster: If you don’t already have a cluster, you need to create one. https://youtu.be/xa7P1cYZZSE
  2. Create a Notebook: Once the cluster is running, create a new notebook where you’ll write your PySpark code.

Step 2: Sample Data Preparation

For demonstration purposes, let’s create two sample DataFrames.

# Sample data for DataFrame 1
data1 = [
    (1, "John", "Doe"),
    (2, "Jane", "Smith"),
    (3, "Mike", "Johnson")
]
columns1 = ["id", "first_name", "last_name"]

# Sample data for DataFrame 2
data2 = [
    (1, "[email protected]", "123-456-7890"),
    (2, "[email protected]", "098-765-4321"),
    (4, "[email protected]", "555-555-5555")
]
columns2 = ["id", "email", "phone"]

# Create DataFrames
df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)

Step 3: Perform Inner Join

Now, let’s perform an inner join on these two DataFrames using the id column.

# Perform inner join
inner_join_df = df1.join(df2, df1.id == df2.id, "inner")

# Display the result
inner_join_df.show()

Full Script in Databricks Notebook

Here’s the full script you can run in your Databricks notebook:

# Import required libraries
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("InnerJoinExample").getOrCreate()

# Sample data for DataFrame 1
data1 = [
    (1, "John", "Doe"),
    (2, "Jane", "Smith"),
    (3, "Mike", "Johnson")
]
columns1 = ["id", "first_name", "last_name"]

# Sample data for DataFrame 2
data2 = [
    (1, "[email protected]", "123-456-7890"),
    (2, "[email protected]", "098-765-4321"),
    (4, "[email protected]", "555-555-5555")
]
columns2 = ["id", "email", "phone"]

# Create DataFrames
df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)

# Perform inner join
inner_join_df = df1.join(df2, df1.id == df2.id, "inner")

# Display the result
inner_join_df.show()

Explanation

  • Creating DataFrames: We create two sample DataFrames df1 and df2 using spark.createDataFrame.
  • Inner Join: We use the join method to perform an inner join on the DataFrames. The condition for the join is specified as df1.id == df2.id.
  • Display Results: The show method is used to display the results of the join.

Conclusion

This simple example demonstrates how to perform an inner join in PySpark using Databricks. You can extend this example by using your own data and join conditions as needed.