We can use read different types of data files in the DataBricks. Thanks to it’s capability and available libraries which makes easier to read different types of data files. In this tutorial we will see how we can read excel files and what are the options available.
In this tutorial we will https://github.com/crealytics/spark-excel plugin to read excel files in DataBricks.
spark.read
.format("com.crealytics.spark.excel")
.option("dataAddress", "'My Sheet'!A1:Z500")
.option("header", "true")
.option("treatEmptyValuesAsNulls", "false")
.option("setErrorCellsToFallbackValues", "true")
.option("usePlainNumberFormat", "false")
.option("inferSchema", "false")
.option("addColorColumns", "true")
.option("timestampFormat", "MM-dd-yyyy HH:mm:ss")
.option("maxRowsInMemory", 20)
.option("excerptSize", 10)
.option("workbookPassword", "pass")
.sschema(myCustomSchema)
.load("ExcelFileName.xlsx")
Uses of different options
dataAddress: This option in optional in and default value for this is A1.
header: This option is used to define if data file has header in first row, if this option set to false then the columns will be names as col1, col2…etc.
Optional: No
treatEmptyValuesAsNulls: This option is used to convert all empty value to null while reading the data. This option is optional and default value for this true.
Optional: Yes
Default: True
setErrorCellsToFallbackValues: This option is used to handle cells which contains error values. According to documentation: where errors will be converted to null. If true, any ERROR cell values (e.g. #N/A) will be converted to the zero values of the column’s data type.
Optional: Yes
Default: False