Databricks Volumes
Ingest your files into Unstructured from Databricks Volumes.
The requirements are as follows.
The preceding video shows how to use Databricks personal access tokens (PATs), which are supported only for Unstructured Ingest.
To learn how to use Databricks-managed service principals, which are supported by both the Unstructured Platform and Unstructured Ingest, see the additional videos later on this page.
-
The Databricks workspace URL. Get the workspace URL for AWS, Azure, or GCP.
Examples:
- AWS:
https://<workspace-id>.cloud.databricks.com
- Azure:
https://adb-<workspace-id>.<random-number>.azuredatabricks.net
- GCP:
https://<workspace-id>.<random-number>.gcp.databricks.com
- AWS:
-
The Databricks authentication details. For more information, see the documentation for AWS, Azure, or GCP.
The following videos show how to create a Databricks-managed service principal and then grant it access to a Databricks volume:
For the Unstructured Platform, only the following Databricks authentication type is supported:
- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal. Note that for Azure, only Databricks-managed service principals are supported. Microsoft Entra ID-managed service principals are not supported.
For Unstructured Ingest, the following Databricks authentication types are supported:
- For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token’s value.
- For username and password (basic) authentication (AWS only): The user’s name and password values.
- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
- For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values.
- For Azure managed identities (MSI) authentication (Azure only): The client ID value for the corresponding managed identity.
- For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal.
- For Azure CLI authentication (Azure only): No additional values.
- For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user.
- For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account’s credentials file.
- For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account’s email address.
-
The Databricks catalog name for the volume. Get the catalog name for AWS, Azure, or GCP.
-
The Databricks schema name for the volume. Get the schema name for AWS, Azure, or GCP.
-
The Databricks volume name, and optionally any path in that volume that you want to access directly. Get the volume information for AWS, Azure, or GCP.
-
Make sure that the target user or service principal has access to the target volume. To learn more, see the documentation for AWS, Azure, or GCP.
To create or change a Databricks Volumes source connector, see the following examples.
Replace the preceding placeholders as follows:
<name>
(required) - A unique name for this connector.<host>
(required) - The Databricks workspace host URL.<client-id>
(required) - The application ID value for the Databricks-managed service principal that has access to the volume.<client-secret>
(required) - The associated OAuth secret value for the Databricks-managed service principal that has access to the volume.<catalog>
(required) - The name of the catalog to use.<schema>
- The name of the associated schema. If not specified,default
is used.<volume>
(required) - The name of the associated volume.<volume-path>
- Any optional path to access within the volume.
To learn how to create a Databricks-managed service principal, get its application ID, and generate an associated OAuth secret, see the documentation for AWS, Azure, or GCP.
For Azure, only Databricks-managed service principals are supported. Microsoft Entra ID-managed service principals are not supported.
To learn how to grant a Databricks-managed service principal access to a volume, see the documentation for AWS, Azure, or GCP.
To change a connector, replace <connector-id>
with the source connector’s unique ID.
To get this ID, see List source connectors.