Google Drive
This page contains the setup guide and reference information for the Google Drive source connector.
The Google Drive source connector pulls data from a single folder in Google Drive. Subfolders are recursively included in the sync. All files in the specified folder and all sub folders will be considered.
Prerequisites
- Drive folder link - The link to the Google Drive folder you want to sync files from (includes files located in subfolders)
- For Airbyte Cloud A Google Workspace user with access to the spreadsheet
- For Airbyte Open Source:
- A GCP project
- Enable the Google Drive API in your GCP project
- Service Account Key with access to the Spreadsheet you want to replicate
Setup guide
The Google Drive source connector supports authentication via either OAuth or Service Account Key Authentication.
For Airbyte Cloud users, we highly recommend using OAuth, as it significantly simplifies the setup process and allows you to authenticate directly from the Airbyte UI.
For Airbyte Open Source users, we recommend using Service Account Key Authentication. Follow the steps below to create a service account, generate a key, and enable the Google Drive API.
If you prefer to use OAuth for authentication with Airbyte Open Source, you can follow Google's OAuth instructions to create an authentication app. Be sure to set the scopes to https://www.googleapis.com/auth/drive.readonly
. You will need to obtain your client ID, client secret, and refresh token for the connector setup.
Set up the service account key (Airbyte Open Source)
Create a service account
- Open the Service Accounts page in your Google Cloud console.
- Select an existing project, or create a new project.
- At the top of the page, click + Create service account.
- Enter a name and description for the service account, then click Create and Continue.
- Under Service account permissions, select the roles to grant to the service account, then click Continue. We recommend the Viewer role.
Generate a key
- Go to the API Console/Credentials page and click on the email address of the service account you just created.
- In the Keys tab, click + Add key, then click Create new key.
- Select JSON as the Key type. This will generate and download the JSON key file that you'll use for authentication. Click Continue.
Enable the Google Drive API
- Go to the API Console/Library page.
- Make sure you have selected the correct project from the top.
- Find and select the Google Drive API.
- Click ENABLE.
If your folder is viewable by anyone with its link, no further action is needed. If not, give your Service account access to your folder. Check out this video for how to do this.
Set up the Google Drive source connector in Airbyte
To set up Google Drive as a source in Airbyte Cloud:
- Log in to your Airbyte Cloud or Airbyte Open Source account.
- In the left navigation bar, click Sources. In the top-right corner, click + New source.
- Find and select Google Drive from the list of available sources.
- For Source name, enter a name to help you identify this source.
- Select your authentication method:
For Airbyte Cloud
- (Recommended) Select Authenticate via Google (OAuth) from the Authentication dropdown, click Sign in with Google and complete the authentication workflow.
For Airbyte Open Source
-
(Recommended) Select Service Account Key Authentication from the dropdown and enter your Google Cloud service account key in JSON format:
{ "type": "service_account", "project_id": "YOUR_PROJECT_ID", "private_key_id": "YOUR_PRIVATE_KEY", ... }
-
To authenticate your Google account via OAuth, select Authenticate via Google (OAuth) from the dropdown and enter your Google application's client ID, client secret, and refresh token.
- For Folder Link, enter the link to the Google Drive folder. To get the link, navigate to the folder you want to sync in the Google Drive UI, and copy the current URL.
- Configure the optional Start Date parameter that marks a starting date and time in UTC for data replication. Any files that have not been modified since this specified date/time will not be replicated. Use the provided datepicker (recommended) or enter the desired date programmatically in the format
YYYY-MM-DDTHH:mm:ssZ
. Leaving this field blank will replicate data from all files that have not been excluded by the Path Pattern and Path Prefix. - Click Set up source and wait for the tests to complete.
Supported sync modes
The Google Drive source connector supports the following sync modes:
Feature | Supported? |
---|---|
Full Refresh Sync | Yes |
Incremental Sync | Yes |
Replicate Incremental Deletes | No |
Replicate Multiple Files (pattern matching) | Yes |
Replicate Multiple Streams (distinct tables) | Yes |
Namespaces | No |
Path Patterns
(tl;dr -> path pattern syntax using wcmatch.glob. GLOBSTAR and SPLIT flags are enabled.)
This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:
- Referencing many files with just one pattern, e.g.
**
would indicate every file in the folder. - Referencing future files that don't exist yet (and therefore don't have a specific path).
You must provide a path pattern. You can also provide many patterns split with | for more complex directory layouts.
Each path pattern is a reference from the root of the folder, so don't include the root folder name itself in the pattern(s).
Some example patterns:
**
: match everything.**/*.csv
: match all files with specific extension.myFolder/**/*.csv
: match all csv files anywhere under myFolder.*/**
: match everything at least one folder deep.*/*/*/**
: match everything at least three folders deep.**/file.*|**/file
: match every file called "file" with any extension (or no extension).x/*/y/*
: match all files that sit in sub-folder x -> any folder -> folder y.**/prefix*.csv
: match all csv files with specific prefix.**/prefix*.parquet
: match all parquet files with specific prefix.
Let's look at a specific example, matching the following folder layout (MyFolder
is the folder specified in the connector config as the root folder, which the patterns are relative to):
MyFolder
-> log_files
-> some_table_files
-> part1.csv
-> part2.csv
-> images
-> more_table_files
-> part3.csv
-> extras
-> misc
-> another_part1.csv
We want to pick up part1.csv, part2.csv and part3.csv (excluding another_part1.csv for now). We could do this a few different ways:
- We could pick up every csv file called "partX" with the single pattern
**/part*.csv
. - To be a bit more robust, we could use the dual pattern
some_table_files/*.csv|more_table_files/*.csv
to pick up relevant files only from those exact folders. - We could achieve the above in a single pattern by using the pattern
*table_files/*.csv
. This could however cause problems in the future if new unexpected folders started being created. - We can also recursively wildcard, so adding the pattern
extras/**/*.csv
would pick up any csv files nested in folders below "extras", such as "extras/misc/another_part1.csv".
As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.
User Schema
When using the Avro, Jsonl, CSV or Parquet format, you can provide a schema to use for the output stream. Note that this doesn't apply to the experimental Document file type format.
Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:
- You only care about a specific known subset of the columns. The other columns would all still be included, but packed into the
_ab_additional_properties
map. - Your initial dataset is quite small (in terms of number of records), and you think the automatic type inference from this sample might not be representative of the data in the future.
- You want to purposely define types for every column.
- You know the names of columns that will be added to future data and want to include these in the core schema as columns rather than have them appear in the
_ab_additional_properties
map.
Or any other reason! The schema must be provided as valid JSON as a map of {"column": "datatype"}
where each datatype is one of:
- string
- number
- integer
- object
- array
- boolean
- null
For example:
{"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}
{"username": "string", "friends": "array", "information": "object"}