Not long ago we launched our new integration with Google Sheets providing a user interface to our background check API. Our customers really like this new way of interacting with our API as it reduces their workload and simplifies the review process. Clients get a spreadsheet and voilá — they don’t need to build anything in order to utilize our system. They can put profile data in the given spreadsheet and the results are posted as soon as the background check is completed for each profile. It’s not rocket science – we want to make everything as easy as possible for our customers.
Since our product infrastructure is built on services other than the Google Cloud Platform, bringing them to our development stack brought some challenges for us during the development of this tool. It took a few weeks of planning before we came to understand that our main concern was to find a way that gives our customers the power to instantiate as many background check executions as needed in the cleanest and simplest way possible.
With that goal in mind, we developed a Google Sheets add-on that lets the customers start the process. However, it didn’t work according to our expectations, and we needed a private add-on with Google Sheets shared outside our GCP organization. Unfortunately, we found out, it’s not possible with the current feature set offered by a Google Sheets add-on.
Next, we built a polling-based solution that checks all the spreadsheets for all of our customers every few minutes. But that solution is only efficient with a small number of spreadsheets due to the fact that Google Sheets API offers a very limited quota (i.e. 100 requests/100 seconds). We started facing scalability issues as we onboarded more customers to this new product and concluded that we needed a system that should only check certain spreadsheets, ones that have had changes made by the owner of the spreadsheet.
In order to achieve our new goal, we dove into the Google Drive changes API, which seemed promising. The Google Drive API offers two methods to read changes, though, really, it only provides a single method, and so the second method is designed to be integrated with the first method to avoid making unnecessary requests.
Method # 1: Polling with Changes.list
This is a pretty straightforward method. However, it requires polling the same endpoint continuously to receive the changes. Here is how it works:
- Use that token to fetch change items using the Changes.list endpoint. This will also provide a next-page token. It returns the following JSON object in response:
Where each change item is like the following JSON object:
“file”: files Resource,
“teamDrive”: teamdrives Resource,
“drive”: drives Resource
- Repeat Step 2 with the next-page token received in Step 2.
The only tricky part of this method is that in order to continue receiving the changes the system needs to make continuous requests with the `nextPageToken` tokens. It’s a never-ending loop.
Method # 2: Push Notifications with Changes.watch
This method adds a layer at the top of the first method, using a webhook URL to fix that never-ending loop. With this method, we no longer need to poll. Anytime a change occurs, Google makes an empty HTTP POST request to the webhook URL we register. Here is how it works:
- Register your domain on the Google Search Console using the Domain method. The URL prefix method won’t work because your domain also needs to be registered with your GCP project, which doesn’t support URL prefix.
- Register your URL using the Changes.watch endpoint.
From this point, for every change that occurs, Google will make an empty-body HTTP POST request to the URL you just registered. It returns the following JSON object in the response for registering the Webhook URL:
Note that this step has to be integrated with the first method. For every POST request your URL receives, it needs to make a single request to the Changes.list endpoint to actually get the list of changes and then process them. In other words, this is the point where the first method begins. But instead of polling, it only makes the request when there are notifications.
If you wish to deregister your webhook URL and stop it from receiving the change notifications, you can make an HTTP POST request to Channels.stop to do that.
Comparison of the Two Methods
|Polling||– Not needed||– Needed|
|Obtaining Changes||– For each change, the registered webhook URL is called with no change item||– Making a poll request returns a list of change items
– Option to limit the number of changes received per request
|Webhook Expiration||– The webhook registration expires after one week. The default timeout is set for an hour
– No option to automatically renew the webhook registration
|– No expiration|
|Duplicate Change items||– Duplicate changes are possible. Note that there is likely to be an “overlap” period of time when the two notification channels for the same resource are active. Read more||– Each token is a reference to a list of change items. It should be the responsibility of the endpoint to return non-duplicate change items|
|Webhook URL||– A public-facing webhook URL is required to be registered with the API||– No webhook URL is required|
|Change Item Criteria||– Opening a file is not considered as a change item||– Opening a file is considered as a change item. This occurs due to the change in metadata information storing date/time values|
|Old Change Items||– Only includes the changes that were made after the webhook was registered||– Around 2 months of old changes are fetched if a new start-page-token is used. This can be avoided by storing and using the last token used|
Sadly, these methods don’t provide information regarding the actual change that was made. While it does include the `fileId` in the change item, it also doesn’t specify whomade the change or at what time it was made. For instance, it does not say, “John modified Column 3 of Row 5 of file XYZ on the 14th August 2020 at 5:00 AM.” It just says that “File XYZ has a change.”
Thankfully, the Google Drive API also provides the Revisions API, which can help us uncover more details about the change that occurred. With it, a revision history can be fetched for a specific file using the Revisions.list endpoint, where each revision item looks something like this:
Hurrah! We now know the last `modifiedTime` and the user who modified it (via `lastModifyingUser`).
But wait, what if that last revision item in the revisions list was just added to the list, right before you fetched it? What if the revision item you were looking for comes before it? What if the time you received the change item was far later than when the change actually happened? How do we make sure that we receive the change on time? How do we know the revision history carries the exact item we are expecting to process? The list of questions goes on and on, and with no obvious answers.
The Google Drive API may work for many problems, but the problems we need to solve requires several features that we believe are currently missing from this API. We need a way to be able to receive the change item when it occurs, along with all the relevant information related to it. The Google Drive API appeared to be a promising utility but after we spent some time researching all its deep dark secrets, it turned out that it cannot currently fulfill our needs. Plus, the resources (cost and time) required to build a solution based on the existing Google Drive API features are far more expensive than its benefits.
Google Sheets is great for our customers, and we are currently working to bring an advanced system for our customers with a robust user interface that will eventually replace this integration. It will provide a consumer-friendly experience of our API with several new features. Along with it, we’ll also be releasing an open-source Python wrapper for our API to enable our customers to build their own custom solutions using our API.
We hope that you now have a slightly better understanding of how the Google Drive API works and how it might (and might not) meet your needs.
Perform background checks in latam with Emptor