I recently had to migrate a PostgreSQL Database to Google's Cloud Data Warehouse Solution BigQuery. So, I had a bunch of PostgreSQL queries to convert using BigQuery Standard SQL. It was not as easy as I thought it would be.
SQL, or Structured Query Language, is the standard language for interacting with relational databases. But, as you probably know, many dialects of SQL are available depending on the database management system you're using: PostgreSQL, MySQL, SQL Server, etc.
The first thing you should know is that BigQuery uses legacy SQL or standard SQL. You can set up a default dialect or specify it before each request. In the following queries, I use Bigquery standard SQL.
That being said, some BigQuery specificities make the conversion a bit tricky.
BigQuery’s data types
Let's start with the basics! To write BigQuery queries, it is necessary to use the right data types. So, here is a table with primitive PostgreSQL data types and their BigQuery equivalent.
As you may have noticed, there are some PostgreSQL data types that are not available in BigQuery Standard SQL. You can always find a workaround to solve this problem, but it depends on your use case. For instance, I use the STRING type for my UUID column.
Other data types are available in BigQuery. I find GEOGRAPHY types particularly valuable. You can take a look at the official Google documentation for more details on BiQuery’s data types.
User-defined functions from PostgreSQL to BigQuery
I found writing functions in BigQuery much simpler than using PostgreSQL. Here is a function that increments an integer using PostgreSQL.
CREATE OR REPLACE FUNCTION increment(i integer) RETURNS integer AS $$
BEGIN
RETURN i + 1;
END;
$$
LANGUAGE plpgsql;
Here is the same one using BigQuery:
CREATE OR REPLACE FUNCTION increment(i int64) AS (i + 1)
[RETURNS int64];
It is also possible to use javascript to write user-defined functions in BigQuery. This is mainly useful for complex functions.
Here is a simple example of a function using javascript to check if a number is even or odd:
CREATE OR REPLACE FUNCTION evenOrOdd(n INT64) RETURNS STRING
LANGUAGE js AS """
if (n % 2 == 0) {
return 'Even';
} else {
return 'Odd';
}
""";
Subqueries in BigQuery standard SQL
A subquery is a query that appears inside another query statement. In BigQuery Standard SQL, the syntax of a subquery depends on its type. So, if the query you are converting from PostgreSQL to BigQuery query contains a subquery make sure to identify its type first: Scalar, ARRAY, IN, or EXISTS.
Scalar subquery
It is a subquery inside an expression. It is commonly used in the SELECT list or WHERE clause. The important thing to know is this type of subquery must return a single row. If there is no row to return the result is NULL and in the case of multiple rows, it fails with a runtime error. Actually, this is also a failure case in PostgreSQL. One of the good things about BigQuery Standard SQL is the possibility to return a single STRUCT using SELECT AS STRUCT
. A STRUCT may contain multiple expressions.
For example, to get the quantity remaining in stock of booked items you can use this query:
SELECT
id,
(SELECT quantity FROM db.stock WHERE booking.stockId=id) AS quantity
FROM db.booking;
PS: this query will not work in PostgreSQL since the stockId column will be interpreted as stockid unless you use double quotes: booking.”stockId”. On the other hand, double quotes are not acceptable in BigQuery Standard SQL.
However, if you also want the lastModificationDate of the stock you should use a STRUCT:
SELECT
id,
(SELECT AS STRUCT quantity, lastModificationDate
FROM db.stock WHERE booking.stockId=id) AS stock
FROM db.booking;
ARRAY
You guessed it, this type of subquery returns an ARRAY. To avoid errors, the SELECT
list in an ARRAY subquery must have exactly one column: it defines the type of its elements.
SELECT
id,
ARRAY(SELECT itemId FROM db.booking WHERE booking.userId=user.id)
AS bookings
FROM db.user;
As you can see in the example above, it is possible to create a table named user in BigQuery which is a reserved word in PostgreSQL. But, the structure of the query is the same in both dialects.
Similar to scalar subqueries, you can use SELECT AS STRUCT
to build arrays of structs. In this case, the SELECT list can include multiple columns.
IN
The IN subquery checks if a value is in the set of rows returned by the subquery. It returns True if it is the case and FALSE if no rows are returned by the subquery. However, the subquery's SELECT list must have a single column and its type must be comparable to the type of the value to check.
Now, if we want to select users having booked a certain item with id 123, we should use the IN subquery and UNNEST the array :
SELECT id
FROM (SELECT id, ARRAY(SELECT itemId FROM db.booking WHERE booking.userId=user.id) AS bookings FROM db.user) AS user_bookings
WHERE 123 IN UNNEST(user_bookings.bookings);
An equivalent query in PostgreSQL would be the following:
SELECT id
FROM (SELECT id, ARRAY(SELECT "itemId" FROM db.booking WHERE booking."userId"="user".id) AS bookings FROM "user") AS user_bookings
WHERE 123=ANY(user_bookings.bookings);
EXISTS
This is the least complicated one since there are no rules about the column list and it is used similarly in the two dialects. It returns TRUE if the subquery produces at least one row and FALSE if the subquery produces none.
SELECT EXISTS(SELECT userId FROM db.booking WHERE itemId=123);
To conclude this section I would like to draw your attention to one of BigQuery drawbacks. If you want to create a materialized view, you should probably know that it supports a restricted SQL syntax. Subqueries, for instance, can not be used as of today.
I hope this article will help you smoothly migrate your PostgreSQL queries to BigQuery Standard SQL and avoid the errors I encountered.
If you are looking for Data Experts, don't hesitate to contact us!