Why I like to use UUIDv7 as primary keys
Auto-incrementing integers - the classic approach
The classic approach to relational databases design is to use an auto-incrementing integer as the primary key. This is a good approach because it is simple and efficient. However, it has some drawbacks:
- This approach is not scalable. If a database has to be broken into multiple shards, the auto-incrementing integer solution won't work out of the box. We would have to implement an expensive solution to ensure that the same integer is not used in multiple shards, which would introduce complexity and overhead.
- It can contribute to the actual top 1 OWASP issue: Broken Access Control - having predictable primary keys that identify resources makes it easier for attackers to break access control. A single endpoint not secured properly can open the gates to data leaks and other critical scenarios.
UUIDv4 - a common (not so good) solution
One of the most common solutions to this problem is to replace the usage of auto-incrementing integers with UUIDs. There are many versions of UUID, but the most common one is UUIDv4.
UUIDv4 is a pseudo-random sequence of bits, which means that it is not possible to predict in any way a value of a UUIDv4. At first glance, this seems like a good solution, it greatly mitigates the risk of broken access control and also grants in practice the uniqueness of the primary key across multiple shards.
However, there is a critical drawback with the usage of UUIDv4 as primary keys: they can't be optimally indexed by the database. This post won't get into the technical details of how indexes work, but long story short, UUIDv4 completely random and non-sequential nature affects the database; specially when it comes to writing data. This post from PlanetScale explains the tradeoffs of using UUIDs (including UUIDv4 and UUIDv7) as primary keys: The Problem with Using a UUID Primary Key in MySQL. Even though there are solutions like Snowflake ID, UILD or NanoID, relational databases like MySQL orPostgreSQL have a built-in binary types for UUIDs, which will help us saving space.
UUIDv7 - a better solution
UUIDv7 keeps the random nature of UUIDv4, while adding a timestamp to the UUID, making it sortable and indexable. Despite these nice to have properties, most of the UUID is still random, almost guaranteeing the uniquess when using UUIDv7, even across multiple shards. Now, the question is: does Ecto have a way to automatically use UUIDv7 as primary keys?
The short answer is: no. The long answer is: no, but we can use metaprogramming to achieve this.
UUIDv7 in Elixir
Elixir doesn't have a built-in UUIDv7 library, but we can use uuidv7 to achieve this. As for any other external dependency, we will need to add it to our mix.exs
file:
defp deps do
[
{:uuidv7, "~> 1.0"}
]
end
UUIDv7 in Ecto
This is the most interesting part, where we will need to use metaprogramming. The vanilla way of using UUIDv7 as the primary key of an Ecto schema would be to define the module attributes manually in every schema. I'm too lazy to do that, and that approach is prone to errors (aka I'd forget to add it 90% of the times).
However, Elixir comes to the rescue one more time with its metaprogramming capabilities. We can create a macro that will overwrite Ecto schema to include the UUIDv7 as the primary key. Meta-programming can be scary at first, but once you understand how it works it opens a lot of possibilities and allows you to write code that is more flexible and easier to maintain (but it also allows you to do the opposite, so use it wisely).
defmodule MyApp.Schema do
defmacro __using__(_opts) do
quote do
use Ecto.Schema
@primary_key {:id, UUIDv7, autogenerate: true}
@foreign_key_type UUIDv7
end
end
end
It's this simple, a little macro that will wrap the Ecto.Schema.__using__/1
macro to inject the UUIDv7 as the primary key. From here, in any schema we define, instead of using Ecto.Schema
we will use MyApp.Schema
and we will have UUIDv7 as primary keys that will be autogenerated in case we insert records with Repo.insert/2
.
The time issue
UUIDv7 helps us having a unique identifier for each record, allowing the database to scale horizontally. When this need appears, it usually means that you will be storing data coming from multiple sources, each with its own local time. However, our database should be a single source of truth, thus we need to have a way to represent time in a way that is consistent across the different timezones. The easiest way to do this is to use a timestamp in UTC format. This is the approach that Ecto uses by default, but it has a drawback: in many cases, we will want to keep the information related to the local timezone.
For example, in financial applications, the time at which a transaction takes place is crucial when it comes to detecting potential fraud. If a credit card is used at an ATM at 4:00 AM to withdraw a big sum of money, it's more likely to be a fraudulent transaction. If the DB storing transactions stores the time in UTC and for whatever reason the record doesn't contain the location of the ATM, we won't be able to tell if the transaction took place at 4:00 AM or any other time. However, if we store the time keeping the local timezone, that single data point will be enough to flag a transaction as suspicious in this case scenario.
Keeping the local timezone
Thankfully, Ecto provides a built-in way to keep the local timezone in the database. Is as easy as using the type :utc_datetime
and adding it to the schema using the attribute @timestamps_opts
. If we do this to the macro that we already defined, we will have the following:
defmodule MyApp.Schema do
defmacro __using__(_opts) do
quote do
use Ecto.Schema
@primary_key {:id, UUIDv7, autogenerate: true}
@foreign_key_type UUIDv7
@timestamps_opts [type: :utc_datetime]
def __timestamp_defaults__, do: @timestamps_opts
end
end
end
You may have noticed that we are adding a new function __timestamp_defaults__/0
to the macro. This function will be used by Ecto to get the default timestamp type when calling timestamps/1
inside a schema definition.
Automating the inclusion of timestamps in Ecto schemas
The reason to write this whole post is that I found myself forgetting quite often to add the timestamps/1
call inside the schema I defined in a toy project I've been working on lately. It really annoyed me to see the tests failing just because I didn't add something that is boilerplate in my code, so I decided to look for the way to automate this process, in a way that my macro will include the timestamps/1
call in the schema definition ONLY if I forgot to add it.
The Elixir AST
From the official documentation:
Elixir syntax was designed to have a straightforward conversion to an abstract syntax tree (AST)
Elixir metaprogramming makes use of this syntax design to represent our code as an AST, making it easy to manipulate and transform our code programmatically. This post won't get into the details of how Elixir metaprogramming works, but if you want to learn more, the official documentation is a good starting point. Also, can't recommend enough the book Metaprogramming Elixir by Chris McCord (the creator of Phoenix).
Manipulating the AST of our schema definitions to include the timestamps
To perform this manipulation, we will define a macro that will override the original schema/2
macro from Ecto.Schema
. This macro is the one responsible of generating in compile time the Elixir code that will internally define a schema based on the schema written by the user.
This override will check if the schema defined by the user already contains a call to timestamps/1
. In case the user has already defined the timestamps, it will delegate the block to the original Ecto.Schema.schema/2
macro. Otherwise, it will add the call to timestamps/1
to the schema definition that will be passed to the original Ecto.Schema.schema/2
macro.
defmodule MyApp.Schema do
alias MyApp.Schema
defmacro __using__(opts \\ []) do
quote do
use Ecto.Schema, unquote(opts)
# We want to import all the functions from Ecto.Schema except for schema/2, which will be overridden by our macro.
import Ecto.Schema, except: [schema: 2]
import Schema
require Logger
@primary_key {:id, UUIDv7, autogenerate: true}
@foreign_key_type UUIDv7
@timestamps_opts [type: :utc_datetime]
def __timestamp_defaults__, do: @timestamps_opts
end
end
# Define the schema macro at module level
defmacro schema(source, do: block) do
has_timestamps? =
case block do
{:__block__, _, expressions} ->
# This captured version is equivalent to:
# Enum.any?(expressions,
# fn expression -> match?({:timestamps, _, _}, expression)
# end)
Enum.any?(expressions, &match?({:timestamps, _, _}, &1))
{:timestamps, _, _} ->
true
_ ->
false
end
if has_timestamps? do
quote do
require Ecto.Schema
Ecto.Schema.schema unquote(source) do
unquote(block)
end
end
else
quote do
require Ecto.Schema
Ecto.Schema.schema unquote(source) do
unquote(block)
timestamps(type: :utc_datetime)
end
end
end
end
end
And that's it! Now, in any schema we define, we can forget about adding the timestamps/1
call, and our macro will do it for us.