A Python ORM for the ORM haters: A minimal implementation in less 100 lines
Object-relational mappers, or ORMs, continue to be controversial. There are many critics out there (Jeff Atwood famously called them the Vietnam of computer science), but many others swear by them. They continue to be the default in the big web frameworks like Rails and Django.
I'll state my own position - I'm mostly a hater. The main purported benefit is helping you avoid contact with SQL. But this abstraction layer is leaky, and anything remotely serious involving the database is going to mean getting your hands dirty sooner or later. If you aren't familiar with their internals they can feel very opaque and reliant on "magic".
Another problem is lock-in. ORMs are heavily tied in to their particular framework, and typically go beyond the role of query helper to schema definition and migration tool as well. This means it's hard to swap out individual parts, or play nicely with other apps and change management that might also use the same database. (These drawbacks frustrated me enough that I wrote a whole non-ORM-reliant migration tool to address these issues).
Going off-piste
Nevertheless, if we decide we don't like ORMs, what can do instead? ORMs are so ubiquitous that writing application code without an ORM can feel like seriously uncharted territory. Say what you like about 'em, at least ORMs are well-documented!
And while aspects of existing ORM implementations can be frustrating, organising our logic into objects and entities seems conceptually sound as a general principle of code organization.
Also, we don't want to get to a situation where we are defining fields manually in both schema definition and object definition: we want to specify our fields only once.
What's the minimum amount of code we can get away with that lets us work in accordance with these principles, but avoids the heavy weight and coupling of a full-on ORM?
Principles for something minimal
Let's see what we can throw together that gives us the following:
- Typical saving-and-loading from the database to Python objects and back again
- Uses real SQL for
select
queries and schema definition, for maximum control - No field repetition
This definitely won't be suitable as-is for production use. But it shows how you might approach wiring together your own database interaction code that works best for your needs and preferences.
And it definitely won't have feature parity with a full ORM - that's not the point. This is an exercise in trading off intricate complexity with directness and explicitness.
Our tools
For this example we'll use Python and PostgreSQL. If following along, use the most recent versions of both - we'll be using dataclass
es, which were added only recently to the Python standard library.
We'll be using a couple of python packages to help us:
- results: I helped write
results
at my former job, it's mostly just a boilerplate-reducing wrapper oversqlalchemy
. I still use it a lot but it isn't well-documented so only use if adventurous! - schemainspect: This is a dependency of migra. It inspects database schemas (currently only PostgreSQL ones).
Defining and creating
Let's pretend we're writing some software to manage books in a library. A simplified schema might look like this:
create table book(
book_id uuid default gen_random_uuid() primary key,
title varchar,
author varchar
);
create table borrower(
borrower_id uuid default get_random_uuid() primary key,
name varchar
);
create table issue(
issue_id uuid default get_random_uuid() primary key,
book_id uuid,
borrower_id uuid,
issued_at timestamptz default now()
);
We'd like a class to represent each these entities. We can succinctly do this by inspecting the schema and using that to define data classes:
from dataclasses import make_dataclass, field as dcfield
from pathlib import Path
import results
db = results.db('postgresql:///library')
schema_sql = Path('schema.sql').read_text()
db.raw(schema_sql)
# or you can sync the schema (requires pg running locally)
# db.sync_db_structure_to_definition(schema_sql)
i = db.inspect()
def make_class(inspected_table):
it = inspected_table
classname = it.name.title()
columns = [
(c.name, c.pytype, dcfield(default=None))
for c
in it.columns
]
return make_dataclass(
classname, columns
)
Book = make_class('book')
Issue = make_class('issue')
Borrower = make_class('borrower')
This small amount of code gives us real classes but with their fields defined from the database.
We can create a Book
object confident that we're setting the correct fields.
book = Book(title='Moby Dick', author='Herman Melville') # WORKS
book = Book(title='Moby Dick', bad_field_name='Herman Melville') # DOESN'T
Saving
So far so good. Let's add some saving code. We don't want to have to implement that individually for each object, so let's make all our entity objects inherit from a class called Entity
, and add some save code to that.
To do that, we just need to declare the class, and add a parameter to the make_dataclass
call from earlier:
return make_dataclass(
classname, columns, bases=(Entity,)
)
We'd like to be able to save an entity like this:
book.save(db)
Which we can implement as follows, taking advantage of the convention that the primary key of the table is the table name suffixed with _id
.
from dataclasses import asdict
class Entity:
def save(self, db):
table_name = type(self).__name__.lower()
id_field = f'{table_name}_id'
fields = asdict(self)
entity_id = fields[id_field]
if entity_id is None:
upsert_on = None
fields.pop(id_field)
else:
upsert_on = [id_field]
inserted = db.insert(table_name, fields, upsert_on=upsert_on)
if entity_id is None:
inserted_id = getattr(inserted.one(), id_field)
setattr(self, id_field, inserted_id)
This uses PostgreSQL
's upsert (insert ... on conflict
) to either insert a new entity (if a primary key is not set, then we assume it hasn't been saved yet), or update an entity using its primary key.
dataclasses
help us here again with their inbuilt method to return their fields as a dictionary.
That takes care of saving.
Querying
For the final piece, we need a way to do the most complicated part, querying. Our preference here is to be able to write our select
s in vanilla SQL for maximum control and directness and minimal magic.
For instance we might wish to write a query like this:
select * from books where ...
and get the corresponding list of Book
objects back.
We will need a code that enables this from python, something like:
books = query_to_objects(db, 'select * from book', Book)
All that remains is to implement this query_to_objects
method, which we can do straightforwardly as follows:
def query_to_objects(db, query, _class):
rows = db.ss(query)
fields = [
_.name
for _
in dataclass_fields(_class)
]
return [
_class(**{
f: row[f]
for f
in fields
})
for row in rows
]
Joins and n+1
But of course, we typically want joined data too.
A common problem with ORMS is the N+1 problem, which we obviously want to avoid, by getting all the data we want in a single query.
Take this query for example. We want all the overdue books in the library (the where
clause can be whatever here, let's assume all books get issued for a month).
select
*
from
issue i
join book using (book_id)
join borrower using (borrower_id)
where
i.issued_at < now() - '1 month'::interval
We need to get back the resulting Issue
objects, plus the linked Book
and Borrower
objects for each.
That means we need to extend our loading method a bit.
We need to not only load the linked classes, but correctly set up the links between each object based on their rows. These relationships could be one-to-one or one-to-many so we need to handle either situation.
We'd want to call the query something like this:
issues = query_to_objects(
OVERDUE_BOOKS,
_class=Issue,
linked_classes=[Book, Borrower]
)
This would return a list of Issues, with the Book and Borrower object for each.
A rough implementation as below.
Queries of many-to-many relationships will result in multiple copies of the initial entity, so there is some grouping involved.
def query_to_objects(query, _class, linked_classes=None):
linked_classes = linked_classes or []
rows = db.ss(query)
class_name = _class.__name__
fields = [
_.name
for _
in dataclass_fields(_class)
]
grouped = rows.grouped_by(columns=fields).values()
objects_with_rows = [
(_class(**{f: g[0][f] for f in fields}), g)
for g in grouped
]
for obj, rows in objects_with_rows:
for linked_class in linked_classes:
linked_fields = [
_.name
for _
in dataclass_fields(linked_class)
]
table_name = linked_class.__name__.lower()
id_field = f'{table_name}_id'
linked_objects = [
linked_class(
**{k: r[k] for k in linked_fields}
)
for r in rows
]
for o in linked_objects:
setattr(o, class_name, obj)
if hasattr(obj, id_field): # one-to-one
setattr(obj, table_name, linked_objects[0])
elif id_field in linked_fields: # one-to-many
setattr(obj, f'{table_name}s', linked_objects)
return [_[0] for _ in objects_with_rows]
This has obvious limitations - in particular, it won't handle longer chains of joins, without further enhancement. But for more complex joins, you might wish to handle those using a view anyway.
Views and other non-table selectables often fit awkwardly into the conceptual model of full-fledged ORMs, but can be integrated quite easily into code like this (you can inspect views
just like tables
, and generate classes similarly).
Fully-fledged ORMs have quite sophisticated change tracking/saving features (for instance, sqlalchemy's Unit of Work). Your own database integration code won't give you this - again, going ORMless means making the trade-off between controlled explicitness vs magical implicitness.
Next steps
This is an example of a general approach, not something you should actually use as-is. If you'd like to play around with the code, take a look at the gist.
And let me know what you think via the links at the bottom of the page.
If you enjoyed this, you might also enjoy my talk on database migrations: Your migrations are bad, and you should feel bad