Code and slides from Jayson Falkner's "Efficient Django QuerySet Use".
This talk is about advanced use of Django's QuerySet API based on lessons learned at Counsyl. You'll learn how to debug, optimize and use the QuerySet API efficiently. The main examples focus on demystifying how QuerySet maps to SQL and how seemingly simple Python code, presumably fast O(1) queries, can result in unacceptably long runtime and often O(N) or worse queries. Solutions are provided for keeping your Python code simple and RDMS use performant.
Here is one of the key slides to pique your interest. It is the same query tuned via different strategies to get optimal performance.
Jayson Falker is a software developer for Counsyl. He has an EEN/CS degree and a PhD in Bioinformatics. He is a long-time open-source Linux/Java/Web developer, book author, and presenter. He has focused on proteomics and genomics research since the mid-2000's, including working for a few years at startups and for Dow AgroSciences. Jayson is quick to get his geek on and also loves to build ridiculous side-projects, including a 6' tall Jenga set that weighs 300lbs and mobile apps that give you awesome virtual mustaches.
This material comes from work experience at Counsyl and has been presented at the following events:
- Oct 10th, 2013 - The San Francisco Django Meetup Group @ Counsyl
- Oct 23rd, 2013 - Petri Dish (Counsyl Blog). Includes slideshare + audio.
- Nov 12th 2013 - Portland Python Users Group @ Urban Airship
If you think this talk would be good for your user group or conference, contact jayson@counsyl.com.
You are encouraged to take this material and reuse it according to the included MIT license. In short, you can use it however you like but the work is provided AS IS and you need to include the license to recognize the original authors and Counsyl.
Thanks to Counsyl co-workers Kevin Haas for original authorship of the pg_bulk_update()
code and Constantin Berzan for guidance regarding bulk update strategies using Postgres.
Thanks to Justine Lam, Imran Haque, and Dave Peticolas for discussion and feedback on the presentation, slides and related code.
Below are the scripts to run the examples from the slides. All scripts assume:
- You are running Python 2.7.2 (newer versions may work fine too)
- You have Django 1.5 installed
- You are in the base directory. The one with this file.
- You have
pip install sqlparse
This set includes doing the often intuitive yet worst possible QuerySet use. The optimization method of using prefetch_related()
demonstrates a 10x speed increase. Examples are done with 1,000 rows in the Sample
table.
These examples require use of the demo
Django app included in this repository. From the base directory that this README.md file is in do the following.
# Setup the demo's data models.
cd demo
psql -c "CREATE DATABASE demo;"
python manage.py syncdb
# Run the demo.
python manage.py shell
The rest of the commands can be copy-and-pasted in to Django's shell. For the track_sql()
and print_sql()
timing metrics to be helpful you must paste each complete %cpaste
through --
section all at once.
# Make up 1,000 samples to get sub-second queries for looping.
from example.utils import make_fake_data
make_fake_data()
# Import Models.
from example.models import Sample, SampleStatus
import counsyl.db
from django.db.models import Max
# Run the worst performance, looping query.
%cpaste
counsyl.db.track_sql()
results = []
samples = Sample.objects.filter(
production=True,
statuses__status_code=SampleStatus.LAB)
for sample in samples:
results.append((sample.barcode, sample.status().created))
counsyl.db.print_sql()
counsyl.db.print_sql(show_queries=False)
--
# Run the prefetch_related(), looping query.
%cpaste
counsyl.db.track_sql()
results = []
samples = Sample.objects.filter(
production=True,
statuses__status_code=SampleStatus.LAB)
samples = samples.prefetch_related('statuses')
for sample in samples:
results.append((sample.barcode, sample.status().created))
counsyl.db.print_sql()
counsyl.db.print_sql(show_queries=False)
--
This example set relies on an SQL JOIN to 20-50x more efficiently perform the original query. Examples are using 10,000 rows in the Sample
table.
You do not need to reset the Django demo
app for these examples. The same code base is used for the JOIN examples.
# Make up 10,000 samples to get second(ish) queries for JOIN use.
from example.utils import make_fake_data
make_fake_data(samples_to_make=10000)
# Run the JOIN without select_related()
%cpaste
counsyl.db.track_sql()
vals = list(Sample.objects
.annotate(latest_status_code=Max('statuses__status_code'))
.filter(production=True,
latest_status_code=SampleStatus.LAB)
.values_list('barcode', 'statuses__created'))
counsyl.db.print_sql()
--
# Run the select_related(), looping query. Denormalize fail.
%cpaste
counsyl.db.track_sql()
results = []
samples = Sample.objects.filter(
production=True,
statuses__status_code=SampleStatus.LAB)
samples = samples.select_related('latest_status')
for sample in samples:
results.append((sample.barcode, sample.latest_status.created))
counsyl.db.print_sql()
--
This set includes the optimal known use of QuerySet using Postgres including denormalizing the fields used for the query and a multicolumn index. This is roughly 100,000x improvement over the original query and 1,000x over the SQL JOIN based strategies.
You will have to reset demo
, swith to the demo-optimized
Django app and syncdb
in order to run these examples. The denormalized data model is slightly different than that of the original examples.
# Reset the data models and load a denormalized view.
cd demo-optimized
psql -c "DROP DATABASE demo;"
psql -c "CREATE DATABASE demo;"
python manage.py syncdb
# Launch the shell.
python manage.py shell
# Make up 1,000,000 fake samples. Won't take long.
from example.utils import make_fake_data
make_fake_data(samples_to_make=1000000, batch_threshold=100000, make_statuses=False)
# Import models.
from example.models import Sample, SampleStatus
import counsyl.db
# Query without a multicolumn index.
%cpaste
counsyl.db.track_sql()
vals = list(Sample.objects
.filter(production=True,
status_code = SampleStatus.LAB)
.values_list('barcode', 'status_created'))
counsyl.db.print_sql()
--
# Create the multicolumn index.
from counsyl.db import pg_multicolumn_index
pg_multicolumn_index(Sample, ['production', 'status_code'])
# Re-do the same query above but this time the index is available.
%cpaste
counsyl.db.track_sql()
vals = list(Sample.objects
.filter(production=True,
status_code = SampleStatus.LAB)
.values_list('barcode', 'status_created'))
counsyl.db.print_sql()
--
# Make 100,000,000 rows. This takes awhile!
from example.utils import make_fake_data
make_fake_data(samples_to_make=100000000, batch_threshold=1000000, make_statuses=False)
# Optimal query including multicolumn index. 100,000,000 in sub second!
%cpaste
counsyl.db.track_sql()
vals = list(Sample.objects
.filter(production=True,
status_code = SampleStatus.LAB)
.values_list('barcode', 'status_created'))
counsyl.db.print_sql()
--
Django's index_together
option was added in 1.5. If you are running an older version of
Django you can still manually add an SQL index and obtain the same performance boost. Use
the pg_multicolumn_index
method.
For example, to replicate the index on this presentation's Sample
class.
# Launch a shell.
./manage.py shell
# Add the multi-column index using the helper method.
from django_pg_utils import pg_multicolumn_index
pg_multicolumn_index(Sample, ['production', 'status_code'])
You can confirm that the index exists by using psql
(if using Postgres) or equivalent to
browse the schema for the example_sample
table.
# Launch psql to directly interact with Postgres.
psql -h 127.0.0.1 demo
# Display the table and constraints.
demo=# \d example_sample Table "public.example_sample"
Column | Type | Modifiers
----------------+--------------------------+-------------------------------------------------------------
id | integer | not null default nextval('example_sample_id_seq'::regclass)
barcode | character varying(10) | not null
production | boolean | not null
created | timestamp with time zone | not null
status_code | smallint | not null
status_created | timestamp with time zone | not null
Indexes:
"example_sample_pkey" PRIMARY KEY, btree (id)
"example_sample_barcode_key" UNIQUE CONSTRAINT, btree (barcode)
"example_sample_6d7e6b4a" btree (production, status_code)
"example_sample_barcode_like" btree (barcode varchar_pattern_ops)
Check constraints:
"example_sample_status_code_check" CHECK (status_code >= 0)
Referenced by:
TABLE "example_samplestatus" CONSTRAINT "example_samplestatus_sample_id_fkey" FOREIGN KEY (sample_id) REFERENCES example_sample(id) DEFERRABLE INITIALLY DEFERRED
The specific index you are looking for is "example_sample_6d7e6b4a" btree (production, status_code)
. The
particular name doesn't matter as long as (production, status_code)
are grouped.