Zephyrnet Logo

How To Speed Up SQL Queries Using Indexes [Python Edition] – KDnuggets

Date:

How To Speed Up SQL Queries Using Indexes [Python Edition]
Image by Author
 

Suppose you’re sifting through the pages of a book. And you want to find the information that you’re looking for much faster. How’d you do that? Well, you’d probably look up the index of terminologies and then jump to the pages that reference a particular term. Indexes in SQL work similarly to the indexes in books.

In most real-world systems, you’ll run queries against a database table with a large number of rows (think millions of rows). Queries that require a full-table scan through all the rows to retrieve the results will be quite slow. If you know that you’ll have to query information based on some of the columns often, you can create database indexes on those columns. This will speed up the query significantly.

So what’ll we learn today? We’ll learn to connect to and query a SQLite database in Python—using the sqlite3 module. We’ll also learn how to add indexes and see how it improves performance.

To code along to this tutorial, you should have Python 3.7+ and SQLite installed in your working environment.

Note: The examples and sample output in this tutorial are for Python 3.10 and SQLite3 (version 3.37.2) on Ubuntu LTS 22.04.

We’ll use the built-in sqlite3 module. Before we start running queries, we need to:

  • connect to the database 
  • create a database cursor to run queries

To connect to the database, we’ll use the

connect() function from sqlite3 module. Once we have established a connection, we can call cursor() on the connection object to create a database cursor as shown:

import sqlite3 # connect to the db
db_conn = sqlite3.connect('people_db.db')
db_cursor = db_conn.cursor()

 

Here we try to connect to the database

people_db. If the database doesn’t exist, running the above snippet will create the sqlite database for us.

Now, we’ll create a table in the database and populate it with records.

Let’s create a table named people in the people_db database with the following fields:

  • name
  • email
  • job
# main.py
...
# create table
db_cursor.execute('''CREATE TABLE people ( id INTEGER PRIMARY KEY, name TEXT, email TEXT, job TEXT)''') ... # commit the transaction and close the cursor and db connection
db_conn.commit()
db_cursor.close()
db_conn.close()

Synthetic Data Generation with Faker

We now have to insert records into the table. To do this, we’ll use the Faker—a Python package for synthetic data generation—installable through pip:

$ pip install faker

 

After installing faker, you can import the Faker class into the Python script:

# main.py
...
from faker import Faker
...

 

The next step is to generate and insert records into the people table. Just so we know how indexes can speed up queries, let’s insert a large number of records. Here, we’ll insert 100K records; set the num_records variable to 100000.

Then, we do the following:

  • Instantiate a Faker object fake and set the seed so we get reproducibility. 
  • Get a name string using first and last names—by calling first_name() and last_name() on the fake object.
  • Generate a fake domain by calling domain_name().
  • Use the first and last names and the domain to generate the email field.
  • Get a job for each individual record using job().

We generate and insert records into the people table:

# create and insert records
fake = Faker() # be sure to import: from faker import Faker Faker.seed(42) num_records = 100000 for _ in range(num_records): first = fake.first_name() last = fake.last_name() name = f"{first} {last}" domain = fake.domain_name() email = f"{first}.{last}@{domain}" job = fake.job() db_cursor.execute('INSERT INTO people (name, email, job) VALUES (?,?,?)', (name,email,job)) # commit the transaction and close the cursor and db connection
db_conn.commit()
db_cursor.close()
db_conn.close()

 

Now the main.py file has the following code:

# main.py
# imports
import sqlite3
from faker import Faker # connect to the db
db_conn = sqlite3.connect('people_db.db')
db_cursor = db_conn.cursor() # create table
db_cursor.execute('''CREATE TABLE people ( id INTEGER PRIMARY KEY, name TEXT, email TEXT, job TEXT)''') # create and insert records
fake = Faker()
Faker.seed(42) num_records = 100000 for _ in range(num_records): first = fake.first_name() last = fake.last_name() name = f"{first} {last}" domain = fake.domain_name() email = f"{first}.{last}@{domain}" job = fake.job() db_cursor.execute('INSERT INTO people (name, email, job) VALUES (?,?,?)', (name,email,job)) # commit the transaction and close the cursor and db connection
db_conn.commit()
db_cursor.close()
db_conn.close()

 

Run this script—once—to populate the table with num_records number of records.

Now that we have the table with 100K records, let’s run a sample query on the people table. 

Let’s run a query to:

  • get the names and emails of the records where the job title ‘Product manager’, and
  • limit the query results to 10 records.

We’ll use the default timer from the time module to get the approximate execution time for the query.

# sample_query.py import sqlite3
import time db_conn = sqlite3.connect("people_db.db")
db_cursor = db_conn.cursor() t1 = time.perf_counter_ns() db_cursor.execute("SELECT name, email FROM people WHERE job='Product manager' LIMIT 10;") res = db_cursor.fetchall()
t2 = time.perf_counter_ns() print(res)
print(f"Query time without index: {(t2-t1)/1000} us")

 

Here’s the output:

Output >>
[ ("Tina Woods", "Tina.Woods@smith.com"), ("Toni Jackson", "Toni.Jackson@underwood.com"), ("Lisa Miller", "Lisa.Miller@solis-west.info"), ("Katherine Guerrero", "Katherine.Guerrero@schmidt-price.org"), ("Michelle Lane", "Michelle.Lane@carr-hardy.com"), ("Jane Johnson", "Jane.Johnson@graham.com"), ("Matthew Odom", "Matthew.Odom@willis.biz"), ("Isaac Daniel", "Isaac.Daniel@peck.com"), ("Jay Byrd", "Jay.Byrd@bailey.info"), ("Thomas Kirby", "Thomas.Kirby@west.com"),
] Query time without index: 448.275 us

 

You can also invoke the SQLite command-line client by running sqlite3 db_name at the command line:

$ sqlite3 people_db.db
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.

 

To get the list of indexes, you can run .index:

sqlite> .index

 

As there are no indexes currently, no index will be listed.

You can also check the query plan like so:

sqlite> EXPLAIN QUERY PLAN SELECT name, email FROM people WHERE job='Product Manager' LIMIT 10;
QUERY PLAN
`--SCAN people

 

Here the query plan is to scan all the rows which is inefficient.

To create a database index on a particular column you can use the syntax:

CREATE INDEX index-name on table (column(s))

 

Say we need to frequently look up the records of individuals with a particular job title. It’d help to create an index people_job_index on the job column:

# create_index.py import time
import sqlite3 db_conn = sqlite3.connect('people_db.db') db_cursor =db_conn.cursor() t1 = time.perf_counter_ns() db_cursor.execute("CREATE INDEX people_job_index ON people (job)") t2 = time.perf_counter_ns() db_conn.commit() print(f"Time to create index: {(t2 - t1)/1000} us") Output >>
Time to create index: 338298.6 us

 

Even though creating the index takes this long, it’s a one-time operation. You will still get substantial speed-up when running multiple queries.

Now if you run .index at the SQLite command-line client, you’ll get:

sqlite> .index
people_job_index

If you now look at the query plan, you should be able to see that we now search people table using index people_job_index on the job column:

sqlite> EXPLAIN QUERY PLAN SELECT name, email FROM people WHERE job='Product manager' LIMIT 10;
QUERY PLAN
`--SEARCH people USING INDEX people_job_index (job=?)

 

You can re-run sample_query.py. Only modify the print() statement and see how long it takes for the query to run now:

# sample_query.py import sqlite3
import time db_conn = sqlite3.connect("people_db.db")
db_cursor = db_conn.cursor() t1 = time.perf_counter_ns() db_cursor.execute("SELECT name, email FROM people WHERE job='Product manager' LIMIT 10;") res = db_cursor.fetchall()
t2 = time.perf_counter_ns() print(res)
print(f"Query time with index: {(t2-t1)/1000} us")

 

Here’s the output:

Output >>
[ ("Tina Woods", "Tina.Woods@smith.com"), ("Toni Jackson", "Toni.Jackson@underwood.com"), ("Lisa Miller", "Lisa.Miller@solis-west.info"), ("Katherine Guerrero", "Katherine.Guerrero@schmidt-price.org"), ("Michelle Lane", "Michelle.Lane@carr-hardy.com"), ("Jane Johnson", "Jane.Johnson@graham.com"), ("Matthew Odom", "Matthew.Odom@willis.biz"), ("Isaac Daniel", "Isaac.Daniel@peck.com"), ("Jay Byrd", "Jay.Byrd@bailey.info"), ("Thomas Kirby", "Thomas.Kirby@west.com"),
] Query time with index: 167.179 us

 

We see that the query now takes about 167.179 microseconds to execute.

Performance Improvement

For our sample query, querying with index is about 2.68 times faster. And we get a percentage speedup of 62.71% in execution times. 

You can also try running a few more queries: queries that involve filtering on the job column and see the performance improvement. 

Also note: Because we have created an index only on the job column, if you are running queries that involve other columns, the queries will not run any faster than without index.

I hope this guide helped you understand how creating database indexes—on frequently queried columns—can significantly speed up queries. This is an introduction to database indexes. You can also create multi-column indexes, multiple indexes for the same column, and much more. 

You can find all the code used in this tutorial in this GitHub repository. Happy coding!
 
 
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.
 

spot_img

Latest Intelligence

spot_img