DEV Community

Uendi Hoxha
Uendi Hoxha

Posted on

3 1 1 2 2

SQL Query Optimization for Data Engineers

Moving data efficiently can make the difference between a smooth system and a frustratingly slow one. Optimizing your SQL queries not only speeds up your jobs, but also reduces cloud costs and improves system scalability.

In this post, I'll share 7 practical SQL optimization tips you can apply immediately, with real-world examples.

I. Always SELECT Only the Columns You Need
It’s easy to get lazy and use SELECT *, especially when you're exploring data.
However, pulling all columns increases the amount of data transferred across the network and the memory needed to process it. On wide tables, this can severely impact performance.

Bad example:

SELECT * FROM orders;
Enter fullscreen mode Exit fullscreen mode

Better:

SELECT order_id, order_date, total_amount FROM orders;
Enter fullscreen mode Exit fullscreen mode

II. Use Proper Indexes
Indexes are critical for query performance, especially when filtering (WHERE), joining (JOIN), or sorting (ORDER BY).

If your query frequently filters on a column, it’s a strong candidate for indexing.

Example:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);
Enter fullscreen mode Exit fullscreen mode

Pro tip: Always check your queries with EXPLAIN to verify whether your indexes are actually being used. A missing or unused index can make queries 10x slower.

III. Avoid Unnecessary JOINs
JOINs are powerful — but they can be costly, especially across large tables.
If you're joining tables just to retrieve a field you don't actually use, or if the JOIN isn't adding value to your result set, rethink the query.

Best practices:

  • Fetch only what you truly need
  • Consider denormalization if two tables are always accessed together
  • Use INNER JOIN instead of LEFT JOIN when you don't need unmatched rows

Example:

Instead of this:

SELECT o.order_id, c.customer_name
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id;
Enter fullscreen mode Exit fullscreen mode

If you know every order has a customer, prefer:

SELECT o.order_id, c.customer_name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
Enter fullscreen mode Exit fullscreen mode

IV. Filter Early With WHERE Clauses
Always narrow down your data as early as possible.

The earlier you apply your WHERE filters, the less data the database engine needs to process — making the query faster and lighter.

Example:

SELECT customer_id, order_id
FROM orders
WHERE order_date > '2025-01-01';
Enter fullscreen mode Exit fullscreen mode

Filtering after joining or fetching lots of rows will cause unnecessary load. Make filtering a priority.

V. Limit Result Sets When Exploring
When you're writing queries to explore data or debug issues, always add a LIMIT to avoid pulling millions of rows by accident.

Example:

SELECT * FROM orders
WHERE total_amount > 1000
LIMIT 100;
Enter fullscreen mode Exit fullscreen mode

This tiny habit prevents unnecessary load on your database and keeps you from crashing your local environment.

VI. Analyze Execution Plans (EXPLAIN)
Want to know why a query is slow?
Use your database’s execution plan tools.

In PostgreSQL and MySQL, running EXPLAIN shows how the database will execute your query whether it will do a sequential scan (slow) or an index scan (fast).

Example:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 123;
Enter fullscreen mode Exit fullscreen mode

Look out for:

  • Seq Scan → sequentially scanning the whole table (bad for large tables)
  • Index Scan → using indexes efficiently (good)
  • High-cost operations like sorts, nested loops, or large hash joins

Learning to read execution plans is one of the best investments you can make as a data engineer.

VII. Batch Large Updates and Inserts
Updating or inserting millions of rows at once can lock tables and overwhelm resources.
Instead, break large operations into smaller batches.

Example:

Instead of:

INSERT INTO large_table
SELECT * FROM very_large_temp_table;
Enter fullscreen mode Exit fullscreen mode

Use a batching strategy:

INSERT INTO large_table
SELECT * FROM very_large_temp_table
WHERE id BETWEEN 1 AND 10000;

-- Repeat with next batch
Enter fullscreen mode Exit fullscreen mode

This keeps locks short, memory usage reasonable, and reduces the risk of timeouts.

Top comments (0)