Optimizing your Postgres queries and server for heavy traffic

Lydia — Thu, 31 Dec 2020 09:49:03 +0000

Let's optimize your psql queries!

Indexing:

PostgreSQL automatically creates an index for all primary keys, but are your foreign keys automatically indexed? Nope! You will need to explicitly create these indexes for any foreign key target columns. This is particularly useful when you have queries with innerjoins and you are wanting to reference different relationships amongst your tables.

By default, Postgres uses B-Tree indexing, or binary tree indexing, which is great for making sequence searches obsolte. Cutting down on execution times. You will most likely see a dramatic drop in execution times.

When dealing with a smaller count of records, or rows, sequential lookups may be fine. But imagine, a table of 10 million records. A sequential lookup for record number 99999 will take much longer than searching for record number 1. *B-Tree indexing runs heavy on memory usage. So, avoid unnecessary indexes. Index only the targets you actually need.

If your data can be organized in ranges, perhaps an option to consider is using Brin-indexing to partition different ranges in your records. By comparison, Brin-indexing requires significantly less memory.

Use Analytics and Test Your Queries:

If you take a look at your query execution times, which you can do by typing EXPLAIN ANALYZE before your query statement, you can easily see the decrease in execution times when using proper indexing.

Try it out! Write a query, and use EXPLAIN ANALYZE to see the difference in preparation and execution times with records from the front, middle, and end of your records. How does you query perform?

Pg_stat_statements is a built in feature for Postgres that lets you see averages, median, min, max, and other metrics for your queries. This is especially a powerful and useful tool when executing prepared statements. To make Pg_stat available for use, you must add it to the shared_preload_libraries found in the postgres config_file. Here are some tips to help with the config:

-To find your config_file, in terminal type:

psql -U username(usually defaulted to postgres) 
-c 'SHOW config_file'

-Open the config_file (postgresql.conf) and search for “shared_preload_libraries”. Follow the guide below. These changes will require a psql restart.

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = 0
auto_explain.log_analyze = on

Optimizing Query Statements:

Avoid unnecessary inner-joins. Compare writing your queries with joins and then without? How do the execution times compare?

Be specific. Only select, or get back the data that you actually need. * The beloved Select * is great and all, but if you can be more specific in your request, it'll save you memory in the long run.

What's a Query Plan?

A query plan is a sequence of steps or A PLAN made by the database that will be executed to get data. A query plan is prepared every single time a new query is performed. When you use EXPLAIN ANALYZE, your query plan will be printed along with preparation and execution times.

The Query Plan lists how many rows were traversed, what kind of scans your query is running(index or sequential), how many workers were planned and launched, any used loops or nested loops along with execution times.

Prepared Statements!

You can create prepared statements, so after your first query, every consecutive one after will spend less preparation time, and just focus on executing an already existing query plan. Creating a prepared statement is useful for queries that you know will be repeated. It is a more dynamic way to query, as Postgres will only have to read through the query string and create one query plan on the first execution, with subsequent queries solely focusing on executing not preparing.

Here's an example of a prepared statement taken from the PostgreSQL Documentation:

PREPARE fooplan (int, text, bool, numeric) AS
INSERT INTO foo VALUES($1, $2, $3, $4);

EXECUTE fooplan(1, 'Hunter Valley', 't', 200.00);
EXECUTE fooplan(2, 'Death Valley', 'f', 600.00);

In it, a plan called 'fooplan' is created. Fooplan has parameters that are confined to the chosen datatypes. The dollarsign + integer is used as a variable name to represent those arguments. To use the prepared statement fooplan, the word EXECUTE and plan name is used and any arguments are inputted in the parens.

Utilize thy CPU.

How many cores does your machine have? Maybe you are running your database server on a toaster, I mean, a t2.micro instance... Your beloved machine will only have 1 core. If you have more cores, which you probably do haha, you can utilize parallel queries by setting the number of maximum workers. This allows multiple queries to be performed simultaneously on different cores. WOOT!

Please manage your connections.

How many concurrent connections to your database server do you have? AKA, how many users will be simultaneously making requests which will be making connections to your database server?

PostgreSQL uses a max_connections setting to limit the number of connections and resources that are consumed by connections. The default is typically 100 connections, but you may want to consider increasing this number to allot for more concurrent users. Having too few connections can cause network errors when traffic is heavy.

Scraping images for batch downloads

Lydia — Thu, 31 Dec 2020 08:49:00 +0000

To use existing image APIs, like LoremFlickr or PlaceImg, you can run this for loop directly in your terminal:

#install wget if you don't already have it
brew install wget 

for i in {1..10} ; do wget 
https://loremflickr.com/g/630/340/dog -O dog-"$i".jpg

This will go to the url 10 times, and download the request response as a jpg image, renaming the file "dog" with the "index number" appended. (ie. dog1.jpg, dog2.jpg, etc.)

Here's another resource by a
loadenmb for bulk downloads using your shell.
https://gist.github.com/loadenmb/08203f3467965776caea6b44d6f88fe3

To scrape a whole page, you can go through Google DevTools to get a list of all image urls. I'm sure there are other ways to go about this, but this is the way I found. It was also quite fun to get more familiarized with the DevTools Network panel and filtering capabilities through it.

-Open dev tools
-Select the Network panel
-Click the funnel icon to enable filtering
-Select Img to enable filter for images only
-Click and highlight all the image urls you want
-Copy urls into a text file

If your txt file is formatted correctly(with a new url on each line), with wget you can run this code to download all the urls from that txt file. FYI you may need to convert those downloads into JPEGs afterwards, something that I ended up just doing through my Mac Finder.

wget -i yourtextfile.txt

Hopefully these tips can help someone out there :D Best of luck with your scraping!

Forem: Lydia