Redshift is AWS’ managed data warehouse solution, which is massively parallel, scalable, and columnar. It is very widely used. It was built using ParAccel technology and exposes Postgres-compatible interfaces.
Redshift Alternatives and Lock-in
⛓🚪Whatever data warehouse you select, your business will likely be locked in for a long time. Also (and not coincidentally) the data warehouse market is highly fragmented. Selecting a data warehouse is a choice to be made carefully, with research and awareness of the market landscape and what business intelligence tools you’ll be using.
Although Redshift is mostly Postgres-compatible, its SQL dialect and performance profile are different.
VACUUM regularly following a significant number of deletes or updates to reclaim space and improve query performance.
Avoid performing blanket VACUUM or ANALYZE operations at a cluster level. The checks on each table to determine whether VACUUM or ANALYZE action needs to be taken is wasteful. Only perform ANALYZE and VACUUM commands on the objects that require it. Utilize the Analyze & Vacuum Schema Utility to perform this work. The SQL to determine whether a table needs to be VACUUMed or ANALYZEd can be found in the Schema Utility README if you wish to create your own maintenance process.
Redshift provides various column compression options to optimize the stored data size. AWS strongly encourages users to use automatic compression at the COPY stage, when Redshift uses a sample of the data being ingested to analyze the column compression options. However, automatic compression can only be applied to an empty table with no data. Therefore, make sure the initial load batch is big enough to provide Redshift with a representative sample of the data (the default sample size is 100,000 rows).
Redshift uses columnar storage, hence it does not have indexing capabilities. You can, however, use distribution key and sortkey to improve performance. Redshift has two types of sort keys: compounding sort key and interleaved sort key.
A compound sort key is made up of all columns listed in the sort key definition. It is most useful when you have queries with operations using the prefix of the sortkey.
An interleaved sort key on the other hand gives equal weight to each column or a subset of columns in the sort key. So if you don't know ahead of time which column(s) you want to choose for sorting and filtering, this is a much better choice than the compound key. Here is an example using interleaved sort key.
🔸⏱ Distribution strategies: Since data in Redshift is physically distributed among nodes, choosing the right data distribution key and distribution style is crucial for adequate query performance. There are three possible distribution style settings — EVEN (the default), KEY, or ALL. Use KEY to collocate join key columns for tables which are joined in queries. Use ALL to place the data in small-sized tables on all cluster nodes.
Redshift Gotchas and Limitations
❗⏱While Redshift can handle heavy queries well, it does not scale horizontally, i.e. does not handle multiple queries in parallel. Therefore, if you expect a high parallel load, consider replicating or (if possible) sharding your data across multiple clusters.
🔸 The leader node, which manages communications with client programs and all communication with compute nodes, is the single point of failure.
⏱Although most Redshift queries parallelize well at the compute node level, certain stages are executed on the leader node, which can become the bottleneck.
🔹Redshift data commit transactions are very expensive and serialized at the cluster level. Therefore, consider grouping multiple mutation commands (COPY/INSERT/UPDATE) commands into a single transaction whenever possible.
🔹Redshift does not support multi-AZ deployments. Building multi-AZ clusters is not trivial. Here is an example using Kinesis.
🔸Beware of storing multiple small tables in Redshift. The way Redshift tables are laid out on disk makes it impractical. The minimum space required to store a table (in MB) is nodes slices/node columns. For example, on a 16 node cluster an empty table with 20 columns will occupy 640MB on disk.
⏱ Query performance degrades significantly during data ingestion. WLM (Workload Management) tweaks help to some extent. However, if you need consistent read performance, consider having replica clusters (at the extra cost) and swap them during update.
❗ Never resize a live cluster. The resize operation can take hours depending on the dataset size. In rare cases, the operation may also get stuck and you'll end up having a non-functional cluster. The safer approach is to create a new cluster from a snapshot, resize the new cluster and shut down the old one.
🔸Redshift has reserved keywords that are not present in Postgres (see full list here). Watch out for DELTA (Delta Encodings).
🔸Redshift does not support many Postgres functions, most notably several date/time-related and aggregation functions. See the full list here.
🔸 Uniqueness, primary key, and foreign key constraints on Redshift tables are informational only and are not enforced. They are, however, used by the query optimizer to generate query plans. NOT NULL column constraints are enforced. See here for more information on defining constraints.
🔹 Choosing a sort key is very important since you can not change a table’s sort key after it is created. If you need to change the sort or distribution key of a table, you need to create a new table with the new key and move your data into it with a query like “insert into new_table select * from old_table”.
❗🚪 When moving data with a query that looks like “insert into x select from y”, you need to have twice as much disk space available as table “y” takes up on the cluster’s disks. Redshift first copies the data to disk and then to the new table. Here is a good article on how to this for big tables.