Shared posts

01 Nov 11:49

Domestic na Kanojo - Volume 26, Chapter 249

Group: Yandere Scans - Uploader: Kastos - Language: Portuguese (Br)
26 Oct 14:18

Hong Kong's Oppression Might Be Ours Too

Britain took control of Hong Kong from China by threat of force in 1841. In 1997 the United Kingdom returned control of Hong Kong to China, to be governed independently as "one country, two systems" until 2047.

While there have been protests in previous years that the Chinese government is eroding the "one country, two systems" policy by increasing its control over Hong Kong, this year's protests (summary) have been even more energetic.

While this is a serious issue for Hong Kong and the Chinese government, it also affects others. The Chinese government has pressured Chinese companies to economically punish individuals and companies that are supportive of the Hong Kong protests, even if they are not Chinese nationals or companies. The most dramatic punishment was related to a tweet by an NBA employee supporting the Hong Kong protesters. NBA fans were so upset at China's pressure that they are now expressing support for Hong Kong at basketball games.

Continue Reading »

04 Oct 19:50

Trump's Real Liability Isn't Impeachment: It's China and the Economy

by David P. Goldman
Below is an open letter to Larry Kudlow I posted here on July 19, 2018. I warned that Trump's trade war would backfire. Today we heard that the National Association of Purchasing Managers' industrial production index had fallen to the lowest level since June of 2009. We are in a manufacturing recession, according to the Federal Reserve. Factory output is contracting. Trump won in 2016 by carrying key manufacturing states like Pennsylvania, Ohio, Michigan, and Wisconson. This blunder could lose him the election. This is MUCH more dangerous than the impeachment masquerade. Americans really don't care about Ukraine but they do care about their jobs. The president is trying to deflect blame onto the Federal Reserve, but he's not fooling anybody.
04 Oct 19:38

PostgreSQL 12 Released!

The PostgreSQL Global Development Group today announced the release of PostgreSQL 12, the latest version of the world's most advanced open source database.

PostgreSQL 12 enhancements include notable improvements to query performance, particularly over larger data sets, and overall space utilization. This release provides application developers with new capabilities such as SQL/JSON path expression support, optimizations for how common table expression (WITH) queries are executed, and generated columns. The PostgreSQL community continues to support the extensibility and robustness of PostgreSQL, with further additions to internationalization, authentication, and providing easier ways to administrate PostgreSQL. This release also introduces the pluggable table storage interface, which allows developers to create their own methods for storing data.

"The development community behind PostgreSQL contributed features for PostgreSQL 12 that offer performance and space management gains that our users can achieve with minimal effort, as well as improvements in enterprise authentication, administration functionality, and SQL/JSON support." said Dave Page, a core team member of the PostgreSQL Global Development Group. "This release continues the trend of making it easier to manage database workloads large and small while building on PostgreSQL's reputation of flexibility, reliability and stability in production environments."

PostgreSQL benefits from over 20 years of open source development and has become the preferred open source relational database for organizations of all sizes. The project continues to receive recognition across the industry, including being featured for the second year in a row as the "DBMS of the Year" in 2018 by DB-Engines and receiving the "Lifetime Achievement" open source award at OSCON 2019.

Overall Performance Improvements

PostgreSQL 12 provides significant performance and maintenance enhancements to its indexing system and to partitioning.

B-tree Indexes, the standard type of indexing in PostgreSQL, have been optimized in PostgreSQL 12 to better handle workloads where the indexes are frequently modified. Using a fair use implementation of the TPC-C benchmark, PostgreSQL 12 demonstrated on average a 40% reduction in space utilization and an overall gain in query performance.

Queries on partitioned tables have also seen demonstrable improvements, particularly for tables with thousands of partitions that only need to retrieve data from a limited subset. PostgreSQL 12 also improves the performance of adding data to partitioned tables with INSERT and COPY, and includes the ability to attach a new partition to a table without blocking queries.

There are additional enhancements to indexing in PostgreSQL 12 that affect overall performance, including lower overhead in write-ahead log generation for the GiST, GIN, and SP-GiST index types, the ability to create covering indexes (the INCLUDE clause) on GiST indexes, the ability to perform K-nearest neighbor queries with the distance operator (<->) using SP-GiST indexes, and CREATE STATISTICS now supporting most-common value (MCV) statistics to help generate better query plans when using columns that are nonuniformly distributed.

Just-in-time (JIT) compilation using LLVM, introduced in PostgreSQL 11, is now enabled by default. JIT compilation can provide performance benefits to the execution of expressions in WHERE clauses, target lists, aggregates, and some internal operations, and is available if your PostgreSQL installation is compiled or packaged with LLVM.

Enhancements to SQL Conformance & Functionality

PostgreSQL is known for its conformance to the SQL standard - one reason why it was renamed from "POSTGRES" to "PostgreSQL" - and PostgreSQL 12 adds several features to continue its implementation of the SQL standard with enhanced functionality.

PostgreSQL 12 introduces the ability to run queries over JSON documents using JSON path expressions defined in the SQL/JSON standard. Such queries may utilize the existing indexing mechanisms for documents stored in the JSONB format to efficiently retrieve data.

Common table expressions, also known as WITH queries, can now be automatically inlined by PostgreSQL 12, which in turn can help increase the performance of many existing queries. In this release, a WITH query can be inlined if it is not recursive, does not have any side-effects, and is only referenced once in a later part of a query.

PostgreSQL 12 introduces "generated columns." Defined in the SQL standard, this type of column computes its value from the contents of other columns in the same table. In this version, PostgreSQL supports "stored generated columns," where the computed value is stored on the disk.

Internationalization

PostgreSQL 12 extends its support of ICU collations by allowing users to define "nondeterministic collations" that can, for example, allow case-insensitive or accent-insensitive comparisons.

Authentication

PostgreSQL expands on its robust authentication method support with several enhancements that provide additional security and functionality. This release introduces both client and server-side encryption for authentication over GSSAPI interfaces, as well as the ability for PostgreSQL to discover LDAP servers if PostgreSQL is compiled with OpenLDAP.

Additionally, PostgreSQL 12 now supports a form of multi-factor authentication. A PostgreSQL server can now require an authenticating client to provide a valid SSL certificate with their username using the clientcert=verify-full option and combine this with the requirement of a separate authentication method (e.g. scram-sha-256).

Administration

PostgreSQL 12 introduces the ability to rebuild indexes without blocking writes to an index via the REINDEX CONCURRENTLY command, allowing users to avoid downtime scenarios for lengthy index rebuilds.

Additionally, PostgreSQL 12 can now enable or disable page checksums in an offline cluster using the pg_checksums command. Previously page checksums, a feature to help verify the integrity of data stored to disk, could only be enabled at the time a PostgreSQL cluster was initialized with initdb.

For a full list of features included in this release, please read the release notes, which can be found at: https://www.postgresql.org/docs/12/release-12.html

About PostgreSQL

PostgreSQL is the world's most advanced open source database, with a global community of thousands of users, contributors, companies and organizations. The PostgreSQL Project builds on over 30 years of engineering, starting at the University of California, Berkeley, and has continued with an unmatched pace of development. PostgreSQL's mature feature set not only matches top proprietary database systems, but exceeds them in advanced database features, extensibility, security, and stability.

Press Release Translations

Links

28 Sep 01:22

No, the President of Poland Didn't Say That Jews Cause Anti-Semitism

by David P. Goldman
In late June of 1941, my father's first cousins Moshe and Dvora fled their house in the tiny town of Kameny Most, about halfway between Slonim and Baranavichy in what is now Belarus. Hitler had launched Operation Barbarossa days before, and their home stood almost on the German-Soviet line that had divided Poland in 1939. Now the Germans had marched into town. The boy and girl, then 15 and 16, reached the woods behind their house before the Germans arrived. Their parents and baby brother fled a minute later, but the Germans already were there. They couldn't reach the woods and hid in the tall grass. A Polish neighbor pointed them out to the Germans, who shot them on the spot. Moshe and Dvora joined the partisan brigade led by the Bielski brothers, made famous in the film Defiance. Not long afterwards they returned at night, barricaded the neighbor in his house, set it afire and burned him alive. By the grace of God the teenagers survived the war and came to the new State of Israel and started large and flourishing families. I had the merit to arrive at Dvora's deathbed in 2004 just in time to receive her blessing upon the American branch of her family.
25 Jul 13:06

Spreadsheets

My brother once asked me if there was a function to produce a calendar grid from a list of dates in Google Sheets. I replied with a single-cell formula that took in a list of dates and outputted a calendar. It used SEQUENCE(), REGEXMATCH(), and a double-nested ARRAYFORMULA(), and it locked up the browser for 15 seconds every time it ran. I think he learned a lot about asking me things.
23 Jul 15:39

Karakai Jouzu no (Moto) Takagi-san - Chapter 95.5

Group: 15avaughn - Uploader: 15avaughn - Language: English
17 Jul 13:17

President Trump Has the Moral High Ground Against the Democrats

by David P. Goldman
I'm tired of hearing conservative friends apologize for President Trump's "go back to where you came from" tweets about the likes of Ilhan Omar. The president has the moral high ground, and the weasel war dance of the mainstream media shouldn't distract us from this fact.
25 Jun 20:41

5Toubun no Hanayome - Chapter 91

Group: /a/nonymous | 5toubun sc/a/ns - Uploader: ThunderCloud - Language: English
19 Jun 12:12

.: O incêndio de Londres (1666) e a Igreja Batista de Petty France :.

by Pedro Issa

O texto diz pouco mais que o título, mas não deixa de ser interessantíssimo.

O incêndio de Londres destruiu tantas igrejas paroquiais que mãos violentas se lançaram sobre algumas das casas de encontro (meeting-houses) erigidas por Batistas, e elas foram apropriadas para uso paroquial. Aparentemente, a casa de Petty France, e certamente a casa de Bishopsgate, em Devonshire Square, foram assim roubadas de seus donos por alguns anos. É uma mostra da crescente confiança das igrejas [batistas] que o pessoal de Kiffin tenha aberto um novo livro para seus registros, o qual, dali em diante, permaneceu em uso regular.

Whitley, A History, p. 116.

O incêndio de Londres foi avassalador. Só não foi pior, talvez, do que a Peste de 1665, semi-fantasiada por Daniel Defoe, um puritano, no delicioso Um diário do ano da peste. Pois bem, sempre alguém me pergunta por que o governo tolerava igrejas sectárias, como as batistas, mesmo sabendo de sua existência e localização. Voilà! Para poder usufruir de suas dependências quando conveniente. Petty France, que tinha Nehemiah Coxe como um de seus pastores, chegou a comportar 600 membros. Minha dúvida é onde esse pessoal se reuniu nesses anos de criptobatistismo. Espero que Sam Renihan nos explique em sua trilogia vindoura sobre Petty France.

*

WHITLEY, W. T. A History of British Baptists. London: Charles Griffin & Company, 1923.

17 Jun 13:52

Avinash Kumar: Bloom Indexes in PostgreSQL

Bloom Indexes in PostgreSQL

PostgreSQL LogoThere is a wide variety of indexes available in PostgreSQL. While most are common in almost all databases, there are some types of indexes that are more specific to PostgreSQL. For example, GIN indexes are helpful to speed up the search for element values within documents. GIN and GiST indexes could both be used for making full-text searches faster, whereas BRIN indexes are more useful when dealing with large tables, as it only stores the summary information of a page. We will look at these indexes in more detail in future blog posts. For now, I would like to talk about another of the special indexes that can speed up searches on a table with a huge number of columns and which is massive in size. And that is called a bloom index.

In order to understand the bloom index better, let’s first understand the bloom filter data structure. I will try to keep the description as short as I can so that we can discuss more about how to create this index and when will it be useful.

Most readers will know that an array in computer sciences is a data structure that consists of a collection of values and variables. Whereas a bit or a binary digit is the smallest unit of data represented with either 0 or 1. A bloom filter is also a bit array of m bits that are all initially set to 0.

A bit array is an array that could store a certain number of bits (0 and 1). It is one of the most space-efficient data structures to test whether an element is in a set or not.

Why use bloom filters?

Let’s consider some alternates such as list data structure and hash tables. In the case of a list data structure, it needs to iterate through each element in the list to search for a specific element. We can also try to maintain a hash table where each element in the list is hashed, and we then see if the hash of the element we are searching for matches a hash in the list. But checking through all the hashes may be a higher order of magnitude than expected. If there is a hash collision, then it does a linear probing which may be time-consuming. When we add hash tables to disk, it requires some additional IO and storage. For an efficient solution, we can look into bloom filters which are similar to hash tables.

Type I and Type II errors

While using bloom filters, we may see a result that falls into a

type I error
but never a
type II error
. A nice example of a type I error is a result that a person with last name: “vallarapu” exists in the relation: foo.bar whereas it does not exist in reality (a
false positive
conclusion). An example for a type II error is a result that a person with the last name as “vallarapu” does not exist in the relation: foo.bar, but in reality, it does exist (a
false negative
conclusion). A bloom filter is 100% accurate when it says the element is not present. But when it says the element is present, it may be 90% accurate or less. So it is usually called a
probabilistic data structure
.

The bloom filter algorithm

Let’s now understand the algorithm behind bloom filters better. As discussed earlier, it is a bit array of m bits, where m is a certain number. And we need a k number of hash functions. In order to tell whether an element exists and to give away the item pointer of the element, the element (data in columns) will be passed to the hash functions. Let’s say that there are only two hash functions to store the presence of the first element “avi” in the bit array. When the word “avi” is passed to the first hash function, it may generate the output as 4 and the second may give the output as 5. So now the bit array could look like the following:

All the bits are initially set to 0. Once we store the existence of the element “avi” in the bloom filter, it sets the 4th and 5th bits to 1. Let’s now store the existence of the word “percona”. This word is again passed to both the hash functions and assumes that the first hash function generates the value as 5 and the second hash function generated the value as 6. So, the bit array now looks like the following – since the 5th bit was already set to 1 earlier, it doesn’t make any modifications there:

Now, consider that our query is searching for a predicate with the name as “avi”. The input: “avi” will now be passed to the hash functions. The first hash function returns the value as 4 and the second returns the value as 5, as these are the same hash functions that were used earlier. Now when we look in position 4 and 5 of the bloom filter (bit array), we can see that the values are set to 1. This means that the element is present.

Collision with bloom filters

Consider a query that is fetching the records of a table with the name: “don”. When this word “don” is passed to both the hash functions, the first hash function returns the value as 6 (let’s say) and the second hash function returns the value as 4. As the bits at positions 6 and 4 are set to 1, the membership is confirmed and we see from the result that a record with the name: “don” is present. In reality, it is not. This is one of the chances of collisions. However, this is not a serious problem.

A point to remember is – “The fewer the hash functions, the more the chances of collisions. And the more the hash functions, lesser the chances of collision. But if we have k hash functions, the time it takes for validating membership is in the order of k“.

Bloom Indexes in PostgreSQL

As you’ll now have understood bloom filters, you’ll know a bloom index uses bloom filters. When you have a table with too many columns, and there are queries using too many combinations of columns  – as predicates – on such tables, you could need many indexes. Maintaining so many indexes is not only costly for the database but is also a performance killer when dealing with larger data sets.

So, if you create a bloom index on all these columns, a hash is calculated for each of the columns and merged into a single index entry of the specified length for each row/record. When you specify a list of columns on which you need a bloom filter, you could also choose how many bits need to be set per each column. The following is an example syntax with the length of each index entry and the number of bits per a specific column.

CREATE INDEX bloom_idx_bar ON foo.bar USING bloom (id,dept_id,zipcode)
WITH (length=80, col1=4, col2=2, col3=4);

length
is rounded to the nearest multiple of 16. Default is 80. And the maximum is 4096. The default
number of bits
per column is 2. We can specify a maximum of 4095 bits.

Bits per each column

Here is what it means in theory when we have specified length = 80 and col1=2, col2=2, col3=4. A bit array of length 80 bits is created per row or a record. Data inside col1 (column1) is passed to two hash functions because col1 was set to 2 bits. Let’s say these two hash functions generate the values as 20 and 40. The bits at the 20th and 40th positions are set to 1 within the 80 bits (m) since the length is specified as 80 bits. Data in col3 is now passed to four hash functions and let’s say the values generated are 2, 4, 9, 10. So four bits – 2, 4, 9, 10 –are set to 1 within the 80 bits.

There may be many empty bits, but it allows for more randomness across the bit arrays of each of the individual rows. Using a signature function, a signature is stored in the index data page for each record along with the row pointer that points to the actual row in the table. Now, when a query uses an equality operator on the column that has been indexed using bloom, a number of hash functions, as already set for that column, are used to generate the appropriate number of hash values. Let’s say four for col3 – so 2, 4, 9, 10. The index data is extracted row-by-row and searched if the rows have those bits (bit positions generated by hash functions) set to 1.

And finally, it says a certain number of rows have got all of these bits set to 1. The greater the length and the bits per column, the more the randomness and the fewer the false positives. But the greater the length, the greater the size of the index.

Bloom Extension

Bloom index is shipped through the contrib module as an extension, so you must create the bloom extension in order to take advantage of this index using the following command:

CREATE EXTENSION bloom;

Example

Let’s start with an example. I am going to create a table with multiple columns and insert 100 million records.

percona=# CREATE TABLE foo.bar (id int, dept int, id2 int, id3 int, id4 int, id5 int,id6 int,id7 int,details text, zipcode int);
CREATE TABLE
percona=# INSERT INTO foo.bar SELECT (random() * 1000000)::int, (random() * 1000000)::int,
(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,(random() * 1000000)::int,
(random() * 1000000)::int,(random() * 1000000)::int,md5(g::text), floor(random()* (20000-9999 + 1) + 9999)
from generate_series(1,100*1e6) g;
INSERT 0 100000000

The size of the table is now 9647 MB as you can see below.

percona=# \dt+ foo.bar
List of relations
Schema | Name | Type  | Owner    | Size    | Description
-------+------+-------+----------+---------+-------------
foo    | bar  | table | postgres | 9647 MB | (1 row)

Let’s say that all the columns: id, dept, id2, id3, id4, id5, id6 and zip code of table: foo.bar are used in several queries in random combinations according to different reporting purposes. If we create individual indexes on each column, it is going to take almost 2 GB disk space for each index.

Testing with btree indexes

We’ll try creating a single btree index on all the columns that are most used by the queries hitting this table. As you can see in the following log, it took 91115.397 ms to create this index and the size of the index is 4743 MB.

postgres=# CREATE INDEX idx_btree_bar ON foo.bar (id, dept, id2,id3,id4,id5,id6,zipcode);
CREATE INDEX
Time: 91115.397 ms (01:31.115)
postgres=# \di+ foo.idx_btree_bar
                             List of relations
 Schema |     Name      | Type  |  Owner   | Table |  Size   | Description
--------+---------------+-------+----------+-------+---------+-------------
 foo    | idx_btree_bar | index | postgres | bar   | 4743 MB |
(1 row)

Now, let’s try some of the queries with a random selection of columns. You can see that the execution plans of these queries are 2440.374 ms and 2406.498 ms for query 1 and query 2 respectively. To avoid issues with the disk IO, I made sure that the execution plan was captured when the index was cached to memory.

Query 1
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id4 = 295294 and zipcode = 13266;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_bar on bar  (cost=0.57..1607120.58 rows=1 width=69) (actual time=1832.389..2440.334 rows=1 loops=1)
   Index Cond: ((id4 = 295294) AND (zipcode = 13266))
 Planning Time: 0.079 ms
 Execution Time: 2440.374 ms
(4 rows)
Query 2
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 281326 and id6 = 894198;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Scan using idx_btree_bar on bar  (cost=0.57..1607120.58 rows=1 width=69) (actual time=1806.237..2406.475 rows=1 loops=1)
   Index Cond: ((id5 = 281326) AND (id6 = 894198))
 Planning Time: 0.096 ms
 Execution Time: 2406.498 ms
(4 rows)

Testing with Bloom Indexes

Let’s now create a bloom index on the same columns. As you can see from the following log, there is a huge size difference between the bloom (1342 MB) and the btree index (4743 MB). This is the first win. It took almost the same time to create the btree and the bloom index.

postgres=# CREATE INDEX idx_bloom_bar ON foo.bar USING bloom(id, dept, id2, id3, id4, id5, id6, zipcode)
WITH (length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4, col8=4);
CREATE INDEX
Time: 94833.801 ms (01:34.834)
postgres=# \di+ foo.idx_bloom_bar
                             List of relations
 Schema |     Name      | Type  |  Owner   | Table |  Size   | Description
--------+---------------+-------+----------+-------+---------+-------------
 foo    | idx_bloom_bar | index | postgres | bar   | 1342 MB |
(1 row)

Let’s run the same queries, check the execution time, and observe the difference.

Query 1
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 326756 and id6 = 597560;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on bar  (cost=1171823.08..1171824.10 rows=1 width=69) (actual time=1265.269..1265.550 rows=1 loops=1)
   Recheck Cond: ((id4 = 295294) AND (zipcode = 13266))
   Rows Removed by Index Recheck: 2984788
   Heap Blocks: exact=59099 lossy=36090
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..1171823.08 rows=1 width=0) (actual time=653.865..653.865 rows=99046 loops=1)
         Index Cond: ((id4 = 295294) AND (zipcode = 13266))
 Planning Time: 0.073 ms
 Execution Time: 1265.576 ms
(8 rows)
Query 2
-------
postgres=# EXPLAIN ANALYZE select * from foo.bar where id5 = 281326 and id6 = 894198;
                                                             QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on bar  (cost=1171823.08..1171824.10 rows=1 width=69) (actual time=950.561..950.799 rows=1 loops=1)
   Recheck Cond: ((id5 = 281326) AND (id6 = 894198))
   Rows Removed by Index Recheck: 2983893
   Heap Blocks: exact=58739 lossy=36084
   ->  Bitmap Index Scan on idx_bloom_bar  (cost=0.00..1171823.08 rows=1 width=0) (actual time=401.588..401.588 rows=98631 loops=1)
         Index Cond: ((id5 = 281326) AND (id6 = 894198))
 Planning Time: 0.072 ms
 Execution Time: 950.827 ms
(8 rows)

From the above tests, it is evident that the bloom indexes performed better. Query 1 took 1265.576 ms with a bloom index and 2440.374 ms with a btree index. And query 2 took 950.827 ms with bloom and 2406.498 ms with btree. However, the same test will show a better result for a btree index, if you would have created a btree index on those 2 columns only (instead of many columns).

Reducing False Positives

If you look at the execution plans generated after creating the bloom indexes (consider Query 2),  98631 rows are considered to be matching rows. However, the output says only one row. So, the rest of the rows – all 98630 – are false positives. The btree index would not return any false positives.

In order to reduce the false positives, you may have to increase the signature length and also the bits per column through some of the formulas mentioned in this interesting blog post through experimentation and testing. As you increase the signature length and bits, you might see the bloom index growing in size. Nevertheless, this may reduce false positives. If the time spent is greater due to the number of false positives returned by the bloom index, you could increase the length. If increasing the length does not make much difference to the performance, then you can leave the length as it is.

Points to be carefully noted

  1. In the above tests, we have seen how a bloom index has performed better than a btree index. But, in reality, if we had created a btree index just on top of the two columns being used as predicates, the query would have performed much faster with a btree index than with a bloom index. This index does not replace a btree index unless we wish to replace a chunk of the indexes with a single bloom index.
  2. Just like hash indexes, a bloom index is applicable for equality operators only.
  3. Some formulas on how to calculate the appropriate length of a bloom filter and the bits per column can be read on Wikipedia or in this blog post.

Conclusion

Bloom indexes are very helpful when we have a table that stores huge amounts of data and a lot of columns, where we find it difficult to create a large number of indexes, especially in OLAP environments where data is loaded from several sources and maintained for reporting. You could consider testing a single bloom index to see if you can avoid implementing a huge number of individual or composite indexes that could take additional disk space without much performance gain.

07 Jun 17:49

Normalization and Further Normalization Part 2: If You Need Them, You're Doing It Wrong

by noreply@blogger.com (Fabian Pascal)

In Part 1 we outlined some fundamentals of database design, namely the distinction between normalization to 1NF, and further normalization (to "full" 5NF), and explained that they are necessary only to repair poor designs -- if you (1) develop a complete conceptual model and (2) formalize it properly using the RDM, (3) adhering to the three core principles of database design, you should end up with a relational database in both 1NF and 5NF.

Here we apply this knowledge to the typical request for "normalization" help we presented in Part 1.
06 Jun 10:23

Comments

NPR encourages you to add comments to their stories using the page inspector in your browser's developer tools. Note: Your comments are visible only to you, and will be lost when you refresh the page.
30 May 11:44

Ufo

"It's a little low for a weather balloon; it might be some other kind." "Yeah. Besides, I know I'm the alien conspiracy guy, but come on--the idea that the government would care about hiding something so mundane as atmospheric temperature measurement is too ridiculous even for me."
23 May 16:11

Andy Wingo: bigint shipping in firefox!

I am delighted to share with folks the results of a project I have been helping out on for the last few months: implementation of "BigInt" in Firefox, which is finally shipping in Firefox 68 (beta).

what's a bigint?

BigInts are a new kind of JavaScript primitive value, like numbers or strings. A BigInt is a true integer: it can take on the value of any finite integer (subject to some arbitrarily large implementation-defined limits, such as the amount of memory in your machine). This contrasts with JavaScript number values, which have the well-known property of only being able to precisely represent integers between -253 and 253.

BigInts are written like "normal" integers, but with an n suffix:

var a = 1n;
var b = a + 42n;
b << 64n
// result: 793209995169510719488n

With the bigint proposal, the usual mathematical operations (+, -, *, /, %, <<, >>, **, and the comparison operators) are extended to operate on bigint values. As a new kind of primitive value, bigint values have their own typeof:

typeof 1n
// result: 'bigint'

Besides allowing for more kinds of math to be easily and efficiently expressed, BigInt also allows for better interoperability with systems that use 64-bit numbers, such as "inodes" in file systems, WebAssembly i64 values, high-precision timers, and so on.

You can read more about the BigInt feature over on MDN, as usual. You might also like this short article on BigInt basics that V8 engineer Mathias Bynens wrote when Chrome shipped support for BigInt last year. There is an accompanying language implementation article as well, for those of y'all that enjoy the nitties and the gritties.

can i ship it?

To try out BigInt in Firefox, simply download a copy of Firefox Beta. This version of Firefox will be fully released to the public in a few weeks, on July 9th. If you're reading this in the future, I'm talking about Firefox 68.

BigInt is also shipping already in V8 and Chrome, and my colleague Caio Lima has an project in progress to implement it in JavaScriptCore / WebKit / Safari. Depending on your target audience, BigInt might be deployable already!

thanks

I must mention that my role in the BigInt work was relatively small; my Igalia colleague Robin Templeton did the bulk of the BigInt implementation work in Firefox, so large ups to them. Hearty thanks also to Mozilla's Jan de Mooij and Jeff Walden for their patient and detailed code reviews.

Thanks as well to the V8 engineers for their open source implementation of BigInt fundamental algorithms, as we used many of them in Firefox.

Finally, I need to make one big thank-you, and I hope that you will join me in expressing it. The road to ship anything in a web browser is long; besides the "simple matter of programming" that it is to implement a feature, you need a specification with buy-in from implementors and web standards people, you need a good working relationship with a browser vendor, you need willing technical reviewers, you need to follow up on the inevitable security bugs that any browser change causes, and all of this takes time. It's all predicated on having the backing of an organization that's foresighted enough to invest in this kind of long-term, high-reward platform engineering.

In that regard I think all people that work on the web platform should send a big shout-out to Tech at Bloomberg for making BigInt possible by underwriting all of Igalia's work in this area. Thank you, Bloomberg, and happy hacking!

20 May 19:21

Bluetooth's Complexity Has Become a Security Risk (Wired)

by corbet
Wired looks at the security issues stemming from the complexity of the Bluetooth standard. "Bluetooth has certainly been investigated to a degree, but researchers say that the lack of intense scrutiny historically stems again from just how involved it is to even read the standard, much less understand how it works and all the possible implementations. On the plus side, this has created a sort of security through obscurity, in which attackers have also found it easier to develop attacks against other protocols and systems rather than taking the time to work out how to mess with Bluetooth."
22 Apr 20:22

'The Quantum Supremacy' Serial Continues

by David P. Goldman
Part four of Spengler’s spy thriller which pits China’s Ministry of State Security against the CIA in a deadly battle of wits
04 Apr 19:29

To Stop the Deep State, Bring Back Mike Flynn!

by David P. Goldman
Why did the Deep State throw caution to the winds in an desperate effort to frame Donald Trump for alleged collusion with Russia--and failing that, to entrap him in an obstruction of justice case? There are a lot of reasons for the Establishment to hate Donald Trump, but one of them stands out. During the campaign, Donald Trump denounced the Obama administration for having created ISIS. That claim drew ridicule from the mainstream media, but it is entirely correct. Lt. Gen. Mike Flynn, Trump's campaign adviser, was head of the Defense Intelligence Agency in 2012 when the DIA blew the whistle on CIA backing for Sunni Islamists fighting the Assad regime during the then-raging Syrian civil war. As my friend Michael Ledeen wrote at PJ Media on March 26, the Mueller investigation was all about Flynn.
01 Apr 16:45

Sebastian Insausti: How to Deploy Highly Available PostgreSQL with Single Endpoint for WordPress

WordPress is an open source software you can use to create your website, blog, or application. There are many designs and features/plugins to add to your WordPress installation. WordPress is a free software, however, there are many commercial plugins to improve it depending on your requirements.

WordPress makes it easy for you to manage your content and it’s really flexible. Create drafts, schedule publication, and look at your post revisions. Make your content public or private, and secure posts and pages with a password.

To run WordPress you should have at least PHP version 5.2.4+, MySQL version 5.0+ (or MariaDB), and Apache or Nginx. Some of these versions have reached EOL and you may expose your site to security vulnerabilities, so you should install the latest version available according to your environment.

As we could see, currently, WordPress only supports the MySQL and MariaDB database engines. WPPG is a plugin based on PG4WP plugin, that gives you the possibility to install and use WordPress with a PostgreSQL database as a backend. It works by replacing calls to MySQL specific functions with generic calls that map them to other database functions and rewriting SQL queries on the fly when needed.

For this blog, we’ll install 1 Application Server with WordPress 5.1.1 and HAProxy, 1.5.18 in the same server, and 2 PostgreSQL 11 database nodes (Master-Standby). All the operating system will be CentOS 7. For the databases and load balancer deploy we’ll use the ClusterControl system.

This is a basic environment. You can improve it by adding more high availability features as you can see here. So, let’s start.

Database Deployment

First, we need to install our PostgreSQL database. For this, we’ll assume you have ClusterControl installed.

To perform a deployment from ClusterControl, simply select the option “Deploy” and follow the instructions that appear.

When selecting PostgreSQL, we must specify User, Key or Password and port to connect by SSH to our servers. We also need a name for our new cluster and if we want ClusterControl to install the corresponding software and configurations for us.

After setting up the SSH access information, we must define the database user, version and datadir (optional). We can also specify which repository to use.

In the next step, we need to add our servers to the cluster that we are going to create.

When adding our servers, we can enter IP or hostname.

In the last step, we can choose if our replication will be Synchronous or Asynchronous.

We can monitor the status of the creation of our new cluster from the ClusterControl activity monitor.

Once the task is finished, we can see our cluster in the main ClusterControl screen.

Once we have our cluster created, we can perform several tasks on it, like adding a load balancer (HAProxy) or a new replica.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Load Balancer Deployment

To perform a load balancer deployment, in this case, HAProxy, select the option “Add Load Balancer” in the cluster actions and fill the asked information.

We only need to add IP/Name, port, policy and the nodes we are going to use. By default, HAProxy is configured by ClusterControl with two different ports, one read-write and one read-only. In the read-write port, only the master is UP. In case of failure, ClusterControl will promote the most advanced slave and it’ll change the HAProxy configuration to enable the new master and disable the old one. In this way, we’ll have automatic failover in case of failure.

If we followed the previous steps, we should have the following topology:

So, we have a single endpoint created in the Application Server with HAProxy. Now, we can use this endpoint in the application as a localhost connection.

WordPress Installation

Let’s install WordPress on our Application Server and configure it to connect to the PostgreSQL database by using the local HAProxy port 3307.

First, install the packages required on the Application Server.

$ yum install httpd php php-mysql php-pgsql postgresql
$ systemctl start httpd && systemctl enable httpd

Download the latest WordPress version and move it to the apache document root.

$ wget https://wordpress.org/latest.tar.gz
$ tar zxf latest.tar.gz
$ mv wordpress /var/www/html/

Download the WPPG plugin and move it into the wordpress plugins directory.

$ wget https://downloads.wordpress.org/plugin/wppg.1.0.1.zip
$ unzip wppg.1.0.1.zip
$ mv wppg /var/www/html/wordpress/wp-content/plugins/

Copy the db.php file to the wp-content directory. Then, edit it and change the 'PG4WP_ROOT' path:

$ cp /var/www/html/wordpress/wp-content/plugins/wppg/pg4wp/db.php /var/www/html/wordpress/wp-content/
$ vi /var/www/html/wordpress/wp-content/db.php
define( 'PG4WP_ROOT', ABSPATH.'wp-content/plugins/wppg/pg4wp');

Rename the wp-config.php and change the database information:

$ mv /var/www/html/wordpress/wp-config-sample.php /var/www/html/wordpress/wp-config.php
$ vi /var/www/html/wordpress/wp-config.php
define( 'DB_NAME', 'wordpressdb' );
define( 'DB_USER', 'wordpress' );
define( 'DB_PASSWORD', 'wpPassword' );
define( 'DB_HOST', 'localhost:3307' );

Then, we need to create the database and the application user in the PostgreSQL database. On the master node:

$ postgres=# CREATE DATABASE wordpressdb;
CREATE DATABASE
$ postgres=# CREATE USER wordpress WITH PASSWORD 'wpPassword';
CREATE ROLE
$ postgres=# GRANT ALL PRIVILEGES ON DATABASE wordpressdb TO wordpress;
GRANT

And edit the pg_hba.conf file to allow the connection from the Application Server.

$ Vi /var/lib/pgsql/11/data/pg_hba.conf
host  all  all  192.168.100.153/24  md5
$ systemctl reload postgresql-11

Make sure you can access it from the Application Server:

$ psql -hlocalhost -p3307 -Uwordpress wordpressdb
Password for user wordpress:
psql (9.2.24, server 11.2)
WARNING: psql version 9.2, server version 11.0.
         Some psql features might not work.
Type "help" for help.
wordpressdb=>

Now, go to the install.php in the web browser, in our case, the IP Address for the Application Server is 192.168.100.153, so, we go to:

http://192.168.100.153/wordpress/wp-admin/install.php

Add the Site Title, Username and Password to access the admin section, and your email address.

Finally, go to Plugins -> Installed Plugins and activate the WPPG plugin.

Conclusion

Now, we have WordPress running with PostgreSQL by using a single endpoint. We can monitor our cluster activity on ClusterControl checking the different metrics, dashboards or many performance and management features.

There are different ways to implement WordPress with PostgreSQL. It could be by using a different plugin, or by installing WordPress as usual and adding the plugin later, but in any case, as we mentioned, PostgreSQL is not officially supported by WordPress, so we must perform an exhaustive testing process if we want to use this topology in production.

25 Mar 16:29

Spengler: 'The Quantum Supremacy: A Novel'

by David P. Goldman
For the past several years I've warned about China's ascendance and the threat to America's world standing. Not enough people are listening, so I've written a spy thriller about China: The Quantum Supremacy. It's available now in an Amazon Kindle edition, but I will publish it as a serial in this space over the next six months, a chapter a week. A lot of people don't like the unvarnished facts, but they like stories. So here's a story about China. I hope you have as much fun reading it as I did writing it.
23 Mar 19:26

Graph Databases: They Who Forget the Past...

by noreply@blogger.com (Fabian Pascal)

Out of the plethora of misconceptions common in the industry[1], quite a few are squeezed into this paragraph:
“The relational databases that emerged in the ’80s are efficient at storing and analyzing tabular data but their underlying data model makes it difficult to connect data scattered across multiple tables. The graph databases we’ve seen emerge in the recent years are designed for this purpose. Their data model is particularly well-suited to store and to organize data where connections are as important as individual data points. Connections are stored and indexed as first-class citizens, making it an interesting model for investigations in which you need to connect the dots. In this post, we review three common fraud schemes and see how a graph approach can help investigators defeat them.”
--AnalyticBridge.DataScienceCentral.com

Relational databases did not emerge in the 80s (SQL DBMSs did);
  • There is no "tabular data" (the relational data structure is the relation, which can be visualized as a table on a physical medium[2], and SQL tables are not relations);
  • Analysis is not a DBMS, but an application function (while database queries, as deductions, are an important aspect of analysis, and computational functions can be added to the data sublanguage (as in SQL), the primary function of a DBMS is data management)[3];
  • A data model has nothing to do with storage (storage and access methods are part of physical implementation, which determines efficiency/performance[4]).

Here, however, we will focus on the current revival (rather than emergence) of graph DBMSs claimed superior -- without any evidence or qualifications -- to SQL DBMSs (not relational, which do not exist) that purportedly "make it difficult to connect data scattered across multiple tables". This is a typical example of how lack of foundation knowledge and of familiarity with the history of the field inhibit understanding and progress[5].

20 Mar 18:31

Craig Kerstiens: How to evaluate your database

Choosing a database isn’t something you do every day. You generally choose it once for a project, then don’t look back. If you experience years of success with your application you one day have to migrate to a new database, but that occurs years down the line. In choosing a database there are a few key things to consider. Here is your checklist, and spoiler alert, Postgres checks out strongly in each of these categories.

Does your database solve your problem?

There are a lot of new databases that rise up every year, each of these looks to solve hard problems within the data space. But, you should start by looking and seeing if they’re looking to solve a problem that you personally have. Most applications at the end of the day have some relational data model and more and more are also working with some level of unstructured data. Relational databases of course solve the relational piece, but they increasingly support the unstructured piece as well. Postgres in particular

Do you need strong gurantees for your data? ACID is still at the core of how durable and safe is your data, knowing how it stacks up here is a good evaluation criteria. But then there is also the CAP theorem which you see especially applied to distributed or clustered databases. Each of the previous links is worth a read to get a better understanding of the theory around databases. If you’re interested in how various databases perform under CAP then check out the Jepsen series of tests. But for the average person like myself it can be boiled down a bit more. Do you need full gurantee around your transactions, or do you optimize for some performance?

While it doesn’t fully speak to all the possible options you can have with databases, Postgres comes with some pretty good flexibility out of the box. It allows both synchronous (guaranteed it makes it) and asynchronous (queued up occurring soon after) replication to standbys. Those standbys could be for read replicas for reporting or for high availability. What’s nice about Postgres is can actually allow you to swap between synchronous and asynchronous on a per transaction basis.

Then there is the richness of features. Postgres has rich data types, powerful indexes, and a range of features such as geospatial support and full text search. By default yes, Postgres usually does solve my problem. But that is only one of my criteria.

How locked in am I to my database?

Once I’ve established that my database I want to know a bit more about what I’m getting myself into. Is the database open source is a factor. That doesn’t mean I require the database to be open source, but it simplifies my evaluation. A closed source database means I’m committing to whatever the steward of that database decides. If the company is well established and is a good steward of the product a closed source database can absolutely satisfy what I need.

On the flip side open source doesn’t immediately mean it is perfect. Is it open source but with an extremely restrictive license? Is there a community around it? Has it been getting new releases? All of these play into my level of comfort in trusting you with my data.

Can I hire for my database?

This one gets missed so often by early stage companies! It is the number one reason I like using open technologies and frameworks because I can hire someone already familiar with my tech stack. In contrast, with a home grown in house framework or database the ability to test knowledge is harder and the ramp up time is considerably more for a new hire. Postgres shines as bright as any database here. A look at Hacker news who is hiring trends, which I view as a leading indicator, from a couple years ago shows Postgres leading the pack of desired database skills. The number of people that know Postgres continues to increase each day. It is not a fading skill.

Whose hiring from HN

What does the future look like?

Finally, I’m looking at what my future needs will be combined with the future of the database. Does the database have momentum to keep improving and advancing? Does it not only have features I need today, but does it have features that can benefit me in the future without complicating my stack? I often favor a database that can solve multiple problems not just one. Combining 10 very specialized tools leads to a much more complex stack, vs. in Postgres if I need to layer in a little full text search I already have something that can be my starting point.

Does it scale is my final big one. If my business is expected to remain small this is no longer a concern, but I want to know what my limits are. Replacing my database is a large effort task, how far can I scale my database without rearchitecting.

Personally Postgres having a good answer to the scale question is what attracted me to join Citus over 3 years ago. It takes Postgres and makes it even more powerful. It removes the scaling question for you, so when I need to scale I have my answer.

These aren’t the only criteria

I’m sure this is not an exhaustive list, but it is a framework I’ve used for many years. In most of those years I’m lead back to the simple answer: Just use Postgres.

What other criteria do you use when choosing your database? Let us know @citusdata

10 Mar 17:18

health @ Savannah: Thalamus 0.9.8 is out. HIS Migration from MongoDB to PosrgreSQL

Dear all

I am proud to announce the availability of Thalamus 0.9.8, which has been migrated from MongoDB to now interact with PostgreSQL .
(see related news https://savannah.gnu.org/forum/forum.php?forum_id=9366 )

Please make sure you update the package

$ pip3 install --user --upgrade thalamus

The interaction for the enduser and from GNU Health HMIS node will be transparent.

The Health Information System installation has been updated in Wikipedia. You can refer to https://en.wikibooks.org/wiki/GNU_Health/Federation_Technical_Guide#Installing_Thalamus

Please don't forget to report any issues you find to health@gnu.org

Bests
Luis

09 Mar 15:48

Light Pollution

It's so sad how almost no one alive today can remember seeing the galactic rainbow, the insanity nebula, or the skull and glowing eyes of the Destroyer of Sagittarius.
04 Mar 22:50

Rosenzweig: The federation fallacy

by corbet
Here's a lengthy piece from Alyssa Rosenzweig on preserving freedom despite the inevitable centralization of successful information services. "Indeed, it seems all networked systems tend towards centralisation as the natural consequence of growth. Some systems, both legitimate and illegitimate, are intentionally designed for centralisation. Other systems, like those in the Mastodon universe, are specifically designed to avoid centralisation, but even these succumb to the centralised black hole as their user bases grow towards the event horizon."
09 Feb 12:56

Huawei case demonstrates importance of Free Software for security

Huawei case demonstrates importance of Free Software for security

The discussion of the Huawei security concerns showcases a general trust issue when it comes to critical infrastructure. A first step to solve this problem is to publish the code under a Free and Open Source Software licence and take measures to facilitate its independently-verifiable distribution.

The ongoing debate about banning Huawei hardware for the rollout of 5G networks, following earlier state espionage allegations, falls too short. It is not just about the Chinese company but about a general lack of transparency within this sector. As past incidents proved, the problem of backdoors inside blackboxed hard- and software is widely spread, independently from the manufacturers' origins.

However, it is unprecedented that the demand to inspect the source code of a manufacturer's equipment has been discussed so broadly and intensely. The Free Software Foundation Europe (FSFE) welcomes that the importance of source code is recognised, but is afraid that the proposed solution falls too short. Allowing inspection of the secret code by selected authorities and telephone companies might help in this specific case, but will not solve the general problem.

To establish trust in critical infrastructure like 5G, it is a crucial precondition that all software code powering those devices is published under a Free and Open Source Software licence. Free and Open Source Software guarantees the four freedoms to use, study, share, and improve an application. On this basis, everyone can inspect the code, not only for backdoors, but for all security risks. Only these freedoms allow for independent and continuous security audits which will lead citizens, the economy, and the public sector to trust their communication and data exchange.

Furthermore, in order to verify code integrity – so that the provided source code corresponds to the executable code running on the equipment – it is either necessary that there are reproducible builds in case of binary distribution, or that providers are brought into the position to compile and deploy the code on their own.

"We should not only debate the Huawei case but extend the discussion to all critical infrastructure." says Max Mehl, FSFE Programme Manager. "Only with Free and Open Source Software, transparency and accountability can be guaranteed. This is a long-known crucial precondition for security and trust. We expect from state actors to immediately implement this solution not only for the Huawei case but for all comparable IT security issues."

Support FSFE, join the Fellowship
Make a one time donation

09 Feb 12:38

Invisible Formatting

To avoid errors like this, we render all text and pipe it through OCR before processing, fixing a handful of irregular bugs by burying them beneath a smooth, uniform layer of bugs.
03 Feb 01:40

James Coleman: PostgreSQL at Scale: Database Schema Changes Without Downtime

Braintree Payments uses PostgreSQL as its primary datastore. We rely heavily on the data safety and consistency guarantees a traditional relational database offers us, but these guarantees come with certain operational difficulties. To make things even more interesting, we allow zero scheduled functional downtime for our main payments processing services.

Several years ago we published a blog post detailing some of the things we had learned about how to safely run DDL (data definition language) operations without interrupting our production API traffic.

Since that time PostgreSQL has gone through quite a few major upgrade cycles — several of which have added improved support for concurrent DDL. We’ve also further refined our processes. Given how much has changed, we figured it was time for a blog post redux.

In this post we’ll address the following topics:

First, some basics

For all code and database changes, we require that:

  • Live code and schemas be forward-compatible with updated code and schemas: this allows us to roll out deploys gradually across a fleet of application servers and database clusters.
  • New code and schemas be backward-compatible with live code and schemas: this allows us to roll back any change to the previous version in the event of unexpected errors.

For all DDL operations we require that:

  • Any exclusive locks acquired on tables or indexes be held for at most ~2 seconds.
  • Rollback strategies do not involve reverting the database schema to its previous version.

Transactionality

PostgreSQL supports transactional DDL. In most cases, you can execute multiple DDL statements inside an explicit database transaction and take an “all or nothing” approach to a set of changes. However, running multiple DDL statements inside a transaction has one serious downside: if you alter multiple objects, you’ll need to acquire exclusive locks on all of those objects in a single transactions. Because locks on multiple tables creates the possibility of deadlock and increases exposure to long waits, we do not combine multiple DDL statements into a single transaction. PostgreSQL will still execute each separate DDL statement transactionally; each statement will be either cleanly applied or fail and the transaction rolled back.

Note: Concurrent index creation is a special case. Postgres disallows executing CREATE INDEX CONCURRENTLY inside an explicit transaction; instead Postgres itself manages the transactions. If for some reason the index build fails before completion, you may need to drop the index before retrying, though the index will still never be used for regular queries if it did not finish building successfully.

Locking

PostgreSQL has many different levels of locking. We’re concerned primarily with the following table-level locks since DDL generally operates at these levels:

  • ACCESS EXCLUSIVE: blocks all usage of the locked table.
  • SHARE ROW EXCLUSIVE: blocks concurrent DDL against and row modification (allowing reads) in the locked table.
  • SHARE UPDATE EXCLUSIVE: blocks concurrent DDL against the locked table.

Note: “Concurrent DDL” for these purposes includes VACUUM and ANALYZE operations.

All DDL operations generally necessitate acquiring one of these locks on the object being manipulated. For example, when you run:

https://medium.com/media/b1ba327e98b89a84461c915a075c77aa/href

PostgreSQL attempts to acquire an ACCESS EXCLUSIVE lock on the table foos. Atempting to acquire this lock causes all subsequent queries on this table to queue until the lock is released. In practice your DDL operations can cause other queries to back up for as long as your longest running query takes to execute. Because arbitrarily long queueing of incoming queries is indistinguishable from an outage, we try to avoid any long-running queries in databases supporting our payments processing applications.

But sometimes a query takes longer than you expect. Or maybe you have a few special case queries that you already know will take a long time. PostgreSQL offers some additional runtime configuration options that allow us to guarantee query queueing backpressure doesn’t result in downtime.

Instead of relying on Postgres to lock an object when executing a DDL statement, we acquire the lock explicitly ourselves. This allows us to carefully control the time the queries may be queued. Additionally when we fail to acquire a lock within several seconds, we pause before trying again so that any queued queries can be executed without significantly increasing load. Finally, before we attempt lock acquisition, we query pg_locks¹ for any currently long running queries to avoid unnecessarily queueing queries for several seconds when it is unlikely that lock acquisition is going to succeed.

Starting with Postgres 9.3, you adjust the lock_timeout parameter to control how long Postgres will allow for lock acquisition before returning without acquiring the lock. If you happen to be using 9.2 or earlier (and those are unsupported; you should upgrade!), then you can simulate this behavior by using the statement_timeout parameter around an explicit LOCK <table> statement.

In many cases an ACCESS EXCLUSIVE lock need only be held for a very short period of time, i.e., the amount of time it takes Postgres to update its "catalog" (think metadata) tables. Below we'll discuss the cases where a lower lock level is sufficient or alternative approaches for avoiding long-held locks that block SELECT/INSERT/UPDATE/DELETE.

Note: Sometimes holding even an ACCESS EXCLUSIVE lock for something more than a catalog update (e.g., a full table scan or even rewrite) can be functionally acceptable when the table size is relatively small. We recommend testing your specific use case against realistic data sizes and hardware to see if a particular operation will be "fast enough". On good hardware with a table easily loaded into memory, a full table scan or rewrite for thousands (possibly even 100s of thousands) of rows may be "fast enough".

Table operations

Create table

In general, adding a table is one of the few operations we don’t have to think too hard about since, by definition, the object we’re “modifying” can’t possibly be in use yet. :D

While most of the attributes involved in creating a table do not involve other database objects, including a foreign key in your initial table definition will cause Postgres to acquire a SHARE ROW EXCLUSIVE lock against the referenced table blocking any concurrent DDL or row modifications. While this lock should be short-lived, it nonetheless requires the same caution as any other operation acquiring such a lock. We prefer to split these into two separate operations: create the table and then add the foreign key.

Drop table

Dropping a table requires an exclusive lock on that table. As long as the table isn’t in current use you can safely drop the table. Before allowing a DROP TABLE ... to make its way into our production environments we require documentation showing when all references to the table were removed from the codebase. To double check that this is the case you can query PostgreSQL's table statistics view pg_stat_user_tables² confirming that the returned statistics don't change over the course of a reasonable length of time.

Rename table

While it’s unsurprising that a table rename requires acquiring an ACCESS EXCLUSIVE lock on the table, that's far from our biggest concern. Unless the table is not being read from or written to, it's very unlikely that your application code could safely handle a table being renamed underneath it.

We avoid table renames almost entirely. But if a rename is an absolute must, then a safe approach might look something like the following:

  1. Create a new table with the same schema as the old one.
  2. Backfill the new table with a copy of the data in the old table.
  3. Use INSERT and UPDATE triggers on the old table to maintain parity in the new table.
  4. Begin using the new table.

Other approaches involving views and/or RULEs may also be viable depending on the performance characteristics required.

Column operations

Note: For column constraints (e.g., NOT NULL) or other constraints (e.g., EXCLUDES), see Constraints.

Add column

Adding a column to an existing table generally requires holding a short ACCESS EXCLUSIVE lock on the table while catalog tables are updated. But there are several potential gotchas:

Default values: Introducing a default value at the same time of adding the column will cause the table to be locked while the default value in propagated for all rows in the table. Instead, you should:

  1. Add the new column (without the default value).
  2. Set the default value on the column.
  3. Backfill all existing rows separately.

Note: In the recently release PostgreSQL 11, this is no longer the case for non-volatile default values. Instead adding a new column with a default value only requires updating catalog tables, and any reads of rows without a value for the new column will magically have it “filled in” on the fly.

Not-null constraints: Adding a column with a NOT NULL constraint is only possible if there are no existing rows or a DEFAULT is also provided. If there are no existing rows, then the change is effectively equivalent to a catalog only change. If there are existing rows and you are also specifying a default value, then the same caveats apply as above with respect to default values.

Note: Adding a column will cause all SELECT * FROM ... style queries referencing the table to begin returning the new column. It is important to ensure that all currently running code safely handles new columns. To avoid this gotcha in our applications we require queries to avoid * expansion in favor of explicit column references.

Change column type

In the general case changing a column’s type requires holding an exclusive lock on a table while the entire table is rewritten with the new type.

There are a few exceptions:

Note: Even though one of the exceptions above was added in 9.1, changing the type of an indexed column would always rewrite the index even if a table rewrite was avoided. In 9.2 any column data type that avoids a table rewrite also avoids rewriting the associated indexes. If you’d like to confirm that your change won’t rewrite the table or any indexes, you can query pg_class³ and verify the relfilenode column doesn't change.

If you need to change the type of a column and one of the above exceptions doesn’t apply, then the safe alternative is:

  • Add a new column new_<column>.
  • Dual write to both columns (e.g., with a BEFORE INSERT/UPDATE trigger).
  • Backfill the new column with a copy of the old column’s values.
  • Rename <column> to old_<column> and new_<column> inside a single transaction and explicit LOCK <table> statement.
  • Drop the old column.

Drop column

It goes without saying that dropping a column is something that should be done with great care. Dropping a column requires an exclusive lock on the table to update the catalog but does not rewrite the table. As long as the column isn’t in current use you can safely drop the column. It’s also important to confirm that the column is not referenced by any dependent objects that could be unsafe to drop. In particular, any indexes using the column should be dropped separately and safely with DROP INDEX CONCURRENTLY since otherwise they will be automatically dropped along with the column under an ACCESS EXCLUSIVE lock. You can query pg_depend for any dependent objects.

Before allowing a ALTER TABLE ... DROP COLUMN ... to make its way into our production environments we require documentation showing when all references to the column were removed from the codebase. This process allows us to safely roll back to the release prior to the one that dropped the column.

Note: Dropping a column will require that you update all views, triggers, function, etc. that rely on that column.

Index operations

Create index

The standard form of CREATE INDEX ... acquires an ACCESS EXCLUSIVE lock against the table being indexed while building the index using a single table scan. In contrast, the form CREATE INDEX CONCURRENTLY ... acquires an SHARE UPDATE EXCLUSIVE lock but must complete two table scans (and hence is somewhat slower). This lower lock level allows reads and writes to continue against the table while the index is built.

Caveats:

  • Multiple concurrent index creations on a single table will not return from either CREATE INDEX CONCURRENTLY ... statement until the slowest one completes.
  • CREATE INDEX CONCURRENTLY ... may not be executed inside of a transaction but does maintain transactions internally. This holding open a transaction means that no auto-vacuums (against any table in the system) will be able to cleanup dead tuples introduced after the index build begins until it finishes. If you have a table with a large volume of updates (particularly bad if to a very small table) this could result in extremely sub-optimal query execution.
  • CREATE INDEX CONCURRENTLY ... must wait for all transactions using the table to complete before returning.

Drop index

The standard form of DROP INDEX ... acquires an ACCESS EXCLUSIVE lock against the table with the index while removing the index. For small indexes this may be a short operation. For large indexes, however, file system unlinking and disk flushing can take a significant amount of time. In contrast, the form DROP INDEX CONCURRENTLY ... acquires a SHARE UPDATE EXCLUSIVE lock to perform these operations allowing reads and writes to continue against the table while the index is dropped.

Caveats:

  • DROP INDEX CONCURRENTLY ... cannot be used to drop any index that supports a constraint (e.g., PRIMARY KEY or UNIQUE).
  • DROP INDEX CONCURRENTLY ... may not be executed inside of a transaction but does maintain transactions internally. This holding open a transaction means that no auto-vacuums (against any table in the system) will be able to cleanup dead tuples introduced after the index build begins until it finishes. If you have a table with a large volume of updates (particularly bad if to a very small table) this could result in extremely sub-optimal query execution.
  • DROP INDEX CONCURRENTLY ... must wait for all transactions using the table to complete before returning.

Note: DROP INDEX CONCURRENTLY ... was added in Postgres 9.2. If you're still running 9.1 or prior, you can achieve somewhat similar results by marking the index as invalid and not ready for writes, flushing buffers with the pgfincore extension, and the dropping the index.

Rename index

ALTER INDEX ... RENAME TO ... requires an ACCESS EXCLUSIVE lock on the index blocking reads from and writes to the underlying table. However a recent commit expected to be a part of Postgres 12 lowers that requirement to SHARE UPDATE EXCLUSIVE.

Reindex

REINDEX INDEX ... requires an ACCESS EXCLUSIVE lock on the index blocking reads from and writes to the underlying table. Instead we use the following procedure:

  • Create a new index concurrently that duplicates the existing index definition.
  • Drop the old index concurrently.
  • Rename the new index to match the original index’s name.

Note: If the index you need to rebuild backs a constraint, remember to re-add the constraint as well (subject to all of the caveats we’ve documented.)

Constraints

NOT NULL Constraints

Removing an existing not-null constraint from a column requires an exclusive lock on the table while a simple catalog update is performed.

In contrast, adding a not-null constraint to an existing column requires an exclusive lock on the table while a full table scan verifies that no null values exist. Instead you should:

  1. Add a CHECK constraint requiring the column be not-null with ALTER TABLE <table> ADD CONSTRAINT <name> CHECK (<column> IS NOT NULL) NOT VALID;. The NOT VALID tells Postgres that it doesn't need to scan the entire table to verify that all rows satisfy the condition.
  2. Manually verify that all rows have non-null values in your column.
  3. Validate the constraint with ALTER TABLE <table> VALIDATE CONSTRAINT <name>;. With this statement PostgreSQL will block acquisition of other EXCLUSIVE locks for the table, but will not block reads or writes.

Bonus: There is currently a patch in the works (and possibly it will make it into Postgres 12) that will allow you to create a NOT NULL constraint without a full table scan if a CHECK constraint (like we created above) already exists.

Foreign keys

ALTER TABLE ... ADD FOREIGN KEY requires a SHARE ROW EXCLUSIVE lock (as of 9.5) on both the altered and referenced tables. While this won't block SELECT queries, blocking row modification operations for a long period of time is equally unacceptable for our transaction processing applications.

To avoid that long-held lock you can use the following process:

  • ALTER TABLE ... ADD FOREIGN KEY ... NOT VALID: Adds the foreign key and begins enforcing the constraint for all new INSERT/UPDATE statements but does not validate that all existing rows conform to the new constraint. This operation still requires SHARE ROW EXCLUSIVE locks, but the locks are only briefly held.
  • ALTER TABLE ... VALIDATE CONSTRAINT <constraint>: This operation checks all existing rows to verify they conform to the specified constraint. Validation requires a SHARE UPDATE EXCLUSIVE so may run concurrently with row reading and modification queries.

Check constraints

ALTER TABLE ... ADD CONSTRAINT ... CHECK (...) requires an ACCESS EXCLUSIVE lock. However, as with foreign keys, Postgres supports breaking the operation into two steps:

  • ALTER TABLE ... ADD CONSTRAINT ... CHECK (...) NOT VALID: Adds the check constraint and begins enforcing it for all new INSERT/UPDATE statements but does not validate that all existing rows conform to the new constraint. This operation still requires an ACCESS EXCLUSIVE lock.
  • ALTER TABLE ... VALIDATE CONSTRAINT <constraint>: This operation checks all existing rows to verify they conform to the specified constraint. Validation requires a SHARE UPDATE EXCLUSIVE on the altered table so may run concurrently with row reading and modification queries. A ROW SHARE lock is held on the reference table which will block any operations requiring exclusive locks while validating the constraint.

Uniqueness constraints

ALTER TABLE ... ADD CONSTRAINT ... UNIQUE (...) requires an ACCESS EXCLUSIVE lock. However, Postgres supports breaking the operation into two steps:

  • Create a unique index concurrently. This step will immediately enforce uniqueness, but if you need a declared constraint (or a primary key), then continue to add the constraint separately.
  • Add the constraint using the already existing index with ALTER TABLE ... ADD CONSTRAINT ... UNIQUE USING INDEX <index>. Adding the constraint still requires an ACCESS EXCLUSIVE lock, but the lock will only be held for fast catalog operations.

Note: If you specify PRIMARY KEY instead of UNIQUE then any non-null columns in the index will be made NOT NULL. This requires a full table scan which currently can't be avoided. See NOT NULL Constraints for more details.

Exclusion constraints

ALTER TABLE ... ADD CONSTRAINT ... EXCLUDE USING ... requires an ACCESS EXCLUSIVE lock. Adding an exclusion constraint builds the supporting index, and, unfortunately, there is currently no support for using an existing index (as you can do with a unique constraint).

Enum Types

CREATE TYPE <name> AS (...) and DROP TYPE <name> (after verifying there are no existing usages in the database) can both be done safely without unexpected locking.

Modifying enum values

ALTER TYPE <enum> RENAME VALUE <old> TO <new> was added in Postgres 10. This statement does not require locking tables which use the enum type.

Deleting enum values

Enums are stored internally as integers and there is no support for gaps in the valid range, removing a value would currently shifting values and rewriting all rows using those values. PostgreSQL does not currently support removing values from an existing enum type.

Announcing Pg_ha_migrations for Ruby on Rails

We’re also excited to announce that we have open-sourced our internal library pg_ha_migrations. This Ruby gem enforces DDL safety in projects using Ruby on Rails and/or ActiveRecord with an emphasis on explicitly choosing trade-offs and avoiding unnecessary magic (and the corresponding surprises). You can read more in the project’s README.

Footnotes

[1] You can find active long-running queries and the tables they lock with the following query:

https://medium.com/media/90ee0e73f1d666273b12ad4df0ce2dfd/href

[2] You can see PostgreSQL’s internal statistics about table accesses with the following query:

https://medium.com/media/571d5801ea7602f91f4c6cf98361de39/href

[3] You can see if DDL causes a relation to be rewritten by seeing if the relfilenode value changes after running the statement:

https://medium.com/media/4526240441c8b45733d22b7b04326d48/href

[4] You can find objects (e.g., indexes) that depend on a specific column by running the statement:

https://medium.com/media/5dfc04bc00958bcd9a44f3568c73a49c/href

PostgreSQL at Scale: Database Schema Changes Without Downtime was originally published in Braintree Product and Technology on Medium, where people are continuing the conversation by highlighting and responding to this story.

16 Jan 22:50

Missal of Silos

Welcome to Wyoming, motto "We'd like to clarify that Cheyenne Mountain is in Colorado."
16 Dec 01:12

Laptop Issues

L

Computiŋ is becomiŋ unſuſtainable.

Hang on, we got a call from the feds. They say we can do whatever with him, but the EPA doesn't want that laptop in the ocean. They're sending a team.