PostgreSQL 12 : Where in copy from

5 days ago a commit made it to 12devel that implements WHERE clause in COPY FROM.
Today we’re gonna see how it works and see how someone could achieve the same by using file_fdw.

To begin with, lets create a table and put some data in it.

create table test (
id int,
date timestamp without time zone,
);

insert into test (id,date) select generate_series (1,1000000)            ,’2015-01-01′;
insert into test (id,date) select generate_series (1000001,2000000),’2016-01-01′;

insert into test (id,date) select generate_series (2000001,3000000),’2017-01-01′;
insert into test (id,date) select generate_series (3000001,4000000),’2018-01-01′;
insert into test (id,date) select generate_series (4000001,5000000),’2019-01-01′;
now, lets make this a bit larger than 170MB, dump the data in csv and truncate the table :
monkey=# insert into test select * from test;
INSERT 0 5000000
monkey=# insert into test select * from test;
INSERT 0 10000000
monkey=# select count(*) from test ;
  count
———-
 20000000
(1 row)
monkey=# copy test to ‘/home/vasilis/test.csv’ with csv;
COPY 20000000
monkey=# truncate test;
TRUNCATE TABLE
vasilis@Wrath > ls -lh ~/test.csv
-rw-r–r– 1 vasilis vasilis 759M Jan 25 12:24 /home/vasilis/test.csv
Our test file is about 750Mb, now with an empty table , lets import only the rows that are up to 2 years old :
monkey=# copy test from ‘/home/vasilis/test.csv’ with csv where date >= now() – INTERVAL ‘2 year’ ;
COPY 8000000
Time: 17770.267 ms (00:17.770)

It worked , awesome !
Now, lets compare to the alternative, file_fdw :
monkey=# CREATE EXTENSION file_fdw;
CREATE EXTENSION
Time: 7.288 ms
monkey=# CREATE SERVER pgtest FOREIGN DATA WRAPPER file_fdw;
CREATE SERVER
Time: 3.957 ms
monkey=# create foreign table file_test (id int, date timestamp without time zone, name text) server pgtest options (filename ‘/home/vasilis/test.csv’, format ‘csv’);
CREATE FOREIGN TABLE
Time: 16.463 ms
monkey=# truncate test ;
TRUNCATE TABLE
Time: 15.123 ms

monkey=# insert into test select * from file_test where date >= now() – INTERVAL ‘2 year’ ;
INSERT 0 8000000
Time: 21368.595 ms (00:21.369)
I automated the test, and copy was (obviously) always faster. I’d imagine that the key benefit here is not performance, which is inherited by COPY itself but extending the functionality of COPY, which is good, especially for workloads that export data , ETL and load to another DB, or for restoring specific rows from backups (assuming pg_dump backups is still a thing).
Thanks for reading
Vasilis Ventirozos
Credativ

Source: eVOL Monkey

Leave a Reply

Your email address will not be published. Required fields are marked *