PostgreSQL數據庫隨機取數據,數據抽樣

  • 對于數據量比較大的表,不要使用下面的這種方法,會比較慢

select * from review order by random() limit 10;

會全表掃描,很慢

  • SYSTEM抽樣方法

SELECT ...

FROM table_name

TABLESAMPLE sampling_method (argument [, ...]) [REPEATABLE (seed)]

sampling_method 抽樣方法,主要有兩種,system和bernoulli,argument是指抽樣百分比,

舉個栗子


select id from review tablesample system(0.0001);

[email protected]:5432=#explain analyze select id from review tablesample system(0.0001);

QUERY PLAN

----------------------------------------------------------------------------------------------------

Sample Scan on review (cost=0.00..4.06 rows=6 width=4) (actual time=3.328..3.899 rows=10 loops=1)

Sampling: system ('0.0001'::real)

Planning time: 0.110 ms

Execution time: 3.920 ms

(4 rows)

需要指出一條的是,system抽樣方式是隨機抽取表上的數據塊的數據,理論上每個數據塊被檢索的概率是一樣的,system抽樣方式基于數據塊級別,被選中的快上面的所以數據將被檢索, 所以可能會出現每次隨機抽取的數據條數是不一樣的,甚至可能返回0條,可能是抽取到了一個沒有數據的數據塊

關于數據塊檢索我們可以通過下面的例子看到,ctid是數據塊的值,前面的那個是數據庫編號,后面那個是該條數據在數據塊上面的編號


[email protected]:5432=#select ctid, id from review tablesample system(0.0001);

ctid | id

------+----

(0 rows)

[email protected]:5432=#select ctid, id from review tablesample system(0.0001);

ctid | id

------------+---------

(289034,1) | 4131829

(289034,2) | 4131830

(289034,3) | 4131853

(289034,4) | 4131854

(289034,5) | 4924212

(289034,6) | 5142532

(6 rows)

[email protected]:5432=#select ctid, id from review tablesample system(0.0001);

ctid | id

------+----

(0 rows)

[email protected]:5432=#select ctid, id from review tablesample system(0.0001);

ctid | id

-------------+---------

(368648,1) | 5006567

(368648,2) | 5164392

(368648,3) | 4844708

(368648,4) | 4844524

(368648,5) | 4844526

(368648,6) | 4844621

(368648,7) | 4844748

(368648,8) | 4844614

(368648,9) | 4844678

(368648,10) | 4844530

(368648,11) | 4844532

(11 rows)
  • BERNOULLI抽樣方式

BERNOULLI抽樣方式,是基于數據行的,所以相對來說,每次返回的數量相對比較一致,相差不多,這種抽取方式比SYSTEM抽樣方式更加隨機一些,但是相對性能要低很多,


[email protected]:5432=#explain analyze select ctid, id from review tablesample bernoulli(0.0001);

QUERY PLAN

--------------------------------------------------------------------------------------------------------------

Sample Scan on review (cost=0.00..429583.06 rows=6 width=10) (actual time=136.399..1524.570 rows=6 loops=1)

Sampling: bernoulli ('0.0001'::real)

Planning time: 0.113 ms

Execution time: 1524.615 ms

(4 rows)

然后由于BERNOULLI抽樣方式是基于數據行的,所以一般而言,抽樣得到的數據應該位于不同的數據塊


[email protected]:5432=#select ctid, id from review tablesample bernoulli(0.0001);

ctid | id

-------------+---------

(82429,15) | 1419928

(131678,12) | 2159220

(273168,2) | 3965564

(344546,3) | 4675681

(4 rows)
  • 總結

SYSTEM抽樣方式適合抽樣效率優先,針對大數據量啥的;

BERNOULLI適合隨機性優先的場景

你可能感興趣的

广东25选5开奖结果