0%

PostgreSQL技术-索引

为了更加高效地查询数据, 我们应该考虑使用索引.

在PostgreSQL中, 我们可以通过如下地方式创建索引

1
2
3
4
5
6
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON [ ONLY ] table_name [ USING method ]
( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ INCLUDE ( column_name [, ...] ) ]
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace_name ]
[ WHERE predicate ]

1. btree

介绍

绝大部分场景下btree索引均适用. 该索引可用来进行范围查找, 支持以下操作符

  • >,>=,=,<, <=
  • like
  • between
  • in

同时,该类索引支持多字段索引, 只用索引的扫描,覆盖索引唯一约束

但是该索引无法使用长度较长的key. 否则就会出现: 54000] ERROR: index row requires 21792 bytes, maximum size is 8191

用法

创建表与数据

1
2
3
4
5
6
7
8
create table test_btree
(
key varchar(32),
value int
);
insert into test_btree
select md5(random()::text), (random() * 1000000)::int
from generate_series(1, 1000000);

创建索引

1
create index test_btree_key on test_btree (key) include (value);

查询示例

1
2
3
select sum(value)
from test_btree
where key in ('323816e2d4bb47d85d669b90089a2382', 'f3180734688c85d1cf5b173b261e9521');

查询计划

1
2
3
4
5
6
Aggregate  (cost=8.89..8.90 rows=1 width=8) (actual time=0.024..0.024 rows=1 loops=1)
-> Index Only Scan using test_btree_key on test_btree (cost=0.42..8.88 rows=2 width=4) (actual time=0.016..0.021 rows=2 loops=1)
" Index Cond: (key = ANY ('{323816e2d4bb47d85d669b90089a2382,f3180734688c85d1cf5b173b261e9521}'::text[]))"
Heap Fetches: 0
Planning Time: 0.052 ms
Execution Time: 0.038 ms

2. hash

介绍

hash索引适用于字段value非常长的场景. 只支持等值连接。

用法

创建表

1
2
3
4
5
6
7
8
create table test_hash
(
key text,
value int
);
insert into test_hash
select md5(random()::text), (random() * 1000000)::int
from generate_series(1, 10000);

创建索引

1
create index test_hash_key on test_hash using hash (key);

查询示例

1
2
3
select key, value
from test_hash
where key = '70913971e3676ba202faba804a12f9bd';

查询计划

1
2
3
4
Index Scan using test_hash_key on test_hash  (cost=0.00..8.02 rows=1 width=37) (actual time=0.010..0.011 rows=1 loops=1)
Index Cond: (key = '70913971e3676ba202faba804a12f9bd'::text)
Planning Time: 0.037 ms
Execution Time: 0.021 ms

3. GiST

介绍

Gist表示通用搜索树, 它是一种平衡的树结构的访问方法. 它可以用来解决空间,范围等几何问题. 例如,对于各种数据类型,执行各类操作符

名称 索引数据类型 可索引操作符 排序操作符
box_ops box && &> &< `&< >><<<<
circle_ops circle && &> &< `&< >><<<<
inet_ops inet, cidr && >> >>= > >= <> << <<= < <= =
point_ops point >> >^ << <@ <@ <@ <^ ~= <->
poly_ops polygon && &> &< `&< >><<<<
range_ops 任何范围类型 && &> &< >> << <@ `- -=@>@>`
tsquery_ops tsquery <@ @>
tsvector_ops tsvector @@

用法

创建表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
create table test_gist
(
range int4range,
value int,
exclude using gist (range with &&)
);
insert into test_gist
with a as (
select (random() * 1000)::int v
from generate_series(1, 100)
)
select int4range(v, v + 11), (random() * 100)::int
from a
on conflict do nothing;

查询示例

1
2
3
select range, value
from test_gist
where range @> 400;

查询计划

1
2
3
4
5
6
7
Bitmap Heap Scan on test_gist  (cost=4.19..13.66 rows=6 width=36) (actual time=0.049..0.049 rows=1 loops=1)
Recheck Cond: (range @> 400)
Heap Blocks: exact=1
-> Bitmap Index Scan on test_gist_range_excl (cost=0.00..4.19 rows=6 width=0) (actual time=0.009..0.009 rows=1 loops=1)
Index Cond: (range @> 400)
Planning Time: 0.142 ms
Execution Time: 0.068 ms

4. SP-GiST

介绍

SP-GiST是空间分割的(Space-Partitioned)GiST的省略语. 他的使用场景与GiST类似. 它支持二维点的SP-Gist操作符如下

  • <<, 是否严格在左
  • >>, 是否严格在右
  • ~=, 与…相同
  • <@, 包含或在…上
  • <^, 低于(允许接触)
  • >^,高于(允许接触)

用法

创建表

1
2
3
4
5
6
7
8
create table test_spgist
(
key point,
value int
);
insert into test_spgist
select point(random() * 1000, random() * 1000), (random() * 10000)::int
from generate_series(1, 1000000);

创建索引

1
create index test_spgist_key on test_spgist using spgist (key);

查询示例

1
2
3
select key, value
from test_spgist
where key >> point(100, 100);

查询计划

1
2
3
4
5
6
7
Bitmap Heap Scan on test_spgist  (cost=2951.28..10571.28 rows=100000 width=20) (actual time=64.614..119.100 rows=899566 loops=1)
" Recheck Cond: (key >> '(100,100)'::point)"
Heap Blocks: exact=6370
-> Bitmap Index Scan on test_spgist_key (cost=0.00..2926.28 rows=100000 width=0) (actual time=64.098..64.099 rows=899566 loops=1)
" Index Cond: (key >> '(100,100)'::point)"
Planning Time: 0.048 ms
Execution Time: 135.882 ms

5. GIN

介绍

GIN为通用倒排索引, 用于处理索引项为组合值的情况. 比如所搜包含特定词的文档, 或拥有特定值的数组. 对于一维数组, 它支持以下的操作符

  • <@, 被包含
  • @>, 包含
  • =, 等于
  • &&, 重叠

用法

创建表

1
2
3
4
5
6
7
8
create table test_gin
(
uuid uuid,
tags integer[]
);
insert into test_gin
select gen_random_uuid(), array [(random() * 100)::int,(random() * 100)::int,(random() * 100)::int]
from generate_series(1, 1000);

创建索引

1
create index test_gin_tags on test_gin using gin (tags);

查询示例

1
2
3
select uuid, tags
from test_gin
where tags @> array [10, 2];

查询计划

1
2
3
4
5
6
7
Bitmap Heap Scan on test_gin  (cost=12.00..16.02 rows=1 width=49) (actual time=0.018..0.019 rows=1 loops=1)
" Recheck Cond: (tags @> '{10,2}'::integer[])"
Heap Blocks: exact=1
-> Bitmap Index Scan on test_gin_tags (cost=0.00..12.00 rows=1 width=0) (actual time=0.012..0.012 rows=1 loops=1)
" Index Cond: (tags @> '{10,2}'::integer[])"
Planning Time: 0.078 ms
Execution Time: 0.065 ms

6. BRIN

介绍

BRIN索引(块范围索引, Block Range Indexes)维护不同范围内数据块的统计信息. 相比于btree索引, BRIN索引占用的空间更小.该索引用于线性相关较强的精确和范围查询.

用法

创建表

1
2
3
4
5
6
7
8
create table test_brin
(
key int,
value int
);
insert into test_brin
select g, (random() * 1000000)::int
from generate_series(1, 1000000) g;

创建索引

1
create index test_brin_key on test_brin (key) include (value);

查询示例

1
2
3
select key, value
from test_brin
where key between 11100 and 11150;

查询计划

1
2
3
4
5
Index Only Scan using test_brin_key on test_brin  (cost=0.42..5.42 rows=50 width=8) (actual time=0.014..0.018 rows=51 loops=1)
Index Cond: ((key >= 11100) AND (key <= 11150))
Heap Fetches: 0
Planning Time: 0.046 ms
Execution Time: 0.029 ms

7. bloom

介绍

bloom索引虽然只能用于等值查询,但可通过多列字段索引高效的查询每一种可能的情况.

用法

创建表

1
2
3
4
5
create table test_bloom as
select md5(random()::text) as i1,
md5(random()::text) as i2,
md5(random()::text) as i3
from generate_series(1, 1000000);

创建索引

1
create index test_bloom_all on test_bloom using bloom (i1, i2, i3);

查询示例

1
2
3
select i1,i2,i3
from test_bloom
where i2 = '2efeb41703124b0c5a3ac7cfbc86029f';

查询计划

1
2
3
4
5
6
7
8
Bitmap Heap Scan on test_bloom  (cost=15348.00..15352.01 rows=1 width=99) (actual time=5.933..8.660 rows=1 loops=1)
Recheck Cond: (i2 = '2efeb41703124b0c5a3ac7cfbc86029f'::text)
Rows Removed by Index Recheck: 4380
Heap Blocks: exact=3890
-> Bitmap Index Scan on test_bloom_all (cost=0.00..15348.00 rows=1 width=0) (actual time=5.615..5.616 rows=4381 loops=1)
Index Cond: (i2 = '2efeb41703124b0c5a3ac7cfbc86029f'::text)
Planning Time: 0.039 ms
Execution Time: 8.678 ms

PostgreSQL技术-IP范围索引

本文来源于自己的经验,但也主要参考一篇文章

场景描述

在业务上,有时需要根据一个IP地址查询其城市,经纬度,运营商等信息, 那么该如何快速的查询该IP的信息.

比如,我们以IP2LOCATION家的一个免费数据库LITE IP-COUNTRY Database为例, 该数据提供了IP段起始地址,IP段结束地址,国家代码, 国家名称.

如果你下载时遇到了一些困难,那么可以点击这里

方案1. BTREE索引-int8

我们按照其官方下载页面提供的PostgreSQL建表建议:

1
2
3
4
5
6
7
CREATE TABLE ip2location_db1_test1(
ip_from bigint NOT NULL,
ip_to bigint NOT NULL,
country_code character(2) NOT NULL,
country_name character varying(64) NOT NULL,
CONSTRAINT ip2location_db1_pkey PRIMARY KEY (ip_from, ip_to)
);

基准测试

构建文件: ./test-sql1.sql

1
2
3
4
5
\set ip random(0, 2147483647)
select *
from ip2location_db1_test1
where ip_from <= :ip
and ip_to >= :ip;

执行测试:

1
pgbench -M simple -c 4 -j 4 -f ./test-sql1.sql -n -T 30 -h localhost -U postgres test

测试结果

1
2
3
4
5
6
7
8
9
10
transaction type: ./test-sql1.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 30 s
number of transactions actually processed: 39007
latency average = 3.077 ms
tps = 1299.848793 (including connections establishing)
tps = 1299.933199 (excluding connections establishing)

方案2. BTREE索引-int4

为了让字段占用空间更小, 我们在方案1的基础上选用int4 (int4即int)

建表:

1
2
3
4
5
6
7
8
create table ip2location_db1_test2
(
ip_from int not null,
ip_to int not null,
country_code char(2) not null,
country_name varchar(64) not null
);
create index ip2location_db1_test2_pkey on ip2location_db1_test2 using btree (ip_from, ip_to);

基准测试

构建文件: ./test-sql2.sql

1
2
3
4
5
\set ip random(0, 2147483647)
select *
from ip2location_db1_test2
where ip_from <= :ip - 2147483648
and ip_to >= :ip - 2147483648;

执行测试:

1
pgbench -M simple -c 4 -j 4 -f ./test-sql2.sql -n -T 30 -h localhost -U postgres test

测试结果

1
2
3
4
5
6
7
8
9
10
transaction type: ./test-sql2.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 30 s
number of transactions actually processed: 173751
latency average = 0.691 ms
tps = 5791.490212 (including connections establishing)
tps = 5791.971583 (excluding connections establishing)

方案3. Gist索引-iprange

我们选用Gist索引, 类型选用自定义的iprange

建表:

1
2
3
4
5
6
7
8
9
10
11
create type iprange as range
(
subtype=inet
);
create table ip2location_db1_test3
(
ip_range iprange not null,
country_code char(2) not null,
country_name varchar(64) not null
);
create index ip2location_db1_test3_key on ip2location_db1_test3 using gist (ip_range);

基准测试

构建文件: ./test-sql3.sql

1
2
3
4
\set ip random(0, 2147483647)
select *
from ip2location_db1_test3
where ip_range @> '0.0.0.0'::inet + :ip;

执行测试:

1
pgbench -M simple -c 4 -j 4 -f ./test-sql3.sql -n -T 30 -h localhost -U postgres test

测试结果

1
2
3
4
5
6
7
8
9
10
transaction type: ./test-sql3.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 30 s
number of transactions actually processed: 198253
latency average = 0.605 ms
tps = 6608.167036 (including connections establishing)
tps = 6608.692985 (excluding connections establishing)

方案4. Gist索引-int8range

我们选用Gist索引, 类型选用int8range

建表:

1
2
3
4
5
6
7
create table ip2location_db1_test4
(
ip_range int8range not null,
country_code character(2) not null,
country_name character varying(64) not null
);
create index ip2location_db1_test4_range on ip2location_db1_test4 using gist (ip_range);

基准测试

构建文件: ./test-sql4.sql

1
2
3
4
\set ip random(0, 2147483647)
select *
from ip2location_db1_test4
where ip_range @> :ip;

执行测试:

1
pgbench -M simple -c 4 -j 4 -f ./test-sql4.sql -n -T 30 -h localhost -U postgres test

测试结果

1
2
3
4
5
6
7
8
9
10
transaction type: ./test-sql4.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 30 s
number of transactions actually processed: 1070154
latency average = 0.112 ms
tps = 35671.455573 (including connections establishing)
tps = 35674.666859 (excluding connections establishing)

方案5. Gist索引-int4range

我们对方案4进行改进, 将int8range修改为int4range

建表:

1
2
3
4
5
6
7
create table ip2location_db1_test5
(
ip_range int4range not null,
country_code character(2) not null,
country_name character varying(64) not null
);
create index ip2location_db1_test5_range on ip2location_db1_test5 using gist (ip_range);

基准测试

构建文件: ./test-sql5.sql

1
2
3
4
\set ip random(0, 2147483647)
select *
from ip2location_db1_test5
where ip_range @> (:ip - 2147483648)::int4;

执行测试:

1
pgbench -M simple -c 4 -j 4 -f ./test-sql5.sql -n -T 30 -h localhost -U postgres test

测试结果

1
2
3
4
5
6
7
8
9
10
ransaction type: ./test-sql5.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 30 s
number of transactions actually processed: 980291
latency average = 0.122 ms
tps = 32675.974742 (including connections establishing)
tps = 32678.161957 (excluding connections establishing)

方案6. Gist索引-box

我们在PostgreSQL的wiki里可以找到这样一个页面, 按照其建议

建表:

1
2
3
4
5
6
7
create table ip2location_db1_test6
(
ip_range box not null,
country_code character(2) not null,
country_name character varying(64) not null
);
create index ip2location_db1_test6_range on ip2location_db1_test6 using gist (ip_range);

基准测试

构建文件: ./test-sql6.sql

1
2
3
4
\set ip random(0, 2147483647)
select *
from ip2location_db1_test6
where ip_range @> box(point(:ip, :ip), point(:ip, :ip));

执行测试:

1
pgbench -M simple -c 4 -j 4 -f ./test-sql6.sql -n -T 30 -h localhost -U postgres test

测试结果

1
2
3
4
5
6
7
8
9
10
transaction type: ./test-sql6.sql
scaling factor: 1
query mode: simple
number of clients: 4
number of threads: 4
duration: 30 s
number of transactions actually processed: 1218063
latency average = 0.099 ms
tps = 40600.765922 (including connections establishing)
tps = 40604.364144 (excluding connections establishing)

数据与索引空间

最后, 我们展示各种方案下数据和索引所占空间

查询语句

1
2
3
4
5
6
7
8
SELECT table_name                                               "表名",
pg_size_pretty(pg_relation_size(table_name::text)) "数据占用空间",
pg_size_pretty(pg_indexes_size(table_name::text)) "索引占用空间",
pg_size_pretty(pg_total_relation_size(table_name::text)) "数据和索引占用空间"
FROM information_schema.tables
WHERE table_schema = 'public'
AND table_type = 'BASE TABLE'
and table_name like 'ip2location_db1_%';

查询结果

表名 数据占用空间 索引占用空间 数据和索引占用空间
ip2location_db1_test1 12 MB 10 MB 22 MB
ip2location_db1_test2 10 MB 4080 kB 14 MB
ip2location_db1_test3 12 MB 14 MB 26 MB
ip2location_db1_test4 13 MB 14 MB 27 MB
ip2location_db1_test5 11 MB 11 MB 22 MB
ip2location_db1_test6 14 MB 18 MB 32 MB

总结

由于方案6的查询性能更高, 我们最终选用它。它的数据空间和索引空间更大,但随着IP信息表越来越宽,这点劣势就会变得不明显。

PostgreSQL技术-窗口函数

今天继续分享有关PostgreSQL的技术。

功能定义

早在SQL:2003中,就定义了窗口函数。 一个窗口函数调用表示在一个查询选择的行的某个部分上应用一个聚集类的函数。和非窗口聚集函数调用不同,这不会被约束为将被选择的行分组为一个单一的输出行 — 在查询输出中每一个行仍保持独立。不过,窗口函数能够根据窗口函数调用的分组声明(PARTITION BY列表)访问属于当前行所在分组中的所有行。一个窗口函数调用的语法是下列之一:

1
2
3
4
function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER window_name
function_name ([expression [, expression ... ]]) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )
function_name ( * ) [ FILTER ( WHERE filter_clause ) ] OVER window_name
function_name ( * ) [ FILTER ( WHERE filter_clause ) ] OVER ( window_definition )

其中window_definition的语法是

1
2
3
4
[ existing_window_name ]
[ PARTITION BY expression [, ...] ]
[ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | LAST } ] [, ...] ]
[ frame_clause ]

可选的frame_clause是下列之一

1
2
{ RANGE | ROWS | GROUPS } frame_start [ frame_exclusion ]
{ RANGE | ROWS | GROUPS } BETWEEN frame_start AND frame_end [ frame_exclusion ]

其中frame_startframe_end可以是下面形式中的一种

1
2
3
4
5
UNBOUNDED PRECEDING
offset PRECEDING
CURRENT ROW
offset FOLLOWING
UNBOUNDED FOLLOWING

frame_exclusion可以是下列之一

1
2
3
4
EXCLUDE CURRENT ROW
EXCLUDE GROUP
EXCLUDE TIES
EXCLUDE NO OTHERS

关于更加详细的描述, 可以参考PostgreSQL手册

使用场景

以下的查询将使用此处定义的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- 员工薪水表
create table test_salary
(
dep_name varchar(64), -- 部门名称
emp_no integer, -- 职员编号
salary numeric -- 薪水
);

-- 插入数据
insert into test_salary
values ('develop', 11, 5200),
('develop', 7, 4200),
('develop', 9, 4500),
('develop', 8, 6000),
('develop', 10, 5200),
('personnel', 5, 3500),
('personnel', 2, 3900),
('sales', 3, 4800),
('sales', 1, 5000),
('sales', 4, 4800);

查询每个职员所处部门的平均薪水

查询语句:

1
2
select *, (avg(salary) over (partition by dep_name))::int dep_avg_salary
from test_salary;

查询结果:

dep_name emp_no salary dep_avg_salary
develop 11 5200 5020
develop 7 4200 5020
develop 9 4500 5020
develop 8 6000 5020
develop 10 5200 5020
personnel 5 3500 3700
personnel 2 3900 3700
sales 3 4800 4867
sales 1 5000 4867
sales 4 4800 4867

查询每个职员与本部门的薪水的差值

查询语句:

1
2
select *, max(salary) over (partition by dep_name) - salary  diff_dep_max_salary
from test_salary;

查询结果:

dep_name emp_no salary diff_dep_max_salary
develop 11 5200 800
develop 7 4200 1800
develop 9 4500 1500
develop 8 6000 0
develop 10 5200 800
personnel 5 3500 400
personnel 2 3900 0
sales 3 4800 200
sales 1 5000 0
sales 4 4800 200

查询每一个员工与本部门中薪资比自己更高的一位的差值

查询语句:

1
2
3
select *, min(salary) over w - salary diff_higher_salary
from test_salary
window w as (partition by dep_name order by salary desc groups unbounded preceding exclude group);

查询结果:

dep_name emp_no salary diff_higher_salary
develop 8 6000 NULL
develop 10 5200 800
develop 11 5200 800
develop 9 4500 700
develop 7 4200 300
personnel 2 3900 NULL
personnel 5 3500 400
sales 1 5000 NULL
sales 3 4800 200
sales 4 4800 200

查询每一个部门的平均工资及其排名

查询语句:

1
2
3
select rank() over (order by avg(salary) desc ), dep_name, avg(salary)::int dep_avg_salary
from test_salary
group by dep_name;

查询结果:

rank dep_name dep_avg_salary
1 develop 5020
2 sales 4867
3 personnel 3700

查询与自身相邻工号的两位员工的薪资均值

查询语句:

1
2
3
select *, avg(salary) over w adj_emp_avg_salary
from test_salary
window w as (order by emp_no range between 1 preceding and 1 following exclude current row );

查询结果:

dep_name emp_no salary adj_emp_avg_salary
sales 1 5000 3900
personnel 2 3900 4900
sales 3 4800 4350
sales 4 4800 4150
personnel 5 3500 4800
develop 7 4200 6000
develop 8 6000 4350
develop 9 4500 5600
develop 10 5200 4850
develop 11 5200 5200

PostgreSQL技术-CTE与递归的使用

我接触PostgreSQL已有一年有余了, 为了分享自己的经验,之后会撰写一些列与之相关的文章。

功能定义

早在SQL:1999中,就定义了通用表表达式和递归查询相关功能. 以WITH开始的部分我们将其称为公共表表达式(Common Table Expression, CTE), 它提供了一种在大型查询中的辅助语句,可以被看成是定义在一个查询中的临时表,它可以将复杂的查询分解为相对简单的多个部分. 虽然WITH中的辅助语句可以是INSERT, DELETE, UPDATE, SELECT, 但我们平时更多的是使用SELECT.

当CTE中添加RECURSIVE修饰符后, 即可实现递归功能.(其实这个处理严格来说不是递归,而是迭代,但RECURSIVE是SQL标准委员会选择的术语),从结构上看, 可以认为其存在如下形式.

1
2
3
4
5
WITH RECURSIVE cte_name(
CTE_query_definition -- 非递归项
UNION [ALL]
CTE_query definion -- 递归项
) SELECT * FROM cte_name;

递归查询求值的流程

  1. 计算非递归项。对UNION(但不对UNION ALL),抛弃重复行。把所有剩余的行包括在递归查询的结果中,并且也把它们放在一个临时的工作表中。

  2. 只要工作表不为空,重复下列步骤:

2.1 计算递归项,用当前工作表的内容替换递归自引用。对UNION(不是UNION ALL),抛弃重复行以及那些与之前结果行重复的行。将剩下的所有行包括在递归查询的结果中,并且也把它们放在一个临时的中间表中。

2.2 用中间表的内容替换工作表的内容,然后清空中间表。

使用场景

模拟1到100的和

一般情况下,我们很容易想到通过generate完成本功能

1
2
select sum(num)
from generate_series(1, 100) nums(num);

但是, 通过递归CTE也可以实现

1
2
3
4
5
6
7
8
9
with recursive num_generate(num) as (
values (1)
union all
select num + 1
from num_generate
where num < 100
)
select sum(num)
from num_generate;

加速查询唯一数

有如下的场景, 我们有一百万条日志数据, 它们分为10个类别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- 我们选择postgresql 13

-- 创建表
create table test_data
(
category_id int, -- 类别ID
log_id uuid -- 日志ID
);

-- 为类别ID创建索引
create index idx_category on test_data using btree (category_id);

-- 插入模拟数据,
insert into test_data
select (random() * 100)::int, gen_random_uuid()
from generate_series(1, 1000000);

那么如何快速地查询每一个类别, 常见方法有如下几种

distinct查询

查询方式:

1
2
select distinct category_id
from test_data;

查询计划与耗时:

1
2
3
4
5
6
HashAggregate  (cost=18870.00..18871.00 rows=100 width=4) (actual time=155.989..155.997 rows=100 loops=1)
Group Key: category_id
Batches: 1 Memory Usage: 24kB
-> Seq Scan on test_data (cost=0.00..16370.00 rows=1000000 width=4) (actual time=0.006..39.345 rows=1000000 loops=1)
Planning Time: 0.041 ms
Execution Time: 156.018 ms
group by查询

查询方式:

1
2
3
select category_id
from test_data
group by 1;

查询计划与耗时:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Group  (cost=12582.68..12606.51 rows=100 width=4) (actual time=58.393..60.113 rows=100 loops=1)
Group Key: category_id
-> Gather Merge (cost=12582.68..12606.01 rows=200 width=4) (actual time=58.392..60.095 rows=300 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=11582.66..11582.91 rows=100 width=4) (actual time=55.155..55.163 rows=100 loops=3)
Sort Key: category_id
Sort Method: quicksort Memory: 29kB
Worker 0: Sort Method: quicksort Memory: 29kB
Worker 1: Sort Method: quicksort Memory: 29kB
-> Partial HashAggregate (cost=11578.33..11579.33 rows=100 width=4) (actual time=55.126..55.134 rows=100 loops=3)
Group Key: category_id
Batches: 1 Memory Usage: 24kB
Worker 0: Batches: 1 Memory Usage: 24kB
Worker 1: Batches: 1 Memory Usage: 24kB
-> Parallel Seq Scan on test_data (cost=0.00..10536.67 rows=416667 width=4) (actual time=0.006..15.642 rows=333333 loops=3)
Planning Time: 0.040 ms
Execution Time: 60.135 ms

这里之所以比distinct更快,主要原因是利用了并行查询, 我们可以暂时把并行度调整为1

1
set max_parallel_workers_per_gather = 1

查询计划与耗时:

1
2
3
4
5
6
HashAggregate  (cost=18870.00..18871.00 rows=100 width=4) (actual time=158.472..158.480 rows=100 loops=1)
Group Key: category_id
Batches: 1 Memory Usage: 24kB
-> Seq Scan on test_data (cost=0.00..16370.00 rows=1000000 width=4) (actual time=0.007..39.888 rows=1000000 loops=1)
Planning Time: 0.040 ms
Execution Time: 158.499 ms
递归查询

查询方式:

1
2
3
4
5
6
7
8
9
10
11
with recursive rcs(category_id) as (
select min(category_id)
from test_data
union all
select (select min(test_data.category_id) from test_data where test_data.category_id > r1.category_id)
from rcs r1
where r1.category_id is not null
)
select count(*)
from rcs
where category_id is not null;

查询计划与耗时:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Aggregate  (cost=52.59..52.60 rows=1 width=8) (actual time=0.361..0.363 rows=1 loops=1)
CTE rcs
-> Recursive Union (cost=0.45..50.32 rows=101 width=4) (actual time=0.015..0.340 rows=101 loops=1)
-> Result (cost=0.45..0.46 rows=1 width=4) (actual time=0.015..0.015 rows=1 loops=1)
InitPlan 3 (returns $1)
-> Limit (cost=0.42..0.45 rows=1 width=4) (actual time=0.013..0.013 rows=1 loops=1)
-> Index Only Scan using idx_category on test_data test_data_1 (cost=0.42..20920.42 rows=1000000 width=4) (actual time=0.012..0.012 rows=1 loops=1)
Index Cond: (category_id IS NOT NULL)
Heap Fetches: 0
-> WorkTable Scan on rcs r1 (cost=0.00..4.78 rows=10 width=4) (actual time=0.003..0.003 rows=1 loops=101)
Filter: (category_id IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 2
-> Result (cost=0.45..0.46 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=100)
InitPlan 1 (returns $3)
-> Limit (cost=0.42..0.45 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=100)
-> Index Only Scan using idx_category on test_data (cost=0.42..7807.09 rows=333333 width=4) (actual time=0.002..0.002 rows=1 loops=100)
Index Cond: ((category_id IS NOT NULL) AND (category_id > r1.category_id))
Heap Fetches: 0
-> CTE Scan on rcs (cost=0.00..2.02 rows=100 width=0) (actual time=0.017..0.355 rows=100 loops=1)
Filter: (category_id IS NOT NULL)
Rows Removed by Filter: 1
Planning Time: 0.088 ms
Execution Time: 0.419 ms

查询树状结构

我们定于如下树状数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- 创建表
create table test_data
(
node_id int, -- 节点ID
child_node_id int -- 该节点的子节点
);

-- 插入数据
insert into test_data
values (0, 1),
(0, 2),

(1, 10),
(1, 11),
(1, 12),
(2, 20),
(2, 21),

(10, 101),
(10, 102),
(11, 110),
(21, 210);

查询语句

1
2
3
4
5
6
7
8
9
with recursive rsc(node_id, path) as (
values (0, '0')
union all
select test_data.child_node_id, format('%s -> %s', r1.path, test_data.child_node_id)
from rsc r1
join test_data using (node_id)
)
select path
from rsc;

查询结果

1
2
3
4
5
6
7
8
9
10
11
12
0
0 -> 1
0 -> 2
0 -> 1 -> 10
0 -> 1 -> 11
0 -> 1 -> 12
0 -> 2 -> 20
0 -> 2 -> 21
0 -> 1 -> 10 -> 101
0 -> 1 -> 10 -> 102
0 -> 1 -> 11 -> 110
0 -> 2 -> 21 -> 210

计算整数在二进制形式下1的个数

给定一个无符号32位整数, 计算它在二进制形式下1的个数

方法1

循环地检查每一位是否为1, 并计数.

这个是最容易想到的方式, 但是需要循环32次, 性能开销过大

1
2
3
4
5
6
7
8
9
10
11
// 计算无符号32位整数二进制形式下1的数量
func CountBits1(num uint32) (count int) {
// 依次检查每一位是否为1, 并计数
for num > 0 {
if num&0b1 == 1 {
count++
}
num >>= 1
}
return
}

方法2

从低位开始, 循环地清除1, 直到数字为0.

对于那些1的个数少的数字,可以大大降低循环的次数

1
2
3
4
5
6
7
8
9
// 计算无符号32位整数二进制形式下1的数量
func CountBits2(num uint32) (count int) {
// 循环地清除低位的1, 直到数字为0, 1的个数与循环次数一致
for num > 0 {
num &= num - 1
count++
}
return
}

方法3

类似于归并排序的思想, 我们先将相邻的两个比特位相加, 然后将相邻的两个结果相加, 不断重复将相邻的结果相加,直至最后只剩下一个结果.

比如对于二进制数0b10001111, 如下图,可以计算出1的个数为5

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
graph BT;
1(101)
1-->2(1)
1-->3(100)
2-->4(1)
2-->5(0)
3-->6(10)
3-->7(10)
4-->8(1)
4-->9(0)
5-->10(0)
5-->11(0)
6-->12(1)
6-->13(1)
7-->14(1)
7-->15(1)
1
2
3
4
5
6
7
8
9
// 计算无符号32位整数二进制形式下1的数量
func CountBits3(num uint32) (count int) {
num = num&0x55555555 + num>>1&0x55555555
num = num&0x33333333 + num>>2&0x33333333
num = num&0x0f0f0f0f + num>>4&0x0f0f0f0f
num = num&0x00ff00ff + num>>8&0x00ff00ff
num = num&0x0000ffff + num>>16&0x0000ffff
return int(num)
}

性能测试

我们取固定的数0b11001111110011111100111111001111进行测试

通过benchmark的结果为:

1
2
3
4
5
6
7
8
➜  igos go test -bench=.
goos: linux
goarch: amd64
pkg: github.com/souhup/igos
BenchmarkCountBits1-4 87418681 13.5 ns/op
BenchmarkCountBits2-4 144392588 8.16 ns/op
BenchmarkCountBits3-4 1000000000 0.435 ns/op
PASS

由于想要搭建一个可以提供各类基础功能的网站, 所以便立刻行动起来了😆.

功能需求

需求1

提供一个查询自己公网IP的功能,

通过GET访问http(s)://api.utils.pro/myip, 获取自己的公网IP与其它信息.

获取到的信息将以简单的文本形式进行展示, IP:xxx.xxx.xxx.xxx, 国家(地区): xxxx, 省份: xxx, 城市: xxx, 运营商: xxx

需求2

提供一个查询任意IP的功能

通过GET访问http(s)://api.utils.pro/query-ip?ip=xxx.xxx.xxx.xxx 获取自己的公网IP与其它信息.

获取到的信息将以简单的文本形式进行展示, IP:xxx.xxx.xxx.xxx, 国家(地区): xxxx, 省份: xxx, 城市: xxx, 运营商: xxx

需求3

通过curl等HTTP客户端来进行文件互传.

发送方或接收方通过GET调用http(s)://api.utils.pro/v1/data-transfer-code获取一个随机密码.

通过该随机密码, 发送方通过PUT调用http(s)://api.utils.pro/v1/data-transfer?code=xxx

接收方通过GET调用http(s)://api.utils.pro/v1/data-transfer?code=xxx

实现的思路

实现的话, 应该满足以下的条件

  • 足够的省钱🤣
  • 便于之后的功能扩展
  • 程序部署方便
  • 考虑系统的安全性

项目开发的选择如下

  • 在云厂商购买免费的单域名证书.
  • 功能托管将采用的云厂商云服务器.
  • 因为要省钱, 所以网关采用服务器内搭建Nginx而非负载均衡器, 等之后再换吧
  • 采用云厂商的日志服务
  • 代码托管采用github的私有仓库
  • 软件库采用云厂商私有镜像仓库
  • 暂不考虑前端设计
  • 服务器安全性

购买域名

在某云厂商购买域名utils.pro, 并为该域名免费的单域名证书.

之后会考虑let’s encrypt提供的免费泛域名证书

创建服务器

在某云厂商购买了服务器, 配置公网IP, 并配置域名解析.

从安全角度: 创建由云厂商托管的服务器登录密钥.

从安全角度: 配置服务器安全组, 外网入网端口仅包含22,80,443

配置服务器内容

选择操作系统centos-8

安装软件nginx, docker

创建项目

github上创建私有仓库, 用来存放功能设计文档与代码实现.

后台编程语言选择go

运营脚本语言选择python

因为之前自己买过intelij家的全家桶, 所以IDE选择goland

业务指标与性能指标采集选择prometheus

prometheus指标展示采用grafana

配置镜像仓库

再某云厂商创建私有镜像仓库, 将代码源选择为github. 选择代码提交后满足特定TAG即可自动构建.

镜像构建后通过触发器通知云服务器.

编写python脚本用以监听特定端口, 当镜像仓库通知服务器镜像变更后, 服务器即可更新docker容器.

在测试时, 有时需在单机搭建etcd集群, 现将步骤记录如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/bin/bash
docker stop etcd-1 etcd-2 etcd-3
docker rm etcd-1 etcd-2 etcd-3

THIS_IP=??
ETCD_HOSTS=s1=http://$THIS_IP:?,s2=http://$THIS_IP:?,s3=http://$THIS_IP:?
ETCD_TOKEN=??

CON_NAME=etcd-1
THIS_NAME=s1
THIS_CLIENT_PORT=?
THIS_PEER_PORT=?
docker run -d \
--name $CON_NAME \
--restart always \
--net host \
quay.io/coreos/etcd:v3 \
/usr/local/bin/etcd \
--name ${THIS_NAME} \
--data-dir /etcd-data \
--listen-client-urls http://${THIS_IP}:${THIS_CLIENT_PORT} \
--advertise-client-urls http://${THIS_IP}:${THIS_CLIENT_PORT} \
--listen-peer-urls http://${THIS_IP}:${THIS_PEER_PORT} \
--initial-advertise-peer-urls http://${THIS_IP}:${THIS_PEER_PORT} \
--initial-cluster $ETCD_HOSTS \
--initial-cluster-state new \
--initial-cluster-token ${ETCD_TOKEN}

CON_NAME=etcd-2
THIS_NAME=s2
THIS_CLIENT_PORT=?
THIS_PEER_PORT=?
docker run -d \
--name $CON_NAME \
--restart always \
--net host \
quay.io/coreos/etcd:v3 \
/usr/local/bin/etcd \
--name ${THIS_NAME} \
--data-dir /etcd-data \
--listen-client-urls http://${THIS_IP}:${THIS_CLIENT_PORT} \
--advertise-client-urls http://${THIS_IP}:${THIS_CLIENT_PORT} \
--listen-peer-urls http://${THIS_IP}:${THIS_PEER_PORT} \
--initial-advertise-peer-urls http://${THIS_IP}:${THIS_PEER_PORT} \
--initial-cluster $ETCD_HOSTS \
--initial-cluster-state new \
--initial-cluster-token ${ETCD_TOKEN}

CON_NAME=etcd-3
THIS_NAME=s3
THIS_CLIENT_PORT=?
THIS_PEER_PORT=?
docker run -d \
--name $CON_NAME \
--restart always \
--net host \
quay.io/coreos/etcd:v3 \
/usr/local/bin/etcd \
--name ${THIS_NAME} \
--data-dir /etcd-data \
--listen-client-urls http://${THIS_IP}:${THIS_CLIENT_PORT} \
--advertise-client-urls http://${THIS_IP}:${THIS_CLIENT_PORT} \
--listen-peer-urls http://${THIS_IP}:${THIS_PEER_PORT} \
--initial-advertise-peer-urls http://${THIS_IP}:${THIS_PEER_PORT} \
--initial-cluster $ETCD_HOSTS \
--initial-cluster-state new \
--initial-cluster-token ${ETCD_TOKEN}

在进行测试时,我们可以自己生成根证书服务端证书

生成根证书

1
2
3
4
5
6
# 生成私钥
openssl genrsa -out ca.key
# 生成请求
openssl req -new -key ca.key -out ca.csr -subj "/C=CN/ST=XXXX/L=XXXX/O=XXXX/OU=XXXX/CN=XXXX/emailAddress=XXX@XXX.com"
# 生成根证书
openssl x509 -req -days 36500 -sha256 -signkey ca.key -in ca.csr -out ca.cer -extfile <(printf "subjectKeyIdentifier=hash\nauthorityKeyIdentifier=keyid,issuer\nbasicConstraints=CA:TRUE")

生成单域名证书

1
2
3
4
5
6
7
8
9
10
# 生成单域名证书
domain=$1
# 创建文件夹
mkdir -p $domain
# 生成秘钥
openssl genrsa -out $domain/$domain.key
# 生成请求
openssl req -new -key $domain/$domain.key -out $domain/$domain.csr -subj "/C=CN/ST=XXX/L=XXX/O=XXX/OU=XXX/CN=$domain/emailAddress=XXX@XXX.com" -addext "subjectAltName=DNS:$domain"
# 生成证书
openssl x509 -req -in $domain/$domain.csr -CA ../rootca/root.cer -CAkey ../rootca/root.key -CAcreateserial -out $domain/$domain.crt -days 730 -extfile <(printf "subjectKeyIdentifier=hash\nauthorityKeyIdentifier=keyid,issuer\nbasicConstraints=CA:FALSE\nkeyUsage=digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment\nextendedKeyUsage=serverAuth,OCSPSigning\nsubjectAltName=DNS:$domain")

生成泛域名证书

1
2
3
4
5
6
7
8
9
10
# 生成泛域名证书
domain=$1
# 创建文件夹
mkdir -p $domain
# 生成秘钥
openssl genrsa -out $domain/$domain.key
# 生成请求
openssl req -new -key $domain/$domain.key -out $domain/$domain.csr -subj "/C=CN/ST=XXX/L=XXX/O=XXX/OU=XXX/CN=$domain/emailAddress=XXX@XXX.com" -addext "subjectAltName=DNS:*.$domain,DNS:$domain"
# 生成证书
openssl x509 -req -in $domain/$domain.csr -CA ../rootca/root.cer -CAkey ../rootca/root.key -CAcreateserial -out $domain/$domain.crt -days 730 -extfile <(printf "subjectKeyIdentifier=hash\nauthorityKeyIdentifier=keyid,issuer\nbasicConstraints=CA:FALSE\nkeyUsage=digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment\nextendedKeyUsage=serverAuth,OCSPSigning\nsubjectAltName=DNS:*.$domain,DNS:$domain")

golang的一个自重写程序

今日在看书时,突然想用golang尝试一下自重写.

所谓自重写,就是程序打印的结果为程序源代码本身. 听起来似乎不难, 只要用文件系统读取源代码文件并打印即可, 不过这样就完全没有挑战了不是吗😂

我自己实现的版本如下, 虽然不优雅,但至少起到了作用.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
package main

import (
"fmt"
"strings"
)

func main() {
code := "package main\n\nimport (\n \"fmt\"\n \"strings\"\n)\n\nfunc main() {\n code := %s\n tmp := strings.Replace(code, \"\\\\\", \"\\\\\\\\\", -1)\n tmp = strings.Replace(tmp, \"\\n\", \"\\\\n\", -1)\n tmp = strings.Replace(tmp, \"\\\"\", \"\\\\\\\"\", -1)\n code = fmt.Sprintf(code, fmt.Sprintf(\"\\\"%%s\\\"\", tmp))\n println(code)\n}"
tmp := strings.Replace(code, "\\", "\\\\", -1)
tmp = strings.Replace(tmp, "\n", "\\n", -1)
tmp = strings.Replace(tmp, "\"", "\\\"", -1)
code = fmt.Sprintf(code, fmt.Sprintf("\"%s\"", tmp))
println(code)
}

广为流传的有rsc提供的实现, 确实更为精简

1
2
3
4
5
6
7
8
9
10
11
12
13
/* Go quine */
package main
import "fmt"
func main() {
fmt.Printf("%s%c%s%c\n", q, 0x60, q, 0x60)
}
var q = `/* Go quine */
package main
import "fmt"
func main() {
fmt.Printf("%s%c%s%c\n", q, 0x60, q, 0x60)
}
var q = `

golang-nuts提供了更短的代码, 思路和rsc的类似,不过没有经过go fmt,写成一行的话看起来也太奇怪了.

1
package main;func main(){print(c+"\x60"+c+"\x60")};var c=`package main;func main(){print(c+"\x60"+c+"\x60")};var c=`

使用ubuntu-20

安装基础工具

1
2
3
4
sudo apt update
sudo apt install -y vim git telnet curl wget tmux gitk
sudo apt install -y iftop htop net-tools tcpdump
sudo apt install -y gcc g++ make

配置本地代理

在host中配置本地代理地址

1
sudo sed -i '$a192.168.152.1\tmyproxy' /etc/hosts

terminal代理

1
2
3
4
# 会话级别代理
export ALL_PROXY="socks5://myproxy:1080"
# 命令级别代理
ALL_PROXY="socks5://myproxy:1080" {其他命令}

配置git

配置基本信息

1
2
3
4
git config --global user.name "souhup"
git config --global user.email "souhup@gmail.com"
git config --global core.editor "vim"
git config --global core.quotepath false # git status显示中文

生成公私钥

1
ssh-keygen -t rsa -C ”souhup@gmail.com”

为github配置代理

1
2
3
4
5
6
cat >> ~/.ssh/config << EOF
Host github.com
HostName github.com
User git
ProxyCommand nc -v -x myproxy:1080 %h %p
EOF

安装zsh

安装zsh

1
2
sudo apt install zsh
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

安装插件 zsh-autosuggestions

1
git clone https://github.com/zsh-users/zsh-autosuggestions ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-autosuggestions

安装插件 zsh-syntax-highlighting

1
git clone https://github.com/zsh-users/zsh-syntax-highlighting.git ${ZSH_CUSTOM:-~/.oh-my-zsh/custom}/plugins/zsh-syntax-highlighting

安装插件autojump

1
sudo apt install autojump

修改.zshrc

1
sed -i '/^plugins=/ c plugins=(git zsh-syntax-highlighting autojump zsh-autosuggestions)' ~/.zshrc

从github上下载自定义的zsh定义

1
2
ALL_PROXY="socks5://myproxy:1080" curl https://raw.githubusercontent.com/souhup/configuration/master/.zshrc >> .zshrc
source ~/.zshrc

其他

后台执行特定软件

1
nohup XXX >> /dev/null 2>&1 &