forked from dimitri/pgloader
-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO.txt
214 lines (154 loc) · 7.06 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
= PGLOADER TODO LIST =
== Multi Threaded Import ==
Release-Canditate status now. See documentation and its new +PARALLEL
LOADING+ section. Supports for both round-robin reader and split file
reading concepts, and provides processing multiple sections at a
time. Please test if you're interrested.
Current status::
Released in pgloader +2.3.0+.
Limits::
See http://docs.python.org/api/threads.html[], and consider
+pgloader+ is using generators, so maybe the problem won't get so
huga as to completely inhibit any performance benefit from
multi-threaded processing.
== Constraint Exclusion support ==
Current status::
Design seems ok enough for devel to get started anytime. Will wait
for +2.3.0+ to get released.
Version::
Targeted for +2.4.0+, which will certainly be the next one after
+2.3.0+ gets released.
=== User level configuration
At user level, you will have to add a +constraint_exclusion = on+
parameter to pgloader section configuration for it to bother checking
if the destination table has some children etc.
You'll need to provide also a global ce_path parameter (where to find user
python constraint exclusion modules) and a +ce_modules+ parameter for each
section where +constraint_exclusion = on+:
ce_modules = columnA:module:class, columnB:module:class
As the +ce_path+ could point to any number of modules where a single
type is supported by several modules, I'll let the user choose which
module to use.
=== Constraint exclusion modules
The modules will provide one or several class(es) (kind of a packaging
issue), each one will have to register which datatypes and operators
they know about. Here's some pseudo-code of a module, which certainly
is the best way to express a code design idea:
class MyCE:
def __init__(self, operator, constant, cside='r'):
""" CHECK ( col operator constant ) => cside = 'r', could be 'l' """
...
@classmethod
def support_type(cls, type):
return type in ['integer', 'bigint', 'smallint', 'real', 'double']
@classmethod
def support_operator(cls, op):
return op in ['=', '>', '<', '>=', '<=', '%']
def check(self, op, data):
if op == '>' : return self.gt(data)
...
def gt(self, data):
if cside == 'l':
return self.constant > data
elif cside == 'r':
return data > self.constant
This way pgloader will be able to support any datatype (user datatype
like IP4R included) and operator (+@@+, +~<=+ or whatever). For
pgloader to handle a +CHECK()+ constraint, though, it'll have to be
configured to use a CE class supporting the used operators and
datatypes.
=== PGLoader constraint exclusion support
The +CHECK()+ constraint being a tree of check expressions[*] linked
by logical operators, pgloader will have to build some logic tree of
MyCE (user CE modules) and evaluate all the checks in order to be able
to choose the input line partition.
[*]: +check((a % 10) = 1)+ makes an expression tree containing 2 check
nodes
After having parsed +pg_constraint.consrc+ (not +conbin+ which seems
too much an internal dump for using it from user code) and built a
CHECK tree for each partition, pgloader will try to decide if it's
about range partitioning (most common case).
If each partition CHECK tree is +AND((a>=b, a<c)+ or a variation of
it, we have range partitioning. Then surely we can optimize the code
to run to choose the partition where to COPY data to and still use the
module operator implementation, e.g. making a binary search on a
partitions limits tree.
If you want some other widely used (or not) partitioning scheme to be
recognized and optimized by pgloader, just tell me and we'll see about it :)
Having this step as a user module seems overkill at the moment,
though.
=== Multi-Threading behavior and CE support
Now, pgloader will be able to run N threads, each one loading some
data to a partitionned child-table target. N will certainly be
configured depending on the number of server cores and not depending
on the partition numbers...
So what do we do when reading a tuple we want to store in a partition
which has no dedicated Thread started yet, and we already have N
Threads running? I'm thinking about some LRU(Thread) to choose a
Thread to terminate (launch COPY with current buffer and quit) and
start a new one for the current partition target.
Hopefully there won't be such high values of N that the LRU is a bad
choice per see, and the input data won't be so messy to have to
stop/start Threads at each new line.
== Reject Behaviour ==
Current status::
In need for design.
Version::
After +2.4.0+, either as a minor version or a +2.5.0+ one,
depending on the author mood...
Implement several reject behaviours for pgloader and allow users to
configure them depending on the error we had. For example user might
configure pgloader to load rejected data to some other table when the
error is PK violation.
=== Offload to error table
This is a design idea where we add a new configuration parameter to
+pgloader+, namely +errors_table+. Then +pgloader+ will create the
table with the given name as following:
CREATE TABLE mysection_errors (
id bigserial PRIMARY KEY,
first_line bigint not null,
last_line bigint,
sqlstate text,
message text,
hint text,
context text,
data bytea
);
The lines number are intended to represent input file physical line
numbers, +last_line+ being +NULL+ when input files do not contain line
breaks.
At the implementation level, it migth be a good idea to use
+SAVEPOINTS+ (see
http://www.postgresql.org/docs/8.3/static/sql-savepoint.html[])
instead of plain transactions.
== Fixed Format ==
Current status::
Released in pgloader 2.3.1
Support fixed format: no separator, known length (in bytes) per
column. See +examples/fixed+.
== Facilities ==
Current status::
Partially implemented, +skip_head_lines+ is in CVS (2.3.2~dev1)
Add options:
+skip_head_lines+::
Will allow to skip the +n+ first lines of the given files (headers)
+with_header+::
Will allow pgloader to configure the +columns+ parameter from the
first line of the file, and skip loading it.
+optional_cols+::
Allow pgloader to consider some columns optionnal, replace their
missing content with NULL when needed (line does not provide
them). Only concerns last columns of a line, by definition.
Change default behaviour:
+only_cols+::
pgloader will only complain about missing cols when those are not
listed in only_cols, if this knob is used.
== XML support ==
Implement XML support by allowing the user to give +pgloader+ a custom
+XSLT+ StyleSheet transforming his XML into some +CSV+ that pgloader
can parse. Any formating option should be available for parsing the
+XSL+ output, and it should ideally be possible for pgloader to
parallelize this processing too.
Consider using http://xmlsoft.org/XSLT/python.html[].
Add options allowing the user to have his +XML+ validated with a
custom DTD or schema, if provided by the +XML+ lib.