-
Notifications
You must be signed in to change notification settings - Fork 0
/
4. HQL Data Analysis (Partitioning and Bucketing).HQL
256 lines (201 loc) · 5.99 KB
/
4. HQL Data Analysis (Partitioning and Bucketing).HQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
1. Create a partitioned table on 'Education' and 'Marital' status. Load data from the original table to this new partitioned table.
step 1.create partition table :
CREATE EXTERNAL TABLE car_insurance_data_partition (
Id INT,
Age INT,
Job STRING,
Default INT,
Balance INT,
HHInsurance INT,
CarLoan INT,
Communication STRING,
LastContactDay INT,
LastContactMonth INT,
NoOfContacts INT,
DaysPassed INT,
PrevAttempts INT,
Outcome STRING,
CallStart STRING,
CallEnd STRING,
CarInsurance INT)
partitioned by (Education STRING ,Marital STRING)
row format delimited
fields terminated by ','
stored as textfile
step 2. set property:
set hive.exec.dynamic.partition.mode=nonstrict;
....because we are doing dynamic partition , dynamic beacuse we not mention age is equal to this or that
step 3.load data:
INSERT OVERWRITE TABLE car_insurance_data_partition PARTITION (Education, Marital)
SELECT
Id, Age, Job, Default, Balance, HHInsurance,
CarLoan, Communication, LastContactDay,
LastContactMonth, NoOfContacts, DaysPassed,
PrevAttempts, Outcome, CallStart, CallEnd,
CarInsurance, Education, Marital
FROM car_insurance_data;
................................................................................................................................
2. Create a bucketed table on 'Age', bucketed into 4 groups (as per
the age groups mentioned above). Load data from the original
table into this bucketed table.
step 1: create bucket table
CREATE TABLE car_insurance_data_bucket (
Id INT,
Age INT,
Job STRING,
Marital STRING,
Education STRING,
Default INT,
Balance INT,
HHInsurance INT,
CarLoan INT,
Communication STRING,
LastContactDay INT,
LastContactMonth INT,
NoOfContacts INT,
DaysPassed INT,
PrevAttempts INT,
Outcome STRING,
CallStart STRING,
CallEnd STRING,
CarInsurance INT)
clustered by (Age) into 4 buckets
row format delimited
fields terminated by ','
stored as textfile;
step 2. set property:
set hive.enforce.bucketing=true;
step 3: load data
INSERT OVERWRITE TABLE
car_insurance_data_bucket
SELECT * FROM car_insurance_data;
..................................................................................................................................................
3. Add an additional partition on 'Job' to the partitioned table created
earlier and move the data accordingly.
(((
note :
If you want to add an additional partition on 'Job' to the
previously created partitioned table, you actually have
to create a new table as Hive does not allow altering
the partitioning of existing tables. However, it's a
straightforward task to create a new partitioned table
and move the data accordingly
)))
step 1st : create table:
CREATE TABLE car_insurance_data_d_partition (
Id INT,
Age INT,
Default INT,
Balance INT,
HHInsurance INT,
CarLoan INT,
Communication STRING,
LastContactDay INT,
LastContactMonth INT,
NoOfContacts INT,
DaysPassed INT,
PrevAttempts INT,
Outcome STRING,
CallStart STRING,
CallEnd STRING,
CarInsurance INT)
partitioned by (Job STRING,Marital STRING,Education STRING)
row format delimited
fields terminated by ','
stored as textfile
step 2. set property:
set hive.exec.dynamic.partition.mode=nonstrict;
....because we are doing dynamic partition , dynamic beacuse we not mention age is equal to this or that
step 3: load data
insert overwrite table car_insurance_data_d_partition partition(Job,Marital,Education)
select
Id, Age, Default, Balance, HHInsurance,
CarLoan, Communication, LastContactDay,
LastContactMonth, NoOfContacts, DaysPassed,
PrevAttempts, Outcome, CallStart, CallEnd,
CarInsurance, Education, Marital, Job
from car_insurance_data
(((
imp : during running this query the map and reducer get failed i get to know that my file format should be more optimized rather than stored it as text file as we know orc support better with hive so i rewrite query again.
)))
/////////////////////////
new query :
step 1st : create table:
CREATE TABLE car_insurance_data_d_partition_orc (
Id INT,
Age INT,
Default INT,
Balance INT,
HHInsurance INT,
CarLoan INT,
Communication STRING,
LastContactDay INT,
LastContactMonth INT,
NoOfContacts INT,
DaysPassed INT,
PrevAttempts INT,
Outcome STRING,
CallStart STRING,
CallEnd STRING,
CarInsurance INT)
partitioned by (Job STRING,Marital STRING,Education STRING)
row format delimited
fields terminated by ','
stored as orc
step 2. set property:
set hive.exec.dynamic.partition.mode=nonstrict;
....because we are doing dynamic partition , dynamic beacuse we not mention age is equal to this or that
step 3: load data
insert overwrite table car_insurance_data_d_partition_orc partition(Job,Marital,Education)
select
Id, Age, Default, Balance, HHInsurance,
CarLoan, Communication, LastContactDay,
LastContactMonth, NoOfContacts, DaysPassed,
PrevAttempts, Outcome, CallStart, CallEnd,
CarInsurance, Education, Marital, Job
from car_insurance_data_partition ;
still it fails
......................................................................................................................
4. Increase the number of buckets in the bucketed table to 10 and
redistribute the data.
(((
note :
In Hive, once a table is bucketed, the number of
buckets cannot be changed. The process of bucketing
happens at the time of table creation and is immutable.
Therefore, in order to increase the number of buckets,
you will need to create a new table with the desired
number of buckets and then insert data into the new
table from the existing one.
)))
step 1 . create table:
CREATE TABLE car_insurance_data_n_bucket (
Id INT,
Age INT,
Job STRING,
Marital STRING,
Education STRING,
Default INT,
Balance INT,
HHInsurance INT,
CarLoan INT,
Communication STRING,
LastContactDay INT,
LastContactMonth INT,
NoOfContacts INT,
DaysPassed INT,
PrevAttempts INT,
Outcome STRING,
CallStart STRING,
CallEnd STRING,
CarInsurance INT)
clustered by (age) into 10 buckets
row format delimited
fields terminated by ','
stored as textfile;
step 2. enable property :
set hive.enforce.bucketing=true;
step 3. load data :
insert overwrite table car_insurance_data_n_bucket
select *
from car_insurance_data_bucket;