incremental loading with new + update no delete in redshift columns that are updated at different times
I have a the following tables as an example:
Events(id, business_uuid, event_category, event_name, created,
row_updated)Bank_accounts(id, business_uuid, bank_name, Account_no, etc, created, row_updated)
orders (id, uuid, business_uuid, qty, item, created, row_updated)
where created and row_updated are timestamps. for each business I want to capture the 1st time something happened for that business in each table. I am on redshift. I have my query like:
with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od
giving me an output of:
result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order)
the way these events occur is that they may get filled at different times, if I am pulling the data every 10 minutes, some columns will be null and later be filled in, I am dealing with a large table and I cant do full loads each time, I would like to do incremental loads (new + update) and hopefully no delete. I introduced a column last_update
here result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order, last update)
and I rewrote the query above as:
with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
where row_updated > (select max(last_updated) from result)
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
where row_updated > (select max(last_updated) from result)
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
where row_updated > (select max(last_updated) from result)
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od
It does not work out when I compare my full load plus the attempted incremental load, there is slight difference in row counts.
if you find the above in clear manner, I would like a suggestion on how to best solve it.
for anything further info, I use Apache-Airflow as my ETL tool to execute my sql statements
sql amazon-redshift incremental-build
add a comment |
I have a the following tables as an example:
Events(id, business_uuid, event_category, event_name, created,
row_updated)Bank_accounts(id, business_uuid, bank_name, Account_no, etc, created, row_updated)
orders (id, uuid, business_uuid, qty, item, created, row_updated)
where created and row_updated are timestamps. for each business I want to capture the 1st time something happened for that business in each table. I am on redshift. I have my query like:
with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od
giving me an output of:
result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order)
the way these events occur is that they may get filled at different times, if I am pulling the data every 10 minutes, some columns will be null and later be filled in, I am dealing with a large table and I cant do full loads each time, I would like to do incremental loads (new + update) and hopefully no delete. I introduced a column last_update
here result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order, last update)
and I rewrote the query above as:
with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
where row_updated > (select max(last_updated) from result)
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
where row_updated > (select max(last_updated) from result)
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
where row_updated > (select max(last_updated) from result)
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od
It does not work out when I compare my full load plus the attempted incremental load, there is slight difference in row counts.
if you find the above in clear manner, I would like a suggestion on how to best solve it.
for anything further info, I use Apache-Airflow as my ETL tool to execute my sql statements
sql amazon-redshift incremental-build
add a comment |
I have a the following tables as an example:
Events(id, business_uuid, event_category, event_name, created,
row_updated)Bank_accounts(id, business_uuid, bank_name, Account_no, etc, created, row_updated)
orders (id, uuid, business_uuid, qty, item, created, row_updated)
where created and row_updated are timestamps. for each business I want to capture the 1st time something happened for that business in each table. I am on redshift. I have my query like:
with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od
giving me an output of:
result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order)
the way these events occur is that they may get filled at different times, if I am pulling the data every 10 minutes, some columns will be null and later be filled in, I am dealing with a large table and I cant do full loads each time, I would like to do incremental loads (new + update) and hopefully no delete. I introduced a column last_update
here result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order, last update)
and I rewrote the query above as:
with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
where row_updated > (select max(last_updated) from result)
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
where row_updated > (select max(last_updated) from result)
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
where row_updated > (select max(last_updated) from result)
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od
It does not work out when I compare my full load plus the attempted incremental load, there is slight difference in row counts.
if you find the above in clear manner, I would like a suggestion on how to best solve it.
for anything further info, I use Apache-Airflow as my ETL tool to execute my sql statements
sql amazon-redshift incremental-build
I have a the following tables as an example:
Events(id, business_uuid, event_category, event_name, created,
row_updated)Bank_accounts(id, business_uuid, bank_name, Account_no, etc, created, row_updated)
orders (id, uuid, business_uuid, qty, item, created, row_updated)
where created and row_updated are timestamps. for each business I want to capture the 1st time something happened for that business in each table. I am on redshift. I have my query like:
with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od
giving me an output of:
result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order)
the way these events occur is that they may get filled at different times, if I am pulling the data every 10 minutes, some columns will be null and later be filled in, I am dealing with a large table and I cant do full loads each time, I would like to do incremental loads (new + update) and hopefully no delete. I introduced a column last_update
here result(business_uuid, created_account,enter_business_detail, accepted_terms, entered_banking_details,first_order, last update)
and I rewrote the query above as:
with universal as(
select business_uuid from events
union
select business_uuid from bank_accounts
union
select business_uuid from orders
)
, ev as(
select business_uuid
, min(case when name = 'created_account' then created end) as created_account
, min(case when name = 'enter_business_detail' then created end) as enter_business_detail
, min(case when name = 'accepted_terms' then created end) as accepted_terms
from events
where row_updated > (select max(last_updated) from result)
group by 1
)
, bank as(
select business_uuid
, min(row_updated) as entered_banking_details
from Bank_accounts
where row_updated > (select max(last_updated) from result)
group by 1
)
, od as(
select business_uuid
, min(created) as first_order
from orders
where row_updated > (select max(last_updated) from result)
group by 1
)
select u.business_uuid
natural left join ev
natural left join bank
natural left join od
It does not work out when I compare my full load plus the attempted incremental load, there is slight difference in row counts.
if you find the above in clear manner, I would like a suggestion on how to best solve it.
for anything further info, I use Apache-Airflow as my ETL tool to execute my sql statements
sql amazon-redshift incremental-build
sql amazon-redshift incremental-build
asked Nov 21 '18 at 12:36
PlengoPlengo
479
479
add a comment |
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53412185%2fincremental-loading-with-new-update-no-delete-in-redshift-columns-that-are-upd%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53412185%2fincremental-loading-with-new-update-no-delete-in-redshift-columns-that-are-upd%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown