efficient join >100 files
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}
I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:
- output1: ID, VAR1
- output2: VAR2.1,VAR2.2,VAR2.3
- output3: VAR3.1,VAR3.2,VAR3.3
- output4: VAR4.1,VAR4.2,VAR4.3
where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.
My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.
I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.
My script so far:
ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1
Final remark: I use cat
for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file
bash performance join merge
|
show 6 more comments
I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:
- output1: ID, VAR1
- output2: VAR2.1,VAR2.2,VAR2.3
- output3: VAR3.1,VAR3.2,VAR3.3
- output4: VAR4.1,VAR4.2,VAR4.3
where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.
My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.
I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.
My script so far:
ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1
Final remark: I use cat
for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file
bash performance join merge
1
why do you sort?
– NEGR KITAEC
Jan 3 at 8:56
good questionNEGR KITAEC
and check thebar2 && mv bar2 output2
and the next linebar3 && mv bar2 ...
.
– Doncho Gunchev
Jan 3 at 9:15
Isouptut
incat ouptut1 tmp5 > foo
a mistake in your original script or only here ?
– Walter A
Jan 3 at 9:57
The firstawk
command usesID
,VAR1
,VAR2
,VAR3
,VAR4
as input variables. When are they filled?
– Walter A
Jan 3 at 10:06
Where do you get theID
? Is that the first column, andVAR1
the second column?
– Walter A
Jan 3 at 10:09
|
show 6 more comments
I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:
- output1: ID, VAR1
- output2: VAR2.1,VAR2.2,VAR2.3
- output3: VAR3.1,VAR3.2,VAR3.3
- output4: VAR4.1,VAR4.2,VAR4.3
where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.
My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.
I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.
My script so far:
ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1
Final remark: I use cat
for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file
bash performance join merge
I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:
- output1: ID, VAR1
- output2: VAR2.1,VAR2.2,VAR2.3
- output3: VAR3.1,VAR3.2,VAR3.3
- output4: VAR4.1,VAR4.2,VAR4.3
where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.
My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.
I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.
My script so far:
ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1
Final remark: I use cat
for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file
bash performance join merge
bash performance join merge
edited Jan 3 at 18:51
Hill
asked Jan 3 at 8:52
HillHill
496
496
1
why do you sort?
– NEGR KITAEC
Jan 3 at 8:56
good questionNEGR KITAEC
and check thebar2 && mv bar2 output2
and the next linebar3 && mv bar2 ...
.
– Doncho Gunchev
Jan 3 at 9:15
Isouptut
incat ouptut1 tmp5 > foo
a mistake in your original script or only here ?
– Walter A
Jan 3 at 9:57
The firstawk
command usesID
,VAR1
,VAR2
,VAR3
,VAR4
as input variables. When are they filled?
– Walter A
Jan 3 at 10:06
Where do you get theID
? Is that the first column, andVAR1
the second column?
– Walter A
Jan 3 at 10:09
|
show 6 more comments
1
why do you sort?
– NEGR KITAEC
Jan 3 at 8:56
good questionNEGR KITAEC
and check thebar2 && mv bar2 output2
and the next linebar3 && mv bar2 ...
.
– Doncho Gunchev
Jan 3 at 9:15
Isouptut
incat ouptut1 tmp5 > foo
a mistake in your original script or only here ?
– Walter A
Jan 3 at 9:57
The firstawk
command usesID
,VAR1
,VAR2
,VAR3
,VAR4
as input variables. When are they filled?
– Walter A
Jan 3 at 10:06
Where do you get theID
? Is that the first column, andVAR1
the second column?
– Walter A
Jan 3 at 10:09
1
1
why do you sort?
– NEGR KITAEC
Jan 3 at 8:56
why do you sort?
– NEGR KITAEC
Jan 3 at 8:56
good question
NEGR KITAEC
and check the bar2 && mv bar2 output2
and the next line bar3 && mv bar2 ...
.– Doncho Gunchev
Jan 3 at 9:15
good question
NEGR KITAEC
and check the bar2 && mv bar2 output2
and the next line bar3 && mv bar2 ...
.– Doncho Gunchev
Jan 3 at 9:15
Is
ouptut
in cat ouptut1 tmp5 > foo
a mistake in your original script or only here ?– Walter A
Jan 3 at 9:57
Is
ouptut
in cat ouptut1 tmp5 > foo
a mistake in your original script or only here ?– Walter A
Jan 3 at 9:57
The first
awk
command uses ID
, VAR1
, VAR2
, VAR3
, VAR4
as input variables. When are they filled?– Walter A
Jan 3 at 10:06
The first
awk
command uses ID
, VAR1
, VAR2
, VAR3
, VAR4
as input variables. When are they filled?– Walter A
Jan 3 at 10:06
Where do you get the
ID
? Is that the first column, and VAR1
the second column?– Walter A
Jan 3 at 10:09
Where do you get the
ID
? Is that the first column, and VAR1
the second column?– Walter A
Jan 3 at 10:09
|
show 6 more comments
1 Answer
1
active
oldest
votes
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) &
and then wait
for all of them to finish. If this takes 1 second and the awk
part before takes 10 minutes then this will not help a lot.
You can also put the wait
before cat output 1 tmp5...
and before the final sort -k1...
line. For this to work you'll have to name the temporary files differently and rename them just before the join
s. The idea is to generate the input for the three parallel joins for the first file in background, wait
, then rename the files, run the join
s in background and generate the next inputs. After the loop is complete just wait the last join
s to finish. This will help if the awk
part consumes comparable to the join
s CPU time.
HTH, you can make even more complex parallel execution scenarios.
Thanks for the tip! It's actually thesort
-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before thejoin
s, but forgot that I could do those in parallel. I now run the 3join
s in parallel where - within each parallel run - the required columns are first extracted and then followed by asort
. This has halved the running time of my script
– Hill
Jan 3 at 12:59
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54018981%2fefficient-join-100-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) &
and then wait
for all of them to finish. If this takes 1 second and the awk
part before takes 10 minutes then this will not help a lot.
You can also put the wait
before cat output 1 tmp5...
and before the final sort -k1...
line. For this to work you'll have to name the temporary files differently and rename them just before the join
s. The idea is to generate the input for the three parallel joins for the first file in background, wait
, then rename the files, run the join
s in background and generate the next inputs. After the loop is complete just wait the last join
s to finish. This will help if the awk
part consumes comparable to the join
s CPU time.
HTH, you can make even more complex parallel execution scenarios.
Thanks for the tip! It's actually thesort
-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before thejoin
s, but forgot that I could do those in parallel. I now run the 3join
s in parallel where - within each parallel run - the required columns are first extracted and then followed by asort
. This has halved the running time of my script
– Hill
Jan 3 at 12:59
add a comment |
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) &
and then wait
for all of them to finish. If this takes 1 second and the awk
part before takes 10 minutes then this will not help a lot.
You can also put the wait
before cat output 1 tmp5...
and before the final sort -k1...
line. For this to work you'll have to name the temporary files differently and rename them just before the join
s. The idea is to generate the input for the three parallel joins for the first file in background, wait
, then rename the files, run the join
s in background and generate the next inputs. After the loop is complete just wait the last join
s to finish. This will help if the awk
part consumes comparable to the join
s CPU time.
HTH, you can make even more complex parallel execution scenarios.
Thanks for the tip! It's actually thesort
-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before thejoin
s, but forgot that I could do those in parallel. I now run the 3join
s in parallel where - within each parallel run - the required columns are first extracted and then followed by asort
. This has halved the running time of my script
– Hill
Jan 3 at 12:59
add a comment |
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) &
and then wait
for all of them to finish. If this takes 1 second and the awk
part before takes 10 minutes then this will not help a lot.
You can also put the wait
before cat output 1 tmp5...
and before the final sort -k1...
line. For this to work you'll have to name the temporary files differently and rename them just before the join
s. The idea is to generate the input for the three parallel joins for the first file in background, wait
, then rename the files, run the join
s in background and generate the next inputs. After the loop is complete just wait the last join
s to finish. This will help if the awk
part consumes comparable to the join
s CPU time.
HTH, you can make even more complex parallel execution scenarios.
First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.
You can simply run the three joins in background in parallel (cmd args ) &
and then wait
for all of them to finish. If this takes 1 second and the awk
part before takes 10 minutes then this will not help a lot.
You can also put the wait
before cat output 1 tmp5...
and before the final sort -k1...
line. For this to work you'll have to name the temporary files differently and rename them just before the join
s. The idea is to generate the input for the three parallel joins for the first file in background, wait
, then rename the files, run the join
s in background and generate the next inputs. After the loop is complete just wait the last join
s to finish. This will help if the awk
part consumes comparable to the join
s CPU time.
HTH, you can make even more complex parallel execution scenarios.
edited Jan 3 at 9:09
answered Jan 3 at 9:03


Doncho GunchevDoncho Gunchev
1,6551115
1,6551115
Thanks for the tip! It's actually thesort
-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before thejoin
s, but forgot that I could do those in parallel. I now run the 3join
s in parallel where - within each parallel run - the required columns are first extracted and then followed by asort
. This has halved the running time of my script
– Hill
Jan 3 at 12:59
add a comment |
Thanks for the tip! It's actually thesort
-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before thejoin
s, but forgot that I could do those in parallel. I now run the 3join
s in parallel where - within each parallel run - the required columns are first extracted and then followed by asort
. This has halved the running time of my script
– Hill
Jan 3 at 12:59
Thanks for the tip! It's actually the
sort
-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the join
s, but forgot that I could do those in parallel. I now run the 3 join
s in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort
. This has halved the running time of my script– Hill
Jan 3 at 12:59
Thanks for the tip! It's actually the
sort
-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the join
s, but forgot that I could do those in parallel. I now run the 3 join
s in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort
. This has halved the running time of my script– Hill
Jan 3 at 12:59
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54018981%2fefficient-join-100-files%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
why do you sort?
– NEGR KITAEC
Jan 3 at 8:56
good question
NEGR KITAEC
and check thebar2 && mv bar2 output2
and the next linebar3 && mv bar2 ...
.– Doncho Gunchev
Jan 3 at 9:15
Is
ouptut
incat ouptut1 tmp5 > foo
a mistake in your original script or only here ?– Walter A
Jan 3 at 9:57
The first
awk
command usesID
,VAR1
,VAR2
,VAR3
,VAR4
as input variables. When are they filled?– Walter A
Jan 3 at 10:06
Where do you get the
ID
? Is that the first column, andVAR1
the second column?– Walter A
Jan 3 at 10:09