efficient join >100 files





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







0















I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:




  • output1: ID, VAR1

  • output2: VAR2.1,VAR2.2,VAR2.3

  • output3: VAR3.1,VAR3.2,VAR3.3

  • output4: VAR4.1,VAR4.2,VAR4.3


where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.



My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.



I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.



My script so far:



ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1


Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file










share|improve this question




















  • 1





    why do you sort?

    – NEGR KITAEC
    Jan 3 at 8:56











  • good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

    – Doncho Gunchev
    Jan 3 at 9:15











  • Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

    – Walter A
    Jan 3 at 9:57











  • The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

    – Walter A
    Jan 3 at 10:06











  • Where do you get the ID? Is that the first column, and VAR1 the second column?

    – Walter A
    Jan 3 at 10:09


















0















I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:




  • output1: ID, VAR1

  • output2: VAR2.1,VAR2.2,VAR2.3

  • output3: VAR3.1,VAR3.2,VAR3.3

  • output4: VAR4.1,VAR4.2,VAR4.3


where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.



My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.



I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.



My script so far:



ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1


Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file










share|improve this question




















  • 1





    why do you sort?

    – NEGR KITAEC
    Jan 3 at 8:56











  • good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

    – Doncho Gunchev
    Jan 3 at 9:15











  • Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

    – Walter A
    Jan 3 at 9:57











  • The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

    – Walter A
    Jan 3 at 10:06











  • Where do you get the ID? Is that the first column, and VAR1 the second column?

    – Walter A
    Jan 3 at 10:09














0












0








0








I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:




  • output1: ID, VAR1

  • output2: VAR2.1,VAR2.2,VAR2.3

  • output3: VAR3.1,VAR3.2,VAR3.3

  • output4: VAR4.1,VAR4.2,VAR4.3


where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.



My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.



I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.



My script so far:



ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1


Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file










share|improve this question
















I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:




  • output1: ID, VAR1

  • output2: VAR2.1,VAR2.2,VAR2.3

  • output3: VAR3.1,VAR3.2,VAR3.3

  • output4: VAR4.1,VAR4.2,VAR4.3


where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.



My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.



I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.



My script so far:



ID=1
VAR1=6
VAR2=9
VAR3=12
VAR4=16
while read FILE;do
sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'
awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1
cat output1 tmp5 > foo && mv foo output1
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3
join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4
rm tmp?
done < files.list
sort -k1,1 output1 > foo && mv foo output1


Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file







bash performance join merge






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 3 at 18:51







Hill

















asked Jan 3 at 8:52









HillHill

496




496








  • 1





    why do you sort?

    – NEGR KITAEC
    Jan 3 at 8:56











  • good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

    – Doncho Gunchev
    Jan 3 at 9:15











  • Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

    – Walter A
    Jan 3 at 9:57











  • The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

    – Walter A
    Jan 3 at 10:06











  • Where do you get the ID? Is that the first column, and VAR1 the second column?

    – Walter A
    Jan 3 at 10:09














  • 1





    why do you sort?

    – NEGR KITAEC
    Jan 3 at 8:56











  • good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

    – Doncho Gunchev
    Jan 3 at 9:15











  • Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

    – Walter A
    Jan 3 at 9:57











  • The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

    – Walter A
    Jan 3 at 10:06











  • Where do you get the ID? Is that the first column, and VAR1 the second column?

    – Walter A
    Jan 3 at 10:09








1




1





why do you sort?

– NEGR KITAEC
Jan 3 at 8:56





why do you sort?

– NEGR KITAEC
Jan 3 at 8:56













good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

– Doncho Gunchev
Jan 3 at 9:15





good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

– Doncho Gunchev
Jan 3 at 9:15













Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

– Walter A
Jan 3 at 9:57





Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

– Walter A
Jan 3 at 9:57













The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

– Walter A
Jan 3 at 10:06





The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

– Walter A
Jan 3 at 10:06













Where do you get the ID? Is that the first column, and VAR1 the second column?

– Walter A
Jan 3 at 10:09





Where do you get the ID? Is that the first column, and VAR1 the second column?

– Walter A
Jan 3 at 10:09












1 Answer
1






active

oldest

votes


















1














First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.



You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.



You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.



HTH, you can make even more complex parallel execution scenarios.






share|improve this answer


























  • Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

    – Hill
    Jan 3 at 12:59












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54018981%2fefficient-join-100-files%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.



You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.



You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.



HTH, you can make even more complex parallel execution scenarios.






share|improve this answer


























  • Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

    – Hill
    Jan 3 at 12:59
















1














First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.



You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.



You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.



HTH, you can make even more complex parallel execution scenarios.






share|improve this answer


























  • Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

    – Hill
    Jan 3 at 12:59














1












1








1







First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.



You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.



You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.



HTH, you can make even more complex parallel execution scenarios.






share|improve this answer















First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.



You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.



You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.



HTH, you can make even more complex parallel execution scenarios.







share|improve this answer














share|improve this answer



share|improve this answer








edited Jan 3 at 9:09

























answered Jan 3 at 9:03









Doncho GunchevDoncho Gunchev

1,6551115




1,6551115













  • Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

    – Hill
    Jan 3 at 12:59



















  • Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

    – Hill
    Jan 3 at 12:59

















Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

– Hill
Jan 3 at 12:59





Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

– Hill
Jan 3 at 12:59




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54018981%2fefficient-join-100-files%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

Npm cannot find a required file even through it is in the searched directory

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith