efficient join >100 files

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}

I have a list containing >100 tab-delimited files, containing 5-8 million rows, and 16 columns (always in the same exact order). From each file I need to extract 5 specific columns, including one identifier-column. My final output (using 3 input files as an example) should be 4 files, containing the following columns:

output1: ID, VAR1

output2: VAR2.1,VAR2.2,VAR2.3

output3: VAR3.1,VAR3.2,VAR3.3

output4: VAR4.1,VAR4.2,VAR4.3

where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.

My problem is that the input files contain partially overlapping IDs and I need to extract the union of these rows (i.e. all IDs that occur at least once in one of the input files). To be more exact, output1 should contain the unions of the "ID"- and "VAR1"-columns of all input files. The row order of the remaining output files should be identical to output1. Finally, rows not present in any given input file should be padded with "NA" in output2, output3 and output4.

I'm using a combination of a while-loop, awk and join to get the job done, but it takes quite some time. I'd like to know whether there's a faster way to get this done, because I have to run the same script over and over with varying input files.

My script so far:

ID=1

VAR1=6

VAR2=9

VAR3=12

VAR4=16

while read FILE;do

    sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'

    awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1

    cat output1 tmp5 > foo && mv foo output1

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4

    rm tmp?

done < files.list

sort -k1,1 output1 > foo && mv foo output1

Final remark: I use cat for output1 because all values in VAR1 for the same ID are identical across all input files (I've made sure of that when I pre-process my files). So I can just append rows that are not already included to the bottom of output1 and sort the final output-file

edited Jan 3 at 18:51

asked Jan 3 at 8:52

Hill

496

1

why do you sort?

– NEGR KITAEC
Jan 3 at 8:56

good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

– Doncho Gunchev
Jan 3 at 9:15

Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

– Walter A
Jan 3 at 9:57

The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

– Walter A
Jan 3 at 10:06

Where do you get the ID? Is that the first column, and VAR1 the second column?

– Walter A
Jan 3 at 10:09

|
show 6 more comments

output1: ID, VAR1

output2: VAR2.1,VAR2.2,VAR2.3

output3: VAR3.1,VAR3.2,VAR3.3

output4: VAR4.1,VAR4.2,VAR4.3

where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.

My script so far:

ID=1

VAR1=6

VAR2=9

VAR3=12

VAR4=16

while read FILE;do

    sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'

    awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1

    cat output1 tmp5 > foo && mv foo output1

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4

    rm tmp?

done < files.list

sort -k1,1 output1 > foo && mv foo output1

edited Jan 3 at 18:51

asked Jan 3 at 8:52

Hill

496

1

why do you sort?

– NEGR KITAEC
Jan 3 at 8:56

good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

– Doncho Gunchev
Jan 3 at 9:15

Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

– Walter A
Jan 3 at 9:57

The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

– Walter A
Jan 3 at 10:06

Where do you get the ID? Is that the first column, and VAR1 the second column?

– Walter A
Jan 3 at 10:09

|
show 6 more comments

output1: ID, VAR1

output2: VAR2.1,VAR2.2,VAR2.3

output3: VAR3.1,VAR3.2,VAR3.3

output4: VAR4.1,VAR4.2,VAR4.3

where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.

My script so far:

ID=1

VAR1=6

VAR2=9

VAR3=12

VAR4=16

while read FILE;do

    sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'

    awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1

    cat output1 tmp5 > foo && mv foo output1

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4

    rm tmp?

done < files.list

sort -k1,1 output1 > foo && mv foo output1

edited Jan 3 at 18:51

asked Jan 3 at 8:52

Hill

496

output1: ID, VAR1

output2: VAR2.1,VAR2.2,VAR2.3

output3: VAR3.1,VAR3.2,VAR3.3

output4: VAR4.1,VAR4.2,VAR4.3

where ".1", ".2", and ".3" indicate that the column originates from the first, second and third input file, respectively.

My script so far:

ID=1

VAR1=6

VAR2=9

VAR3=12

VAR4=16

while read FILE;do

    sort -k${ID},${ID} < ${FILE} | awk -v ID=${ID} -v VAR1=${VAR1} -v VAR2=${VAR2} -v VAR3=${VAR3} -v VAR4=${VAR4} 'BEGIN{OFS="t"};{print $ID,$VAR1 > "tmp1";print ${ID},$VAR2 > "tmp2";print ${ID},$VAR3 > "tmp3";print ${ID},$VAR4 > "tmp4"}'

    awk 'FNR==NR{a[$1]=$1;next};{if(($1 in a)==0){print $0 > "tmp5"}}' output1 tmp1

    cat output1 tmp5 > foo && mv foo output1

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output2 -o auto tmp2 > bar2 && mv bar2 output2

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output3 -o auto tmp3 > bar3 && mv bar2 output3

    join -e "NA" -a1 -a2 -t $'t' -1 1 -2 1 output4 -o auto tmp4 > bar4 && mv bar2 output4

    rm tmp?

done < files.list

sort -k1,1 output1 > foo && mv foo output1

bash performance join merge

edited Jan 3 at 18:51

asked Jan 3 at 8:52

Hill

496

edited Jan 3 at 18:51

asked Jan 3 at 8:52

Hill

496

edited Jan 3 at 18:51

asked Jan 3 at 8:52

Hill

496

asked Jan 3 at 8:52

Hill

496

asked Jan 3 at 8:52

Hill

496

1

why do you sort?

– NEGR KITAEC
Jan 3 at 8:56

good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

– Doncho Gunchev
Jan 3 at 9:15

Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

– Walter A
Jan 3 at 9:57

The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

– Walter A
Jan 3 at 10:06

Where do you get the ID? Is that the first column, and VAR1 the second column?

– Walter A
Jan 3 at 10:09

|
show 6 more comments

1

why do you sort?

– NEGR KITAEC
Jan 3 at 8:56

good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

– Doncho Gunchev
Jan 3 at 9:15

Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

– Walter A
Jan 3 at 9:57

The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

– Walter A
Jan 3 at 10:06

Where do you get the ID? Is that the first column, and VAR1 the second column?

– Walter A
Jan 3 at 10:09

why do you sort?

– NEGR KITAEC
Jan 3 at 8:56

good question NEGR KITAEC and check the bar2 && mv bar2 output2 and the next line bar3 && mv bar2 ....

– Doncho Gunchev
Jan 3 at 9:15

Is ouptut in cat ouptut1 tmp5 > foo a mistake in your original script or only here ?

– Walter A
Jan 3 at 9:57

The first awk command uses ID, VAR1, VAR2, VAR3, VAR4 as input variables. When are they filled?

– Walter A
Jan 3 at 10:06

Where do you get the ID? Is that the first column, and VAR1 the second column?

– Walter A
Jan 3 at 10:09

|
show 6 more comments

1 Answer
1

active

oldest

votes

First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.

You can simply run the three joins in background in parallel (cmd args ) & and then wait for all of them to finish. If this takes 1 second and the awk part before takes 10 minutes then this will not help a lot.

You can also put the wait before cat output 1 tmp5... and before the final sort -k1... line. For this to work you'll have to name the temporary files differently and rename them just before the joins. The idea is to generate the input for the three parallel joins for the first file in background, wait, then rename the files, run the joins in background and generate the next inputs. After the loop is complete just wait the last joins to finish. This will help if the awk part consumes comparable to the joins CPU time.

HTH, you can make even more complex parallel execution scenarios.

edited Jan 3 at 9:09

answered Jan 3 at 9:03

Doncho Gunchev

1,6551115

Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

– Hill
Jan 3 at 12:59

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54018981%2fefficient-join-100-files%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.

HTH, you can make even more complex parallel execution scenarios.

edited Jan 3 at 9:09

answered Jan 3 at 9:03

Doncho Gunchev

1,6551115

Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

– Hill
Jan 3 at 12:59

add a comment |

First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.

HTH, you can make even more complex parallel execution scenarios.

edited Jan 3 at 9:09

answered Jan 3 at 9:03

Doncho Gunchev

1,6551115

Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

– Hill
Jan 3 at 12:59

add a comment |

First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.

HTH, you can make even more complex parallel execution scenarios.

edited Jan 3 at 9:09

answered Jan 3 at 9:03

Doncho Gunchev

1,6551115

First you have to figure out where most of the time is lost. You can 'echo "running X"; time ./X` and make sure you are not trying to optimize the fastest part of the script.

HTH, you can make even more complex parallel execution scenarios.

edited Jan 3 at 9:09

answered Jan 3 at 9:03

Doncho Gunchev

1,6551115

edited Jan 3 at 9:09

answered Jan 3 at 9:03

Doncho Gunchev

1,6551115

answered Jan 3 at 9:03

Doncho Gunchev

1,6551115

answered Jan 3 at 9:03

Doncho Gunchev

1,6551115

Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

– Hill
Jan 3 at 12:59

add a comment |

Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

– Hill
Jan 3 at 12:59

Thanks for the tip! It's actually the sort-command in the first line inside the while-loop that's taking up most of the time. I thought this would be faster than sorting the separate files before the joins, but forgot that I could do those in parallel. I now run the 3 joins in parallel where - within each parallel run - the required columns are first extracted and then followed by a sort. This has halved the running time of my script

– Hill
Jan 3 at 12:59

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu