-1

I have ~250 folders. Each folders in a day.
Each folder contains 24 parquet files.
I need to read them all, run on them a function, and write them after the change of the function.

When writing, I am doing this:

df

  .repartition('date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

But this "loses" the original split to 24 parts each date, and writes one file per date. Is there any option to split each day to n parts?

asked Nov 20 '18 at 13:46

Amir H.

add a comment |

-1

I have ~250 folders. Each folders in a day.
Each folder contains 24 parquet files.
I need to read them all, run on them a function, and write them after the change of the function.

When writing, I am doing this:

df

  .repartition('date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

But this "loses" the original split to 24 parts each date, and writes one file per date. Is there any option to split each day to n parts?

asked Nov 20 '18 at 13:46

Amir H.

add a comment |

-1

I have ~250 folders. Each folders in a day.
Each folder contains 24 parquet files.
I need to read them all, run on them a function, and write them after the change of the function.

When writing, I am doing this:

df

  .repartition('date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

But this "loses" the original split to 24 parts each date, and writes one file per date. Is there any option to split each day to n parts?

asked Nov 20 '18 at 13:46

Amir H.

I have ~250 folders. Each folders in a day.
Each folder contains 24 parquet files.
I need to read them all, run on them a function, and write them after the change of the function.

When writing, I am doing this:

df

  .repartition('date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

But this "loses" the original split to 24 parts each date, and writes one file per date. Is there any option to split each day to n parts?

scala apache-spark apache-spark-sql parquet

asked Nov 20 '18 at 13:46

Amir H.

asked Nov 20 '18 at 13:46

Amir H.

asked Nov 20 '18 at 13:46

Amir H.

asked Nov 20 '18 at 13:46

Amir H.

asked Nov 20 '18 at 13:46

Amir H.

add a comment |

1 Answer
1

active

oldest

votes

-1

You can specify the number of target partitions when doing a repartition - scaladoc

df

  .repartition(numPartitions = 24, 'date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

Edit

I just realized numPartitions is the number of resulting partitions in total. Thus you may try passing it the number of days times the number of splits you want per file, e.g. numPartitions = 24 * 250 - however, there is no guarantee that all days will have exactly 24 splits, especially if the amount of data for each day is drastically different.

edited Nov 20 '18 at 16:08

answered Nov 20 '18 at 13:53

Luis Miguel Mejía Suárez

2,1521821

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53394426%2fcontrol-number-of-target-parquet-files%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

-1

You can specify the number of target partitions when doing a repartition - scaladoc

df

  .repartition(numPartitions = 24, 'date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

Edit

edited Nov 20 '18 at 16:08

answered Nov 20 '18 at 13:53

Luis Miguel Mejía Suárez

2,1521821

add a comment |

-1

You can specify the number of target partitions when doing a repartition - scaladoc

df

  .repartition(numPartitions = 24, 'date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

Edit

edited Nov 20 '18 at 16:08

answered Nov 20 '18 at 13:53

Luis Miguel Mejía Suárez

2,1521821

add a comment |

-1

You can specify the number of target partitions when doing a repartition - scaladoc

df

  .repartition(numPartitions = 24, 'date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

Edit

edited Nov 20 '18 at 16:08

answered Nov 20 '18 at 13:53

Luis Miguel Mejía Suárez

2,1521821

You can specify the number of target partitions when doing a repartition - scaladoc

df

  .repartition(numPartitions = 24, 'date)

  .write

  .partitionBy("date")

  .mode(SaveMode.Overwrite)

  .parquet(outputPath)

Edit

edited Nov 20 '18 at 16:08

answered Nov 20 '18 at 13:53

Luis Miguel Mejía Suárez

2,1521821

edited Nov 20 '18 at 16:08

answered Nov 20 '18 at 13:53

Luis Miguel Mejía Suárez

2,1521821

answered Nov 20 '18 at 13:53

Luis Miguel Mejía Suárez

2,1521821

answered Nov 20 '18 at 13:53

Luis Miguel Mejía Suárez

2,1521821

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu

Control number of target parquet files

1 Answer
1

Edit

Your Answer

Post as a guest

1 Answer
1

1 Answer
1

Edit

Edit

Edit

Edit

Post as a guest

Popular posts from this blog

Azure Devops hosted Ubuntu agent cancels build with edited hosts file

bold and italics in edittext

NPM command prompt closes immediately [closed]

Category

Random preview

Control number of target parquet files

1 Answer 1

Edit

Your Answer

Sign up or log in

Post as a guest

Post as a guest

1 Answer 1

1 Answer 1

Edit

Edit

Edit

Edit

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Azure Devops hosted Ubuntu agent cancels build with edited hosts file

bold and italics in edittext

NPM command prompt closes immediately [closed]

1 Answer
1

1 Answer
1

1 Answer
1