How to use new Hadoop parquet magic commiter to custom S3 server with Spark
I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf
:
spark.sql.sources.commitProtocolClass com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true
When using this configuration I end up with the exception:
java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?
apache-spark hadoop amazon-s3
add a comment |
I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf
:
spark.sql.sources.commitProtocolClass com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true
When using this configuration I end up with the exception:
java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?
apache-spark hadoop amazon-s3
add a comment |
I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf
:
spark.sql.sources.commitProtocolClass com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true
When using this configuration I end up with the exception:
java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?
apache-spark hadoop amazon-s3
I have spark 2.4.0 and Hadoop 3.1.1. According to Hadoop Documentation, to use the new Magic committer that allow write of parquet files to S3 consistently, I've setup those values in conf/spark-default.conf
:
spark.sql.sources.commitProtocolClass com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
spark.sql.parquet.output.committer.class org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter
spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory
spark.hadoop.fs.s3a.committer.name magic
spark.hadoop.fs.s3a.committer.magic.enabled true
When using this configuration I end up with the exception:
java.lang.ClassNotFoundException: com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol
My question is double, first do I properly understand that Hadoop 3.1.1 allow write of parquet file to S3 consistently ?
Second, if I did understand well, how to use the new committer properly from Spark ?
apache-spark hadoop amazon-s3
apache-spark hadoop amazon-s3
edited Nov 20 '18 at 9:33
Kiwy
asked Nov 20 '18 at 8:32


KiwyKiwy
2422532
2422532
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
I shall try this today.
– Kiwy
Nov 21 '18 at 4:21
I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` andfs.s3a.committer.staging.conflict-mode=fail
So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 '18 at 8:20
Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 '18 at 12:48
So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 '18 at 13:02
I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 '18 at 8:12
add a comment |
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio
or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
Sorry, missed this. If your object store is has list consistency then you don't need s3guard. Use the staging committer and leave it generating unique UUIDs on every file (avoid problem of write-after-write consistency). As for the signing type -that should not matter. File a JIRA on issues.apache.org /hadoop, include any stack traces
– Steve Loughran
Jan 8 at 21:51
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53388976%2fhow-to-use-new-hadoop-parquet-magic-commiter-to-custom-s3-server-with-spark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
I shall try this today.
– Kiwy
Nov 21 '18 at 4:21
I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` andfs.s3a.committer.staging.conflict-mode=fail
So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 '18 at 8:20
Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 '18 at 12:48
So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 '18 at 13:02
I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 '18 at 8:12
add a comment |
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
I shall try this today.
– Kiwy
Nov 21 '18 at 4:21
I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` andfs.s3a.committer.staging.conflict-mode=fail
So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 '18 at 8:20
Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 '18 at 12:48
So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 '18 at 13:02
I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 '18 at 8:12
add a comment |
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
Kiwy: that's my code: I can help you with this. Some of the classes haven't got into the ASF spark releases, but you'll find the in the Hadoop JARs, and I could have a go at building the ASF release with the relevant dependencies in (I could put them in downstream; they used to be there)
You do not need S3Guard turned on to use the "staging committer"; it's only the "magic" variant which needs consistent object store listings during the commit phase.
answered Nov 20 '18 at 15:00


Steve LoughranSteve Loughran
5,30711417
5,30711417
I shall try this today.
– Kiwy
Nov 21 '18 at 4:21
I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` andfs.s3a.committer.staging.conflict-mode=fail
So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 '18 at 8:20
Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 '18 at 12:48
So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 '18 at 13:02
I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 '18 at 8:12
add a comment |
I shall try this today.
– Kiwy
Nov 21 '18 at 4:21
I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` andfs.s3a.committer.staging.conflict-mode=fail
So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.
– Kiwy
Nov 21 '18 at 8:20
Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 '18 at 12:48
So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 '18 at 13:02
I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 '18 at 8:12
I shall try this today.
– Kiwy
Nov 21 '18 at 4:21
I shall try this today.
– Kiwy
Nov 21 '18 at 4:21
I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and
fs.s3a.committer.staging.conflict-mode=fail
So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.– Kiwy
Nov 21 '18 at 8:20
I won't be able to use Magic Commiter I'm trying to use ` fs.s3a.committer.name=partitioned` and
fs.s3a.committer.staging.conflict-mode=fail
So far it's is OK. replace mode would consistently throw a 403 error I suspect my instance of Swift is not to be relied on. I should try with a recent minio server to confirm how consistent the error is. You've been of great help Steve, Thank you a lot.– Kiwy
Nov 21 '18 at 8:20
Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 '18 at 12:48
Not tested with either. If it doesn't work, file a bug on Apache jira, component "fs/s3", link to S3A features for Hadoop 3.3
– Steve Loughran
Nov 21 '18 at 12:48
So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 '18 at 13:02
So far my attempts from a s3 500GB csv to s3 parquet in 3393 pieces failed with this message: filecsv.write.parquet("s3a://bucket/file.parquet") 2018-11-21 12:47:36 ERROR FileFormatWriter:91 - Aborting job 594be6. org.apache.hadoop.fs.s3a.AWSBadRequestException: delete on s3a://bucket/file.parquet/_temporary: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema. (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: txa745d7; S3 Extended Request ID: null)
– Kiwy
Nov 21 '18 at 13:02
I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 '18 at 8:12
I confirm you that older swift release might not support the new partitionner. thank you for you amazing work.
– Kiwy
Nov 30 '18 at 8:12
add a comment |
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio
or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
Sorry, missed this. If your object store is has list consistency then you don't need s3guard. Use the staging committer and leave it generating unique UUIDs on every file (avoid problem of write-after-write consistency). As for the signing type -that should not matter. File a JIRA on issues.apache.org /hadoop, include any stack traces
– Steve Loughran
Jan 8 at 21:51
add a comment |
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio
or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
Sorry, missed this. If your object store is has list consistency then you don't need s3guard. Use the staging committer and leave it generating unique UUIDs on every file (avoid problem of write-after-write consistency). As for the signing type -that should not matter. File a JIRA on issues.apache.org /hadoop, include any stack traces
– Steve Loughran
Jan 8 at 21:51
add a comment |
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio
or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
Edit:
OK, I've two intances of server one being a bit old now, I've attempted to use last version of minio with those parameters:
sc.hadoopConfiguration.set("hadoop.fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload","true")
sc.hadoopConfiguration.set("hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.multipart.size","128M")
sc.hadoopConfiguration.set("fs.s3a.fast.upload.active.blocks","4")
sc.hadoopConfiguration.set("fs.s3a.committer.name","partitioned")
I'm able to write so far without trouble.
However my swift server which is a bit older with this config:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
seems to not support properly the partionner.
Regarding "Hadoop S3guard":
It is not possible currently, Hadoop S3guard that keep metadata of the S3 files must be enable in Hadoop. The S3guard though rely on DynamoDB a proprietary Amazon service.
There's no alternative now like a sqlite file or other DB system to store the metadata.
So if you're using S3 with minio
or any other S3 implementation, you're missing DynamoDB.
This article explains nicely how works S3guard
edited Nov 30 '18 at 8:12
answered Nov 20 '18 at 9:36


KiwyKiwy
2422532
2422532
Sorry, missed this. If your object store is has list consistency then you don't need s3guard. Use the staging committer and leave it generating unique UUIDs on every file (avoid problem of write-after-write consistency). As for the signing type -that should not matter. File a JIRA on issues.apache.org /hadoop, include any stack traces
– Steve Loughran
Jan 8 at 21:51
add a comment |
Sorry, missed this. If your object store is has list consistency then you don't need s3guard. Use the staging committer and leave it generating unique UUIDs on every file (avoid problem of write-after-write consistency). As for the signing type -that should not matter. File a JIRA on issues.apache.org /hadoop, include any stack traces
– Steve Loughran
Jan 8 at 21:51
Sorry, missed this. If your object store is has list consistency then you don't need s3guard. Use the staging committer and leave it generating unique UUIDs on every file (avoid problem of write-after-write consistency). As for the signing type -that should not matter. File a JIRA on issues.apache.org /hadoop, include any stack traces
– Steve Loughran
Jan 8 at 21:51
Sorry, missed this. If your object store is has list consistency then you don't need s3guard. Use the staging committer and leave it generating unique UUIDs on every file (avoid problem of write-after-write consistency). As for the signing type -that should not matter. File a JIRA on issues.apache.org /hadoop, include any stack traces
– Steve Loughran
Jan 8 at 21:51
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53388976%2fhow-to-use-new-hadoop-parquet-magic-commiter-to-custom-s3-server-with-spark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown