How to submit Spark jobs to EMR cluster from Airflow?





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







1















How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.



I need solutions so that Airflow can talk to EMR and execute Spark submit.



https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/



These blogs have understanding on execution after connection has been established.(Didn't help much)



In airflow I have made a connection using UI for AWS and EMR:-



enter image description here



Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-



from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
client = hook.get_client_type(‘emr’, ‘eu-central-1’)
for x in a:
print(x[‘Status’][‘State’],x[‘Name’])


My question is - How can I update my above code can do Spark-submit actions










share|improve this question




















  • 1





    hi kally please specify what is the issue here that you are facing, what you have tried yet

    – varnit
    Jan 3 at 13:06






  • 1





    Hi Kally, Can you share what resources you have created and which connection is not working?

    – pradeep
    Jan 3 at 13:39











  • @varnit I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

    – Kally
    Jan 3 at 16:45













  • @pradeep I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

    – Kally
    Jan 3 at 16:46


















1















How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.



I need solutions so that Airflow can talk to EMR and execute Spark submit.



https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/



These blogs have understanding on execution after connection has been established.(Didn't help much)



In airflow I have made a connection using UI for AWS and EMR:-



enter image description here



Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-



from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
client = hook.get_client_type(‘emr’, ‘eu-central-1’)
for x in a:
print(x[‘Status’][‘State’],x[‘Name’])


My question is - How can I update my above code can do Spark-submit actions










share|improve this question




















  • 1





    hi kally please specify what is the issue here that you are facing, what you have tried yet

    – varnit
    Jan 3 at 13:06






  • 1





    Hi Kally, Can you share what resources you have created and which connection is not working?

    – pradeep
    Jan 3 at 13:39











  • @varnit I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

    – Kally
    Jan 3 at 16:45













  • @pradeep I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

    – Kally
    Jan 3 at 16:46














1












1








1


3






How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.



I need solutions so that Airflow can talk to EMR and execute Spark submit.



https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/



These blogs have understanding on execution after connection has been established.(Didn't help much)



In airflow I have made a connection using UI for AWS and EMR:-



enter image description here



Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-



from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
client = hook.get_client_type(‘emr’, ‘eu-central-1’)
for x in a:
print(x[‘Status’][‘State’],x[‘Name’])


My question is - How can I update my above code can do Spark-submit actions










share|improve this question
















How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.



I need solutions so that Airflow can talk to EMR and execute Spark submit.



https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/



These blogs have understanding on execution after connection has been established.(Didn't help much)



In airflow I have made a connection using UI for AWS and EMR:-



enter image description here



Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-



from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
client = hook.get_client_type(‘emr’, ‘eu-central-1’)
for x in a:
print(x[‘Status’][‘State’],x[‘Name’])


My question is - How can I update my above code can do Spark-submit actions







amazon-web-services terraform airflow amazon-emr






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 27 at 14:15









GabLeRoux

7,83964062




7,83964062










asked Jan 3 at 12:15









KallyKally

348727




348727








  • 1





    hi kally please specify what is the issue here that you are facing, what you have tried yet

    – varnit
    Jan 3 at 13:06






  • 1





    Hi Kally, Can you share what resources you have created and which connection is not working?

    – pradeep
    Jan 3 at 13:39











  • @varnit I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

    – Kally
    Jan 3 at 16:45













  • @pradeep I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

    – Kally
    Jan 3 at 16:46














  • 1





    hi kally please specify what is the issue here that you are facing, what you have tried yet

    – varnit
    Jan 3 at 13:06






  • 1





    Hi Kally, Can you share what resources you have created and which connection is not working?

    – pradeep
    Jan 3 at 13:39











  • @varnit I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

    – Kally
    Jan 3 at 16:45













  • @pradeep I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

    – Kally
    Jan 3 at 16:46








1




1





hi kally please specify what is the issue here that you are facing, what you have tried yet

– varnit
Jan 3 at 13:06





hi kally please specify what is the issue here that you are facing, what you have tried yet

– varnit
Jan 3 at 13:06




1




1





Hi Kally, Can you share what resources you have created and which connection is not working?

– pradeep
Jan 3 at 13:39





Hi Kally, Can you share what resources you have created and which connection is not working?

– pradeep
Jan 3 at 13:39













@varnit I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

– Kally
Jan 3 at 16:45







@varnit I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

– Kally
Jan 3 at 16:45















@pradeep I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

– Kally
Jan 3 at 16:46





@pradeep I have updated the code which will list the All EMR Cluster, How can I know the master server IP from of single EMR cluster where I can submit my spark code

– Kally
Jan 3 at 16:46












2 Answers
2






active

oldest

votes


















3














While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow





  1. Use Apache Livy




    • This solution is actually independent of remote server, i.e., EMR


    • Here's an example

    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me




  2. Use EmrSteps API




    • Dependent on remote system: EMR

    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)

    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)




  3. Use SSHHook / SSHOperator




    • Again independent of remote system

    • Comparatively easier to get started with

    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome






EDIT-1



There seems to be another straightforward way





  1. Specifying remote master-IP




    • Independent of remote system


    • Needs modifying Global Configurations / Environment Variables

    • See @cricket_007's answer for details






Useful links




  • This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master

  • Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

  • Remote spark-submit to YARN running on EMR






share|improve this answer


























  • Thank you for the info. I have EMR clusters getting created by AWS ASG, I need a breakthrough where I can pull single EMR Master running cluster from AWS(Currently we are running 4 cluster in single Environment). I mean to say, How can I specify in which EMR cluster I need to do Spark-submit

    – Kally
    Jan 8 at 17:04











  • @Kally if you take the EmrStep route, the cluster-id a.k.a. JobFlowId will be needed to specify which cluster to submit to. Otherwise, you will have to obtain the private-IP of that cluster's master (which i think you can easily do via boto3). While I'm a novice with AWS infrastructure, i believe IAM Roles would come handy for authorization (i assume you already know that)

    – y2k-shubham
    Jan 8 at 17:20













  • See this for hints on how to modify Airflow's built-in operators to work over SSH

    – y2k-shubham
    Feb 5 at 19:44



















0














As you have created EMR using Terraform, then you get the master IP as aws_emr_cluster.my-emr.master_public_dns



Hope this helps.






share|improve this answer
























  • Thank you. How can I authenticate to this master IP server and do spark-submit – Kally 18 hours ago

    – Kally
    Jan 4 at 15:48












Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022129%2fhow-to-submit-spark-jobs-to-emr-cluster-from-airflow%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow





  1. Use Apache Livy




    • This solution is actually independent of remote server, i.e., EMR


    • Here's an example

    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me




  2. Use EmrSteps API




    • Dependent on remote system: EMR

    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)

    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)




  3. Use SSHHook / SSHOperator




    • Again independent of remote system

    • Comparatively easier to get started with

    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome






EDIT-1



There seems to be another straightforward way





  1. Specifying remote master-IP




    • Independent of remote system


    • Needs modifying Global Configurations / Environment Variables

    • See @cricket_007's answer for details






Useful links




  • This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master

  • Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

  • Remote spark-submit to YARN running on EMR






share|improve this answer


























  • Thank you for the info. I have EMR clusters getting created by AWS ASG, I need a breakthrough where I can pull single EMR Master running cluster from AWS(Currently we are running 4 cluster in single Environment). I mean to say, How can I specify in which EMR cluster I need to do Spark-submit

    – Kally
    Jan 8 at 17:04











  • @Kally if you take the EmrStep route, the cluster-id a.k.a. JobFlowId will be needed to specify which cluster to submit to. Otherwise, you will have to obtain the private-IP of that cluster's master (which i think you can easily do via boto3). While I'm a novice with AWS infrastructure, i believe IAM Roles would come handy for authorization (i assume you already know that)

    – y2k-shubham
    Jan 8 at 17:20













  • See this for hints on how to modify Airflow's built-in operators to work over SSH

    – y2k-shubham
    Feb 5 at 19:44
















3














While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow





  1. Use Apache Livy




    • This solution is actually independent of remote server, i.e., EMR


    • Here's an example

    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me




  2. Use EmrSteps API




    • Dependent on remote system: EMR

    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)

    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)




  3. Use SSHHook / SSHOperator




    • Again independent of remote system

    • Comparatively easier to get started with

    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome






EDIT-1



There seems to be another straightforward way





  1. Specifying remote master-IP




    • Independent of remote system


    • Needs modifying Global Configurations / Environment Variables

    • See @cricket_007's answer for details






Useful links




  • This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master

  • Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

  • Remote spark-submit to YARN running on EMR






share|improve this answer


























  • Thank you for the info. I have EMR clusters getting created by AWS ASG, I need a breakthrough where I can pull single EMR Master running cluster from AWS(Currently we are running 4 cluster in single Environment). I mean to say, How can I specify in which EMR cluster I need to do Spark-submit

    – Kally
    Jan 8 at 17:04











  • @Kally if you take the EmrStep route, the cluster-id a.k.a. JobFlowId will be needed to specify which cluster to submit to. Otherwise, you will have to obtain the private-IP of that cluster's master (which i think you can easily do via boto3). While I'm a novice with AWS infrastructure, i believe IAM Roles would come handy for authorization (i assume you already know that)

    – y2k-shubham
    Jan 8 at 17:20













  • See this for hints on how to modify Airflow's built-in operators to work over SSH

    – y2k-shubham
    Feb 5 at 19:44














3












3








3







While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow





  1. Use Apache Livy




    • This solution is actually independent of remote server, i.e., EMR


    • Here's an example

    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me




  2. Use EmrSteps API




    • Dependent on remote system: EMR

    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)

    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)




  3. Use SSHHook / SSHOperator




    • Again independent of remote system

    • Comparatively easier to get started with

    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome






EDIT-1



There seems to be another straightforward way





  1. Specifying remote master-IP




    • Independent of remote system


    • Needs modifying Global Configurations / Environment Variables

    • See @cricket_007's answer for details






Useful links




  • This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master

  • Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

  • Remote spark-submit to YARN running on EMR






share|improve this answer















While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow





  1. Use Apache Livy




    • This solution is actually independent of remote server, i.e., EMR


    • Here's an example

    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me




  2. Use EmrSteps API




    • Dependent on remote system: EMR

    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)

    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)




  3. Use SSHHook / SSHOperator




    • Again independent of remote system

    • Comparatively easier to get started with

    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome






EDIT-1



There seems to be another straightforward way





  1. Specifying remote master-IP




    • Independent of remote system


    • Needs modifying Global Configurations / Environment Variables

    • See @cricket_007's answer for details






Useful links




  • This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master

  • Spark job submission using Airflow by submitting batch POST method on Livy and tracking job

  • Remote spark-submit to YARN running on EMR







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 6 at 1:01

























answered Jan 8 at 13:18









y2k-shubhamy2k-shubham

1,49921534




1,49921534













  • Thank you for the info. I have EMR clusters getting created by AWS ASG, I need a breakthrough where I can pull single EMR Master running cluster from AWS(Currently we are running 4 cluster in single Environment). I mean to say, How can I specify in which EMR cluster I need to do Spark-submit

    – Kally
    Jan 8 at 17:04











  • @Kally if you take the EmrStep route, the cluster-id a.k.a. JobFlowId will be needed to specify which cluster to submit to. Otherwise, you will have to obtain the private-IP of that cluster's master (which i think you can easily do via boto3). While I'm a novice with AWS infrastructure, i believe IAM Roles would come handy for authorization (i assume you already know that)

    – y2k-shubham
    Jan 8 at 17:20













  • See this for hints on how to modify Airflow's built-in operators to work over SSH

    – y2k-shubham
    Feb 5 at 19:44



















  • Thank you for the info. I have EMR clusters getting created by AWS ASG, I need a breakthrough where I can pull single EMR Master running cluster from AWS(Currently we are running 4 cluster in single Environment). I mean to say, How can I specify in which EMR cluster I need to do Spark-submit

    – Kally
    Jan 8 at 17:04











  • @Kally if you take the EmrStep route, the cluster-id a.k.a. JobFlowId will be needed to specify which cluster to submit to. Otherwise, you will have to obtain the private-IP of that cluster's master (which i think you can easily do via boto3). While I'm a novice with AWS infrastructure, i believe IAM Roles would come handy for authorization (i assume you already know that)

    – y2k-shubham
    Jan 8 at 17:20













  • See this for hints on how to modify Airflow's built-in operators to work over SSH

    – y2k-shubham
    Feb 5 at 19:44

















Thank you for the info. I have EMR clusters getting created by AWS ASG, I need a breakthrough where I can pull single EMR Master running cluster from AWS(Currently we are running 4 cluster in single Environment). I mean to say, How can I specify in which EMR cluster I need to do Spark-submit

– Kally
Jan 8 at 17:04





Thank you for the info. I have EMR clusters getting created by AWS ASG, I need a breakthrough where I can pull single EMR Master running cluster from AWS(Currently we are running 4 cluster in single Environment). I mean to say, How can I specify in which EMR cluster I need to do Spark-submit

– Kally
Jan 8 at 17:04













@Kally if you take the EmrStep route, the cluster-id a.k.a. JobFlowId will be needed to specify which cluster to submit to. Otherwise, you will have to obtain the private-IP of that cluster's master (which i think you can easily do via boto3). While I'm a novice with AWS infrastructure, i believe IAM Roles would come handy for authorization (i assume you already know that)

– y2k-shubham
Jan 8 at 17:20







@Kally if you take the EmrStep route, the cluster-id a.k.a. JobFlowId will be needed to specify which cluster to submit to. Otherwise, you will have to obtain the private-IP of that cluster's master (which i think you can easily do via boto3). While I'm a novice with AWS infrastructure, i believe IAM Roles would come handy for authorization (i assume you already know that)

– y2k-shubham
Jan 8 at 17:20















See this for hints on how to modify Airflow's built-in operators to work over SSH

– y2k-shubham
Feb 5 at 19:44





See this for hints on how to modify Airflow's built-in operators to work over SSH

– y2k-shubham
Feb 5 at 19:44













0














As you have created EMR using Terraform, then you get the master IP as aws_emr_cluster.my-emr.master_public_dns



Hope this helps.






share|improve this answer
























  • Thank you. How can I authenticate to this master IP server and do spark-submit – Kally 18 hours ago

    – Kally
    Jan 4 at 15:48
















0














As you have created EMR using Terraform, then you get the master IP as aws_emr_cluster.my-emr.master_public_dns



Hope this helps.






share|improve this answer
























  • Thank you. How can I authenticate to this master IP server and do spark-submit – Kally 18 hours ago

    – Kally
    Jan 4 at 15:48














0












0








0







As you have created EMR using Terraform, then you get the master IP as aws_emr_cluster.my-emr.master_public_dns



Hope this helps.






share|improve this answer













As you have created EMR using Terraform, then you get the master IP as aws_emr_cluster.my-emr.master_public_dns



Hope this helps.







share|improve this answer












share|improve this answer



share|improve this answer










answered Jan 3 at 20:16









pradeeppradeep

1,21661835




1,21661835













  • Thank you. How can I authenticate to this master IP server and do spark-submit – Kally 18 hours ago

    – Kally
    Jan 4 at 15:48



















  • Thank you. How can I authenticate to this master IP server and do spark-submit – Kally 18 hours ago

    – Kally
    Jan 4 at 15:48

















Thank you. How can I authenticate to this master IP server and do spark-submit – Kally 18 hours ago

– Kally
Jan 4 at 15:48





Thank you. How can I authenticate to this master IP server and do spark-submit – Kally 18 hours ago

– Kally
Jan 4 at 15:48


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54022129%2fhow-to-submit-spark-jobs-to-emr-cluster-from-airflow%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

How to fix TextFormField cause rebuild widget in Flutter