Convert String to Double in Scala / Spark?












1















I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:



import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint


case class Obs(f1: Double, f2: Double, price: Array[String])

val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))

val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()

val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))

labeled.take(2).foreach(println)


The output looks like:



df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
|-- f1: double (nullable = false)
|-- f2: double (nullable = false)
|-- price: array (nullable = true)
| |-- element: string (containsNull = true)

+---+---+-----------+
| f1| f2| price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+


but then I wind up getting a ClassCastException:



java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;


I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?



The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.










share|improve this question





























    1















    I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:



    import org.apache.spark.mllib.linalg.{Vector,Vectors}
    import org.apache.spark.mllib.regression.LabeledPoint


    case class Obs(f1: Double, f2: Double, price: Array[String])

    val obs1 = new Obs(1,2,Array("USD", "5.00"))
    val obs2 = new Obs(2,1,Array("USD", "3.00"))

    val df = sc.parallelize(Seq(obs1,obs2)).toDF()
    df.printSchema
    df.show()

    val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))

    labeled.take(2).foreach(println)


    The output looks like:



    df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
    root
    |-- f1: double (nullable = false)
    |-- f2: double (nullable = false)
    |-- price: array (nullable = true)
    | |-- element: string (containsNull = true)

    +---+---+-----------+
    | f1| f2| price|
    +---+---+-----------+
    |1.0|2.0|[USD, 5.00]|
    |2.0|1.0|[USD, 3.00]|
    +---+---+-----------+


    but then I wind up getting a ClassCastException:



    java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;


    I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?



    The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.










    share|improve this question



























      1












      1








      1








      I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:



      import org.apache.spark.mllib.linalg.{Vector,Vectors}
      import org.apache.spark.mllib.regression.LabeledPoint


      case class Obs(f1: Double, f2: Double, price: Array[String])

      val obs1 = new Obs(1,2,Array("USD", "5.00"))
      val obs2 = new Obs(2,1,Array("USD", "3.00"))

      val df = sc.parallelize(Seq(obs1,obs2)).toDF()
      df.printSchema
      df.show()

      val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))

      labeled.take(2).foreach(println)


      The output looks like:



      df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
      root
      |-- f1: double (nullable = false)
      |-- f2: double (nullable = false)
      |-- price: array (nullable = true)
      | |-- element: string (containsNull = true)

      +---+---+-----------+
      | f1| f2| price|
      +---+---+-----------+
      |1.0|2.0|[USD, 5.00]|
      |2.0|1.0|[USD, 3.00]|
      +---+---+-----------+


      but then I wind up getting a ClassCastException:



      java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;


      I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?



      The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.










      share|improve this question
















      I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:



      import org.apache.spark.mllib.linalg.{Vector,Vectors}
      import org.apache.spark.mllib.regression.LabeledPoint


      case class Obs(f1: Double, f2: Double, price: Array[String])

      val obs1 = new Obs(1,2,Array("USD", "5.00"))
      val obs2 = new Obs(2,1,Array("USD", "3.00"))

      val df = sc.parallelize(Seq(obs1,obs2)).toDF()
      df.printSchema
      df.show()

      val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))

      labeled.take(2).foreach(println)


      The output looks like:



      df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
      root
      |-- f1: double (nullable = false)
      |-- f2: double (nullable = false)
      |-- price: array (nullable = true)
      | |-- element: string (containsNull = true)

      +---+---+-----------+
      | f1| f2| price|
      +---+---+-----------+
      |1.0|2.0|[USD, 5.00]|
      |2.0|1.0|[USD, 3.00]|
      +---+---+-----------+


      but then I wind up getting a ClassCastException:



      java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;


      I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?



      The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.







      scala apache-spark apache-spark-mllib






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Apr 25 '16 at 10:24









      zero323

      164k39477572




      164k39477572










      asked Oct 23 '15 at 1:16









      schneeschnee

      600616




      600616
























          2 Answers
          2






          active

          oldest

          votes


















          1














          Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:



          import org.apache.spark.ml.feature.VectorAssembler
          import org.apache.spark.sql.Row

          val assembler = new VectorAssembler()
          .setInputCols(Array("f1", "f2"))
          .setOutputCol("features")

          val labeled = assembler.transform(df)
          .select($"price".getItem(1).cast("double"), $"features")
          .map{case Row(price: Double, features: Vector) =>
          LabeledPoint(price, features)}


          Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use



          import scala.collection.mutable.WrappedArray

          row.getAs[WrappedArray[String]](2)


          or simply



          row.getAs[Seq[String]](2)





          share|improve this answer

































            2














            I think problem here:



            .asInstanceOf[Array[String]]





            share|improve this answer
























            • can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

              – schnee
              Oct 23 '15 at 12:18











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33293365%2fconvert-string-to-double-in-scala-spark%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1














            Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:



            import org.apache.spark.ml.feature.VectorAssembler
            import org.apache.spark.sql.Row

            val assembler = new VectorAssembler()
            .setInputCols(Array("f1", "f2"))
            .setOutputCol("features")

            val labeled = assembler.transform(df)
            .select($"price".getItem(1).cast("double"), $"features")
            .map{case Row(price: Double, features: Vector) =>
            LabeledPoint(price, features)}


            Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use



            import scala.collection.mutable.WrappedArray

            row.getAs[WrappedArray[String]](2)


            or simply



            row.getAs[Seq[String]](2)





            share|improve this answer






























              1














              Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:



              import org.apache.spark.ml.feature.VectorAssembler
              import org.apache.spark.sql.Row

              val assembler = new VectorAssembler()
              .setInputCols(Array("f1", "f2"))
              .setOutputCol("features")

              val labeled = assembler.transform(df)
              .select($"price".getItem(1).cast("double"), $"features")
              .map{case Row(price: Double, features: Vector) =>
              LabeledPoint(price, features)}


              Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use



              import scala.collection.mutable.WrappedArray

              row.getAs[WrappedArray[String]](2)


              or simply



              row.getAs[Seq[String]](2)





              share|improve this answer




























                1












                1








                1







                Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:



                import org.apache.spark.ml.feature.VectorAssembler
                import org.apache.spark.sql.Row

                val assembler = new VectorAssembler()
                .setInputCols(Array("f1", "f2"))
                .setOutputCol("features")

                val labeled = assembler.transform(df)
                .select($"price".getItem(1).cast("double"), $"features")
                .map{case Row(price: Double, features: Vector) =>
                LabeledPoint(price, features)}


                Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use



                import scala.collection.mutable.WrappedArray

                row.getAs[WrappedArray[String]](2)


                or simply



                row.getAs[Seq[String]](2)





                share|improve this answer















                Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:



                import org.apache.spark.ml.feature.VectorAssembler
                import org.apache.spark.sql.Row

                val assembler = new VectorAssembler()
                .setInputCols(Array("f1", "f2"))
                .setOutputCol("features")

                val labeled = assembler.transform(df)
                .select($"price".getItem(1).cast("double"), $"features")
                .map{case Row(price: Double, features: Vector) =>
                LabeledPoint(price, features)}


                Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use



                import scala.collection.mutable.WrappedArray

                row.getAs[WrappedArray[String]](2)


                or simply



                row.getAs[Seq[String]](2)






                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Oct 25 '15 at 6:28

























                answered Oct 23 '15 at 16:35









                zero323zero323

                164k39477572




                164k39477572

























                    2














                    I think problem here:



                    .asInstanceOf[Array[String]]





                    share|improve this answer
























                    • can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

                      – schnee
                      Oct 23 '15 at 12:18
















                    2














                    I think problem here:



                    .asInstanceOf[Array[String]]





                    share|improve this answer
























                    • can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

                      – schnee
                      Oct 23 '15 at 12:18














                    2












                    2








                    2







                    I think problem here:



                    .asInstanceOf[Array[String]]





                    share|improve this answer













                    I think problem here:



                    .asInstanceOf[Array[String]]






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Oct 23 '15 at 1:25









                    termaterma

                    9611714




                    9611714













                    • can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

                      – schnee
                      Oct 23 '15 at 12:18



















                    • can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

                      – schnee
                      Oct 23 '15 at 12:18

















                    can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

                    – schnee
                    Oct 23 '15 at 12:18





                    can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

                    – schnee
                    Oct 23 '15 at 12:18


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33293365%2fconvert-string-to-double-in-scala-spark%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    MongoDB - Not Authorized To Execute Command

                    How to fix TextFormField cause rebuild widget in Flutter

                    Npm cannot find a required file even through it is in the searched directory