Convert String to Double in Scala / Spark?

I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:

import org.apache.spark.mllib.linalg.{Vector,Vectors}

import org.apache.spark.mllib.regression.LabeledPoint





case class Obs(f1: Double, f2: Double, price: Array[String])



val obs1 = new Obs(1,2,Array("USD", "5.00"))

val obs2 = new Obs(2,1,Array("USD", "3.00"))



val df = sc.parallelize(Seq(obs1,obs2)).toDF()

df.printSchema

df.show()



val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))



labeled.take(2).foreach(println)

The output looks like:

df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]

root

 |-- f1: double (nullable = false)

 |-- f2: double (nullable = false)

 |-- price: array (nullable = true)

 |    |-- element: string (containsNull = true)



+---+---+-----------+

| f1| f2|      price|

+---+---+-----------+

|1.0|2.0|[USD, 5.00]|

|2.0|1.0|[USD, 3.00]|

+---+---+-----------+

but then I wind up getting a ClassCastException:

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?

The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.

edited Apr 25 '16 at 10:24

zero323

164k39477572

asked Oct 23 '15 at 1:16

schnee

600616

add a comment |

import org.apache.spark.mllib.linalg.{Vector,Vectors}

import org.apache.spark.mllib.regression.LabeledPoint





case class Obs(f1: Double, f2: Double, price: Array[String])



val obs1 = new Obs(1,2,Array("USD", "5.00"))

val obs2 = new Obs(2,1,Array("USD", "3.00"))



val df = sc.parallelize(Seq(obs1,obs2)).toDF()

df.printSchema

df.show()



val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))



labeled.take(2).foreach(println)

The output looks like:

df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]

root

 |-- f1: double (nullable = false)

 |-- f2: double (nullable = false)

 |-- price: array (nullable = true)

 |    |-- element: string (containsNull = true)



+---+---+-----------+

| f1| f2|      price|

+---+---+-----------+

|1.0|2.0|[USD, 5.00]|

|2.0|1.0|[USD, 3.00]|

+---+---+-----------+

but then I wind up getting a ClassCastException:

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?

edited Apr 25 '16 at 10:24

zero323

164k39477572

asked Oct 23 '15 at 1:16

schnee

600616

add a comment |

import org.apache.spark.mllib.linalg.{Vector,Vectors}

import org.apache.spark.mllib.regression.LabeledPoint





case class Obs(f1: Double, f2: Double, price: Array[String])



val obs1 = new Obs(1,2,Array("USD", "5.00"))

val obs2 = new Obs(2,1,Array("USD", "3.00"))



val df = sc.parallelize(Seq(obs1,obs2)).toDF()

df.printSchema

df.show()



val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))



labeled.take(2).foreach(println)

The output looks like:

df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]

root

 |-- f1: double (nullable = false)

 |-- f2: double (nullable = false)

 |-- price: array (nullable = true)

 |    |-- element: string (containsNull = true)



+---+---+-----------+

| f1| f2|      price|

+---+---+-----------+

|1.0|2.0|[USD, 5.00]|

|2.0|1.0|[USD, 3.00]|

+---+---+-----------+

but then I wind up getting a ClassCastException:

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?

edited Apr 25 '16 at 10:24

zero323

164k39477572

asked Oct 23 '15 at 1:16

schnee

600616

import org.apache.spark.mllib.linalg.{Vector,Vectors}

import org.apache.spark.mllib.regression.LabeledPoint





case class Obs(f1: Double, f2: Double, price: Array[String])



val obs1 = new Obs(1,2,Array("USD", "5.00"))

val obs2 = new Obs(2,1,Array("USD", "3.00"))



val df = sc.parallelize(Seq(obs1,obs2)).toDF()

df.printSchema

df.show()



val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))



labeled.take(2).foreach(println)

The output looks like:

df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]

root

 |-- f1: double (nullable = false)

 |-- f2: double (nullable = false)

 |-- price: array (nullable = true)

 |    |-- element: string (containsNull = true)



+---+---+-----------+

| f1| f2|      price|

+---+---+-----------+

|1.0|2.0|[USD, 5.00]|

|2.0|1.0|[USD, 3.00]|

+---+---+-----------+

but then I wind up getting a ClassCastException:

java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;

I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?

scala apache-spark apache-spark-mllib

edited Apr 25 '16 at 10:24

zero323

164k39477572

asked Oct 23 '15 at 1:16

schnee

600616

edited Apr 25 '16 at 10:24

zero323

164k39477572

asked Oct 23 '15 at 1:16

schnee

600616

edited Apr 25 '16 at 10:24

zero323

164k39477572

edited Apr 25 '16 at 10:24

zero323

164k39477572

edited Apr 25 '16 at 10:24

zero323

164k39477572

asked Oct 23 '15 at 1:16

schnee

600616

asked Oct 23 '15 at 1:16

schnee

600616

asked Oct 23 '15 at 1:16

schnee

600616

add a comment |

2 Answers
2

active

oldest

votes

Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.sql.Row



val assembler = new VectorAssembler()

  .setInputCols(Array("f1", "f2"))

  .setOutputCol("features")



val labeled = assembler.transform(df)

  .select($"price".getItem(1).cast("double"), $"features")

  .map{case Row(price: Double, features: Vector) => 

    LabeledPoint(price, features)}

Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use

import scala.collection.mutable.WrappedArray



row.getAs[WrappedArray[String]](2)

or simply

row.getAs[Seq[String]](2)

edited Oct 25 '15 at 6:28

answered Oct 23 '15 at 16:35

zero323

164k39477572

add a comment |

I think problem here:

.asInstanceOf[Array[String]]

answered Oct 23 '15 at 1:25

terma

9611714

can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

– schnee
Oct 23 '15 at 12:18

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33293365%2fconvert-string-to-double-in-scala-spark%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.sql.Row



val assembler = new VectorAssembler()

  .setInputCols(Array("f1", "f2"))

  .setOutputCol("features")



val labeled = assembler.transform(df)

  .select($"price".getItem(1).cast("double"), $"features")

  .map{case Row(price: Double, features: Vector) => 

    LabeledPoint(price, features)}

Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use

import scala.collection.mutable.WrappedArray



row.getAs[WrappedArray[String]](2)

or simply

row.getAs[Seq[String]](2)

edited Oct 25 '15 at 6:28

answered Oct 23 '15 at 16:35

zero323

164k39477572

add a comment |

Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.sql.Row



val assembler = new VectorAssembler()

  .setInputCols(Array("f1", "f2"))

  .setOutputCol("features")



val labeled = assembler.transform(df)

  .select($"price".getItem(1).cast("double"), $"features")

  .map{case Row(price: Double, features: Vector) => 

    LabeledPoint(price, features)}

Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use

import scala.collection.mutable.WrappedArray



row.getAs[WrappedArray[String]](2)

or simply

row.getAs[Seq[String]](2)

edited Oct 25 '15 at 6:28

answered Oct 23 '15 at 16:35

zero323

164k39477572

add a comment |

Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.sql.Row



val assembler = new VectorAssembler()

  .setInputCols(Array("f1", "f2"))

  .setOutputCol("features")



val labeled = assembler.transform(df)

  .select($"price".getItem(1).cast("double"), $"features")

  .map{case Row(price: Double, features: Vector) => 

    LabeledPoint(price, features)}

Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use

import scala.collection.mutable.WrappedArray



row.getAs[WrappedArray[String]](2)

or simply

row.getAs[Seq[String]](2)

edited Oct 25 '15 at 6:28

answered Oct 23 '15 at 16:35

zero323

164k39477572

Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf:

import org.apache.spark.ml.feature.VectorAssembler

import org.apache.spark.sql.Row



val assembler = new VectorAssembler()

  .setInputCols(Array("f1", "f2"))

  .setOutputCol("features")



val labeled = assembler.transform(df)

  .select($"price".getItem(1).cast("double"), $"features")

  .map{case Row(price: Double, features: Vector) => 

    LabeledPoint(price, features)}

Regarding your problem ArrayType is stored in Row as a WrappedArray hence the error you see. You can either use

import scala.collection.mutable.WrappedArray



row.getAs[WrappedArray[String]](2)

or simply

row.getAs[Seq[String]](2)

edited Oct 25 '15 at 6:28

answered Oct 23 '15 at 16:35

zero323

164k39477572

edited Oct 25 '15 at 6:28

answered Oct 23 '15 at 16:35

zero323

164k39477572

answered Oct 23 '15 at 16:35

zero323

164k39477572

answered Oct 23 '15 at 16:35

zero323

164k39477572

add a comment |

I think problem here:

.asInstanceOf[Array[String]]

answered Oct 23 '15 at 1:25

terma

9611714

can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

– schnee
Oct 23 '15 at 12:18

add a comment |

I think problem here:

.asInstanceOf[Array[String]]

answered Oct 23 '15 at 1:25

terma

9611714

can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

– schnee
Oct 23 '15 at 12:18

add a comment |

I think problem here:

.asInstanceOf[Array[String]]

answered Oct 23 '15 at 1:25

terma

9611714

I think problem here:

.asInstanceOf[Array[String]]

answered Oct 23 '15 at 1:25

terma

9611714

answered Oct 23 '15 at 1:25

terma

9611714

answered Oct 23 '15 at 1:25

terma

9611714

answered Oct 23 '15 at 1:25

terma

9611714

can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

– schnee
Oct 23 '15 at 12:18

add a comment |

can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

– schnee
Oct 23 '15 at 12:18

can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.

– schnee
Oct 23 '15 at 12:18

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu