Convert String to Double in Scala / Spark?
I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1: Double, f2: Double, price: Array[String])
val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()
val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
The output looks like:
df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
|-- f1: double (nullable = false)
|-- f2: double (nullable = false)
|-- price: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+-----------+
| f1| f2| price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+
but then I wind up getting a ClassCastException:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?
The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.
scala apache-spark apache-spark-mllib
add a comment |
I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1: Double, f2: Double, price: Array[String])
val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()
val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
The output looks like:
df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
|-- f1: double (nullable = false)
|-- f2: double (nullable = false)
|-- price: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+-----------+
| f1| f2| price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+
but then I wind up getting a ClassCastException:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?
The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.
scala apache-spark apache-spark-mllib
add a comment |
I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1: Double, f2: Double, price: Array[String])
val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()
val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
The output looks like:
df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
|-- f1: double (nullable = false)
|-- f2: double (nullable = false)
|-- price: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+-----------+
| f1| f2| price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+
but then I wind up getting a ClassCastException:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?
The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.
scala apache-spark apache-spark-mllib
I have JSON data set that contains a price in a string like "USD 5.00". I'd like to convert the numeric portion to a Double to use in an MLLIB LabeledPoint, and have managed to split the price string into an array of string. The below creates a data set with the correct structure:
import org.apache.spark.mllib.linalg.{Vector,Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
case class Obs(f1: Double, f2: Double, price: Array[String])
val obs1 = new Obs(1,2,Array("USD", "5.00"))
val obs2 = new Obs(2,1,Array("USD", "3.00"))
val df = sc.parallelize(Seq(obs1,obs2)).toDF()
df.printSchema
df.show()
val labeled = df.map(row => LabeledPoint(row.get(2).asInstanceOf[Array[String]].apply(1).toDouble, Vectors.dense(row.getDouble(0), row.getDouble(1))))
labeled.take(2).foreach(println)
The output looks like:
df: org.apache.spark.sql.DataFrame = [f1: double, f2: double, price: array<string>]
root
|-- f1: double (nullable = false)
|-- f2: double (nullable = false)
|-- price: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+-----------+
| f1| f2| price|
+---+---+-----------+
|1.0|2.0|[USD, 5.00]|
|2.0|1.0|[USD, 3.00]|
+---+---+-----------+
but then I wind up getting a ClassCastException:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String;
I think the ClassCastException is due to the println. But I didn't expect it; how can I handle this situation?
The potential duplicate solved one part of my question (thanks), but the deeper question of "promoting elements of a struct in a dataframe remain". I'll let the mods determine if this is truly a dupe.
scala apache-spark apache-spark-mllib
scala apache-spark apache-spark-mllib
edited Apr 25 '16 at 10:24
zero323
164k39477572
164k39477572
asked Oct 23 '15 at 1:16
schneeschnee
600616
600616
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf
:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.Row
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2"))
.setOutputCol("features")
val labeled = assembler.transform(df)
.select($"price".getItem(1).cast("double"), $"features")
.map{case Row(price: Double, features: Vector) =>
LabeledPoint(price, features)}
Regarding your problem ArrayType
is stored in Row
as a WrappedArray
hence the error you see. You can either use
import scala.collection.mutable.WrappedArray
row.getAs[WrappedArray[String]](2)
or simply
row.getAs[Seq[String]](2)
add a comment |
I think problem here:
.asInstanceOf[Array[String]]
can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.
– schnee
Oct 23 '15 at 12:18
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33293365%2fconvert-string-to-double-in-scala-spark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf
:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.Row
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2"))
.setOutputCol("features")
val labeled = assembler.transform(df)
.select($"price".getItem(1).cast("double"), $"features")
.map{case Row(price: Double, features: Vector) =>
LabeledPoint(price, features)}
Regarding your problem ArrayType
is stored in Row
as a WrappedArray
hence the error you see. You can either use
import scala.collection.mutable.WrappedArray
row.getAs[WrappedArray[String]](2)
or simply
row.getAs[Seq[String]](2)
add a comment |
Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf
:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.Row
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2"))
.setOutputCol("features")
val labeled = assembler.transform(df)
.select($"price".getItem(1).cast("double"), $"features")
.map{case Row(price: Double, features: Vector) =>
LabeledPoint(price, features)}
Regarding your problem ArrayType
is stored in Row
as a WrappedArray
hence the error you see. You can either use
import scala.collection.mutable.WrappedArray
row.getAs[WrappedArray[String]](2)
or simply
row.getAs[Seq[String]](2)
add a comment |
Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf
:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.Row
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2"))
.setOutputCol("features")
val labeled = assembler.transform(df)
.select($"price".getItem(1).cast("double"), $"features")
.map{case Row(price: Double, features: Vector) =>
LabeledPoint(price, features)}
Regarding your problem ArrayType
is stored in Row
as a WrappedArray
hence the error you see. You can either use
import scala.collection.mutable.WrappedArray
row.getAs[WrappedArray[String]](2)
or simply
row.getAs[Seq[String]](2)
Let me propose an alternative solution which I believe is much cleaner than playing with all asInstanceOf
:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.Row
val assembler = new VectorAssembler()
.setInputCols(Array("f1", "f2"))
.setOutputCol("features")
val labeled = assembler.transform(df)
.select($"price".getItem(1).cast("double"), $"features")
.map{case Row(price: Double, features: Vector) =>
LabeledPoint(price, features)}
Regarding your problem ArrayType
is stored in Row
as a WrappedArray
hence the error you see. You can either use
import scala.collection.mutable.WrappedArray
row.getAs[WrappedArray[String]](2)
or simply
row.getAs[Seq[String]](2)
edited Oct 25 '15 at 6:28
answered Oct 23 '15 at 16:35
zero323zero323
164k39477572
164k39477572
add a comment |
add a comment |
I think problem here:
.asInstanceOf[Array[String]]
can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.
– schnee
Oct 23 '15 at 12:18
add a comment |
I think problem here:
.asInstanceOf[Array[String]]
can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.
– schnee
Oct 23 '15 at 12:18
add a comment |
I think problem here:
.asInstanceOf[Array[String]]
I think problem here:
.asInstanceOf[Array[String]]
answered Oct 23 '15 at 1:25
termaterma
9611714
9611714
can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.
– schnee
Oct 23 '15 at 12:18
add a comment |
can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.
– schnee
Oct 23 '15 at 12:18
can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.
– schnee
Oct 23 '15 at 12:18
can you clarify? A row consists of double,double,Array[String]] so getting the third element as an Array[[String]] seems like the right thing to do.
– schnee
Oct 23 '15 at 12:18
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f33293365%2fconvert-string-to-double-in-scala-spark%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown