Spark SQL - createDataFrame wrong struct schema

When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:

some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},

             {'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]

spark.createDataFrame([Row(**d) for d in some_data]).printSchema()

The resulting DataFrame's schema is:

root

 |--  some-column: array (nullable = true)

 |    |-- element: map (containsNull = true)

 |    |    |-- key: string

 |    |    |-- value: long (valueContainsNull = true)

This schema is wrong, as strVal column is of string type (and indeed collecting on this DataFrame would result in nulls on this column).

I'd expect for the schema to be an Array of appropriate Structs - inferred with a bit of Python reflection on the types of values.
Why is this not the case?
Is there anything I can do besides providing the schema explicitly in this case?

edited Nov 20 '18 at 9:15

user10465355

1,4221413

asked Nov 19 '18 at 23:22

user976850

4411616

add a comment |

When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:

some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},

             {'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]

spark.createDataFrame([Row(**d) for d in some_data]).printSchema()

The resulting DataFrame's schema is:

root

 |--  some-column: array (nullable = true)

 |    |-- element: map (containsNull = true)

 |    |    |-- key: string

 |    |    |-- value: long (valueContainsNull = true)

This schema is wrong, as strVal column is of string type (and indeed collecting on this DataFrame would result in nulls on this column).

edited Nov 20 '18 at 9:15

user10465355

1,4221413

asked Nov 19 '18 at 23:22

user976850

4411616

add a comment |

When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:

some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},

             {'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]

spark.createDataFrame([Row(**d) for d in some_data]).printSchema()

The resulting DataFrame's schema is:

root

 |--  some-column: array (nullable = true)

 |    |-- element: map (containsNull = true)

 |    |    |-- key: string

 |    |    |-- value: long (valueContainsNull = true)

This schema is wrong, as strVal column is of string type (and indeed collecting on this DataFrame would result in nulls on this column).

edited Nov 20 '18 at 9:15

user10465355

1,4221413

asked Nov 19 '18 at 23:22

user976850

4411616

When trying to create a DataFrame with Spark SQL by passing it a list of Rows like so:

some_data = [{'some-column': [{'timestamp': 1353534535353, 'strVal': 'some-string'}]},

             {'some-column': [{'timestamp': 1353534535354, 'strVal': 'another-string'}]}]

spark.createDataFrame([Row(**d) for d in some_data]).printSchema()

The resulting DataFrame's schema is:

root

 |--  some-column: array (nullable = true)

 |    |-- element: map (containsNull = true)

 |    |    |-- key: string

 |    |    |-- value: long (valueContainsNull = true)

This schema is wrong, as strVal column is of string type (and indeed collecting on this DataFrame would result in nulls on this column).

apache-spark dataframe pyspark apache-spark-sql schema

edited Nov 20 '18 at 9:15

user10465355

1,4221413

asked Nov 19 '18 at 23:22

user976850

4411616

edited Nov 20 '18 at 9:15

user10465355

1,4221413

asked Nov 19 '18 at 23:22

user976850

4411616

edited Nov 20 '18 at 9:15

user10465355

1,4221413

edited Nov 20 '18 at 9:15

user10465355

1,4221413

edited Nov 20 '18 at 9:15

user10465355

1,4221413

asked Nov 19 '18 at 23:22

user976850

4411616

asked Nov 19 '18 at 23:22

user976850

4411616

asked Nov 19 '18 at 23:22

user976850

4411616

add a comment |

1 Answer
1

active

oldest

votes

This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.

To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):

from pyspark.sql import Row



Outer = Row("some-column")

Inner = Row("timestamp", "strVal")



spark.createDataFrame([

    Outer([Inner(1353534535353, 'some-string')]),

    Outer([Inner(1353534535354, 'another-string')])

]).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- timestamp: long (nullable = true)

 |    |    |-- strVal: string (nullable = true)

With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:

import json



spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- strVal: string (nullable = true)

 |    |    |-- timestamp: long (nullable = true)

or explicit schema:

from pyspark.sql.types import *



schema = StructType([StructField(

    "some-column", ArrayType(StructType([

        StructField("timestamp", LongType()), 

        StructField("strVal", StringType())])

))])



spark.createDataFrame(some_data, schema)

although the last method might not be fully robust.

edited Nov 20 '18 at 0:58

answered Nov 20 '18 at 0:48

user10465355

1,4221413

Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

– user976850
Nov 20 '18 at 9:05

Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

– user10465355
Nov 20 '18 at 11:24

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384072%2fspark-sql-createdataframe-wrong-struct-schema%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.

To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):

from pyspark.sql import Row



Outer = Row("some-column")

Inner = Row("timestamp", "strVal")



spark.createDataFrame([

    Outer([Inner(1353534535353, 'some-string')]),

    Outer([Inner(1353534535354, 'another-string')])

]).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- timestamp: long (nullable = true)

 |    |    |-- strVal: string (nullable = true)

With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:

import json



spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- strVal: string (nullable = true)

 |    |    |-- timestamp: long (nullable = true)

or explicit schema:

from pyspark.sql.types import *



schema = StructType([StructField(

    "some-column", ArrayType(StructType([

        StructField("timestamp", LongType()), 

        StructField("strVal", StringType())])

))])



spark.createDataFrame(some_data, schema)

although the last method might not be fully robust.

edited Nov 20 '18 at 0:58

answered Nov 20 '18 at 0:48

user10465355

1,4221413

Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

– user976850
Nov 20 '18 at 9:05

Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

– user10465355
Nov 20 '18 at 11:24

add a comment |

This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.

To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):

from pyspark.sql import Row



Outer = Row("some-column")

Inner = Row("timestamp", "strVal")



spark.createDataFrame([

    Outer([Inner(1353534535353, 'some-string')]),

    Outer([Inner(1353534535354, 'another-string')])

]).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- timestamp: long (nullable = true)

 |    |    |-- strVal: string (nullable = true)

With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:

import json



spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- strVal: string (nullable = true)

 |    |    |-- timestamp: long (nullable = true)

or explicit schema:

from pyspark.sql.types import *



schema = StructType([StructField(

    "some-column", ArrayType(StructType([

        StructField("timestamp", LongType()), 

        StructField("strVal", StringType())])

))])



spark.createDataFrame(some_data, schema)

although the last method might not be fully robust.

edited Nov 20 '18 at 0:58

answered Nov 20 '18 at 0:48

user10465355

1,4221413

Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

– user976850
Nov 20 '18 at 9:05

Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

– user10465355
Nov 20 '18 at 11:24

add a comment |

This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.

To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):

from pyspark.sql import Row



Outer = Row("some-column")

Inner = Row("timestamp", "strVal")



spark.createDataFrame([

    Outer([Inner(1353534535353, 'some-string')]),

    Outer([Inner(1353534535354, 'another-string')])

]).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- timestamp: long (nullable = true)

 |    |    |-- strVal: string (nullable = true)

With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:

import json



spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- strVal: string (nullable = true)

 |    |    |-- timestamp: long (nullable = true)

or explicit schema:

from pyspark.sql.types import *



schema = StructType([StructField(

    "some-column", ArrayType(StructType([

        StructField("timestamp", LongType()), 

        StructField("strVal", StringType())])

))])



spark.createDataFrame(some_data, schema)

although the last method might not be fully robust.

edited Nov 20 '18 at 0:58

answered Nov 20 '18 at 0:48

user10465355

1,4221413

This happens because the structure doesn't encode what you mean. As explained in the SQL guide Python dict is mapped to MapType.

To work with structures you should use nested Rows (namedtuples are preferred in general, but require valid name identifiers):

from pyspark.sql import Row



Outer = Row("some-column")

Inner = Row("timestamp", "strVal")



spark.createDataFrame([

    Outer([Inner(1353534535353, 'some-string')]),

    Outer([Inner(1353534535354, 'another-string')])

]).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- timestamp: long (nullable = true)

 |    |    |-- strVal: string (nullable = true)

With the structure you have at the moment, the scheme outcome could be achieved with intermediate JSON:

import json



spark.read.json(sc.parallelize(some_data).map(json.dumps)).printSchema()

root

 |-- some-column: array (nullable = true)

 |    |-- element: struct (containsNull = true)

 |    |    |-- strVal: string (nullable = true)

 |    |    |-- timestamp: long (nullable = true)

or explicit schema:

from pyspark.sql.types import *



schema = StructType([StructField(

    "some-column", ArrayType(StructType([

        StructField("timestamp", LongType()), 

        StructField("strVal", StringType())])

))])



spark.createDataFrame(some_data, schema)

although the last method might not be fully robust.

edited Nov 20 '18 at 0:58

answered Nov 20 '18 at 0:48

user10465355

1,4221413

edited Nov 20 '18 at 0:58

answered Nov 20 '18 at 0:48

user10465355

1,4221413

answered Nov 20 '18 at 0:48

user10465355

1,4221413

answered Nov 20 '18 at 0:48

user10465355

1,4221413

Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

– user976850
Nov 20 '18 at 9:05

Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

– user10465355
Nov 20 '18 at 11:24

add a comment |

Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

– user976850
Nov 20 '18 at 9:05

Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

– user10465355
Nov 20 '18 at 11:24

Thanks, the JSON trick is exactly what i was hoping for. Why isn't this the default behavior? e.g. for spark.createDataFrame(rdd, schema) - cant the schema be inferred by doing this very same trick?

– user976850
Nov 20 '18 at 9:05

Schema is inferred according to the linked specification so it couldn't be, without making a whole process full of special cases.

– user10465355
Nov 20 '18 at 11:24

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu