How to use lateral view explode in Hive for XML data format?












0















I am trying to load sales data which is in XML format to the Hive table.
Below is a small sample of the data.



I am aware that I can load the data below to Hive if I separate it to several tables and then join them as required. But just wanted to know if I can load them in a single table and the expected output should look like the screenshot attached.



Please help me with the table structure I should use and how can I use the lateral view explode option effectively to achieve this.



Sample data:



  <Store>
<Version>1.1</Version>
<StoreId>16695</StoreId>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>29559</GTIN>
<ItmDsc>CHOCALATE</ItmDsc>
<ItmProm>
<PromCD>CM</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>59653</GTIN>
<ItmDsc>CORN FLAKES</ItmDsc>
</Itm>
<Itm>
<ItmSeq>3</ItmSeq>
<GTIN>42260</GTIN>
<ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
<ItmProm>
<PromCD>MTSRO</PromCD>
<OfferID>11766</OfferID>
</ItmProm>
</Itm>
</Bskt>
<Bskt>
<TillNo>5</TillNo>
<BsktNo>1947</BsktNo>
<DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
<OpID>50063</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>24064</GTIN>
<ItmDsc>TOMATOES 2KG</ItmDsc>
<ItmProm>
<PromCD>INSTORE</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>81287</GTIN>
<ItmDsc>ROTHMANS BLUE</ItmDsc>
<ItmProm>
<PromCD>TF</PromCD>
</ItmProm>
</Itm>
</Bskt>
</Store>


Desired Output



enter image description here



Table structure:



CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
TBLPROPERTIES (
"xmlinput.start"="<Store","xmlinput.end"="</Store>"
);


Output:
enter image description here



Tried below query to read the data, It is not showing the results in the way i want.



select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
LATERAL VIEW explode(DateTime) table1 as basket_dtm
LATERAL VIEW explode(BsktNo) table2 as basket_number
LATERAL VIEW explode(TillNo) table3 as till_number;


Results:



enter image description here










share|improve this question

























  • what have you tried so far? are you getting any error?

    – Gaurang Shah
    Nov 20 '18 at 15:19











  • I have created the table structure as above and loaded all the data in a single row. Now i need to explode the arrays correctly to get the output as attached in the screenshot. I am not familiar with the explode options, just wondering how i can do it

    – Nandakumar_bigdata
    Nov 21 '18 at 0:47











  • Is there any way that I can use the XSD schema file of an XML and load the data in HIve tables. Like we do for AVRO files with AVRO schema? If that is possible, it would be great.

    – Nandakumar_bigdata
    Nov 21 '18 at 0:51











  • I believe that i need to make my query better to get the desired output, any help is much appreciated.

    – Nandakumar_bigdata
    Nov 21 '18 at 3:03
















0















I am trying to load sales data which is in XML format to the Hive table.
Below is a small sample of the data.



I am aware that I can load the data below to Hive if I separate it to several tables and then join them as required. But just wanted to know if I can load them in a single table and the expected output should look like the screenshot attached.



Please help me with the table structure I should use and how can I use the lateral view explode option effectively to achieve this.



Sample data:



  <Store>
<Version>1.1</Version>
<StoreId>16695</StoreId>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>29559</GTIN>
<ItmDsc>CHOCALATE</ItmDsc>
<ItmProm>
<PromCD>CM</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>59653</GTIN>
<ItmDsc>CORN FLAKES</ItmDsc>
</Itm>
<Itm>
<ItmSeq>3</ItmSeq>
<GTIN>42260</GTIN>
<ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
<ItmProm>
<PromCD>MTSRO</PromCD>
<OfferID>11766</OfferID>
</ItmProm>
</Itm>
</Bskt>
<Bskt>
<TillNo>5</TillNo>
<BsktNo>1947</BsktNo>
<DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
<OpID>50063</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>24064</GTIN>
<ItmDsc>TOMATOES 2KG</ItmDsc>
<ItmProm>
<PromCD>INSTORE</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>81287</GTIN>
<ItmDsc>ROTHMANS BLUE</ItmDsc>
<ItmProm>
<PromCD>TF</PromCD>
</ItmProm>
</Itm>
</Bskt>
</Store>


Desired Output



enter image description here



Table structure:



CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
TBLPROPERTIES (
"xmlinput.start"="<Store","xmlinput.end"="</Store>"
);


Output:
enter image description here



Tried below query to read the data, It is not showing the results in the way i want.



select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
LATERAL VIEW explode(DateTime) table1 as basket_dtm
LATERAL VIEW explode(BsktNo) table2 as basket_number
LATERAL VIEW explode(TillNo) table3 as till_number;


Results:



enter image description here










share|improve this question

























  • what have you tried so far? are you getting any error?

    – Gaurang Shah
    Nov 20 '18 at 15:19











  • I have created the table structure as above and loaded all the data in a single row. Now i need to explode the arrays correctly to get the output as attached in the screenshot. I am not familiar with the explode options, just wondering how i can do it

    – Nandakumar_bigdata
    Nov 21 '18 at 0:47











  • Is there any way that I can use the XSD schema file of an XML and load the data in HIve tables. Like we do for AVRO files with AVRO schema? If that is possible, it would be great.

    – Nandakumar_bigdata
    Nov 21 '18 at 0:51











  • I believe that i need to make my query better to get the desired output, any help is much appreciated.

    – Nandakumar_bigdata
    Nov 21 '18 at 3:03














0












0








0








I am trying to load sales data which is in XML format to the Hive table.
Below is a small sample of the data.



I am aware that I can load the data below to Hive if I separate it to several tables and then join them as required. But just wanted to know if I can load them in a single table and the expected output should look like the screenshot attached.



Please help me with the table structure I should use and how can I use the lateral view explode option effectively to achieve this.



Sample data:



  <Store>
<Version>1.1</Version>
<StoreId>16695</StoreId>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>29559</GTIN>
<ItmDsc>CHOCALATE</ItmDsc>
<ItmProm>
<PromCD>CM</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>59653</GTIN>
<ItmDsc>CORN FLAKES</ItmDsc>
</Itm>
<Itm>
<ItmSeq>3</ItmSeq>
<GTIN>42260</GTIN>
<ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
<ItmProm>
<PromCD>MTSRO</PromCD>
<OfferID>11766</OfferID>
</ItmProm>
</Itm>
</Bskt>
<Bskt>
<TillNo>5</TillNo>
<BsktNo>1947</BsktNo>
<DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
<OpID>50063</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>24064</GTIN>
<ItmDsc>TOMATOES 2KG</ItmDsc>
<ItmProm>
<PromCD>INSTORE</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>81287</GTIN>
<ItmDsc>ROTHMANS BLUE</ItmDsc>
<ItmProm>
<PromCD>TF</PromCD>
</ItmProm>
</Itm>
</Bskt>
</Store>


Desired Output



enter image description here



Table structure:



CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
TBLPROPERTIES (
"xmlinput.start"="<Store","xmlinput.end"="</Store>"
);


Output:
enter image description here



Tried below query to read the data, It is not showing the results in the way i want.



select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
LATERAL VIEW explode(DateTime) table1 as basket_dtm
LATERAL VIEW explode(BsktNo) table2 as basket_number
LATERAL VIEW explode(TillNo) table3 as till_number;


Results:



enter image description here










share|improve this question
















I am trying to load sales data which is in XML format to the Hive table.
Below is a small sample of the data.



I am aware that I can load the data below to Hive if I separate it to several tables and then join them as required. But just wanted to know if I can load them in a single table and the expected output should look like the screenshot attached.



Please help me with the table structure I should use and how can I use the lateral view explode option effectively to achieve this.



Sample data:



  <Store>
<Version>1.1</Version>
<StoreId>16695</StoreId>
<Bskt>
<TillNo>4</TillNo>
<BsktNo>1753</BsktNo>
<DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
<OpID>50056</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>29559</GTIN>
<ItmDsc>CHOCALATE</ItmDsc>
<ItmProm>
<PromCD>CM</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>59653</GTIN>
<ItmDsc>CORN FLAKES</ItmDsc>
</Itm>
<Itm>
<ItmSeq>3</ItmSeq>
<GTIN>42260</GTIN>
<ItmDsc> MILK CHOCOLATE 162GM</ItmDsc>
<ItmProm>
<PromCD>MTSRO</PromCD>
<OfferID>11766</OfferID>
</ItmProm>
</Itm>
</Bskt>
<Bskt>
<TillNo>5</TillNo>
<BsktNo>1947</BsktNo>
<DateTime>2017-10-31T16:24:59.000+11:00</DateTime>
<OpID>50063</OpID>
<Itm>
<ItmSeq>1</ItmSeq>
<GTIN>24064</GTIN>
<ItmDsc>TOMATOES 2KG</ItmDsc>
<ItmProm>
<PromCD>INSTORE</PromCD>
</ItmProm>
</Itm>
<Itm>
<ItmSeq>2</ItmSeq>
<GTIN>81287</GTIN>
<ItmDsc>ROTHMANS BLUE</ItmDsc>
<ItmProm>
<PromCD>TF</PromCD>
</ItmProm>
</Itm>
</Bskt>
</Store>


Desired Output



enter image description here



Table structure:



CREATE EXTERNAL TABLE IF NOT EXISTS POC_BASKET_ITEM_PROMO (
`Version` string,
`StoreId` string,
`DateTime` array<string>,
`BsktNo` array<double>,
`TillNo` array<int>,
`Item_Seq_num` array<int>,
`GTIN` array<string>,
`ItmDsc` array<string>,
`Promo_CD` array<string>,
`Offer_ID` array<int>
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.Version"="/Store/Version/text()",
"column.xpath.StoreId"="/Store/StoreId/text()",
"column.xpath.DateTime"="/Store/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Store/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Store/Bskt/TillNo/text()",
"column.xpath.Item_Seq_num"="/Store/Bskt/Itm/ItmSeq/text()",
"column.xpath.GTIN"="/Store/Bskt/Itm/GTIN/text()",
"column.xpath.ItmDsc"="/Store/Bskt/Itm/ItmDsc/text()",
"column.xpath.Promo_CD"="/Store/Bskt/Itm/ItmProm/PromCD/text()",
"column.xpath.Offer_ID"="/Store/Bskt/Itm/ItmProm/OfferID/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://namenode:8020/DEV/TEST/nanda_test'
TBLPROPERTIES (
"xmlinput.start"="<Store","xmlinput.end"="</Store>"
);


Output:
enter image description here



Tried below query to read the data, It is not showing the results in the way i want.



select Version,StoreId,basket_dtm,basket_number,till_number from POC_BASKET_ITEM_PROMO
LATERAL VIEW explode(DateTime) table1 as basket_dtm
LATERAL VIEW explode(BsktNo) table2 as basket_number
LATERAL VIEW explode(TillNo) table3 as till_number;


Results:



enter image description here







xml parsing hadoop hive explode






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 22 '18 at 3:10







Nandakumar_bigdata

















asked Nov 20 '18 at 0:54









Nandakumar_bigdataNandakumar_bigdata

53




53













  • what have you tried so far? are you getting any error?

    – Gaurang Shah
    Nov 20 '18 at 15:19











  • I have created the table structure as above and loaded all the data in a single row. Now i need to explode the arrays correctly to get the output as attached in the screenshot. I am not familiar with the explode options, just wondering how i can do it

    – Nandakumar_bigdata
    Nov 21 '18 at 0:47











  • Is there any way that I can use the XSD schema file of an XML and load the data in HIve tables. Like we do for AVRO files with AVRO schema? If that is possible, it would be great.

    – Nandakumar_bigdata
    Nov 21 '18 at 0:51











  • I believe that i need to make my query better to get the desired output, any help is much appreciated.

    – Nandakumar_bigdata
    Nov 21 '18 at 3:03



















  • what have you tried so far? are you getting any error?

    – Gaurang Shah
    Nov 20 '18 at 15:19











  • I have created the table structure as above and loaded all the data in a single row. Now i need to explode the arrays correctly to get the output as attached in the screenshot. I am not familiar with the explode options, just wondering how i can do it

    – Nandakumar_bigdata
    Nov 21 '18 at 0:47











  • Is there any way that I can use the XSD schema file of an XML and load the data in HIve tables. Like we do for AVRO files with AVRO schema? If that is possible, it would be great.

    – Nandakumar_bigdata
    Nov 21 '18 at 0:51











  • I believe that i need to make my query better to get the desired output, any help is much appreciated.

    – Nandakumar_bigdata
    Nov 21 '18 at 3:03

















what have you tried so far? are you getting any error?

– Gaurang Shah
Nov 20 '18 at 15:19





what have you tried so far? are you getting any error?

– Gaurang Shah
Nov 20 '18 at 15:19













I have created the table structure as above and loaded all the data in a single row. Now i need to explode the arrays correctly to get the output as attached in the screenshot. I am not familiar with the explode options, just wondering how i can do it

– Nandakumar_bigdata
Nov 21 '18 at 0:47





I have created the table structure as above and loaded all the data in a single row. Now i need to explode the arrays correctly to get the output as attached in the screenshot. I am not familiar with the explode options, just wondering how i can do it

– Nandakumar_bigdata
Nov 21 '18 at 0:47













Is there any way that I can use the XSD schema file of an XML and load the data in HIve tables. Like we do for AVRO files with AVRO schema? If that is possible, it would be great.

– Nandakumar_bigdata
Nov 21 '18 at 0:51





Is there any way that I can use the XSD schema file of an XML and load the data in HIve tables. Like we do for AVRO files with AVRO schema? If that is possible, it would be great.

– Nandakumar_bigdata
Nov 21 '18 at 0:51













I believe that i need to make my query better to get the desired output, any help is much appreciated.

– Nandakumar_bigdata
Nov 21 '18 at 3:03





I believe that i need to make my query better to get the desired output, any help is much appreciated.

– Nandakumar_bigdata
Nov 21 '18 at 3:03












2 Answers
2






active

oldest

votes


















0














Explode for array object works like cross join.
So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.



You can't map one object from array to another.



Actually you can by using posexplode which gives you index for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.



Solution




  • Use posexplode if you have less column to explode and array size is same. for your case this is not going to work. So


  • Store XML as Complex Data Type : Store your whole XML as a complex data type (not just array), I am talking about creating a struct based on your xml.
    If you don't have much complex xml, you can achieve this. However xmlSerde is not as good as JSONserde when it comes to converting file to complex data type.


So in your case best solution would be.




  • Convert your XML to JSON. You can use NiFi or some other technology for that.

  • Create Hive table using JSONserde and load this file.

  • Create a view as per your requirement.


JSON for Your XML



{"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}


JsonSerde might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.



Hive Table



create external table temp.test_json
(
Version string,
StoreId string,
Bskt array<struct<
BsktNo:string,
DateTime:string,
OpID:string,
TillNo:string,
Itm:array<struct<
GTIN:string,
ItmDsc:string,
ItmSeq:string,
ItmProm:struct<
OfferID:string,
PromCD:string
>

>
>
>
>
)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
location '/tmp/test_json/table/';


enter image description hereCreate View



SELECT Version,
StoreId,
basket.bsktno,
basket.tillno,
basket.`datetime`,
item.itmseq,
item.itmdsc,
item.gtin,
item.itmprom.offerid,
item.itmprom.promcd
FROM temp.test_json
lateral view explode(bskt) b AS basket
lateral view explode(basket.itm) i AS item


enter image description here






share|improve this answer































    0














    Thanks for detailed solution. I tested it and it worked perfectly fine.
    I tried a similar approach to read the data from the XML directly with XML serde.



    My challenges:



    1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
    2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.


    Below is the Hive table structure and the query I am using to read the data.
    I am able to explode the first level array (Bskt) successfully without any issues.



    But when i try to explode the second level array (Itm) it returns NULL results for all the fields in 'Itm'.



    Is there any issue with my query or the table structure itself?



    create external table nanda_scan_xml  (
    Version string,
    StoreId string,
    Bskt array<struct<
    Bskt:struct<
    DateTime:string,
    TillNo:string,
    BsktNo:string,
    Itm:array<struct<
    Itm:struct<
    ItmSeq:string,
    GTIN:string,
    ItmDsc:string,
    DeptCD:string,
    ItmCD:string,
    SalesQTY:string,
    SalesExGST:string,
    Points:string,
    CostExGST:string,
    GSTRate:string,
    DiscAmtExGST:string,
    ItmProm:struct<
    PromCD:string,
    OfferID:string
    >
    >
    >
    >
    >
    >
    >
    )
    row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    with serdeproperties
    (
    "column.xpath.Version" = "/Store/Version/text()",
    "column.xpath.StoreId" = "/Store/StoreId/text()",
    "column.xpath.Bskt" = "/Store/Bskt"

    )
    stored as
    inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
    LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
    tblproperties
    (
    "xmlinput.start" = "<Store>",
    "xmlinput.end" = "</Store>"
    );


    Query:



    1)For Bskt which works fine:



    SELECT  Version,
    StoreId,
    basket.Bskt.DateTime,
    basket.Bskt.bsktno,
    basket.Bskt.tillno
    FROM eim_stg.nanda_scan_xml
    LATERAL VIEW EXPLODE(Bskt) b AS basket;


    Results:



    enter image description here
    2) When trying two lateral view explode in a single query:



    SELECT  Version,
    StoreId,
    basket.Bskt.DateTime,
    basket.Bskt.bsktno,
    basket.Bskt.tillno,
    item.Itm.ItmSeq,
    item.Itm.ItmDsc,
    item.Itm.GTIN,
    item.Itm.itmprom.OfferID,
    item.Itm.itmprom.PromCD
    FROM eim_stg.nanda_scan_xml
    LATERAL VIEW EXPLODE(Bskt) b AS basket
    LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;


    Results:



    enter image description here



    3) Query:



    SELECT  Version,
    StoreId,
    basket.Bskt.DateTime,
    basket.Bskt.bsktno,
    basket.Bskt.tillno,
    item.Itm.ItmSeq,
    item.Itm.ItmDsc,
    item.Itm.GTIN,
    item.Itm.itmprom.OfferID,
    item.Itm.itmprom.PromCD
    FROM eim_stg.nanda_scan_xml
    LATERAL VIEW EXPLODE(Bskt) b AS basket
    LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;


    Error:



    enter image description here






    share|improve this answer


























    • if you are able to get the xml table with complex data type. no issue. it's just XML serde is not as mature as JSON

      – Gaurang Shah
      Nov 27 '18 at 19:58











    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384728%2fhow-to-use-lateral-view-explode-in-hive-for-xml-data-format%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Explode for array object works like cross join.
    So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.



    You can't map one object from array to another.



    Actually you can by using posexplode which gives you index for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.



    Solution




    • Use posexplode if you have less column to explode and array size is same. for your case this is not going to work. So


    • Store XML as Complex Data Type : Store your whole XML as a complex data type (not just array), I am talking about creating a struct based on your xml.
      If you don't have much complex xml, you can achieve this. However xmlSerde is not as good as JSONserde when it comes to converting file to complex data type.


    So in your case best solution would be.




    • Convert your XML to JSON. You can use NiFi or some other technology for that.

    • Create Hive table using JSONserde and load this file.

    • Create a view as per your requirement.


    JSON for Your XML



    {"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}


    JsonSerde might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.



    Hive Table



    create external table temp.test_json
    (
    Version string,
    StoreId string,
    Bskt array<struct<
    BsktNo:string,
    DateTime:string,
    OpID:string,
    TillNo:string,
    Itm:array<struct<
    GTIN:string,
    ItmDsc:string,
    ItmSeq:string,
    ItmProm:struct<
    OfferID:string,
    PromCD:string
    >

    >
    >
    >
    >
    )
    row format serde 'org.openx.data.jsonserde.JsonSerDe'
    location '/tmp/test_json/table/';


    enter image description hereCreate View



    SELECT Version,
    StoreId,
    basket.bsktno,
    basket.tillno,
    basket.`datetime`,
    item.itmseq,
    item.itmdsc,
    item.gtin,
    item.itmprom.offerid,
    item.itmprom.promcd
    FROM temp.test_json
    lateral view explode(bskt) b AS basket
    lateral view explode(basket.itm) i AS item


    enter image description here






    share|improve this answer




























      0














      Explode for array object works like cross join.
      So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.



      You can't map one object from array to another.



      Actually you can by using posexplode which gives you index for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.



      Solution




      • Use posexplode if you have less column to explode and array size is same. for your case this is not going to work. So


      • Store XML as Complex Data Type : Store your whole XML as a complex data type (not just array), I am talking about creating a struct based on your xml.
        If you don't have much complex xml, you can achieve this. However xmlSerde is not as good as JSONserde when it comes to converting file to complex data type.


      So in your case best solution would be.




      • Convert your XML to JSON. You can use NiFi or some other technology for that.

      • Create Hive table using JSONserde and load this file.

      • Create a view as per your requirement.


      JSON for Your XML



      {"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}


      JsonSerde might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.



      Hive Table



      create external table temp.test_json
      (
      Version string,
      StoreId string,
      Bskt array<struct<
      BsktNo:string,
      DateTime:string,
      OpID:string,
      TillNo:string,
      Itm:array<struct<
      GTIN:string,
      ItmDsc:string,
      ItmSeq:string,
      ItmProm:struct<
      OfferID:string,
      PromCD:string
      >

      >
      >
      >
      >
      )
      row format serde 'org.openx.data.jsonserde.JsonSerDe'
      location '/tmp/test_json/table/';


      enter image description hereCreate View



      SELECT Version,
      StoreId,
      basket.bsktno,
      basket.tillno,
      basket.`datetime`,
      item.itmseq,
      item.itmdsc,
      item.gtin,
      item.itmprom.offerid,
      item.itmprom.promcd
      FROM temp.test_json
      lateral view explode(bskt) b AS basket
      lateral view explode(basket.itm) i AS item


      enter image description here






      share|improve this answer


























        0












        0








        0







        Explode for array object works like cross join.
        So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.



        You can't map one object from array to another.



        Actually you can by using posexplode which gives you index for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.



        Solution




        • Use posexplode if you have less column to explode and array size is same. for your case this is not going to work. So


        • Store XML as Complex Data Type : Store your whole XML as a complex data type (not just array), I am talking about creating a struct based on your xml.
          If you don't have much complex xml, you can achieve this. However xmlSerde is not as good as JSONserde when it comes to converting file to complex data type.


        So in your case best solution would be.




        • Convert your XML to JSON. You can use NiFi or some other technology for that.

        • Create Hive table using JSONserde and load this file.

        • Create a view as per your requirement.


        JSON for Your XML



        {"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}


        JsonSerde might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.



        Hive Table



        create external table temp.test_json
        (
        Version string,
        StoreId string,
        Bskt array<struct<
        BsktNo:string,
        DateTime:string,
        OpID:string,
        TillNo:string,
        Itm:array<struct<
        GTIN:string,
        ItmDsc:string,
        ItmSeq:string,
        ItmProm:struct<
        OfferID:string,
        PromCD:string
        >

        >
        >
        >
        >
        )
        row format serde 'org.openx.data.jsonserde.JsonSerDe'
        location '/tmp/test_json/table/';


        enter image description hereCreate View



        SELECT Version,
        StoreId,
        basket.bsktno,
        basket.tillno,
        basket.`datetime`,
        item.itmseq,
        item.itmdsc,
        item.gtin,
        item.itmprom.offerid,
        item.itmprom.promcd
        FROM temp.test_json
        lateral view explode(bskt) b AS basket
        lateral view explode(basket.itm) i AS item


        enter image description here






        share|improve this answer













        Explode for array object works like cross join.
        So if you have 3 columns with each containing array with 2 elements, applying explode on all the columns will give you 8 rows.



        You can't map one object from array to another.



        Actually you can by using posexplode which gives you index for each element. which you can use to join based on condition. However, that' tricky when you have multiple columns and the array size is different for each column.



        Solution




        • Use posexplode if you have less column to explode and array size is same. for your case this is not going to work. So


        • Store XML as Complex Data Type : Store your whole XML as a complex data type (not just array), I am talking about creating a struct based on your xml.
          If you don't have much complex xml, you can achieve this. However xmlSerde is not as good as JSONserde when it comes to converting file to complex data type.


        So in your case best solution would be.




        • Convert your XML to JSON. You can use NiFi or some other technology for that.

        • Create Hive table using JSONserde and load this file.

        • Create a view as per your requirement.


        JSON for Your XML



        {"Version":"1.1","StoreId":"16695","Bskt":[{"TillNo":"4","BsktNo":"1753","DateTime":"2017-10-31T11:19:34.000+11:00","OpID":"50056","Itm":[{"ItmSeq":"1","GTIN":"29559","ItmDsc":"CHOCALATE","ItmProm":{"PromCD":"CM"}},{"ItmSeq":"2","GTIN":"59653","ItmDsc":"CORNFLAKES"},{"ItmSeq":"3","GTIN":"42260","ItmDsc":"MILKCHOCOLATE162GM","ItmProm":{"PromCD":"MTSRO","OfferID":"11766"}}]},{"TillNo":"5","BsktNo":"1947","DateTime":"2017-10-31T16:24:59.000+11:00","OpID":"50063","Itm":[{"ItmSeq":"1","GTIN":"24064","ItmDsc":"TOMATOES2KG","ItmProm":{"PromCD":"INSTORE"}},{"ItmSeq":"2","GTIN":"81287","ItmDsc":"ROTHMANSBLUE","ItmProm":{"PromCD":"TF"}}]}]}


        JsonSerde might give your error if you have tabs or other white spaces in your file. So it's always best to remove them.



        Hive Table



        create external table temp.test_json
        (
        Version string,
        StoreId string,
        Bskt array<struct<
        BsktNo:string,
        DateTime:string,
        OpID:string,
        TillNo:string,
        Itm:array<struct<
        GTIN:string,
        ItmDsc:string,
        ItmSeq:string,
        ItmProm:struct<
        OfferID:string,
        PromCD:string
        >

        >
        >
        >
        >
        )
        row format serde 'org.openx.data.jsonserde.JsonSerDe'
        location '/tmp/test_json/table/';


        enter image description hereCreate View



        SELECT Version,
        StoreId,
        basket.bsktno,
        basket.tillno,
        basket.`datetime`,
        item.itmseq,
        item.itmdsc,
        item.gtin,
        item.itmprom.offerid,
        item.itmprom.promcd
        FROM temp.test_json
        lateral view explode(bskt) b AS basket
        lateral view explode(basket.itm) i AS item


        enter image description here







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 22 '18 at 16:40









        Gaurang ShahGaurang Shah

        3,04211333




        3,04211333

























            0














            Thanks for detailed solution. I tested it and it worked perfectly fine.
            I tried a similar approach to read the data from the XML directly with XML serde.



            My challenges:



            1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
            2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.


            Below is the Hive table structure and the query I am using to read the data.
            I am able to explode the first level array (Bskt) successfully without any issues.



            But when i try to explode the second level array (Itm) it returns NULL results for all the fields in 'Itm'.



            Is there any issue with my query or the table structure itself?



            create external table nanda_scan_xml  (
            Version string,
            StoreId string,
            Bskt array<struct<
            Bskt:struct<
            DateTime:string,
            TillNo:string,
            BsktNo:string,
            Itm:array<struct<
            Itm:struct<
            ItmSeq:string,
            GTIN:string,
            ItmDsc:string,
            DeptCD:string,
            ItmCD:string,
            SalesQTY:string,
            SalesExGST:string,
            Points:string,
            CostExGST:string,
            GSTRate:string,
            DiscAmtExGST:string,
            ItmProm:struct<
            PromCD:string,
            OfferID:string
            >
            >
            >
            >
            >
            >
            >
            )
            row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
            with serdeproperties
            (
            "column.xpath.Version" = "/Store/Version/text()",
            "column.xpath.StoreId" = "/Store/StoreId/text()",
            "column.xpath.Bskt" = "/Store/Bskt"

            )
            stored as
            inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
            outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
            LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
            tblproperties
            (
            "xmlinput.start" = "<Store>",
            "xmlinput.end" = "</Store>"
            );


            Query:



            1)For Bskt which works fine:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket;


            Results:



            enter image description here
            2) When trying two lateral view explode in a single query:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno,
            item.Itm.ItmSeq,
            item.Itm.ItmDsc,
            item.Itm.GTIN,
            item.Itm.itmprom.OfferID,
            item.Itm.itmprom.PromCD
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket
            LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;


            Results:



            enter image description here



            3) Query:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno,
            item.Itm.ItmSeq,
            item.Itm.ItmDsc,
            item.Itm.GTIN,
            item.Itm.itmprom.OfferID,
            item.Itm.itmprom.PromCD
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket
            LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;


            Error:



            enter image description here






            share|improve this answer


























            • if you are able to get the xml table with complex data type. no issue. it's just XML serde is not as mature as JSON

              – Gaurang Shah
              Nov 27 '18 at 19:58
















            0














            Thanks for detailed solution. I tested it and it worked perfectly fine.
            I tried a similar approach to read the data from the XML directly with XML serde.



            My challenges:



            1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
            2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.


            Below is the Hive table structure and the query I am using to read the data.
            I am able to explode the first level array (Bskt) successfully without any issues.



            But when i try to explode the second level array (Itm) it returns NULL results for all the fields in 'Itm'.



            Is there any issue with my query or the table structure itself?



            create external table nanda_scan_xml  (
            Version string,
            StoreId string,
            Bskt array<struct<
            Bskt:struct<
            DateTime:string,
            TillNo:string,
            BsktNo:string,
            Itm:array<struct<
            Itm:struct<
            ItmSeq:string,
            GTIN:string,
            ItmDsc:string,
            DeptCD:string,
            ItmCD:string,
            SalesQTY:string,
            SalesExGST:string,
            Points:string,
            CostExGST:string,
            GSTRate:string,
            DiscAmtExGST:string,
            ItmProm:struct<
            PromCD:string,
            OfferID:string
            >
            >
            >
            >
            >
            >
            >
            )
            row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
            with serdeproperties
            (
            "column.xpath.Version" = "/Store/Version/text()",
            "column.xpath.StoreId" = "/Store/StoreId/text()",
            "column.xpath.Bskt" = "/Store/Bskt"

            )
            stored as
            inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
            outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
            LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
            tblproperties
            (
            "xmlinput.start" = "<Store>",
            "xmlinput.end" = "</Store>"
            );


            Query:



            1)For Bskt which works fine:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket;


            Results:



            enter image description here
            2) When trying two lateral view explode in a single query:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno,
            item.Itm.ItmSeq,
            item.Itm.ItmDsc,
            item.Itm.GTIN,
            item.Itm.itmprom.OfferID,
            item.Itm.itmprom.PromCD
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket
            LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;


            Results:



            enter image description here



            3) Query:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno,
            item.Itm.ItmSeq,
            item.Itm.ItmDsc,
            item.Itm.GTIN,
            item.Itm.itmprom.OfferID,
            item.Itm.itmprom.PromCD
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket
            LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;


            Error:



            enter image description here






            share|improve this answer


























            • if you are able to get the xml table with complex data type. no issue. it's just XML serde is not as mature as JSON

              – Gaurang Shah
              Nov 27 '18 at 19:58














            0












            0








            0







            Thanks for detailed solution. I tested it and it worked perfectly fine.
            I tried a similar approach to read the data from the XML directly with XML serde.



            My challenges:



            1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
            2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.


            Below is the Hive table structure and the query I am using to read the data.
            I am able to explode the first level array (Bskt) successfully without any issues.



            But when i try to explode the second level array (Itm) it returns NULL results for all the fields in 'Itm'.



            Is there any issue with my query or the table structure itself?



            create external table nanda_scan_xml  (
            Version string,
            StoreId string,
            Bskt array<struct<
            Bskt:struct<
            DateTime:string,
            TillNo:string,
            BsktNo:string,
            Itm:array<struct<
            Itm:struct<
            ItmSeq:string,
            GTIN:string,
            ItmDsc:string,
            DeptCD:string,
            ItmCD:string,
            SalesQTY:string,
            SalesExGST:string,
            Points:string,
            CostExGST:string,
            GSTRate:string,
            DiscAmtExGST:string,
            ItmProm:struct<
            PromCD:string,
            OfferID:string
            >
            >
            >
            >
            >
            >
            >
            )
            row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
            with serdeproperties
            (
            "column.xpath.Version" = "/Store/Version/text()",
            "column.xpath.StoreId" = "/Store/StoreId/text()",
            "column.xpath.Bskt" = "/Store/Bskt"

            )
            stored as
            inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
            outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
            LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
            tblproperties
            (
            "xmlinput.start" = "<Store>",
            "xmlinput.end" = "</Store>"
            );


            Query:



            1)For Bskt which works fine:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket;


            Results:



            enter image description here
            2) When trying two lateral view explode in a single query:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno,
            item.Itm.ItmSeq,
            item.Itm.ItmDsc,
            item.Itm.GTIN,
            item.Itm.itmprom.OfferID,
            item.Itm.itmprom.PromCD
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket
            LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;


            Results:



            enter image description here



            3) Query:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno,
            item.Itm.ItmSeq,
            item.Itm.ItmDsc,
            item.Itm.GTIN,
            item.Itm.itmprom.OfferID,
            item.Itm.itmprom.PromCD
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket
            LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;


            Error:



            enter image description here






            share|improve this answer















            Thanks for detailed solution. I tested it and it worked perfectly fine.
            I tried a similar approach to read the data from the XML directly with XML serde.



            My challenges:



            1)XML to JSON conversion takes additional development efforts and we don't have Apache Nifi installation parcels in Cloudera by default, we need to install it with custom parcels.
            2) My data will definitely have spaces/tab spaces in it, especially in 'Item description' field.We need to load the data with the same names as we receive. So converting to JSON and use the 'org.openx.data.jsonserde.JsonSerDe' didn't help. Queries failed with errors as suggested by you.


            Below is the Hive table structure and the query I am using to read the data.
            I am able to explode the first level array (Bskt) successfully without any issues.



            But when i try to explode the second level array (Itm) it returns NULL results for all the fields in 'Itm'.



            Is there any issue with my query or the table structure itself?



            create external table nanda_scan_xml  (
            Version string,
            StoreId string,
            Bskt array<struct<
            Bskt:struct<
            DateTime:string,
            TillNo:string,
            BsktNo:string,
            Itm:array<struct<
            Itm:struct<
            ItmSeq:string,
            GTIN:string,
            ItmDsc:string,
            DeptCD:string,
            ItmCD:string,
            SalesQTY:string,
            SalesExGST:string,
            Points:string,
            CostExGST:string,
            GSTRate:string,
            DiscAmtExGST:string,
            ItmProm:struct<
            PromCD:string,
            OfferID:string
            >
            >
            >
            >
            >
            >
            >
            )
            row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
            with serdeproperties
            (
            "column.xpath.Version" = "/Store/Version/text()",
            "column.xpath.StoreId" = "/Store/StoreId/text()",
            "column.xpath.Bskt" = "/Store/Bskt"

            )
            stored as
            inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
            outputformat 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
            LOCATION 'hdfs://namenode/LandingArea/Sources/SCANP/IGA_SCAN/STAGING/'
            tblproperties
            (
            "xmlinput.start" = "<Store>",
            "xmlinput.end" = "</Store>"
            );


            Query:



            1)For Bskt which works fine:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket;


            Results:



            enter image description here
            2) When trying two lateral view explode in a single query:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno,
            item.Itm.ItmSeq,
            item.Itm.ItmDsc,
            item.Itm.GTIN,
            item.Itm.itmprom.OfferID,
            item.Itm.itmprom.PromCD
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket
            LATERAL VIEW EXPLODE(basket.Bskt.Itm) i AS item limit 100;


            Results:



            enter image description here



            3) Query:



            SELECT  Version,
            StoreId,
            basket.Bskt.DateTime,
            basket.Bskt.bsktno,
            basket.Bskt.tillno,
            item.Itm.ItmSeq,
            item.Itm.ItmDsc,
            item.Itm.GTIN,
            item.Itm.itmprom.OfferID,
            item.Itm.itmprom.PromCD
            FROM eim_stg.nanda_scan_xml
            LATERAL VIEW EXPLODE(Bskt) b AS basket
            LATERAL VIEW EXPLODE(basket.Itm) i AS item limit 100;


            Error:



            enter image description here







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 27 '18 at 4:10

























            answered Nov 27 '18 at 0:35









            Nandakumar_bigdataNandakumar_bigdata

            53




            53













            • if you are able to get the xml table with complex data type. no issue. it's just XML serde is not as mature as JSON

              – Gaurang Shah
              Nov 27 '18 at 19:58



















            • if you are able to get the xml table with complex data type. no issue. it's just XML serde is not as mature as JSON

              – Gaurang Shah
              Nov 27 '18 at 19:58

















            if you are able to get the xml table with complex data type. no issue. it's just XML serde is not as mature as JSON

            – Gaurang Shah
            Nov 27 '18 at 19:58





            if you are able to get the xml table with complex data type. no issue. it's just XML serde is not as mature as JSON

            – Gaurang Shah
            Nov 27 '18 at 19:58


















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53384728%2fhow-to-use-lateral-view-explode-in-hive-for-xml-data-format%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Can a sorcerer learn a 5th-level spell early by creating spell slots using the Font of Magic feature?

            Does disintegrating a polymorphed enemy still kill it after the 2018 errata?

            A Topological Invariant for $pi_3(U(n))$