Pandas dataframe wide vs long - unstack vs pivot vs outer join for MULTIPLE df












1















Problem



I have some enormous dataframes pulled from equipment, which track multiple runs on said equipment, each recording multiple sensors (voltage, current, rpm, pressures... etc.)
I need to widen this data set for plotting and further analysis, but unfortunately the clocks on the sensors are not synchronised, so the different parameters are collected each with their own time stamp, and can vary in length (msec, so sometimes >10 rows).



I've attempted unstacking:



df.set_index(['index','start_time','param']).value.unstack().rename_axis(None, 1).reset_index()



pivotting:



df.pivot_table(values = 'value', index = ['index','start_time'], columns = 'param')



but the different lengths causes real problems (understandably).



I have code to convert based on date (i.e. individual run) or param into a dictionary of dfs, and can do analysis on each either run or param -- but there are ~100 sensors, and 18 months worth of runs(!) so would like to make sure there is no way to do what I want... which i think is some sort of multiple-outer join. Because of the differing lengths, it would need to fill blanks with NaN - which is fine - and find the max length of any param, to adjust the length of the date to.



Model dataset



Start



df_long = pd.DataFrame({"Date" : np.array([1]*5 + [2]*3 + [3]*4 + [4]*2 + [5]*4),
"Param" : list('aaabbabbabccaaaacc'),
"value": [0.1, 0.2, 0.2, 1, 4, 0.6, 0.5, 90, 0.9, 8.8, 4.1, 0.4, 0.5, 0.1, 0.1, 0.3, 3.4, 5.1],
"time" : [1,2,3,1,2,1,1,2,1,1,1,2,1,2,1,2,1,2]
})


Ideal output



df_wide = pd.DataFrame ({
"Date" : [1,1,1,2,2,3,3,4,4,5,5],
"a": [0.1,0.2,0.2,0.6,'NaN',0.9,'NaN',0.5,0.1,0.1,0.3],
"time-a": [1,2,3,1,'NaN',1,'NaN',1,2,1,2],
"b": [1,4,'NaN',0.5,90,8.8,'NaN','NaN','NaN','NaN','NaN'],
"time-b": [1,2,'NaN', 1,2,1,'NaN','NaN','NaN','NaN','NaN'],
"c": ['NaN','NaN','NaN','NaN','NaN',4.1,0.4,'NaN','NaN',3.4,5.1],
"time-c": ['NaN','NaN','NaN','NaN','NaN',1,2,'NaN','NaN',1,2]})


Any help greatly appreciated










share|improve this question

























  • In df_long the Date column is equal to 1 five times, but in df_wide it is only equal to 1 three times, is this part of the expected output, and if so according to what logic?

    – user3471881
    Nov 20 '18 at 11:15













  • As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param a and 2 '1's for param b. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths

    – BAC83
    Nov 20 '18 at 11:27


















1















Problem



I have some enormous dataframes pulled from equipment, which track multiple runs on said equipment, each recording multiple sensors (voltage, current, rpm, pressures... etc.)
I need to widen this data set for plotting and further analysis, but unfortunately the clocks on the sensors are not synchronised, so the different parameters are collected each with their own time stamp, and can vary in length (msec, so sometimes >10 rows).



I've attempted unstacking:



df.set_index(['index','start_time','param']).value.unstack().rename_axis(None, 1).reset_index()



pivotting:



df.pivot_table(values = 'value', index = ['index','start_time'], columns = 'param')



but the different lengths causes real problems (understandably).



I have code to convert based on date (i.e. individual run) or param into a dictionary of dfs, and can do analysis on each either run or param -- but there are ~100 sensors, and 18 months worth of runs(!) so would like to make sure there is no way to do what I want... which i think is some sort of multiple-outer join. Because of the differing lengths, it would need to fill blanks with NaN - which is fine - and find the max length of any param, to adjust the length of the date to.



Model dataset



Start



df_long = pd.DataFrame({"Date" : np.array([1]*5 + [2]*3 + [3]*4 + [4]*2 + [5]*4),
"Param" : list('aaabbabbabccaaaacc'),
"value": [0.1, 0.2, 0.2, 1, 4, 0.6, 0.5, 90, 0.9, 8.8, 4.1, 0.4, 0.5, 0.1, 0.1, 0.3, 3.4, 5.1],
"time" : [1,2,3,1,2,1,1,2,1,1,1,2,1,2,1,2,1,2]
})


Ideal output



df_wide = pd.DataFrame ({
"Date" : [1,1,1,2,2,3,3,4,4,5,5],
"a": [0.1,0.2,0.2,0.6,'NaN',0.9,'NaN',0.5,0.1,0.1,0.3],
"time-a": [1,2,3,1,'NaN',1,'NaN',1,2,1,2],
"b": [1,4,'NaN',0.5,90,8.8,'NaN','NaN','NaN','NaN','NaN'],
"time-b": [1,2,'NaN', 1,2,1,'NaN','NaN','NaN','NaN','NaN'],
"c": ['NaN','NaN','NaN','NaN','NaN',4.1,0.4,'NaN','NaN',3.4,5.1],
"time-c": ['NaN','NaN','NaN','NaN','NaN',1,2,'NaN','NaN',1,2]})


Any help greatly appreciated










share|improve this question

























  • In df_long the Date column is equal to 1 five times, but in df_wide it is only equal to 1 three times, is this part of the expected output, and if so according to what logic?

    – user3471881
    Nov 20 '18 at 11:15













  • As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param a and 2 '1's for param b. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths

    – BAC83
    Nov 20 '18 at 11:27
















1












1








1








Problem



I have some enormous dataframes pulled from equipment, which track multiple runs on said equipment, each recording multiple sensors (voltage, current, rpm, pressures... etc.)
I need to widen this data set for plotting and further analysis, but unfortunately the clocks on the sensors are not synchronised, so the different parameters are collected each with their own time stamp, and can vary in length (msec, so sometimes >10 rows).



I've attempted unstacking:



df.set_index(['index','start_time','param']).value.unstack().rename_axis(None, 1).reset_index()



pivotting:



df.pivot_table(values = 'value', index = ['index','start_time'], columns = 'param')



but the different lengths causes real problems (understandably).



I have code to convert based on date (i.e. individual run) or param into a dictionary of dfs, and can do analysis on each either run or param -- but there are ~100 sensors, and 18 months worth of runs(!) so would like to make sure there is no way to do what I want... which i think is some sort of multiple-outer join. Because of the differing lengths, it would need to fill blanks with NaN - which is fine - and find the max length of any param, to adjust the length of the date to.



Model dataset



Start



df_long = pd.DataFrame({"Date" : np.array([1]*5 + [2]*3 + [3]*4 + [4]*2 + [5]*4),
"Param" : list('aaabbabbabccaaaacc'),
"value": [0.1, 0.2, 0.2, 1, 4, 0.6, 0.5, 90, 0.9, 8.8, 4.1, 0.4, 0.5, 0.1, 0.1, 0.3, 3.4, 5.1],
"time" : [1,2,3,1,2,1,1,2,1,1,1,2,1,2,1,2,1,2]
})


Ideal output



df_wide = pd.DataFrame ({
"Date" : [1,1,1,2,2,3,3,4,4,5,5],
"a": [0.1,0.2,0.2,0.6,'NaN',0.9,'NaN',0.5,0.1,0.1,0.3],
"time-a": [1,2,3,1,'NaN',1,'NaN',1,2,1,2],
"b": [1,4,'NaN',0.5,90,8.8,'NaN','NaN','NaN','NaN','NaN'],
"time-b": [1,2,'NaN', 1,2,1,'NaN','NaN','NaN','NaN','NaN'],
"c": ['NaN','NaN','NaN','NaN','NaN',4.1,0.4,'NaN','NaN',3.4,5.1],
"time-c": ['NaN','NaN','NaN','NaN','NaN',1,2,'NaN','NaN',1,2]})


Any help greatly appreciated










share|improve this question
















Problem



I have some enormous dataframes pulled from equipment, which track multiple runs on said equipment, each recording multiple sensors (voltage, current, rpm, pressures... etc.)
I need to widen this data set for plotting and further analysis, but unfortunately the clocks on the sensors are not synchronised, so the different parameters are collected each with their own time stamp, and can vary in length (msec, so sometimes >10 rows).



I've attempted unstacking:



df.set_index(['index','start_time','param']).value.unstack().rename_axis(None, 1).reset_index()



pivotting:



df.pivot_table(values = 'value', index = ['index','start_time'], columns = 'param')



but the different lengths causes real problems (understandably).



I have code to convert based on date (i.e. individual run) or param into a dictionary of dfs, and can do analysis on each either run or param -- but there are ~100 sensors, and 18 months worth of runs(!) so would like to make sure there is no way to do what I want... which i think is some sort of multiple-outer join. Because of the differing lengths, it would need to fill blanks with NaN - which is fine - and find the max length of any param, to adjust the length of the date to.



Model dataset



Start



df_long = pd.DataFrame({"Date" : np.array([1]*5 + [2]*3 + [3]*4 + [4]*2 + [5]*4),
"Param" : list('aaabbabbabccaaaacc'),
"value": [0.1, 0.2, 0.2, 1, 4, 0.6, 0.5, 90, 0.9, 8.8, 4.1, 0.4, 0.5, 0.1, 0.1, 0.3, 3.4, 5.1],
"time" : [1,2,3,1,2,1,1,2,1,1,1,2,1,2,1,2,1,2]
})


Ideal output



df_wide = pd.DataFrame ({
"Date" : [1,1,1,2,2,3,3,4,4,5,5],
"a": [0.1,0.2,0.2,0.6,'NaN',0.9,'NaN',0.5,0.1,0.1,0.3],
"time-a": [1,2,3,1,'NaN',1,'NaN',1,2,1,2],
"b": [1,4,'NaN',0.5,90,8.8,'NaN','NaN','NaN','NaN','NaN'],
"time-b": [1,2,'NaN', 1,2,1,'NaN','NaN','NaN','NaN','NaN'],
"c": ['NaN','NaN','NaN','NaN','NaN',4.1,0.4,'NaN','NaN',3.4,5.1],
"time-c": ['NaN','NaN','NaN','NaN','NaN',1,2,'NaN','NaN',1,2]})


Any help greatly appreciated







python pandas dataframe pandas-groupby






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 15:23









jpp

96.5k2158109




96.5k2158109










asked Nov 20 '18 at 10:36









BAC83BAC83

1159




1159













  • In df_long the Date column is equal to 1 five times, but in df_wide it is only equal to 1 three times, is this part of the expected output, and if so according to what logic?

    – user3471881
    Nov 20 '18 at 11:15













  • As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param a and 2 '1's for param b. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths

    – BAC83
    Nov 20 '18 at 11:27





















  • In df_long the Date column is equal to 1 five times, but in df_wide it is only equal to 1 three times, is this part of the expected output, and if so according to what logic?

    – user3471881
    Nov 20 '18 at 11:15













  • As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param a and 2 '1's for param b. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths

    – BAC83
    Nov 20 '18 at 11:27



















In df_long the Date column is equal to 1 five times, but in df_wide it is only equal to 1 three times, is this part of the expected output, and if so according to what logic?

– user3471881
Nov 20 '18 at 11:15







In df_long the Date column is equal to 1 five times, but in df_wide it is only equal to 1 three times, is this part of the expected output, and if so according to what logic?

– user3471881
Nov 20 '18 at 11:15















As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param a and 2 '1's for param b. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths

– BAC83
Nov 20 '18 at 11:27







As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param a and 2 '1's for param b. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths

– BAC83
Nov 20 '18 at 11:27














1 Answer
1






active

oldest

votes


















1














pd.pivot_table



You can pivot your dataframe. The only difference versus your desired output is you only have a single time series; you can, if you wish, construct time-a, time-b, etc, by considering null values in other series.



res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()

print(res)

Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1





share|improve this answer
























  • I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers

    – BAC83
    Nov 20 '18 at 15:35













  • How would it work if i wanted to keep the time series separated (as alluded to in your answer)?

    – BAC83
    Nov 20 '18 at 15:38






  • 1





    You'd do something like df['time-a'] = df['time'].mask(df['a'].isnull()).

    – jpp
    Nov 20 '18 at 15:41













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53391122%2fpandas-dataframe-wide-vs-long-unstack-vs-pivot-vs-outer-join-for-multiple-df%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














pd.pivot_table



You can pivot your dataframe. The only difference versus your desired output is you only have a single time series; you can, if you wish, construct time-a, time-b, etc, by considering null values in other series.



res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()

print(res)

Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1





share|improve this answer
























  • I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers

    – BAC83
    Nov 20 '18 at 15:35













  • How would it work if i wanted to keep the time series separated (as alluded to in your answer)?

    – BAC83
    Nov 20 '18 at 15:38






  • 1





    You'd do something like df['time-a'] = df['time'].mask(df['a'].isnull()).

    – jpp
    Nov 20 '18 at 15:41


















1














pd.pivot_table



You can pivot your dataframe. The only difference versus your desired output is you only have a single time series; you can, if you wish, construct time-a, time-b, etc, by considering null values in other series.



res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()

print(res)

Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1





share|improve this answer
























  • I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers

    – BAC83
    Nov 20 '18 at 15:35













  • How would it work if i wanted to keep the time series separated (as alluded to in your answer)?

    – BAC83
    Nov 20 '18 at 15:38






  • 1





    You'd do something like df['time-a'] = df['time'].mask(df['a'].isnull()).

    – jpp
    Nov 20 '18 at 15:41
















1












1








1







pd.pivot_table



You can pivot your dataframe. The only difference versus your desired output is you only have a single time series; you can, if you wish, construct time-a, time-b, etc, by considering null values in other series.



res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()

print(res)

Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1





share|improve this answer













pd.pivot_table



You can pivot your dataframe. The only difference versus your desired output is you only have a single time series; you can, if you wish, construct time-a, time-b, etc, by considering null values in other series.



res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()

print(res)

Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 20 '18 at 15:22









jppjpp

96.5k2158109




96.5k2158109













  • I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers

    – BAC83
    Nov 20 '18 at 15:35













  • How would it work if i wanted to keep the time series separated (as alluded to in your answer)?

    – BAC83
    Nov 20 '18 at 15:38






  • 1





    You'd do something like df['time-a'] = df['time'].mask(df['a'].isnull()).

    – jpp
    Nov 20 '18 at 15:41





















  • I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers

    – BAC83
    Nov 20 '18 at 15:35













  • How would it work if i wanted to keep the time series separated (as alluded to in your answer)?

    – BAC83
    Nov 20 '18 at 15:38






  • 1





    You'd do something like df['time-a'] = df['time'].mask(df['a'].isnull()).

    – jpp
    Nov 20 '18 at 15:41



















I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers

– BAC83
Nov 20 '18 at 15:35







I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers

– BAC83
Nov 20 '18 at 15:35















How would it work if i wanted to keep the time series separated (as alluded to in your answer)?

– BAC83
Nov 20 '18 at 15:38





How would it work if i wanted to keep the time series separated (as alluded to in your answer)?

– BAC83
Nov 20 '18 at 15:38




1




1





You'd do something like df['time-a'] = df['time'].mask(df['a'].isnull()).

– jpp
Nov 20 '18 at 15:41







You'd do something like df['time-a'] = df['time'].mask(df['a'].isnull()).

– jpp
Nov 20 '18 at 15:41




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53391122%2fpandas-dataframe-wide-vs-long-unstack-vs-pivot-vs-outer-join-for-multiple-df%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

How to fix TextFormField cause rebuild widget in Flutter

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith