Pandas dataframe wide vs long - unstack vs pivot vs outer join for MULTIPLE df
Problem
I have some enormous dataframes pulled from equipment, which track multiple runs on said equipment, each recording multiple sensors (voltage, current, rpm, pressures... etc.)
I need to widen this data set for plotting and further analysis, but unfortunately the clocks on the sensors are not synchronised, so the different parameters are collected each with their own time stamp, and can vary in length (msec, so sometimes >10 rows).
I've attempted unstacking:
df.set_index(['index','start_time','param']).value.unstack().rename_axis(None, 1).reset_index()
pivotting:
df.pivot_table(values = 'value', index = ['index','start_time'], columns = 'param')
but the different lengths causes real problems (understandably).
I have code to convert based on date (i.e. individual run) or param into a dictionary of dfs, and can do analysis on each either run or param -- but there are ~100 sensors, and 18 months worth of runs(!) so would like to make sure there is no way to do what I want... which i think is some sort of multiple-outer join. Because of the differing lengths, it would need to fill blanks with NaN - which is fine - and find the max length of any param, to adjust the length of the date to.
Model dataset
Start
df_long = pd.DataFrame({"Date" : np.array([1]*5 + [2]*3 + [3]*4 + [4]*2 + [5]*4),
"Param" : list('aaabbabbabccaaaacc'),
"value": [0.1, 0.2, 0.2, 1, 4, 0.6, 0.5, 90, 0.9, 8.8, 4.1, 0.4, 0.5, 0.1, 0.1, 0.3, 3.4, 5.1],
"time" : [1,2,3,1,2,1,1,2,1,1,1,2,1,2,1,2,1,2]
})
Ideal output
df_wide = pd.DataFrame ({
"Date" : [1,1,1,2,2,3,3,4,4,5,5],
"a": [0.1,0.2,0.2,0.6,'NaN',0.9,'NaN',0.5,0.1,0.1,0.3],
"time-a": [1,2,3,1,'NaN',1,'NaN',1,2,1,2],
"b": [1,4,'NaN',0.5,90,8.8,'NaN','NaN','NaN','NaN','NaN'],
"time-b": [1,2,'NaN', 1,2,1,'NaN','NaN','NaN','NaN','NaN'],
"c": ['NaN','NaN','NaN','NaN','NaN',4.1,0.4,'NaN','NaN',3.4,5.1],
"time-c": ['NaN','NaN','NaN','NaN','NaN',1,2,'NaN','NaN',1,2]})
Any help greatly appreciated
python pandas dataframe pandas-groupby
add a comment |
Problem
I have some enormous dataframes pulled from equipment, which track multiple runs on said equipment, each recording multiple sensors (voltage, current, rpm, pressures... etc.)
I need to widen this data set for plotting and further analysis, but unfortunately the clocks on the sensors are not synchronised, so the different parameters are collected each with their own time stamp, and can vary in length (msec, so sometimes >10 rows).
I've attempted unstacking:
df.set_index(['index','start_time','param']).value.unstack().rename_axis(None, 1).reset_index()
pivotting:
df.pivot_table(values = 'value', index = ['index','start_time'], columns = 'param')
but the different lengths causes real problems (understandably).
I have code to convert based on date (i.e. individual run) or param into a dictionary of dfs, and can do analysis on each either run or param -- but there are ~100 sensors, and 18 months worth of runs(!) so would like to make sure there is no way to do what I want... which i think is some sort of multiple-outer join. Because of the differing lengths, it would need to fill blanks with NaN - which is fine - and find the max length of any param, to adjust the length of the date to.
Model dataset
Start
df_long = pd.DataFrame({"Date" : np.array([1]*5 + [2]*3 + [3]*4 + [4]*2 + [5]*4),
"Param" : list('aaabbabbabccaaaacc'),
"value": [0.1, 0.2, 0.2, 1, 4, 0.6, 0.5, 90, 0.9, 8.8, 4.1, 0.4, 0.5, 0.1, 0.1, 0.3, 3.4, 5.1],
"time" : [1,2,3,1,2,1,1,2,1,1,1,2,1,2,1,2,1,2]
})
Ideal output
df_wide = pd.DataFrame ({
"Date" : [1,1,1,2,2,3,3,4,4,5,5],
"a": [0.1,0.2,0.2,0.6,'NaN',0.9,'NaN',0.5,0.1,0.1,0.3],
"time-a": [1,2,3,1,'NaN',1,'NaN',1,2,1,2],
"b": [1,4,'NaN',0.5,90,8.8,'NaN','NaN','NaN','NaN','NaN'],
"time-b": [1,2,'NaN', 1,2,1,'NaN','NaN','NaN','NaN','NaN'],
"c": ['NaN','NaN','NaN','NaN','NaN',4.1,0.4,'NaN','NaN',3.4,5.1],
"time-c": ['NaN','NaN','NaN','NaN','NaN',1,2,'NaN','NaN',1,2]})
Any help greatly appreciated
python pandas dataframe pandas-groupby
Indf_long
theDate
column is equal to1
five times, but indf_wide
it is only equal to1
three times, is this part of the expected output, and if so according to what logic?
– user3471881
Nov 20 '18 at 11:15
As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for parama
and 2 '1's for paramb
. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths
– BAC83
Nov 20 '18 at 11:27
add a comment |
Problem
I have some enormous dataframes pulled from equipment, which track multiple runs on said equipment, each recording multiple sensors (voltage, current, rpm, pressures... etc.)
I need to widen this data set for plotting and further analysis, but unfortunately the clocks on the sensors are not synchronised, so the different parameters are collected each with their own time stamp, and can vary in length (msec, so sometimes >10 rows).
I've attempted unstacking:
df.set_index(['index','start_time','param']).value.unstack().rename_axis(None, 1).reset_index()
pivotting:
df.pivot_table(values = 'value', index = ['index','start_time'], columns = 'param')
but the different lengths causes real problems (understandably).
I have code to convert based on date (i.e. individual run) or param into a dictionary of dfs, and can do analysis on each either run or param -- but there are ~100 sensors, and 18 months worth of runs(!) so would like to make sure there is no way to do what I want... which i think is some sort of multiple-outer join. Because of the differing lengths, it would need to fill blanks with NaN - which is fine - and find the max length of any param, to adjust the length of the date to.
Model dataset
Start
df_long = pd.DataFrame({"Date" : np.array([1]*5 + [2]*3 + [3]*4 + [4]*2 + [5]*4),
"Param" : list('aaabbabbabccaaaacc'),
"value": [0.1, 0.2, 0.2, 1, 4, 0.6, 0.5, 90, 0.9, 8.8, 4.1, 0.4, 0.5, 0.1, 0.1, 0.3, 3.4, 5.1],
"time" : [1,2,3,1,2,1,1,2,1,1,1,2,1,2,1,2,1,2]
})
Ideal output
df_wide = pd.DataFrame ({
"Date" : [1,1,1,2,2,3,3,4,4,5,5],
"a": [0.1,0.2,0.2,0.6,'NaN',0.9,'NaN',0.5,0.1,0.1,0.3],
"time-a": [1,2,3,1,'NaN',1,'NaN',1,2,1,2],
"b": [1,4,'NaN',0.5,90,8.8,'NaN','NaN','NaN','NaN','NaN'],
"time-b": [1,2,'NaN', 1,2,1,'NaN','NaN','NaN','NaN','NaN'],
"c": ['NaN','NaN','NaN','NaN','NaN',4.1,0.4,'NaN','NaN',3.4,5.1],
"time-c": ['NaN','NaN','NaN','NaN','NaN',1,2,'NaN','NaN',1,2]})
Any help greatly appreciated
python pandas dataframe pandas-groupby
Problem
I have some enormous dataframes pulled from equipment, which track multiple runs on said equipment, each recording multiple sensors (voltage, current, rpm, pressures... etc.)
I need to widen this data set for plotting and further analysis, but unfortunately the clocks on the sensors are not synchronised, so the different parameters are collected each with their own time stamp, and can vary in length (msec, so sometimes >10 rows).
I've attempted unstacking:
df.set_index(['index','start_time','param']).value.unstack().rename_axis(None, 1).reset_index()
pivotting:
df.pivot_table(values = 'value', index = ['index','start_time'], columns = 'param')
but the different lengths causes real problems (understandably).
I have code to convert based on date (i.e. individual run) or param into a dictionary of dfs, and can do analysis on each either run or param -- but there are ~100 sensors, and 18 months worth of runs(!) so would like to make sure there is no way to do what I want... which i think is some sort of multiple-outer join. Because of the differing lengths, it would need to fill blanks with NaN - which is fine - and find the max length of any param, to adjust the length of the date to.
Model dataset
Start
df_long = pd.DataFrame({"Date" : np.array([1]*5 + [2]*3 + [3]*4 + [4]*2 + [5]*4),
"Param" : list('aaabbabbabccaaaacc'),
"value": [0.1, 0.2, 0.2, 1, 4, 0.6, 0.5, 90, 0.9, 8.8, 4.1, 0.4, 0.5, 0.1, 0.1, 0.3, 3.4, 5.1],
"time" : [1,2,3,1,2,1,1,2,1,1,1,2,1,2,1,2,1,2]
})
Ideal output
df_wide = pd.DataFrame ({
"Date" : [1,1,1,2,2,3,3,4,4,5,5],
"a": [0.1,0.2,0.2,0.6,'NaN',0.9,'NaN',0.5,0.1,0.1,0.3],
"time-a": [1,2,3,1,'NaN',1,'NaN',1,2,1,2],
"b": [1,4,'NaN',0.5,90,8.8,'NaN','NaN','NaN','NaN','NaN'],
"time-b": [1,2,'NaN', 1,2,1,'NaN','NaN','NaN','NaN','NaN'],
"c": ['NaN','NaN','NaN','NaN','NaN',4.1,0.4,'NaN','NaN',3.4,5.1],
"time-c": ['NaN','NaN','NaN','NaN','NaN',1,2,'NaN','NaN',1,2]})
Any help greatly appreciated
python pandas dataframe pandas-groupby
python pandas dataframe pandas-groupby
edited Nov 20 '18 at 15:23


jpp
96.5k2158109
96.5k2158109
asked Nov 20 '18 at 10:36
BAC83BAC83
1159
1159
Indf_long
theDate
column is equal to1
five times, but indf_wide
it is only equal to1
three times, is this part of the expected output, and if so according to what logic?
– user3471881
Nov 20 '18 at 11:15
As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for parama
and 2 '1's for paramb
. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths
– BAC83
Nov 20 '18 at 11:27
add a comment |
Indf_long
theDate
column is equal to1
five times, but indf_wide
it is only equal to1
three times, is this part of the expected output, and if so according to what logic?
– user3471881
Nov 20 '18 at 11:15
As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for parama
and 2 '1's for paramb
. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths
– BAC83
Nov 20 '18 at 11:27
In
df_long
the Date
column is equal to 1
five times, but in df_wide
it is only equal to 1
three times, is this part of the expected output, and if so according to what logic?– user3471881
Nov 20 '18 at 11:15
In
df_long
the Date
column is equal to 1
five times, but in df_wide
it is only equal to 1
three times, is this part of the expected output, and if so according to what logic?– user3471881
Nov 20 '18 at 11:15
As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param
a
and 2 '1's for param b
. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths– BAC83
Nov 20 '18 at 11:27
As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param
a
and 2 '1's for param b
. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths– BAC83
Nov 20 '18 at 11:27
add a comment |
1 Answer
1
active
oldest
votes
pd.pivot_table
You can pivot your dataframe. The only difference versus your desired output is you only have a single time
series; you can, if you wish, construct time-a
, time-b
, etc, by considering null values in other series.
res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()
print(res)
Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1
I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers
– BAC83
Nov 20 '18 at 15:35
How would it work if i wanted to keep the time series separated (as alluded to in your answer)?
– BAC83
Nov 20 '18 at 15:38
1
You'd do something likedf['time-a'] = df['time'].mask(df['a'].isnull())
.
– jpp
Nov 20 '18 at 15:41
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53391122%2fpandas-dataframe-wide-vs-long-unstack-vs-pivot-vs-outer-join-for-multiple-df%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
pd.pivot_table
You can pivot your dataframe. The only difference versus your desired output is you only have a single time
series; you can, if you wish, construct time-a
, time-b
, etc, by considering null values in other series.
res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()
print(res)
Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1
I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers
– BAC83
Nov 20 '18 at 15:35
How would it work if i wanted to keep the time series separated (as alluded to in your answer)?
– BAC83
Nov 20 '18 at 15:38
1
You'd do something likedf['time-a'] = df['time'].mask(df['a'].isnull())
.
– jpp
Nov 20 '18 at 15:41
add a comment |
pd.pivot_table
You can pivot your dataframe. The only difference versus your desired output is you only have a single time
series; you can, if you wish, construct time-a
, time-b
, etc, by considering null values in other series.
res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()
print(res)
Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1
I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers
– BAC83
Nov 20 '18 at 15:35
How would it work if i wanted to keep the time series separated (as alluded to in your answer)?
– BAC83
Nov 20 '18 at 15:38
1
You'd do something likedf['time-a'] = df['time'].mask(df['a'].isnull())
.
– jpp
Nov 20 '18 at 15:41
add a comment |
pd.pivot_table
You can pivot your dataframe. The only difference versus your desired output is you only have a single time
series; you can, if you wish, construct time-a
, time-b
, etc, by considering null values in other series.
res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()
print(res)
Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1
pd.pivot_table
You can pivot your dataframe. The only difference versus your desired output is you only have a single time
series; you can, if you wish, construct time-a
, time-b
, etc, by considering null values in other series.
res = pd.pivot_table(df_long, index=['Date', 'time'],
columns=['Param'], values='value').reset_index()
print(res)
Param Date time a b c
0 1 1 0.1 1.0 NaN
1 1 2 0.2 4.0 NaN
2 1 3 0.2 NaN NaN
3 2 1 0.6 0.5 NaN
4 2 2 NaN 90.0 NaN
5 3 1 0.9 8.8 4.1
6 3 2 NaN NaN 0.4
7 4 1 0.5 NaN NaN
8 4 2 0.1 NaN NaN
9 5 1 0.1 NaN 3.4
10 5 2 0.3 NaN 5.1
answered Nov 20 '18 at 15:22


jppjpp
96.5k2158109
96.5k2158109
I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers
– BAC83
Nov 20 '18 at 15:35
How would it work if i wanted to keep the time series separated (as alluded to in your answer)?
– BAC83
Nov 20 '18 at 15:38
1
You'd do something likedf['time-a'] = df['time'].mask(df['a'].isnull())
.
– jpp
Nov 20 '18 at 15:41
add a comment |
I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers
– BAC83
Nov 20 '18 at 15:35
How would it work if i wanted to keep the time series separated (as alluded to in your answer)?
– BAC83
Nov 20 '18 at 15:38
1
You'd do something likedf['time-a'] = df['time'].mask(df['a'].isnull())
.
– jpp
Nov 20 '18 at 15:41
I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers
– BAC83
Nov 20 '18 at 15:35
I was obsessed with keeping the times separated - which isn't needed! Thanks, this works. Annoying that i was most of the way there! Cheers
– BAC83
Nov 20 '18 at 15:35
How would it work if i wanted to keep the time series separated (as alluded to in your answer)?
– BAC83
Nov 20 '18 at 15:38
How would it work if i wanted to keep the time series separated (as alluded to in your answer)?
– BAC83
Nov 20 '18 at 15:38
1
1
You'd do something like
df['time-a'] = df['time'].mask(df['a'].isnull())
.– jpp
Nov 20 '18 at 15:41
You'd do something like
df['time-a'] = df['time'].mask(df['a'].isnull())
.– jpp
Nov 20 '18 at 15:41
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53391122%2fpandas-dataframe-wide-vs-long-unstack-vs-pivot-vs-outer-join-for-multiple-df%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
In
df_long
theDate
column is equal to1
five times, but indf_wide
it is only equal to1
three times, is this part of the expected output, and if so according to what logic?– user3471881
Nov 20 '18 at 11:15
As the db is widened, it also shortens - so the 5 1's in long are now only 3 1's for param
a
and 2 '1's for paramb
. This is standard unstacking behaviour; the issue with these is that both measurements per date and per param are all different lengths– BAC83
Nov 20 '18 at 11:27