LSTM - Making predictions on partial sequence












5














This question is in continue to a previous question I've asked.



I've trained an LSTM model to predict a binary class (1 or 0) for batches of 100 samples with 3 features each, i.e: the shape of the data is (m, 100, 3), where m is the number of batches.



Data:



[
[[1,2,3],[1,2,3]... 100 sampels],
[[1,2,3],[1,2,3]... 100 sampels],
... avaialble batches in the training data
]


Target:



[
[1]
[0]
...
]


Model code:



def build_model(num_samples, num_features, is_training):
model = Sequential()
opt = optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0001)

batch_size = None if is_training else 1
stateful = False if is_training else True
first_lstm = LSTM(32, batch_input_shape=(batch_size, num_samples, num_features), return_sequences=True,
activation='tanh', stateful=stateful)

model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh', stateful=stateful))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(LSTM(8, return_sequences=False, activation='tanh', stateful=stateful))
model.add(LeakyReLU())
model.add(Dense(1, activation='sigmoid'))

if is_training:
model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])
return model


For the training stage, the model is NOT stateful. When predicting I'm using a stateful model, iterating over the data and outputting a probability for each sample:



for index, row in data.iterrows():
if index % 100 == 0:
predicting_model.reset_states()
vals = np.array([[row[['a', 'b', 'c']].values]])
prob = predicting_model.predict_on_batch(vals)


When looking at the probability at the end of a batch, it is exactly the value I get when predicting with the entire batch (not one by one). However, I've expected that the probability will always continue in the right direction when new samples arrive. What actually happens is that the probability output can spike to the wrong class on an arbitrary sample (see below).





Two samples of 100 sample batches over the time of prediction (label = 1):



enter image description here



and Label = 0:
enter image description here



Is there a way to achieve what I want (avoid extreme spikes while predicting probability), or is that a given fact?



Any explanation, advice would be appreciated.





Update
Thanks to @today advice, I've tried training the network with the hidden state output for each input time step using return_sequence=True on the last LSTM layer.



So now the labels look like so (shape (100,100)):



[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
...]


the model summary:



Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM) (None, 100, 32) 4608
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
lstm_3 (LSTM) (None, 100, 8) 800
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 100, 8) 0
_________________________________________________________________
dense_1 (Dense) (None, 100, 1) 9
=================================================================
Total params: 8,553
Trainable params: 8,553
Non-trainable params: 0
_________________________________________________________________


However, I get an exception:



ValueError: Error when checking target: expected dense_1 to have 3 dimensions, but got array with shape (75, 100)


What do I need to fix?










share|improve this question
























  • What is the training accuracy? Have you tried setting activation='linear' for LSTM layers since you are using LeakyReLU layers?
    – today
    Nov 19 '18 at 14:51










  • And please don't use "samples" instead of "timesteps". They are different things and it would lead to confusion. In your example, each of the samples (i.e. sequence) has a shape of (100, 3) which means each sample consists of 100 timesteps where each timestep is a feature vector of length 3. Further, "the shape of the data is (m, 100, 3), where m is the number of batches" is a bit wrong: m is the number of samples (or maybe number of samples in one batch), and not the number of batches. Each batch may consists of one or more samples.
    – today
    Nov 19 '18 at 15:00








  • 2




    I don't know whether the claim that the probabilities should not fluctuate or spike and they should monotonically increase or decrease as we process more timesteps is right or wrong. But you must consider that 1) the model has been trained on sequences of length 100, 2) it has been trained to output the right label after seeing all the 100 timesteps, and 3) it does not generate any output for the intermediate timesteps during training. Therefore, I think we should not expect that intermediate outputs in prediction phase has a specific behavior; rather the final one matters.
    – today
    Nov 19 '18 at 15:33








  • 1




    I think I agree with "today". I don't think that is a problem, but you can prevent that by creating targets containing all 100 steps. Instead of y = [[0],[1],[0],...], use y = [[0,0,0...],[1,1,1...],[0,0,0...x100], ....] -- For that you'd need to return_sequences=True until the end.
    – Daniel Möller
    Nov 20 '18 at 1:09






  • 1




    Oh, that's already an answer below :) -- Upvote
    – Daniel Möller
    Nov 20 '18 at 1:10
















5














This question is in continue to a previous question I've asked.



I've trained an LSTM model to predict a binary class (1 or 0) for batches of 100 samples with 3 features each, i.e: the shape of the data is (m, 100, 3), where m is the number of batches.



Data:



[
[[1,2,3],[1,2,3]... 100 sampels],
[[1,2,3],[1,2,3]... 100 sampels],
... avaialble batches in the training data
]


Target:



[
[1]
[0]
...
]


Model code:



def build_model(num_samples, num_features, is_training):
model = Sequential()
opt = optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0001)

batch_size = None if is_training else 1
stateful = False if is_training else True
first_lstm = LSTM(32, batch_input_shape=(batch_size, num_samples, num_features), return_sequences=True,
activation='tanh', stateful=stateful)

model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh', stateful=stateful))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(LSTM(8, return_sequences=False, activation='tanh', stateful=stateful))
model.add(LeakyReLU())
model.add(Dense(1, activation='sigmoid'))

if is_training:
model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])
return model


For the training stage, the model is NOT stateful. When predicting I'm using a stateful model, iterating over the data and outputting a probability for each sample:



for index, row in data.iterrows():
if index % 100 == 0:
predicting_model.reset_states()
vals = np.array([[row[['a', 'b', 'c']].values]])
prob = predicting_model.predict_on_batch(vals)


When looking at the probability at the end of a batch, it is exactly the value I get when predicting with the entire batch (not one by one). However, I've expected that the probability will always continue in the right direction when new samples arrive. What actually happens is that the probability output can spike to the wrong class on an arbitrary sample (see below).





Two samples of 100 sample batches over the time of prediction (label = 1):



enter image description here



and Label = 0:
enter image description here



Is there a way to achieve what I want (avoid extreme spikes while predicting probability), or is that a given fact?



Any explanation, advice would be appreciated.





Update
Thanks to @today advice, I've tried training the network with the hidden state output for each input time step using return_sequence=True on the last LSTM layer.



So now the labels look like so (shape (100,100)):



[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
...]


the model summary:



Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM) (None, 100, 32) 4608
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
lstm_3 (LSTM) (None, 100, 8) 800
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 100, 8) 0
_________________________________________________________________
dense_1 (Dense) (None, 100, 1) 9
=================================================================
Total params: 8,553
Trainable params: 8,553
Non-trainable params: 0
_________________________________________________________________


However, I get an exception:



ValueError: Error when checking target: expected dense_1 to have 3 dimensions, but got array with shape (75, 100)


What do I need to fix?










share|improve this question
























  • What is the training accuracy? Have you tried setting activation='linear' for LSTM layers since you are using LeakyReLU layers?
    – today
    Nov 19 '18 at 14:51










  • And please don't use "samples" instead of "timesteps". They are different things and it would lead to confusion. In your example, each of the samples (i.e. sequence) has a shape of (100, 3) which means each sample consists of 100 timesteps where each timestep is a feature vector of length 3. Further, "the shape of the data is (m, 100, 3), where m is the number of batches" is a bit wrong: m is the number of samples (or maybe number of samples in one batch), and not the number of batches. Each batch may consists of one or more samples.
    – today
    Nov 19 '18 at 15:00








  • 2




    I don't know whether the claim that the probabilities should not fluctuate or spike and they should monotonically increase or decrease as we process more timesteps is right or wrong. But you must consider that 1) the model has been trained on sequences of length 100, 2) it has been trained to output the right label after seeing all the 100 timesteps, and 3) it does not generate any output for the intermediate timesteps during training. Therefore, I think we should not expect that intermediate outputs in prediction phase has a specific behavior; rather the final one matters.
    – today
    Nov 19 '18 at 15:33








  • 1




    I think I agree with "today". I don't think that is a problem, but you can prevent that by creating targets containing all 100 steps. Instead of y = [[0],[1],[0],...], use y = [[0,0,0...],[1,1,1...],[0,0,0...x100], ....] -- For that you'd need to return_sequences=True until the end.
    – Daniel Möller
    Nov 20 '18 at 1:09






  • 1




    Oh, that's already an answer below :) -- Upvote
    – Daniel Möller
    Nov 20 '18 at 1:10














5












5








5


2





This question is in continue to a previous question I've asked.



I've trained an LSTM model to predict a binary class (1 or 0) for batches of 100 samples with 3 features each, i.e: the shape of the data is (m, 100, 3), where m is the number of batches.



Data:



[
[[1,2,3],[1,2,3]... 100 sampels],
[[1,2,3],[1,2,3]... 100 sampels],
... avaialble batches in the training data
]


Target:



[
[1]
[0]
...
]


Model code:



def build_model(num_samples, num_features, is_training):
model = Sequential()
opt = optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0001)

batch_size = None if is_training else 1
stateful = False if is_training else True
first_lstm = LSTM(32, batch_input_shape=(batch_size, num_samples, num_features), return_sequences=True,
activation='tanh', stateful=stateful)

model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh', stateful=stateful))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(LSTM(8, return_sequences=False, activation='tanh', stateful=stateful))
model.add(LeakyReLU())
model.add(Dense(1, activation='sigmoid'))

if is_training:
model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])
return model


For the training stage, the model is NOT stateful. When predicting I'm using a stateful model, iterating over the data and outputting a probability for each sample:



for index, row in data.iterrows():
if index % 100 == 0:
predicting_model.reset_states()
vals = np.array([[row[['a', 'b', 'c']].values]])
prob = predicting_model.predict_on_batch(vals)


When looking at the probability at the end of a batch, it is exactly the value I get when predicting with the entire batch (not one by one). However, I've expected that the probability will always continue in the right direction when new samples arrive. What actually happens is that the probability output can spike to the wrong class on an arbitrary sample (see below).





Two samples of 100 sample batches over the time of prediction (label = 1):



enter image description here



and Label = 0:
enter image description here



Is there a way to achieve what I want (avoid extreme spikes while predicting probability), or is that a given fact?



Any explanation, advice would be appreciated.





Update
Thanks to @today advice, I've tried training the network with the hidden state output for each input time step using return_sequence=True on the last LSTM layer.



So now the labels look like so (shape (100,100)):



[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
...]


the model summary:



Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM) (None, 100, 32) 4608
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
lstm_3 (LSTM) (None, 100, 8) 800
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 100, 8) 0
_________________________________________________________________
dense_1 (Dense) (None, 100, 1) 9
=================================================================
Total params: 8,553
Trainable params: 8,553
Non-trainable params: 0
_________________________________________________________________


However, I get an exception:



ValueError: Error when checking target: expected dense_1 to have 3 dimensions, but got array with shape (75, 100)


What do I need to fix?










share|improve this question















This question is in continue to a previous question I've asked.



I've trained an LSTM model to predict a binary class (1 or 0) for batches of 100 samples with 3 features each, i.e: the shape of the data is (m, 100, 3), where m is the number of batches.



Data:



[
[[1,2,3],[1,2,3]... 100 sampels],
[[1,2,3],[1,2,3]... 100 sampels],
... avaialble batches in the training data
]


Target:



[
[1]
[0]
...
]


Model code:



def build_model(num_samples, num_features, is_training):
model = Sequential()
opt = optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0001)

batch_size = None if is_training else 1
stateful = False if is_training else True
first_lstm = LSTM(32, batch_input_shape=(batch_size, num_samples, num_features), return_sequences=True,
activation='tanh', stateful=stateful)

model.add(first_lstm)
model.add(LeakyReLU())
model.add(Dropout(0.2))
model.add(LSTM(16, return_sequences=True, activation='tanh', stateful=stateful))
model.add(Dropout(0.2))
model.add(LeakyReLU())
model.add(LSTM(8, return_sequences=False, activation='tanh', stateful=stateful))
model.add(LeakyReLU())
model.add(Dense(1, activation='sigmoid'))

if is_training:
model.compile(loss='binary_crossentropy', optimizer=opt,
metrics=['accuracy', keras_metrics.precision(), keras_metrics.recall(), f1])
return model


For the training stage, the model is NOT stateful. When predicting I'm using a stateful model, iterating over the data and outputting a probability for each sample:



for index, row in data.iterrows():
if index % 100 == 0:
predicting_model.reset_states()
vals = np.array([[row[['a', 'b', 'c']].values]])
prob = predicting_model.predict_on_batch(vals)


When looking at the probability at the end of a batch, it is exactly the value I get when predicting with the entire batch (not one by one). However, I've expected that the probability will always continue in the right direction when new samples arrive. What actually happens is that the probability output can spike to the wrong class on an arbitrary sample (see below).





Two samples of 100 sample batches over the time of prediction (label = 1):



enter image description here



and Label = 0:
enter image description here



Is there a way to achieve what I want (avoid extreme spikes while predicting probability), or is that a given fact?



Any explanation, advice would be appreciated.





Update
Thanks to @today advice, I've tried training the network with the hidden state output for each input time step using return_sequence=True on the last LSTM layer.



So now the labels look like so (shape (100,100)):



[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
...]


the model summary:



Layer (type)                 Output Shape              Param #   
=================================================================
lstm_1 (LSTM) (None, 100, 32) 4608
_________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 100, 32) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 100, 32) 0
_________________________________________________________________
lstm_2 (LSTM) (None, 100, 16) 3136
_________________________________________________________________
dropout_2 (Dropout) (None, 100, 16) 0
_________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 100, 16) 0
_________________________________________________________________
lstm_3 (LSTM) (None, 100, 8) 800
_________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 100, 8) 0
_________________________________________________________________
dense_1 (Dense) (None, 100, 1) 9
=================================================================
Total params: 8,553
Trainable params: 8,553
Non-trainable params: 0
_________________________________________________________________


However, I get an exception:



ValueError: Error when checking target: expected dense_1 to have 3 dimensions, but got array with shape (75, 100)


What do I need to fix?







python tensorflow machine-learning keras lstm






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 '18 at 11:38

























asked Nov 19 '18 at 14:28









Shlomi Schwartz

3,0912072114




3,0912072114












  • What is the training accuracy? Have you tried setting activation='linear' for LSTM layers since you are using LeakyReLU layers?
    – today
    Nov 19 '18 at 14:51










  • And please don't use "samples" instead of "timesteps". They are different things and it would lead to confusion. In your example, each of the samples (i.e. sequence) has a shape of (100, 3) which means each sample consists of 100 timesteps where each timestep is a feature vector of length 3. Further, "the shape of the data is (m, 100, 3), where m is the number of batches" is a bit wrong: m is the number of samples (or maybe number of samples in one batch), and not the number of batches. Each batch may consists of one or more samples.
    – today
    Nov 19 '18 at 15:00








  • 2




    I don't know whether the claim that the probabilities should not fluctuate or spike and they should monotonically increase or decrease as we process more timesteps is right or wrong. But you must consider that 1) the model has been trained on sequences of length 100, 2) it has been trained to output the right label after seeing all the 100 timesteps, and 3) it does not generate any output for the intermediate timesteps during training. Therefore, I think we should not expect that intermediate outputs in prediction phase has a specific behavior; rather the final one matters.
    – today
    Nov 19 '18 at 15:33








  • 1




    I think I agree with "today". I don't think that is a problem, but you can prevent that by creating targets containing all 100 steps. Instead of y = [[0],[1],[0],...], use y = [[0,0,0...],[1,1,1...],[0,0,0...x100], ....] -- For that you'd need to return_sequences=True until the end.
    – Daniel Möller
    Nov 20 '18 at 1:09






  • 1




    Oh, that's already an answer below :) -- Upvote
    – Daniel Möller
    Nov 20 '18 at 1:10


















  • What is the training accuracy? Have you tried setting activation='linear' for LSTM layers since you are using LeakyReLU layers?
    – today
    Nov 19 '18 at 14:51










  • And please don't use "samples" instead of "timesteps". They are different things and it would lead to confusion. In your example, each of the samples (i.e. sequence) has a shape of (100, 3) which means each sample consists of 100 timesteps where each timestep is a feature vector of length 3. Further, "the shape of the data is (m, 100, 3), where m is the number of batches" is a bit wrong: m is the number of samples (or maybe number of samples in one batch), and not the number of batches. Each batch may consists of one or more samples.
    – today
    Nov 19 '18 at 15:00








  • 2




    I don't know whether the claim that the probabilities should not fluctuate or spike and they should monotonically increase or decrease as we process more timesteps is right or wrong. But you must consider that 1) the model has been trained on sequences of length 100, 2) it has been trained to output the right label after seeing all the 100 timesteps, and 3) it does not generate any output for the intermediate timesteps during training. Therefore, I think we should not expect that intermediate outputs in prediction phase has a specific behavior; rather the final one matters.
    – today
    Nov 19 '18 at 15:33








  • 1




    I think I agree with "today". I don't think that is a problem, but you can prevent that by creating targets containing all 100 steps. Instead of y = [[0],[1],[0],...], use y = [[0,0,0...],[1,1,1...],[0,0,0...x100], ....] -- For that you'd need to return_sequences=True until the end.
    – Daniel Möller
    Nov 20 '18 at 1:09






  • 1




    Oh, that's already an answer below :) -- Upvote
    – Daniel Möller
    Nov 20 '18 at 1:10
















What is the training accuracy? Have you tried setting activation='linear' for LSTM layers since you are using LeakyReLU layers?
– today
Nov 19 '18 at 14:51




What is the training accuracy? Have you tried setting activation='linear' for LSTM layers since you are using LeakyReLU layers?
– today
Nov 19 '18 at 14:51












And please don't use "samples" instead of "timesteps". They are different things and it would lead to confusion. In your example, each of the samples (i.e. sequence) has a shape of (100, 3) which means each sample consists of 100 timesteps where each timestep is a feature vector of length 3. Further, "the shape of the data is (m, 100, 3), where m is the number of batches" is a bit wrong: m is the number of samples (or maybe number of samples in one batch), and not the number of batches. Each batch may consists of one or more samples.
– today
Nov 19 '18 at 15:00






And please don't use "samples" instead of "timesteps". They are different things and it would lead to confusion. In your example, each of the samples (i.e. sequence) has a shape of (100, 3) which means each sample consists of 100 timesteps where each timestep is a feature vector of length 3. Further, "the shape of the data is (m, 100, 3), where m is the number of batches" is a bit wrong: m is the number of samples (or maybe number of samples in one batch), and not the number of batches. Each batch may consists of one or more samples.
– today
Nov 19 '18 at 15:00






2




2




I don't know whether the claim that the probabilities should not fluctuate or spike and they should monotonically increase or decrease as we process more timesteps is right or wrong. But you must consider that 1) the model has been trained on sequences of length 100, 2) it has been trained to output the right label after seeing all the 100 timesteps, and 3) it does not generate any output for the intermediate timesteps during training. Therefore, I think we should not expect that intermediate outputs in prediction phase has a specific behavior; rather the final one matters.
– today
Nov 19 '18 at 15:33






I don't know whether the claim that the probabilities should not fluctuate or spike and they should monotonically increase or decrease as we process more timesteps is right or wrong. But you must consider that 1) the model has been trained on sequences of length 100, 2) it has been trained to output the right label after seeing all the 100 timesteps, and 3) it does not generate any output for the intermediate timesteps during training. Therefore, I think we should not expect that intermediate outputs in prediction phase has a specific behavior; rather the final one matters.
– today
Nov 19 '18 at 15:33






1




1




I think I agree with "today". I don't think that is a problem, but you can prevent that by creating targets containing all 100 steps. Instead of y = [[0],[1],[0],...], use y = [[0,0,0...],[1,1,1...],[0,0,0...x100], ....] -- For that you'd need to return_sequences=True until the end.
– Daniel Möller
Nov 20 '18 at 1:09




I think I agree with "today". I don't think that is a problem, but you can prevent that by creating targets containing all 100 steps. Instead of y = [[0],[1],[0],...], use y = [[0,0,0...],[1,1,1...],[0,0,0...x100], ....] -- For that you'd need to return_sequences=True until the end.
– Daniel Möller
Nov 20 '18 at 1:09




1




1




Oh, that's already an answer below :) -- Upvote
– Daniel Möller
Nov 20 '18 at 1:10




Oh, that's already an answer below :) -- Upvote
– Daniel Möller
Nov 20 '18 at 1:10












1 Answer
1






active

oldest

votes


















3














Note: This is just an idea and it might be wrong. Try it if you would like and I would appreciate any feedback.






Is there a way to achieve what I want (avoid extreme spikes while
predicting probability), or is that a given fact?




You can do this experiment: set the return_sequences argument of last LSTM layer to True and replicate the labels of each sample as much as the length of each sample. For example if a sample has a length of 100 and its label is 0, then create a new label for this sample which consists of 100 zeros (you can probably easily do this using numpy function like np.repeat). Then retrain your new model and test it on new samples afterwards. I am not sure of this, but I would expect more monotonically increasing/decreasing probability graphs this time.





Update: The error you mentioned is caused by the fact that the labels should be a 3D array (look at the output shape of last layer in the model summary). Use np.expand_dims to add another axis of size one to the end. The correct way of repeating the labels would look like this, assuming y_train has a shape of (num_samples,):



rep_y_train = np.repeat(y_train, num_reps).reshape(-1, num_reps, 1)




The experiment on IMDB dataset:



Actually, I tried the experiment suggested above on the IMDB dataset using a simple model with one LSTM layer. One time, I used only one label per each sample (as in original approach of @Shlomi) and the other time I replicated the labels to have one label per each timestep of a sample (as I suggested above). Here is the code if you would like to try it yourself:



from keras.layers import *
from keras.models import Sequential, Model
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
import numpy as np

vocab_size = 10000
max_len = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
X_train = pad_sequences(x_train, maxlen=max_len)

def create_model(return_seq=False, stateful=False):
batch_size = 1 if stateful else None
model = Sequential()
model.add(Embedding(vocab_size, 128, batch_input_shape=(batch_size, None)))
model.add(CuDNNLSTM(64, return_sequences=return_seq, stateful=stateful))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
return model

# train model with one label per sample
train_model = create_model()
train_model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.3)

# replicate the labels
y_train_rep = np.repeat(y_train, max_len).reshape(-1, max_len, 1)

# train model with one label per timestep
rep_train_model = create_model(True)
rep_train_model.fit(X_train, y_train_rep, epochs=10, batch_size=128, validation_split=0.3)


Then we can create the stateful replicas of the training models and run them on some test data to compare their results:



# replica of `train_model` with the same weights
test_model = create_model(False, True)
test_model.set_weights(train_model.get_weights())
test_model.reset_states()

# replica of `rep_train_model` with the same weights
rep_test_model = create_model(True, True)
rep_test_model.set_weights(rep_train_model.get_weights())
rep_test_model.reset_states()

def stateful_predict(model, samples):
preds =
for s in samples:
model.reset_states()
ps =
for ts in s:
p = model.predict(np.array([[ts]]))
ps.append(p[0,0])
preds.append(list(ps))
return preds

X_test = pad_sequences(x_test, maxlen=max_len)


Actually, the first sample of X_test has a 0 label (i.e. belongs to negative class) and the second sample of X_test has a 1 label (i.e. belongs to positive class). So let's first see what the stateful prediction of test_model (i.e. the one that were trained using one label per sample) for these two samples would look like:



import matplotlib.pyplot as plt

preds = stateful_predict(test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>test_model</code> stateful predictions



Correct label (i.e. probability) at the end (i.e. timestep 200) but very spiky and fluctuating in between. Now let's compare it with the stateful predictions of the rep_test_model (i.e. the one that were trained using one label per each timestep):



preds = stateful_predict(rep_test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>rep_test_model</code> stateful predictions



Again, correct label prediction at the end but this time with a much more smoother and monotonic trend, as expected.



Note that this was just an example for demonstration and therefore I have used a very simple model here with just one LSTM layer and I did not attempt to tune it at all. I guess with a better tuning of the model (e.g. adjusting the number of layers, number of units in each layer, activation functions used, optimizer type and parameters, etc.), you might get far better results.






share|improve this answer























  • do you mean I'll have 100 probabilities as the output, or is it just the input to the last dense layer?
    – Shlomi Schwartz
    Nov 19 '18 at 15:52












  • @ShlomiSchwartz Yes only in training time. But don't change the number of units in the last layer. I was wrong on that and modified my answer. In prediction time though, you would give one timestep and you would get only one probability per timestep (and not 100).
    – today
    Nov 19 '18 at 15:56








  • 1




    @ShlomiSchwartz Essentially, in training time you would have a probability for each sub-sequence of length L.
    – today
    Nov 19 '18 at 15:58






  • 1




    @ShlomiSchwartz See my answer. I have updated it with the way to resolve the error you get as well as a simple experiment on IMDB dataset.
    – today
    Nov 20 '18 at 12:04






  • 1




    @ShlomiSchwartz You are welcome! Oh, I wish I was! But actually I am very very very far from the best :)
    – today
    Nov 20 '18 at 16:50













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376761%2flstm-making-predictions-on-partial-sequence%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









3














Note: This is just an idea and it might be wrong. Try it if you would like and I would appreciate any feedback.






Is there a way to achieve what I want (avoid extreme spikes while
predicting probability), or is that a given fact?




You can do this experiment: set the return_sequences argument of last LSTM layer to True and replicate the labels of each sample as much as the length of each sample. For example if a sample has a length of 100 and its label is 0, then create a new label for this sample which consists of 100 zeros (you can probably easily do this using numpy function like np.repeat). Then retrain your new model and test it on new samples afterwards. I am not sure of this, but I would expect more monotonically increasing/decreasing probability graphs this time.





Update: The error you mentioned is caused by the fact that the labels should be a 3D array (look at the output shape of last layer in the model summary). Use np.expand_dims to add another axis of size one to the end. The correct way of repeating the labels would look like this, assuming y_train has a shape of (num_samples,):



rep_y_train = np.repeat(y_train, num_reps).reshape(-1, num_reps, 1)




The experiment on IMDB dataset:



Actually, I tried the experiment suggested above on the IMDB dataset using a simple model with one LSTM layer. One time, I used only one label per each sample (as in original approach of @Shlomi) and the other time I replicated the labels to have one label per each timestep of a sample (as I suggested above). Here is the code if you would like to try it yourself:



from keras.layers import *
from keras.models import Sequential, Model
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
import numpy as np

vocab_size = 10000
max_len = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
X_train = pad_sequences(x_train, maxlen=max_len)

def create_model(return_seq=False, stateful=False):
batch_size = 1 if stateful else None
model = Sequential()
model.add(Embedding(vocab_size, 128, batch_input_shape=(batch_size, None)))
model.add(CuDNNLSTM(64, return_sequences=return_seq, stateful=stateful))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
return model

# train model with one label per sample
train_model = create_model()
train_model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.3)

# replicate the labels
y_train_rep = np.repeat(y_train, max_len).reshape(-1, max_len, 1)

# train model with one label per timestep
rep_train_model = create_model(True)
rep_train_model.fit(X_train, y_train_rep, epochs=10, batch_size=128, validation_split=0.3)


Then we can create the stateful replicas of the training models and run them on some test data to compare their results:



# replica of `train_model` with the same weights
test_model = create_model(False, True)
test_model.set_weights(train_model.get_weights())
test_model.reset_states()

# replica of `rep_train_model` with the same weights
rep_test_model = create_model(True, True)
rep_test_model.set_weights(rep_train_model.get_weights())
rep_test_model.reset_states()

def stateful_predict(model, samples):
preds =
for s in samples:
model.reset_states()
ps =
for ts in s:
p = model.predict(np.array([[ts]]))
ps.append(p[0,0])
preds.append(list(ps))
return preds

X_test = pad_sequences(x_test, maxlen=max_len)


Actually, the first sample of X_test has a 0 label (i.e. belongs to negative class) and the second sample of X_test has a 1 label (i.e. belongs to positive class). So let's first see what the stateful prediction of test_model (i.e. the one that were trained using one label per sample) for these two samples would look like:



import matplotlib.pyplot as plt

preds = stateful_predict(test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>test_model</code> stateful predictions



Correct label (i.e. probability) at the end (i.e. timestep 200) but very spiky and fluctuating in between. Now let's compare it with the stateful predictions of the rep_test_model (i.e. the one that were trained using one label per each timestep):



preds = stateful_predict(rep_test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>rep_test_model</code> stateful predictions



Again, correct label prediction at the end but this time with a much more smoother and monotonic trend, as expected.



Note that this was just an example for demonstration and therefore I have used a very simple model here with just one LSTM layer and I did not attempt to tune it at all. I guess with a better tuning of the model (e.g. adjusting the number of layers, number of units in each layer, activation functions used, optimizer type and parameters, etc.), you might get far better results.






share|improve this answer























  • do you mean I'll have 100 probabilities as the output, or is it just the input to the last dense layer?
    – Shlomi Schwartz
    Nov 19 '18 at 15:52












  • @ShlomiSchwartz Yes only in training time. But don't change the number of units in the last layer. I was wrong on that and modified my answer. In prediction time though, you would give one timestep and you would get only one probability per timestep (and not 100).
    – today
    Nov 19 '18 at 15:56








  • 1




    @ShlomiSchwartz Essentially, in training time you would have a probability for each sub-sequence of length L.
    – today
    Nov 19 '18 at 15:58






  • 1




    @ShlomiSchwartz See my answer. I have updated it with the way to resolve the error you get as well as a simple experiment on IMDB dataset.
    – today
    Nov 20 '18 at 12:04






  • 1




    @ShlomiSchwartz You are welcome! Oh, I wish I was! But actually I am very very very far from the best :)
    – today
    Nov 20 '18 at 16:50


















3














Note: This is just an idea and it might be wrong. Try it if you would like and I would appreciate any feedback.






Is there a way to achieve what I want (avoid extreme spikes while
predicting probability), or is that a given fact?




You can do this experiment: set the return_sequences argument of last LSTM layer to True and replicate the labels of each sample as much as the length of each sample. For example if a sample has a length of 100 and its label is 0, then create a new label for this sample which consists of 100 zeros (you can probably easily do this using numpy function like np.repeat). Then retrain your new model and test it on new samples afterwards. I am not sure of this, but I would expect more monotonically increasing/decreasing probability graphs this time.





Update: The error you mentioned is caused by the fact that the labels should be a 3D array (look at the output shape of last layer in the model summary). Use np.expand_dims to add another axis of size one to the end. The correct way of repeating the labels would look like this, assuming y_train has a shape of (num_samples,):



rep_y_train = np.repeat(y_train, num_reps).reshape(-1, num_reps, 1)




The experiment on IMDB dataset:



Actually, I tried the experiment suggested above on the IMDB dataset using a simple model with one LSTM layer. One time, I used only one label per each sample (as in original approach of @Shlomi) and the other time I replicated the labels to have one label per each timestep of a sample (as I suggested above). Here is the code if you would like to try it yourself:



from keras.layers import *
from keras.models import Sequential, Model
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
import numpy as np

vocab_size = 10000
max_len = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
X_train = pad_sequences(x_train, maxlen=max_len)

def create_model(return_seq=False, stateful=False):
batch_size = 1 if stateful else None
model = Sequential()
model.add(Embedding(vocab_size, 128, batch_input_shape=(batch_size, None)))
model.add(CuDNNLSTM(64, return_sequences=return_seq, stateful=stateful))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
return model

# train model with one label per sample
train_model = create_model()
train_model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.3)

# replicate the labels
y_train_rep = np.repeat(y_train, max_len).reshape(-1, max_len, 1)

# train model with one label per timestep
rep_train_model = create_model(True)
rep_train_model.fit(X_train, y_train_rep, epochs=10, batch_size=128, validation_split=0.3)


Then we can create the stateful replicas of the training models and run them on some test data to compare their results:



# replica of `train_model` with the same weights
test_model = create_model(False, True)
test_model.set_weights(train_model.get_weights())
test_model.reset_states()

# replica of `rep_train_model` with the same weights
rep_test_model = create_model(True, True)
rep_test_model.set_weights(rep_train_model.get_weights())
rep_test_model.reset_states()

def stateful_predict(model, samples):
preds =
for s in samples:
model.reset_states()
ps =
for ts in s:
p = model.predict(np.array([[ts]]))
ps.append(p[0,0])
preds.append(list(ps))
return preds

X_test = pad_sequences(x_test, maxlen=max_len)


Actually, the first sample of X_test has a 0 label (i.e. belongs to negative class) and the second sample of X_test has a 1 label (i.e. belongs to positive class). So let's first see what the stateful prediction of test_model (i.e. the one that were trained using one label per sample) for these two samples would look like:



import matplotlib.pyplot as plt

preds = stateful_predict(test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>test_model</code> stateful predictions



Correct label (i.e. probability) at the end (i.e. timestep 200) but very spiky and fluctuating in between. Now let's compare it with the stateful predictions of the rep_test_model (i.e. the one that were trained using one label per each timestep):



preds = stateful_predict(rep_test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>rep_test_model</code> stateful predictions



Again, correct label prediction at the end but this time with a much more smoother and monotonic trend, as expected.



Note that this was just an example for demonstration and therefore I have used a very simple model here with just one LSTM layer and I did not attempt to tune it at all. I guess with a better tuning of the model (e.g. adjusting the number of layers, number of units in each layer, activation functions used, optimizer type and parameters, etc.), you might get far better results.






share|improve this answer























  • do you mean I'll have 100 probabilities as the output, or is it just the input to the last dense layer?
    – Shlomi Schwartz
    Nov 19 '18 at 15:52












  • @ShlomiSchwartz Yes only in training time. But don't change the number of units in the last layer. I was wrong on that and modified my answer. In prediction time though, you would give one timestep and you would get only one probability per timestep (and not 100).
    – today
    Nov 19 '18 at 15:56








  • 1




    @ShlomiSchwartz Essentially, in training time you would have a probability for each sub-sequence of length L.
    – today
    Nov 19 '18 at 15:58






  • 1




    @ShlomiSchwartz See my answer. I have updated it with the way to resolve the error you get as well as a simple experiment on IMDB dataset.
    – today
    Nov 20 '18 at 12:04






  • 1




    @ShlomiSchwartz You are welcome! Oh, I wish I was! But actually I am very very very far from the best :)
    – today
    Nov 20 '18 at 16:50
















3












3








3






Note: This is just an idea and it might be wrong. Try it if you would like and I would appreciate any feedback.






Is there a way to achieve what I want (avoid extreme spikes while
predicting probability), or is that a given fact?




You can do this experiment: set the return_sequences argument of last LSTM layer to True and replicate the labels of each sample as much as the length of each sample. For example if a sample has a length of 100 and its label is 0, then create a new label for this sample which consists of 100 zeros (you can probably easily do this using numpy function like np.repeat). Then retrain your new model and test it on new samples afterwards. I am not sure of this, but I would expect more monotonically increasing/decreasing probability graphs this time.





Update: The error you mentioned is caused by the fact that the labels should be a 3D array (look at the output shape of last layer in the model summary). Use np.expand_dims to add another axis of size one to the end. The correct way of repeating the labels would look like this, assuming y_train has a shape of (num_samples,):



rep_y_train = np.repeat(y_train, num_reps).reshape(-1, num_reps, 1)




The experiment on IMDB dataset:



Actually, I tried the experiment suggested above on the IMDB dataset using a simple model with one LSTM layer. One time, I used only one label per each sample (as in original approach of @Shlomi) and the other time I replicated the labels to have one label per each timestep of a sample (as I suggested above). Here is the code if you would like to try it yourself:



from keras.layers import *
from keras.models import Sequential, Model
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
import numpy as np

vocab_size = 10000
max_len = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
X_train = pad_sequences(x_train, maxlen=max_len)

def create_model(return_seq=False, stateful=False):
batch_size = 1 if stateful else None
model = Sequential()
model.add(Embedding(vocab_size, 128, batch_input_shape=(batch_size, None)))
model.add(CuDNNLSTM(64, return_sequences=return_seq, stateful=stateful))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
return model

# train model with one label per sample
train_model = create_model()
train_model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.3)

# replicate the labels
y_train_rep = np.repeat(y_train, max_len).reshape(-1, max_len, 1)

# train model with one label per timestep
rep_train_model = create_model(True)
rep_train_model.fit(X_train, y_train_rep, epochs=10, batch_size=128, validation_split=0.3)


Then we can create the stateful replicas of the training models and run them on some test data to compare their results:



# replica of `train_model` with the same weights
test_model = create_model(False, True)
test_model.set_weights(train_model.get_weights())
test_model.reset_states()

# replica of `rep_train_model` with the same weights
rep_test_model = create_model(True, True)
rep_test_model.set_weights(rep_train_model.get_weights())
rep_test_model.reset_states()

def stateful_predict(model, samples):
preds =
for s in samples:
model.reset_states()
ps =
for ts in s:
p = model.predict(np.array([[ts]]))
ps.append(p[0,0])
preds.append(list(ps))
return preds

X_test = pad_sequences(x_test, maxlen=max_len)


Actually, the first sample of X_test has a 0 label (i.e. belongs to negative class) and the second sample of X_test has a 1 label (i.e. belongs to positive class). So let's first see what the stateful prediction of test_model (i.e. the one that were trained using one label per sample) for these two samples would look like:



import matplotlib.pyplot as plt

preds = stateful_predict(test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>test_model</code> stateful predictions



Correct label (i.e. probability) at the end (i.e. timestep 200) but very spiky and fluctuating in between. Now let's compare it with the stateful predictions of the rep_test_model (i.e. the one that were trained using one label per each timestep):



preds = stateful_predict(rep_test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>rep_test_model</code> stateful predictions



Again, correct label prediction at the end but this time with a much more smoother and monotonic trend, as expected.



Note that this was just an example for demonstration and therefore I have used a very simple model here with just one LSTM layer and I did not attempt to tune it at all. I guess with a better tuning of the model (e.g. adjusting the number of layers, number of units in each layer, activation functions used, optimizer type and parameters, etc.), you might get far better results.






share|improve this answer














Note: This is just an idea and it might be wrong. Try it if you would like and I would appreciate any feedback.






Is there a way to achieve what I want (avoid extreme spikes while
predicting probability), or is that a given fact?




You can do this experiment: set the return_sequences argument of last LSTM layer to True and replicate the labels of each sample as much as the length of each sample. For example if a sample has a length of 100 and its label is 0, then create a new label for this sample which consists of 100 zeros (you can probably easily do this using numpy function like np.repeat). Then retrain your new model and test it on new samples afterwards. I am not sure of this, but I would expect more monotonically increasing/decreasing probability graphs this time.





Update: The error you mentioned is caused by the fact that the labels should be a 3D array (look at the output shape of last layer in the model summary). Use np.expand_dims to add another axis of size one to the end. The correct way of repeating the labels would look like this, assuming y_train has a shape of (num_samples,):



rep_y_train = np.repeat(y_train, num_reps).reshape(-1, num_reps, 1)




The experiment on IMDB dataset:



Actually, I tried the experiment suggested above on the IMDB dataset using a simple model with one LSTM layer. One time, I used only one label per each sample (as in original approach of @Shlomi) and the other time I replicated the labels to have one label per each timestep of a sample (as I suggested above). Here is the code if you would like to try it yourself:



from keras.layers import *
from keras.models import Sequential, Model
from keras.datasets import imdb
from keras.preprocessing.sequence import pad_sequences
import numpy as np

vocab_size = 10000
max_len = 200
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
X_train = pad_sequences(x_train, maxlen=max_len)

def create_model(return_seq=False, stateful=False):
batch_size = 1 if stateful else None
model = Sequential()
model.add(Embedding(vocab_size, 128, batch_input_shape=(batch_size, None)))
model.add(CuDNNLSTM(64, return_sequences=return_seq, stateful=stateful))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
return model

# train model with one label per sample
train_model = create_model()
train_model.fit(X_train, y_train, epochs=10, batch_size=128, validation_split=0.3)

# replicate the labels
y_train_rep = np.repeat(y_train, max_len).reshape(-1, max_len, 1)

# train model with one label per timestep
rep_train_model = create_model(True)
rep_train_model.fit(X_train, y_train_rep, epochs=10, batch_size=128, validation_split=0.3)


Then we can create the stateful replicas of the training models and run them on some test data to compare their results:



# replica of `train_model` with the same weights
test_model = create_model(False, True)
test_model.set_weights(train_model.get_weights())
test_model.reset_states()

# replica of `rep_train_model` with the same weights
rep_test_model = create_model(True, True)
rep_test_model.set_weights(rep_train_model.get_weights())
rep_test_model.reset_states()

def stateful_predict(model, samples):
preds =
for s in samples:
model.reset_states()
ps =
for ts in s:
p = model.predict(np.array([[ts]]))
ps.append(p[0,0])
preds.append(list(ps))
return preds

X_test = pad_sequences(x_test, maxlen=max_len)


Actually, the first sample of X_test has a 0 label (i.e. belongs to negative class) and the second sample of X_test has a 1 label (i.e. belongs to positive class). So let's first see what the stateful prediction of test_model (i.e. the one that were trained using one label per sample) for these two samples would look like:



import matplotlib.pyplot as plt

preds = stateful_predict(test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>test_model</code> stateful predictions



Correct label (i.e. probability) at the end (i.e. timestep 200) but very spiky and fluctuating in between. Now let's compare it with the stateful predictions of the rep_test_model (i.e. the one that were trained using one label per each timestep):



preds = stateful_predict(rep_test_model, X_test[0:2])

plt.plot(preds[0])
plt.plot(preds[1])
plt.legend(['Class 0', 'Class 1'])


The result:



<code>rep_test_model</code> stateful predictions



Again, correct label prediction at the end but this time with a much more smoother and monotonic trend, as expected.



Note that this was just an example for demonstration and therefore I have used a very simple model here with just one LSTM layer and I did not attempt to tune it at all. I guess with a better tuning of the model (e.g. adjusting the number of layers, number of units in each layer, activation functions used, optimizer type and parameters, etc.), you might get far better results.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 20 '18 at 12:34

























answered Nov 19 '18 at 15:50









today

10k21536




10k21536












  • do you mean I'll have 100 probabilities as the output, or is it just the input to the last dense layer?
    – Shlomi Schwartz
    Nov 19 '18 at 15:52












  • @ShlomiSchwartz Yes only in training time. But don't change the number of units in the last layer. I was wrong on that and modified my answer. In prediction time though, you would give one timestep and you would get only one probability per timestep (and not 100).
    – today
    Nov 19 '18 at 15:56








  • 1




    @ShlomiSchwartz Essentially, in training time you would have a probability for each sub-sequence of length L.
    – today
    Nov 19 '18 at 15:58






  • 1




    @ShlomiSchwartz See my answer. I have updated it with the way to resolve the error you get as well as a simple experiment on IMDB dataset.
    – today
    Nov 20 '18 at 12:04






  • 1




    @ShlomiSchwartz You are welcome! Oh, I wish I was! But actually I am very very very far from the best :)
    – today
    Nov 20 '18 at 16:50




















  • do you mean I'll have 100 probabilities as the output, or is it just the input to the last dense layer?
    – Shlomi Schwartz
    Nov 19 '18 at 15:52












  • @ShlomiSchwartz Yes only in training time. But don't change the number of units in the last layer. I was wrong on that and modified my answer. In prediction time though, you would give one timestep and you would get only one probability per timestep (and not 100).
    – today
    Nov 19 '18 at 15:56








  • 1




    @ShlomiSchwartz Essentially, in training time you would have a probability for each sub-sequence of length L.
    – today
    Nov 19 '18 at 15:58






  • 1




    @ShlomiSchwartz See my answer. I have updated it with the way to resolve the error you get as well as a simple experiment on IMDB dataset.
    – today
    Nov 20 '18 at 12:04






  • 1




    @ShlomiSchwartz You are welcome! Oh, I wish I was! But actually I am very very very far from the best :)
    – today
    Nov 20 '18 at 16:50


















do you mean I'll have 100 probabilities as the output, or is it just the input to the last dense layer?
– Shlomi Schwartz
Nov 19 '18 at 15:52






do you mean I'll have 100 probabilities as the output, or is it just the input to the last dense layer?
– Shlomi Schwartz
Nov 19 '18 at 15:52














@ShlomiSchwartz Yes only in training time. But don't change the number of units in the last layer. I was wrong on that and modified my answer. In prediction time though, you would give one timestep and you would get only one probability per timestep (and not 100).
– today
Nov 19 '18 at 15:56






@ShlomiSchwartz Yes only in training time. But don't change the number of units in the last layer. I was wrong on that and modified my answer. In prediction time though, you would give one timestep and you would get only one probability per timestep (and not 100).
– today
Nov 19 '18 at 15:56






1




1




@ShlomiSchwartz Essentially, in training time you would have a probability for each sub-sequence of length L.
– today
Nov 19 '18 at 15:58




@ShlomiSchwartz Essentially, in training time you would have a probability for each sub-sequence of length L.
– today
Nov 19 '18 at 15:58




1




1




@ShlomiSchwartz See my answer. I have updated it with the way to resolve the error you get as well as a simple experiment on IMDB dataset.
– today
Nov 20 '18 at 12:04




@ShlomiSchwartz See my answer. I have updated it with the way to resolve the error you get as well as a simple experiment on IMDB dataset.
– today
Nov 20 '18 at 12:04




1




1




@ShlomiSchwartz You are welcome! Oh, I wish I was! But actually I am very very very far from the best :)
– today
Nov 20 '18 at 16:50






@ShlomiSchwartz You are welcome! Oh, I wish I was! But actually I am very very very far from the best :)
– today
Nov 20 '18 at 16:50




















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376761%2flstm-making-predictions-on-partial-sequence%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

MongoDB - Not Authorized To Execute Command

in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith

How to fix TextFormField cause rebuild widget in Flutter