For loop and if statements in R
I have a dataframe orange_train which has 231 variables and 50,000 observations. I want to check each variable for NA's or Zero's. If sum of NA (for factors) and Zero's(for numeric and integers) is greater than 75% of the 50,000, I want to eliminate those variables. My code is as below: But its not working as expected:
counting_na <- function(x) {sum(is.na(x))}
counting_zero <- function(x){length(which(x==0))}
for(i in 1:ncol(orange_train)){
if (class(orange_train$Var[i])=='numeric' && sum(is.na(orange_train$Var[i]))< 32500)
{print(orange_train$Var[i])}
else (class(orange_train$Var[i])=='integer' && [enter image description here][1]counting_zero(orange_train$Var[i]) < 32500)
{print(orange_train$Var[i])}
Could someone please help me with the code. I have been struggling for a long time now and am very new to R.
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
r
add a comment |
I have a dataframe orange_train which has 231 variables and 50,000 observations. I want to check each variable for NA's or Zero's. If sum of NA (for factors) and Zero's(for numeric and integers) is greater than 75% of the 50,000, I want to eliminate those variables. My code is as below: But its not working as expected:
counting_na <- function(x) {sum(is.na(x))}
counting_zero <- function(x){length(which(x==0))}
for(i in 1:ncol(orange_train)){
if (class(orange_train$Var[i])=='numeric' && sum(is.na(orange_train$Var[i]))< 32500)
{print(orange_train$Var[i])}
else (class(orange_train$Var[i])=='integer' && [enter image description here][1]counting_zero(orange_train$Var[i]) < 32500)
{print(orange_train$Var[i])}
Could someone please help me with the code. I have been struggling for a long time now and am very new to R.
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
r
counting_zero <- function(x) sum(x==0)
– jogo
Nov 21 '18 at 21:49
1
It would be helpful if you gave a sample of what your data looks like usingdput()
. Also, you're looping over the columns inorange_train
, but you're indexing over the rows in one variable. Perhaps you meanorange_train[[i]]
, instead oforange_train$Var[i]
?
– mickey
Nov 21 '18 at 22:03
1
Welcome to SO! Please read How to Ask give a Minimal, Complete, and Verifiable example in your question! Copy the output ofdput(head(orange_train, 10))
in your question!
– jogo
Nov 21 '18 at 22:04
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
– Sindhu Viswanathan
Nov 22 '18 at 0:29
@SindhuViswanathan, it does, but then you are still indexing them improperly. You could useorange_train[paste0('Var', i)]
instead.
– mickey
Nov 22 '18 at 2:37
add a comment |
I have a dataframe orange_train which has 231 variables and 50,000 observations. I want to check each variable for NA's or Zero's. If sum of NA (for factors) and Zero's(for numeric and integers) is greater than 75% of the 50,000, I want to eliminate those variables. My code is as below: But its not working as expected:
counting_na <- function(x) {sum(is.na(x))}
counting_zero <- function(x){length(which(x==0))}
for(i in 1:ncol(orange_train)){
if (class(orange_train$Var[i])=='numeric' && sum(is.na(orange_train$Var[i]))< 32500)
{print(orange_train$Var[i])}
else (class(orange_train$Var[i])=='integer' && [enter image description here][1]counting_zero(orange_train$Var[i]) < 32500)
{print(orange_train$Var[i])}
Could someone please help me with the code. I have been struggling for a long time now and am very new to R.
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
r
I have a dataframe orange_train which has 231 variables and 50,000 observations. I want to check each variable for NA's or Zero's. If sum of NA (for factors) and Zero's(for numeric and integers) is greater than 75% of the 50,000, I want to eliminate those variables. My code is as below: But its not working as expected:
counting_na <- function(x) {sum(is.na(x))}
counting_zero <- function(x){length(which(x==0))}
for(i in 1:ncol(orange_train)){
if (class(orange_train$Var[i])=='numeric' && sum(is.na(orange_train$Var[i]))< 32500)
{print(orange_train$Var[i])}
else (class(orange_train$Var[i])=='integer' && [enter image description here][1]counting_zero(orange_train$Var[i]) < 32500)
{print(orange_train$Var[i])}
Could someone please help me with the code. I have been struggling for a long time now and am very new to R.
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
r
r
edited Nov 22 '18 at 0:58
Sindhu Viswanathan
asked Nov 21 '18 at 21:46


Sindhu ViswanathanSindhu Viswanathan
83
83
counting_zero <- function(x) sum(x==0)
– jogo
Nov 21 '18 at 21:49
1
It would be helpful if you gave a sample of what your data looks like usingdput()
. Also, you're looping over the columns inorange_train
, but you're indexing over the rows in one variable. Perhaps you meanorange_train[[i]]
, instead oforange_train$Var[i]
?
– mickey
Nov 21 '18 at 22:03
1
Welcome to SO! Please read How to Ask give a Minimal, Complete, and Verifiable example in your question! Copy the output ofdput(head(orange_train, 10))
in your question!
– jogo
Nov 21 '18 at 22:04
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
– Sindhu Viswanathan
Nov 22 '18 at 0:29
@SindhuViswanathan, it does, but then you are still indexing them improperly. You could useorange_train[paste0('Var', i)]
instead.
– mickey
Nov 22 '18 at 2:37
add a comment |
counting_zero <- function(x) sum(x==0)
– jogo
Nov 21 '18 at 21:49
1
It would be helpful if you gave a sample of what your data looks like usingdput()
. Also, you're looping over the columns inorange_train
, but you're indexing over the rows in one variable. Perhaps you meanorange_train[[i]]
, instead oforange_train$Var[i]
?
– mickey
Nov 21 '18 at 22:03
1
Welcome to SO! Please read How to Ask give a Minimal, Complete, and Verifiable example in your question! Copy the output ofdput(head(orange_train, 10))
in your question!
– jogo
Nov 21 '18 at 22:04
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
– Sindhu Viswanathan
Nov 22 '18 at 0:29
@SindhuViswanathan, it does, but then you are still indexing them improperly. You could useorange_train[paste0('Var', i)]
instead.
– mickey
Nov 22 '18 at 2:37
counting_zero <- function(x) sum(x==0)
– jogo
Nov 21 '18 at 21:49
counting_zero <- function(x) sum(x==0)
– jogo
Nov 21 '18 at 21:49
1
1
It would be helpful if you gave a sample of what your data looks like using
dput()
. Also, you're looping over the columns in orange_train
, but you're indexing over the rows in one variable. Perhaps you mean orange_train[[i]]
, instead of orange_train$Var[i]
?– mickey
Nov 21 '18 at 22:03
It would be helpful if you gave a sample of what your data looks like using
dput()
. Also, you're looping over the columns in orange_train
, but you're indexing over the rows in one variable. Perhaps you mean orange_train[[i]]
, instead of orange_train$Var[i]
?– mickey
Nov 21 '18 at 22:03
1
1
Welcome to SO! Please read How to Ask give a Minimal, Complete, and Verifiable example in your question! Copy the output of
dput(head(orange_train, 10))
in your question!– jogo
Nov 21 '18 at 22:04
Welcome to SO! Please read How to Ask give a Minimal, Complete, and Verifiable example in your question! Copy the output of
dput(head(orange_train, 10))
in your question!– jogo
Nov 21 '18 at 22:04
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
– Sindhu Viswanathan
Nov 22 '18 at 0:29
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
– Sindhu Viswanathan
Nov 22 '18 at 0:29
@SindhuViswanathan, it does, but then you are still indexing them improperly. You could use
orange_train[paste0('Var', i)]
instead.– mickey
Nov 22 '18 at 2:37
@SindhuViswanathan, it does, but then you are still indexing them improperly. You could use
orange_train[paste0('Var', i)]
instead.– mickey
Nov 22 '18 at 2:37
add a comment |
1 Answer
1
active
oldest
votes
Example data
set.seed(10)
df <- data.frame(a = sample(c(NA, LETTERS[1]), 100, T, prob = c(.75, .25))
, b = sample(0:1, 100, T, prob = c(.75, .25)))
Calculate the percentages for each column (percent NA
for factor, percent 0
for numeric)
percents <-
sapply(df, function(x){
if(is.factor(x)) mean(is.na(x))
else if(is.numeric(x)) mean(x == 0)
else NA})
percents
# a b
# 0.84 0.75
Remove the ones greater than 75%
df[percents > 0.75] <- NULL
names(df)
#[1] "b"
You can see that the column a
was removed, because it was a factor with 84% NA
s
This worked like a charm! Thank you @IceCreamToucan!!! I appreciate your timely help!
– Sindhu Viswanathan
Nov 22 '18 at 4:21
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420918%2ffor-loop-and-if-statements-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Example data
set.seed(10)
df <- data.frame(a = sample(c(NA, LETTERS[1]), 100, T, prob = c(.75, .25))
, b = sample(0:1, 100, T, prob = c(.75, .25)))
Calculate the percentages for each column (percent NA
for factor, percent 0
for numeric)
percents <-
sapply(df, function(x){
if(is.factor(x)) mean(is.na(x))
else if(is.numeric(x)) mean(x == 0)
else NA})
percents
# a b
# 0.84 0.75
Remove the ones greater than 75%
df[percents > 0.75] <- NULL
names(df)
#[1] "b"
You can see that the column a
was removed, because it was a factor with 84% NA
s
This worked like a charm! Thank you @IceCreamToucan!!! I appreciate your timely help!
– Sindhu Viswanathan
Nov 22 '18 at 4:21
add a comment |
Example data
set.seed(10)
df <- data.frame(a = sample(c(NA, LETTERS[1]), 100, T, prob = c(.75, .25))
, b = sample(0:1, 100, T, prob = c(.75, .25)))
Calculate the percentages for each column (percent NA
for factor, percent 0
for numeric)
percents <-
sapply(df, function(x){
if(is.factor(x)) mean(is.na(x))
else if(is.numeric(x)) mean(x == 0)
else NA})
percents
# a b
# 0.84 0.75
Remove the ones greater than 75%
df[percents > 0.75] <- NULL
names(df)
#[1] "b"
You can see that the column a
was removed, because it was a factor with 84% NA
s
This worked like a charm! Thank you @IceCreamToucan!!! I appreciate your timely help!
– Sindhu Viswanathan
Nov 22 '18 at 4:21
add a comment |
Example data
set.seed(10)
df <- data.frame(a = sample(c(NA, LETTERS[1]), 100, T, prob = c(.75, .25))
, b = sample(0:1, 100, T, prob = c(.75, .25)))
Calculate the percentages for each column (percent NA
for factor, percent 0
for numeric)
percents <-
sapply(df, function(x){
if(is.factor(x)) mean(is.na(x))
else if(is.numeric(x)) mean(x == 0)
else NA})
percents
# a b
# 0.84 0.75
Remove the ones greater than 75%
df[percents > 0.75] <- NULL
names(df)
#[1] "b"
You can see that the column a
was removed, because it was a factor with 84% NA
s
Example data
set.seed(10)
df <- data.frame(a = sample(c(NA, LETTERS[1]), 100, T, prob = c(.75, .25))
, b = sample(0:1, 100, T, prob = c(.75, .25)))
Calculate the percentages for each column (percent NA
for factor, percent 0
for numeric)
percents <-
sapply(df, function(x){
if(is.factor(x)) mean(is.na(x))
else if(is.numeric(x)) mean(x == 0)
else NA})
percents
# a b
# 0.84 0.75
Remove the ones greater than 75%
df[percents > 0.75] <- NULL
names(df)
#[1] "b"
You can see that the column a
was removed, because it was a factor with 84% NA
s
edited Nov 21 '18 at 22:21
answered Nov 21 '18 at 22:14


IceCreamToucanIceCreamToucan
9,7611816
9,7611816
This worked like a charm! Thank you @IceCreamToucan!!! I appreciate your timely help!
– Sindhu Viswanathan
Nov 22 '18 at 4:21
add a comment |
This worked like a charm! Thank you @IceCreamToucan!!! I appreciate your timely help!
– Sindhu Viswanathan
Nov 22 '18 at 4:21
This worked like a charm! Thank you @IceCreamToucan!!! I appreciate your timely help!
– Sindhu Viswanathan
Nov 22 '18 at 4:21
This worked like a charm! Thank you @IceCreamToucan!!! I appreciate your timely help!
– Sindhu Viswanathan
Nov 22 '18 at 4:21
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53420918%2ffor-loop-and-if-statements-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
counting_zero <- function(x) sum(x==0)
– jogo
Nov 21 '18 at 21:49
1
It would be helpful if you gave a sample of what your data looks like using
dput()
. Also, you're looping over the columns inorange_train
, but you're indexing over the rows in one variable. Perhaps you meanorange_train[[i]]
, instead oforange_train$Var[i]
?– mickey
Nov 21 '18 at 22:03
1
Welcome to SO! Please read How to Ask give a Minimal, Complete, and Verifiable example in your question! Copy the output of
dput(head(orange_train, 10))
in your question!– jogo
Nov 21 '18 at 22:04
my columns have headers Var1 - Var231 and the data types are numeric, factors and integers. I hope this helps
– Sindhu Viswanathan
Nov 22 '18 at 0:29
@SindhuViswanathan, it does, but then you are still indexing them improperly. You could use
orange_train[paste0('Var', i)]
instead.– mickey
Nov 22 '18 at 2:37