Read comma-separated file which also have comma within a column

I have a file where the separator between columns is comma (,). However, comma may also occur within columns, i.e. the 'Notes' column:

Id,Notes,Other_ID

100,This text looks good,1000

101,This text,have,comma,2000

I tried to read the csv:

r <- read.csv("test.csv", sep = ",")

As a result, I received it as :

Id.Notes.GUID

100,This text is good,1000

102,This text,have,comma,2000

which is incorrect, as I would like to have the output as

Id       Notes                  GUID

100      This text is good      1000

102      This text,have,comma  2000

Goal is to receive the data intact with the columns irrespective of comma present within a column and shouldn't work like a delimiter.

Thanks in Advance

edited Jan 2 at 17:21

Henrik

42.3k994110

asked Jan 2 at 16:49

Ambuj

add a comment |

I have a file where the separator between columns is comma (,). However, comma may also occur within columns, i.e. the 'Notes' column:

Id,Notes,Other_ID

100,This text looks good,1000

101,This text,have,comma,2000

I tried to read the csv:

r <- read.csv("test.csv", sep = ",")

As a result, I received it as :

Id.Notes.GUID

100,This text is good,1000

102,This text,have,comma,2000

which is incorrect, as I would like to have the output as

Id       Notes                  GUID

100      This text is good      1000

102      This text,have,comma  2000

Goal is to receive the data intact with the columns irrespective of comma present within a column and shouldn't work like a delimiter.

Thanks in Advance

edited Jan 2 at 17:21

Henrik

42.3k994110

asked Jan 2 at 16:49

Ambuj

add a comment |

I have a file where the separator between columns is comma (,). However, comma may also occur within columns, i.e. the 'Notes' column:

Id,Notes,Other_ID

100,This text looks good,1000

101,This text,have,comma,2000

I tried to read the csv:

r <- read.csv("test.csv", sep = ",")

As a result, I received it as :

Id.Notes.GUID

100,This text is good,1000

102,This text,have,comma,2000

which is incorrect, as I would like to have the output as

Id       Notes                  GUID

100      This text is good      1000

102      This text,have,comma  2000

Goal is to receive the data intact with the columns irrespective of comma present within a column and shouldn't work like a delimiter.

Thanks in Advance

edited Jan 2 at 17:21

Henrik

42.3k994110

asked Jan 2 at 16:49

Ambuj

I have a file where the separator between columns is comma (,). However, comma may also occur within columns, i.e. the 'Notes' column:

Id,Notes,Other_ID

100,This text looks good,1000

101,This text,have,comma,2000

I tried to read the csv:

r <- read.csv("test.csv", sep = ",")

As a result, I received it as :

Id.Notes.GUID

100,This text is good,1000

102,This text,have,comma,2000

which is incorrect, as I would like to have the output as

Id       Notes                  GUID

100      This text is good      1000

102      This text,have,comma  2000

Goal is to receive the data intact with the columns irrespective of comma present within a column and shouldn't work like a delimiter.

Thanks in Advance

r csv

edited Jan 2 at 17:21

Henrik

42.3k994110

asked Jan 2 at 16:49

Ambuj

edited Jan 2 at 17:21

Henrik

42.3k994110

asked Jan 2 at 16:49

Ambuj

edited Jan 2 at 17:21

Henrik

42.3k994110

edited Jan 2 at 17:21

Henrik

42.3k994110

edited Jan 2 at 17:21

Henrik

42.3k994110

asked Jan 2 at 16:49

Ambuj

asked Jan 2 at 16:49

Ambuj

asked Jan 2 at 16:49

Ambuj

add a comment |

1 Answer
1

active

oldest

votes

1) read.pattern read.pattern will read the fields according to the provided regular expression. For reproducibility we have used Lines in the Note below but if the data is in a file replace text=Lines with something like "myfile.csv" .

library(gsubfn)

read.pattern(text = Lines, pattern = "^(.*?),(.*),(.*)$", header = TRUE, as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

2) Base R Read the data into a character vector and replace the first and last comma on each line with some character that does not otherwise occur such as semicolon. Then read that.

L.raw <- readLines(textConnection(Lines))

L.semi <- sub(",(.*),", ";\1;", L.raw)

read.table(text = L.semi, header = TRUE, sep = ";", as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

3) gawk If you have a very large input file then it would likely be faster to do as much as possible outside of R. Using gawk we have the following. (On Windows install Rtools if you don't already have gawk and also make sure it is on your path or else refer to it using the entire pathname.) In the BEGIN block first is the number of commas to replace before the field with commas and last is the number of commas to replace after the field with commas. In this case the field with commas is the second of 3 fields so first = last = 1.

# generate test input



Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"



cat(Lines, file = "ambuj.dat")



# gawk program to replace commas

ambuj.awk <- '

BEGIN { first = 1; last = 1 }

{ 

  nc = gsub(/,/, ",") # number of commas

  for(i = nc; i > nc-last; i--) $0 = gensub(/,/, ";", i) # replace last last commas

  for(i = 0; i < first; i++) sub(/,/, ";") # replace first first commas 

  print

}'

cat(ambuj.awk, file = "ambuj.awk")



read.csv(pipe("gawk -f ambuj.awk ambuj.dat"), sep = ";", quote = "",

 comment.char = "")

Also you could set colClasses= to speed it up a bit more.

Note

Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"

edited Jan 8 at 16:21

answered Jan 2 at 16:56

G. Grothendieck

153k10136244

Awesome, worked like a charm, quick question: do we have to define (.*) as many times as number of columns present in data set?

– Ambuj
Jan 2 at 18:10

The pattern in read.pattern must correspond to the data so if the data has a different number of columns or the columns are in a different order then the pattern will have to be modified appropriately.

– G. Grothendieck
Jan 7 at 13:58

One more query I have, the number of comma(,) in that text column are not fixed, at some rows there are 2 and at some other there are 7. How can we determine the maximum number of commas and prevent them to split the text into several parts.

– Ambuj
Jan 7 at 14:06

The number of commas shouldn't matter. The ^(.*?), captures everything up to the first comma, the first (.*), captures everything after the first comma until the last comma no matter how many commas that field contains and the (.*)$ at the end captures the last field.

– G. Grothendieck
Jan 7 at 14:12

So do I have to enter (.*) as many times as number of text columns present in the database?Or does defining (.*) once will do the task? Because I have 8 columns with 3 numeric columns in beginning, 4th column as text with comma, 5th column blank,6th column number,7th column text and 8th column text.The only column have commas in text is 5th one for which I wrote the code as: ^(.?),(.?),(.?),(.),(.*),(.?),(.),(.*)$. Please correct me if I am wrong.

– Ambuj
Jan 7 at 15:19

|
show 4 more comments

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54010143%2fread-comma-separated-file-which-also-have-comma-within-a-column%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

library(gsubfn)

read.pattern(text = Lines, pattern = "^(.*?),(.*),(.*)$", header = TRUE, as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

2) Base R Read the data into a character vector and replace the first and last comma on each line with some character that does not otherwise occur such as semicolon. Then read that.

L.raw <- readLines(textConnection(Lines))

L.semi <- sub(",(.*),", ";\1;", L.raw)

read.table(text = L.semi, header = TRUE, sep = ";", as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

# generate test input



Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"



cat(Lines, file = "ambuj.dat")



# gawk program to replace commas

ambuj.awk <- '

BEGIN { first = 1; last = 1 }

{ 

  nc = gsub(/,/, ",") # number of commas

  for(i = nc; i > nc-last; i--) $0 = gensub(/,/, ";", i) # replace last last commas

  for(i = 0; i < first; i++) sub(/,/, ";") # replace first first commas 

  print

}'

cat(ambuj.awk, file = "ambuj.awk")



read.csv(pipe("gawk -f ambuj.awk ambuj.dat"), sep = ";", quote = "",

 comment.char = "")

Also you could set colClasses= to speed it up a bit more.

Note

Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"

edited Jan 8 at 16:21

answered Jan 2 at 16:56

G. Grothendieck

153k10136244

Awesome, worked like a charm, quick question: do we have to define (.*) as many times as number of columns present in data set?

– Ambuj
Jan 2 at 18:10

The pattern in read.pattern must correspond to the data so if the data has a different number of columns or the columns are in a different order then the pattern will have to be modified appropriately.

– G. Grothendieck
Jan 7 at 13:58

One more query I have, the number of comma(,) in that text column are not fixed, at some rows there are 2 and at some other there are 7. How can we determine the maximum number of commas and prevent them to split the text into several parts.

– Ambuj
Jan 7 at 14:06

The number of commas shouldn't matter. The ^(.*?), captures everything up to the first comma, the first (.*), captures everything after the first comma until the last comma no matter how many commas that field contains and the (.*)$ at the end captures the last field.

– G. Grothendieck
Jan 7 at 14:12

So do I have to enter (.*) as many times as number of text columns present in the database?Or does defining (.*) once will do the task? Because I have 8 columns with 3 numeric columns in beginning, 4th column as text with comma, 5th column blank,6th column number,7th column text and 8th column text.The only column have commas in text is 5th one for which I wrote the code as: ^(.?),(.?),(.?),(.),(.*),(.?),(.),(.*)$. Please correct me if I am wrong.

– Ambuj
Jan 7 at 15:19

|
show 4 more comments

library(gsubfn)

read.pattern(text = Lines, pattern = "^(.*?),(.*),(.*)$", header = TRUE, as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

2) Base R Read the data into a character vector and replace the first and last comma on each line with some character that does not otherwise occur such as semicolon. Then read that.

L.raw <- readLines(textConnection(Lines))

L.semi <- sub(",(.*),", ";\1;", L.raw)

read.table(text = L.semi, header = TRUE, sep = ";", as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

# generate test input



Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"



cat(Lines, file = "ambuj.dat")



# gawk program to replace commas

ambuj.awk <- '

BEGIN { first = 1; last = 1 }

{ 

  nc = gsub(/,/, ",") # number of commas

  for(i = nc; i > nc-last; i--) $0 = gensub(/,/, ";", i) # replace last last commas

  for(i = 0; i < first; i++) sub(/,/, ";") # replace first first commas 

  print

}'

cat(ambuj.awk, file = "ambuj.awk")



read.csv(pipe("gawk -f ambuj.awk ambuj.dat"), sep = ";", quote = "",

 comment.char = "")

Also you could set colClasses= to speed it up a bit more.

Note

Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"

edited Jan 8 at 16:21

answered Jan 2 at 16:56

G. Grothendieck

153k10136244

Awesome, worked like a charm, quick question: do we have to define (.*) as many times as number of columns present in data set?

– Ambuj
Jan 2 at 18:10

The pattern in read.pattern must correspond to the data so if the data has a different number of columns or the columns are in a different order then the pattern will have to be modified appropriately.

– G. Grothendieck
Jan 7 at 13:58

One more query I have, the number of comma(,) in that text column are not fixed, at some rows there are 2 and at some other there are 7. How can we determine the maximum number of commas and prevent them to split the text into several parts.

– Ambuj
Jan 7 at 14:06

The number of commas shouldn't matter. The ^(.*?), captures everything up to the first comma, the first (.*), captures everything after the first comma until the last comma no matter how many commas that field contains and the (.*)$ at the end captures the last field.

– G. Grothendieck
Jan 7 at 14:12

So do I have to enter (.*) as many times as number of text columns present in the database?Or does defining (.*) once will do the task? Because I have 8 columns with 3 numeric columns in beginning, 4th column as text with comma, 5th column blank,6th column number,7th column text and 8th column text.The only column have commas in text is 5th one for which I wrote the code as: ^(.?),(.?),(.?),(.),(.*),(.?),(.),(.*)$. Please correct me if I am wrong.

– Ambuj
Jan 7 at 15:19

|
show 4 more comments

library(gsubfn)

read.pattern(text = Lines, pattern = "^(.*?),(.*),(.*)$", header = TRUE, as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

2) Base R Read the data into a character vector and replace the first and last comma on each line with some character that does not otherwise occur such as semicolon. Then read that.

L.raw <- readLines(textConnection(Lines))

L.semi <- sub(",(.*),", ";\1;", L.raw)

read.table(text = L.semi, header = TRUE, sep = ";", as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

# generate test input



Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"



cat(Lines, file = "ambuj.dat")



# gawk program to replace commas

ambuj.awk <- '

BEGIN { first = 1; last = 1 }

{ 

  nc = gsub(/,/, ",") # number of commas

  for(i = nc; i > nc-last; i--) $0 = gensub(/,/, ";", i) # replace last last commas

  for(i = 0; i < first; i++) sub(/,/, ";") # replace first first commas 

  print

}'

cat(ambuj.awk, file = "ambuj.awk")



read.csv(pipe("gawk -f ambuj.awk ambuj.dat"), sep = ";", quote = "",

 comment.char = "")

Also you could set colClasses= to speed it up a bit more.

Note

Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"

edited Jan 8 at 16:21

answered Jan 2 at 16:56

G. Grothendieck

153k10136244

library(gsubfn)

read.pattern(text = Lines, pattern = "^(.*?),(.*),(.*)$", header = TRUE, as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

2) Base R Read the data into a character vector and replace the first and last comma on each line with some character that does not otherwise occur such as semicolon. Then read that.

L.raw <- readLines(textConnection(Lines))

L.semi <- sub(",(.*),", ";\1;", L.raw)

read.table(text = L.semi, header = TRUE, sep = ";", as.is = TRUE)

giving:

   Id                Notes Other_ID

1 100 This text looks good     1000

2 101 This text,have,comma     2000

# generate test input



Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"



cat(Lines, file = "ambuj.dat")



# gawk program to replace commas

ambuj.awk <- '

BEGIN { first = 1; last = 1 }

{ 

  nc = gsub(/,/, ",") # number of commas

  for(i = nc; i > nc-last; i--) $0 = gensub(/,/, ";", i) # replace last last commas

  for(i = 0; i < first; i++) sub(/,/, ";") # replace first first commas 

  print

}'

cat(ambuj.awk, file = "ambuj.awk")



read.csv(pipe("gawk -f ambuj.awk ambuj.dat"), sep = ";", quote = "",

 comment.char = "")

Also you could set colClasses= to speed it up a bit more.

Note

Lines <- "Id,Notes,Other_ID

100,This text looks good, 1000

101,This text,have,comma,2000"

edited Jan 8 at 16:21

answered Jan 2 at 16:56

G. Grothendieck

153k10136244

edited Jan 8 at 16:21

answered Jan 2 at 16:56

G. Grothendieck

153k10136244

answered Jan 2 at 16:56

G. Grothendieck

153k10136244

answered Jan 2 at 16:56

G. Grothendieck

153k10136244

Awesome, worked like a charm, quick question: do we have to define (.*) as many times as number of columns present in data set?

– Ambuj
Jan 2 at 18:10

The pattern in read.pattern must correspond to the data so if the data has a different number of columns or the columns are in a different order then the pattern will have to be modified appropriately.

– G. Grothendieck
Jan 7 at 13:58

One more query I have, the number of comma(,) in that text column are not fixed, at some rows there are 2 and at some other there are 7. How can we determine the maximum number of commas and prevent them to split the text into several parts.

– Ambuj
Jan 7 at 14:06

The number of commas shouldn't matter. The ^(.*?), captures everything up to the first comma, the first (.*), captures everything after the first comma until the last comma no matter how many commas that field contains and the (.*)$ at the end captures the last field.

– G. Grothendieck
Jan 7 at 14:12

So do I have to enter (.*) as many times as number of text columns present in the database?Or does defining (.*) once will do the task? Because I have 8 columns with 3 numeric columns in beginning, 4th column as text with comma, 5th column blank,6th column number,7th column text and 8th column text.The only column have commas in text is 5th one for which I wrote the code as: ^(.?),(.?),(.?),(.),(.*),(.?),(.),(.*)$. Please correct me if I am wrong.

– Ambuj
Jan 7 at 15:19

|
show 4 more comments

Awesome, worked like a charm, quick question: do we have to define (.*) as many times as number of columns present in data set?

– Ambuj
Jan 2 at 18:10

The pattern in read.pattern must correspond to the data so if the data has a different number of columns or the columns are in a different order then the pattern will have to be modified appropriately.

– G. Grothendieck
Jan 7 at 13:58

One more query I have, the number of comma(,) in that text column are not fixed, at some rows there are 2 and at some other there are 7. How can we determine the maximum number of commas and prevent them to split the text into several parts.

– Ambuj
Jan 7 at 14:06

The number of commas shouldn't matter. The ^(.*?), captures everything up to the first comma, the first (.*), captures everything after the first comma until the last comma no matter how many commas that field contains and the (.*)$ at the end captures the last field.

– G. Grothendieck
Jan 7 at 14:12

So do I have to enter (.*) as many times as number of text columns present in the database?Or does defining (.*) once will do the task? Because I have 8 columns with 3 numeric columns in beginning, 4th column as text with comma, 5th column blank,6th column number,7th column text and 8th column text.The only column have commas in text is 5th one for which I wrote the code as: ^(.?),(.?),(.?),(.),(.*),(.?),(.),(.*)$. Please correct me if I am wrong.

– Ambuj
Jan 7 at 15:19

Awesome, worked like a charm, quick question: do we have to define (.*) as many times as number of columns present in data set?

– Ambuj
Jan 2 at 18:10

The pattern in read.pattern must correspond to the data so if the data has a different number of columns or the columns are in a different order then the pattern will have to be modified appropriately.

– G. Grothendieck
Jan 7 at 13:58

One more query I have, the number of comma(,) in that text column are not fixed, at some rows there are 2 and at some other there are 7. How can we determine the maximum number of commas and prevent them to split the text into several parts.

– Ambuj
Jan 7 at 14:06

The number of commas shouldn't matter. The ^(.*?), captures everything up to the first comma, the first (.*), captures everything after the first comma until the last comma no matter how many commas that field contains and the (.*)$ at the end captures the last field.

– G. Grothendieck
Jan 7 at 14:12

So do I have to enter (.*) as many times as number of text columns present in the database?Or does defining (.*) once will do the task? Because I have 8 columns with 3 numeric columns in beginning, 4th column as text with comma, 5th column blank,6th column number,7th column text and 8th column text.The only column have commas in text is 5th one for which I wrote the code as: ^(.?),(.?),(.?),(.),(.*),(.?),(.),(.*)$. Please correct me if I am wrong.

– Ambuj
Jan 7 at 15:19

|
show 4 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu