How can I find the largest number in a very large text file (~150 GB)?

I have a text file that has around 100000000 lines, each of the following type:

string num1 num2 num3 ... num500

string num1 num2 num3 ... num40

I want to find the largest number present in this file.

My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.

with open(filename,'r') as f:

    prev_max = -1

    for line in f:

        line = [int(n) for n in line.split(' ')[1:]]

        max = max_num(line)

        if max > prev_max:

            prev_max = max

But this takes forever. Is there a better way to do this?

I am open to solutions with awk or other shell commands as well.

Edit: Added how I am reading the file.

edited Jan 1 at 6:15

asked Jan 1 at 5:20

user110327

18918

1

How are you getting all_lines?

– Alexander Reynolds
Jan 1 at 5:21

1

What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

– Mad Physicist
Jan 1 at 6:08

1

You didn't answer my question. What is all_lines specifically? Please post all of your code.

– Alexander Reynolds
Jan 1 at 6:13

blog.pythonlibrary.org/2014/01/27/…

– Windchill
Jan 1 at 6:13

1

Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

– Charles Duffy
Jan 1 at 15:36

|
show 1 more comment

I have a text file that has around 100000000 lines, each of the following type:

string num1 num2 num3 ... num500

string num1 num2 num3 ... num40

I want to find the largest number present in this file.

My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.

with open(filename,'r') as f:

    prev_max = -1

    for line in f:

        line = [int(n) for n in line.split(' ')[1:]]

        max = max_num(line)

        if max > prev_max:

            prev_max = max

But this takes forever. Is there a better way to do this?

I am open to solutions with awk or other shell commands as well.

Edit: Added how I am reading the file.

edited Jan 1 at 6:15

asked Jan 1 at 5:20

user110327

18918

1

How are you getting all_lines?

– Alexander Reynolds
Jan 1 at 5:21

1

What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

– Mad Physicist
Jan 1 at 6:08

1

You didn't answer my question. What is all_lines specifically? Please post all of your code.

– Alexander Reynolds
Jan 1 at 6:13

blog.pythonlibrary.org/2014/01/27/…

– Windchill
Jan 1 at 6:13

1

Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

– Charles Duffy
Jan 1 at 15:36

|
show 1 more comment

I have a text file that has around 100000000 lines, each of the following type:

string num1 num2 num3 ... num500

string num1 num2 num3 ... num40

I want to find the largest number present in this file.

My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.

with open(filename,'r') as f:

    prev_max = -1

    for line in f:

        line = [int(n) for n in line.split(' ')[1:]]

        max = max_num(line)

        if max > prev_max:

            prev_max = max

But this takes forever. Is there a better way to do this?

I am open to solutions with awk or other shell commands as well.

Edit: Added how I am reading the file.

edited Jan 1 at 6:15

asked Jan 1 at 5:20

user110327

18918

I have a text file that has around 100000000 lines, each of the following type:

string num1 num2 num3 ... num500

string num1 num2 num3 ... num40

I want to find the largest number present in this file.

My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.

with open(filename,'r') as f:

    prev_max = -1

    for line in f:

        line = [int(n) for n in line.split(' ')[1:]]

        max = max_num(line)

        if max > prev_max:

            prev_max = max

But this takes forever. Is there a better way to do this?

I am open to solutions with awk or other shell commands as well.

Edit: Added how I am reading the file.

python-3.x text awk large-data

edited Jan 1 at 6:15

asked Jan 1 at 5:20

user110327

18918

edited Jan 1 at 6:15

asked Jan 1 at 5:20

user110327

18918

edited Jan 1 at 6:15

asked Jan 1 at 5:20

user110327

18918

asked Jan 1 at 5:20

user110327

18918

asked Jan 1 at 5:20

user110327

18918

1

How are you getting all_lines?

– Alexander Reynolds
Jan 1 at 5:21

1

What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

– Mad Physicist
Jan 1 at 6:08

1

You didn't answer my question. What is all_lines specifically? Please post all of your code.

– Alexander Reynolds
Jan 1 at 6:13

blog.pythonlibrary.org/2014/01/27/…

– Windchill
Jan 1 at 6:13

1

Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

– Charles Duffy
Jan 1 at 15:36

|
show 1 more comment

1

How are you getting all_lines?

– Alexander Reynolds
Jan 1 at 5:21

1

What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

– Mad Physicist
Jan 1 at 6:08

1

You didn't answer my question. What is all_lines specifically? Please post all of your code.

– Alexander Reynolds
Jan 1 at 6:13

blog.pythonlibrary.org/2014/01/27/…

– Windchill
Jan 1 at 6:13

1

Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

– Charles Duffy
Jan 1 at 15:36

How are you getting all_lines?

– Alexander Reynolds
Jan 1 at 5:21

What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

– Mad Physicist
Jan 1 at 6:08

You didn't answer my question. What is all_lines specifically? Please post all of your code.

– Alexander Reynolds
Jan 1 at 6:13

blog.pythonlibrary.org/2014/01/27/…

– Windchill
Jan 1 at 6:13

Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

– Charles Duffy
Jan 1 at 15:36

|
show 1 more comment

3 Answers
3

active

oldest

votes

It's a trivial task for awk.

awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file

If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.

edited Jan 1 at 19:22

answered Jan 1 at 10:02

oguzismail

3,75031126

add a comment |

Try this Perl solution

$ cat sample1.txt

string 1 2 4 10 7

string 1 2 44 10 7

string 3 2 4 10 70

string 9 2 44 10 7

$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt

70

$

answered Jan 1 at 10:37

stack0114106

4,0332421

1

Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

– zdim
Jan 1 at 11:10

@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

– stack0114106
Jan 1 at 11:14

Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

– zdim
Jan 1 at 11:17

yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

– stack0114106
Jan 1 at 11:22

add a comment |

I wanted to write an awk script without for looping the columns to compare execution times with a for looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr to swap space and newlines I got pretty close:

$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'

Output of cat <(echo 0) file | tr ' n' 'n ':

0 string1

1250117816

3632742839

172403688 string2

2746184479

...

The trivial solution used:

real    0m24.239s

user    0m23.992s

sys     0m0.236s

whereas my tr + awk spent:

real    0m28.798s

user    0m29.908s

sys     0m2.256s

(surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)

So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):

#include <stdio.h>

#include <stdlib.h>

#include <string.h>



int main(void)

{

  FILE * fp;

  char * line = NULL;

  char * word = NULL;

  size_t len = 0;

  ssize_t read;

  long max=0;

  long tmp=0;



  fp = fopen("file", "r");

  if (fp == NULL)

    exit(EXIT_FAILURE);

  while ((read = getline(&line, &len, fp)) != -1) {

    if((word = strtok(line," "))!=NULL) {

      while(word != NULL) {

        if((word = strtok(NULL," "))!=NULL) {

          tmp=strtol(word,NULL,10);

          if(max<tmp) {

            max=tmp;

          }

        }

      }

    }

  }

  fclose(fp);

  printf("%ldn",max);

  exit(EXIT_SUCCESS);

}

Result of that:

$ time ./a.out 

4294967292



real    0m9.307s

user    0m9.144s

sys     0m0.164s

Oh, using mawk instead of gawk almost halved the results.

edited Jan 1 at 15:31

answered Jan 1 at 15:25

James Brown

19.3k31735

1

not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

– oguzismail
Jan 1 at 22:04

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53993164%2fhow-can-i-find-the-largest-number-in-a-very-large-text-file-150-gb%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

It's a trivial task for awk.

awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file

If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.

edited Jan 1 at 19:22

answered Jan 1 at 10:02

oguzismail

3,75031126

add a comment |

It's a trivial task for awk.

awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file

If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.

edited Jan 1 at 19:22

answered Jan 1 at 10:02

oguzismail

3,75031126

add a comment |

It's a trivial task for awk.

awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file

If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.

edited Jan 1 at 19:22

answered Jan 1 at 10:02

oguzismail

3,75031126

It's a trivial task for awk.

awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file

If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.

edited Jan 1 at 19:22

answered Jan 1 at 10:02

oguzismail

3,75031126

edited Jan 1 at 19:22

answered Jan 1 at 10:02

oguzismail

3,75031126

answered Jan 1 at 10:02

oguzismail

3,75031126

answered Jan 1 at 10:02

oguzismail

3,75031126

add a comment |

Try this Perl solution

$ cat sample1.txt

string 1 2 4 10 7

string 1 2 44 10 7

string 3 2 4 10 70

string 9 2 44 10 7

$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt

70

$

answered Jan 1 at 10:37

stack0114106

4,0332421

1

Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

– zdim
Jan 1 at 11:10

@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

– stack0114106
Jan 1 at 11:14

Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

– zdim
Jan 1 at 11:17

yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

– stack0114106
Jan 1 at 11:22

add a comment |

Try this Perl solution

$ cat sample1.txt

string 1 2 4 10 7

string 1 2 44 10 7

string 3 2 4 10 70

string 9 2 44 10 7

$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt

70

$

answered Jan 1 at 10:37

stack0114106

4,0332421

1

Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

– zdim
Jan 1 at 11:10

@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

– stack0114106
Jan 1 at 11:14

Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

– zdim
Jan 1 at 11:17

yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

– stack0114106
Jan 1 at 11:22

add a comment |

Try this Perl solution

$ cat sample1.txt

string 1 2 4 10 7

string 1 2 44 10 7

string 3 2 4 10 70

string 9 2 44 10 7

$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt

70

$

answered Jan 1 at 10:37

stack0114106

4,0332421

Try this Perl solution

$ cat sample1.txt

string 1 2 4 10 7

string 1 2 44 10 7

string 3 2 4 10 70

string 9 2 44 10 7

$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt

70

$

answered Jan 1 at 10:37

stack0114106

4,0332421

answered Jan 1 at 10:37

stack0114106

4,0332421

answered Jan 1 at 10:37

stack0114106

4,0332421

answered Jan 1 at 10:37

stack0114106

4,0332421

1

Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

– zdim
Jan 1 at 11:10

@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

– stack0114106
Jan 1 at 11:14

Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

– zdim
Jan 1 at 11:17

yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

– stack0114106
Jan 1 at 11:22

add a comment |

1

Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

– zdim
Jan 1 at 11:10

@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

– stack0114106
Jan 1 at 11:14

Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

– zdim
Jan 1 at 11:17

yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

– stack0114106
Jan 1 at 11:22

Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

– zdim
Jan 1 at 11:10

@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

– stack0114106
Jan 1 at 11:14

Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

– zdim
Jan 1 at 11:17

yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

– stack0114106
Jan 1 at 11:22

add a comment |

$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'

Output of cat <(echo 0) file | tr ' n' 'n ':

0 string1

1250117816

3632742839

172403688 string2

2746184479

...

The trivial solution used:

real    0m24.239s

user    0m23.992s

sys     0m0.236s

whereas my tr + awk spent:

real    0m28.798s

user    0m29.908s

sys     0m2.256s

(surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)

So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):

#include <stdio.h>

#include <stdlib.h>

#include <string.h>



int main(void)

{

  FILE * fp;

  char * line = NULL;

  char * word = NULL;

  size_t len = 0;

  ssize_t read;

  long max=0;

  long tmp=0;



  fp = fopen("file", "r");

  if (fp == NULL)

    exit(EXIT_FAILURE);

  while ((read = getline(&line, &len, fp)) != -1) {

    if((word = strtok(line," "))!=NULL) {

      while(word != NULL) {

        if((word = strtok(NULL," "))!=NULL) {

          tmp=strtol(word,NULL,10);

          if(max<tmp) {

            max=tmp;

          }

        }

      }

    }

  }

  fclose(fp);

  printf("%ldn",max);

  exit(EXIT_SUCCESS);

}

Result of that:

$ time ./a.out 

4294967292



real    0m9.307s

user    0m9.144s

sys     0m0.164s

Oh, using mawk instead of gawk almost halved the results.

edited Jan 1 at 15:31

answered Jan 1 at 15:25

James Brown

19.3k31735

1

not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

– oguzismail
Jan 1 at 22:04

add a comment |

$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'

Output of cat <(echo 0) file | tr ' n' 'n ':

0 string1

1250117816

3632742839

172403688 string2

2746184479

...

The trivial solution used:

real    0m24.239s

user    0m23.992s

sys     0m0.236s

whereas my tr + awk spent:

real    0m28.798s

user    0m29.908s

sys     0m2.256s

(surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)

So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):

#include <stdio.h>

#include <stdlib.h>

#include <string.h>



int main(void)

{

  FILE * fp;

  char * line = NULL;

  char * word = NULL;

  size_t len = 0;

  ssize_t read;

  long max=0;

  long tmp=0;



  fp = fopen("file", "r");

  if (fp == NULL)

    exit(EXIT_FAILURE);

  while ((read = getline(&line, &len, fp)) != -1) {

    if((word = strtok(line," "))!=NULL) {

      while(word != NULL) {

        if((word = strtok(NULL," "))!=NULL) {

          tmp=strtol(word,NULL,10);

          if(max<tmp) {

            max=tmp;

          }

        }

      }

    }

  }

  fclose(fp);

  printf("%ldn",max);

  exit(EXIT_SUCCESS);

}

Result of that:

$ time ./a.out 

4294967292



real    0m9.307s

user    0m9.144s

sys     0m0.164s

Oh, using mawk instead of gawk almost halved the results.

edited Jan 1 at 15:31

answered Jan 1 at 15:25

James Brown

19.3k31735

1

not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

– oguzismail
Jan 1 at 22:04

add a comment |

$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'

Output of cat <(echo 0) file | tr ' n' 'n ':

0 string1

1250117816

3632742839

172403688 string2

2746184479

...

The trivial solution used:

real    0m24.239s

user    0m23.992s

sys     0m0.236s

whereas my tr + awk spent:

real    0m28.798s

user    0m29.908s

sys     0m2.256s

(surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)

So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):

#include <stdio.h>

#include <stdlib.h>

#include <string.h>



int main(void)

{

  FILE * fp;

  char * line = NULL;

  char * word = NULL;

  size_t len = 0;

  ssize_t read;

  long max=0;

  long tmp=0;



  fp = fopen("file", "r");

  if (fp == NULL)

    exit(EXIT_FAILURE);

  while ((read = getline(&line, &len, fp)) != -1) {

    if((word = strtok(line," "))!=NULL) {

      while(word != NULL) {

        if((word = strtok(NULL," "))!=NULL) {

          tmp=strtol(word,NULL,10);

          if(max<tmp) {

            max=tmp;

          }

        }

      }

    }

  }

  fclose(fp);

  printf("%ldn",max);

  exit(EXIT_SUCCESS);

}

Result of that:

$ time ./a.out 

4294967292



real    0m9.307s

user    0m9.144s

sys     0m0.164s

Oh, using mawk instead of gawk almost halved the results.

edited Jan 1 at 15:31

answered Jan 1 at 15:25

James Brown

19.3k31735

$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'

Output of cat <(echo 0) file | tr ' n' 'n ':

0 string1

1250117816

3632742839

172403688 string2

2746184479

...

The trivial solution used:

real    0m24.239s

user    0m23.992s

sys     0m0.236s

whereas my tr + awk spent:

real    0m28.798s

user    0m29.908s

sys     0m2.256s

(surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)

So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):

#include <stdio.h>

#include <stdlib.h>

#include <string.h>



int main(void)

{

  FILE * fp;

  char * line = NULL;

  char * word = NULL;

  size_t len = 0;

  ssize_t read;

  long max=0;

  long tmp=0;



  fp = fopen("file", "r");

  if (fp == NULL)

    exit(EXIT_FAILURE);

  while ((read = getline(&line, &len, fp)) != -1) {

    if((word = strtok(line," "))!=NULL) {

      while(word != NULL) {

        if((word = strtok(NULL," "))!=NULL) {

          tmp=strtol(word,NULL,10);

          if(max<tmp) {

            max=tmp;

          }

        }

      }

    }

  }

  fclose(fp);

  printf("%ldn",max);

  exit(EXIT_SUCCESS);

}

Result of that:

$ time ./a.out 

4294967292



real    0m9.307s

user    0m9.144s

sys     0m0.164s

Oh, using mawk instead of gawk almost halved the results.

edited Jan 1 at 15:31

answered Jan 1 at 15:25

James Brown

19.3k31735

edited Jan 1 at 15:31

answered Jan 1 at 15:25

James Brown

19.3k31735

answered Jan 1 at 15:25

James Brown

19.3k31735

answered Jan 1 at 15:25

James Brown

19.3k31735

1

not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

– oguzismail
Jan 1 at 22:04

add a comment |

1

not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

– oguzismail
Jan 1 at 22:04

not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

– oguzismail
Jan 1 at 22:04

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

Search This Blog

Ufyukyu