How can I find the largest number in a very large text file (~150 GB)?












0















I have a text file that has around 100000000 lines, each of the following type:



string num1 num2 num3 ... num500
string num1 num2 num3 ... num40


I want to find the largest number present in this file.



My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.



with open(filename,'r') as f:
prev_max = -1
for line in f:
line = [int(n) for n in line.split(' ')[1:]]
max = max_num(line)
if max > prev_max:
prev_max = max


But this takes forever. Is there a better way to do this?



I am open to solutions with awk or other shell commands as well.



Edit: Added how I am reading the file.










share|improve this question




















  • 1





    How are you getting all_lines?

    – Alexander Reynolds
    Jan 1 at 5:21






  • 1





    What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

    – Mad Physicist
    Jan 1 at 6:08






  • 1





    You didn't answer my question. What is all_lines specifically? Please post all of your code.

    – Alexander Reynolds
    Jan 1 at 6:13











  • blog.pythonlibrary.org/2014/01/27/…

    – Windchill
    Jan 1 at 6:13






  • 1





    Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

    – Charles Duffy
    Jan 1 at 15:36
















0















I have a text file that has around 100000000 lines, each of the following type:



string num1 num2 num3 ... num500
string num1 num2 num3 ... num40


I want to find the largest number present in this file.



My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.



with open(filename,'r') as f:
prev_max = -1
for line in f:
line = [int(n) for n in line.split(' ')[1:]]
max = max_num(line)
if max > prev_max:
prev_max = max


But this takes forever. Is there a better way to do this?



I am open to solutions with awk or other shell commands as well.



Edit: Added how I am reading the file.










share|improve this question




















  • 1





    How are you getting all_lines?

    – Alexander Reynolds
    Jan 1 at 5:21






  • 1





    What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

    – Mad Physicist
    Jan 1 at 6:08






  • 1





    You didn't answer my question. What is all_lines specifically? Please post all of your code.

    – Alexander Reynolds
    Jan 1 at 6:13











  • blog.pythonlibrary.org/2014/01/27/…

    – Windchill
    Jan 1 at 6:13






  • 1





    Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

    – Charles Duffy
    Jan 1 at 15:36














0












0








0


1






I have a text file that has around 100000000 lines, each of the following type:



string num1 num2 num3 ... num500
string num1 num2 num3 ... num40


I want to find the largest number present in this file.



My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.



with open(filename,'r') as f:
prev_max = -1
for line in f:
line = [int(n) for n in line.split(' ')[1:]]
max = max_num(line)
if max > prev_max:
prev_max = max


But this takes forever. Is there a better way to do this?



I am open to solutions with awk or other shell commands as well.



Edit: Added how I am reading the file.










share|improve this question
















I have a text file that has around 100000000 lines, each of the following type:



string num1 num2 num3 ... num500
string num1 num2 num3 ... num40


I want to find the largest number present in this file.



My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.



with open(filename,'r') as f:
prev_max = -1
for line in f:
line = [int(n) for n in line.split(' ')[1:]]
max = max_num(line)
if max > prev_max:
prev_max = max


But this takes forever. Is there a better way to do this?



I am open to solutions with awk or other shell commands as well.



Edit: Added how I am reading the file.







python-3.x text awk large-data






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 1 at 6:15







user110327

















asked Jan 1 at 5:20









user110327user110327

18918




18918








  • 1





    How are you getting all_lines?

    – Alexander Reynolds
    Jan 1 at 5:21






  • 1





    What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

    – Mad Physicist
    Jan 1 at 6:08






  • 1





    You didn't answer my question. What is all_lines specifically? Please post all of your code.

    – Alexander Reynolds
    Jan 1 at 6:13











  • blog.pythonlibrary.org/2014/01/27/…

    – Windchill
    Jan 1 at 6:13






  • 1





    Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

    – Charles Duffy
    Jan 1 at 15:36














  • 1





    How are you getting all_lines?

    – Alexander Reynolds
    Jan 1 at 5:21






  • 1





    What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

    – Mad Physicist
    Jan 1 at 6:08






  • 1





    You didn't answer my question. What is all_lines specifically? Please post all of your code.

    – Alexander Reynolds
    Jan 1 at 6:13











  • blog.pythonlibrary.org/2014/01/27/…

    – Windchill
    Jan 1 at 6:13






  • 1





    Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

    – Charles Duffy
    Jan 1 at 15:36








1




1





How are you getting all_lines?

– Alexander Reynolds
Jan 1 at 5:21





How are you getting all_lines?

– Alexander Reynolds
Jan 1 at 5:21




1




1





What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

– Mad Physicist
Jan 1 at 6:08





What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.

– Mad Physicist
Jan 1 at 6:08




1




1





You didn't answer my question. What is all_lines specifically? Please post all of your code.

– Alexander Reynolds
Jan 1 at 6:13





You didn't answer my question. What is all_lines specifically? Please post all of your code.

– Alexander Reynolds
Jan 1 at 6:13













blog.pythonlibrary.org/2014/01/27/…

– Windchill
Jan 1 at 6:13





blog.pythonlibrary.org/2014/01/27/…

– Windchill
Jan 1 at 6:13




1




1





Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

– Charles Duffy
Jan 1 at 15:36





Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.

– Charles Duffy
Jan 1 at 15:36












3 Answers
3






active

oldest

votes


















4














It's a trivial task for awk.



awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file


If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.






share|improve this answer

































    1














    Try this Perl solution



    $ cat sample1.txt
    string 1 2 4 10 7
    string 1 2 44 10 7
    string 3 2 4 10 70
    string 9 2 44 10 7
    $ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
    70
    $





    share|improve this answer



















    • 1





      Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

      – zdim
      Jan 1 at 11:10













    • @zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

      – stack0114106
      Jan 1 at 11:14











    • Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

      – zdim
      Jan 1 at 11:17













    • yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

      – stack0114106
      Jan 1 at 11:22



















    0














    I wanted to write an awk script without for looping the columns to compare execution times with a for looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr to swap space and newlines I got pretty close:



    $ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'


    Output of cat <(echo 0) file | tr ' n' 'n ':



    0 string1
    1250117816
    3632742839
    172403688 string2
    2746184479
    ...


    The trivial solution used:



    real    0m24.239s
    user 0m23.992s
    sys 0m0.236s


    whereas my tr + awk spent:



    real    0m28.798s
    user 0m29.908s
    sys 0m2.256s


    (surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)



    So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):



    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    int main(void)
    {
    FILE * fp;
    char * line = NULL;
    char * word = NULL;
    size_t len = 0;
    ssize_t read;
    long max=0;
    long tmp=0;

    fp = fopen("file", "r");
    if (fp == NULL)
    exit(EXIT_FAILURE);
    while ((read = getline(&line, &len, fp)) != -1) {
    if((word = strtok(line," "))!=NULL) {
    while(word != NULL) {
    if((word = strtok(NULL," "))!=NULL) {
    tmp=strtol(word,NULL,10);
    if(max<tmp) {
    max=tmp;
    }
    }
    }
    }
    }
    fclose(fp);
    printf("%ldn",max);
    exit(EXIT_SUCCESS);
    }


    Result of that:



    $ time ./a.out 
    4294967292

    real 0m9.307s
    user 0m9.144s
    sys 0m0.164s


    Oh, using mawk instead of gawk almost halved the results.






    share|improve this answer





















    • 1





      not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

      – oguzismail
      Jan 1 at 22:04













    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53993164%2fhow-can-i-find-the-largest-number-in-a-very-large-text-file-150-gb%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    4














    It's a trivial task for awk.



    awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file


    If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.






    share|improve this answer






























      4














      It's a trivial task for awk.



      awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file


      If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.






      share|improve this answer




























        4












        4








        4







        It's a trivial task for awk.



        awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file


        If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.






        share|improve this answer















        It's a trivial task for awk.



        awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file


        If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 } part.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 1 at 19:22

























        answered Jan 1 at 10:02









        oguzismailoguzismail

        3,75031126




        3,75031126

























            1














            Try this Perl solution



            $ cat sample1.txt
            string 1 2 4 10 7
            string 1 2 44 10 7
            string 3 2 4 10 70
            string 9 2 44 10 7
            $ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
            70
            $





            share|improve this answer



















            • 1





              Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

              – zdim
              Jan 1 at 11:10













            • @zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

              – stack0114106
              Jan 1 at 11:14











            • Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

              – zdim
              Jan 1 at 11:17













            • yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

              – stack0114106
              Jan 1 at 11:22
















            1














            Try this Perl solution



            $ cat sample1.txt
            string 1 2 4 10 7
            string 1 2 44 10 7
            string 3 2 4 10 70
            string 9 2 44 10 7
            $ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
            70
            $





            share|improve this answer



















            • 1





              Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

              – zdim
              Jan 1 at 11:10













            • @zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

              – stack0114106
              Jan 1 at 11:14











            • Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

              – zdim
              Jan 1 at 11:17













            • yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

              – stack0114106
              Jan 1 at 11:22














            1












            1








            1







            Try this Perl solution



            $ cat sample1.txt
            string 1 2 4 10 7
            string 1 2 44 10 7
            string 3 2 4 10 70
            string 9 2 44 10 7
            $ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
            70
            $





            share|improve this answer













            Try this Perl solution



            $ cat sample1.txt
            string 1 2 4 10 7
            string 1 2 44 10 7
            string 3 2 4 10 70
            string 9 2 44 10 7
            $ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
            70
            $






            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Jan 1 at 10:37









            stack0114106stack0114106

            4,0332421




            4,0332421








            • 1





              Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

              – zdim
              Jan 1 at 11:10













            • @zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

              – stack0114106
              Jan 1 at 11:14











            • Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

              – zdim
              Jan 1 at 11:17













            • yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

              – stack0114106
              Jan 1 at 11:22














            • 1





              Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

              – zdim
              Jan 1 at 11:10













            • @zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

              – stack0114106
              Jan 1 at 11:14











            • Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

              – zdim
              Jan 1 at 11:17













            • yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

              – stack0114106
              Jan 1 at 11:22








            1




            1





            Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

            – zdim
            Jan 1 at 11:10







            Can use max from core List::Util instead of sort, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....

            – zdim
            Jan 1 at 11:10















            @zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

            – stack0114106
            Jan 1 at 11:14





            @zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(

            – stack0114106
            Jan 1 at 11:14













            Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

            – zdim
            Jan 1 at 11:17







            Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?

            – zdim
            Jan 1 at 11:17















            yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

            – stack0114106
            Jan 1 at 11:22





            yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..

            – stack0114106
            Jan 1 at 11:22











            0














            I wanted to write an awk script without for looping the columns to compare execution times with a for looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr to swap space and newlines I got pretty close:



            $ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'


            Output of cat <(echo 0) file | tr ' n' 'n ':



            0 string1
            1250117816
            3632742839
            172403688 string2
            2746184479
            ...


            The trivial solution used:



            real    0m24.239s
            user 0m23.992s
            sys 0m0.236s


            whereas my tr + awk spent:



            real    0m28.798s
            user 0m29.908s
            sys 0m2.256s


            (surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)



            So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):



            #include <stdio.h>
            #include <stdlib.h>
            #include <string.h>

            int main(void)
            {
            FILE * fp;
            char * line = NULL;
            char * word = NULL;
            size_t len = 0;
            ssize_t read;
            long max=0;
            long tmp=0;

            fp = fopen("file", "r");
            if (fp == NULL)
            exit(EXIT_FAILURE);
            while ((read = getline(&line, &len, fp)) != -1) {
            if((word = strtok(line," "))!=NULL) {
            while(word != NULL) {
            if((word = strtok(NULL," "))!=NULL) {
            tmp=strtol(word,NULL,10);
            if(max<tmp) {
            max=tmp;
            }
            }
            }
            }
            }
            fclose(fp);
            printf("%ldn",max);
            exit(EXIT_SUCCESS);
            }


            Result of that:



            $ time ./a.out 
            4294967292

            real 0m9.307s
            user 0m9.144s
            sys 0m0.164s


            Oh, using mawk instead of gawk almost halved the results.






            share|improve this answer





















            • 1





              not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

              – oguzismail
              Jan 1 at 22:04


















            0














            I wanted to write an awk script without for looping the columns to compare execution times with a for looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr to swap space and newlines I got pretty close:



            $ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'


            Output of cat <(echo 0) file | tr ' n' 'n ':



            0 string1
            1250117816
            3632742839
            172403688 string2
            2746184479
            ...


            The trivial solution used:



            real    0m24.239s
            user 0m23.992s
            sys 0m0.236s


            whereas my tr + awk spent:



            real    0m28.798s
            user 0m29.908s
            sys 0m2.256s


            (surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)



            So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):



            #include <stdio.h>
            #include <stdlib.h>
            #include <string.h>

            int main(void)
            {
            FILE * fp;
            char * line = NULL;
            char * word = NULL;
            size_t len = 0;
            ssize_t read;
            long max=0;
            long tmp=0;

            fp = fopen("file", "r");
            if (fp == NULL)
            exit(EXIT_FAILURE);
            while ((read = getline(&line, &len, fp)) != -1) {
            if((word = strtok(line," "))!=NULL) {
            while(word != NULL) {
            if((word = strtok(NULL," "))!=NULL) {
            tmp=strtol(word,NULL,10);
            if(max<tmp) {
            max=tmp;
            }
            }
            }
            }
            }
            fclose(fp);
            printf("%ldn",max);
            exit(EXIT_SUCCESS);
            }


            Result of that:



            $ time ./a.out 
            4294967292

            real 0m9.307s
            user 0m9.144s
            sys 0m0.164s


            Oh, using mawk instead of gawk almost halved the results.






            share|improve this answer





















            • 1





              not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

              – oguzismail
              Jan 1 at 22:04
















            0












            0








            0







            I wanted to write an awk script without for looping the columns to compare execution times with a for looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr to swap space and newlines I got pretty close:



            $ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'


            Output of cat <(echo 0) file | tr ' n' 'n ':



            0 string1
            1250117816
            3632742839
            172403688 string2
            2746184479
            ...


            The trivial solution used:



            real    0m24.239s
            user 0m23.992s
            sys 0m0.236s


            whereas my tr + awk spent:



            real    0m28.798s
            user 0m29.908s
            sys 0m2.256s


            (surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)



            So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):



            #include <stdio.h>
            #include <stdlib.h>
            #include <string.h>

            int main(void)
            {
            FILE * fp;
            char * line = NULL;
            char * word = NULL;
            size_t len = 0;
            ssize_t read;
            long max=0;
            long tmp=0;

            fp = fopen("file", "r");
            if (fp == NULL)
            exit(EXIT_FAILURE);
            while ((read = getline(&line, &len, fp)) != -1) {
            if((word = strtok(line," "))!=NULL) {
            while(word != NULL) {
            if((word = strtok(NULL," "))!=NULL) {
            tmp=strtol(word,NULL,10);
            if(max<tmp) {
            max=tmp;
            }
            }
            }
            }
            }
            fclose(fp);
            printf("%ldn",max);
            exit(EXIT_SUCCESS);
            }


            Result of that:



            $ time ./a.out 
            4294967292

            real 0m9.307s
            user 0m9.144s
            sys 0m0.164s


            Oh, using mawk instead of gawk almost halved the results.






            share|improve this answer















            I wanted to write an awk script without for looping the columns to compare execution times with a for looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr to swap space and newlines I got pretty close:



            $ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'


            Output of cat <(echo 0) file | tr ' n' 'n ':



            0 string1
            1250117816
            3632742839
            172403688 string2
            2746184479
            ...


            The trivial solution used:



            real    0m24.239s
            user 0m23.992s
            sys 0m0.236s


            whereas my tr + awk spent:



            real    0m28.798s
            user 0m29.908s
            sys 0m2.256s


            (surprisingly, if I first preprocessed the data with the tr to a file and then read it with awk it wouldn't be faster, most of the time slower actually)



            So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):



            #include <stdio.h>
            #include <stdlib.h>
            #include <string.h>

            int main(void)
            {
            FILE * fp;
            char * line = NULL;
            char * word = NULL;
            size_t len = 0;
            ssize_t read;
            long max=0;
            long tmp=0;

            fp = fopen("file", "r");
            if (fp == NULL)
            exit(EXIT_FAILURE);
            while ((read = getline(&line, &len, fp)) != -1) {
            if((word = strtok(line," "))!=NULL) {
            while(word != NULL) {
            if((word = strtok(NULL," "))!=NULL) {
            tmp=strtol(word,NULL,10);
            if(max<tmp) {
            max=tmp;
            }
            }
            }
            }
            }
            fclose(fp);
            printf("%ldn",max);
            exit(EXIT_SUCCESS);
            }


            Result of that:



            $ time ./a.out 
            4294967292

            real 0m9.307s
            user 0m9.144s
            sys 0m0.164s


            Oh, using mawk instead of gawk almost halved the results.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jan 1 at 15:31

























            answered Jan 1 at 15:25









            James BrownJames Brown

            19.3k31735




            19.3k31735








            • 1





              not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

              – oguzismail
              Jan 1 at 22:04
















            • 1





              not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

              – oguzismail
              Jan 1 at 22:04










            1




            1





            not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

            – oguzismail
            Jan 1 at 22:04







            not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5

            – oguzismail
            Jan 1 at 22:04




















            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53993164%2fhow-can-i-find-the-largest-number-in-a-very-large-text-file-150-gb%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            MongoDB - Not Authorized To Execute Command

            How to fix TextFormField cause rebuild widget in Flutter

            in spring boot 2.1 many test slices are not allowed anymore due to multiple @BootstrapWith