How can I find the largest number in a very large text file (~150 GB)?
I have a text file that has around 100000000 lines, each of the following type:
string num1 num2 num3 ... num500
string num1 num2 num3 ... num40
I want to find the largest number present in this file.
My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.
with open(filename,'r') as f:
prev_max = -1
for line in f:
line = [int(n) for n in line.split(' ')[1:]]
max = max_num(line)
if max > prev_max:
prev_max = max
But this takes forever. Is there a better way to do this?
I am open to solutions with awk or other shell commands as well.
Edit: Added how I am reading the file.
python-3.x text awk large-data
|
show 1 more comment
I have a text file that has around 100000000 lines, each of the following type:
string num1 num2 num3 ... num500
string num1 num2 num3 ... num40
I want to find the largest number present in this file.
My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.
with open(filename,'r') as f:
prev_max = -1
for line in f:
line = [int(n) for n in line.split(' ')[1:]]
max = max_num(line)
if max > prev_max:
prev_max = max
But this takes forever. Is there a better way to do this?
I am open to solutions with awk or other shell commands as well.
Edit: Added how I am reading the file.
python-3.x text awk large-data
1
How are you gettingall_lines
?
– Alexander Reynolds
Jan 1 at 5:21
1
What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.
– Mad Physicist
Jan 1 at 6:08
1
You didn't answer my question. What isall_lines
specifically? Please post all of your code.
– Alexander Reynolds
Jan 1 at 6:13
blog.pythonlibrary.org/2014/01/27/…
– Windchill
Jan 1 at 6:13
1
Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.
– Charles Duffy
Jan 1 at 15:36
|
show 1 more comment
I have a text file that has around 100000000 lines, each of the following type:
string num1 num2 num3 ... num500
string num1 num2 num3 ... num40
I want to find the largest number present in this file.
My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.
with open(filename,'r') as f:
prev_max = -1
for line in f:
line = [int(n) for n in line.split(' ')[1:]]
max = max_num(line)
if max > prev_max:
prev_max = max
But this takes forever. Is there a better way to do this?
I am open to solutions with awk or other shell commands as well.
Edit: Added how I am reading the file.
python-3.x text awk large-data
I have a text file that has around 100000000 lines, each of the following type:
string num1 num2 num3 ... num500
string num1 num2 num3 ... num40
I want to find the largest number present in this file.
My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.
with open(filename,'r') as f:
prev_max = -1
for line in f:
line = [int(n) for n in line.split(' ')[1:]]
max = max_num(line)
if max > prev_max:
prev_max = max
But this takes forever. Is there a better way to do this?
I am open to solutions with awk or other shell commands as well.
Edit: Added how I am reading the file.
python-3.x text awk large-data
python-3.x text awk large-data
edited Jan 1 at 6:15
user110327
asked Jan 1 at 5:20
user110327user110327
18918
18918
1
How are you gettingall_lines
?
– Alexander Reynolds
Jan 1 at 5:21
1
What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.
– Mad Physicist
Jan 1 at 6:08
1
You didn't answer my question. What isall_lines
specifically? Please post all of your code.
– Alexander Reynolds
Jan 1 at 6:13
blog.pythonlibrary.org/2014/01/27/…
– Windchill
Jan 1 at 6:13
1
Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.
– Charles Duffy
Jan 1 at 15:36
|
show 1 more comment
1
How are you gettingall_lines
?
– Alexander Reynolds
Jan 1 at 5:21
1
What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.
– Mad Physicist
Jan 1 at 6:08
1
You didn't answer my question. What isall_lines
specifically? Please post all of your code.
– Alexander Reynolds
Jan 1 at 6:13
blog.pythonlibrary.org/2014/01/27/…
– Windchill
Jan 1 at 6:13
1
Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.
– Charles Duffy
Jan 1 at 15:36
1
1
How are you getting
all_lines
?– Alexander Reynolds
Jan 1 at 5:21
How are you getting
all_lines
?– Alexander Reynolds
Jan 1 at 5:21
1
1
What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.
– Mad Physicist
Jan 1 at 6:08
What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.
– Mad Physicist
Jan 1 at 6:08
1
1
You didn't answer my question. What is
all_lines
specifically? Please post all of your code.– Alexander Reynolds
Jan 1 at 6:13
You didn't answer my question. What is
all_lines
specifically? Please post all of your code.– Alexander Reynolds
Jan 1 at 6:13
blog.pythonlibrary.org/2014/01/27/…
– Windchill
Jan 1 at 6:13
blog.pythonlibrary.org/2014/01/27/…
– Windchill
Jan 1 at 6:13
1
1
Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.
– Charles Duffy
Jan 1 at 15:36
Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.
– Charles Duffy
Jan 1 at 15:36
|
show 1 more comment
3 Answers
3
active
oldest
votes
It's a trivial task for awk.
awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file
If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 }
part.
add a comment |
Try this Perl solution
$ cat sample1.txt
string 1 2 4 10 7
string 1 2 44 10 7
string 3 2 4 10 70
string 9 2 44 10 7
$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
70
$
1
Can usemax
from core List::Util instead ofsort
, for efficiency:perl -MList::Util=max -lane'$m = max @F; ....
– zdim
Jan 1 at 11:10
@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(
– stack0114106
Jan 1 at 11:14
Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?
– zdim
Jan 1 at 11:17
yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..
– stack0114106
Jan 1 at 11:22
add a comment |
I wanted to write an awk script without for
looping the columns to compare execution times with a for
looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS
to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr
to swap space and newlines I got pretty close:
$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'
Output of cat <(echo 0) file | tr ' n' 'n '
:
0 string1
1250117816
3632742839
172403688 string2
2746184479
...
The trivial solution used:
real 0m24.239s
user 0m23.992s
sys 0m0.236s
whereas my tr
+ awk spent:
real 0m28.798s
user 0m29.908s
sys 0m2.256s
(surprisingly, if I first preprocessed the data with the tr
to a file and then read it with awk it wouldn't be faster, most of the time slower actually)
So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
FILE * fp;
char * line = NULL;
char * word = NULL;
size_t len = 0;
ssize_t read;
long max=0;
long tmp=0;
fp = fopen("file", "r");
if (fp == NULL)
exit(EXIT_FAILURE);
while ((read = getline(&line, &len, fp)) != -1) {
if((word = strtok(line," "))!=NULL) {
while(word != NULL) {
if((word = strtok(NULL," "))!=NULL) {
tmp=strtol(word,NULL,10);
if(max<tmp) {
max=tmp;
}
}
}
}
}
fclose(fp);
printf("%ldn",max);
exit(EXIT_SUCCESS);
}
Result of that:
$ time ./a.out
4294967292
real 0m9.307s
user 0m9.144s
sys 0m0.164s
Oh, using mawk instead of gawk almost halved the results.
1
not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5
– oguzismail
Jan 1 at 22:04
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53993164%2fhow-can-i-find-the-largest-number-in-a-very-large-text-file-150-gb%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
It's a trivial task for awk.
awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file
If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 }
part.
add a comment |
It's a trivial task for awk.
awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file
If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 }
part.
add a comment |
It's a trivial task for awk.
awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file
If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 }
part.
It's a trivial task for awk.
awk '(max == "") { max = $2 } { for (i = 2; i <= NF; ++i) if (max < $i) max = $i } END { print max }' file
If it's guaranteed that your file is not all zeroes or negative numbers, you can drop the (max == "") { max = $2 }
part.
edited Jan 1 at 19:22
answered Jan 1 at 10:02
oguzismailoguzismail
3,75031126
3,75031126
add a comment |
add a comment |
Try this Perl solution
$ cat sample1.txt
string 1 2 4 10 7
string 1 2 44 10 7
string 3 2 4 10 70
string 9 2 44 10 7
$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
70
$
1
Can usemax
from core List::Util instead ofsort
, for efficiency:perl -MList::Util=max -lane'$m = max @F; ....
– zdim
Jan 1 at 11:10
@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(
– stack0114106
Jan 1 at 11:14
Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?
– zdim
Jan 1 at 11:17
yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..
– stack0114106
Jan 1 at 11:22
add a comment |
Try this Perl solution
$ cat sample1.txt
string 1 2 4 10 7
string 1 2 44 10 7
string 3 2 4 10 70
string 9 2 44 10 7
$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
70
$
1
Can usemax
from core List::Util instead ofsort
, for efficiency:perl -MList::Util=max -lane'$m = max @F; ....
– zdim
Jan 1 at 11:10
@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(
– stack0114106
Jan 1 at 11:14
Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?
– zdim
Jan 1 at 11:17
yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..
– stack0114106
Jan 1 at 11:22
add a comment |
Try this Perl solution
$ cat sample1.txt
string 1 2 4 10 7
string 1 2 44 10 7
string 3 2 4 10 70
string 9 2 44 10 7
$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
70
$
Try this Perl solution
$ cat sample1.txt
string 1 2 4 10 7
string 1 2 44 10 7
string 3 2 4 10 70
string 9 2 44 10 7
$ perl -lane ' $m=(sort {$b<=>$a} @F[1..$#F])[0]; $max=$m>$max?$m:$max ; END { print $max } ' sample1.txt
70
$
answered Jan 1 at 10:37
stack0114106stack0114106
4,0332421
4,0332421
1
Can usemax
from core List::Util instead ofsort
, for efficiency:perl -MList::Util=max -lane'$m = max @F; ....
– zdim
Jan 1 at 11:10
@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(
– stack0114106
Jan 1 at 11:14
Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?
– zdim
Jan 1 at 11:17
yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..
– stack0114106
Jan 1 at 11:22
add a comment |
1
Can usemax
from core List::Util instead ofsort
, for efficiency:perl -MList::Util=max -lane'$m = max @F; ....
– zdim
Jan 1 at 11:10
@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(
– stack0114106
Jan 1 at 11:14
Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?
– zdim
Jan 1 at 11:17
yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..
– stack0114106
Jan 1 at 11:22
1
1
Can use
max
from core List::Util instead of sort
, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....
– zdim
Jan 1 at 11:10
Can use
max
from core List::Util instead of sort
, for efficiency: perl -MList::Util=max -lane'$m = max @F; ....
– zdim
Jan 1 at 11:10
@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(
– stack0114106
Jan 1 at 11:14
@zdim..you are right..:-) my office RHEL Perl is throwing error for installing CPAN modules.. so I'll have to live with core modules :-(
– stack0114106
Jan 1 at 11:14
Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?
– zdim
Jan 1 at 11:17
Oh, sorry. Can you upgrade? The v5.10.1 is fine but really old at this point. Or, run with perlbrew?
– zdim
Jan 1 at 11:17
yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..
– stack0114106
Jan 1 at 11:22
yeah..it is old.. If I'm admin, I can do that.. that will take a long time.. btw if you have time can you try questions/53706983 using Perl..
– stack0114106
Jan 1 at 11:22
add a comment |
I wanted to write an awk script without for
looping the columns to compare execution times with a for
looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS
to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr
to swap space and newlines I got pretty close:
$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'
Output of cat <(echo 0) file | tr ' n' 'n '
:
0 string1
1250117816
3632742839
172403688 string2
2746184479
...
The trivial solution used:
real 0m24.239s
user 0m23.992s
sys 0m0.236s
whereas my tr
+ awk spent:
real 0m28.798s
user 0m29.908s
sys 0m2.256s
(surprisingly, if I first preprocessed the data with the tr
to a file and then read it with awk it wouldn't be faster, most of the time slower actually)
So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
FILE * fp;
char * line = NULL;
char * word = NULL;
size_t len = 0;
ssize_t read;
long max=0;
long tmp=0;
fp = fopen("file", "r");
if (fp == NULL)
exit(EXIT_FAILURE);
while ((read = getline(&line, &len, fp)) != -1) {
if((word = strtok(line," "))!=NULL) {
while(word != NULL) {
if((word = strtok(NULL," "))!=NULL) {
tmp=strtol(word,NULL,10);
if(max<tmp) {
max=tmp;
}
}
}
}
}
fclose(fp);
printf("%ldn",max);
exit(EXIT_SUCCESS);
}
Result of that:
$ time ./a.out
4294967292
real 0m9.307s
user 0m9.144s
sys 0m0.164s
Oh, using mawk instead of gawk almost halved the results.
1
not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5
– oguzismail
Jan 1 at 22:04
add a comment |
I wanted to write an awk script without for
looping the columns to compare execution times with a for
looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS
to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr
to swap space and newlines I got pretty close:
$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'
Output of cat <(echo 0) file | tr ' n' 'n '
:
0 string1
1250117816
3632742839
172403688 string2
2746184479
...
The trivial solution used:
real 0m24.239s
user 0m23.992s
sys 0m0.236s
whereas my tr
+ awk spent:
real 0m28.798s
user 0m29.908s
sys 0m2.256s
(surprisingly, if I first preprocessed the data with the tr
to a file and then read it with awk it wouldn't be faster, most of the time slower actually)
So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
FILE * fp;
char * line = NULL;
char * word = NULL;
size_t len = 0;
ssize_t read;
long max=0;
long tmp=0;
fp = fopen("file", "r");
if (fp == NULL)
exit(EXIT_FAILURE);
while ((read = getline(&line, &len, fp)) != -1) {
if((word = strtok(line," "))!=NULL) {
while(word != NULL) {
if((word = strtok(NULL," "))!=NULL) {
tmp=strtol(word,NULL,10);
if(max<tmp) {
max=tmp;
}
}
}
}
}
fclose(fp);
printf("%ldn",max);
exit(EXIT_SUCCESS);
}
Result of that:
$ time ./a.out
4294967292
real 0m9.307s
user 0m9.144s
sys 0m0.164s
Oh, using mawk instead of gawk almost halved the results.
1
not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5
– oguzismail
Jan 1 at 22:04
add a comment |
I wanted to write an awk script without for
looping the columns to compare execution times with a for
looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS
to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr
to swap space and newlines I got pretty close:
$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'
Output of cat <(echo 0) file | tr ' n' 'n '
:
0 string1
1250117816
3632742839
172403688 string2
2746184479
...
The trivial solution used:
real 0m24.239s
user 0m23.992s
sys 0m0.236s
whereas my tr
+ awk spent:
real 0m28.798s
user 0m29.908s
sys 0m2.256s
(surprisingly, if I first preprocessed the data with the tr
to a file and then read it with awk it wouldn't be faster, most of the time slower actually)
So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
FILE * fp;
char * line = NULL;
char * word = NULL;
size_t len = 0;
ssize_t read;
long max=0;
long tmp=0;
fp = fopen("file", "r");
if (fp == NULL)
exit(EXIT_FAILURE);
while ((read = getline(&line, &len, fp)) != -1) {
if((word = strtok(line," "))!=NULL) {
while(word != NULL) {
if((word = strtok(NULL," "))!=NULL) {
tmp=strtol(word,NULL,10);
if(max<tmp) {
max=tmp;
}
}
}
}
}
fclose(fp);
printf("%ldn",max);
exit(EXIT_SUCCESS);
}
Result of that:
$ time ./a.out
4294967292
real 0m9.307s
user 0m9.144s
sys 0m0.164s
Oh, using mawk instead of gawk almost halved the results.
I wanted to write an awk script without for
looping the columns to compare execution times with a for
looped solution such as @oguzismail's trivial. I created a million records of 1-100 columns of data, values between 0-2^32. I played around with RS
to only compare columns 2-100 but as that required regex it slowed down the execution. Much. Using tr
to swap space and newlines I got pretty close:
$ cat <(echo 0) file | tr ' n' 'n ' | awk 'max<$1{max=$1}END{print max}'
Output of cat <(echo 0) file | tr ' n' 'n '
:
0 string1
1250117816
3632742839
172403688 string2
2746184479
...
The trivial solution used:
real 0m24.239s
user 0m23.992s
sys 0m0.236s
whereas my tr
+ awk spent:
real 0m28.798s
user 0m29.908s
sys 0m2.256s
(surprisingly, if I first preprocessed the data with the tr
to a file and then read it with awk it wouldn't be faster, most of the time slower actually)
So, then I decided to test my rusty C skills to set some kind of baseline (the man pages are pretty good. And Google.):
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
FILE * fp;
char * line = NULL;
char * word = NULL;
size_t len = 0;
ssize_t read;
long max=0;
long tmp=0;
fp = fopen("file", "r");
if (fp == NULL)
exit(EXIT_FAILURE);
while ((read = getline(&line, &len, fp)) != -1) {
if((word = strtok(line," "))!=NULL) {
while(word != NULL) {
if((word = strtok(NULL," "))!=NULL) {
tmp=strtol(word,NULL,10);
if(max<tmp) {
max=tmp;
}
}
}
}
}
fclose(fp);
printf("%ldn",max);
exit(EXIT_SUCCESS);
}
Result of that:
$ time ./a.out
4294967292
real 0m9.307s
user 0m9.144s
sys 0m0.164s
Oh, using mawk instead of gawk almost halved the results.
edited Jan 1 at 15:31
answered Jan 1 at 15:25


James BrownJames Brown
19.3k31735
19.3k31735
1
not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5
– oguzismail
Jan 1 at 22:04
add a comment |
1
not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5
– oguzismail
Jan 1 at 22:04
1
1
not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5
– oguzismail
Jan 1 at 22:04
not an expert on C but I would mess around with mmap. see: paste.ubuntu.com/p/8Q2SpjGTX5
– oguzismail
Jan 1 at 22:04
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53993164%2fhow-can-i-find-the-largest-number-in-a-very-large-text-file-150-gb%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
How are you getting
all_lines
?– Alexander Reynolds
Jan 1 at 5:21
1
What do you mean by reading normally? Please post a minimal example showing what you actually do with the file.
– Mad Physicist
Jan 1 at 6:08
1
You didn't answer my question. What is
all_lines
specifically? Please post all of your code.– Alexander Reynolds
Jan 1 at 6:13
blog.pythonlibrary.org/2014/01/27/…
– Windchill
Jan 1 at 6:13
1
Basically -- take the size of your file in bytes, halve it, seek to that point, find the location of the next newline, make that your split point, so one thread finds the max of everything before it and one thread finds the max of everything after. Repeat until you've got the workload split into an adequate number of subdivisions.
– Charles Duffy
Jan 1 at 15:36