When I investigated the run times of grep -f related to large pattern files (with hundreds of lines and more) it became apparent that one can speed up grep
by splitting up the pattern files into smaller chunks.
I tested this with GNU grep 2.5.3 on a machine with an Intel Xeon 2400 MHz CPU running Debian Lenny. grep -f
with a pattern file of 1900 lines1) takes at least 72 seconds (with no match found).
$ TIMEFORMAT="%R seconds" $ time grep -f pattern-file testmail_for_grep.txt 73.852 seconds
To speed things up, for instance, one can use split
to split pattern-file
into smaller chunks and run grep
with each of them. Here is an example which uses chunks of 50 lines:
split -l 50 pattern-file pattern-file.split. for CHUNK in pattern-file.split.* ; do grep -f "$CHUNK" testmail_for_grep.txt done rm pattern-file.split.*
Using the same pattern file as above (1900 lines) grep -f
of all pattern files splitted with split -l 50
takes only about 1.1 second!
For the fun of it I tried to find the optimal size of chunks. I assume that it depends on a multitude of factors (such as the speed of your file system) but in this particular case I gained best results for chunks of about 20 lines. The run time for chunks of 20 lines was 0.70 seconds. I did the same tests with a similar pattern file of 900 lines. There, the optimal chunk size was about 20 lines, too.
In the related bug report bug #16305: grep much less efficient ..., Levi Waldron suggested to use
for line in `cat patterns.txt`;do grep $line data.txt >> matches.txt;done
This won't work in general (it e.g. breaks when patterns contain spaces). However, a while
loop using bash
's read
might do better (testmail_for_grep.txt
being data.txt
).
while read line ; do grep "$line" testmail_for_grep.txt ; done < pattern-file
The run time of this method was a bit more than 4 seconds for 1900 patterns. This is much slower than using split
but in cases where one does not want to write temporary files it might come in handy. Note also that this method always scales linearly (in contrast to grep -f
) which at least makes it easier to estimate run times of arbitrarily large pattern files.
BTW, grep
does not check whether there are duplicate lines in your pattern file. If there are you might want to run it through sort -u
or uniq
first. Also, it does not check whether your input file is empty. Running my pattern file of 1900 lines over an empty file still took nearly 10 seconds ;-)
$ rm testmail_for_grep.txt ; touch testmail_for_grep.txt $ time grep -f pattern-file testmail_for_grep.txt 9.879 seconds
disclaimer & imprint :: copyright :: go to top ::
Discussion
I am using this suggestion and it works great. But it is not usable for the -v option. Any ideas?
Thanks for the great question
I haven't tried it myself with large pattern files but you might want to try comm.
Useful info, but wanted to add my own observations - found this page when having grep performance trouble myself. I've found the reading a line by line with cat version a lot quicker. This is on a production box where I was looking to pull circa 13,000 accounts in my grep list file out of some batch files containing 1.2 million records. Although the box has a 2x12core xeons it makes no difference if grep is single threaded ;-|
Tried the split method with 20 rows per split file, and after an hour it had produced about 90 accounts in the output file.
. I'd be quicker find, cutting and pasting from vi
Reading a single line with cat and for loop got me all 12,918 accounts in 25 minutes. Notes this is Red Hat Enterprise Linux Server release 5.8 (Tikanga)
Script I knocked up to make it reusable:
cat greplots.sh
Note the -a is there as the files I'm looking for have lots of control characters making grep think they are binary files. the -F says - these are fixed string, not expressions which speed up a bit too. Given a bit longer I'd have split the file being searched into 12 streams and fired off multiple processes using the . to run a subshell with a wait at the end and then concatenate the results all back together.
@all: Please keep in mind that locales can have a huge impact on the performance of text operations. If possible avoid UTF-8 locales and set e.g.
LC_ALL=C
beforegrep
ing.@Simon: Thanks for your data and feedback.
It's a bit of a special case, but if you're lucky and know where your values are in the lines of the (log)file, e.g.
starttag: value, whatever
, you can turn the problem around:cat $logfile| while read line do value=${line#*starttag}; value=${value%%, whatever*}; [ -n "$value" ] && fgrep -q $value $patternfile && echo "$line" done
It's an order of magnitude faster on very large files (in my case, $pattern was 1200 lines, $logfile 15000).
I don't have access to your test data, but I suspect this would be faster still:
This reads the first input file to populate a hash with keys equal to each line, then scans each remaining input against each line (as a regex) and prints matches. This eliminates duplicate matches as well as eliminating the need to check for them.
If you're comparing whole lines as strings (mimicking
grep -Fx
), this becomes significantly faster:You can also use that trick to match a particular field by changing both
$0
s to the field number.Use
awk -F DELIMITER
to specify a different field separator (regex).Note:
awk
uses extended POSIX regular expressions, with a slightly different definition to GNUgrep -E
. One notable difference is the lack of support for\b
.@Adam: Thanks a lot for your comment. I happened to find some old data that I used in 2008. So, I did some quick tests.
Unfortunately, I cannot share the data for privacy reasons. But it should be easy to reproduce. The pattern file contains only lines like (yeah, I know it's stupid)
The data file is an RFC822 message of 114 lines including 1 "From:" line but no line that matches any pattern.
For a quick test I have created 2 smaller pattern files, 1 with only 5302 lines (50%), the other one has 7953 lines (75%). The full pattern file has 10604 lines.
Results
Averages of a few runs given in real seconds, tested on an Intel Core i7-870 @2.93 GHz using GNU grep 2.20 and GNU Awk 4.1.1
So, Awk is not faster here but times might (and should) grow linearly whereas they do not with
grep
. This suggests that Awk will indeed be faster as the number of patterns increases.