Web page archived as of 2018-05-22. Some links & features might not work.

Grep with large pattern files

When I investigated the run times of grep -f related to large pattern files (with hundreds of lines and more) it became apparent that one can speed up grep by splitting up the pattern files into smaller chunks.

I tested this with GNU grep 2.5.3 on a machine with an Intel Xeon 2400 MHz CPU running Debian Lenny. grep -f with a pattern file of 1900 lines1) takes at least 72 seconds (with no match found).

$ TIMEFORMAT="%R seconds"
$ time grep -f pattern-file testmail_for_grep.txt
73.852 seconds

To speed things up, for instance, one can use split to split pattern-file into smaller chunks and run grep with each of them. Here is an example which uses chunks of 50 lines:

split -l 50 pattern-file pattern-file.split.
for CHUNK in pattern-file.split.* ; do
        grep -f "$CHUNK" testmail_for_grep.txt
done
rm pattern-file.split.*

Using the same pattern file as above (1900 lines) grep -f of all pattern files splitted with split -l 50 takes only about 1.1 second!

Optimal chunks

For the fun of it I tried to find the optimal size of chunks. I assume that it depends on a multitude of factors (such as the speed of your file system) but in this particular case I gained best results for chunks of about 20 lines. The run time for chunks of 20 lines was 0.70 seconds. I did the same tests with a similar pattern file of 900 lines. There, the optimal chunk size was about 20 lines, too.

Pattern by pattern

In the related bug report bug #16305: grep much less efficient ..., Levi Waldron suggested to use

for line in `cat patterns.txt`;do grep $line data.txt >> matches.txt;done

This won't work in general (it e.g. breaks when patterns contain spaces). However, a while loop using bash's read might do better (testmail_for_grep.txt being data.txt).

while read line ; do grep "$line" testmail_for_grep.txt ; done < pattern-file

The run time of this method was a bit more than 4 seconds for 1900 patterns. This is much slower than using split but in cases where one does not want to write temporary files it might come in handy. Note also that this method always scales linearly (in contrast to grep -f) which at least makes it easier to estimate run times of arbitrarily large pattern files.

Duplicates and nothing

BTW, grep does not check whether there are duplicate lines in your pattern file. If there are you might want to run it through sort -u or uniq first. Also, it does not check whether your input file is empty. Running my pattern file of 1900 lines over an empty file still took nearly 10 seconds ;-)

$ rm testmail_for_grep.txt ; touch testmail_for_grep.txt
$ time grep -f pattern-file testmail_for_grep.txt
9.879 seconds
1)
The same as before.

Discussion

Milan, 2011-05-05 13:05

I am using this suggestion and it works great. But it is not usable for the -v option. Any ideas?

Andreas Schamanek, 2011-05-06 20:28, 2011-05-06 20:30

Thanks for the great question :-) I haven't tried it myself with large pattern files but you might want to try comm.

$ cat data.txt
111
AAA
222
BBB
333
CCC
$ cat patterns.txt
A
33
$ # we simulate the approach of splitting up patterns.txt here:
$ ( grep 'A' data.txt ; grep '33' data.txt ) >inverse.txt
$ cat inverse.txt
AAA
333
$ comm -23 <(sort data.txt) <(sort inverse.txt)
111
222
BBB
CCC
Simon Heath, 2014-03-07 13:13, 2014-03-10 22:13

Useful info, but wanted to add my own observations - found this page when having grep performance trouble myself. I've found the reading a line by line with cat version a lot quicker. This is on a production box where I was looking to pull circa 13,000 accounts in my grep list file out of some batch files containing 1.2 million records. Although the box has a 2x12core xeons it makes no difference if grep is single threaded ;-|

Tried the split method with 20 rows per split file, and after an hour it had produced about 90 accounts in the output file. :-(. I'd be quicker find, cutting and pasting from vi :-)

Reading a single line with cat and for loop got me all 12,918 accounts in 25 minutes. Notes this is Red Hat Enterprise Linux Server release 5.8 (Tikanga)

Script I knocked up to make it reusable:

cat greplots.sh

#Usage : greplots.sh <greplistfile> <filetogrep>
 
#Reads 1 line from $1 search string file and greps the file.
 
#Used as grep gets exponentially slower with large input files.
 
for searchstring in `cat $1`
do
  grep -Fa "$searchstring" $2
done

Note the -a is there as the files I'm looking for have lots of control characters making grep think they are binary files. the -F says - these are fixed string, not expressions which speed up a bit too. Given a bit longer I'd have split the file being searched into 12 streams and fired off multiple processes using the . to run a subshell with a wait at the end and then concatenate the results all back together.

Andreas Schamanek, 2014-03-10 22:19

@all: Please keep in mind that locales can have a huge impact on the performance of text operations. If possible avoid UTF-8 locales and set e.g. LC_ALL=C before greping.

@Simon: Thanks for your data and feedback.

Gabor Laszlo, 2014-10-17 18:40

It's a bit of a special case, but if you're lucky and know where your values are in the lines of the (log)file, e.g. starttag: value, whatever, you can turn the problem around:

cat $logfile| while read line do value=${line#*starttag}; value=${value%%, whatever*}; [ -n "$value" ] && fgrep -q $value $patternfile && echo "$line" done

It's an order of magnitude faster on very large files (in my case, $pattern was 1200 lines, $logfile 15000).

Adam Katz, 2016-07-14 22:54

I don't have access to your test data, but I suspect this would be faster still:

awk 'NR==FNR {query[$0]=1; next} {for (q in query) if ($0~q) {print; break}}' PATTERN_FILE FILE...

This reads the first input file to populate a hash with keys equal to each line, then scans each remaining input against each line (as a regex) and prints matches. This eliminates duplicate matches as well as eliminating the need to check for them.

If you're comparing whole lines as strings (mimicking grep -Fx), this becomes significantly faster:

awk 'NR==FNR {query[$0]=1; next} query[$0]' PATTERN_FILE FILE...

You can also use that trick to match a particular field by changing both $0s to the field number.

Use awk -F DELIMITER to specify a different field separator (regex).

Note: awk uses extended POSIX regular expressions, with a slightly different definition to GNU grep -E. One notable difference is the lack of support for \b.

Andreas Schamanek, 2016-07-26 21:01, 2016-07-26 21:05

@Adam: Thanks a lot for your comment. I happened to find some old data that I used in 2008. So, I did some quick tests.

Unfortunately, I cannot share the data for privacy reasons. But it should be easy to reproduce. The pattern file contains only lines like (yeah, I know it's stupid)

^From: .*006infos@thrifty.com.lb
^From: .*007joao.serra@edp.pt
...

The data file is an RFC822 message of 114 lines including 1 "From:" line but no line that matches any pattern.

For a quick test I have created 2 smaller pattern files, 1 with only 5302 lines (50%), the other one has 7953 lines (75%). The full pattern file has 10604 lines.

Results

Averages of a few runs given in real seconds, tested on an Intel Core i7-870 @2.93 GHz using GNU grep 2.20 and GNU Awk 4.1.1

Code 5302 7953 10604 (1st+2nd)/2
Ur awk 8.4 12.5 16.5 12.45
egrep 1.7 4.9 9.4 5.55

So, Awk is not faster here but times might (and should) grow linearly whereas they do not with grep. This suggests that Awk will indeed be faster as the number of patterns increases.

 
blog/080727_grep_with_large_pattern_files.txt ยท Last modified: 2008-07-28 10:09 by andreas