Discussion:
bug#22357: grep -f not only huge memory usage, but also huge time cost
(too old to reply)
Norihiro Tanaka
2016-12-11 02:22:11 UTC
Permalink
Raw Message
On Fri, 9 Dec 2016 01:24:19 -0600
662b19f2d0edc0bf07f1a2c1421080252df4a37c
468d5217ed6ec1679512dec208c7f30fb8612957
(can't narrow it down because the latter doesn't compile for me)
The changes switch used algorithm. They convert grep -w -F to grep -w.

Try following test case and before and after the changes, please.

$ yes $(printf %040d 0) | head -10000000 >k
$ time -p env LC_ALL=C grep -w -F 0 k

It did not finish in 40s on my machine without the changes, but finished
in 0.54s after the changes, and following case finished in 1.35s
regardless whether the changes applied or not.

$ time -p env LC_ALL=C grep -w 0 k

It may be an extreme example, but if a lot of people assume that grep -F
is faster than grep, I think that it is a bug.
grep -w -f /usr/share/dict/words /tmp/greptest
(good older version: 2 seconds to complete, minimal memory)
(any version after the above commits: 10 or more minutes, never waited for
it to finish, 1.2GB RAM usage and 100% cpu)
I believe that it can not happen, as the changes work for grep -F only.
Is it grep -F -w, correctly? grep -F -w is not good at long pattern
after the changes.

I think that the real issue of bug#21763, bug#22239 and bug#22357 is
that dfa uses a lot of memory for a long pattern, but We have no idea to
improve it currently.

Following case is poor performance and uses a lot of memory regardless
whether the changes applied or not.

$ env LC_ALL=C grep -w -f /usr/share/dict/words /dev/null

However, the poor performance will be improved partly. Following case
is returned in 2.4s after patched on my machine.

$ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null

Thanks,
Norihiro
Trevor Cordes
2016-12-11 11:28:56 UTC
Permalink
Raw Message
Post by Norihiro Tanaka
The changes switch used algorithm. They convert grep -w -F to grep -w.
Hi, thanks for helping! Sorry, yes, I forgot I was using
--fixed-strings (-F), so yes, my example should have used -F.
Post by Norihiro Tanaka
Try following test case and before and after the changes, please.
$ yes $(printf %040d 0) | head -10000000 >k
$ time -p env LC_ALL=C grep -w -F 0 k
It did not finish in 40s on my machine without the changes, but
finished in 0.54s after the changes, and following case finished in
1.35s regardless whether the changes applied or not.
$ time -p env LC_ALL=C grep -w 0 k
OK, I see now where you are going with your examples. You are trying
to show why the change was made in the first place. That makes sense.
Post by Norihiro Tanaka
It may be an extreme example, but if a lot of people assume that grep
-F is faster than grep, I think that it is a bug.
I've read in numerous places (O'Reilly books mostly) that grep or pcre
is often/sometimes faster than fgrep, so I think it is (somewhat) common
knowledge and I wouldn't worry too much about that perception. (Even
though what you say is intuitive.)
Post by Norihiro Tanaka
grep -w -f /usr/share/dict/words /tmp/greptest
(good older version: 2 seconds to complete, minimal memory)
(any version after the above commits: 10 or more minutes, never
waited for it to finish, 1.2GB RAM usage and 100% cpu)
I believe that it can not happen, as the changes work for grep -F
only. Is it grep -F -w, correctly? grep -F -w is not good at long
pattern after the changes.
Yes, you are correct, I mistakenly left out the -F in my example.
Post by Norihiro Tanaka
I think that the real issue of bug#21763, bug#22239 and bug#22357 is
that dfa uses a lot of memory for a long pattern, but We have no idea
to improve it currently.
Let me digress for a moment... I think it's important you can reproduce
Post by Norihiro Tanaka
However, the poor performance will be improved partly. Following case
is returned in 2.4s after patched on my machine.
$ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null
On my box the above runs for >2m (never completes before I ^C) on the
version **AFTER** the commits (v2.22). On the test build just *BEFORE*
the commits (2.21.73-8058), it runs in <2s. So for me, I had a working
command (-F -w -f) that used to run quickly, that suddenly became
slow/useless. And none of the suggestions I've read can alleviate the
problem for me, not locale settings, nor anything else. If you can
confirm this is the case on your box, then maybe we can figure out why?

The main question I think for my case is why is "-F -w -f" switching
modes to the slow DFA at all? AFAIK there's no encoding flaw in the
input, so why is grep even switching modes in this use case? What was
wrong with just using the old mode? From what I've read, people don't
want DFA-mode when using -F and -f.

Or, there should be an option to grep that we can specify to turn off
this new behavior to get back the old (<2s) performance.

Thanks!
Norihiro Tanaka
2016-12-11 15:26:33 UTC
Permalink
Raw Message
On Sun, 11 Dec 2016 05:28:56 -0600
Post by Trevor Cordes
On my box the above runs for >2m (never completes before I ^C) on the
version **AFTER** the commits (v2.22). On the test build just *BEFORE*
the commits (2.21.73-8058), it runs in <2s. So for me, I had a working
command (-F -w -f) that used to run quickly, that suddenly became
slow/useless. And none of the suggestions I've read can alleviate the
problem for me, not locale settings, nor anything else. If you can
confirm this is the case on your box, then maybe we can figure out why?
Can you re-run with current master, as dfa has been improved frequently?
Post by Trevor Cordes
The main question I think for my case is why is "-F -w -f" switching
modes to the slow DFA at all? AFAIK there's no encoding flaw in the
input, so why is grep even switching modes in this use case? What was
wrong with just using the old mode? From what I've read, people don't
want DFA-mode when using -F and -f.
Or, there should be an option to grep that we can specify to turn off
this new behavior to get back the old (<2s) performance.
Thanks!
dfa matcher is not always slower than kws matcher.

- $ env LC_ALL=C grep -F -w 0 k

- $ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null

First is faster after the changes, and second is slower after the
changes. It's a trade-off. Can you have any idea to select the better
matcher for both two cases?

If we think simple algorithm as we select kws matcher for grep -F -w -f
and select grep dfa matcher for -F -w -e, we will get extreamly different
performance for following two cases.

- $ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null

- $ env LC_ALL=C grep -F -w -e "`cat /usr/share/dict/words`" /dev/null

Thanks,
Norihiro
Bruno Haible
2016-12-11 17:55:32 UTC
Permalink
Raw Message
Post by Trevor Cordes
I've read in numerous places (O'Reilly books mostly) that grep or pcre
is often/sometimes faster than fgrep, so I think it is (somewhat) common
knowledge and I wouldn't worry too much about that perception.
It is wrong to put the burden of the algorithm choice on the user.

Many computation tasks can be done through several different algorithms that
produce the same results but with different performance characteristics. The
main benefit of having one program that incorporates both/all algorithms is
that the user does NOT have to thing about which algorithm is better.
For example:

- When you write a C program, you don't have to choose among
y = 2 * x;
and
y = x + x;
depending on CPU. The compiler picks the right instruction for you.

- When you compute a sin(x) value, you don't have to say whether you
want the Taylor series around 0 or a polynomial approximation.
The implementation does it for you.

- You don't need to tell 'sort' whether it should sort in memory or
use temporary files.

- When you execute a database query SELECT X WHERE A = 10 AND B = 20
you don't have to tell the database engine whether to match the A
column first and then the B column or vice versa - the database engine
knows better which way to choose.

And so on. Computer science wouldn't be where it is without this paradigm.
Post by Trevor Cordes
dfa matcher is not always slower than kws matcher. ...
It's a trade-off. Can you have any idea to select the better
matcher for both two cases?
There are at least two approaches.

1) If you have vastly different run times (2 sec vs. 10 min or so), the
program can set up two threads and run each algorithm in one thread. Buffer
the output. When a thread terminates, use its output and kill the other thread.

Now that is still suboptimal, because on average this approach will lose a
factor of 2, just because it does not know which is the best algorithm.

2) You can run a set of benchmarks, indexed by 2 parameters (m,n)
m = number of keywords,
n = total number of bytes of the files to be searched,
run each of the two algorithms, and note the best algorithm. This way
you fill a matrix of results (in the (m,n) plane). Now find a formula
that roughly tells you for given (m,n) whether you are in one area of
the plane (where kwset is better) or in the other area of the plane
(where dfa is better). Finally, code this formula into the 'grep' program.

Bruno
a***@skeeve.com
2016-12-11 18:07:12 UTC
Permalink
Raw Message
Post by Bruno Haible
Finally, code this formula into the 'grep' program.
I'm sure that Paul and Jim would welcome patches.

Arnold
Paul Eggert
2016-12-12 03:12:19 UTC
Permalink
Raw Message
Post by a***@skeeve.com
I'm sure that Paul and Jim would welcome patches.
I wrote a patch that prefers -F when the input looks like a troublesome case. It
merely uses heuristics though, nothing scientific like what Bruno Haible suggested.

Before installing anything like that, I'd first like to look into improving the
DFA performance, along the lines suggested by Norhiro Tanaka. Part of the
problem appears to be that position-set merging, even with his latest proposed
changes, is O(N**2) where N is the pattern size. Perhaps we should change the
representation of position sets to use simple arrays; although this would take
more space, merging would be O(N) and the code would surely be simpler.
Bruno Haible
2016-12-12 11:49:29 UTC
Permalink
Raw Message
Hi Paul,
Post by Paul Eggert
Before installing anything like that, I'd first like to look into improving the
DFA performance, along the lines suggested by Norhiro Tanaka. Part of the
problem appears to be that position-set merging, even with his latest proposed
changes, is O(N**2) where N is the pattern size. Perhaps we should change the
representation of position sets to use simple arrays; although this would take
more space, merging would be O(N) and the code would surely be simpler.
I'm confused. Which code are you talking about? gnulib's dfa.c, function merge()
(line 1994)? This merge function operates in O(s1->nelem + s2->nelem), since in
each loop round i+j is being incremented by at least 1. Thus it is asymptotically
optimal. Or do you mean that merge() is being called too frequently? But then
changing the representation of position_set wouldn't help.

When I run a gprof build of grep with the two files file1, file2, I get:

$ ./grep -v -f <(head -n 2000 ../../../file1) <(head -n 2000 ../../../file2) | cat > /dev/null
$ gprof grep
...
% cumulative self self total
time seconds seconds calls s/call s/call name
48.84 1.47 1.47 26339 0.00 0.00 dfastate
15.62 1.94 0.47 24365 0.00 0.00 state_index
14.29 2.37 0.43 50101 0.00 0.00 merge
...

$ ./grep -v -f <(head -n 5000 ../../../file1) <(head -n 5000 ../../../file2) | cat > /dev/null
$ gprof grep
...
% cumulative self self total
time seconds seconds calls s/call s/call name
54.06 10.45 10.45 67061 0.00 0.00 dfastate
13.50 13.06 2.61 62127 0.00 0.00 state_index
12.21 15.42 2.36 126516 0.00 0.00 merge
...

So, the larger the two input files, the larger the relative time eaten by
'dfastate'. IMO this means the bottleneck is 'dfastate'.

And when I run a gcov build of grep with the two files file1, file2, I get:

$ ./grep -v -f <(head -n 5000 ../../../file1) <(head -n 5000 ../../../file2) | cat > /dev/null
$ cd ../lib
$ gcov dfa.gcda
Find attached the dfa.c.gcov.

Conclusions?

Bruno
Paul Eggert
2016-12-15 01:19:27 UTC
Permalink
Raw Message
Post by Bruno Haible
Post by Paul Eggert
Part of the
problem appears to be that position-set merging, even with his latest proposed
changes, is O(N**2) where N is the pattern size....
I'm confused. Which code are you talking about?
I was referring to code with his proposed patch installed. 'delete' is
O(N); 'replace' calls 'delete' in a loop and is therefore O(N**2).
'epsclosure' calls 'replace' in a loop and so I suppose it is O(N**3). I
haven't looked into how likely the worst-case performance is, though.
Post by Bruno Haible
So, the larger the two input files, the larger the relative time eaten by
'dfastate'. IMO this means the bottleneck is 'dfastate'.
Yes, there is a bottleneck there. I haven't had time yet to investigate
more.
Norihiro Tanaka
2016-12-18 01:30:13 UTC
Permalink
Raw Message
On Wed, 14 Dec 2016 17:19:27 -0800
Post by Paul Eggert
I was referring to code with his proposed patch installed. 'delete' is
O(N); 'replace' calls 'delete' in a loop and is therefore O(N**2).
'epsclosure' calls 'replace' in a loop and so I suppose it is O(N**3).
I haven't looked into how likely the worst-case performance is, though.
Yes. It is same both before and after with my proposed patch, but It
seems that memset() for VISITED causes slowdown in before code. So it
may extremely depress efficiency of memory cache, although there is no
basis.

- before

$ time -p env LC_ALL=C src/grep -F -w -f /usr/share/dict/linux.words /dev/null
real 188.98
user 188.46
sys 0.33

- after

$ time -p env LC_ALL=C src/grep -F -w -f /usr/share/dict/linux.words /dev/null
real 2.19
user 1.84
sys 0.34

Norihiro Tanaka
2016-12-12 14:43:47 UTC
Permalink
Raw Message
On Sun, 11 Dec 2016 18:55:32 +0100
Post by Bruno Haible
Post by Trevor Cordes
dfa matcher is not always slower than kws matcher. ...
It's a trade-off. Can you have any idea to select the better
matcher for both two cases?
There are at least two approaches.
1) If you have vastly different run times (2 sec vs. 10 min or so), the
program can set up two threads and run each algorithm in one thread. Buffer
the output. When a thread terminates, use its output and kill the other thread.
Now that is still suboptimal, because on average this approach will lose a
factor of 2, just because it does not know which is the best algorithm.
2) You can run a set of benchmarks, indexed by 2 parameters (m,n)
m = number of keywords,
n = total number of bytes of the files to be searched,
run each of the two algorithms, and note the best algorithm. This way
you fill a matrix of results (in the (m,n) plane). Now find a formula
that roughly tells you for given (m,n) whether you are in one area of
the plane (where kwset is better) or in the other area of the plane
(where dfa is better). Finally, code this formula into the 'grep' program.
We can not know N before searching. If input is a regular file, we can
examine it with stat(), but it may be STDIN.

BTW, following test case uses a lot of memory with grep matcher
specially, as grep compiles both dfa and regex. dfa uses a lot of memory
in calculation of `D->follow' in dfaanalyze(). In my machine, dfa uses
about 500MB, and regex uses about 1GB.

$ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null
Trevor Cordes
2016-12-12 10:36:21 UTC
Permalink
Raw Message
Post by Norihiro Tanaka
Post by Trevor Cordes
On my box the above runs for >2m (never completes before I ^C) on
the version **AFTER** the commits (v2.22). On the test build just
*BEFORE* the commits (2.21.73-8058), it runs in <2s. So for me, I
had a working command (-F -w -f) that used to run quickly, that
suddenly became slow/useless. And none of the suggestions I've
read can alleviate the problem for me, not locale settings, nor
anything else. If you can confirm this is the case on your box,
then maybe we can figure out why?
Can you re-run with current master, as dfa has been improved
frequently?
env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null
ran for 3+ minutes and then quit; no change then. I had run on HEAD
before my git bisect anyway, just to ensure it hadn't been fixed in the
meantime.

If HEAD helps, then it still isn't good enough, as taking 2+ orders of
magnitude more time is still not a viable solution for me.
Post by Norihiro Tanaka
First is faster after the changes, and second is slower after the
The problem, I think, isn't "faster" or "slower", it's usable vs.
unusable. I think it's not possible to change a working-for-decades use
case to increase runtime by 2, 3 or 10 orders of magnitude without
leaving a bunch of previously-happy end users stranded. Hence all the
bug reports. Conversely, not many other people are going to notice or
care that their other use case improved by 10 or 20% runtime. This
change is exact that tradeoff: under 1 order of magnitude improvement
for some people, at the expense of many orders of magnitude detriment
for others.
Post by Norihiro Tanaka
If we think simple algorithm as we select kws matcher for grep -F -w
-f and select grep dfa matcher for -F -w -e, we will get extreamly
different performance for following two cases.
OK, I looked at the code a lot more closely. Now I see a more salient
point perhaps we should be looking at:

"In a unibyte locale, switch from fgrep to grep if the pattern matches
words (where grep is typically faster)."

Why is this even something that's being done? If a user bothered to
specify fgrep or -F then that user knows what they are doing! Only
advanced users even know that fgrep/-F exists and use it. What the
above comment says is that "since we think it'll be faster, let's
override the will of the user because we think we know better"! Why not
give the user credit that they know what they are doing?

And in this case, again consider my above point that you are gaining
tiny improvements for some at the expense of horrendous penalties for
others.

Perhaps undoing the commit and simply expounding on the issue in the
docs (man page, etc) would be a better approach? Then the user has
control of the situation in their choice of -F or no -F. What is wrong
with that?
Post by Norihiro Tanaka
changes. It's a trade-off. Can you have any idea to select the
better matcher for both two cases?
- $ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null
- $ env LC_ALL=C grep -F -w -e "`cat /usr/share/dict/words`" /dev/null
Yes, you are correct, if this commit is to stand and we want to find
some way to solve my (and the other bug reporters') use case
performance problems, then that question will have to be answered. I
would say empirical tests would have to be run to discover where the
performance gain/loss crossover point lies (perhaps based on
input/expression size) and switch based on that.

However, my hunch is not trying to outsmart the user in the first place
is the correct approach.

Thanks for your time and input!
Paul Eggert
2016-12-12 16:57:55 UTC
Permalink
Raw Message
Post by Trevor Cordes
If a user bothered to
specify fgrep or -F then that user knows what they are doing!
Here I have to disagree, and to agree with Bruno. The user should not
have to know what grep's algorithm is. grep should just work. It is not
working now and we need to fix that, but we shouldn't fix things by
asking users to know that 'grep -F' is "fast" and plain grep is "slow".

Instead, we should fix things so that grep -F and plain grep are both
fast when given equivalent patterns. If the file foo contains no special
characters, then 'grep -F -f foo' and 'grep -f foo' should run at about
the same speed (fast in both cases of course :-).
Trevor Cordes
2016-12-13 00:04:38 UTC
Permalink
Raw Message
Post by Paul Eggert
grep should just work. It is
not working now and we need to fix that
By that rationale, the commit that causes the problems for me should be
backed out as a first step, and then a proper solution should be
found. If you look at the commit in isolation, it:
- at best speeds up some use cases ("case A") by 20%
- at worst slows down some use cases ("case B") by 10000%+

This all should be approached from the standpoint that the pre-commit
state is correct, and the post-commit state is buggy. And then try to
find the proper solution to regain the speed up in use case A, without
killing case B. All I'm talking about here is perspective... the end
result may be the same.
Post by Paul Eggert
Instead, we should fix things so that grep -F and plain grep are both
fast when given equivalent patterns.
Agreed, but it would appear that no amount of optimization to the dfa
algorithm (minutes/hours) is going to get us anywhere near the speed of
the old -F algorithm (2s) in my use case. Then there's this...
Post by Paul Eggert
memory in calculation of `D->follow' in dfaanalyze(). In my machine,
dfa uses about 500MB, and regex uses about 1GB.
$ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null
Even if a patch solves the time issue, perhaps the memory issue must be
considered as well, as Norihiro suggests. On my box, the above gives
me an RSS KB:
pre-commit: 141248
post-commit: 1410152

An order of magnitude more memory, ironic since pre-commit also runs
faster :-) I would assume that if someone's -f file size is even
bigger than mine (or their box had small amounts of RAM), the memory
requirements could more than their box could handle (as per RHBZ
1338497). My point being: wouldn't any dfa solution require much
larger amounts of RAM? Any heuristic switching should take into
account time and ram, making the choice even harder. Perhaps
necessitating a need for a --optimize-for-ram and --optimize-for-time
switch choice.
Post by Paul Eggert
It is wrong to put the burden of the algorithm choice on the user.
OK, you convinced me. In my mind I was thinking of -F as more of an
algorithm choice than a has/hasnot regex meta-chars choice. That is
clearly wrong.
Post by Paul Eggert
There are at least two approaches.
1) If you have vastly different run times (2 sec vs. 10 min or so),
[...]
Post by Paul Eggert
2) You can run a set of benchmarks, indexed by 2 parameters (m,n)
I love both of your ideas, though #1 is for sure "kludgy" it would
solve the problem for my use case. Perhaps #1 could be invoked only
when a heuristic applies, such as the -f file being >X bytes or words.
Post by Paul Eggert
We can not know N before searching. If input is a regular file, we
can examine it with stat(), but it may be STDIN.
It seems to me N is vastly less important here than M, in terms of
runtime. I could be wrong, but perhaps just looking at M is good
enough. If not, then you are correct, a heuristic solution would be
difficult to determine.

I'm going to run some test cases and benchmarks to see exactly how M vs
N impacts runtime to see if some heuristics are viable solutions.

Thanks everyone for all your great ideas and thoughtful discussion!
Paul Eggert
2016-12-13 01:47:11 UTC
Permalink
Raw Message
Post by Trevor Cordes
Post by Paul Eggert
grep should just work. It is
not working now and we need to fix that
By that rationale, the commit that causes the problems for me should be
backed out as a first step, and then a proper solution should be
found.
But it may be better to improve on the solution. Backing up is not always the
best way to move forward, even if a previous step had unwanted consequences.
Post by Trevor Cordes
This all should be approached from the standpoint that the pre-commit
state is correct
But the pre-commit state had bad performance in some significant cases. So we
can't assume it was correct.
Post by Trevor Cordes
My point being: wouldn't any dfa solution require much
larger amounts of RAM?
That might be OK these days. It would depend on how much RAM of course.
Nowadays, trading RAM for CPU time is often a win.
Post by Trevor Cordes
Perhaps
necessitating a need for a --optimize-for-ram and --optimize-for-time
switch choice.
I would rather avoid that. Users will not want to mess with such options. "grep
should just work."
L.A. Walsh
2016-12-15 09:55:44 UTC
Permalink
Raw Message
Post by Norihiro Tanaka
dfa matcher is not always slower than kws matcher.
- $ env LC_ALL=C grep -F -w 0 k
- $ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null
First is faster after the changes, and second is slower after the
changes. It's a trade-off. Can you have any idea to select the better
matcher for both two cases?
----
If a way can't be found to determine the best choice
algorithmically, then
on a multiprocessor machine, run both at the same time. First one to
complete terminates the slow poke.

It's similar to how some think the brain works in computing multiple
potential paths at the same time where early returns may not be the most
accurate, but might be better than nothing.

Given time, (meditation, "I think I'll sleep on it" [the problem],
"thinking about it", etc..), a more complete answer can be found.

In this case, with extreme time differences and a possible,
quick & accurate fast resolution, searching in parallel, while not trivial
in our linear programming languages, would almost certainly yield a faster
result.

Using parallel solutions should be the default on a multi-core machine.
Loading...