Discussion:
[master b3cf281] Unbreak the MinGW build
(too old to reply)
Paul Eggert
2016-12-16 15:43:59 UTC
Permalink
Raw Message
I needed this commit to prevent temacs from crashing during dumping.
Don't ask me how including errno.h (both the one from Gnulib and the
MinGW one) could cause this, especially as the preprocessed __fpending
doesn't seem to change a bit as result of that, and it doesn't seem to
even be called during dumping. The facts are stubborn: if I leave
that inclusion in place, I get a crash, removing it fixes the crash.
I propagated this fix back to gnulib by installing the attached patch to Gnulib.
I'll CC: this to Bruno Haible, who made the recent change to fpending. I don't
use MinGW myself.
Bruno Haible
2016-12-16 18:11:31 UTC
Permalink
Raw Message
Post by Paul Eggert
I needed this commit to prevent temacs from crashing during dumping.
Don't ask me how including errno.h (both the one from Gnulib and the
MinGW one) could cause this, especially as the preprocessed __fpending
doesn't seem to change a bit as result of that, and it doesn't seem to
even be called during dumping. The facts are stubborn: if I leave
that inclusion in place, I get a crash, removing it fixes the crash.
I propagated this fix back to gnulib by installing the attached patch to Gnulib.
I'll CC: this to Bruno Haible, who made the recent change to fpending. I don't
use MinGW myself.
On mingw, the fpending.o generated by this code, with or without this
#include <errno.h>, is identical (even with debugging information [-g]).

I wouldn't have applied this patch, as the cause of the crash is obviously
somewhere else. We have all seen Heisenbugs in complex pieces of code. The
suspicious complex piece of code in this case, for me, is the temacs
dumping code.

Eli:
- How often have you tried to temacs+dump with and without the change?
Once each? Twice each? 10 times each? If it's a small number, you may
be seeing a random result and not realize it was random.
- Did you run temacs in the same directory in both cases, or in different
directories? Different directories could lead to a different memory
layout in temacs, due to different filename lengths.
- After determining that disabling the include would fix the crash, did
you test whether reenable the include would reenable the crash? If not,
some leftover file on the file system could make the difference.

Bruno
Eli Zaretskii
2016-12-16 21:10:59 UTC
Permalink
Raw Message
Date: Fri, 16 Dec 2016 19:11:31 +0100
On mingw, the fpending.o generated by this code, with or without this
#include <errno.h>, is identical (even with debugging information [-g]).
Yes, I said that much. Which is why this is a mystery in my eyes.
I wouldn't have applied this patch, as the cause of the crash is obviously
somewhere else.
I know. I just don't know where, and have run out of ideas.
Suggestions welcome, but I cannot continue my work on Emacs without
being able to build the latest HEAD of the master branch. Having
uncommitted/unpushed changes in master means I need to jump through
hoops each time I need to push a change upstream.
- How often have you tried to temacs+dump with and without the change?
Once each? Twice each? 10 times each? If it's a small number, you may
be seeing a random result and not realize it was random.
I did it in two separate branches, 3 times in each one. The results
are consistent. This crash isn't random.
- Did you run temacs in the same directory in both cases, or in different
directories? Different directories could lead to a different memory
layout in temacs, due to different filename lengths.
Same directory.
- After determining that disabling the include would fix the crash, did
you test whether reenable the include would reenable the crash?
Yes.
Bruno Haible
2016-12-16 23:17:31 UTC
Permalink
Raw Message
Post by Eli Zaretskii
Post by Bruno Haible
- How often have you tried to temacs+dump with and without the change?
Once each? Twice each? 10 times each? If it's a small number, you may
be seeing a random result and not realize it was random.
I did it in two separate branches, 3 times in each one. The results
are consistent. This crash isn't random.
Post by Bruno Haible
- Did you run temacs in the same directory in both cases, or in different
directories? Different directories could lead to a different memory
layout in temacs, due to different filename lengths.
Same directory.
Post by Bruno Haible
- After determining that disabling the include would fix the crash, did
you test whether reenable the include would reenable the crash?
Yes.
OK, that much for our "conventional" wisdom...

Since, as you say, the crash occurs during dumping, this is what I would
turn to now. Do you have a systematic approach for debugging crashes during
dump? If dump is based on malloc, does it help to set MALLOC_PERTURB_?
Can you use tools such as valgrind to debug it?
Post by Eli Zaretskii
Having
uncommitted/unpushed changes in master means I need to jump through
hoops each time I need to push a change upstream.
Yes, having to do
$ git rebase -i HEAD~2
$ git push origin HEAD~1:master
each time is extra work.
Post by Eli Zaretskii
Post by Bruno Haible
I wouldn't have applied this patch, as the cause of the crash is obviously
somewhere else.
I know.
Actually gnulib has a way to keep in your project changes relative to gnulib
that should not be pushed upstream: It's gnulib-tool's --local-dir option.
You would have had to create a small lib/stdio-impl.h.diff file that gets
applied on the fly each time you invoke gnulib-tool for emacs.

Bruno
Paul Eggert
2016-12-17 00:30:01 UTC
Permalink
Raw Message
Post by Bruno Haible
gnulib has a way to keep in your project changes relative to gnulib
that should not be pushed upstream
I'd rather not do that, as the Emacs build procedure is already too complicated.
Eli Zaretskii
2016-12-17 07:51:15 UTC
Permalink
Raw Message
Date: Sat, 17 Dec 2016 00:17:31 +0100
Since, as you say, the crash occurs during dumping, this is what I would
turn to now. Do you have a systematic approach for debugging crashes during
dump? If dump is based on malloc, does it help to set MALLOC_PERTURB_?
Can you use tools such as valgrind to debug it?
Why would I need them? The dumping code on Windows is just C code
that reads and writes a binary file, so GDB is good enough, it shows
exactly what causes the crash. Except that what it told me made very
little sense, so I tried the "blame the last change" approach. Which
worked, even though examining the source-level changes clearly
indicates that the new fpending.c does the same as the old one did, as
long as you compare the executable C code in the two versions.
Actually gnulib has a way to keep in your project changes relative to gnulib
that should not be pushed upstream: It's gnulib-tool's --local-dir option.
You would have had to create a small lib/stdio-impl.h.diff file that gets
applied on the fly each time you invoke gnulib-tool for emacs.
I didn't ask to propagate the change back into Gnulib, that was Paul's
decision. I understand his motivation, but I'm okay with anything you
decide about Gnulib.

I will try to look into this some more today, and see if I can unlock
the mystery. It bothers me, too.

Thanks.
Eli Zaretskii
2016-12-17 11:17:23 UTC
Permalink
Raw Message
Date: Sat, 17 Dec 2016 09:51:15 +0200
I will try to look into this some more today, and see if I can unlock
the mystery. It bothers me, too.
Bug squashed, see commit 0757b4f. It was always there, we were just
lucky until now.

What the inclusion of errno.h did is increase the size of the object
file (due to massive debug info about macros), and thus the size of
temacs, by 1.5KB, and that caused writes to emacs.exe to cross the
page boundary beyond the allocated memory mapping, and segfault.

You can now remove the workaround from stdio-impl.h. (I already did
that in the Emacs repository.)

Thanks.
Bruno Haible
2016-12-17 12:33:42 UTC
Permalink
Raw Message
Post by Eli Zaretskii
You can now remove the workaround from stdio-impl.h.
Done.

Thanks for the rapid debugging. This is the kind of bug that you typically
cannot reproduce any more after two weeks, if you don't handle it quickly.

Bruno

Bruno Haible
2016-12-17 01:48:47 UTC
Permalink
Raw Message
Is Plan 9 still a valid porting target, anyway?
Well, it got requested in 2012 [1]. And, I guess, it has some educational value
regarding Unix and minimalism.

Bruno

[1] https://lists.gnu.org/archive/html/bug-gnulib/2012-02/msg00004.html
Loading...