[pacman-dev] makepkg: Include more source files in debug packages.

Message ID 20200220015143.2249601-1-austin.lund@gmail.com
State Changes Requested
Headers show
Series
  • [pacman-dev] makepkg: Include more source files in debug packages.
Related show

Commit Message

Austin Lund Feb. 20, 2020, 1:51 a.m. UTC
Currently only the file pointed to by the DW_AT_name is included as a source
file in debug packages.  This means many files that are useful for debugging are
not included.  For example, no header files are included but yet these may by
referenced in the .debug_line section.

This sed script converts into shell variables the debug dump information from
readelf about compilation units, directory tables and file tables.  This can be
used to get the full path of all the source files from within the package being
compiled that are referenced in the debugging information.  Also, placeholder
symbols (e.g. <builtin>) and paths outside the current source (e.g. linked
libraries) will be more consistently ignored from inclusion in the debug
packages.

Signed-off-by: Austin Lund <austin.lund@gmail.com>
---
 scripts/libmakepkg/tidy/strip.sh.in | 46 +++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

Comments

Eli Schwartz Feb. 20, 2020, 3:40 a.m. UTC | #1
On 2/19/20 8:51 PM, Austin Lund wrote:
> Currently only the file pointed to by the DW_AT_name is included as a source
> file in debug packages.  This means many files that are useful for debugging are
> not included.  For example, no header files are included but yet these may by
> referenced in the .debug_line section.

Why do we need headers? headers by definition are supposed to declare
things that are then defined by a source file which is in DW_AT_name.

> This sed script converts into shell variables the debug dump information from
> readelf about compilation units, directory tables and file tables.  This can be
> used to get the full path of all the source files from within the package being
> compiled that are referenced in the debugging information.  Also, placeholder
> symbols (e.g. <builtin>) and paths outside the current source (e.g. linked
> libraries) will be more consistently ignored from inclusion in the debug
> packages.

I... don't really follow what this sed thing is doing. Is it printing
out a bash script? Why?
Austin Lund Feb. 21, 2020, 1:47 a.m. UTC | #2
On Thu, 20 Feb 2020 at 13:41, Eli Schwartz <eschwartz@archlinux.org> wrote:
>
> On 2/19/20 8:51 PM, Austin Lund wrote:
> > Currently only the file pointed to by the DW_AT_name is included as a source
> > file in debug packages.  This means many files that are useful for debugging are
> > not included.  For example, no header files are included but yet these may by
> > referenced in the .debug_line section.
>
> Why do we need headers? headers by definition are supposed to declare
> things that are then defined by a source file which is in DW_AT_name.

Some function definitions are presented within header files.  For
example, people will declare short inline functions in header files.
The line debugging information then points into the header files.

Just a random example:

gdb ls
b initialize_exit_failure
r does_not_exist

A code listing in GDB should point to the inline function definition
in system.h in coreutils  But currently that's missing from a built
coreutils-debug package.

> > This sed script converts into shell variables the debug dump information from
> > readelf about compilation units, directory tables and file tables.  This can be
> > used to get the full path of all the source files from within the package being
> > compiled that are referenced in the debugging information.  Also, placeholder
> > symbols (e.g. <builtin>) and paths outside the current source (e.g. linked
> > libraries) will be more consistently ignored from inclusion in the debug
> > packages.
>
> I... don't really follow what this sed thing is doing. Is it printing
> out a bash script? Why?

A fairly complex state needs to be saved to give a output useful for
packaging.  The things that are needed are the compilation directory
for each compilation unit and the file and directory tables for the
corresponding compilation units.  The way this is structured in the
file and the way readelf presents this information isn't totally
linear.  DW_AT_comp_dir is in the .debug_info section whilst the file
and directory tables are in the .debug_line section where the file
table refers to the directory table.  readelf tries to resolve things
with the --debug-dump=decodedline option.  But the compilation
directories are lost and the directory tables aren't used consistently
in the output (at least in my version of readelf (i.e. 2.34).

Keeping track of this state information is completely impractical in
sed, and confusing in awk.  Using bash associative arrays was an easy
solution that came to mind, so I just used that and parsed the output
in a subshell.  Perhaps there is a better way.
Allan McRae Feb. 24, 2020, 5:56 a.m. UTC | #3
On 20/2/20 11:51 am, Austin Lund wrote:
> Currently only the file pointed to by the DW_AT_name is included as a source
> file in debug packages.  This means many files that are useful for debugging are
> not included.  For example, no header files are included but yet these may by
> referenced in the .debug_line section.
> 
> This sed script converts into shell variables the debug dump information from
> readelf about compilation units, directory tables and file tables.  This can be
> used to get the full path of all the source files from within the package being
> compiled that are referenced in the debugging information.  Also, placeholder
> symbols (e.g. <builtin>) and paths outside the current source (e.g. linked
> libraries) will be more consistently ignored from inclusion in the debug
> packages.
> 
> Signed-off-by: Austin Lund <austin.lund@gmail.com>
> ---


So...  that sed script is horrendous!  But let me see if I understand
this correctly.


We currently only look at the .debug_info section, finding
DW_AT_name/DW_AT_comp_dir pair to grab file names.  That appears to get
the main compilation units, but misses header files.  It looks like your
sed does something slightly different to get that info, although I can't
tell if there is a functional difference.

Your patch additionally looks at the .debug_line section.  This section
has a table of directories that source files come from (which can be
filtered to remove system directories), and a file name table with files
from each directory.   This does not include the files we currently grab.

But there must be something I am missing...  For the example of "ls" I
see "selinux.h" in from directory "./lib/selinux" in that .debug_line
output.   Your script does not include this file.  Using the rpmtool
"debugedit -l" does include that file in its file list.  There are quite
a few other examples.

Am I on the right track?   Can you clarify what I am missing here?

Thanks,
Allan
Austin Lund Feb. 26, 2020, 1 a.m. UTC | #4
On Mon, Feb 24, 2020 at 03:56:58PM +1000, Allan McRae wrote:
> On 20/2/20 11:51 am, Austin Lund wrote:
> > Currently only the file pointed to by the DW_AT_name is included as a source
> > file in debug packages.  This means many files that are useful for debugging are
> > not included.  For example, no header files are included but yet these may by
> > referenced in the .debug_line section.
> > 
> > This sed script converts into shell variables the debug dump information from
> > readelf about compilation units, directory tables and file tables.  This can be
> > used to get the full path of all the source files from within the package being
> > compiled that are referenced in the debugging information.  Also, placeholder
> > symbols (e.g. <builtin>) and paths outside the current source (e.g. linked
> > libraries) will be more consistently ignored from inclusion in the debug
> > packages.
> > 
> > Signed-off-by: Austin Lund <austin.lund@gmail.com>
> > ---
> 
> 
> So...  that sed script is horrendous!  But let me see if I understand
> this correctly.
> 
> 
> We currently only look at the .debug_info section, finding
> DW_AT_name/DW_AT_comp_dir pair to grab file names.  That appears to get
> the main compilation units, but misses header files.  It looks like your
> sed does something slightly different to get that info, although I can't
> tell if there is a functional difference.
> 
> Your patch additionally looks at the .debug_line section.  This section
> has a table of directories that source files come from (which can be
> filtered to remove system directories), and a file name table with files
> from each directory.   This does not include the files we currently grab.
> 
> But there must be something I am missing...  For the example of "ls" I
> see "selinux.h" in from directory "./lib/selinux" in that .debug_line
> output.   Your script does not include this file.  Using the rpmtool
> "debugedit -l" does include that file in its file list.  There are quite
> a few other examples.
> 
> Am I on the right track?   Can you clarify what I am missing here?

It would seem I made a bad assmption about what was in and not in the current
source tree.  There will be references to libc files that appear in the output
as a relative path.  My thought was that if there wasn't a preceeding './' then
it must refer to another directory.

Anyway, yes.  You are on the right track.  My hope was that you can install the
debug package and get all available source listings.  Without these functions
within the header files, that isn't possible.

What's a better approach to parsing the readelf output?  Awk?
Allan McRae March 2, 2020, 3:08 a.m. UTC | #5
On 26/2/20 11:00 am, Austin Lund wrote:
> On Mon, Feb 24, 2020 at 03:56:58PM +1000, Allan McRae wrote:
>> On 20/2/20 11:51 am, Austin Lund wrote:
>>> Currently only the file pointed to by the DW_AT_name is included as a source
>>> file in debug packages.  This means many files that are useful for debugging are
>>> not included.  For example, no header files are included but yet these may by
>>> referenced in the .debug_line section.
>>>
>>> This sed script converts into shell variables the debug dump information from
>>> readelf about compilation units, directory tables and file tables.  This can be
>>> used to get the full path of all the source files from within the package being
>>> compiled that are referenced in the debugging information.  Also, placeholder
>>> symbols (e.g. <builtin>) and paths outside the current source (e.g. linked
>>> libraries) will be more consistently ignored from inclusion in the debug
>>> packages.
>>>
>>> Signed-off-by: Austin Lund <austin.lund@gmail.com>
>>> ---
>>
>>
>> So...  that sed script is horrendous!  But let me see if I understand
>> this correctly.
>>
>>
>> We currently only look at the .debug_info section, finding
>> DW_AT_name/DW_AT_comp_dir pair to grab file names.  That appears to get
>> the main compilation units, but misses header files.  It looks like your
>> sed does something slightly different to get that info, although I can't
>> tell if there is a functional difference.
>>
>> Your patch additionally looks at the .debug_line section.  This section
>> has a table of directories that source files come from (which can be
>> filtered to remove system directories), and a file name table with files
>> from each directory.   This does not include the files we currently grab.
>>
>> But there must be something I am missing...  For the example of "ls" I
>> see "selinux.h" in from directory "./lib/selinux" in that .debug_line
>> output.   Your script does not include this file.  Using the rpmtool
>> "debugedit -l" does include that file in its file list.  There are quite
>> a few other examples.
>>
>> Am I on the right track?   Can you clarify what I am missing here?
> 
> It would seem I made a bad assmption about what was in and not in the current
> source tree.  There will be references to libc files that appear in the output
> as a relative path.  My thought was that if there wasn't a preceeding './' then
> it must refer to another directory.
> 
> Anyway, yes.  You are on the right track.  My hope was that you can install the
> debug package and get all available source listings.  Without these functions
> within the header files, that isn't possible.
> 
> What's a better approach to parsing the readelf output?  Awk?

Short answer...  I'm not sure of the best approach.  I'd just like it to
be readable.

I have file a bug:
https://bugs.archlinux.org/task/65677

Patch

diff --git a/scripts/libmakepkg/tidy/strip.sh.in b/scripts/libmakepkg/tidy/strip.sh.in
index 2b6f732d..2598d606 100644
--- a/scripts/libmakepkg/tidy/strip.sh.in
+++ b/scripts/libmakepkg/tidy/strip.sh.in
@@ -35,9 +35,51 @@  build_id() {
 	LANG=C readelf -n "$1" | sed -n '/Build ID/ { s/.*: //p; q; }'
 }
 
+_source_files_sedscript='1 { a \
+declare -A COMPDIR
+}
+/^Contents of the .debug_info section:$/,/^[^[:space:]]/ {
+/^ <0>.*\(DW_TAG_compile_unit\)$/,/^ {0,3}[^[:space:]]/{
+	/^ <0>.*\(DW_TAG_compile_unit\)$/{h;d}
+	/^ {0,3}[^[:space:]]/{
+		x
+		s/(.*)/\n&\n/
+		s/.*DW_AT_comp_dir[^\n]*: ([^\n]*)\n.*/'\''\1'\''&/
+		s/.*DW_AT_stmt_list[^\n]*: ([^\n]*)\n.*/COMPDIR\[\1\]=&/
+		P;d
+	}
+	H
+	d
+}
+}
+/^Raw dump of debug contents of section .debug_line:$/,/^[^[:space:]]/ {
+/^  Offset:/,/^ Line Number Statements:$/ {
+	/^  Offset:/ {
+		a \
+unset DIRTABLE\
+unset FILETABLE
+
+		s/.*(0x[[:xdigit:]]+)$/OFFSET=\1/; p; d }
+	/^ The Directory Table/,/^$/ {
+		/^  [[:digit:]]+\t[^/.]/ {
+		s/^  ([[:digit:]]+)\t(.*)/DIRTABLE[\1]=\$\{COMPDIR\[\$OFFSET]\}\/\2/
+		p; d
+		}
+		d
+	}
+	/^ The File Name Table/,/^$/ {
+		/^  [[:digit:]]+\t[1-9]/ {
+		s/^  [[:digit:]]+\t([[:digit:]]+)\t.*\t(.*)$/\[\[ -v DIRTABLE\[\1\] \]\] \&\& echo \$\{DIRTABLE\[\1\]\}\/\2/
+		p; d
+		}
+		d
+	}
+}
+}'
+
 source_files() {
 	LANG=C readelf "$1" --debug-dump | \
-		awk '/DW_AT_name +:/{name=$8}/DW_AT_comp_dir +:/{{if (name == "<artificial>") next}{if (name !~ /^[<\/]/) {printf "%s/", $8}}{print name}}'
+		sed -n -E -e "${_source_files_sedscript}" | /bin/bash
 }
 
 strip_file() {
@@ -60,7 +102,7 @@  strip_file() {
 		while IFS= read -r t; do
 			f=${t/${dbgsrcdir}/"$srcdir"}
 			mkdir -p "${dbgsrc/"$dbgsrcdir"/}${t%/*}"
-			cp -- "$f" "${dbgsrc/"$dbgsrcdir"/}$t"
+			cp -n -- "$f" "${dbgsrc/"$dbgsrcdir"/}$t"
 		done < <(source_files "$binary")
 
 		# copy debug symbols to debug directory