[dbscripts] Don't parse .db files ourselves; use pyalpm instead
diff mbox

Message ID 20180709011400.554-1-lukeshu@lukeshu.com
State Rejected
Headers show

Commit Message

Luke Shumaker July 9, 2018, 1:14 a.m. UTC
From: Luke Shumaker <lukeshu@parabola.nu>

In a patchset that I recently submitted, Eli was concerned that I was
parsing .db files with bsdtar+awk, when the format of .db files isn't
"public"; the only guarantees made about it are that libalpm can parse it.

https://lists.archlinux.org/pipermail/arch-projects/2018-June/004932.html

I wasn't too concerned, because `ftpdir-cleanup` and `sourceballs` already
parse the .db files in the same way.  Nonetheless, I think Eli is right: we
shouldn't be parsing these files ourselves.

So, add a `dbquery` function that uses pyalpm to parse the .db files:

 - It takes as arguments Python 3 expressions;
   1. one that that returns a bool deciding whether we want to print
      information on a package, and
   2. another that returns the string to print for a package.

   Currently, all callers use "True" for the decider expression, as
   ftpdir-cleanup and sourceballs operate on *every* package.  However, I'm
   including a way to filter packages because, I'm coming at this from the
   context that I want to parse .db files in other places too.

 - libalpm doesn't offer an easy way to say "parse this DB file for me";
   instead, we must construct a configuration that has a syncdb pointing to
   that file, which we then have it sync in to a temporary directory.

As a final note, when re-writing the bit of sourceballs to use dbquery
instead of AWK, I realized that it does not correctly handle licenses that
have a space in them (as of 2018-07-07 there are 67 packages in the Arch
repos that have license containing a space).  I did not fix this bug; I
merely translated it from AWK to Python, as the program would also need to
be adjusted elsewhere.
---
 cron-jobs/ftpdir-cleanup |  2 +-
 cron-jobs/sourceballs    | 14 ++------------
 db-functions             | 25 +++++++++++++++++++++++++
 test/Dockerfile          |  2 +-
 4 files changed, 29 insertions(+), 14 deletions(-)

Comments

Eli Schwartz via arch-projects July 9, 2018, 2:38 a.m. UTC | #1
On 07/08/2018 09:14 PM, Luke Shumaker wrote:
> From: Luke Shumaker <lukeshu@parabola.nu>
> 
> In a patchset that I recently submitted, Eli was concerned that I was
> parsing .db files with bsdtar+awk, when the format of .db files isn't
> "public"; the only guarantees made about it are that libalpm can parse it.
> 
> https://lists.archlinux.org/pipermail/arch-projects/2018-June/004932.html
> 
> I wasn't too concerned, because `ftpdir-cleanup` and `sourceballs` already
> parse the .db files in the same way.  Nonetheless, I think Eli is right: we
> shouldn't be parsing these files ourselves.
> 
> So, add a `dbquery` function that uses pyalpm to parse the .db files:

What's wrong with expac?

expac --config ${dbscripts_root}/pacman-community.conf -S '%f'

expac is not only super elegant, there's pending patches to provide it
in pacman 6 as part of the core project. This is what I'm waiting for,
actually.

I see no reason to add an external dependency on both python and pyalpm,
in order to run a small python program which evals its arguments in
order to inject database queries, when a tool with a simple API can do
the same and will eventually be guaranteed to be everywhere pacman
itself is.

(Let's ignore for a moment, the defunct integrity checks service which
is written in python, but not pyalpm. pyalpm is not currently installed
on the dbscripts server ATM.)

>  - It takes as arguments Python 3 expressions;
>    1. one that that returns a bool deciding whether we want to print
>       information on a package, and
>    2. another that returns the string to print for a package.
> 
>    Currently, all callers use "True" for the decider expression, as
>    ftpdir-cleanup and sourceballs operate on *every* package.  However, I'm
>    including a way to filter packages because, I'm coming at this from the
>    context that I want to parse .db files in other places too.
> 
>  - libalpm doesn't offer an easy way to say "parse this DB file for me";
>    instead, we must construct a configuration that has a syncdb pointing to
>    that file, which we then have it sync in to a temporary directory.
> 
> As a final note, when re-writing the bit of sourceballs to use dbquery
> instead of AWK, I realized that it does not correctly handle licenses that
> have a space in them (as of 2018-07-07 there are 67 packages in the Arch
> repos that have license containing a space).  I did not fix this bug; I
> merely translated it from AWK to Python, as the program would also need to
> be adjusted elsewhere.
Keeping in mind the ones we're looking for are a whitelist of
strictly-defined license types... I think those are all ad-hoc custom
licenses, none of which we're interested in in the primary sourceballs
deployment.
Luke Shumaker July 9, 2018, 5:32 p.m. UTC | #2
On Sun, 08 Jul 2018 22:38:06 -0400,
Eli Schwartz wrote:
> 
> On 07/08/2018 09:14 PM, Luke Shumaker wrote:
> > From: Luke Shumaker <lukeshu@parabola.nu>
> > 
> > In a patchset that I recently submitted, Eli was concerned that I was
> > parsing .db files with bsdtar+awk, when the format of .db files isn't
> > "public"; the only guarantees made about it are that libalpm can parse it.
> > 
> > https://lists.archlinux.org/pipermail/arch-projects/2018-June/004932.html
> > 
> > I wasn't too concerned, because `ftpdir-cleanup` and `sourceballs` already
> > parse the .db files in the same way.  Nonetheless, I think Eli is right: we
> > shouldn't be parsing these files ourselves.
> > 
> > So, add a `dbquery` function that uses pyalpm to parse the .db files:
> 
> What's wrong with expac?
> 
> expac --config ${dbscripts_root}/pacman-community.conf -S '%f'
> 
> expac is not only super elegant, there's pending patches to provide it
> in pacman 6 as part of the core project. This is what I'm waiting for,
> actually.
> 
> I see no reason to add an external dependency on both python and pyalpm,
> in order to run a small python program which evals its arguments in
> order to inject database queries, when a tool with a simple API can do
> the same and will eventually be guaranteed to be everywhere pacman
> itself is.

With the "True" filter that ftpdir-cleanup and sourceballs both use,
you're right; this could be done with expac.  But, with the context
that this patch exists to enable me to address the concern you had
with the other patchset:

AFAICT, with expac there's no way to do a query like:

    dbquery core x86_64 \
    	    "(pkg.base or pkg.name) == '$pkgbase'" \
	    ...

Which is what most (all?) of the queries in the other patchset would
become.

(Drat, it seems that discussing this separately from the other
patchset won't work after all.)

> (Let's ignore for a moment, the defunct integrity checks service which
> is written in python, but not pyalpm. pyalpm is not currently installed
> on the dbscripts server ATM.)

Good call; check_packages.py is python2, the pyalpm dep does add a new
dependency on python3.

> > As a final note, when re-writing the bit of sourceballs to use dbquery
> > instead of AWK, I realized that it does not correctly handle licenses that
> > have a space in them (as of 2018-07-07 there are 67 packages in the Arch
> > repos that have license containing a space).  I did not fix this bug; I
> > merely translated it from AWK to Python, as the program would also need to
> > be adjusted elsewhere.
> Keeping in mind the ones we're looking for are a whitelist of
> strictly-defined license types... I think those are all ad-hoc custom
> licenses, none of which we're interested in in the primary sourceballs
> deployment.

Indeed; if I thought it were a serious problem, I would have written a
patch for it :)
Eli Schwartz via arch-projects July 9, 2018, 10:37 p.m. UTC | #3
On 07/09/2018 01:32 PM, Luke Shumaker wrote:
> With the "True" filter that ftpdir-cleanup and sourceballs both use,
> you're right; this could be done with expac.  But, with the context
> that this patch exists to enable me to address the concern you had
> with the other patchset:
> 
> AFAICT, with expac there's no way to do a query like:
> 
>     dbquery core x86_64 \
>     	    "(pkg.base or pkg.name) == '$pkgbase'" \
> 	    ...
> 
> Which is what most (all?) of the queries in the other patchset would
> become.
> 
> (Drat, it seems that discussing this separately from the other
> patchset won't work after all.)

Well, it's not as complex as you think. No matter what tool is used,
we'd need to manually reassemble the repository layout vs. pacman's
system DBPath. (Whether by doing database loading by hand with low-level
APIs or by providing a custom pacman.conf). But:

Checking a specific arch is as simple as using setarch first.

Current versions of makepkg set the pkgbase regardless of whether the
pkgname is different (with the rationale that there's no reason not to,
and it makes parsing package metadata easier), so we can add this to the
long list of other reasons we want to do a complete rebuild of every
package (cf. reproducible builds, PIE, and more).

And expac supports "repo/package" -- or printing the repo and filtering
that while parsing the filename or what-have-you. That being said, I've
discussed on our development IRC channel, having expac's -Ss mode to
support 'repo/.*' filtering which might be useful too...
Luke Shumaker July 10, 2018, 12:33 a.m. UTC | #4
On Mon, 09 Jul 2018 18:37:04 -0400,
Eli Schwartz wrote:
> On 07/09/2018 01:32 PM, Luke Shumaker wrote:
> > AFAICT, with expac there's no way to do a query like:
> > 
> >     dbquery core x86_64 \
> >         "(pkg.base or pkg.name) == '$pkgbase'" \
> >         ...
> 
> Well, it's not as complex as you think. No matter what tool is used,
> we'd need to manually reassemble the repository layout vs. pacman's
> system DBPath. (Whether by doing database loading by hand with low-level
> APIs or by providing a custom pacman.conf). But:
>
> Checking a specific arch is as simple as using setarch first.

The part I'm concerned about is being able to do is filtering by
pkgbase rather than pkgname; setting up repository layout is simple
enough either way (it just comes down to setting DBpath (and RootDir?)
to a temporary directory and setting
`$repo.Server=file://$FTP_BASE/$repo/os/$arch`).

> Current versions of makepkg set the pkgbase regardless of whether the
> pkgname is different (with the rationale that there's no reason not to,
> and it makes parsing package metadata easier), so we can add this to the
> long list of other reasons we want to do a complete rebuild of every
> package (cf. reproducible builds, PIE, and more).

Ok, but even if/when all packages have pkgbase, expac still doesn't
have the ability to search based on pkgbase instead of pkgname, does
it?

> And expac supports "repo/package" -- or printing the repo and filtering
> that while parsing the filename or what-have-you. That being said, I've
> discussed on our development IRC channel, having expac's -Ss mode to
> support 'repo/.*' filtering which might be useful too...

I guess we could have it print all packages for the repo, include
pkgbase in the output, then filter that with grep...

Patch
diff mbox

diff --git a/cron-jobs/ftpdir-cleanup b/cron-jobs/ftpdir-cleanup
index 9df5f99..77e49c8 100755
--- a/cron-jobs/ftpdir-cleanup
+++ b/cron-jobs/ftpdir-cleanup
@@ -44,7 +44,7 @@  for repo in "${PKGREPOS[@]}"; do
 			fi
 		done | sort > "${WORKDIR}/repo-${repo}-${arch}"
 		# get a list of package files defined in the repo db
-		bsdtar -xOf "${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT}" | awk '/^%FILENAME%/{getline;print}' | sort > "${WORKDIR}/db-${repo}-${arch}"
+		dbquery "$repo" "$arch" True pkg.filename | sort > "${WORKDIR}/db-${repo}-${arch}"
 
 		missing_pkgs=($(comm -13 "${WORKDIR}/repo-${repo}-${arch}" "${WORKDIR}/db-${repo}-${arch}"))
 		if (( ${#missing_pkgs[@]} >= 1 )); then
diff --git a/cron-jobs/sourceballs b/cron-jobs/sourceballs
index 6be28ab..784b48b 100755
--- a/cron-jobs/sourceballs
+++ b/cron-jobs/sourceballs
@@ -24,18 +24,8 @@  for repo in "${PKGREPOS[@]}"; do
 		if [[ ! -f ${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT} ]]; then
 			continue
 		fi
-		bsdtar -xOf "${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT}" \
-			| awk '/^%NAME%/ { getline b };
-				/^%BASE%/ { getline b };
-				/^%VERSION%/ { getline v };
-				/^%LICENSE%/,/^$/ {
-					if ( !/^%LICENSE%/ ) { l=l" "$0 }
-					};
-				/^%ARCH%/ {
-					getline a;
-					printf "%s %s %s %s\n", b, v, a, l;
-					l="";
-				}'
+		dbquery "$repo" "$arch" True \
+			'f"{pkg.base or pkg.name} {pkg.version} {pkg.arch} {'\'' '\''.join(pkg.licenses)}"'
 	done | sort -u > "${WORKDIR}/db-${repo}"
 done
 
diff --git a/db-functions b/db-functions
index 0491c22..f1d821a 100644
--- a/db-functions
+++ b/db-functions
@@ -294,6 +294,31 @@  getpkgfiles() {
 	echo "${files[@]}"
 }
 
+# usage: dbquery repo arch filter_expr output_expr
+dbquery() {
+	local repo=$1
+	local arch=$2
+	local filter=$3
+	local output=$4
+	local dbfile="${FTP_BASE}/${repo}/os/${arch}/${repo}.db"
+
+	python3 - "$dbfile" "$filter" "$output" <<-'EOT'
+		import os.path
+		import sys
+		import tempfile
+		import pyalpm
+		db_dir, db_file = os.path.split(os.path.abspath(sys.argv[1]))
+		with tempfile.TemporaryDirectory() as tmpdirname:
+		    handle = pyalpm.Handle(tmpdirname, tmpdirname)
+		    db = handle.register_syncdb(db_file[:-3], 0)
+		    db.servers = ["file://{}".format(db_dir)]
+		    db.update(False)
+		    for pkg in db.search(".*"):
+		        if eval(sys.argv[2], {}, {"pkg": pkg}):
+		            print(eval(sys.argv[3], {}, {"pkg": pkg}))
+		EOT
+}
+
 check_pkgfile() {
 	local pkgfile=$1
 
diff --git a/test/Dockerfile b/test/Dockerfile
index 83c8449..0d01a75 100644
--- a/test/Dockerfile
+++ b/test/Dockerfile
@@ -1,5 +1,5 @@ 
 FROM archlinux/base
-RUN pacman -Syu --noconfirm --needed sudo fakeroot awk subversion make kcov bash-bats gettext grep
+RUN pacman -Syu --noconfirm --needed sudo fakeroot awk subversion make kcov bash-bats gettext grep pyalpm
 RUN pacman-key --init
 RUN echo '%wheel ALL=(ALL) NOPASSWD: ALL' > /etc/sudoers.d/wheel
 RUN useradd -N -g users -G wheel -d /build -m tester