From patchwork Mon Jul 9 01:14:00 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Luke Shumaker X-Patchwork-Id: 684 Return-Path: Delivered-To: patchwork@archlinux.org Received: from apollo.archlinux.org (localhost [127.0.0.1]) by apollo.archlinux.org (Postfix) with ESMTP id A360B558B3F8 for ; Mon, 9 Jul 2018 01:14:31 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on apollo X-Spam-Level: X-Spam-Status: No, score=-3.3 required=5.0 tests=MAILING_LIST_MULTI=-1, RCVD_IN_DNSWL_MED=-2.3 autolearn=ham autolearn_force=no version=3.4.1 X-Spam-BL-Results: [127.0.9.2] Received: from orion.archlinux.org (orion.archlinux.org [IPv6:2a01:4f8:160:6087::1]) by apollo.archlinux.org (Postfix) with ESMTPS for ; Mon, 9 Jul 2018 01:14:31 +0000 (UTC) Received: from orion.archlinux.org (localhost [127.0.0.1]) by orion.archlinux.org (Postfix) with ESMTP id B4186BCF370FB; Mon, 9 Jul 2018 01:14:20 +0000 (UTC) Received: from luna.archlinux.org (luna.archlinux.org [IPv6:2a01:4f8:160:3033::2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by orion.archlinux.org (Postfix) with ESMTPS; Mon, 9 Jul 2018 01:14:20 +0000 (UTC) Received: from luna.archlinux.org (luna.archlinux.org [127.0.0.1]) by luna.archlinux.org (Postfix) with ESMTP id 967502C768; Mon, 9 Jul 2018 01:14:20 +0000 (UTC) Authentication-Results: luna.archlinux.org; dkim=none Received: from luna.archlinux.org (luna.archlinux.org [127.0.0.1]) by luna.archlinux.org (Postfix) with ESMTP id B11AA2C768 for ; Mon, 9 Jul 2018 01:14:16 +0000 (UTC) Received: from orion.archlinux.org (orion.archlinux.org [88.198.91.70]) by luna.archlinux.org (Postfix) with ESMTPS for ; Mon, 9 Jul 2018 01:14:16 +0000 (UTC) Received: from orion.archlinux.org (localhost [127.0.0.1]) by orion.archlinux.org (Postfix) with ESMTP id 12C63BCF370F6 for ; Mon, 9 Jul 2018 01:14:05 +0000 (UTC) Received: from mav.lukeshu.com (mav.lukeshu.com [104.207.138.63]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by orion.archlinux.org (Postfix) with ESMTPS for ; Mon, 9 Jul 2018 01:14:04 +0000 (UTC) Received: from build64-par (unknown [IPv6:2601:803:202:9275:da50:e6ff:fe00:4a5b]) by mav.lukeshu.com (Postfix) with ESMTPSA id 8194B80502 for ; Sun, 8 Jul 2018 21:14:01 -0400 (EDT) From: Luke Shumaker To: arch-projects@archlinux.org Date: Sun, 8 Jul 2018 21:14:00 -0400 Message-Id: <20180709011400.554-1-lukeshu@lukeshu.com> X-Mailer: git-send-email 2.17.1 Subject: [arch-projects] [dbscripts] [PATCH] Don't parse .db files ourselves; use pyalpm instead X-BeenThere: arch-projects@archlinux.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Arch Linux projects development discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Arch Linux projects development discussion Errors-To: arch-projects-bounces@archlinux.org Sender: "arch-projects" From: Luke Shumaker In a patchset that I recently submitted, Eli was concerned that I was parsing .db files with bsdtar+awk, when the format of .db files isn't "public"; the only guarantees made about it are that libalpm can parse it. https://lists.archlinux.org/pipermail/arch-projects/2018-June/004932.html I wasn't too concerned, because `ftpdir-cleanup` and `sourceballs` already parse the .db files in the same way. Nonetheless, I think Eli is right: we shouldn't be parsing these files ourselves. So, add a `dbquery` function that uses pyalpm to parse the .db files: - It takes as arguments Python 3 expressions; 1. one that that returns a bool deciding whether we want to print information on a package, and 2. another that returns the string to print for a package. Currently, all callers use "True" for the decider expression, as ftpdir-cleanup and sourceballs operate on *every* package. However, I'm including a way to filter packages because, I'm coming at this from the context that I want to parse .db files in other places too. - libalpm doesn't offer an easy way to say "parse this DB file for me"; instead, we must construct a configuration that has a syncdb pointing to that file, which we then have it sync in to a temporary directory. As a final note, when re-writing the bit of sourceballs to use dbquery instead of AWK, I realized that it does not correctly handle licenses that have a space in them (as of 2018-07-07 there are 67 packages in the Arch repos that have license containing a space). I did not fix this bug; I merely translated it from AWK to Python, as the program would also need to be adjusted elsewhere. --- cron-jobs/ftpdir-cleanup | 2 +- cron-jobs/sourceballs | 14 ++------------ db-functions | 25 +++++++++++++++++++++++++ test/Dockerfile | 2 +- 4 files changed, 29 insertions(+), 14 deletions(-) diff --git a/cron-jobs/ftpdir-cleanup b/cron-jobs/ftpdir-cleanup index 9df5f99..77e49c8 100755 --- a/cron-jobs/ftpdir-cleanup +++ b/cron-jobs/ftpdir-cleanup @@ -44,7 +44,7 @@ for repo in "${PKGREPOS[@]}"; do fi done | sort > "${WORKDIR}/repo-${repo}-${arch}" # get a list of package files defined in the repo db - bsdtar -xOf "${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT}" | awk '/^%FILENAME%/{getline;print}' | sort > "${WORKDIR}/db-${repo}-${arch}" + dbquery "$repo" "$arch" True pkg.filename | sort > "${WORKDIR}/db-${repo}-${arch}" missing_pkgs=($(comm -13 "${WORKDIR}/repo-${repo}-${arch}" "${WORKDIR}/db-${repo}-${arch}")) if (( ${#missing_pkgs[@]} >= 1 )); then diff --git a/cron-jobs/sourceballs b/cron-jobs/sourceballs index 6be28ab..784b48b 100755 --- a/cron-jobs/sourceballs +++ b/cron-jobs/sourceballs @@ -24,18 +24,8 @@ for repo in "${PKGREPOS[@]}"; do if [[ ! -f ${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT} ]]; then continue fi - bsdtar -xOf "${FTP_BASE}/${repo}/os/${arch}/${repo}${DBEXT}" \ - | awk '/^%NAME%/ { getline b }; - /^%BASE%/ { getline b }; - /^%VERSION%/ { getline v }; - /^%LICENSE%/,/^$/ { - if ( !/^%LICENSE%/ ) { l=l" "$0 } - }; - /^%ARCH%/ { - getline a; - printf "%s %s %s %s\n", b, v, a, l; - l=""; - }' + dbquery "$repo" "$arch" True \ + 'f"{pkg.base or pkg.name} {pkg.version} {pkg.arch} {'\'' '\''.join(pkg.licenses)}"' done | sort -u > "${WORKDIR}/db-${repo}" done diff --git a/db-functions b/db-functions index 0491c22..f1d821a 100644 --- a/db-functions +++ b/db-functions @@ -294,6 +294,31 @@ getpkgfiles() { echo "${files[@]}" } +# usage: dbquery repo arch filter_expr output_expr +dbquery() { + local repo=$1 + local arch=$2 + local filter=$3 + local output=$4 + local dbfile="${FTP_BASE}/${repo}/os/${arch}/${repo}.db" + + python3 - "$dbfile" "$filter" "$output" <<-'EOT' + import os.path + import sys + import tempfile + import pyalpm + db_dir, db_file = os.path.split(os.path.abspath(sys.argv[1])) + with tempfile.TemporaryDirectory() as tmpdirname: + handle = pyalpm.Handle(tmpdirname, tmpdirname) + db = handle.register_syncdb(db_file[:-3], 0) + db.servers = ["file://{}".format(db_dir)] + db.update(False) + for pkg in db.search(".*"): + if eval(sys.argv[2], {}, {"pkg": pkg}): + print(eval(sys.argv[3], {}, {"pkg": pkg})) + EOT +} + check_pkgfile() { local pkgfile=$1 diff --git a/test/Dockerfile b/test/Dockerfile index 83c8449..0d01a75 100644 --- a/test/Dockerfile +++ b/test/Dockerfile @@ -1,5 +1,5 @@ FROM archlinux/base -RUN pacman -Syu --noconfirm --needed sudo fakeroot awk subversion make kcov bash-bats gettext grep +RUN pacman -Syu --noconfirm --needed sudo fakeroot awk subversion make kcov bash-bats gettext grep pyalpm RUN pacman-key --init RUN echo '%wheel ALL=(ALL) NOPASSWD: ALL' > /etc/sudoers.d/wheel RUN useradd -N -g users -G wheel -d /build -m tester