[pacman-dev] Add a timestamp file into repo tarballs

Message ID 20191105135434.1017126-1-allan@archlinux.org
State Deferred, archived
Headers show
Series [pacman-dev] Add a timestamp file into repo tarballs | expand

Commit Message

Allan McRae Nov. 5, 2019, 1:54 p.m. UTC
When creating or modifying repo tarballs, place a .TIMESTAMP file with
seconds since epoch in it.  This will be used in the future to enable
rejecting databases older that a given threshold.

Also skip reading the .TIMESTAMP file in sync_db_populate().

Signed-off-by: Allan McRae <allan@archlinux.org>
---

Anyone want to check my logic in the sync_db_populate() populate change?
Repo-add puts the .TIMESTAMP file first when calling bsdtar, so if present
it will be first in the repo db file.  Otherwise the first item read from
the tarball will be a directory, which we skip reading anyway.  So I just
read the header for the first item and discard it.

 lib/libalpm/be_sync.c  | 9 +++++++++
 scripts/repo-add.sh.in | 8 +++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

Comments

Morten Linderud Nov. 5, 2019, 1:58 p.m. UTC | #1
On Tue, Nov 05, 2019 at 11:54:34PM +1000, Allan McRae wrote:
> When creating or modifying repo tarballs, place a .TIMESTAMP file with
> seconds since epoch in it.  This will be used in the future to enable
> rejecting databases older that a given threshold.
> 
> Also skip reading the .TIMESTAMP file in sync_db_populate().
> 
> Signed-off-by: Allan McRae <allan@archlinux.org>
> ---
> 
> Anyone want to check my logic in the sync_db_populate() populate change?
> Repo-add puts the .TIMESTAMP file first when calling bsdtar, so if present
> it will be first in the repo db file.  Otherwise the first item read from
> the tarball will be a directory, which we skip reading anyway.  So I just
> read the header for the first item and discard it.
> 
>  lib/libalpm/be_sync.c  | 9 +++++++++
>  scripts/repo-add.sh.in | 8 +++++---
>  2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/libalpm/be_sync.c b/lib/libalpm/be_sync.c
> index 2c76fe83..041b2266 100644
> --- a/lib/libalpm/be_sync.c
> +++ b/lib/libalpm/be_sync.c
> @@ -506,6 +506,13 @@ static int sync_db_populate(alpm_db_t *db)
>  		goto cleanup;
>  	}
>  
> +	/* the .TIMESTAMP file will be first entry in the repo archive if present.
> +	 * If not, the first entry will be a directory and can be skipped too */
> +	if((archive_ret = archive_read_next_header(archive, &entry)) != ARCHIVE_OK) {
> +		ret = -1;
> +		goto readfail;
> +	}
> +
>  	while((archive_ret = archive_read_next_header(archive, &entry)) == ARCHIVE_OK) {
>  		mode_t mode = archive_entry_mode(entry);
>  		if(!S_ISDIR(mode)) {
> @@ -518,6 +525,8 @@ static int sync_db_populate(alpm_db_t *db)
>  			}
>  		}
>  	}
> +
> +readfail:
>  	if(archive_ret != ARCHIVE_EOF) {
>  		_alpm_log(db->handle, ALPM_LOG_ERROR, _("could not read db '%s' (%s)\n"),
>  				db->treename, archive_error_string(archive));
> diff --git a/scripts/repo-add.sh.in b/scripts/repo-add.sh.in
> index caf1232d..c87409f1 100644
> --- a/scripts/repo-add.sh.in
> +++ b/scripts/repo-add.sh.in
> @@ -526,6 +526,7 @@ create_db() {
>  	TAR_OPT=$(verify_repo_extension "$REPO_DB_FILE")
>  	# $LOCKFILE is already guaranteed to be absolute so this is safe
>  	dirname=${LOCKFILE%/*}
> +	timestamp=$(date +%s)

This should probably utilize SOURCE_DATE_EPOCH or something equivalent?

timestamp=$(date --date="@${SOURCE_DATE_EPOCH:-$(date +%s)}" +%s))

>  	for repo in "db" "files"; do
>  		filename=${REPO_DB_PREFIX}.${repo}.${REPO_DB_SUFFIX}
> @@ -533,12 +534,13 @@ create_db() {
>  		tempname=$dirname/.tmp.$filename
>  
>  		pushd "$tmpdir/$repo" >/dev/null
> +		echo $timestamp > .TIMESTAMP
>  		if ( shopt -s nullglob; files=(*); (( ${#files[*]} )) ); then
> -			bsdtar -c ${TAR_OPT} -f "$tempname" *
> +			bsdtar -c ${TAR_OPT} -f "$tempname" .TIMESTAMP *
>  		else
> -			# we have no packages remaining? zip up some emptyness
> +			# we have no packages remaining
>  			warning "$(gettext "No packages remain, creating empty database.")"
> -			bsdtar -c ${TAR_OPT} -f "$tempname" -T /dev/null
> +			bsdtar -c ${TAR_OPT} -f "$tempname" .TIMESTAMP
>  		fi
>  		popd >/dev/null
>  
> -- 
> 2.23.0
Allan McRae Nov. 5, 2019, 2:03 p.m. UTC | #2
On 5/11/19 11:58 pm, Morten Linderud wrote:
> On Tue, Nov 05, 2019 at 11:54:34PM +1000, Allan McRae wrote:
>> When creating or modifying repo tarballs, place a .TIMESTAMP file with
>> seconds since epoch in it.  This will be used in the future to enable
>> rejecting databases older that a given threshold.
>>
>> Also skip reading the .TIMESTAMP file in sync_db_populate().
>>
>> Signed-off-by: Allan McRae <allan@archlinux.org>
>> ---
>>

<snip>

>> diff --git a/scripts/repo-add.sh.in b/scripts/repo-add.sh.in
>> index caf1232d..c87409f1 100644
>> --- a/scripts/repo-add.sh.in
>> +++ b/scripts/repo-add.sh.in
>> @@ -526,6 +526,7 @@ create_db() {
>>  	TAR_OPT=$(verify_repo_extension "$REPO_DB_FILE")
>>  	# $LOCKFILE is already guaranteed to be absolute so this is safe
>>  	dirname=${LOCKFILE%/*}
>> +	timestamp=$(date +%s)
> 
> This should probably utilize SOURCE_DATE_EPOCH or something equivalent?
> 
> timestamp=$(date --date="@${SOURCE_DATE_EPOCH:-$(date +%s)}" +%s))

Why?  I can see no reason why it should...
Eli Schwartz Nov. 5, 2019, 2:12 p.m. UTC | #3
On 11/5/19 9:03 AM, Allan McRae wrote:
> On 5/11/19 11:58 pm, Morten Linderud wrote:
>> On Tue, Nov 05, 2019 at 11:54:34PM +1000, Allan McRae wrote:
>>> When creating or modifying repo tarballs, place a .TIMESTAMP file with
>>> seconds since epoch in it.  This will be used in the future to enable
>>> rejecting databases older that a given threshold.
>>>
>>> Also skip reading the .TIMESTAMP file in sync_db_populate().
>>>
>>> Signed-off-by: Allan McRae <allan@archlinux.org>
>>> ---
>>>
> 
> <snip>
> 
>>> diff --git a/scripts/repo-add.sh.in b/scripts/repo-add.sh.in
>>> index caf1232d..c87409f1 100644
>>> --- a/scripts/repo-add.sh.in
>>> +++ b/scripts/repo-add.sh.in
>>> @@ -526,6 +526,7 @@ create_db() {
>>>  	TAR_OPT=$(verify_repo_extension "$REPO_DB_FILE")
>>>  	# $LOCKFILE is already guaranteed to be absolute so this is safe
>>>  	dirname=${LOCKFILE%/*}
>>> +	timestamp=$(date +%s)
>>
>> This should probably utilize SOURCE_DATE_EPOCH or something equivalent?
>>
>> timestamp=$(date --date="@${SOURCE_DATE_EPOCH:-$(date +%s)}" +%s))
> 
> Why?  I can see no reason why it should...

I don't either see value in "reproducible builds" for the actual state
of the database. It's just a series of plaintext pointers to some other
(hopefully reproducible) packages.

If we actually did want to respect SOURCE_DATE_EPOCH, we'd need to do a
lot more, like doing that for the bsdtar metadata (both file timestamps
and file owners, probably sort files too, etc.) but again, I don't see
how this protects the supply chain.
Morten Linderud Nov. 5, 2019, 2:16 p.m. UTC | #4
On Wed, Nov 06, 2019 at 12:03:17AM +1000, Allan McRae wrote:
> On 5/11/19 11:58 pm, Morten Linderud wrote:
> > On Tue, Nov 05, 2019 at 11:54:34PM +1000, Allan McRae wrote:
> >> When creating or modifying repo tarballs, place a .TIMESTAMP file with
> >> seconds since epoch in it.  This will be used in the future to enable
> >> rejecting databases older that a given threshold.
> >>
> >> Also skip reading the .TIMESTAMP file in sync_db_populate().
> >>
> >> Signed-off-by: Allan McRae <allan@archlinux.org>
> >> ---
> >>
> 
> <snip>
> 
> >> diff --git a/scripts/repo-add.sh.in b/scripts/repo-add.sh.in
> >> index caf1232d..c87409f1 100644
> >> --- a/scripts/repo-add.sh.in
> >> +++ b/scripts/repo-add.sh.in
> >> @@ -526,6 +526,7 @@ create_db() {
> >>  	TAR_OPT=$(verify_repo_extension "$REPO_DB_FILE")
> >>  	# $LOCKFILE is already guaranteed to be absolute so this is safe
> >>  	dirname=${LOCKFILE%/*}
> >> +	timestamp=$(date +%s)
> > 
> > This should probably utilize SOURCE_DATE_EPOCH or something equivalent?
> > 
> > timestamp=$(date --date="@${SOURCE_DATE_EPOCH:-$(date +%s)}" +%s))
> 
> Why?  I can see no reason why it should...

If you wan't to write tests for `repo-add` in the future, I think it will be
beneficial to be able to create consistent databases.

Outside of pacman I believe being able to reproduce any artifact produced is
desireable. Enables us to not run through hoops recreating past database files
given the correct packages.
Morten Linderud Nov. 5, 2019, 2:19 p.m. UTC | #5
On Tue, Nov 05, 2019 at 09:12:33AM -0500, Eli Schwartz wrote:
> On 11/5/19 9:03 AM, Allan McRae wrote:
> > On 5/11/19 11:58 pm, Morten Linderud wrote:
> >> On Tue, Nov 05, 2019 at 11:54:34PM +1000, Allan McRae wrote:
> >>> When creating or modifying repo tarballs, place a .TIMESTAMP file with
> >>> seconds since epoch in it.  This will be used in the future to enable
> >>> rejecting databases older that a given threshold.
> >>>
> >>> Also skip reading the .TIMESTAMP file in sync_db_populate().
> >>>
> >>> Signed-off-by: Allan McRae <allan@archlinux.org>
> >>> ---
> >>>
> > 
> > <snip>
> > 
> >>> diff --git a/scripts/repo-add.sh.in b/scripts/repo-add.sh.in
> >>> index caf1232d..c87409f1 100644
> >>> --- a/scripts/repo-add.sh.in
> >>> +++ b/scripts/repo-add.sh.in
> >>> @@ -526,6 +526,7 @@ create_db() {
> >>>  	TAR_OPT=$(verify_repo_extension "$REPO_DB_FILE")
> >>>  	# $LOCKFILE is already guaranteed to be absolute so this is safe
> >>>  	dirname=${LOCKFILE%/*}
> >>> +	timestamp=$(date +%s)
> >>
> >> This should probably utilize SOURCE_DATE_EPOCH or something equivalent?
> >>
> >> timestamp=$(date --date="@${SOURCE_DATE_EPOCH:-$(date +%s)}" +%s))
> > 
> > Why?  I can see no reason why it should...
> 
> I don't either see value in "reproducible builds" for the actual state
> of the database. It's just a series of plaintext pointers to some other
> (hopefully reproducible) packages.
> 
> If we actually did want to respect SOURCE_DATE_EPOCH, we'd need to do a
> lot more, like doing that for the bsdtar metadata (both file timestamps
> and file owners, probably sort files too, etc.) but again, I don't see
> how this protects the supply chain.

Hmm, should probably discuss the threat model or attack vectors in
#archlinux-reproducible.

Patch

diff --git a/lib/libalpm/be_sync.c b/lib/libalpm/be_sync.c
index 2c76fe83..041b2266 100644
--- a/lib/libalpm/be_sync.c
+++ b/lib/libalpm/be_sync.c
@@ -506,6 +506,13 @@  static int sync_db_populate(alpm_db_t *db)
 		goto cleanup;
 	}
 
+	/* the .TIMESTAMP file will be first entry in the repo archive if present.
+	 * If not, the first entry will be a directory and can be skipped too */
+	if((archive_ret = archive_read_next_header(archive, &entry)) != ARCHIVE_OK) {
+		ret = -1;
+		goto readfail;
+	}
+
 	while((archive_ret = archive_read_next_header(archive, &entry)) == ARCHIVE_OK) {
 		mode_t mode = archive_entry_mode(entry);
 		if(!S_ISDIR(mode)) {
@@ -518,6 +525,8 @@  static int sync_db_populate(alpm_db_t *db)
 			}
 		}
 	}
+
+readfail:
 	if(archive_ret != ARCHIVE_EOF) {
 		_alpm_log(db->handle, ALPM_LOG_ERROR, _("could not read db '%s' (%s)\n"),
 				db->treename, archive_error_string(archive));
diff --git a/scripts/repo-add.sh.in b/scripts/repo-add.sh.in
index caf1232d..c87409f1 100644
--- a/scripts/repo-add.sh.in
+++ b/scripts/repo-add.sh.in
@@ -526,6 +526,7 @@  create_db() {
 	TAR_OPT=$(verify_repo_extension "$REPO_DB_FILE")
 	# $LOCKFILE is already guaranteed to be absolute so this is safe
 	dirname=${LOCKFILE%/*}
+	timestamp=$(date +%s)
 
 	for repo in "db" "files"; do
 		filename=${REPO_DB_PREFIX}.${repo}.${REPO_DB_SUFFIX}
@@ -533,12 +534,13 @@  create_db() {
 		tempname=$dirname/.tmp.$filename
 
 		pushd "$tmpdir/$repo" >/dev/null
+		echo $timestamp > .TIMESTAMP
 		if ( shopt -s nullglob; files=(*); (( ${#files[*]} )) ); then
-			bsdtar -c ${TAR_OPT} -f "$tempname" *
+			bsdtar -c ${TAR_OPT} -f "$tempname" .TIMESTAMP *
 		else
-			# we have no packages remaining? zip up some emptyness
+			# we have no packages remaining
 			warning "$(gettext "No packages remain, creating empty database.")"
-			bsdtar -c ${TAR_OPT} -f "$tempname" -T /dev/null
+			bsdtar -c ${TAR_OPT} -f "$tempname" .TIMESTAMP
 		fi
 		popd >/dev/null