[pacman-dev] Increase maximum database size

Message ID 20200118234233.62094-1-allan@archlinux.org
State Accepted, archived
Headers show
Series [pacman-dev] Increase maximum database size | expand

Commit Message

Allan McRae Jan. 18, 2020, 11:42 p.m. UTC
We previously has the maximum database size as 25MB.  This was set in the days
before repos had as many packages as they do now, and before we started
distributing files databases.  Increase this limit to 128MB.

Signed-off-by: Allan McRae <allan@archlinux.org>
---

So this has been hit in the wild.   Manjaro patches their pacman package to
allow databases of 32MB, because their [community] repo files database
breaks the 25MB limit.   But being Manjaro, the patch was never forwarded
upstream, just like everything they have ever done.

People in Arch are no better.  A bug was reported, but some idiot (named
Antonio Rojas) closed the bug as "not a bug", because it was not an
Arch repo running into the issue.

So I only discovered this by seeing a closed bug report.


Now, onto the change...  this is ~4x bigger than anything seen in the
wild currently.  Is that enough of an increasse.


 lib/libalpm/be_sync.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eli Schwartz Jan. 19, 2020, 12:36 a.m. UTC | #1
On 1/18/20 6:42 PM, Allan McRae wrote:
> We previously has the maximum database size as 25MB.  This was set in the days
> before repos had as many packages as they do now, and before we started
> distributing files databases.  Increase this limit to 128MB.

What ever happened to that long-ago idea to make .sig files be
downloaded on-demand rather than embedding them into the .db ? This
would have the added bonus of making downloads actually not be so big by
default...

Another potential optimization ISTR us discussing is making
community.files not include the content from community.db, and providing
e.g. community.alldb for anyone who needs the combined form.

Aside for that... we currently use .gz for databases on our official
infrastructure. We could get much better compression than that, I'm
sure. e.g. why not use xz there? We could even use xz -9, since the
databases tend to be fairly conservative in size so optimizing
decompression speed by switching to zstd is not really so important IMO,
and xz with level -9 compression can beat zstd -20 in both size and
compression speed.

community.files.gz when recompressed with either xz or zstd drops from
20MB to 15MB; exact numbers look like this:

$ du -b /var/lib/pacman/sync/community.files /tmp/community.files.*
20769830	/var/lib/pacman/sync/community.files
14969268	/tmp/community.files.xz
15090081	/tmp/community.files.zst

...

I'm not really a fan of just bumping the size forever, because it seems
to me people who are running into this issue are indeed doing something
they shouldn't. A 128MB repository that consumes 128MB of bandwidth on
every pacman -Syu just because a single package has been updated is
really not nice... I feel like the proper solution is more aggressive
compression, figuring out why these .files databases are actually so
huge (nodejs packages are probably a really annoying problem because
that completely ridiculous language will ship an application composed of
several hundred thousand micro-files, and the files database needs to
record every single path, so I'd quite like nodejs packaging to die a
horrible death), and, if databases are still running into size limits,
shipping packages in split repositories.

Splitting out more repositories is not just about "fooling" pacman into
splitting the limits up among multiple repos. It's about making a
single-package update only trigger an update to one of the splitted
repos. This strikes me as exactly the purpose of instituting a size
limit to begin with!
Andrew Gregory Jan. 19, 2020, 1:09 a.m. UTC | #2
On 01/19/20 at 09:42am, Allan McRae wrote:
> We previously has the maximum database size as 25MB.  This was set in the days
> before repos had as many packages as they do now, and before we started
> distributing files databases.  Increase this limit to 128MB.
> 
> Signed-off-by: Allan McRae <allan@archlinux.org>
> ---
> 
> So this has been hit in the wild.   Manjaro patches their pacman package to
> allow databases of 32MB, because their [community] repo files database
> breaks the 25MB limit.   But being Manjaro, the patch was never forwarded
> upstream, just like everything they have ever done.
> 
> People in Arch are no better.  A bug was reported, but some idiot (named
> Antonio Rojas) closed the bug as "not a bug", because it was not an
> Arch repo running into the issue.
> 
> So I only discovered this by seeing a closed bug report.
> 
> 
> Now, onto the change...  this is ~4x bigger than anything seen in the
> wild currently.  Is that enough of an increasse.

ACK.  Just update the comment.
 
>  lib/libalpm/be_sync.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/libalpm/be_sync.c b/lib/libalpm/be_sync.c
> index 07d2b4ae..a7050290 100644
> --- a/lib/libalpm/be_sync.c
> +++ b/lib/libalpm/be_sync.c
> @@ -224,7 +224,7 @@ int SYMEXPORT alpm_db_update(int force, alpm_db_t *db)
>  		int sig_ret = 0;
>  
>  		/* set hard upper limit of 25MiB */
> -		payload.max_size = 25 * 1024 * 1024;
> +		payload.max_size = 128 * 1024 * 1024;
>  
>  		/* print server + filename into a buffer */
>  		len = strlen(server) + strlen(db->treename) + strlen(dbext) + 2;
> -- 
> 2.25.0
Allan McRae Jan. 19, 2020, 1:29 a.m. UTC | #3
On 19/1/20 10:36 am, Eli Schwartz wrote:
> On 1/18/20 6:42 PM, Allan McRae wrote:
>> We previously has the maximum database size as 25MB.  This was set in the days
>> before repos had as many packages as they do now, and before we started
>> distributing files databases.  Increase this limit to 128MB.
> 
> What ever happened to that long-ago idea to make .sig files be
> downloaded on-demand rather than embedding them into the .db ? This
> would have the added bonus of making downloads actually not be so big by
> default...

Well...  it makes a difference for sync dbs, but barely makes a
difference for files dbs, which is where we are hitting the limit.

Also, out download output when downloading .sig files is rather bad...

> Another potential optimization ISTR us discussing is making
> community.files not include the content from community.db, and providing
> e.g. community.alldb for anyone who needs the combined form.

Many things have been discussed.

> Aside for that... we currently use .gz for databases on our official
> infrastructure. We could get much better compression than that, I'm
> sure. e.g. why not use xz there? We could even use xz -9, since the
> databases tend to be fairly conservative in size so optimizing
> decompression speed by switching to zstd is not really so important IMO,
> and xz with level -9 compression can beat zstd -20 in both size and
> compression speed.

That is all Arch Linux related, and decisions that don't happen on this
list.   But the speed of reading from a zstd file wins greatly over xz,
and that should be a consideration.

> community.files.gz when recompressed with either xz or zstd drops from
> 20MB to 15MB; exact numbers look like this:
> 
> $ du -b /var/lib/pacman/sync/community.files /tmp/community.files.*
> 20769830	/var/lib/pacman/sync/community.files
> 14969268	/tmp/community.files.xz
> 15090081	/tmp/community.files.zst
> 
> ...
> 
> I'm not really a fan of just bumping the size forever, because it seems
> to me people who are running into this issue are indeed doing something
> they shouldn't. A 128MB repository that consumes 128MB of bandwidth on
> every pacman -Syu just because a single package has been updated is
> really not nice... I feel like the proper solution is more aggressive
> compression, figuring out why these .files databases are actually so
> huge (nodejs packages are probably a really annoying problem because
> that completely ridiculous language will ship an application composed of
> several hundred thousand micro-files, and the files database needs to
> record every single path, so I'd quite like nodejs packaging to die a
> horrible death), and, if databases are still running into size limits,
> shipping packages in split repositories.
> 
> Splitting out more repositories is not just about "fooling" pacman into
> splitting the limits up among multiple repos. It's about making a
> single-package update only trigger an update to one of the splitted
> repos. This strikes me as exactly the purpose of instituting a size
> limit to begin with!
> 

None of this is the role of pacman to decide.  If a distribution (or
custom repo) wants to ship a database file or greater than 25MB in size,
I see no reason not to support that in pacman.

In fact, I'm of the opinion that the upper limit should be removed
altogether.  From memory, it was to prevent infinite download DoS type
attacks.  But we can stop that with a Ctrl+C anyway...  With the
patchset I'm working on that verifies repos before overwriting the old
ones, this becomes even less of an issue.

Allan
Allan McRae Jan. 19, 2020, 1:31 a.m. UTC | #4
On 19/1/20 11:09 am, Andrew Gregory wrote:
> On 01/19/20 at 09:42am, Allan McRae wrote:
>> We previously has the maximum database size as 25MB.  This was set in the days
>> before repos had as many packages as they do now, and before we started
>> distributing files databases.  Increase this limit to 128MB.
>>
>> Signed-off-by: Allan McRae <allan@archlinux.org>
>> ---
>>
>> So this has been hit in the wild.   Manjaro patches their pacman package to
>> allow databases of 32MB, because their [community] repo files database
>> breaks the 25MB limit.   But being Manjaro, the patch was never forwarded
>> upstream, just like everything they have ever done.
>>
>> People in Arch are no better.  A bug was reported, but some idiot (named
>> Antonio Rojas) closed the bug as "not a bug", because it was not an
>> Arch repo running into the issue.
>>
>> So I only discovered this by seeing a closed bug report.
>>
>>
>> Now, onto the change...  this is ~4x bigger than anything seen in the
>> wild currently.  Is that enough of an increasse.
> 
> ACK.  Just update the comment.
>  

Oops!   Done.

>>  lib/libalpm/be_sync.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/lib/libalpm/be_sync.c b/lib/libalpm/be_sync.c
>> index 07d2b4ae..a7050290 100644
>> --- a/lib/libalpm/be_sync.c
>> +++ b/lib/libalpm/be_sync.c
>> @@ -224,7 +224,7 @@ int SYMEXPORT alpm_db_update(int force, alpm_db_t *db)
>>  		int sig_ret = 0;
>>  
>>  		/* set hard upper limit of 25MiB */
>> -		payload.max_size = 25 * 1024 * 1024;
>> +		payload.max_size = 128 * 1024 * 1024;
>>  
>>  		/* print server + filename into a buffer */
>>  		len = strlen(server) + strlen(db->treename) + strlen(dbext) + 2;
>> -- 
>> 2.25.0
> .
>
Allan McRae Jan. 19, 2020, 10:04 a.m. UTC | #5
On 19/1/20 9:42 am, Allan McRae wrote:
> We previously has the maximum database size as 25MB.  This was set in the days
> before repos had as many packages as they do now, and before we started
> distributing files databases.  Increase this limit to 128MB.
> 
> Signed-off-by: Allan McRae <allan@archlinux.org>
> ---
> 
> So this has been hit in the wild.   Manjaro patches their pacman package to
> allow databases of 32MB, because their [community] repo files database
> breaks the 25MB limit.   But being Manjaro, the patch was never forwarded
> upstream, just like everything they have ever done.
> 
> People in Arch are no better.  A bug was reported, but some idiot (named
> Antonio Rojas) closed the bug as "not a bug", because it was not an
> Arch repo running into the issue.
> 
> So I only discovered this by seeing a closed bug report.
> 

I'm going to retract my statement there.   Manjaro did report this, in
the #archlinux-projects channel.  Not the right place, but at least they
tried.  Reportedly Arch team in there started discussing doing things
like splitting repos etc to avoid this, rather than reporting or fixing
the bug.

So Manjaro did try to help here.  And it turns out I need to extend my
idiot count to the Arch team members involved in the discussion in the
#archlinux-projects channel who did not forward the report from Manjaro
to the right place.

Allan
Eli Schwartz Jan. 19, 2020, 2:55 p.m. UTC | #6
On 1/19/20 5:04 AM, Allan McRae wrote:
> On 19/1/20 9:42 am, Allan McRae wrote:
>> We previously has the maximum database size as 25MB.  This was set in the days
>> before repos had as many packages as they do now, and before we started
>> distributing files databases.  Increase this limit to 128MB.
>>
>> Signed-off-by: Allan McRae <allan@archlinux.org>
>> ---
>>
>> So this has been hit in the wild.   Manjaro patches their pacman package to
>> allow databases of 32MB, because their [community] repo files database
>> breaks the 25MB limit.   But being Manjaro, the patch was never forwarded
>> upstream, just like everything they have ever done.
>>
>> People in Arch are no better.  A bug was reported, but some idiot (named
>> Antonio Rojas) closed the bug as "not a bug", because it was not an
>> Arch repo running into the issue.
>>
>> So I only discovered this by seeing a closed bug report.
>>
> 
> I'm going to retract my statement there.   Manjaro did report this, in
> the #archlinux-projects channel.  Not the right place, but at least they
> tried.  Reportedly Arch team in there started discussing doing things
> like splitting repos etc to avoid this, rather than reporting or fixing
> the bug.
> 
> So Manjaro did try to help here.  And it turns out I need to extend my
> idiot count to the Arch team members involved in the discussion in the
> #archlinux-projects channel who did not forward the report from Manjaro
> to the right place.

Well, we could stop leaping to assumptions.

Someone very rapidly pointed out:
"i think the best place would be #pacman-dev or the mailinglist, as
thats where the consensus for an upstream change needs to take place"

(followup, in case you're thinking the obvious: "ups, its
#archlinux-pacman")

While it is true that no Arch team members took up the ball on that one,
one could reasonably assume that everyone figured Jonathon from Manjaro
would do so, and never got around to verifying that that happened.
Personally, I'd forgotten about this whole conversation.

Patch

diff --git a/lib/libalpm/be_sync.c b/lib/libalpm/be_sync.c
index 07d2b4ae..a7050290 100644
--- a/lib/libalpm/be_sync.c
+++ b/lib/libalpm/be_sync.c
@@ -224,7 +224,7 @@  int SYMEXPORT alpm_db_update(int force, alpm_db_t *db)
 		int sig_ret = 0;
 
 		/* set hard upper limit of 25MiB */
-		payload.max_size = 25 * 1024 * 1024;
+		payload.max_size = 128 * 1024 * 1024;
 
 		/* print server + filename into a buffer */
 		len = strlen(server) + strlen(db->treename) + strlen(dbext) + 2;