Proof-of-concept of a dual PHP–Python stack

Message ID 20200307210152.GA2448188@tsubame.mg0.fr
State New
Headers show
Series Proof-of-concept of a dual PHP–Python stack | expand

Commit Message

Frédéric Mangano-Tarumi March 7, 2020, 9:01 p.m. UTC
Here’s a demonstration of a Python web stack running next to the current
PHP stack, such that it’s invisible to the client.

This approach aims at providing a way to migrate the PHP code base to
Python, one endpoint after another, with as little glue as possible to
have the two backends collaborate. Since they are both mounted on the
web root, Python could implement, say, /packages/ and leave the rest to
PHP. As soon as PHP yields it by returning 404 on this URL, Python will
take over automatically.

To run it, you need python-flask and nginx. Then, you need to start PHP,
Flask, and nginx, in whatever order:

    $ cd path/to/aurweb
    $ AUR_CONFIG="$PWD/conf/config" php -S 127.0.0.1:8080 -t web/html
    $ FLASK_APP=aurweb.wsgi flask run
    $ nginx -p . -c conf/nginx.conf

You may then open http://localhost:8081/ and http://localhost:8081/hello
to check that the former URL goes to PHP and the latter to Flask.

The key concept is nginx’s proxy_next_upstream feature. We set Flask as
a fallback backend, and tell nginx to use the fallback only on 404 from
PHP. See http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream

The main limitation of this approach is that PHP and Python need to use
the same gateway protocol, probably FastCGI or HTTP.

A minor caveat with this system is that the body of the 404 returned by
PHP is lost, though it could contain useful information like “Package X
doesn’t exist”, rather than a generic “Page not found”. Luckily, all the
404 cases are handled by 404.php, so we could port its logic to Flask
and preserve the current behavior.
---
 aurweb/wsgi.py  | 15 +++++++++++++++
 conf/nginx.conf | 23 +++++++++++++++++++++++
 2 files changed, 38 insertions(+)
 create mode 100644 aurweb/wsgi.py
 create mode 100644 conf/nginx.conf

Comments

Lukas Fleischer March 11, 2020, 11:44 p.m. UTC | #1
On Sat, 07 Mar 2020 at 16:01:52, Frédéric Mangano-Tarumi wrote:
> Here\u2019s a demonstration of a Python web stack running next to the current
> PHP stack, such that it\u2019s invisible to the client.
> 
> This approach aims at providing a way to migrate the PHP code base to
> Python, one endpoint after another, with as little glue as possible to
> have the two backends collaborate. Since they are both mounted on the
> web root, Python could implement, say, /packages/ and leave the rest to
> PHP. As soon as PHP yields it by returning 404 on this URL, Python will
> take over automatically.
> [...]
> ---
>  aurweb/wsgi.py  | 15 +++++++++++++++
>  conf/nginx.conf | 23 +++++++++++++++++++++++
>  2 files changed, 38 insertions(+)
>  create mode 100644 aurweb/wsgi.py
>  create mode 100644 conf/nginx.conf

Thanks! I like the approach. I wonder what the performance impact of
always querying the Python backend first would be, though, especially at
the beginning when most requests are expected to yield a 404.

Alternatively, would it make sense to use multiple location blocks and
use the right upstream based on matching the path against a predefined
set of patterns? It would add some additional maintenance work but since
the overall plan is to migrate everything to Python eventually, it would
exist only temporarily.

I guess we could use a similar approach if we ever wanted to decouple
certain endpoints completely and make them a separate app (optionally
sharing some code with the "main" backend).

For an actual first patch to be merged, I suggest porting the RPC
interface which is rather small and largely independent from other parts
of the code. This patch should also add instructions to the
documentation: both INSTALL and doc/maintenance.txt need to be updated.
Maybe also add a note to README.md.
Frédéric Mangano-Tarumi March 13, 2020, 6:13 p.m. UTC | #2
Lukas Fleischer [2020-03-11 19:44:37 -0400]
> Thanks! I like the approach. I wonder what the performance impact of
> always querying the Python backend first would be, though, especially at
> the beginning when most requests are expected to yield a 404.

One way or the other, I don’t think it’s worth worrying about since most
404 consist of accessing a local socket and exchanging a few KB. Hardly
any disk or database access is performed. Best way to be sure is to
measure it under load though. Are the AUR servers often overloaded?

> Alternatively, would it make sense to use multiple location blocks and
> use the right upstream based on matching the path against a predefined
> set of patterns? It would add some additional maintenance work but since
> the overall plan is to migrate everything to Python eventually, it would
> exist only temporarily.

I couldn’t find a smart way to do it without turning it into a
maintenance burden. Beside, I can’t see any advantage over the fallback
approach, except a performance speedup which I believe would be
irrelevant.

> For an actual first patch to be merged, I suggest porting the RPC
> interface which is rather small and largely independent from other parts
> of the code.

Sure! But first, are there other approaches you would like to try out
before we begin the serious work?

Also, I’d like to make a proposal about regression testing to limit the
best we can potential bugs introduced by the rewrite.
Lukas Fleischer March 15, 2020, 12:25 p.m. UTC | #3
On Fri, 13 Mar 2020 at 14:13:56, Frédéric Mangano-Tarumi wrote:
> Lukas Fleischer [2020-03-11 19:44:37 -0400]
> > Thanks! I like the approach. I wonder what the performance impact of
> > always querying the Python backend first would be, though, especially at
> > the beginning when most requests are expected to yield a 404.
> 
> One way or the other, I don\u2019t think it\u2019s worth worrying about since most
> 404 consist of accessing a local socket and exchanging a few KB. Hardly
> any disk or database access is performed. Best way to be sure is to
> measure it under load though. Are the AUR servers often overloaded?

Being overloaded is a relative term. Yes, the AUR servers are often
under heavy load, with millions of requests every day.

> > Alternatively, would it make sense to use multiple location blocks and
> > use the right upstream based on matching the path against a predefined
> > set of patterns? It would add some additional maintenance work but since
> > the overall plan is to migrate everything to Python eventually, it would
> > exist only temporarily.
> 
> I couldn\u2019t find a smart way to do it without turning it into a
> maintenance burden. Beside, I can\u2019t see any advantage over the fallback
> approach, except a performance speedup which I believe would be
> irrelevant.

Fair enough. We can always keep it in mind as an alternative solution in
case there are any issues with the fallback approach.

I also wonder whether we should ever use version with two backends in
production. It might make sense to only switch over once the port has
been completed.

> > For an actual first patch to be merged, I suggest porting the RPC
> > interface which is rather small and largely independent from other parts
> > of the code.
> 
> Sure! But first, are there other approaches you would like to try out
> before we begin the serious work?

Your proposal makes the rewrite relatively easy, has low overhead and at
least one other person actively working on the rewrite (I briefly
discussed it with Filipe) likes it. Unless somebody else wants to
suggest an alternative approach here, I think we're good to go!

> 
> Also, I\u2019d like to make a proposal about regression testing to limit the
> best we can potential bugs introduced by the rewrite.

Great!
Frédéric Mangano-Tarumi March 15, 2020, 1:16 p.m. UTC | #4
Lukas Fleischer [2020-03-15 08:25:31 -0400]
> Being overloaded is a relative term. Yes, the AUR servers are often
> under heavy load, with millions of requests every day.

To get more concrete: are the AUR servers sometimes at 100% CPU
capacity, or do they hardly ever reach the point of saturation? In other
words, can we afford a 1% slowdown? What about 10%?

> It might make sense to only switch over once the port has been
> completed.

I strongly recommend against that.

First, not deploying the Python backend implies we keep developing the
PHP stack too, which in turn means we either need to stop developing
new features, or develop them twice.

Second, deploying a wholly different codebase at once is dreadful for an
actively used website. All the bugs introduced by the rewrite would pop
up simultaneously. This is all the more risky if we decide to adjust
features as we rewrite them. Debugging may also become harder if we
can’t narrow down the commits based on the date the bug appeared. By the
way, I think we should for that reason accelerate the release cycle when
we start porting code.

> Your proposal makes the rewrite relatively easy, has low overhead and at
> least one other person actively working on the rewrite (I briefly
> discussed it with Filipe) likes it. Unless somebody else wants to
> suggest an alternative approach here, I think we're good to go!

All right!
Filipe Laíns March 15, 2020, 1:50 p.m. UTC | #5
On Sun, 2020-03-15 at 14:16 +0100, Frédéric Mangano-Tarumi wrote:
> > It might make sense to only switch over once the port has been
> > completed.
> 
> I strongly recommend against that.
> 
> First, not deploying the Python backend implies we keep developing the
> PHP stack too, which in turn means we either need to stop developing
> new features, or develop them twice.

Not really, we just put it in maintenance mode -- no more features just
bugfixes.

> Second, deploying a wholly different codebase at once is dreadful for an
> actively used website. All the bugs introduced by the rewrite would pop
> up simultaneously. This is all the more risky if we decide to adjust
> features as we rewrite them. Debugging may also become harder if we
> can’t narrow down the commits based on the date the bug appeared. By the
> way, I think we should for that reason accelerate the release cycle when
> we start porting code.

We can deploy it to aur-dev.archlinux.org and have users test it before
we deploy it to the real website.

The reason I don't want to deploy it to the real installation right
away is mainly because we are changing the database structure. The plan
is to move to SQLAlchemy (you already have a patch for this) and then
start implementing the Flask app. If we mess something up in the
database backend and it does not become apparent at me moment, we are
screwing up the production database.

Filipe Laíns
Frédéric Mangano-Tarumi March 15, 2020, 3:43 p.m. UTC | #6
Filipe Laíns [2020-03-15 13:50:39 +0000]
> Not really, we just put it in maintenance mode -- no more features just
> bugfixes.

That won’t benefit the end user, but if we manage to port the code fast
enough I guess it’s a compromise. There’s still the second point though:
stability.

> We can deploy it to aur-dev.archlinux.org and have users test it before
> we deploy it to the real website.

Staging deployments help detect the bigger bugs, but we should still
expect a little portion to be discovered only in production. How about a
2-month release cycle where aur-dev is one release ahead production?

> The reason I don't want to deploy it to the real installation right
> away is mainly because we are changing the database structure. The plan
> is to move to SQLAlchemy (you already have a patch for this) and then
> start implementing the Flask app. If we mess something up in the
> database backend and it does not become apparent at me moment, we are
> screwing up the production database.

So far aurweb only uses SQLAlchemy Core, which is a neutral Pythonic
wrapper over SQL. Unlike SQLAlchemy ORM, it does not make decisions on
the structure of the database or the operations to perform. For that
reason, I doubt SQLAlchemy will ever be a cause of screw up, but if
we’re uncertain we can pass it the raw SQL PHP currently uses, and
modernize it later on.

When I made the SQLAlchemy schema, I double-checked to make sure
SQLAlchemy uses the exact same structure as before, though I must admit
I can’t guarantee the production database is perfectly identical to my
local deployment. I can check it if you send me a structure dump of the
production database. In any case, we won’t run initdb on the production
server, and even if the new schema were to differ, at worst we’d get an
SQL error. Databases are good at detecting that.

Alembic is probably a much bigger factor of risk since database
migrations do alter the structure. However, regardless of when we deploy
it, every single new migration is equally risky. Please also note that
Alembic uses the SQLAlchemy only for assisting migration generation, and
that the SQLAlchemy migration of our Python code is completely
independent from the use of Alembic as far as production is concerned.

More generally, I think deploying everything at once or incrementally
won’t affect the possibility of data screw ups. It’s mostly a matter of
how thoroughly each individual piece of code is tested. Deploying
incrementally helps focus on specific parts at a time, which I believe
is an advantage over having users test everything without a plan. If our
testers are overwhelmed by changes, it’s gonna make it harder for them
to notice oddities.
Lukas Fleischer March 15, 2020, 4:16 p.m. UTC | #7
On Sun, 15 Mar 2020 at 09:16:52, Frédéric Mangano-Tarumi wrote:
> Lukas Fleischer [2020-03-15 08:25:31 -0400]
> > Being overloaded is a relative term. Yes, the AUR servers are often
> > under heavy load, with millions of requests every day.
> 
> To get more concrete: are the AUR servers sometimes at 100% CPU
> capacity, or do they hardly ever reach the point of saturation? In other
> words, can we afford a 1% slowdown? What about 10%?

We are in the process of getting a new machine for the AUR and will be
scaling up whenever we are reaching resource limits. So, practically, we
will never reach full utilization unless we're running out of money.
However, it might be better to think of optimizations as tradeoffs
between total time invested and the optimization benefits (reduction in
running expenses, knowledge acquisition of the person implementing the
change, ...)

As I mentioned before, I am fine with trying your fallback approach
first and possibly optimizing later.

Your explanation of the performance hit being relatively small sounds
reasonable but a few simple experiments would be even better.

> > It might make sense to only switch over once the port has been
> > completed.
> 
> I strongly recommend against that.
> 
> First, not deploying the Python backend implies we keep developing the
> PHP stack too, which in turn means we either need to stop developing
> new features, or develop them twice.

It also depends on how long the port will take. We should certainly
prioritize the port and try to defer any feature additions for the time
being.

> Second, deploying a wholly different codebase at once is dreadful for an
> actively used website. All the bugs introduced by the rewrite would pop
> up simultaneously. This is all the more risky if we decide to adjust
> features as we rewrite them. Debugging may also become harder if we
> can\u2019t narrow down the commits based on the date the bug appeared. By the
> way, I think we should for that reason accelerate the release cycle when
> we start porting code.

That's a good point. I agree that introducing the rewrite gradually is a
good idea.

Patch

diff --git a/aurweb/wsgi.py b/aurweb/wsgi.py
new file mode 100644
index 00000000..fd6b67d3
--- /dev/null
+++ b/aurweb/wsgi.py
@@ -0,0 +1,15 @@ 
+from flask import Flask, request
+
+
+def create_app():
+    app = Flask(__name__)
+
+    @app.route('/hello', methods=['GET', 'POST'])
+    def hello():
+        return (
+            f"{request.method} {request.url}\n"
+            f"{request.headers}"
+            f"{request.get_data(as_text=True)}\n"
+        ), {'Content-Type': 'text/plain'}
+
+    return app
diff --git a/conf/nginx.conf b/conf/nginx.conf
new file mode 100644
index 00000000..8e6e4edb
--- /dev/null
+++ b/conf/nginx.conf
@@ -0,0 +1,23 @@ 
+events {
+}
+
+daemon off;
+error_log /dev/stderr info;
+pid nginx.pid;
+
+http {
+	access_log /dev/stdout;
+
+	upstream aurweb {
+		server [::1]:8080 max_fails=0;
+		server 127.0.0.1:5000 backup max_fails=0;
+	}
+
+	server {
+		listen 8081;
+		location / {
+			proxy_pass http://aurweb;
+			proxy_next_upstream http_404 non_idempotent;
+		}
+	}
+}