Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Server Error 😢 🔥 #325

Open
1 task
nelsonic opened this issue Nov 2, 2023 · 46 comments
Open
1 task

Internal Server Error 😢 🔥 #325

nelsonic opened this issue Nov 2, 2023 · 46 comments
Assignees
Labels
BLOCKED :fire: Core team's HIGHEST priority, blocking critical work chore a tedious but necessary task often paying technical debt elixir Pull requests that update Elixir code help wanted If you can help make progress with this issue, please comment! in-progress An issue or pull request that is being worked on by the assigned person priority-1 Highest priority issue. This is costing us money every minute that passes. question A question needs to be answered before progress can be made on this issue tech-debt A feature/requirement implemented in a sub-optimal way & must be re-written technical A technical issue that requires understanding of the code, infrastructure or dependencies

Comments

@nelsonic
Copy link
Member

nelsonic commented Nov 2, 2023

Is anyone else seeing an Internal Server Error on https://auth.dwyl.com/ ??

IMG_4570

Todo

  • Recover the Fly.io instance and reboot the authprod app. 🤞
@nelsonic nelsonic added help wanted If you can help make progress with this issue, please comment! question A question needs to be answered before progress can be made on this issue priority-1 Highest priority issue. This is costing us money every minute that passes. chore a tedious but necessary task often paying technical debt technical A technical issue that requires understanding of the code, infrastructure or dependencies BLOCKED :fire: Core team's HIGHEST priority, blocking critical work tech-debt A feature/requirement implemented in a sub-optimal way & must be re-written elixir Pull requests that update Elixir code labels Nov 2, 2023
@iteles
Copy link
Member

iteles commented Nov 2, 2023

Yep. dwyl/mvp#440

@iteles
Copy link
Member

iteles commented Nov 2, 2023

Suspect it has something to do with this email received from fly.io 16 hours ago:
image

@LuchoTurtle
Copy link
Member

Also confirm this error on my end 😢

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Lame. Thanks fly.io. The reward for being a loyal+paying customer... 🤦‍♂️

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Thankfully we have a "Staging" instance that still works flawlessly: https://authdemo.fly.dev/
So I'm temporarily switching over to that on the MVP until we can recover authprod.

image

Just re-deploying now because the machine restart didn't work to update the environment variable. 🙃

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

MVP now redirects to authdemo:
image

Works for me:

image

@iteles / @LuchoTurtle please confirm you're able to access your items using the authdemo login.
Thanks.

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Definitely still going to try and recover the authprod instance when I have a bit more time. (maybe only tomorrow)

@iteles
Copy link
Member

iteles commented Nov 2, 2023

Interesting. It's working for me on Firefox desktop!

On Chrome desktop, I'm still getting an internal server error straight away (on the start page, not even getting to the login portion) even after clearing cookies and cached data 🤷‍♀️

But fixed for now, no point troubleshooting this further until we're back on prod.

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

@iteles In Chrome, if you visit https://mvp.fly.dev/ what do you see?

@iteles
Copy link
Member

iteles commented Nov 2, 2023

image

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Well that's mega frustrating! Thank you Google Chrome! 🤦‍♂️
But it's useful feedback that someone using the App could encounter the same error. 💭 🙃
Glad you proactively tested it in Firefox. 👌
Can you still see all your items?

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Works on my phone:

IMG_4593

but doesn’t look great. Some tidy up to be done. ⏳

@iteles
Copy link
Member

iteles commented Nov 2, 2023

Works on Chrome on my phone but I can't see any of my items.

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Ok. So that means your items are tied to the person.id from authprod. 🤦‍♂️
This is very useful insight. Need to make the person.id a cid based on the authprovider -> userid 💭

@LuchoTurtle
Copy link
Member

LuchoTurtle commented Nov 2, 2023

Only works if I clear my cookies. I'm doing this on the web.
Although it now works after clearing them, all of my items are gone as well.

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Trying to re-deploy authprod ... 🤞

n@MBP auth % fly deploy --verbose
==> Verifying app config
Validating /auth/fly.toml
Platform: machines
✓ Configuration is valid
--> Verified app config
==> Building image
Waiting for remote builder fly-builder-solitary-rain-5830... 🌎

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Useless! 😢

WARN Remote builder did not start in time. Check remote builder logs with `flyctl logs -a fly-builder-solitary-rain-5830`

WARN Failed to start remote builder heartbeat: remote builder app unavailable

WARN failed to create build in graphql: returned error 504 Gateway Timeout: <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>


Error: failed to fetch an image or build from source: error connecting to docker: server returned a non-200 status code: 503

@nelsonic
Copy link
Member Author

nelsonic commented Nov 2, 2023

Web UI is unresponsive: https://fly.io/apps/authprod 🤷‍♂️ 😢

image

Gotta get going to Lx now. Will resume trying to figure this out later tonight. 🤞

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

Ok. Step 1: Attempt to create a backup of authprodb in case any of this goes wrong. 🔙

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

Following this helpful article: https://fly.io/docs/postgres/managing/backup-and-restore/

fly volumes list -a authprodb

ID                  	STATE  	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM	CREATED AT
vol_g2yxp4mllj6v63qd	created	pg_data	10GB	lhr   	34c3	false    	           	2 years ago
vol_1g67340ggznrydxw	created	pg_data	10GB	lhr   	805e	false    	           	2 years ago

ATTACHED VM: blank. There's your problem ...

image

Note: "ECNRYPTED: false" ... Not encrypted by Fly.io.
But all the sensitive data is encrypted using fields.

List the snapshots:

fly volumes snapshots list vol_g2yxp4mllj6v63qd
Snapshots
ID                 	SIZE    	CREATED AT
vs_QGPN6pQObpLQpUma	96479939	10 hours ago
vs_za8NQRN70Dmnat9K	96479939	1 day ago
vs_8zZLpNxD3VB9Qhom	95992552	2 days ago
vs_OANN6OZkLJLNfz7 	95895907	3 days ago
vs_GpqzBBJgL7DmltGK	95019181	4 days ago
vs_bjjMlxqMzXjl7sB 	94680799	5 days ago

Dunno why it's continuing to take "dumb" snapshots if the data isn't changing. 🤷‍♂️

So that guide doesn't actually deliver on "backing up" the DB.
So I searched for: https://www.google.com/search?q=fly.io+postgres+backup+to+localhost

And the first hit is: https://community.fly.io/t/backup-and-restore-postgresql-in-fly-io-step-by-step/11455
image

No formatting of the input/output and using Windows 🙄 ... But it's better than nothing. Going to follow the steps.

First thing I had to do was find out what the Postgres password is for the authprodb instance as I don't have in any password manager.

https://community.fly.io/t/lost-postgres-password/4192/2

fly ssh console
echo $DATABASE_URL
postgres://authprod:[email protected]:5432/authprod

This also tells us that the internal TCP port used is 5432. 👍

Attempt to proxy the internal TCP port to a local one:

flyctl proxy 15432:5432 -a authprodb

Sadly, we get:

Error: authprodb.internal: host was not found in DNS

Not good.
How about connecting directly to the Postgres VM?

fly ssh console -a authprodb

Got:

Error: app authprodb has no started VMs.
It may be unhealthy or not have been deployed yet.
Try the following command to verify:

fly status

Tried:

fly status -a authprodb

got:

No machines are available on this app authprodb

But the Fly.io Web interface says it's "deployed": https://fly.io/apps/authprodb 🤷‍♂️

image

But the instance doesn't have an IP Address so it's not addressable ... 🙈 🙃
Going to try and temporarily allocate an IP V4 Address: https://fly.io/docs/flyctl/ips/#usage

flyctl ips allocate-v4 -a authprodb

Prompt:

? Looks like you're accessing a paid feature. Dedicated IPv4 addresses now cost $2/mo. Are you ok with this? (y/N)

Sure, why not? If $2/month resolves this (and I will immediately remove the IP address) let's go!

Appears to have worked:

VERSION	IP          	TYPE  	REGION	CREATED AT
v4     	137.66.62.95	public	global	19s ago

137.66.62.95 is now listed in the IP Addresses section:

image

Let's try connecting to it again. 🤞

fly ssh console -a authprodb

No started VMs:

Error: app authprodb has no started VMs.
It may be unhealthy or not have been deployed yet.
Try the following command to verify:

fly status

Great. So it allows me to attach a paid IP address to a VM that is "deployed" but not reachable. 👌 (🤦‍♂️)

Given that it's useless to us, let's release that IPV4 address before we forget:

flyctl ips release 137.66.62.95

Confirmed:

Released 137.66.62.95 from authprod

Going to try a different approach. Let's grab the latest snapshot and create a new Postgres instance from it! 🆕 🤞
That way if I mess up, it's just on a new VM that can easily be dropped.

Recall from above that the latest snapshot ID was vs_QGPN6pQObpLQpUma ⬆️

Run:

fly postgres create --snapshot-id vs_QGPN6pQObpLQpUma

But got:

Error:  "Failed to resolve source machine from snapshot."

😢

Try to fork the db:

fly postgres create --fork-from authprodb

Fails:

Error: No machines associated with fork-from target. See `fly pg create --help` for more information

Helpful. 🤦‍♂️

Reading: https://fly.io/docs/postgres/postgres-on-nomad/#upgrading-the-postgres-app

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

Tried:

flyctl image show -a authprodb

Output:

MACHINE ID	REGISTRY	REPOSITORY	TAG	VERSION	DIGEST	LABELS

🦗 🦗 🦗... (i.e: nothing...!)

Reading this very recent forum topic: https://community.fly.io/t/postgres-crashed-cannot-restart-restore-or-fork/16186

Sadly not a very helpful comment from the OP: 🙃

image

Read: https://fly.io/docs/flyctl/postgres-restart/

Tried:

flyctl postgres restart -a authprodb

Got:

Error: no active leader found

Read: https://fly.io/docs/apps/restart/

Tried restarting:

fly apps restart authprodb

Got:

Error: postgres apps should use `fly pg restart` instead

Great! Thanks fly ... 🤦‍♂️

Trying to "deploy" a machine:

fly deploy -a authprodb --verbose

Got the following output + error:

-------
 ✖ release_command failed
-------
Error release_command failed running on machine 1857439c5996d8 with exit code 1.
Check its logs: here's the last 100 lines below, or run 'fly logs -i 1857439c5996d8':
  Pulling container image registry.fly.io/authprodb:deployment-01HEA0YDEKTNN69C15H8MBZ7XJ
  Successfully prepared image registry.fly.io/authprodb:deployment-01HEA0YDEKTNN69C15H8MBZ7XJ (4.634590051s)
  Configuring firecracker
  [    0.042013] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
  [    0.047890] PCI: Fatal: No config space access function found
   INFO Starting init (commit: 15238e9)...
   INFO Preparing to run: `/app/bin/migrate` as nobody
   INFO [fly api proxy] listening at /.fly/api
  2023/11/03 07:22:14 listening on [fdaa:0:42c6:a7b:1bf:a08f:f999:2]:22 (DNS: [fdaa::3]:53)
  ERROR! Config provider Config.Reader failed with:
  ** (RuntimeError) environment variable DATABASE_URL is missing.
  For example: ecto://USER:PASS@HOST/DATABASE
      (stdlib 3.16.1) erl_eval.erl:685: :erl_eval.do_apply/6
      (stdlib 3.16.1) erl_eval.erl:446: :erl_eval.expr/5
      (stdlib 3.16.1) erl_eval.erl:123: :erl_eval.exprs/5
      (elixir 1.12.3) src/elixir.erl:280: :elixir.recur_eval/3
      (elixir 1.12.3) src/elixir.erl:265: :elixir.eval_forms/3
      (elixir 1.12.3) lib/code.ex:656: Code.eval_string_with_error_handling/3
      (elixir 1.12.3) lib/config.ex:258: Config.__eval__!/3
      (elixir 1.12.3) lib/config/reader.ex:86: Config.Reader.read!/2
  {"init terminating in do_boot",{#{'__exception__'=>true,'__struct__'=>'Elixir.RuntimeError',message=><<101,110,118,105,114,111,110,109,101,110,116,32,118,97,114,105,97,98,108,101,32,68,65,84,65,66,65,83,69,95,85,82,76,32,105,115,32,109,105,115,115,105,110,103,46,10,70,111,114,32,101,120,97,109,112,108,101,58,32,101,99,116,111,58,47,47,85,83,69,82,58,80,65,83,83,64,72,79,83,84,47,68,65,84,65,66,65,83,69,10>>},[{erl_eval,do_apply,6,[{file,"erl_eval.erl"},{line,685}]},{erl_eval,expr,5,[{file,"erl_eval.erl"},{line,446}]},{erl_eval,exprs,5,[{file,"erl_eval.erl"},{line,123}]},{elixir,recur_eval,3,[{file,"src/elixir.erl"},{line,280}]},{elixir,eval_forms,3,[{file,"src/elixir.erl"},{line,265}]},{'Elixir.Code',eval_string_with_error_handling,3,[{file,"lib/code.ex"},{line,656}]},{'Elixir.Config','__eval__!',3,[{file,"lib/config.ex"},{line,258}]},{'Elixir.Config.Reader','read!',2,[{file,"lib/config/reader.ex"},{line,86}]}]}}
  init terminating in do_boot ({,[{erl_eval,do_apply,6,[{_},{_}]},{erl_eval,expr,5,[{_},{_}]},{erl_eval,exprs,5,[{_},{_}]},{elixir,recur_eval,3,[{_},{_}]},{elixir,eval_forms,3,[{_},{_}]},{Elixir.Code,ev
  Crash dump is being written to: erl_crash.dump...done
   INFO Main child exited normally with code: 1
   INFO Starting clean up.
   WARN hallpass exited, pid: 314, status: signal: 15 (SIGTERM)
  2023/11/03 07:22:15 listening on [fdaa:0:42c6:a7b:1bf:a08f:f999:2]:22 (DNS: [fdaa::3]:53)
  [    2.293675] reboot: Restarting system
  machine restart policy set to 'no', not restarting
-------
Error: release command failed - aborting deployment. error release_command machine 1857439c5996d8 exited with non-zero status of 1

Dunno why it's trying to run the release command ... obviously it's going to fail!!

Trying to fork from the volume:

fly postgres create --fork-from authprodb:vol_g2yxp4mllj6v63qd

Output:

Creating app...
Setting secrets on app authprodb2...
Provisioning 1 of 1 machines with image flyio/postgres-flex:15.3@sha256:5e5fc53decb051f69b0850f0f5d137c92343fcd1131ec413015e5260620947ac
Waiting for machine to start...
Machine 1857509c162168 is created
==> Monitoring health checks
  Waiting for 1857509c162168 to become healthy (started, 3/3)

Postgres cluster authprodb2 created
  Username:    postgres
  Password:    redacted
  Hostname:    authprodb2.internal
  Flycast:     fdaa:0:42c6:0:1::6
  Proxy port:  5432
  Postgres port:  5433
  Connection string: postgres://postgres:[email protected]:5432

Save your credentials in a secure place -- you won't be able to see them again!

Connect to postgres
Any app within the DWYL Auth organization can connect to this Postgres using the above connection string

Now that you've set up Postgres, here's what you need to understand: https://fly.io/docs/postgres/getting-started/what-you-should-know/

That appears to have worked:

Reading: https://fly.io/docs/postgres/getting-started/what-you-should-know/
Recommends using a different service if you want managed Postgres ...
Rabbit hole ...

Back to getting this working!

Tried:

flyctl postgres attach authprodb2 -a authprod

Got:

Error: consumer app "authprod" already contains a secret named DATABASE_URL

Duh.

Let's quickly connect to the newly created authprodb2 and see if it has data:

fly postgres connect -a authprodb2

List the databases:

\l

                                                List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    | ICU Locale | Locale Provider |   Access privileges
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 repmgr    | repmgr   | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
(4 rows)

None of these look like what we need. 🙄

Going to need to take a step back ... 💭

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

Connecting to the (working) demo version of auth Postgres instance:

fly postgres connect -a authdemo-db

We see that the databases are:

\l
                                    List of databases
   Name    |   Owner    | Encoding |  Collate   |   Ctype    |     Access privileges
-----------+------------+----------+------------+------------+---------------------------
 authdemo  | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 |
 postgres  | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 |
 template0 | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 | =c/flypgadmin            +
           |            |          |            |            | flypgadmin=CTc/flypgadmin
 template1 | flypgadmin | UTF8     | en_US.utf8 | en_US.utf8 | =c/flypgadmin            +
           |            |          |            |            | flypgadmin=CTc/flypgadmin

Instead of the repmgr there is an authdemo database. 💭

As per the echo $DATABASE_URL above:

postgres://authprod:[email protected]:5432/authprod

We're expecting a database called authprod to exist in the newly created Postgres instance. 🤷‍♂️

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

This got me scratching my head ... 🤔
If we look at the https://fly.io/apps/authdemo-db/volumes we see that the authdemo-db is using 182 MB the available volume:

image

Whereas the authprodb volumes https://fly.io/apps/authprodb/volumes are using 0mb ...
image

Going to contact Fly.io support.

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

Just in case I was going crazy, I tried to deploy the authprod app one last time:

fly deploy --verbose

Output:

==> Verifying app config
Validating /auth/fly.toml
Platform: machines
✓ Configuration is valid
--> Verified app config
==> Building image
Remote builder fly-builder-solitary-rain-5830 ready
==> Building image with Docker
--> docker host: 20.10.12 linux x86_64
[+] Building 1.8s (30/30) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                                                                                     1.0s
 => => transferring dockerfile: 32B                                                                                                                                                                                      1.0s
 => [internal] load .dockerignore                                                                                                                                                                                        1.0s
 => => transferring context: 34B                                                                                                                                                                                         1.0s
 => [internal] load metadata for docker.io/library/debian:bullseye-20210902-slim                                                                                                                                         0.3s
 => [internal] load metadata for docker.io/hexpm/elixir:1.12.3-erlang-24.1.5-debian-bullseye-20210902-slim                                                                                                               0.4s
 => [internal] load build context                                                                                                                                                                                        0.3s
 => => transferring context: 29.65kB                                                                                                                                                                                     0.3s
 => [stage-1 1/6] FROM docker.io/library/debian:bullseye-20210902-slim@sha256:e3ed4be20c22a1358020358331d177aa2860632f25b21681d79204ace20455a6                                                                           0.0s
 => [builder  1/18] FROM docker.io/hexpm/elixir:1.12.3-erlang-24.1.5-debian-bullseye-20210902-slim@sha256:a2447f82e9125abb7edb0708eb51da638d8f40e7fda4b3ed5151bee883d476be                                               0.0s
 => CACHED [stage-1 2/6] RUN apt-get update -y && apt-get install -y libstdc++6 openssl libncurses5 locales   && apt-get clean && rm -f /var/lib/apt/lists/*_*                                                           0.0s
 => CACHED [stage-1 3/6] RUN sed -i '/en_US.UTF-8/s/^# //g' /etc/locale.gen && locale-gen                                                                                                                                0.0s
 => CACHED [stage-1 4/6] WORKDIR /app                                                                                                                                                                                    0.0s
 => CACHED [stage-1 5/6] RUN chown nobody /app                                                                                                                                                                           0.0s
 => CACHED [builder  2/18] RUN apt-get update -y && apt-get install -y build-essential git     && apt-get clean && rm -f /var/lib/apt/lists/*_*                                                                          0.0s
 => CACHED [builder  3/18] WORKDIR /app                                                                                                                                                                                  0.0s
 => CACHED [builder  4/18] RUN mix local.hex --force &&     mix local.rebar --force                                                                                                                                      0.0s
 => CACHED [builder  5/18] COPY .env_sample ./                                                                                                                                                                           0.0s
 => CACHED [builder  6/18] COPY mix.exs mix.lock ./                                                                                                                                                                      0.0s
 => CACHED [builder  7/18] RUN mix deps.get --only prod                                                                                                                                                                  0.0s
 => CACHED [builder  8/18] RUN mkdir config                                                                                                                                                                              0.0s
 => CACHED [builder  9/18] COPY config/config.exs config/prod.exs config/                                                                                                                                                0.0s
 => CACHED [builder 10/18] RUN mix deps.compile                                                                                                                                                                          0.0s
 => CACHED [builder 11/18] COPY priv priv                                                                                                                                                                                0.0s
 => CACHED [builder 12/18] COPY assets assets                                                                                                                                                                            0.0s
 => CACHED [builder 13/18] RUN mix assets.deploy                                                                                                                                                                         0.0s
 => CACHED [builder 14/18] COPY lib lib                                                                                                                                                                                  0.0s
 => CACHED [builder 15/18] RUN mix compile                                                                                                                                                                               0.0s
 => CACHED [builder 16/18] COPY config/runtime.exs config/                                                                                                                                                               0.0s
 => CACHED [builder 17/18] COPY rel rel                                                                                                                                                                                  0.0s
 => CACHED [builder 18/18] RUN mix release                                                                                                                                                                               0.0s
 => CACHED [stage-1 6/6] COPY --from=builder --chown=nobody:root /app/_build/prod/rel/auth ./                                                                                                                            0.0s
 => exporting to image                                                                                                                                                                                                   0.0s
 => => exporting layers                                                                                                                                                                                                  0.0s
 => => writing image sha256:5ba72f99464041b84ed2e1491eb709c1518dff88852706229e8d82f0f3af2d4c                                                                                                                             0.0s
 => => naming to registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02                                                                                                                                          0.0s
--> Building image done
==> Pushing image to fly
The push refers to repository [registry.fly.io/authprod]
7b5d131176f6: Layer already exists
4d5c3ee4879b: Layer already exists
0166b8c07570: Layer already exists
4dfd7d2f8a33: Layer already exists
11a55e4db61e: Layer already exists
ba5a5fe43301: Layer already exists
deployment-01HEADJTX39GT10NA7TY49HG02: digest: sha256:f5d586ae7234d29f8ef485ce6c06e5ec469f6ae66d77886459780fc172337147 size: 1575
--> Pushing image done
image: registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02
image size: 120 MB

Watch your deployment at https://fly.io/apps/authprod/monitoring

Running authprod release_command: /app/bin/migrate

-------
 ✖ release_command failed
-------
Error release_command failed running on machine 48ed037a343368 with exit code 1.
Check its logs: here's the last 100 lines below, or run 'fly logs -i 48ed037a343368':
  Pulling container image registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02
  Successfully prepared image registry.fly.io/authprod:deployment-01HEADJTX39GT10NA7TY49HG02 (5.834199334s)
  [    0.041401] Spectre V2 : WARNING: Unprivileged eBPF is enabled with eIBRS on, data leaks possible via Spectre v2 BHB attacks!
  [    0.042773] PCI: Fatal: No config space access function found
   INFO Starting init (commit: 15238e9)...
   INFO Preparing to run: `/app/bin/migrate` as nobody
   INFO [fly api proxy] listening at /.fly/api
  2023/11/03 11:02:57 listening on [fdaa:0:42c6:a7b:1be:99c2:78d9:2]:22 (DNS: [fdaa::3]:53)
  11:02:58.710 [error] Postgrex.Protocol (#PID<0.136.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
  11:02:58.710 [error] Postgrex.Protocol (#PID<0.135.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
  11:02:59.943 [error] Postgrex.Protocol (#PID<0.136.0>) failed to connect: ** (DBConnection.ConnectionError) tcp connect (top2.nearest.of.authprodb.internal:5432): non-existing domain - :nxdomain
  11:03:00.660 [error] Could not create schema migrations table. This error usually happens due to the following:
    * The database does not exist
    * The "schema_migrations" table, which Ecto uses for managing
      migrations, was defined by another library
    * There is a deadlock while migrating (such as using concurrent
      indexes with a migration_lock)
  To fix the first issue, run "mix ecto.create".
  To address the second, you can run "mix ecto.drop" followed by
  "mix ecto.create". Alternatively you may configure Ecto to use
  another table and/or repository for managing migrations:
      config :auth, Auth.Repo,
        migration_source: "some_other_table_for_schema_migrations",
        migration_repo: AnotherRepoForSchemaMigrations
  The full error report is shown below.
  ** (DBConnection.ConnectionError) connection not available and request was dropped from queue after 2975ms. This means requests are coming in and your connection pool cannot serve them fast enough. You can address this by:
    1. Ensuring your database is available and that you can connect to it
    2. Tracking down slow queries and making sure they are running fast enough
    3. Increasing the pool_size (although this increases resource consumption)
    4. Allowing requests to wait longer by increasing :queue_target and :queue_interval
  See DBConnection.start_link/2 for more information
      (ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:760: Ecto.Adapters.SQL.raise_sql_call_error/1
      (elixir 1.12.3) lib/enum.ex:1582: Enum."-map/2-lists^map/1-0-"/2
      (ecto_sql 3.7.2) lib/ecto/adapters/sql.ex:852: Ecto.Adapters.SQL.execute_ddl/4
      (ecto_sql 3.7.2) lib/ecto/migrator.ex:678: Ecto.Migrator.verbose_schema_migration/3
      (ecto_sql 3.7.2) lib/ecto/migrator.ex:504: Ecto.Migrator.lock_for_migrations/4
      (ecto_sql 3.7.2) lib/ecto/migrator.ex:419: Ecto.Migrator.run/4
      (ecto_sql 3.7.2) lib/ecto/migrator.ex:146: Ecto.Migrator.with_repo/3
      (auth 1.6.6) lib/auth/release.ex:12: anonymous fn/2 in Auth.Release.migrate/0
   INFO Main child exited normally with code: 1
   WARN Reaped child process with pid: 359 and signal: SIGUSR1, core dumped? false
   INFO Starting clean up.
   WARN hallpass exited, pid: 306, status: signal: 15 (SIGTERM)
  2023/11/03 11:03:01 listening on [fdaa:0:42c6:a7b:1be:99c2:78d9:2]:22 (DNS: [fdaa::3]:53)
  [    5.255192] reboot: Restarting system
  machine restart policy set to 'no', not restarting
-------
Error: release command failed - aborting deployment. error release_command machine 48ed037a343368 exited with non-zero status of 1

fly deploy definitely doesn't work because it cannot run the mix release command because there is no Postgres instance available to migrate. 🙅

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

Posted Forum topic: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247

Snapshot:
image

Full text of my forum topic in case it gets deleted:

Opening this question in case anyone `else` is in the same boat (having their `Nomad`-based `Postgres` instance killed ...) 

# Context

We deployed our [`auth`](https://github.com/dwyl/auth) (`Elixir/Phoenix`) app to Fly.io in 2022. :rocket: 
It was a good experience and worked flawlessly until they killed it. :neutral_face: 

# Fly.io App Disabled 💀

We received the following email from Fly.io: 

![image|690x464](upload://xgMVmyDeZnevWHqqpE3bKQLJFHV.png)

## `Fulltext` of the email for Google indexing/findability 📧

```md
Hi. We’re in the process of removing Nomad from Fly.io. This means any apps that haven’t been upgraded to Apps V2 will no longer function.

Your app authprodb is one of those apps. Here’s what this means for you:

1. authprodb is currently disabled.
  - That means the Nomad instances (virtual machines) 
     that ran your app have been shut down, but that the app’s configuration still exists.
2. Your app is technically now on the “machines” platform, rather than “nomad.”
  - That means that any subsequent fly deploy on your app will attempt to launch Fly Machines.
  - In many cases, a deploy will immediately work. In some cases, there may be some additional configuration required to get your app running on Machines. 
    The community forum is a great place to get help troubleshooting your deployment if you get stuck.
3. Nothing has been deleted.
  - Your app’s Docker image still exists, and any Volumes that were attached to your app still exist.

Thanks for bearing with us throughout this whole migration saga. It’s not been easy, but the results should soon speak for themselves.

If you have any questions or comments, we’ve posted a similar announcement on the community forum.

Thanks, Your Friends at Fly.io

They key point of the email:
they shut down our Postgres instance
and according to point 3 "Nothing has been deleted".

"Nothing has been deleted" ... 💭

So I dived into trying to recover the database.
Tried re-deploying the authprod App that was attached to the authprodb Postgres instance:

fly deploy --verbose

It errors because the Elixir App cannot run the mix release command:

Error: release command failed - aborting deployment. error release_command machine 48ed037a343368 exited with non-zero status of 1

Full log output in: #325 (comment)

After a couple of hours searching and trying everything I could think of with zero progress.
I'm opening this forum topic.

Question: Why is the Volume Used 0 MB?

Viewing the volumes for the authprodb, they have Used 0 MB i.e. no data
https://fly.io/apps/authprodb/volumes
image

image

Luckily, we have a working "staging" version of the auth App also hosted on Fly.io: https://authdemo.fly.dev/
and can easily compare the Volume: https://fly.io/apps/authdemo-db/volumes we see that the authdemo-db is using 182 MB the available volume:

image

This suggests that our data was in fact deleted when Fly.io disabled our App.

If anyone has seen this issue and can help us recover our Postgres data, please comment! 🙏

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

Replied to the support email:

image

Ball is in their court. 🎾
I get the sinking feeling that the data for authprodb is lost. 💭
Which is going to be a huge headache. 😖
But let's wait and see what they say. ⏳

AFK to help parents with something for the next couple of hours. 🤞

@nelsonic
Copy link
Member Author

nelsonic commented Nov 3, 2023

Meanwhile I've wasted my whole morning on this ... ⏳ 🔥 😢
image

@nelsonic
Copy link
Member Author

nelsonic commented Nov 7, 2023

Followed-up with [email protected] by email. If the data for authprodb is in fact gone, we need to move on. 💭

@nelsonic
Copy link
Member Author

nelsonic commented Nov 8, 2023

Got a couple of replies on the forum:
https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/5
image

Tried creating a fork from the second volume:

fly postgres create --fork-from authprodb:vol_1g67340ggznrydxw

Output:

? Select VM size: shared-cpu-1x - CPU Kind: Shared, vCPUs: 1 Memory: 256MB
Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app authprodb3...
Provisioning 1 of 1 machines with image flyio/postgres-flex:15.3@sha256:44b698752cf113110f2fa72443d7fe452b48228aafbb0d93045ef1e3282360a6
Waiting for machine to start...
Machine e82d377a069728 is created
==> Monitoring health checks
  Waiting for e82d377a069728 to become healthy (started, 3/3)

Postgres cluster authprodb3 created
  Username:    postgres
  Password:    redacted
  Hostname:    authprodb3.internal
  Flycast:     fdaa:0:42c6:0:1::7
  Proxy port:  5432
  Postgres port:  5433
  Connection string: postgres://postgres:[email protected]:5432

Save your credentials in a secure place -- you won't be able to see them again!

Connect to postgres
Any app within the DWYL Auth organization can connect to this Postgres using the above connection string

Now that you've set up Postgres, here's what you need to understand: https://fly.io/docs/postgres/getting-started/what-you-should-know/

Connect:

fly postgres connect -a authprodb3
\l
                                                List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    | ICU Locale | Locale Provider |   Access privileges
-----------+----------+----------+------------+------------+------------+-----------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 repmgr    | repmgr   | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            |
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 |            | libc            | =c/postgres          +
           |          |          |            |            |            |                 | postgres=CTc/postgres
(4 rows)

these-are-not-the-tables

I replied on the thread.

@nelsonic
Copy link
Member Author

Got a reply on the forum (but still nothing in email ...)
https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/8
image

Only saw it now ... didn't get a notification from the forum. We need a "Reminders" feature.

Going to try the command now:

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region CHOOSE --org your-org --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_yaddayaddayadda

Re-ran the command to check the snapshots list:

fly volume snapshots list vol_g2yxp4mllj6v63qd

Snapshots
ID                 	SIZE    	CREATED AT
vs_vYXP8wx6ezM53CR 	96479939	11 hours ago
vs_vnAvN76g36Apwt2A	96479939	1 day ago
vs_azyMXoAexXOG9tA7	96479939	2 days ago
vs_JjO3bbl7nbV2oSX5	96479939	3 days ago
vs_VD9Zm3LRVqQ6lfz7	96479939	4 days ago
vs_gnkGR2jXV7yM3sqN	96479939	5 days ago

Using the latest snapshot: vs_vYXP8wx6ezM53CR

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_vYXP8wx6ezM53CR
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...

Gotta go be a Dad for a few hours. BRB. ⏳

@nelsonic
Copy link
Member Author

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_vYXP8wx6ezM53CR
Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...
Machine e784434c420ed8 is created
==> Monitoring health checks
  Waiting for e784434c420ed8 to become healthy (started, 0/3)
Error: context deadline exceeded

great. going to have to run the command again ...

@nelsonic
Copy link
Member Author

Posted a comment on the forum topic to this effect:
https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/9

Not much else I can do on this right now.
Other than creating a new empty DB and starting again. 💭

@nelsonic
Copy link
Member Author

nelsonic commented Nov 23, 2023

Given the error above, I had deleted new-db to avoid incurring unnecessary charges.
So I had to re-run the command to re-create new-db:

fly postgres create --name new-db --vm-size shared-cpu-2x --volume-size 10 --initial-cluster-size 1 --region lhr --org dwyl-auth-546 --image-ref registry-1.docker.io/flyio/postgres:13 --stolon --snapshot-id vs_jXlNXQLxgLDeXclV

But sadly, Fly.io is not fully operational right now: https://fly.io/dashboard/dwyl-auth/status

fly-status-docker-fail

So the response is:

Creating app...
Error: server returned a non-200 status code: 504

Will need to revisit this in the morning. ⏳

@nelsonic
Copy link
Member Author

Resume: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/14

@nelsonic
Copy link
Member Author

At home now and still getting error trying to revive the DB:

Creating postgres cluster in organization dwyl-auth-546
Creating app...
Setting secrets on app new-db...
Restoring 1 of 1 machines with image registry-1.docker.io/flyio/postgres:13
Waiting for machine to start...
Machine 784e679b453408 is created
==> Monitoring health checks
  Waiting for 784e679b453408 to become healthy (started, 0/3)
Error: context deadline exceeded

@nelsonic
Copy link
Member Author

But when visiting the dashboard, we see that the new-db was created:
https://fly.io/dashboard/dwyl-auth-546
image

Replied on the thread: https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/16?u=dwyl_auth

@nelsonic
Copy link
Member Author

On the advice of the Daniel https://community.fly.io/t/where-is-our-data-postgres-database-volume-0mb/16247/18 I created a new DB instance to attempt to recover the data ...

https://fly.io/dashboard/dwyl-auth-546
image

It didn't work and the thread was automatically closed without any resolution. 😢
But the instance is still running and costing us a small fortune!!
So I'm killing it now:

image

The prod instance is still not recovered. 😢
Really wish I had someone to pair on this ...

image

@nelsonic
Copy link
Member Author

Going to attempt to host the Postgres on DigitalOcean dwyl/learn-devops#90 with the Auth App still running on Fly.io ref: dwyl/learn-devops#93

Meanwhile our Auth App has a mountain of credits: https://fly.io/dashboard/dwyl-auth-546
image
And since the first $5/month is "free", we aren't consuming them.
So I might bump the App up a notch once the DB transferred to DigitalOcean.

But just to say I'm tackling this now! ⏳ (or at least investing an hour into this and seeing how far I get ...)

@nelsonic
Copy link
Member Author

fly auth login
Opening https://fly.io/app/auth/cli/6c617a6637656868646571 ...

Waiting for session... Done
successfully logged in as nelsonic
fly status
App
  Name     = authprod
  Owner    = dwyl-auth-546
  Hostname = authprod.fly.dev
  Image    = authprod:deployment-01GK3WCNQ8Q374

Machines
PROCESS	ID            	VERSION	REGION	STATE  	ROLE	CHECKS            	LAST UPDATED  
app    	6e82d924ad3742	58     	lhr   	started	    	1 total, 1 passing	2023-11-03T07:20:03Z	

All running. 🆗

@nelsonic
Copy link
Member Author

Going to detach the (empty) DB: https://fly.io/docs/postgres/managing/attach-detach/#detach-a-fly-app

Example:

fly postgres detach <postgres app name> --app <app name>

Ours:

fly postgres detach  authprodb --app authprod

Got:

Error: no 6pn ips founds for authprodb app

Didin't expect it to be easy ... 🙄
But what is this error? 🤷‍♂️

@nelsonic
Copy link
Member Author

nelsonic commented Jul 27, 2024

Exact error message for forum title/topic:
https://community.fly.io/t/6pn-ips-are-not-static-re-fly-postgres-last-thurs/20022

Reading now ... ⏳

@nelsonic
Copy link
Member Author

Turns out the authprod App is not connected to the authprodb:
https://fly.io/docs/postgres/managing/attach-detach/#check-if-a-fly-app-is-attached-to-a-postgres-app

fly postgres users list --app authprodb
Error: no active leader found

Fly Postgres is Total Garbage. 🗑️ 😢

Why aren't these status checks all automatic?! 🤦‍♂️
Surely the Web Interface could inform us that the health of the Postgres is "no active leader". 🤷‍♂️

@nelsonic
Copy link
Member Author

nelsonic commented Jul 27, 2024

Interrupted by children. 🧒 ❤️ So no further progress will be made on this today.
Wish I could clone myself and be a good parent and still work on "side projects" ... 💭

@nelsonic
Copy link
Member Author

==> ecto
Compiling 56 files (.ex)
     warning: missing parentheses for expression following "do:" keyword. 
      Parentheses are required to solve ambiguity inside keywords.

     This error happens when you have function calls without parentheses inside keywords. For example:

         function(arg, one: nested_call a, b, c)
         function(arg, one: if expr, do: :this, else: :that)

     In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
      Or if the keywords "do" and "else" apply to the function "function" or "if". 
      You can solve this by explicitly adding parentheses:

         function(arg, one: if(expr, do: :this, else: :that))
         function(arg, one: nested_call(a, b, c))

     Ambiguity found at:

 568 │     do: raise ArgumentError, "cast/3 expects a list of atom keys, got: `#{inspect key}`"
     │     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     └─ lib/ecto/changeset.ex:568

     warning: missing parentheses for expression following "do:" keyword. 
     Parentheses are required to solve ambiguity inside keywords.

     This error happens when you have function calls without parentheses inside keywords. For example:

         function(arg, one: nested_call a, b, c)
         function(arg, one: if expr, do: :this, else: :that)

     In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
     Or if the keywords "do" and "else" apply to the function "function" or "if". 
     You can solve this by explicitly adding parentheses:

         function(arg, one: if(expr, do: :this, else: :that))
         function(arg, one: nested_call(a, b, c))

     Ambiguity found at:
     │
 445 │     do: raise ArgumentError,
     │     ~~~~~~~~~~~~~~~~~~~~~~~~
     │
     └─ lib/ecto/association.ex:445

      warning: missing parentheses for expression following "do:" keyword. 
      Parentheses are required to solve ambiguity inside keywords.

      This error happens when you have function calls without parentheses inside keywords. For example:

          function(arg, one: nested_call a, b, c)
          function(arg, one: if expr, do: :this, else: :that)

      In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
      Or if the keywords "do" and "else" apply to the function "function" or "if". 
      You can solve this by explicitly adding parentheses:

          function(arg, one: if(expr, do: :this, else: :that))
          function(arg, one: nested_call(a, b, c))

      Ambiguity found at:

 2904 │     raise ArgumentError, "cannot add constraint to changeset because it does not have a source, got: #{inspect data}"
      │     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

      └─ lib/ecto/changeset.ex:2904

      warning: missing parentheses for expression following "do:" keyword. Parentheses are required to solve ambiguity inside keywords.

      This error happens when you have function calls without parentheses inside keywords. For example:

          function(arg, one: nested_call a, b, c)
          function(arg, one: if expr, do: :this, else: :that)

      In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
      Or if the keywords "do" and "else" apply to the function "function" or "if". 
      You can solve this by explicitly adding parentheses:

          function(arg, one: if(expr, do: :this, else: :that))
          function(arg, one: nested_call(a, b, c))

      Ambiguity found at:
      │
 2906 │     raise ArgumentError, "cannot add constraint because a changeset was not supplied, got: #{inspect item}"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      │
      └─ lib/ecto/changeset.ex:2906

     warning: atom ::: must be written between quotes, as in :"::", to avoid ambiguity
     │
 260 │       {:::, _, [left, _]} -> is_integer(left) or is_binary(left)
     │        ~
     │
     └─ lib/ecto/query/builder.ex:260:8

     warning: missing parentheses for expression following "do:" keyword. 
    Parentheses are required to solve ambiguity inside keywords.

     This error happens when you have function calls without parentheses inside keywords. For example:

         function(arg, one: nested_call a, b, c)
         function(arg, one: if expr, do: :this, else: :that)

     In the examples above, we don't know if the arguments "b" and "c" apply to the function "function" or "nested_call". 
     Or if the keywords "do" and "else" apply to the function "function" or "if". 
     You can solve this by explicitly adding parentheses:

         function(arg, one: if(expr, do: :this, else: :that))
         function(arg, one: nested_call(a, b, c))

     Ambiguity found at:

 621 │     do: raise ArgumentError, "a changeset with action #{inspect given} was given to #{inspect repo}.#{action}/2"
     │     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

     └─ lib/ecto/repo/schema.ex:621


== Compilation error in file lib/ecto/query.ex ==
** (Kernel.TypespecError) lib/ecto/query.ex:428: type dynamic/0 is a built-in type and it cannot be redefined
    (elixir 1.17.2) lib/kernel/typespec.ex:969: Kernel.Typespec.compile_error/2
    (stdlib 5.2.3) lists.erl:1599: :lists.foldl_1/3
    (elixir 1.17.2) lib/kernel/typespec.ex:226: Kernel.Typespec.translate_typespecs_for_module/2
could not compile dependency :ecto, "mix compile" failed. Errors may have been logged above. 
You can recompile this dependency with "mix deps.compile ecto --force", 
update it with "mix deps.update ecto" or clean it with "mix deps.clean ecto"

I obviously tried updating ecto but that didn't do anything. 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BLOCKED :fire: Core team's HIGHEST priority, blocking critical work chore a tedious but necessary task often paying technical debt elixir Pull requests that update Elixir code help wanted If you can help make progress with this issue, please comment! in-progress An issue or pull request that is being worked on by the assigned person priority-1 Highest priority issue. This is costing us money every minute that passes. question A question needs to be answered before progress can be made on this issue tech-debt A feature/requirement implemented in a sub-optimal way & must be re-written technical A technical issue that requires understanding of the code, infrastructure or dependencies
Projects
Status: Backlog (Prioritized)
Status: 🔖 Ready for Development
Development

No branches or pull requests

3 participants