Make sure that add-on store resets do not delete the root folder. This
is important so that successive reset attempts do not fail (the
directory passed to `remove_folder` must exist, otherwise find fails
with an non-zero exit code).
While at it, handle find errors properly and report errors as critical.
* Improve D-Bus timeout error handling
Typically D-Bus timeouts are related to systemd activation timing out
after 25s. The current dbus-fast timeout of 10s is well below that
so we never get the actual D-Bus error. This increases the dbus-fast
timeout to 30s, which will make sure we wait long enought to get the
actual D-Bus error from the broker.
Note that this should not slow down a typical system, since we tried
three times each waiting for 10s. With the new error handling typically
we'll end up waiting 25s and then receive the actual D-Bus error. There
is no point in waiting for multiple D-Bus/systemd caused timeouts.
* Create D-Bus TimedOut exception
Make sure that add-on store resets do not delete the root folder. This
is important so that successive reset attempts do not fail (the
directory passed to `remove_folder` must exist, otherwise find fails
with an non-zero exit code).
While at it, handle find errors properly and report errors as critical.
* Handle permission error on backup create
Make sure we handle (write) permission errors when creating a backup.
* Introduce BackupFileExistError and BackupPermissionError exceptions
* Make error messages a bit more uniform
* Drop use of exclusive mode
SecureTar does not handle exclusive mode nicely. Drop use of it for now.
With PR #5634 (which had the goal to remove I/O in event loop for backup
operations) the semantics of `remove_folder` changed slightly: Non-zero
exits of subprocesses were no longer handled, but lead to a
CalledProcessError.
Now to restore the semantics of `remove_folder` we should simply log an
error. However, this semantic change actually uncovered a potential
problem in deployed systems: There are 34 users on beta channel which
regularly seem to run `FixupStoreExecuteReset`, and with the semantic
change we see those errors in Sentry.
An obvious problem could be no storage. But in a quick test that would
not execute the repair in first place since the fixup has the job
condition `FREE_SPACE` set. So the problem is likely elsewhere.
With this change, we log the stderr of find, while still raising the
exception. With that we should get more context in Sentry to see what
could be the underlying error.
* Remove I/O in event loop for add-on backup and restore
Remove I/O in event loop for add-on backup and restore operations. On
backup, this moves the add-on shutdown before metadata is stored in the
backup, which slightly lenghens the time the add-on is actually stopped.
However, the biggest contributor here is likely adding the image
itself if it is a local backup. However, since that is the minority of
cases, I've opted for simplicity over optimizing for this case.
* Use partial to explicitly bind arguments
* Remove I/O in event loop for Home Assistant Core backup
The Home Assistant Core backup still contains some I/O in the event
loop. Move all I/O into the executor.
* Update supervisor/homeassistant/module.py
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
---------
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
* Avoid IO in event loop when removing backup
* Refactor backup size calculation
Currently size is lazy loaded when required via properties. This
however is blocking the async event loop.
Backup sizes don't change. Instead of lazy loading the size of a backup
simply determine it on loading/after creation.
* Fix tests for backup size change
* Avoid IO in event loop when loading backups
* Avoid IO in event loop when importing a backup
We don't intent to run uv again, so the cache is not really useful.
The cache directory size is around 80MB, however, the files are mostly
hardlinks to the original files in `/usr/local/lib/python3.13/site-packages`
so the actual saving is much smaller.
* Remove I/O from backup create() function
* Move mount check into exectutor thread
* Remove I/O from backup open() function
* Remove I/O from _folder_save()
* Refactor remove_folder and remove_folder_with_excludes
Make remove_folder and remove_folder_with_excludes synchronous
functions which need to be run in an executor thread to be safely used
in asyncio. This makes them better composable with other I/O operations
like checking for file existence etc.
* Fix logger typo
* Use return values for functions running in an exectutor
* Move location check into a separate function
* Fix extract
Since transition from pip to uv in #5152, Supervisor container doesn't
contain bytecode for site-packages anymore, and because our AppArmor
profile denies mkdir operations, the compiled *.pyc files are never
created. Enable uv --compile option to opt for the same behavior as pip
had, to fix of the AA errors and the potential penalty of compilation on
every import.
* Validate Backup always before restoring
Since #5519 we check the encryption password early in restore case.
This has the side effect that we check the file existance early too.
However, in the non-encryption case, the file is not checked early.
This PR changes the behavior to always validate the backup file before
restoring, ensuring both encryption and non-encryption cases are
handled consistently.
In particular, the last case of test_restore_immediate_errors actually
validates that behavior. That test should actually have failed so far.
But it seems that because we validate the backup shortly after freeze
anyways, the exception still got raised early enough.
A simply `await asyncio.sleep(10)` right after the freeze makes the
test case fail. With this change, the test works consistently.
* Address pylint
* Fix backup_manager tests
* Drop warning message
When delivering multiple messages to Core, and the first fails with a
connection error, the second message will also fail with the same error.
But at this point the connection is already clsoed, which leads to an
exception in the exception handler. Avoid this compunding error by
checking if the connection is still exists before trying to close.
Fixes: #5629
* Delay inital version fetch until there is connectivity
* Add test
* Only mock get not whole websession object
* drive delayed fetch off of supervisor connectivity not host
* Fix test to not rely on sleep guessing to track tasks
* Use fixture to remove job throttle temporarily
* Drop Docker config from Supervisor backup
The Docker config is part of the main backup metadata. Because we
consolidate encrypted and unencrypted backups today, this leads to
potential bugs when restoring a backup.
* Drop obsolete encrypt/decrypt functions
* Drop unused Backup Job stage
* Fix restoring unencrypted backup in corner case
If a backup has a encrypted and unencrypted location, and the encrypted
location is beeing restored first, the encryption key is still cached.
When the user restores the unencrypted backup next, it will fail because
the Supervisor tries to use encryption key still.
* Add integration test for restoring backups with and without encryption
* Rename _validate_location_password to _set_location_password
* Reload backup metadata from restore location
* Revert "Reload backup metadata from restore location"
This reverts commit 9b47a1cfe9a2682a0908e08cd143373744084fb7.
* Make pytest work/punt the ball on docker config restore issue
* Address pylint error