Ask / Submit
8

Data loss during IO in deep sleep

asked 2015-06-17 02:02:40 +0200

Self-Perfection gravatar image

updated 2015-06-17 13:25:50 +0200

Preface

I have created a script for regular checking for updates, downloading of zip archive with update and uncompressing it's contents to ~/android_storage/XXX. It worked perfectly during manual testing, but when I finally configured it with timed to run without manual interaction I found that most of the time the script produces corrupted file. WHAT?!

Main point

Long story short: after excessive/exhaustive debugging I found, that writing large amounts of data (~100MB) while Jolla is in sleep mode are likely to fail. The result may vary: you may get truncated file, no file at all or in my case I got file 64 KiB shorter than proper result with 64 KiB chunk missing somewhere around the middle of the file. That was really "WTF?!?!" experience.

IO in ~/android_storage is much more likely to fail than IO without fuse intermediate layer. Most of the time I've noticed write failures, but once I got read failure. Looks like system load (read/write proportion, CPU usage) affects somehow, but I found sequence that always reproduces this issue on my device (see below).

For me this looks like a bug. Operating system might postpone actions when going to sleep, OS might consider long running apps as hanging and ask them (politely for the first time!) to stop, but reporting file system error is a bug. Entering sleep mode should be imperceptible for processes.

Guide to reproduce

Download sample zip file (~100MB). curl -o /home/nemo/android_storage/test.obf.zip -L 'http://download.osmand.net/download.php?standard=yes&file=Russia_moskovskaya-oblast_asia_2.obf.zip'

Save following script as /home/nemo/android_storage/test_unpack.sh

#!/bin/sh

DIR=$(dirname "$0")
ARCHIVE='test.obf.zip'
OUTFILE="$DIR/archive_unzipped_$(date +%T)"

exec >> "$DIR/test_unpack.log"
exec 2>> "$DIR/test_unpack.log"

unzip -p "$DIR/$ARCHIVE" | tee "$OUTFILE" |md5sum
ls -l "$OUTFILE"

echo ====

Schedule it execution for several times: for t in {300..1800..300};do timedclient-qt5 -a'whenDue;runCommand=/bin/sh /home/nemo/android_storage/test_unpack.sh' -e"APPLICATION=test_unpack;ticker=$t"; done

Make sure your device is not connected to charger or USB, switch off screen, put it aside and don't touch for 40 minutes.

When the time passes check ~/android_storage/test_unpack.logHere is my sample output. Only one uncompress out of six was successful!

Workaround and afterword

Now as a workaround I have scheduled the script to run as media_rw user to access /opt/alien/data/media/ directly and added several uncompression attempts to the script. I hope it would be robust now. I have neither skills nor courage to investigate the issue further and hope someone will pick it. Well, at least I would be glad if someone tries to follow reproduce guide and check if this issue happens on his device as well.

edit retag flag offensive close delete

Comments

2 Answers

Sort by » oldest newest most voted
2

answered 2015-06-17 09:58:03 +0200

t-lo gravatar image

There are a few good explanations for this behaviour other than a FS corruption on low power mode.

Also, did you notice that, although tee reported errors in your log output and the uncompressed file differs in size, you would always get the same MD5SUM? That's most likely because unzip finished successfully, and tee managed to pipe everything to md5sum, but did not manage to recover from EINTR (as it should; see below).

But let's start from the beginning. You never check for errors in your script:

  • curl may fail (data / connection loss)
  • unzip may fail (corrupt data in zip)
  • tee may fail (see below)

I'd recommend using #!/bin/sh -e instead of #!/bin/sh so as to at least abort execution if one of the binaries called returns an error. Ideally, you'd check $? after each step, re-starting the command as required.

Additionally to this, teeseems to be broken - EINTR is handled incorrectly (see https://www.quora.com/What-are-interrupted-system-calls?share=1). I assume that tee gets interrupted when writing to the device. Instead of re-starting the write() call it discards data - maybe the --ignore-interrupts option helps, although it's actually meant for a slightly different situation (it shields against keyboard interrupts, aka CTRL+C).

edit flag offensive delete publish link more

Comments

Hey, don't blame for lack of error checking in my script! Original script has plenty of it, actually original is too complex (103 lines bywc or 71 by sloccount) and most of it is unrelated to problem so I provided here simplest possible sequence to reproduce the problem and collect logs.

Yeah, I've put md5sum in this script right on purpose to indicate that it is writing error and not unzip error.

You have interesting point that EINTR is a normal thing and app should try to reissue system call if it receives EINTR. Well, in this case pretty much everything is broken. First version of my script worked without tee: unzip -p "$ARCHIVE" > "$UNZIPPED_ARCHIVE". This version produced incomplete files with data missing somewhere in the middle of the file. I suppose unzip received EINTR but silently ignored it and continued to write data.

Ok, after your answer I have tried cat: unzip -p "$ARCHIVE" | cat > "$UNZIPPED_ARCHIVE" In this command cat should handle EINTR from FS, but I've got the same result as with plain unzip: shorter file with missing data chunks. --ignore-interrupts option for tee does not change anything.

This situation makes me wonder if there is at least one app that handles EINTR properly and if EINTR is really a normal thing that should be anticipated.

Self-Perfection ( 2015-06-17 12:58:08 +0200 )edit
1

Sorry for making assumtions - your examples were all I had.

Regarding EINTR, no, it's completely legal (although quite uncommon) to receive this signal on a number of system calls. See e.g. man 7 signal, specifically all the parts that explicitly mention EINTR.

Regarding unzip -p "$ARCHIVE" | cat > "$UNZIPPED_ARCHIVE" - I think it's a rather unfortunate choice. When issuing ... cat > "xxx", the cat command does not even see the ...> "xxx" part. This is handled by the shell that runs cat since it's an I/O redirection: the shell will open the file "xxx" and hook it to the stdout of the cat process before exec()ing the cat binary. cat itself will see exactly zero command line arguments. It will thusly read from stdin and write to (what cat thinks is its) stdout. So, question is, are you certain that cat does handle terminal output correctly when receiving EINTR? Would your own code (i.e. if you would write your own cat) check for EINTR on e.g. printf() and repeat the output accordingly?

t-lo ( 2015-06-17 13:52:46 +0200 )edit

According to man 7 signal

If a blocked call to one of the following interfaces is interrupted by a signal handler, then the call will be automatically restarted after the signal handler returns if the SA_RESTART flag was used; otherwise the call will fail with the error EINTR: * read(2), readv(2), write(2), writev(2), and ioctl(2) calls on "slow" devices. A "slow" device is one where the I/O call may block for an indefinite time, for example, a terminal, pipe, or socket. (A disk is not a slow device according to this definition.)

Therefore I assume cat > FILE even more appropriate example for EINTR than tee file, because cat writes to pipe. And it seems to me tee should not get EINTR at all from FS.

I have not writed anything in compiling programming languages for ages. In my imagination cat authors are godlike superheroes who writes perfect code which handles all edge cases. :-D

Well, anyway, what's your point:

  1. Is cat broken too?
  2. Or am I wrong in my assumption that there has to be a way to perform ny task reliably independently of sleep state of the device?
Self-Perfection ( 2015-06-17 14:54:56 +0200 )edit

Regarding the man page quote, yupp, that's what a driver should do. Apparently the bind-mount/fuse-horror-stack Jolla/alien-dalvik is using on the android_storage mount point thinks differently.

So, even though erroneous signal/return code handling in the shell tools is the cause of your problems it can probably be better fixed by simply writing to a different file system. Do you use an SD-Card? It should be safe to write there.

Regarding your assumtions: cat does not write to a pipe, it just writes to what (to cat) looks like stdout. I.e. it will probably use a printf() variant of some kind. It just so happens that the stdout file descriptor does not point to a terminal but was re-directed (i.e. closed, then cloned) by the shell to an open file's descriptor. So, in a way, it's worse than tee because at least tee knows that it's writing to an actual file.

  1. cat and tee are not broken but will probably not help here since the FS reacts in a funny (although strictly speaking legal) way.
  2. Unfortunately things are beginning to look that way - using the tools at hand in combination with android_storage/ for storing data seems to break your use case.
t-lo ( 2015-06-17 15:09:47 +0200 )edit

I have already ditched working via android_storage, just as I mentioned in "workaround" section now my script is launched as media_rw user and wors directly with /opt/alien/data/media/. But working with Btrfs directly, without android_storage mediator is not 100% reliable. Check example output of running the same test in $HOME

Just one failure out of 8 attempts - much better but solution is not complete. Now unzip gets read error :-/

Self-Perfection ( 2015-06-17 16:55:34 +0200 )edit
0

answered 2015-06-17 08:29:50 +0200

juiceme gravatar image

That is fairly strange since I run device backup scripts from cron every night, and the transferred data size is pretty large, for example my largest callrecorder log is about 380M in size.

I believe your problem might be related to timed, maybe the device does not fully wake up when the triggering event occurs? You could install cron and see if that behaves better. (http://www.swagman.org/juice/vixie-cron-3.0pl1-1.src.rpm)

edit flag offensive delete publish link more

Comments

  1. True, as I use rsync for backup from the device there is not much writing being done... just the log entry that backup has been started and completed.

  2. I have no "android_storage" nor I know what it is... I have never used Android on my jolla (and am quite sure I will never do that)

  3. I thought your script also uses network to retrieve the file from remote location?

juiceme ( 2015-06-17 10:17:16 +0200 )edit

I suppose this problem is not caused by improper wake up but by going back to sleep during IO.

I believe you perform your backup over network to remote system. In that case your situation is totally different and does not contain "triggering factors":

  1. Problem happens most of the time during writing and backup is read only process
  2. Bug occurs mostly during IO on ~/android_storage. And your backup probably does not touch it much
  3. Network activity might prevent your device from going to sleep
Self-Perfection ( 2015-06-17 11:07:21 +0200 )edit
1

Yeah, my script downloads file to fs. And downloading always works fine. Then comes unzip, which does not use network. And unzip fails.

Unzip can't decompress zip archive from pipe therefore I can't run decompression in parallel with downloading.

Self-Perfection ( 2015-06-17 11:14:06 +0200 )edit

What if you have a script in parallel that just pings a host while your main script is running?

Blizzz ( 2015-06-17 13:55:15 +0200 )edit

@Blizzz, parallel ping is an ugly duct tape, I don't want to use it even if it helps. As far as i recall there is D-Bus interface to disable entering sleep mode but I am reluctant to use it as well, direct working with /opt/alien/data/media/ looks reliable enough for me.

Self-Perfection ( 2015-06-17 16:59:42 +0200 )edit
Login/Signup to Answer

Question tools

Follow
2 followers

Stats

Asked: 2015-06-17 02:02:40 +0200

Seen: 612 times

Last updated: Jun 17 '15