Sunday, 29 November 2009

Linux tips: Recovering lost text data from disk, or EVEN MEMORY

Okay, first new post for the new 'technical' blog. The topic is "Holy ****!!! I just had my browser/text editor crash, and I've lost all that work! What do I do now?"

Well, so long as the editor you were using stores it's data in some form of plain text, and so long as you are using linux, freebsd, or some other unix, (or possibly Mac OSX), I may be able to help.

WARNING: These techniques, like any secret powers, come with a degree of risk. I take no responsibility for any accidents where you wipe your hard-disk, explode your computer, or accidentally sell your soul to the devil. In fact I take no responsiblity for anything at all. You have been warned.

I use opera as a web-browser, and every now and then it crashes. This often happens when I'm in the middle of typing a big old 'mr angry' post onto a web-forum somewhere. Other times, I'm writing code, or SF stories (I am, like so many people, a wannabe writer) and a muscle spasm causes me to kick the power-supply of my computer, disconnecting it. Or there's a power cut. Or more often than this it's the shameful FAIL!! of doing 'rm -r -f' in the wrong place.

Fortunately, I've always (so far) been able to recover my work when such things happen. The tool I use for doing this is 'grep'.

Now, as you probably know, 'grep', whose name stands for 'Get Regular ExPression' is a tool that prints out lines matching a certain pattern in a file. But, if you had a file, you wouldn't have lost the work, would you? Well, lets consider our old freind, the ill reckless 'rm -r -f'. You thought you were in the /tmp directory, but actually you were in the /home/MyLifesWork directory, and now you've wiped all that oh-so-important stuff. You do have backups of course, don't you?

Well, if you don't, you may be able to at least recover any plain-text documents. When you 'delete' a file, what generally happens is that the chunk of hard-disk that the file was on gets marked as being free for re-use. But it doesn't get 'wiped clean', the data is still in there. So long as you can find this chunk of hard-disk and read the data out of it before it gets used again, you can recover much, maybe all, of your work.

You'll need to be 'root', i.e. the superuser on the unix box you are trying to recover from. As ever, mistakes made as 'root' can be disasterous, so be sure you know what you are doing. (Me, I'm always logged in as 'root'. Yes, I know, that's really bad practice. Do as I say, not as I do).

Linux has a directory called '/dev'. In this directory are 'virtual files'. These are not files in the sense we would normally understand them, but rather the input-output endpoints of device drivers. Devices like your hard-disk and computer memory appear in /dev as though they were files. You can read and write to the entire hard-disk, as though it were one giant file. I really wouldn't recommend writing anything to these files, as this is a good way to wipe your hard-disk clean!! BE VERY SURE YOU KNOW WHAT YOU ARE DOING HERE.

So, depending on what kind of hard-drive you have, the device files will probably be called something like '/dev/hda' or '/dev/sda' for the first hard-drive, and '/dev/hdb' etc for the second. In addition to this, there will be device files for the partitions on those hard-drives, named like '/dev/hda1', '/dev/hda2' etc.

If you know a word or phrase that occurs in your lost document, you can do

grep "My phrase" -a /dev/hda2

to grep for it in the entirity of the 2nd partition on the 1st hard-drive. Further, you can redirect this output to a file like:

grep "My phrase" -a /dev/hda2 > /tmp/

However, there is a danger here. The 'lost file' takes up a chunk of hard-disk that is available for re-use, so the second that you start writing to '/tmp/', you run the risk of overwriting the very data you are hoping to recover! It's better, if you have another partition you can use, to copy the data onto that partition, not the partition that you are trying to recover data from. For example, you might have a usb-disk or pendrive that you can mount on /mnt, and then redirect to '/mnt/' rather to '/tmp/'. This will mean that you aren't writing onto the same hard-disk you are trying to recover from. If it's a usb-disk, this might be slow, but better slow than overwritten!

Well, this is all very well, but grep only returns lines that contain the searched for phrase. You want the whole document! Well, grep takes two command-line arguments:

-A Return after matching line
-B Return before matching line

For instance, half the reason that I'm writing this post right now, is that I lost my 'Margaret Atwood: Bad girl, or just Misunderstood?' post that I was writing for my other 'non technical' blog in a freak accident. I was able to recover it by grepping for 'Misunderstood'. I knew that 'Misunderstood' would be at the start of the document, so I only had to use '-A', like this

grep -a Misunderstood -A 200 > /tmp/

This will cause grep to print out every line it finds with 'Misunderstood' in it, and also the 200 lines that follow that one.

If I'd known that 'Misunderstood' was at the end of the document, then I'd have used -B. If my search phrase is mid-way through, then I'd use -A and -B in combination to grep lines before and after the phrase.

Okay, so we have our file of saved data. Unfortunately, that's only the beginning. Looking in the file with a text-editor (I recomend 'vi') you will find that, frankly, it's full of crap. Oh, your missing work is in there somewhere, but there's loads of extra stuff been pulled along with it. You'll have to go through and pluck your work from the mess. Sorry, that's how it is.
Also, if your missing document is fairly large, it's unlikey that it will all have been stored in one place on disk. So, I grepped 'Misunderstood', and that got the the first third or so of my blog-post. To get the rest, I had to grep for other words. grepping 'Atwood' returned all kinds of chunks of the document, and in the end I was able to sew these back together into my original text. It's a messy proceedure, but in the end, it works.

One tool that can help you with this is 'strings'. 'Strings' takes a file, and only prints out the text in it that uses a restricted set of human-readable characters. Most of the 'control characters' and binary stuff, it throws away. So one could either do:

grep -a Misunderstood -A 200 > /tmp/
cat /tmp/ | strings > /tmp/MA.strings


grep -a Misunderstood -A 200 | strings > /tmp/

And this will eliminate some of the mess that you have to consider.

But here's something else about my lost 'Margaret Atwood' post. It wasn't saved on disk at all, I was typing it into a text box in my opera browser when, for some reason best known to itself, opera went 'plink'. How can you recover something that's not even written to disk, an only existed in the memory being used by a given application?

Well, it depends if your unix operating system blanks all memory down when it is freed. Some 'ultra secure' versions of unix do this (you can patch the linux kernel to do this) to ensure that someone naughty, who is logged in as root, can't use these techniques to look into memory you are using, and see all your dirty little secrets. However, if you don't have such an 'ultra secure' unix, you may be in with a chance. Indeed, about an hour ago, I managed to recover my 'Margaret Atwood' blogpost from memory.

First off you should probably shut down as many programs as you can, to prevent them from grabbing the memory that your 'crashed' application has just given up. WHATEVER YOU DO, DO NOT TURN OFF OR REBOOT YOUR COMPUTER, as this will obviously blank all memory.
Then, if you look in /dev, you'll see a file /dev/mem. This file lets you view your memory as a file, just as you can view hard-drive partitions as a file, so all the same proceedures can be used to recover something that's still hanging around in memory.

The results of grepping /dev/mem tend to be even messier than grepping hard-drive partitions, but your work should be in there somewhere. You'll have to find the chunks and join them together.

If you were editing an html document, you might find that it's full of things like '%20' indicating a space. You can replace these using 'sed'

cat /tmp/ | sed "s/%20/ /g" > /tmp/MA.cleaned

similarly for any other characters that have been 'quoted' using the http quote method. Characters like '/' and '"' and '[' have special meanings to sed, so you'll have to quote them with '\' in your sed command, so that sed knows what you mean. For instace:

cat /tmp/ | sed "s/%2F/\//g" > /tmp/MA.slashes

if you wrote '/' instead of '\/', sed will get confused, because it uses '/' to divide 'thing to replace' and 'thing to replace it with'.

So, that's all you need to recover text, html, and probably a few other 'text based' formats.

Good luck. You'll probably need a little luck.

No comments:

Post a Comment