A content-based file manager

June 4, 2009

I’ve been thinking recently about Insight again, and I’ve been considering part of the problem with naming and uniqueness.

Names in a traditional file system are made unique based on a full path to the file, but most people think of a file name as just the final component. This would then cause a problem with the move to Insight, as a file could appear in multiple directories, and its only distinguishing feature would be the final component of its path. This is counter-intuitive and can cause all sorts of problems.

Consider makefiles, for example. They rely on a standard named file (Makefile) appearing at various levels in the hierarchy in order to work. Obviously, you would want different makefiles at different levels and in different projects, but Insight as it stands has no way to handle this.

I then started thinking about what makes a file unique. In the end, I came up with two things: name and content. This covers the makefile case (same name, different content) as well as the backup case (same content, different name). It then occurred to me that, in the general case, all you need to distinguish a file is its content, and then actually finding it can all be left up to metadata.

If files are then thought of as containers for data that happen to have a unique internal identifier (which never needs to be exposed to the user, although it can be accessed as, say, the file’s inode number) then the idea of a content-based file manager comes into play. These examples work best with visual media, particularly images, but there is no reason in principle that this could not be extended.

Imagine searching for a file. You know it’s a photo, but you have a large collection of them. With a digital camera and a large-capacity memory card, who needs to ever delete a photo? We’ll assume that you’ve dilligently tagged the photos with metadata as you’ve imported them from the camera, through some easy batch process.

On the tagging point: a lot can be taken from the metadata stored by the camera (date/time, resolution, orientation, black and white/colour, perhaps GPS co-ordinates) and with the right tools, more can be inferred (auto-tagging faces, buildings, perhaps recognising common events like football matches, converting GPS co-ordinates to places, …). As time goes on, people will need to do less and less manual tagging.

Anyway, back to the file manager. You know you are after a picture or a set of pictures. Normal thought processes will probably follow a path similar to: “Yeah, I wanted to show dad those photos from that holiday in Paris that we had two months ago. I think he’d particularly like the ones we got of the Louvre, as well as the ones with me in, of course.” I’ve highlighted various key words that can be translated directly to metadata searches. Notice how these all involve a narrowing down of the query.

To convert these to filters, we then have:

type: photo
holiday
location: Paris
date: two months ago
at least one of:
- location: Louvre
- person: Me

This could also be represented by a query:

type:photo AND holiday AND location:Paris AND date:-2m
AND (location:Louvre OR person:me)

Breaking it down in this way feels fairly technical and wordy, however. I’d much prefer a visual view.

Imagine a black field, speckled with points of light representing your photos:

Content File Manager 1

You filter by “holiday”, and (because it learns based on previous searches) it then groups by location. The ones which have been filtered out fade into nothing, and the photos group into labelled blobs and enlarge slightly:

Content File Manager 2

You filter by date, and as you drag the slider, irrelevant items fade away and relevant ones enlarge:

Content File Manager 3

Then you add the final filters and set the photos up for viewing, perhaps as a slideshow… and you’re done!

Pretty neat, I think.

6 Responses to A content-based file manager

David Durant says: June 4, 2009 at 23:55

I’m almost certainly too old as I still find the idea of visual clouds representing data cluttering and less useful that a hierarchy (although a hierarchy generated on the fly via metadata). No reason not to offer both I suppose. 🙂

I agree that if you have a need for the filesystem to have an internal identifier for the file when access is provided by metadata then content is as good a way as any to generate that (although realistically you could just as easily use pretty much anything – for example seconds since Epoc last-accessed). I would suggest perhaps a hash rather than an inode as I’m sure some file systems would change the inode for a file in certain circumstances. However, hashing a large file (or a number of large files) is, of course, potentially computationally expensive.

I’d be interested on your thoughts of making this whole thing work at a low layer. It’s my (possibly incorrect) assumption tha this is still a (SQLite?) database based system sitting on top of an existing filesystem? I wonder if there is a way to eliminate the whole file system later and work down at that level…?

Babul says: June 5, 2009 at 00:43

Very interesting article. Also liking the new blog layout, much nicer than before.

Dave says: June 5, 2009 at 11:14

@Babul: Thanks! I’m glad you like it. I’m not 100% happy with this layout, so I’m going to build my own… maybe this weekend. It is a lot less ugly than before.

@David Durant:

One of the other cool things to do would definitely be auto-generated hierarchies, but that either requires a file to have a file-system-unique basename, or for the basename to be autogenerated along with the hierarchy (which would be my preferred solution: context-sensitive file naming).

The internal identifier for a file is the inode, by definition. The inode number is unchanged during the lifetime of the file. If you were to create a file and hardlink it (e.g. “touch file1; ln file1 file2”) then both directory entries would share an inode number (as can be verified by “ls -i”). This makes it suitable and unique. Last access time since Epoch would not be unique, perhaps even down to nanosecond resolution. It’s perfectly possible to access or modify multiple files in under a second (think “grep foo *” or “sed ‘s/foo/bar/’ *”).

My eventual plan is for this to work at the file system layer — it will tie directly into an Insight file system. If this was separate (i.e. to work for any file system) then I would probably store it using the same custom structure I use for Insight, which is optimised for this use case in a way that SQL/relational databases are not.

Ilya says: August 25, 2009 at 14:23

Hi guys,

I was having a think about this – and thought whether the file system can mimic in someway how we file our own memories in our minds. This is highly conceptual I know, but if you need uniqueness for a filesystem – why not simply use a unique id for the user of the file (creator/owner) and the timestamp created, plus maybe the duration/size of file or final timestamp [closed file]… I suppose I came up with this, is when you recollect a memory, its based on context (emotion, content, people, interaction, senses) and the time it happened for you. So why not try to replicate this?

Technically I’m not clear on how this would work exactly, but as I see you’re trying to merge the concept of uniqueness and ability to have context. Uniqueness would be the ownerid & timestamp.

The problems as I understand arise with the fact that if files are shared (which memories are not entirely, except explicitly), that files can be owned by more than one person. This is where you can become creative, either by creating clones of files (with new owners id and timestamp) – or at least creating links/references of new owner id and timestamps, but the original context stays the same (with version control of the context for different users). Aha what happens if the new owners wants to share the file with the originator? well that means that the new owner would need to provide permission!

Now we get into types of permissions of files – implicit (owner), explicit (editable), information (read-only), data (tags/references)… this is the basis of knowledge management.

Anyway I hope some of that made sense… 🙂

Dave says: August 25, 2009 at 18:39

Hi Ilya,

The problem is in defining uniqueness in a way that’s easy for the user to understand. Internally we can just store a number that uniquely references the file just like most, if not all, other file systems.

I think the uniqueness would have to be context-related, as the way you would tell files apart would vary depending on what attributes they share and what you’ve already filtered on. Not all attributes are necessarily useful for uniquely identifying a file, but there needs to be something other than a name in this system.

As far as my own mind goes, I generally identify memories by a few key things: date, location, and people involved (if any). Something that I was doing at the time may also help identify things. This does of course rely on a few assumptions that aren’t necessarily true on a computer system.

A file may have a creation and last modified date associated with it, which would help for uniqueness. Location may help in two ways: location represented by the data, and also a visual/conceptual location of the data. People involved could also be represented in two ways: as owner/collaborator or as people represented by the data itself.

I feel I should explain “visual/conceptual location of the data” a little further. Visual location of the data could be likened to having a preferred layout for the icons on your desktop. After a little while, you know where something is by its position (and then its icon when you get closer), without having to read the labels. An example of a (not necessarily explicit) thought: “Office applications are up in the top right… then that’s the word processor icon…”

There is also conceptual location with traditional file systems, which the language we use reveals. “Oh, that file’s in folder FOO which is in either project XYZ or ABC” could represent a file “within”

/projects/xyz/foo

/projects/abc/foo

Ilya says: August 27, 2009 at 17:09

Hi Dave,

Interesting discussion this. In my opinion though location (whether is visual or structural) is an attribute of the content/context. I suppose what I mean is there are plenty of visual search engines out there (just search for these) but what they lack or where they fail (in my opinion) as its not personalized and the visual data is represented based on some generic formula. This non-personalization for user makes it highly complex and frustrating to use because it takes to long to go through the copious amounts of data/files. Google are coming close with their visual representation of filing their videos (www.youtube.com/warp_speed) but this is only useful for browsing and having a bit of fun. To actually find a video say about american idol, the initial search word describing it would be necessary.

With memories you place their location somewhere in you mind, and the way you recall them is by thinking, on tues the 5th, at 11.30pm, at my house, on the couch, I read on tv that the stock of IBM went up 5%. This triggers something, because you have a linked memory to stocks that you purchased the week earlier, this links to your mate you told you should by this, this links to your wife you told, but she said it was a bad idea, and the list goes on…

So back to the location as being an attribute. For me memories come up from not where i’ve been (or place) but other senses (touch, smell [highly sensitive], sounds, tastes, emotions, etc). Of course these are not possible to replicate via digital means as yet – but what’s interesting is having the ability to create links and connections between these attributes to provide meaningful results for search. I suppose what I am getting at is – say you a particular piece of music is playing in the background, you automatically remember the location, smells, who you’re with, the emotions. Having the ability to replicate the semantic file system to work in this manner, provide a whole lot of more interesting results. So a file located on the system, can be retrieved through the context attributes which you saving – ie, meaningful extracted tag words, image recognition, sound attributes, etc. I know this is a little far out, but in my opinion the true meaning of semantic file storage. As an example, you search ‘last time I jammed with John’ – all the photos of john in it would come up, the music you have in common, the lyrics or documents you created, the emails/IMs you had discussing the jam sessions, the jam session events and future jam sessions scheduled, instruments involved, where you bought the instruments.

Now how do we translate this into digital format and creating a file system, in my opinion, it can not be linear. So in order to reference a file you have multiple attributes reflecting one file – but to store the file itself only needs to have some uniqueness so it doesn’t mix up with another file – so even GUID (or simplified versions of GUID) would work.

A Blog Less Ordinary

A content-based file manager

6 Responses to A content-based file manager

Leave a Reply Cancel reply