Musings on Intelligence: thought experiments

Isn't intelligence just unbelievably annoying? How does it work? I spend many hours pondering this question. In this post I outline two of my more interesting thought experiments that aim to probe the answers. As I go through these in my head, I always think about how a robot could achieve these same "thoughts" or inferences. What kind of algorithms are required to at least approximately match my thinking process?

Though Experiment #1. Try this: fully introspect your thinking process while doing a random, routine task. Suppose you're sitting at your desk in the office and suddenly decide to get some coffee from coffee shop across the street. Think about every detail of your thought process as you go along: You form a plan to go down to the street. The plan is hierarchical in nature: overall goal, waypoints, immediate plans of getting from A to B, all of your muscle contractions that get executed to meet each tiny goal on the way... Just before you walk out of your office you slow down a bit in front of the door because the hallway can be full of people who may be walking quickly and are unaware of you. In other words, you're considering the possible dangers and planning ahead, minimizing the risk of undesirable outcomes. As you walk forwards, a person is coming across from you. You immediately infer the goal of that person: They are most likely trying to pass you and continue on their way down the hallway. You steer slightly to the right and you anticipate them moving slightly to the left. You walk down the steps and you're about to open the door, but suddenly you notice a person coming in from the outside. Again, you understand that they want to come in to the building. You immediately infer that they are likely to open the door. You also notice that the other person is not looking at you but slightly down at their feet while walking, so you infer that they are probably unaware of you. You step aside and wait for them to open the door and pass. Finally, you get to the shop and you see a line. You understand how a line works: people line up and wait for their turn to order things. You line up at the end because that is the right thing to do. You don't stand too far back and face elsewhere because other people who want to line up will be confused about whether or not you're waiting in line...

I feel like I'm doing an injustice to this exercise, but in general it is overwhelming to think about all the tiny inferences my brain is automatically making at any time. Now, how could a robot match similar processes or inferences? How could it ever learn what a line is at a coffee shop? How is it represented as a data structure in its memory? Or the fact that the rule is to "line up at the end of the line"? How could it ever understand that that person on the other side of the glass door had his own goal, and that in that particular moment his goal was to get into the building? How could it ever understand that the other person also has their knowledge base, and that since they were not looking at the robot they did not know it was there? And how could it ever resolve to deciding that a particularly efficient way to handle the scenario was for it it to step aside and wait for them to pass?

Though Experiment #2. For my second thought experiment consider a slightly different setup. It is so ordinary and so boring, and yet from all my experiments, I believe it reveals a lot about intelligence. It is inspired by a real-world situation: I was talking to a friend of mine at a party, when after a brief pause of us both taking a casual sip of our beverage, my friend suddenly asked: "Did you see John?". The inferences that unfolded during my tiny state of confusion, on the other hand, are extraordinary if you try to enumerate them explicitly:

- John is probably a person. It probably isn't a movie, or a thousand other things.
- I can't think of a John I know at the moment. I know John's, but I don't think my friend knows them.
- My friend would not ask me the question if he thought I did not know John. So he thinks I know a John.
- What is the set of people that we both know? Maybe I know John but only from seeing him? Maybe my friend doesn't know that I don't know him by name.
- What were we talking about just seconds before? We were talking about an assignment for a class.  Is John in the class as well? Is there a person in the class who we sometimes hang out with and who's name I don't know, but should?
- Why is my friend asking this question? How does it fit with what we've just been talking about? How does it fit with what my friend would want to know at this moment?
- Is he merely thinking out loud, and does not really expect me to know John?
- Is he asking about the past? Did we ever talk about some John? Or is John a guest at the party that my friend is merely trying to find?
- Did I not hear my friend correctly? Maybe he meant Jen? We both know a Jen, but she doesn't fit too well into context of the conversation moments ago. Is my friend trying to change the topic? Is there something interesting that happened with Jen in the last few days and maybe I don't know about it?
- Did my friend ever ask this question or a similar one before?

It feels like my brain went through hundreds of immediate hypothesis like the ones above, racing to make sense of the situation; Striving to make it consistent. It felt like in a millisecond it tried to fit every hypothesis to the available data, and it felt like it retrieved vast amounts of past knowledge not only about the context of the situation at that time, but also context of an entire past of my entire relationship with my friend, and the events that unfolded moments ago. It felt like it was trying to find a hypothesis that "clicked". It considered not only my knowledge, but a model of knowledge of me from the perspective of my friend, and even my guess at his immediate intentions. In other words, somehow I maintain a model of what every person I know knows about me and the world, their attitudes toward me and the world, and the experiences and contexts we share. I also have an understanding of their personalities, and the kinds of things they are likely to talk to me about. Interestingly, I would also argue that I maintain a degree of certainty on every such piece of knowledge, sometimes only as a summary, and sometimes with pointers to events that led me to believe them.

It is quite amazing that our brains are capable of doing all this in fractions of a second, and they do it thousands and thousands of times a day. I believe that the process outlined above is at the heart of intelligence, in that it is just a single example of more general reasoning machinery that is used at any moment in time. The brain is, as best as I can describe it, a Hypothesis Generating Bayesian Scoring Machine. And don't get excited, by Bayesian I only mean the very simple idea that we have priors and assign likelihoods for every possible hypothesis, and we combine them in some way to get a winner: the hypothesis that "clicks" the best. And as far as I can tell, the inference is most similar to a kind of hybrid Loopy BP / MCMC scheme, where proposals that are based on experience are used to initialize hypotheses, and where a belief propagation-like procedure derives their consequences before scoring them.

In conclusion, these depressing thought experiments tell me that we are, indeed, very very (very!!) far from Artificial Intelligence. How can we write algorithms that can automatically explain data by generating and scoring hypotheses, while considering the full context? How do we write algorithms that understand and model intent, knowledge and goals of other agents? I don't have the answers, but one thing I do know is that there is no single machine learning system that I've heard of that I consider to be on the right path. I'm being harsh and my expectations are high, but my main concern is that our algorithms for the most part don't think, they compute boring feed-forward functions that depend on a fixed set of conveniently chosen parameters. An algorithm that attempts to model a mind must have a certain scent of meta... a scent that I have yet to feel.

 

My Last quarter: projects, courses, endeavors

First quarter at Stanford was extremely busy but a lot of fun. Here is the list of endeavors that kept me entertained:

1. I took two courses: Machine Learning with Andrew Ng, and Computer Vision with Fei Fei Li. Both courses were fun, even though they contained mostly information I've learned already at UBC. Regardless, it was nice to hear it all again and get to practice it more.

2. I rotated in Daphne Koller's lab and worked on the Latent Structural Support Vector Machine. The optimization for LSSVM's is done in a coordinate-descent fashion: Latent variables h are inferred given the weights of the SVM w, and then w is inferred given h. I worked on an extension to the first step: instead of inferring a fixed value of h, one tries to maintain a probability distribution over h. When inferring w in the second step, an expectation is calculated over h instead of simply using a fixed value. The intuition is that the algorithm should not be too hasty to commit to a bad h, or it can get stuck in a bad minimum. Of course, one pays a computational price for this, but the question was: is it worth it? As far as my experiments went with my specific data, the answer seems to be no. This general meta-issue is one that keeps coming up over and over again: Do you spend computational effort doing the right thing, or do you compute the wrong thing many times faster? In practice, the latter can be surprisingly effective.

Most importantly though, I reaffirmed during this rotation that this kind of work is not something I find personally appealing. I don't get excited about mathematically reformulating a problem in some slightly different form, and seeing it perform 1% better than state of the art on my favorite dataset. What motivates me best are more tangible projects that address large conceptual challenges. Projects that have the goal of AI in mind, or the goal of getting robots to live among us. Projects that have meta in them. Projects that can make me say embarrassing things, such as "This must be how the brain works".

3. For my course project for both Computer Vision and Machine Learning, I was advised by Gary Bradski from Willow Garage and I worked on Object Detection. More specifically, I worked on extensions to the recently published (ICCV 2011) LINEMOD Object detector by Stefan Hinterstoisser. Stefan's work is essentially on super-fast, optimized implementation of template matching that can be applied to RGBD images (such as those coming from the Kinect) for object instance detection. I chose this project because it had all the tags necessary to get me excited: Kinect, Willow Garage, Object Detection, Super-fast, Vision, and Robots. In addition, I have this strange feeling that despite all the efforts that go into building clever systems for object detection, it will be common in 20 years to solve practical problems with template matching, naive bayes, and bag of words models. In fact, I'm not entirely convinced that this is unlike what the visual cortex does in humans, at least for large portion of the low-level processing.

However, clearly it is not practical to have a separate template for every possible view and for every possible object, so there must be mechanisms in place to scale the naive object-centric template matching strategy. I investigated two ways of scaling the algorithm based on: 1. Simple intuition that not all parts of the image should receive the same amount of attention in terms of matching, because it is possible to reject boring regions of images as candidates for objects based on very coarse matching at low resolution. I was able to use this (trivial) intuition to speed up the algorithm 20x without any loss in recall. And 2: It would be nice if we didn't have to have a separate, large template for entire objects. Instead, I explored a hough-voting approach where I detected little parts of objects, and had them vote for object center. The intuition is that, for example, if you detect a bottle cap with high certainty, then a bottle center should be somewhere below. This turned out not to work too well but I was so puzzled by it that I kept searching, and indeed, shortly after the report was due I uncovered a severe bug in the code base I was using as a black box for matching that would directly lead to bad performance in these part-based experiments. Unfortunate!

I liked working on this project a lot! You can read my final report here. [PDF]

4. Those of you who know me also know that I get very easily excited about anything Education. And since Andrew Ng's Machine Learning class was offered to the public online for free last quarter, I did not hesitate and volunteered almost 10-15 hours a week helping to prepare the programming assignments for the class. Looking back, it was probably not the best choice considering my career as a researcher, but I do not regret my choice. It was a lot of fun being involved in something I consider to be so ground-breaking, and I really hope that all the new initiatives that seek to revolutionize online education, such as Coursera, Udacity, and MITx go on to become very successful. And I hope I earned some bragging rights, because I'll be able to say that I was there, involved and at the heart of it when it all began.

 

This quarter I am rotating with Andrew Ng's group working with Adam Coates, and I am taking Convex Optimization with Stephen Boyd and Probabilistic Graphical Models with Daphne Koller and Kevin Murphy. More on this later! :)

My “Values and Assumptions about Teaching and Learning”

Those of you who know me well may also know that I get very passionate about education. I can write a whole another 10-page post on some of my thoughts on Khan Academy, and more recently the MLclass, AIclass, DBclass, etc offered in Stanford. (By the way, update: I've volunteered to help make assignments for the ML class, and I LOVE to be a part of it). My name is on the "About us" page, and will go down in educational history! (ok just kidding, but I'm proud of it anyway :p)

For now, however, I wanted to share this writeup that I just randomly discovered hidden deep inside my Dropbox. It is my "Values and Assumptions about Teaching and Learning" that I submitted with my application for one of the top TA awards at University of British Columbia last year. My application was rejected (which, by the way, I am bitter about because I think my application was overall very strong and there is no other student I know who worked even close as hard as I did on my TA duties, who volunteered to TA more courses than was required, who volunteered many many more hours than he should have spent, who received identically near-perfect student evaluations every time.... I am normally a fairly modest person, but here I refuse. Ah well, hard work not recognized, fine with me.) Regardless, the writeup has some of my thoughts on what I learned while teaching (most of my experience was in teaching Tutorials - i.e. ~5-30 people per class with mean at around 20, and helping out students who worked on assignments in learning center). Forgive the slight cheeseness of it at times :)

--------------------------------------------------

When I sometimes help a group of students along as they try to complete some problem, I wonder if they realize that I, as a teacher, am also in a process of solving an extremely difficult problem: that of teaching. It is very hard to over-estimate the difficulty of being an effective teacher. Even a simple question from a struggling student is often just a tip of an iceberg: a brief manifestation of a deeper misunderstanding. The task of the teacher is not to simply answer the question (that's easy!), but to first infer the exact shape and size of this iceberg, and then to address the source of the confusion. Over the last few years, I came to realize that teaching is one of the most intellectually demanding problems that I can hope to work on, and solving it correctly for some students, in some cases, is a great source of satisfaction.

I have accumulated many tips and tricks of teaching over the last two years, during which I conducted a tutorial almost every other day. In an effort to make my essay concrete, I will attempt to justify from experience a few of my core teaching principles. One of the first surprising discoveries I made when I started out was that being very comfortable with the content of the course was, paradoxically, detrimental to my ability to teach it. As I was trying to explain the material, I would frequently catch myself skipping over details in a problem derivation, simply because certain leaps of logic were obvious to me. For this reason, I volunteered to undertake the universally most hated task that a TA can have: marking assignments. Students are generally bad at conveying their misunderstanding, and are often even reluctant to admit it. A commonly occurring situation is that they aren't even aware of it in the first place. Overall, getting my hands dirty and poring over students' work in detail enabled me to more clearly understand the kinds of problems that often come up, it reminded me of all the little pieces of knowledge that I now take for granted, and ultimately led me to become a more effective instructor.

One of my other core principles was also strongly reinforced through personal experience. When I first started teaching, I felt very comfortable with the course material. After all, the course I taught only involved simple mathematics that I carried out many times since my first year. To my surprise, however, once I actually started teaching I realized that my understanding of these elementary concepts was only superficial, and often simply rule-driven. Forcing myself to make sense of it as I was explaining it to others led me directly toward a deeper understanding of all concepts and their relations. Similarly, as teachers we should encourage our students to not only passively absorb information, but to actively try to make sense of it through interaction, collaboration, and teaching.

My process of improvement as a teacher is not unlike the one that my students go through. We gradually learn to become better through long periods of sustained practice. I don't pretend to have anything figured out, but eagerly look forward to learn more.

Isaac Asimov’s I, Robot: thoughts

I finally had a chance to read *Isaac Asimov's I, Robot*.
It was certainly an interesting experience, given that the short stories in it were written at about 1940-1950, but the events the in book take place at about 2070. (i.e. right now in 2011 we are almost exactly half way there)

The book contains 9 short stories, from which the ones I would most recommend are *Reason* and *Evidence*.

What strikes me as most interesting is the nature of predictions in the book. Some predictions are too pessimistic and some are too optimistic, but in funny ways. Here are examples:

- The robots in 2070 are described to be *heavy, metalic, and have diaphragms*. More likely, we'd now think that robots at that time will be made of super light-weight carbon fibers, and they certainly won't have diaphragms when we can just use speakers?

- Most interestingly, in charge of the hardcore theory of robots are ... *mathematicians*. In fact, the positronic brains are seen to yield *behavior based on solutions of differential equations*. These days, we would most likely not think of including (pure) mathematicians in robotics, and we rarely ever think of algorithms in AI/Machine Learning in terms of differential equations. (wait, should we? :) )

- One story mentions that the protagonists recorded a *video*, and that he *had to to get it developed*. Interesting that it was not obvious that this limitation would not be overcome by 2070, and that we wouldn't be using film.

- Even though some of the above contain severely pessimistic views of the world, Isaac imagines us to have *hyperatomic drives* in 2070, that allow for easy interstellar travel. It is strange to think that we can conquer space, but still need to "develop" a video.

Anyway, overall I liked the stories. Many of them essentially come down to an almost detective-like story, where there is something wrong with the robots, and the protagonist has to figure out how the observed behavior has come about from the 3 laws and logical inferences. In general I like the idea that sufficiently advanced robots will become so complicated that we will lose the ability to fully interpret their behavior. There will simply be too many moving parts, and what we observe in terms of the behavior will only ever be the tip of the iceberg. The underlying, perfectly deterministic and individually understandable complexity will simply collapse all together into one term, and we will call it *personality*. I look forward to these times, at some point around 2070 (sounds reasonable to me).

In the plex: fun quotes

A few weeks before my internship at Google I finished the book "In the plex", by Steven Levy (@stevenjayl).  I found a few memorable quotes that I wrote down, and wanted to share them. The book is an interesting exploration of Google: how it started, it's culture, philosophy, it's legal struggles, etc. Parts of it were very interesting read, such as the early Google and how it came about, the tension between product-oriented and revenue-oriented people, and Google's struggle in China. For the most part, the book reinforced my strong views that truly great and lasting tech products can only ever get built by selfless, forward-looking, ambitious tech geeks in the lead. People who are first and foremost interested in developing a great product, NOT making revenue off its users. If you build a good product, users will follow, and so will the revenue.

Now here are some random funny passages that stood out for me as funny/interesting:

- This one made me ROFL: (Larry&Sergey are showing their early early search algorithm to a (business man) CEO of Excite, and the search results are excellent) "Bell was visibly upset. The Stanford product was TOO good. If Excite were to host a search engine that instantly gave people information they sought, he explained, the users would leave the site instantly. Since his ad revenue came from people staying on the site-- "stickiness" was the most desired metric in websites at the time-- using [this] technology would be counterproductive." \facepalm (goes to reinforce my point above)

- Talking about Matt Cutts, who at first worked on Google Safe search. "Cutts asked his colleagues to help him locate adult websites so he could extract signals to better identify and block them, but everyone was too busy. "No one will help me look for porn!" he complained to his wife one night"

- "Larry's the worst person you want designing your product- he's very smart but not your average user'. To avoid this situation, Chan had a strategy of  giving him shiny objects to play with. At the beginning of one Google Voice product review, for instance, he offered Page, and Brin as well, the opportunity to pick their own phone numbers for the new service. For the next hour the founders brainstormed sequences that embodied mathematical puns, while the product sailed through the review." :)

- The Google approach: "Before the meeting, Pashupathy was warned not to ever bring cost into the discussion-- not to talk about return on investment. He was simply to look at the talent and the user value the project would bring."

- Based on things I've read, this funny excerpt pretty much summarizes Larry and Eric relationship :) '"How many engineers does Microsoft have'? asked Page. About 25,000, Page was told. 'We should have a million,' said Page. Eric, accustomed to Page's hyperbolic responses by then, said, 'Come on, Larry, let's be real.'"

- Interesting, well-summarizing passage from Google+China chapter, talking about one of the employees in Google China office: "Her tenure came to an end when Google discovered that she had taken it upon herself to give Chinese officials new iPods. She had charged them to Google, and another executive had approved the charge. In the Chinese business culture such gifts were routine, but the act unambiguously violated Google policy... Google fired both her and the executive who had approved the expense.... she was dumbfounded that what she considered a normal business practice had led to her firing."

Good book! Recommended read :) You can also skip around chapters that interest you more, because they aren't strongly tied together by a single chronological story, but instead focus on separate aspects of the company and their struggles.

Top YouTube channels to subscribe to, and Google+

I've been inactive here for a while here on this blog, but I remain significantly more active on my Google+ account.

For example, today I posted about my top YouTube channels to subscribe to. There are many nerdy channels on YouTube that produce great content, and they should all be checked out and subscribed to. Let me know if you have more that I haven't discovered yet!

In general, I will reserve this blog for lengthy and substantial posts, but I will share many shorter snippets on Google+. I recommend you sign up and start using the service yourself, circle me, and let's find the best, nerdiest content on the internet together :)

I also use Google+ for academic discussions, and I've found many great researchers in Machine Learning / Computer vision / related on there. If you go through my circles, you will notice several academic "celebrities" :) It's great to see them online.

 

My upcoming whereabouts. I’m making a note here, huge success.

(Warning: Portal inside jokes around this post)

It's hard to overstate my satisfaction about my upcoming whereabouts:

- For the summer, I have accepted an offer for a summer internship at Google. I will be joining a secret bio/ML group there, working on creating the next Skynet. (joking! (or am i?)). I am quite excited about this offer, and eager to join my fellow interns at the Googleplex in Mountain View. To complement my excitement, I've also bought the recently released "In the plex" by Steven Levy. So far it's a great read, and I particularly enjoyed the back story on Larry and Sergey, running around the halls of the Gates building at Stanford.

- Speaking of Stanford and Gates building, after long nights of obsessing about doing my PhD at Stanford or MIT, I've ended up choosing to continue my studies at Stanford. I expect that my work will be somewhere in the area of ML/Vision, which has been a passion of mine for as long as I can remember. In general, I want to work on machine learning approaches to perception, and eventually I may want to tackle harder problems of general intelligence. I'm very serious about sentient machines, let's do this! ...and change the world forever. Accordingly, I hope to be working with one of Andrew Ng/Fei Fei Li/Daphne Koller/Sebastian Thrun (self-driving cars, cool!). I haven't made any hard arrangements yet because I plan to take advantage of the first year rotation program they just recently introduced at Stanford, which sounds great. (Basically, 3 month rotations with a different group each time).

In addition to the great faculty and the program, I am also very excited about Stanford because of its location at the heart of the Bay area. I can get very enthusiastic about new gadgets, apps, companies, and developments in the tech sector in general. I'm an early adopter of all kinds of crazy apps and new social media services and sites, so this is a great place to be in from this perspective. The culture in the area is very amusing :)

Still, saying no to MIT has haunted me since I officially committed. The faculty, students, environment-- all were great. I met with Antonio Torralba, Bill Freeman and Joshua Tenenbaum and they all made an excellent impression on me. I hope they will consider me again for a postdoc :( ?

For now, I'm excited and eager to start at both Google and then later in Stanford, but meanwhile first things first: I still have yet to finish up last details and paperwork for my thesis and draft out 2 more papers, all in one months of work. But I'll do it... For science. Hop hop!

Lessons learned from manually classifying CIFAR-10 [with code]

CIFAR-10 is a popular vision classification dataset. It consists of 50,000 training images, all of them in 1 of 10 categories (displayed left). The test set consists of 10,000 novel images from the same categories, and the task is to classify each to its category. The state of the art is currently at about 80% classification accuracy (4000 centroids), achieved by Adam Coates et al. (PDF). This paper achieved the accuracy by using whitening, k-means to learn many centroids, and then using a soft activation function as features.

By the way, running their method with 1600 centroids gives 77% classification accuracy. If you set the clusters to be random, the accuracy becomes 92% on train and 70% on test. And if you set the clusters to be random patches from the training set, accuracy goes up to 74% on test and about 91% on train. It seems like the entire purpose of k-means there is to nicely spread out the clusters around the data. As the number of clusters grows, randomly sampling train data converges towards that. The 70% random clusters performance must be because many of the clusters are relatively too far away from data manifolds, and never become activated. So it's as if you had much fewer clusters to begin with.

Anyway, over the weekend I wanted to see what kind of classification accuracy a human would achieve on this dataset. So I set out to write some quick MATLAB code that would provide the interface to do this. My classification accuracy was about 94% on 400 images, (some images are really unfair), but more importantly I felt I learned something about the nature of this task by explicitly going through the data myself and thinking about why I classified something as one class rather than another.

To help in the process, for every testing image I also displayed a set of closest images in training set according to several distances. You can see examples of the interface by clicking on the images later in this post. The MATLAB code for this program can be found here: [MATLAB CODE].

The lessons I learned:

- The objects within classes in this dataset are extremely varied. For example the "bird" class contains many different types of bird (both big birds and small). Not only are there many types of bird, but the occur at many possible magnifications, all possible angles and all possible poses. Sometimes only parts of the bird are shown. The poses problem is even worse for the dog/cat category, because these animals occur at many many different types of poses, and sometimes only the head is shown. Or left part of the body, etc.

10 questionable images I took from a set of 50. 2nd from last is supposed to be a boat.

- My classification method felt strangely dichotomous. Sometimes you can clearly see the animal or object and classify it based very highly-informative distinct parts (for example, you find ears of a cat). Other times, my recognition was purely based on context and the overall cues in the image such as the colors.

- There are many distractors and occlusions in the images that surely confuse the classifier. For example, there are two images of a same type of plane of same pose, but one has a cloud in the sky and the other doesn't. The HOG classifier got confused by the cloud, which threw off the entire prediction. How can the classifier "learn" something like clouds from this dataset, and separate them from planes? As a human, I had absolutely no trouble completely disregarding the cloud (which made up almost half of the image). It's just a cloud-- I have an understanding of the layered structure of the scene. How can a classifier learn reasoning of the following form: "It's an image of the sky because I see blue background and clouds. That little dark blob next to the cloud must be either a plane or some kind of bird. I see wing-like structures, and they seem a bit curvy. Also, planes are usually whiter and not that dark. Bird." Clearly, that's not the kind of reasoning the HOG classifier would produce.

few interface screencaps

My overall conclusions:

- The CIFAR-10 dataset is too small to properly contain examples of everything that it is asking for in the test set.

- Many images require high-level reasoning classification, and not purely "appearance modeling"

- The 0/1 loss is not appropriate for this dataset, because it is extremely hard to tell cat/dog and horse/deer apart in many many images.

- The classifier I found myself internally using the most was some kind of a product of experts. I think I would search the image (at all scales and positions) for very informative parts that hint very strongly at a presence of one of the objects. For example, two dark dots that could indicate eyes.  Or legs of an animal like horse/deer. I would also extract global scene features. What kind of scene is this? Natural? Water? Sky? I would infer this based on clouds or waves or background type. The information then gets merged to produce object prediction. Finally, if nothing seemed to work, I predicted toad. (The toad images in this dataset are terrible. If you see a lot of brown noisy stuff, it's a toad).

few more interface screencaps

- I don't quite understand how Adam Coates et al. perform so well on this dataset (80%) with the method they used. My guess is that it works along the following lines: looking at the image squinting your eyes you can almost always narrow down the category to about 2 or 3. The final disambiguation could come from finding good specific informative patches (like a patch of some kind of fur, or pointy ear part, etc.)

- I don't think any method will go significantly higher than 80%, even though improvements might be possible up to about 85-90% perhaps.

- I'm not convinced this is a great dataset to work with, because I suspect it's too small not only in size of dataset, but also in size of images.

I encourage people to try this for themselves (see my code, above), as it is very interesting and fun! I have trouble exactly articulating what I learned, but overall I feel like I gained more intuition for image classification tasks.

 

 

BONUS: Included in the download above is also some code that I wrote to generate pretty confusion matrices. Below is a sample confusion matrix I created for the classification that comes from Coates et al. on CIFAR-10. Inside the code you can change a variable to make the confusion matrix larger and include more misclassified examples. I made confusion matrices for larger sizes too, and made them available here: [confusion matrices visualized (12.8MB)].

Visualized confusion matrix for 77% classification on CIFAR-10 by Coates et al. Prediction goes down along vertical, and ground truth is horizontal.

Quick experiment: vision + distances in high dimensions

When one is working with images, there is a natural tendency to treat them as vectors in a very high-dimensional pixel space for convenience. Basically, every dimension of the vector corresponds to a single R/G/B pixel value in the image. However, as a rule of thumb, it is best to avoid high-dimensional spaces whenever possible because they provide a large number of non-intuitive results. For example, the volume of n-dimensional ball reaches a peak at about n=5, and then goes to zero as n -> infinity. What's up with that?

Anyway, getting back to images here is an experiment I did in MATLAB this morning. The image in left-top is a random image from Google if you search "face". The other 3 images are all modifications of this image: the first one is shifted to the left a bit, the second one is... "messed up", and the third one is darkened. Here's the kicker: the 3 images are the same (L2) distance away from the original, in this high-dimensional pixel space.

I just thought it was funny/interesting and worth a share. The lesson for the kids is to not trust distance metrics in high-dimensional spaces, and to realize that the pixel space is almost always NOT the space you want to work in or think about when dealing with images.

The set of things I was recently impressed with

EVENTS

- Twitter Announces Fire Hose Marketplace: Up to 10k Keyword Filters for 30 cents: Fire hose is a stream of all tweets on twitter. Very few companies have access to it because it is costly to get access, and hard to process. Twitter has announced that they are partnering with a company that will act something like a fire hose re-seller. Think: Twitter Search on steroids for a small fee. Exciting also for academics, bust probably mostly companies.

TALKS

- Sebastian Thrun gave a great talk on TED about the self-driving car project he's working on with Google. The car can now actually drive on streets. There is a lot of exciting technology that goes into this. Some pretty visualizations are shown in the video about half-way.

- Robert Schapire on Boosting, Tutorial on videolectures.net : I was pleasantly surprised by the quality and clarity on a tutorial about boosting. Very accessible to beginners. Covers adaboost with C4.5 decision trees. In general, I have recently become very interested in ensemble methods. In particular also, with Random Decision Forests. (see Kinect paper, later & below). I know I am late to the party, but I feel that the idea of combining many non-homogeneous weak classifiers into a strong one is very sound. There so many papers in vision that compute features on images, and then run some simple linear classifiers on top. I wonder if ensemble methods with a small number of linear classifiers can almost always do better? I find it hard to believe that you would suddenly overfit if you only included an extra one or two linear boundaries in the final classification process. I wish I knew more on this topic and the associated empirical results in this respect.

PAPERS

- BRIEF descriptor (Binary Robust Independent Elementary Features), from ECCV, by Calonder et al. (link to PDF).  The authors introduce a cute bitstring descriptor. Every bit is simply a result of a tests that determines if one pixel is brighter than another, somewhere in a neighborhood of a point.  Sliding a gaussian through the image as a preprocessing step improves results. The final method is extremely fast for matching because it only relies on computing hamming distance for bitstrings. I was just surprised that such a trivial idea can still lead to great results. The thing I like about these kinds of representations is that you get to tune the number of bits to your needs: Using more bits monotonically leads to better performance (to some point). Future work will be focused on making the descriptor scale and orientation invariant to compete with SIFT, SURF. As a shortcoming: I'm a bit confused about how their method handles smooth parts of the image.

- (CHoG) Compressed Histogram of Gradients (PDF) from CVPR: *I love this paper*. It also tries to create feature descriptors in images. An idea that I did not stumble by so far but that I really liked was to create histograms of the joint distribution of (dx, dy) gradients in the patch, instead of simply doing histogram of orientations (Figure 1 illustrates). This has the obvious benefit of not introducing noise when the gradients in the image are too weak. While this method would place those pixels into a container that essentially says "smooth patch", normal HoG descriptor would force this smooth patch into some particular orientation bin. It seems intuitive this this approach should do better-- I'm a bit surprised that I haven't seen more of it. Again, I wish I had more empirical experience with this kind of binning.

The authors then go on to a very interesting Descriptor Compression technique. Once they have the distribution over joint (dx, dy) in the patch summarized in the histogram, they compute Huffman Tree for the distribution. Next, they enumerate every possible Huffman Tree and assign fixed-length codes to each possible tree. So at this point for every patch we wish to make descriptor of, we can get the ID of the Huffman tree that best corresponds to the distribution of the (dx,dy) gradients in that patch. But why stop there? Not all trees are equally likely to occur! So more compression is achieved by entropy coding the fixed-length indices. The matching is also interesting because they pre-compute distances from all trees to all trees (in compressed form). So when matching two patches, compute the low-bit representation for each, and then use the lookup table for distance.

Here's the most interesting result. On the experiments they ran, they can match the performance of SIFT using 53 bit descriptors. (about 20x compression from normal). There are just too many more cool results and ideas in this paper for me to enumerate. I'll stop here.

-Human Motion Capture for Xbox Kinect: Project page (includes video) also (Direct link to PDF), CVPR. I thought this was a very interesting paper to read as well. The paper describes the software that runs on Kinect and addresses the problem of finding a person in a depth image. They did this 1. synthesizing a huge amount of depth data using Computer Graphics. They put people of various proportions into different poses, and collect the ground truth pose for every depth image. They use a bit of real mocap data too. 2. They come up with an intermediate representation of 31 body parts. 3. The learning problem then becomes that of multiclass classification: Given a depth patch, which body part is at the center? When every pixel is (softly) classified to be one of 31 parts, 4. use mean-shift or similar to find modes and predict locations of each part.

To solve this problem the learning problem they use a clever descriptor that strongly reminds me of BRIEF. However, instead of doing tests of image brightness, they look at two pixels and compare their depths. Then they use Random Decision Forests with decision stumps to create the classifier, by repeatedly proposing two pixel locations + the depth threshold, and picking the one with largest gain in information (i.e standard way of doing things). They can train 3 separate trees on disjoint training data (which they can do because they are swimming in data), and their final classifier is a consensus vote. This consensus vote, together with a huge amount of data, leads to a very good classifier with great generalization. They have videos, and it seems to work very well. I recommended this paper for reading! A++ :)

 

The re-occurring theme I see come up a lot recently in general is that these simple ensemble methods with simple decisions on a very large amount of data do very well. And where can we get a large amount of data? Virtual worlds. I would love to see someone implement these ideas for object recognition, where training is done for the most part in virtual world + a bit of transfer learning in real world.