My “Values and Assumptions about Teaching and Learning”

Those of you who know me well may also know that I get very passionate about education. I can write a whole another 10-page post on some of my thoughts on Khan Academy, and more recently the MLclass, AIclass, DBclass, etc offered in Stanford. (By the way, update: I've volunteered to help make assignments for the ML class, and I LOVE to be a part of it). My name is on the "About us" page, and will go down in educational history! (ok just kidding, but I'm proud of it anyway :p)

For now, however, I wanted to share this writeup that I just randomly discovered hidden deep inside my Dropbox. It is my "Values and Assumptions about Teaching and Learning" that I submitted with my application for one of the top TA awards at University of British Columbia last year. My application was rejected (which, by the way, I am bitter about because I think my application was overall very strong and there is no other student I know who worked even close as hard as I did on my TA duties, who volunteered to TA more courses than was required, who volunteered many many more hours than he should have spent, who received identically near-perfect student evaluations every time.... I am normally a fairly modest person, but here I refuse. Ah well, hard work not recognized, fine with me.) Regardless, the writeup has some of my thoughts on what I learned while teaching (most of my experience was in teaching Tutorials - i.e. ~5-30 people per class with mean at around 20, and helping out students who worked on assignments in learning center). Forgive the slight cheeseness of it at times :)

--------------------------------------------------

When I sometimes help a group of students along as they try to complete some problem, I wonder if they realize that I, as a teacher, am also in a process of solving an extremely difficult problem: that of teaching. It is very hard to over-estimate the difficulty of being an effective teacher. Even a simple question from a struggling student is often just a tip of an iceberg: a brief manifestation of a deeper misunderstanding. The task of the teacher is not to simply answer the question (that's easy!), but to first infer the exact shape and size of this iceberg, and then to address the source of the confusion. Over the last few years, I came to realize that teaching is one of the most intellectually demanding problems that I can hope to work on, and solving it correctly for some students, in some cases, is a great source of satisfaction.

I have accumulated many tips and tricks of teaching over the last two years, during which I conducted a tutorial almost every other day. In an effort to make my essay concrete, I will attempt to justify from experience a few of my core teaching principles. One of the first surprising discoveries I made when I started out was that being very comfortable with the content of the course was, paradoxically, detrimental to my ability to teach it. As I was trying to explain the material, I would frequently catch myself skipping over details in a problem derivation, simply because certain leaps of logic were obvious to me. For this reason, I volunteered to undertake the universally most hated task that a TA can have: marking assignments. Students are generally bad at conveying their misunderstanding, and are often even reluctant to admit it. A commonly occurring situation is that they aren't even aware of it in the first place. Overall, getting my hands dirty and poring over students' work in detail enabled me to more clearly understand the kinds of problems that often come up, it reminded me of all the little pieces of knowledge that I now take for granted, and ultimately led me to become a more effective instructor.

One of my other core principles was also strongly reinforced through personal experience. When I first started teaching, I felt very comfortable with the course material. After all, the course I taught only involved simple mathematics that I carried out many times since my first year. To my surprise, however, once I actually started teaching I realized that my understanding of these elementary concepts was only superficial, and often simply rule-driven. Forcing myself to make sense of it as I was explaining it to others led me directly toward a deeper understanding of all concepts and their relations. Similarly, as teachers we should encourage our students to not only passively absorb information, but to actively try to make sense of it through interaction, collaboration, and teaching.

My process of improvement as a teacher is not unlike the one that my students go through. We gradually learn to become better through long periods of sustained practice. I don't pretend to have anything figured out, but eagerly look forward to learn more.

Isaac Asimov’s I, Robot: thoughts

I finally had a chance to read *Isaac Asimov's I, Robot*.
It was certainly an interesting experience, given that the short stories in it were written at about 1940-1950, but the events the in book take place at about 2070. (i.e. right now in 2011 we are almost exactly half way there)

The book contains 9 short stories, from which the ones I would most recommend are *Reason* and *Evidence*.

What strikes me as most interesting is the nature of predictions in the book. Some predictions are too pessimistic and some are too optimistic, but in funny ways. Here are examples:

- The robots in 2070 are described to be *heavy, metalic, and have diaphragms*. More likely, we'd now think that robots at that time will be made of super light-weight carbon fibers, and they certainly won't have diaphragms when we can just use speakers?

- Most interestingly, in charge of the hardcore theory of robots are ... *mathematicians*. In fact, the positronic brains are seen to yield *behavior based on solutions of differential equations*. These days, we would most likely not think of including (pure) mathematicians in robotics, and we rarely ever think of algorithms in AI/Machine Learning in terms of differential equations. (wait, should we? :) )

- One story mentions that the protagonists recorded a *video*, and that he *had to to get it developed*. Interesting that it was not obvious that this limitation would not be overcome by 2070, and that we wouldn't be using film.

- Even though some of the above contain severely pessimistic views of the world, Isaac imagines us to have *hyperatomic drives* in 2070, that allow for easy interstellar travel. It is strange to think that we can conquer space, but still need to "develop" a video.

Anyway, overall I liked the stories. Many of them essentially come down to an almost detective-like story, where there is something wrong with the robots, and the protagonist has to figure out how the observed behavior has come about from the 3 laws and logical inferences. In general I like the idea that sufficiently advanced robots will become so complicated that we will lose the ability to fully interpret their behavior. There will simply be too many moving parts, and what we observe in terms of the behavior will only ever be the tip of the iceberg. The underlying, perfectly deterministic and individually understandable complexity will simply collapse all together into one term, and we will call it *personality*. I look forward to these times, at some point around 2070 (sounds reasonable to me).

In the plex: fun quotes

A few weeks before my internship at Google I finished the book "In the plex", by Steven Levy (@stevenjayl).  I found a few memorable quotes that I wrote down, and wanted to share them. The book is an interesting exploration of Google: how it started, it's culture, philosophy, it's legal struggles, etc. Parts of it were very interesting read, such as the early Google and how it came about, the tension between product-oriented and revenue-oriented people, and Google's struggle in China. For the most part, the book reinforced my strong views that truly great and lasting tech products can only ever get built by selfless, forward-looking, ambitious tech geeks in the lead. People who are first and foremost interested in developing a great product, NOT making revenue off its users. If you build a good product, users will follow, and so will the revenue.

Now here are some random funny passages that stood out for me as funny/interesting:

- This one made me ROFL: (Larry&Sergey are showing their early early search algorithm to a (business man) CEO of Excite, and the search results are excellent) "Bell was visibly upset. The Stanford product was TOO good. If Excite were to host a search engine that instantly gave people information they sought, he explained, the users would leave the site instantly. Since his ad revenue came from people staying on the site-- "stickiness" was the most desired metric in websites at the time-- using [this] technology would be counterproductive." \facepalm (goes to reinforce my point above)

- Talking about Matt Cutts, who at first worked on Google Safe search. "Cutts asked his colleagues to help him locate adult websites so he could extract signals to better identify and block them, but everyone was too busy. "No one will help me look for porn!" he complained to his wife one night"

- "Larry's the worst person you want designing your product- he's very smart but not your average user'. To avoid this situation, Chan had a strategy of  giving him shiny objects to play with. At the beginning of one Google Voice product review, for instance, he offered Page, and Brin as well, the opportunity to pick their own phone numbers for the new service. For the next hour the founders brainstormed sequences that embodied mathematical puns, while the product sailed through the review." :)

- The Google approach: "Before the meeting, Pashupathy was warned not to ever bring cost into the discussion-- not to talk about return on investment. He was simply to look at the talent and the user value the project would bring."

- Based on things I've read, this funny excerpt pretty much summarizes Larry and Eric relationship :) '"How many engineers does Microsoft have'? asked Page. About 25,000, Page was told. 'We should have a million,' said Page. Eric, accustomed to Page's hyperbolic responses by then, said, 'Come on, Larry, let's be real.'"

- Interesting, well-summarizing passage from Google+China chapter, talking about one of the employees in Google China office: "Her tenure came to an end when Google discovered that she had taken it upon herself to give Chinese officials new iPods. She had charged them to Google, and another executive had approved the charge. In the Chinese business culture such gifts were routine, but the act unambiguously violated Google policy... Google fired both her and the executive who had approved the expense.... she was dumbfounded that what she considered a normal business practice had led to her firing."

Good book! Recommended read :) You can also skip around chapters that interest you more, because they aren't strongly tied together by a single chronological story, but instead focus on separate aspects of the company and their struggles.

Top YouTube channels to subscribe to, and Google+

I've been inactive here for a while here on this blog, but I remain significantly more active on my Google+ account.

For example, today I posted about my top YouTube channels to subscribe to. There are many nerdy channels on YouTube that produce great content, and they should all be checked out and subscribed to. Let me know if you have more that I haven't discovered yet!

In general, I will reserve this blog for lengthy and substantial posts, but I will share many shorter snippets on Google+. I recommend you sign up and start using the service yourself, circle me, and let's find the best, nerdiest content on the internet together :)

I also use Google+ for academic discussions, and I've found many great researchers in Machine Learning / Computer vision / related on there. If you go through my circles, you will notice several academic "celebrities" :) It's great to see them online.

 

My upcoming whereabouts. I’m making a note here, huge success.

(Warning: Portal inside jokes around this post)

It's hard to overstate my satisfaction about my upcoming whereabouts:

- For the summer, I have accepted an offer for a summer internship at Google. I will be joining a secret bio/ML group there, working on creating the next Skynet. (joking! (or am i?)). I am quite excited about this offer, and eager to join my fellow interns at the Googleplex in Mountain View. To complement my excitement, I've also bought the recently released "In the plex" by Steven Levy. So far it's a great read, and I particularly enjoyed the back story on Larry and Sergey, running around the halls of the Gates building at Stanford.

- Speaking of Stanford and Gates building, after long nights of obsessing about doing my PhD at Stanford or MIT, I've ended up choosing to continue my studies at Stanford. I expect that my work will be somewhere in the area of ML/Vision, which has been a passion of mine for as long as I can remember. In general, I want to work on machine learning approaches to perception, and eventually I may want to tackle harder problems of general intelligence. I'm very serious about sentient machines, let's do this! ...and change the world forever. Accordingly, I hope to be working with one of Andrew Ng/Fei Fei Li/Daphne Koller/Sebastian Thrun (self-driving cars, cool!). I haven't made any hard arrangements yet because I plan to take advantage of the first year rotation program they just recently introduced at Stanford, which sounds great. (Basically, 3 month rotations with a different group each time).

In addition to the great faculty and the program, I am also very excited about Stanford because of its location at the heart of the Bay area. I can get very enthusiastic about new gadgets, apps, companies, and developments in the tech sector in general. I'm an early adopter of all kinds of crazy apps and new social media services and sites, so this is a great place to be in from this perspective. The culture in the area is very amusing :)

Still, saying no to MIT has haunted me since I officially committed. The faculty, students, environment-- all were great. I met with Antonio Torralba, Bill Freeman and Joshua Tenenbaum and they all made an excellent impression on me. I hope they will consider me again for a postdoc :( ?

For now, I'm excited and eager to start at both Google and then later in Stanford, but meanwhile first things first: I still have yet to finish up last details and paperwork for my thesis and draft out 2 more papers, all in one months of work. But I'll do it... For science. Hop hop!

Lessons learned from manually classifying CIFAR-10 [with code]

CIFAR-10 is a popular vision classification dataset. It consists of 50,000 training images, all of them in 1 of 10 categories (displayed left). The test set consists of 10,000 novel images from the same categories, and the task is to classify each to its category. The state of the art is currently at about 80% classification accuracy (4000 centroids), achieved by Adam Coates et al. (PDF). This paper achieved the accuracy by using whitening, k-means to learn many centroids, and then using a soft activation function as features.

By the way, running their method with 1600 centroids gives 77% classification accuracy. If you set the clusters to be random, the accuracy becomes 92% on train and 70% on test. And if you set the clusters to be random patches from the training set, accuracy goes up to 74% on test and about 91% on train. It seems like the entire purpose of k-means there is to nicely spread out the clusters around the data. As the number of clusters grows, randomly sampling train data converges towards that. The 70% random clusters performance must be because many of the clusters are relatively too far away from data manifolds, and never become activated. So it's as if you had much fewer clusters to begin with.

Anyway, over the weekend I wanted to see what kind of classification accuracy a human would achieve on this dataset. So I set out to write some quick MATLAB code that would provide the interface to do this. My classification accuracy was about 94% on 400 images, (some images are really unfair), but more importantly I felt I learned something about the nature of this task by explicitly going through the data myself and thinking about why I classified something as one class rather than another.

To help in the process, for every testing image I also displayed a set of closest images in training set according to several distances. You can see examples of the interface by clicking on the images later in this post. The MATLAB code for this program can be found here: [MATLAB CODE].

The lessons I learned:

- The objects within classes in this dataset are extremely varied. For example the "bird" class contains many different types of bird (both big birds and small). Not only are there many types of bird, but the occur at many possible magnifications, all possible angles and all possible poses. Sometimes only parts of the bird are shown. The poses problem is even worse for the dog/cat category, because these animals occur at many many different types of poses, and sometimes only the head is shown. Or left part of the body, etc.

10 questionable images I took from a set of 50. 2nd from last is supposed to be a boat.

- My classification method felt strangely dichotomous. Sometimes you can clearly see the animal or object and classify it based very highly-informative distinct parts (for example, you find ears of a cat). Other times, my recognition was purely based on context and the overall cues in the image such as the colors.

- There are many distractors and occlusions in the images that surely confuse the classifier. For example, there are two images of a same type of plane of same pose, but one has a cloud in the sky and the other doesn't. The HOG classifier got confused by the cloud, which threw off the entire prediction. How can the classifier "learn" something like clouds from this dataset, and separate them from planes? As a human, I had absolutely no trouble completely disregarding the cloud (which made up almost half of the image). It's just a cloud-- I have an understanding of the layered structure of the scene. How can a classifier learn reasoning of the following form: "It's an image of the sky because I see blue background and clouds. That little dark blob next to the cloud must be either a plane or some kind of bird. I see wing-like structures, and they seem a bit curvy. Also, planes are usually whiter and not that dark. Bird." Clearly, that's not the kind of reasoning the HOG classifier would produce.

few interface screencaps

My overall conclusions:

- The CIFAR-10 dataset is too small to properly contain examples of everything that it is asking for in the test set.

- Many images require high-level reasoning classification, and not purely "appearance modeling"

- The 0/1 loss is not appropriate for this dataset, because it is extremely hard to tell cat/dog and horse/deer apart in many many images.

- The classifier I found myself internally using the most was some kind of a product of experts. I think I would search the image (at all scales and positions) for very informative parts that hint very strongly at a presence of one of the objects. For example, two dark dots that could indicate eyes.  Or legs of an animal like horse/deer. I would also extract global scene features. What kind of scene is this? Natural? Water? Sky? I would infer this based on clouds or waves or background type. The information then gets merged to produce object prediction. Finally, if nothing seemed to work, I predicted toad. (The toad images in this dataset are terrible. If you see a lot of brown noisy stuff, it's a toad).

few more interface screencaps

- I don't quite understand how Adam Coates et al. perform so well on this dataset (80%) with the method they used. My guess is that it works along the following lines: looking at the image squinting your eyes you can almost always narrow down the category to about 2 or 3. The final disambiguation could come from finding good specific informative patches (like a patch of some kind of fur, or pointy ear part, etc.)

- I don't think any method will go significantly higher than 80%, even though improvements might be possible up to about 85-90% perhaps.

- I'm not convinced this is a great dataset to work with, because I suspect it's too small not only in size of dataset, but also in size of images.

I encourage people to try this for themselves (see my code, above), as it is very interesting and fun! I have trouble exactly articulating what I learned, but overall I feel like I gained more intuition for image classification tasks.

 

 

BONUS: Included in the download above is also some code that I wrote to generate pretty confusion matrices. Below is a sample confusion matrix I created for the classification that comes from Coates et al. on CIFAR-10. Inside the code you can change a variable to make the confusion matrix larger and include more misclassified examples. I made confusion matrices for larger sizes too, and made them available here: [confusion matrices visualized (12.8MB)].

Visualized confusion matrix for 77% classification on CIFAR-10 by Coates et al. Prediction goes down along vertical, and ground truth is horizontal.

Quick experiment: vision + distances in high dimensions

When one is working with images, there is a natural tendency to treat them as vectors in a very high-dimensional pixel space for convenience. Basically, every dimension of the vector corresponds to a single R/G/B pixel value in the image. However, as a rule of thumb, it is best to avoid high-dimensional spaces whenever possible because they provide a large number of non-intuitive results. For example, the volume of n-dimensional ball reaches a peak at about n=5, and then goes to zero as n -> infinity. What's up with that?

Anyway, getting back to images here is an experiment I did in MATLAB this morning. The image in left-top is a random image from Google if you search "face". The other 3 images are all modifications of this image: the first one is shifted to the left a bit, the second one is... "messed up", and the third one is darkened. Here's the kicker: the 3 images are the same (L2) distance away from the original, in this high-dimensional pixel space.

I just thought it was funny/interesting and worth a share. The lesson for the kids is to not trust distance metrics in high-dimensional spaces, and to realize that the pixel space is almost always NOT the space you want to work in or think about when dealing with images.

The set of things I was recently impressed with

EVENTS

- Twitter Announces Fire Hose Marketplace: Up to 10k Keyword Filters for 30 cents: Fire hose is a stream of all tweets on twitter. Very few companies have access to it because it is costly to get access, and hard to process. Twitter has announced that they are partnering with a company that will act something like a fire hose re-seller. Think: Twitter Search on steroids for a small fee. Exciting also for academics, bust probably mostly companies.

TALKS

- Sebastian Thrun gave a great talk on TED about the self-driving car project he's working on with Google. The car can now actually drive on streets. There is a lot of exciting technology that goes into this. Some pretty visualizations are shown in the video about half-way.

- Robert Schapire on Boosting, Tutorial on videolectures.net : I was pleasantly surprised by the quality and clarity on a tutorial about boosting. Very accessible to beginners. Covers adaboost with C4.5 decision trees. In general, I have recently become very interested in ensemble methods. In particular also, with Random Decision Forests. (see Kinect paper, later & below). I know I am late to the party, but I feel that the idea of combining many non-homogeneous weak classifiers into a strong one is very sound. There so many papers in vision that compute features on images, and then run some simple linear classifiers on top. I wonder if ensemble methods with a small number of linear classifiers can almost always do better? I find it hard to believe that you would suddenly overfit if you only included an extra one or two linear boundaries in the final classification process. I wish I knew more on this topic and the associated empirical results in this respect.

PAPERS

- BRIEF descriptor (Binary Robust Independent Elementary Features), from ECCV, by Calonder et al. (link to PDF).  The authors introduce a cute bitstring descriptor. Every bit is simply a result of a tests that determines if one pixel is brighter than another, somewhere in a neighborhood of a point.  Sliding a gaussian through the image as a preprocessing step improves results. The final method is extremely fast for matching because it only relies on computing hamming distance for bitstrings. I was just surprised that such a trivial idea can still lead to great results. The thing I like about these kinds of representations is that you get to tune the number of bits to your needs: Using more bits monotonically leads to better performance (to some point). Future work will be focused on making the descriptor scale and orientation invariant to compete with SIFT, SURF. As a shortcoming: I'm a bit confused about how their method handles smooth parts of the image.

- (CHoG) Compressed Histogram of Gradients (PDF) from CVPR: *I love this paper*. It also tries to create feature descriptors in images. An idea that I did not stumble by so far but that I really liked was to create histograms of the joint distribution of (dx, dy) gradients in the patch, instead of simply doing histogram of orientations (Figure 1 illustrates). This has the obvious benefit of not introducing noise when the gradients in the image are too weak. While this method would place those pixels into a container that essentially says "smooth patch", normal HoG descriptor would force this smooth patch into some particular orientation bin. It seems intuitive this this approach should do better-- I'm a bit surprised that I haven't seen more of it. Again, I wish I had more empirical experience with this kind of binning.

The authors then go on to a very interesting Descriptor Compression technique. Once they have the distribution over joint (dx, dy) in the patch summarized in the histogram, they compute Huffman Tree for the distribution. Next, they enumerate every possible Huffman Tree and assign fixed-length codes to each possible tree. So at this point for every patch we wish to make descriptor of, we can get the ID of the Huffman tree that best corresponds to the distribution of the (dx,dy) gradients in that patch. But why stop there? Not all trees are equally likely to occur! So more compression is achieved by entropy coding the fixed-length indices. The matching is also interesting because they pre-compute distances from all trees to all trees (in compressed form). So when matching two patches, compute the low-bit representation for each, and then use the lookup table for distance.

Here's the most interesting result. On the experiments they ran, they can match the performance of SIFT using 53 bit descriptors. (about 20x compression from normal). There are just too many more cool results and ideas in this paper for me to enumerate. I'll stop here.

-Human Motion Capture for Xbox Kinect: Project page (includes video) also (Direct link to PDF), CVPR. I thought this was a very interesting paper to read as well. The paper describes the software that runs on Kinect and addresses the problem of finding a person in a depth image. They did this 1. synthesizing a huge amount of depth data using Computer Graphics. They put people of various proportions into different poses, and collect the ground truth pose for every depth image. They use a bit of real mocap data too. 2. They come up with an intermediate representation of 31 body parts. 3. The learning problem then becomes that of multiclass classification: Given a depth patch, which body part is at the center? When every pixel is (softly) classified to be one of 31 parts, 4. use mean-shift or similar to find modes and predict locations of each part.

To solve this problem the learning problem they use a clever descriptor that strongly reminds me of BRIEF. However, instead of doing tests of image brightness, they look at two pixels and compare their depths. Then they use Random Decision Forests with decision stumps to create the classifier, by repeatedly proposing two pixel locations + the depth threshold, and picking the one with largest gain in information (i.e standard way of doing things). They can train 3 separate trees on disjoint training data (which they can do because they are swimming in data), and their final classifier is a consensus vote. This consensus vote, together with a huge amount of data, leads to a very good classifier with great generalization. They have videos, and it seems to work very well. I recommended this paper for reading! A++ :)

 

The re-occurring theme I see come up a lot recently in general is that these simple ensemble methods with simple decisions on a very large amount of data do very well. And where can we get a large amount of data? Virtual worlds. I would love to see someone implement these ideas for object recognition, where training is done for the most part in virtual world + a bit of transfer learning in real world.

 

Hacking around with Twitter API+Python: Tutorial

I am an avid Twitter user. Many people still don't quite understand its appeal, but I personally gained a lot by using it. I suppose the most frustrating is the fact that I do such a poor job of converting my friends to use it, for some reason. The world would be a better place if everyone used Twitter (or anything of the same nature, I'm not strictly a Twitter fan. I'm only fan of the idea).

Regardless, one of the most exciting things about Twitter is that you get to meet a lot of like-minded individuals. There is a lot of sharing of information, news, ideas, etc. However, it is surprisingly hard to find these individuals in Twitterverse because even though Twitter tries to recommend "People to Follow", it does a terrible job at it. I've therefore tried to take the matters into my own hands, and write a couple of scripts that help me find good people to follow. In this post, I will try to get you up to speed with how to analyze Twitter, and provide some of my code.

I had to read around the internet for a very long time until I got everything working. Here is everything you need to know in condensed form, in one place and with no useless English glue (i.e. yay point form):

SETUP:
-Tweepy for Python is a great library for using Twitter API. Download & Install.
-Tweepy documentation, for later reference.

SETTING UP OAUTH
-To use Twitter API, you first need to go through the rite of passage of setting up the oauth. This will give you four magic strings that you need when you make connection to Twitter: consumer key, consumer secret, access token key, access token secret.
- You can try this Tweepy tutorial to set it up. See 'OAuth authentication' section.
- But this is how I got it to work, which seemed easier: Download&Install oauth2 for Python. Then use this tutorial. You get consumer key+secret in step 1-2, and then the code given is to get the access token key+secret.

API
-You should be able to now authenticate yourself with Tweepy and do something like

import tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(key, secret)
api = tweepy.API(auth)
public_tweets = tweepy.api.public_timeline()
for tweet in public_tweets:
print tweet.text

-Twitter limits number of API calls to 350 per hour. Every 1 hour, they reset the counter of "API calls used" for every person back to 350.  Use "api.rate_limit_status()['remaining_hits']" to see how many you have remaining.
-One API call can get you: 100 followers/friends of any person, or 20 latest tweets from any person.
-API calls can fail at random. API calls can fail for some person always (usually due to protected tweets etc). API calls can fail for some person for some time, but then work later. (I think). Basically, the calls are not very reliable, use caution.
-In API, word 'friend' refers to a person you are following, not a mutual following.
-Here is some example code on how to use Tweepy, where I show most common use cases.

MY ATTEMPTS: WHO TO FOLLOW
I tried to create a script for finding people similar to me on Twitter. I did this by collecting my depth 2 network (union of people I follow, and people they follow), and running some statistics. First, the graph alone suggested many very good people to follow. This was done simply by checking for people who are very popular in the collected network, but who I don't follow.

I conducted a few extra experiments by collecting 100 tweets for every person and looking at the kinds of things they talk about. More details on my initial approach can be found on this MetaOptimize thread I made about it.

SOME CODE
- My Python code for finding people similar to you based only on graph. Change your name on top and put in your oauth credentials to run.
- EDIT: Now also available in better form on my GITHUB
- You probably don't want to change any of the other settings (especially depth. If >2, your computer may explode or not halt). Just to give you an idea, my depth 2 network already consists of about 14000 people.
- If you don't want to go through all trouble above and you think I like you a lot, email me your username and I can run the script for you :)
- For now I am not yet releasing my code that  tells you who to follow based on text of tweets. It is much more involved to collect them all (takes few days) and to run it.

TWITTER DATASET (aside, for fun)
- A huge Twitter dataset was available here: Twitter 7. "467 million Twitter posts from 20 million users covering a 7 month period from June 1 2009 to December 31 2009".
- Twitter actively pursues people who try to put up datasets of tweets because they are no fun, want to slow down innovation in NLP, and prevent us from doing cool research.
- Twitter made them take it down as you can see, but the links are still in the HTML code, commented. Direct link.

The super-useful website I would build if I had time

Example project site

Here is an idea for a website that I had for a very long time. I see its application mostly in academia/Computer Science, but it could be extended to be more general.

Problem statement:
Us computer scientists work on many projects throughout our lives. Most people host a projects page on their website somewhere. Some create project pages for their publications. However, every page is left by itself, stranded somewhere in the vastness of internet, waiting for someone to stumble by it, somehow.

In general, the idea of a project comes up all the time, and everyone builds a website for a project from scratch, somewhere on their website. Have a look at, for example, CVPR papers page. Note the sheer number of publications that have a "Project" link. Every one of these pages is hard to find (usually you Google the person, and go around the site to find the link), and they all look the same: There is a couple of pictures, videos, some links, list of collaborators, and a bit of additional information. For my projects I find it extremely interesting to also attach a little discussion forum on the bottom, as can be seen on my evolutionary creatures simulator project.

Proposed solution:

Why can't there be a central location for projects? Why can't academics easily upload their project images, links, and descriptions somewhere? Some place that is well known and visible? There could be ways of searching Project pages. A little discussion forum could be created for each project automatically. Projects could be rated, tagged, featured. Projects could come with a little news feed that could display updates. Projects could display a list of collaborators, maybe ranked/tagged by their contribution.

There could be profiles for people. A person's profile could even become their online portfolio. This is why we all make pages for ourselves in the first place: we want to showcase the things we worked on and give some contact information. Except that instead of everyone having to bother with HTML, CSS, etc. to create a Project pages for themselves, they could simply easily create these projects on the site and then link people to their profile. The profile could display some customizable information, and point to all of their projects.

Imagine being able to follow certain people to get updates on what they are working on. It's also easy to imagine following projects, and getting notified of any changes or developments.

Eventually, I can see this website expanding beyond use in academics. In some sense, websites are projects. Any software can be thought of as a project as well. Even the LHC is a project. Anything that 1+ people worked on for a substantial period of time is a project.

I hope that someone with time, resources, and skill, build this website. I think it would benefit academia, and more importantly, it would benefit humanity overall if we made all of our projects easily searchable in one central location.