Author Archives: phillipmah

Watson: IBM's Next Deep Blue

Article in the New York Times: What is IBM’s Watson?

Watson is IBM’s next grand challenge. In 1997 IBM built Deep Blue, which beat the world champion chess player Kasparov. IBM is hoping to do it again by beating previous champions of Jeopardy, and has signed up with the TV show for an actual competition.

Watson is the latest in a string of trends in the computer science world. In the late 80′s and early 90′s artificial intelligence research entered the AI winter, where funding slowed down due to a lack of results. Projects like MYCIN lead to a great deal of hype, and people believed computers were on the verge of true intelligence. Unfortunately the breakthroughs failed to appear – and research slowed down.

In the 90′s the predominant idea with artificial intelligence was creating a set of rules to guide the computer. Building models of language, grammar, speech, and intelligence. However the models being built were never powerful enough to compete with human intelligence, and the models were very complicated to build.

Recent breakthroughs in machine learning however are changing the landscape. Whereas before researchers would painstakingly build a model by hand, current trends are to build a simple but flexible model. The computer would then be fed a staggering amount of information, essentially teaching the computer. These breakthroughs are what lead to speech, and handwriting recognition. It is also the key idea to modern natural language processing, which is what allows Watson to actually compete against humans.

These advances have been enabled by faster and cheaper computers, as well as the enormous growth in machine readable information. Everytime someone adds information to wikipedia, or to a blog post, it allows computers to learn from it.

However it isn’t clear whether these advances will be enough. Already speech recognition advances have plateaued, and it is an open question whether computing power and massive amounts of information will be enough to make true artificial intelligences.

Most likely there is still room to grow. Knowledge engines will supplant search engines when we just need answers to questions. Whether or not we’ll have computers acting as personal assistants is unknown, but Kurzweil certain thinks so, I myself am not so sure.

Open Catalogues of City Data

This San Francisco Smart Parking reminded me about how far we’ve come with open access to the data that impacts our daily lives, and how much more we can do.

In Vancouver, there is the Vancouver Open Data Catalogue which consists mainly of Map/GIS data. Some applications I can think of using the existing data:

  • Take out garbage reminder email – signup with an email and an address and the website sends you an email the day before collection.
  • Web application to mark areas with graffiti, potholes, broken street lights for City of Vancouver to fix.

Some ideas that there is no data for:

  • Map of parking zones (i.e. no signage, 2 hour parking, residential only etc.)
  • Dynamic map showing the amount of traffic throughout the city in the course of a week.

A couple of other cities also have open data policies: Edmonton, Toronto, New York, Washington, and San Francisco.

This brings up the question of data integration. Lets say I make a great web application that allows people in Vancouver to mark locations with graffiti, potholes, or other minor maintenance issues, and the city of Vancouver also uses the application – everyone is happy. Why should someone have to develop an entirely new application to deal with the same issues in New York? Or in Toronto?

I can only hope for the day that question comes up though because right now most cities do not have open data catalogues.

OpenPCR – Develop a cheap, open design, DIY PCR Machine

Found an interesting project looking for donations: OpenPCR.

They aim to create a PCR machine out of open components (i.e. Arduino), and then release all the designs. In theory this means someone can hack together a PCR machine out of base components. This would in turn commoditize PCR machines, allowing generic manufacturers to produce them on the cheap.

An interesting fact: one of the designers/hackers was a judge for the 2009 MIT iGEM competition, of which Eric was part of!

A Tour of the Visualization Zoo

Found an awesome article by the ACM, A Tour through the Visualization Zoo.

One visualization you cannot miss is the following flow map, a recreation based on the 1861 Minard visualization of Napoleons 1812 Russian campaign:

Napoleons March to Moscow

Edward Tufte, author of the classic book on visualizations: The Visual Display of Quantitative Information, called the original chart possibly the best graphical visualization of all time. And it is easy to see why, the chart combines 6 dimensions: geographical, time, size of the army, and direction of the Army, and temperature, with incredible clarity and did Minard did it in 1861.

Here is how the visualization works:

  • Width of the band indicates the size of the army.
  • Red band indicates movement towards Moscow, black the return march.
  • At the bottom is a time line as well as temperature during the return.

I find this visualization inspiring because it is a reminder to everyone who works with data that you all you need to create a compelling story with data is diligent and careful thought. The original Minard chart can be found here.

Another visualization that was not included in the tour is John Snow’s cholera outbreak map:

Snow's Cholera Map

The black lines marks indicate deaths due to cholera, and the dots represent water pumps. Snow used his analysis of deaths to determine that the water pump on Broadway was linked with the cholera outbreak. Some claim that Snow’s analysis was the birth of Epidemiology – the study of the factors affecting the health and illnesses of populations.

Both of these visualizations demonstrate that sometimes the most convincing analysis is simply a visualization.

Resources for Graduate School

Found this article recently: The Secret Lives of Professors which got me thinking about graduate school again. So for those of you thinking about applying, or going to grad school in September, I’ve found some good interesting resources so I thought I would share.

If you have any good links feel free to leave a comment or email it and I’ll add it to this post.

Prospective Graduate Students

Graduate Students

How to Measure Anything: Mathless Confidence Intervals

I recently picked up a book by Douglas W. Hubbard, How to Measure Anything which offers this table that pays for the book itself.

The “Mathless” 90% CI (p139, How to Measure Anything)

Lower bound: __th smallest

Upper bound: __th largest

Sample Size nth largest and smallest sample value Actual Confidence
5 1st 93.8%
8 2nd 93.0%
11 3rd 93.5%
13 4th 90.8%
16 5th 92.3%
18 6th 90.4%
21 7th 92.2%
23 8th 90.7%
26 9th 92.4%
28 10th 91.3%
30 11th 90.1%

So what does the table mean and how can it be used?

Suppose you are a drug dealer and you’ve received a 100 packages of 10g marijuana, ready to sell. The suppliers may have tried to rip you off so you need to check. You decide you want to  be more than 90% certain that the average weight of each package is actually 100g.

You don’t have the time to hire people on your end to weigh every package, and there are no friendly statisticians willing to calculate sample statistics for you. So what can you do?

With the table above you decide on how many packages you are willing to weight. Suppose you have time to weigh 8 packages and find that they weight 8, 8.9, 9, 9.5, 9.7, 9.9, 10, 10, 10.5, 11, 12g. With a sample size of 11 you only need to look at the 3rd smallest value (9g) and the 3rd largest value (10.5g) to construct a 90% confidence interval (actually 93.5%). Hence a 90% CI of the average weight of the packages is between 9g – 10.5g. You may or may not decide to accept the deal.

What is a 90% confidence interval (CI)? It means using the above table 9 times out of 10 the actual average will between the ‘calculated’ values.

So what if your not a drug dealer? The example in the book is used to measure the average amount of time a group of managers spend on under-performing sales rep. Other examples I can think of include measuring the average amount of time developers spend on bug fixes, and the amount of time employees spend working at home.

The table can construct 90% confidence interval for any kind of sample statistic, with some caveats. The table can construct 90% CI of the median for any distribution. However to use the table to calculate 90% CI’s for averages the distribution has to be symmetric. Which means in the drug dealers case your suppliers are equally likely to give you lighter packages as heavier packages (not super-realistic), but many other things in life are.

Sports Statistics: The No-Stats All-Star

Found an older article from the New York Times magazine: The No-Stats All-Star.

Shane Battier is a small forward for the Housten Rockets (NBA), that was originally drafted by the Vancouver (now Memphis) Grizzlies. What makes him so interesting from a statistics point of view? Here is a quote from the article,

[Shane Batteir's] conventional statistics are unremarkable: he doesn’t score many points, snag many rebounds, block many shots, steal many balls or dish out many assists…When he is on the court, his teammates get better, often a lot better, and his opponents get worse — often a lot worse. He may not grab huge numbers of rebounds, but he has an uncanny ability to improve his teammates’ rebounding. He doesn’t shoot much, but when he does, he takes only the most efficient shots. He also has a knack for getting the ball to teammates who are in a position to do the same, and he commits few turnovers. On defense, although he routinely guards the N.B.A.’s most prolific scorers, he significantly ­reduces their shooting percentages. At the same time he somehow improves the defensive efficiency of his teammates — probably, Morey surmises, by helping them out in all sorts of subtle ways. “I call him Lego,” Morey says. “When he’s on the court, all the pieces start to fit together. And everything that leads to winning that you can get to through intellect instead of innate ability, Shane excels in. I’ll bet he’s in the hundredth percentile of every category.

Since Bill James discovered the sports Pythagorean theorem and founded Sabermetrics, statistics has swept the world of sports by storm. Baseball pitchers are given minute details about every opposing player and their weaknesses. In basketball Battier gets a report detailing how well his check Kobe Bryant plays in every part of the court. All of these statistics are created by people carefully watching videos of the game from all angles, and detailing every move.

But as important as knowing that statistics can dramatically improve everyday decision making, is realizing the weaknesses inherent in statistics. Shane Battier’s stats are a case in point. From the box score he comes off as a mediocre player, but that’s because we are looking at the wrong things. This is one of the largest weaknesses in statistics. Any statistical analysis is only as good as the data, and if we become myopic and look only at the obvious data, we miss seeing magnificent opportunities like Shane Battier.

Statistics is part of the future in terms of decision making, but don’t forget to look for hidden opportunities and ways to improve your analysis. One thing that can be done is to set aside some time every year to review your decisions and any obvious missed opportunities or odd recurrences, then fix the way you make decisions to catch these in the future.

Pythagorean Part 3: R Analysis of Hockey

With the data files extracted in Pythagorean Part 2 we can go on to the statistical analysis with R.

Recall, the proportion of games won by a team is predicted by the formula:

\text{Proportion of Wins} = \frac{\text{Runs Scored}^{2}}{\text{Runs Scored}^{2} + \text{Runs Allowed}^{2}}

Our goal is to calculate the optimum exponent that best describes the actual proportion of wins for each season. The complete script can be found on github. The script generates the following time series,

Exponent time series

The time series shows the best fitting exponent value for each season (measured in years). The size of the point indicates the number of teams playing in each season. The color represents the error as measured my sum of square residuals (SSE).

As you can see the error is relatively small for all the seasons (SSE less than or equal to 0.20). So the formula fits fairly well for all seasons since 1920.

From the graph it is obvious that an exponent value of 2 is more than accurate enough for all seasons. An interesting trend is the exponent values cluster more tightly as the number of teams increase.

So what does this mean?

We can do fairly simple prediction for any team given the number of goals scored, and the number of goals let in. In general if there is a large discrepancy between the actual proportion of games won versus the predicted proportion of games won, then we can predict that the team should either win/lose in future games to regress back to the predicted proportion of wins.

May/June 2010 Issue of Analytics – Cancer Therapy

The latest bimonthly issue of Analytics (flash needed) has been released. It features an article on the use of optimization programming in cancer treatment.

Do you remember Mario Tetris on the Nintendo DS? Researchers develop algorithms to fit different sized blocks, or spheres into certain volumes, like 3D tetris. Many types of radiation therapy can be modeled as packing spheres in a volume defined by the tumour, and the algorithms can be applied to have a big clinical impact.

In general cancer therapy is an optimization problem. The cancerous cells need to be killed with a minimum amount of treatment, but the treatment is constrained by damage to the surrounding tissue. Which is fundamentally an optimization problem. So as new types of treatment are developed, operations researchers will need to be there to as well to maximize the effectiveness of treatment.

Why Do Harvard Kids Head to Wall Street?

I’m a fourth year student at the University of British Columbia and many of my friends are finished their undergraduate degrees. One of their chief concerns: What do I do now?

Note the question is not: What should I do with my life?

I believe many new graduates have grand dreams of changing the world. I certainly do. Of course it seems all to easy to find a job that pays well and forget our grand dreams.

The Baseline Scenario is a great blog written by a financial insiders, i.e. chief economist of the IMF. They recently wrote a post on Why Do Harvard Kids Head to Wall Street?

Their typical Harvard undergraduate, I believe is really the typical high-achieving undergraduate:

(a) is very good at school; (b) has been very successful by conventional standards for his entire life; (c) has little or no experience of the “real world” outside of school or school-like settings; (d) feels either the ambition or the duty to have a positive impact on the world (not well defined); and (e) is driven more by fear of not being a success than by a concrete desire to do anything in particular.

So what happens to these ambitious high-achievers? They get taken up by the well-structured recruitment processes of established institutions. What are these processes? Medical School, law school, graduate school, consulting firms, accounting firms, and corporate law firms. These recruitment processes gives the students who have spent the last 16 years following a well defined path the next stepping block.

Once these new grads get recruited they get comfortable with the income, and start building their family. After that when kids come along it gets exceedingly hard to transfer careers.

One interesting reason is that people start justifying their career choice. A case of cognitive dissonance. Where previously the reason for working at a investment firm was  to pay the bills and learn the necessary skills to move on and change the world. The new justification is the markets of the world need to be fluid, the management of the portfolio is very important to so and so etc.

There’s nothing wrong with a career at an investment bank, or at a corporate law firm. But the universities of North America are supposed to be training the leaders of tomorrow. The smart, high-achieving new graduates have a deep desire to help the world, but many of them do not end up fulfilling their ambitious if vague dreams.

I can’t fault the investment banks they are doing what they need to do to succeed. But the world would be a better place if more university graduates go on to really have an impact.

So what can we students do? Our chief problem is our vague goals. So try writing down what you really want to achieve in life, and stop worrying about failing to have an impact – because you just might end up working at in Wall Street or Bay Street.

Follow

Get every new post delivered to your Inbox.