Wednesday, July 24, 2013

[Figure 0. Today's Tech Xploration speaker: Jeremy Howard - President and Chief Data Scientist @ Kaggle]

This talk is given by Jeremy Howard, a serial entrepreneur, a person became a famous data scientist without having any prior technical background.  According to the brief introduction about him at, he has spent a LOT of time with the world's top data scientists and consultants and analysts including in McKinsey & Company.  He is also known as the president and chief data scientist of Kaggle. In this talk, he presented "Prediction Down to a Science: How You can Make the Next Big (Data) Breakthrough".   From this talk, we can understand some of hints how to become a great data scientist in the real world, and how to manage and succeed own startup companies.

He stared his talk from his experience on several startups.  To be an entrepreneur, he or she should do everything including virus, spam filtering on emails, making server systems (Linux setups, filesystems), so, all things should be done by oneself.  From the experience of doing everything, he mentioned that he learned a lot of practical and useful things in very short amount of time.

Secondly, he started some of experience of data science when he was working on some insurance product.  The most important one is to have more profit, by having appropriate price for the insurance products.  To get such a fancy profit, he needed to understand the actual curve of profit vs. price tradeoff relationship (see Fig. 1).  Based on understanding the curve, the company can sell the product in an optimal way (optimal profit? - probably arguable.. but can do the best with known model).  Here we need at least one math, but need some real data from customers and past histories, so it was a place to apply data science in the real product.

[Figure 1. Tradeoff between Profit and Price - drew by Jeremy Howard]

Next, he explained how Kaggle works.  Kaggle is generally a platform for competition in data mining and analysis, so that data scientists can compete to provide better scientific results and solutions.  Usually time series analysis was the popular topic and scope, and there were many practical works done based on Kaggle (see Fig. 2).

[Figure 2. What Kaggle does - from]

On top of Kaggle, about 100,000 data scientists are working, (actually McKinsey made an estimation like total 15,000 data scientists in the world) and there are many interesting challenges on-going.  Even though crowdsourcing is used only for very simple, cheap tasks to outsource to people with enough time, but this Kaggle made more higher quality crowdsourcing like scientific researches in the crowd world.  Kaggle made successful work on several scientific projects including HIV and dark matter-like topics.  This proves that the concept of crowd sourcing can be applied in the subjects with depth.

[Figure 3. About a user on Kaggle: Xavier Conort]

One of the most interesting feature in Kaggle is, keeping crowd workers (users)' past history of their credentials and experiences.  Here, Fig. 3 shows an example of Xavier Conort.  As you can see in the figure, crowd task creators can also leverages and workers can build-up their own credentials and profiles on top of the Kaggle platform.  In this example we can see that Xavier won two competitions as the 1st rank, and several of 2nd to 4th ranked results on various competitions.  This can reduce huge amount of concerns and myths that 'crowdsourcing usually comes with low (bad) quality results'.

[Figure 4. TechXploration Talk Venue]
In the last part, Jeremy mentioned some 'dirty'-parts of practices in data sciences.  It requires massive amount of endeavors of trials and errors, and exceptional cases of input data, and many of failures to be seen in the middle of data analysis.  He wanted to mention that data science is not really an easy one to be achieved with small amount of efforts, but need more hard working people with smarter brains.

On the Q&A session, there are many of strange and interesting questions that I cannot really ask;

1) why do you draw picture in presentation?: Answer: "I use Windows 8, and there is a way to note easily, so it's my habitual stuffs."
2) what programming language do you use for developement?: Answer: "I use Microsoft C#, with various reasons. :-)"
3) Should we have model first and data later, or vice-versa?: Answer: "We need a basic model first and have data later.  Otherwise it will take more time and will produce noisy, less accurate results."
4) Status of education in data science and current work in Singularity University?: Answer: "The current status is pretty awful, but Coursera and several on-line courses in Machine Learning and Data Mining help people to get used to the topics and area of the data science.

[Figure 5. Jeremy Howard's contact]

So, that is a quick lesson from Jeremy Howard.  If you have any more questions please ask to Jeremy with the following contact (see Fig. 5).  Hope this helps candidate people who want to be a data scientist in the future!

- written by ANTICONFIDENTIAL at San Jose in July 24, 2013