Sunday, March 10, 2013

Microsoft and Big Data

While I was thinking about researching on Microsoft Big Data strategy, I just read an article about it - Microsoft's big data strategy: Democratize, in-memory, and the cloud - on ZDnet.

So what is Microsoft strategy on Big Data? In short, Microsoft promotes Big Data in 2 fronts: Business Intelligence formed over more than a decade of being in that market; and those of its other lines of business, including online services, gaming and cloud platforms.

In specific, there are following products related to Big Data:




While the above products are all impressive and can compete with other companies, the Data Explorer is potentially adopted by more users since Excel is so popular in the office software market. 

As this Microsoft blog said, "Data Explorer enhances the self-service BI experience in Excel by simplifying data discovery and access to a broad range of public and enterprise data sources, enabling richer insights from data that has traditionally been difficult for users to get to. With Data Explorer, users can now quickly and easily import data from a variety of sources, including Web, Excel, Text, Database and Azure. Access to non-traditional sources such as Active Directory, Facebook and big data solutions like Hadoop are now within the reach of any user. Connecting directly to data from the web is easy and intuitive. Filtering and transforming your data can be done in just a few clicks and importing your final data into Excel is straightforward. "

You can check out the video below to see how Data Explorer works.

Friday, March 8, 2013

Computer Trading by using Big Data?

In the FT's article - FBI joins SEC in computer trading probe - on Mar.5, it told the following detail:

"Authorities are exploring potential holes in the system, including new algorithms referred to as “news aggregation” that search the internet, news sites and social media for selected keywords, and fire off orders in milliseconds. The trades are so quick, often before the information is widely disseminated, that authorities are debating whether they violate insider trading rules, the people familiar with the matter said."

It makes me wondering if above technology-driven trading trick was using Big Data technology. If some people or companies are able to use Big Data to generate benefits for other people or customers, there must be some guys who can take advantage of Big Data to make money for themselves.

Why did Big Data make a bad thing so easily?

  • Big Data especially people's social network information from Internet are easily be accessed and consumed.
  • There are so many Open Source technologies which can help on processing the Big Data.
  • The cloud computing make Big Data crunching faster.
  • The fast-growing IT technology including hardware and software drive the Big Data processing cost lower.
There is no way to stop people to access the Big Data just like no way to stop the data getting big.  Like fighting computer virus, people has to keep their anti-virus software updated. So for the above computer trading case, FBI and SEC have to get them equipped with Big Data technology to fight those guys who also know the Big Data.

Wednesday, February 20, 2013

If you want to be a data scientist, you should know these 67 questions.

Recently, I read a blog by Vincent Granville on Data Science Central. If you would like to apply for a data scientist job or even prepare for this kind of job, you may try to answer these 66 open-ended questions.

66 job interview questions for data scientists

Here are some questions from it:

What is the curse of big data? (answer)

Examples where mapreduce does not work? Examples where it works very well? What are the security issues involved with the cloud? What do you think of EMC's solution offering an hybrid approach - both internal and external cloud - to mitigate the risks and offer other advantages (which ones)?  (answer)

I would like to add a question (not Why not 67 questions?):

Why do we need more data scientists? (answer)

Tuesday, February 12, 2013

Big Data in Financial Service Industry

In order to get senior management's buy-in on Big Data, you will have to show them some use cases.

Let's start from the financial service industry including the banks and others.

From Oracle:

This Oracle White Paper briefly talks about Oracle Big Data technology and several use cases in the financial services industry.

Financial Services Data Management:Big Data Technology in Financial Services

From IBM:

IBM solutions for big data provides banks with an integrated and scalable set of cost-effective, high-performance tools that support the rapid ingestion of important customer data from a variety of sources and the fast analysis of large volumes of data at transactional, product or enterprise levels.

See the link from IBM website: Deriving Business Insight from Big Data in Banking

And White Paper: IBM Information Agenda for Banking - Financial Crisis and Integrated Risk Management for Financial Institutions

From IDC:

The document is not free. You will have to  pay US$1,000 to get it.

Big Data - Use Cases in Financial ServicesPrice: US $1,000

Author: Michael Versace

Insights Presentation
July, 2012  -  Doc # FIN236035
Number of Pages: 18
Abstract
Data is the currency of competition in financial service. The effective use of data and information is the foundation upon which firms compete. Services are wrapped around data to differentiate products and services. For example, knowing which customers represent the best credit revenue and profitability opportunity to a bank is a question that only data and analysis can answer.
As an extension, IDC Financial Insights believes that Big Data and business analytics can quickly deliver competitive advantage for those firms that effectively harness and leverage the trend.. In this IDC Financial Insights presentation, we describe some of the drivers behind big data with examples for how big data technologies are being applied against some demanding business imperatives in the financial markets today. The presentation concludes with Essential Questions and Guidance to practitioners.


Sunday, February 10, 2013

The History of Big Data - 2

In my blog "The History of Big Data", it says the name "big data" originated as a tag for a class of technology with roots in high-performance computing.

After reading NYTimes article "The Origins of ‘Big Data’: An Etymological Detective Story" ,  I found out the origins of Big Data might not be different. 

In the article, it mentioned Francis X. Diebold, an economist at the University of Pennsylvania and his most recent paper. In the paper it concludes: “The term Big Data, which spans computer science and statistics/econometrics, probably originated in the lunch-table conversations at Silicon Graphics in the mid-1990s, in which John Mashey figured prominently.”

Wednesday, February 6, 2013

Big Data skills

To get into the field of Big Data, lots of people especially IT professionals are wondering what kinds of skills are required.

Here are some skills you should have or plan to have:



It will take time to learn and explore. But all the above skills will help you build your Big Data career path such as Data Scientist.

Here are some articles for your reference:

"Big data analytics is sometimes sold as a boon for IT workers, with analyst house Gartner predicting that within three years there will be 4.4 million staff working on big data projects. "

"The U.S. faces a substantial shortage of workers with data science skills, according to a much-talked about report published last year by consulting firm McKinsey and Company. The report predicted that by 2018 the country will lack 1.5 million analysts who can make strategic decisions using big data and between 140,000 to 190,000 workers with the proper data-processing technology skills."

"Regardless if they are called Data Scientists or Data Analysts, Data geeks need to be more in control of their destiny. "

Saturday, February 2, 2013

Big Data University

You may be wondering where you should start your Big Data learning journey. After a bit research, I found Big Data University is a good place to try.

Big Data University is an online educational site run by new and experienced Hadoop, Big Data and DB2 users who want to learn, contribute with course materials, or look for job opportunities. And it is hosted on the Cloud and using Moodle 2 course management system enabled to run on DB2. It is in the Beta stage.

The site includes free and fee-based courses delivered by experienced professionals and teachers.
When I saw DB2 but not other databases (including open source ones), I guess this site is either sponsored by IBM or run by IBM product lovers. Anyway, it is no harmful for you to learn Big Data.


In order to study in this "university", you should register by either using your Google, Facebook, Yahoo or ChannelDB2 account or creating your Big Data University account. Most of IT or data professionals should already have at least an account from Google, Facebook or Yahoo. If you don't use DB2, you might not even know ChannelDB2.


According to the site statistics, there are 63339 registered students (as of today - Feb.2, 2013 - not sure if it publishes the latest number). If you put this number under the perspective of real universities,  it is about 3 times size of Harvard (about 20,000 students)  or Stanford (about 18,000).


So, you want to join?

Sunday, January 27, 2013

Big Data Use Case #2 - Netflix

Just last week, Netflix stock soared after it fourth-quarter results top forecasts. On Jan.24, shares of Netflix rose $43.60 to $146.86 on Nasdaq, their highest level since September 2011.

Also in its earnings report, the company predicted it will add as many as 2.1 million U.S. streaming members in the first quarter, more than it gained during the first three months of last year.

How will Netflix to attract new subscribers? Although the company didn't tell, analyzing the Big Data should be one of the techniques. The people who has been following Netflix should know the Netflix Prize contest. The following provides a bit detail:

Netflix is all about connecting people to the movies they love. To help customers find those movies, we’ve developed our world-class movie recommendation system: CinematchSM. Its job is to predict whether someone will enjoy a movie based on how much they liked or disliked other movies. We use those predictions to make personal movie recommendations based on each customer’s unique tastes. And while Cinematch is doing pretty well, it can always be made better.

Now there are a lot of interesting alternative approaches to how Cinematch works that we haven’t tried. Some are described in the literature, some aren’t. We’re curious whether any of these can beat Cinematch by making better predictions. Because, frankly, if there is a much better approach it could make a big difference to our customers and our business.

So, we thought we’d make a contest out of finding the answer. It’s “easy” really. We provide you with a lot of anonymous rating data, and a prediction accuracy bar that is 10% better than what Cinematch can do on the same training data set. (Accuracy is a measurement of how closely predicted ratings of movies match subsequent actual ratings.) If you develop a system that we judge most beats that bar on the qualifying test set we provide, you get serious money and the bragging rights. But (and you knew there would be a catch, right?) only if you share your method with us and describe to the world how you did it and why it works.

Serious money demands a serious bar. We suspect the 10% improvement is pretty tough, but we also think there is a good chance it can be achieved. It may take months; it might take years. So to keep things interesting, in addition to the Grand Prize, we’re also offering a $50,000 Progress Prize each year the contest runs. It goes to the team whose system we judge shows the most improvement over the previous year’s best accuracy bar on the same qualifying test set. No improvement, no prize. And like the Grand Prize, to win you’ll need to share your method with us and describe it for the world.

According to the company blog,  Netflix announced the $1M Grand Prize winner of the Netflix Prize contest as team BellKor’s Pragmatic Chaos for their verified submission on July 26, 2009 at 18:18:28 UTC, achieving the winning RMSE of 0.8567 on the test subset.  This represents a 10.06% improvement over Cinematch’s score on the test subset at the start of the contest.

To know how much Cinematch has contributed to Netflix's financial result, it will need another project to make the calculation. One thing for sure, the company should collect more data from its subscribers not only from its business but also from other social source. The more the data they get, the better the recommendation they should provide, the larger the revenue they should make.

Friday, January 25, 2013

Big Data Use Case #1 - NBA

Have you ever heard of a company named Ayasdi? I didn't know this name until I recently Sarah Reedy's blog Ideas Watch: Ayasdi Gives Big-Data a Name.

In her blog, she talked about Ayasdi just got $10.25 million in Series A funding. For what? Ayasdi's cloud-based Insight Discovery Platform uses distributed computing, machine learning, and user-experience technologies to take all the guess work out of massive data sets. In company's own website, it says "Solving Today’s Biggest Problems Requires an Entirely New Approach to Data" and "A New Way to Discover Insights Leading to Breakthrough Outcomes".

I was amazed by the following picture named "Big-Data Basketball". If I didn't read the note under the picture, I thought it was about the new discovered galaxies by NASA or some new genetic maps found by scientists. It is actually a topological similarity network of 452 NBA players during the 2010-2011 season. Ayasi used its software to discover patterns from those NBA players' data and broke down the player into 13 classifications beyond the 5 normal positions on the court ( point guard, shooting guard, small forward, power forward and center).
 

Then what? The result from the analysis could change how coaches and general managers think about the roles their players fill and help team win more games. Also, the analysis could help team find good players and potential good players. In other words, the software makes the Big Data create value (money).  You can get more detail from the WIRED magazine article "Analytics Reveal 13 New Basketball Positions".

This use case also tells that this valuable analysis of big data was not done by those large companies like IBM, Oracle and Microsoft, but a startup.

Big Data provides huge opportunities to the startup companies.

Thursday, January 24, 2013

Oracle and Big Data

When people talk about Oracle, they first think about its RDBMS (relational database). After Oracle acquired so many companies including BEA and Sun, people know about its Java and Weblogic. So where is Oracle's Big Data product?

On its own site, Oracle provides the information about its products to help customers acquire and organize big data and analyze them alongside customers' existing data to find new insights and make better business decision. Oracle's Big Data platform provides end-to-end solution - all the components the customers need to get real results from their big data initiatives.

Acquire Big Data

Making the most of big data means quickly analyzing a high volume of data generated in many different formats. Oracle offers a range of products for acquiring all your data including:
Oracle NoSQL Database
Oracle Database

Organize Big Data

A big data platform needs to process massive quantities of data in batch and in parallel—filtering, transforming and sorting it before loading it into an enterprise data warehouse. Oracle offers a choice of products for organizing big data including:
Oracle Big Data Appliance
Oracle Data Integrator
Oracle Big Data Connectors

Analyze Big Data

Analyzing big data within the context of all your other enterprise data can reveal new insights that can have a significant impact on your bottom line. Oracle offers a portfolio of tools for statistical and advanced analysis that complement Oracle Exadata, including:
Oracle Advanced Analytics
Oracle Exadata Database Machine
Oracle Data Warehousing
Oracle Exalytics In-Memory Machine

You can watch the Oracle Bigdata Videos on YouTube:



Also, you can read this Oracle White Paper - Oracle Information Architecture: An Architect's Guide to Big Data.

Tuesday, January 22, 2013

IBM and Big Data

Most of large companies like IBM, Microsoft and Oracle are promoting Big Data ideas and their related software relating to Big Data.

There is a good site maintained by IBM. It tells you where to start on the Big Data.

IBM Big Data - Where do I start?

One of the sections is as follows:

If you are new to BigData concepts you can start with this
1. http://www.ibm.com/bigdata - Quick introduction to Big Data. Reading time - 5 minutes
2. http://www-01.ibm.com/software/data/bigdata/enterprise.html - Give you an overview of the two products in IBM Big Data - InfoSphere Streams and InfoSphere BigInsights - Reading time 10 minutes 
3. http://bigdatauniversity.com - This contains an excellent Certification Course for Hadoop Fundamentals - and has a good coverage on the open source foundational components such as Hadoop, MapReduce concepts, Pig, Hive, Flume JAQL etc. There are videos, hands-on downloadable VM, lab exercises, reading material etc. The bible of Hadoop and MapReduce reference pdf book is available for download. If your expertise so far has been one line summary of each of the technolgies mentioned above, you will need to spend about 3 to 4 days to cover this course, reading time + exercises. There's a test that you can appear for at the end of the course and yes, you get a certificate if you clear it. Reading up and clearing certification time 4 to 5 days

Also, another site from IBM:

Big Data - Find developer and DBA resources, tutorials, and articles to help you grow your knowledge on big data technology and IBM's integrated big data platform.

Monday, January 21, 2013

The History of Big Data

Suddenly, every people talks about Big Data. When did Big Data start? Who invented the name - Big Data?

I like GilPress' blog -  "A Very Short History of Big Data" which summarizes big data's brief history starting 1944 when Fremont Rider, Wesleyan University Librarian, published The Scholar and the Future of the Research Library. In December 2008, Randal E. Bryant (CMU), Randy H. Katz (Berkeley), and Edward D. Lazowska (Univ of Washington) published “Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society.”  They wrote: “Big-data computing is perhaps the biggest innovation in computing in the last decade. We have only begun to see its potential to collect, organize, and process data in all walks of life. A modest investment by the federal government could greatly accelerate its development and deployment.”

According to another article - "Forrester: Big data – start small, but scale quickly",  the name "big data" originated as a tag for a class of technology with roots in high-performance computing, as pioneered by Google in the early 2000s. It means the history of "Big Data" - people started to use the name -  is about 10 years.

Thursday, January 17, 2013

McKinsey and Big Data

Big Data is hot.  Big Data is as hot as Mobile and Cloud. If we add Mobile, Cloud and Big Data together (M+C+BD), the result will be the hottest thing (MCBD) in the world now.

This blog will focus only on Big Data. So I call it "Big Data Big".

When I started to pay attention to Big Data a while ago (even recently when I joined a seminar about Big Data), McKinsey had always been mentioned. So you go to Google and type in McKinsey and Big Data, the first result will be the following (except the paid result):


Big data: The next frontier for innovation, competition - McKinsey ...

www.mckinsey.com/.../big_data_the_next_frontier_for_innov...Share
MGI studied big data in five domains—healthcare in the United States, the public ... For example, a retailer using big data to the full could increase its operating ...

Print

Big data will become a key basis of competition, underpinning new ...


After you click the link, you will see the content of the article (Big data: The next frontier for innovation, competition, and productivity, dated May 2011) which briefly tells you about the Big Data and McKinsey Global Institute (MGI)'s seven key insights about Big Data. It will eventually lead you to download the full report about Big Data.