AnalyticBridge2014-09-17T05:31:16ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebelhttp://api.ning.com/files/sjiEmaLbXGhN1WRavJqWDhDVVRgy98r*r9QjzvuoTYRjxre6-ZcjLQ1TRSiubt*zolu3-UJFBpr3FQYTbYas*9Xfbs8SCfiF/smaller.jpg?width=48&height=48&crop=1%3A1http://www.analyticbridge.com/forum/topic/listForContributor?user=mryrtp6jl4nl&feed=yes&xn_auth=noCoolest things that have been done by statisticians, data scientists, or machine learning experts.tag:www.analyticbridge.com,2014-09-11:2004291:Topic:3082612014-09-11T07:00:14.879ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>Obviously, there are some wonderful problems solved by data scientists. However, I love the below problems which are worked on and believe they are the best <img alt=";-)" class="wp-smiley" src="http://beyond.insofe.edu.in/wp-includes/images/smilies/icon_wink.gif"></img></p>
<p>· Detecting sarcasm in speech<br></br> · Identify every fraudulent medicine administration amongst<br></br> hundreds of thousands of cases<br></br> · Help physicians prescribe the most suitable medicine for a patient<br></br> based on his insurance policy<br></br> · Detect the patterns of customers in sales data that the marketing…<br></br></p>
<p>Obviously, there are some wonderful problems solved by data scientists. However, I love the below problems which are worked on and believe they are the best <img src="http://beyond.insofe.edu.in/wp-includes/images/smilies/icon_wink.gif" alt=";-)" class="wp-smiley"/></p>
<p>· Detecting sarcasm in speech<br/> · Identify every fraudulent medicine administration amongst<br/> hundreds of thousands of cases<br/> · Help physicians prescribe the most suitable medicine for a patient<br/> based on his insurance policy<br/> · Detect the patterns of customers in sales data that the marketing<br/> folks of the company did not know until then<br/> · Help a supply chain company plan their fleet to improve the<br/> productivity by over 30%</p>
<p>For few interesting case studies look at <a href="http://insofe.edu.in/init/default/consultancy">http://insofe.edu.in/init/default/consultancy</a></p>
<p>The point is that it is a wonderful and yet simple field where even regular practitioners can solve wonderful problems.</p> How to develop churn prediction tool for mobile telecommunication using data mining evolutionary algorithmtag:www.analyticbridge.com,2014-09-10:2004291:Topic:3082492014-09-10T07:53:43.760ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
I am trying to develop a tool for churn prediction for mobile telecommunication<br />
My project is divided in four phases<br />
phase 1)user interface which is divided in the following components<br />
a)Start page<br />
The start page essentially consists of two buttons<br />
• Button for selecting the training data for the program<br />
• Button for selecting the data on which the prediction has to be performed<br />
The pressing of button 1 launches a ﬁle browser, which can then be used to select the text ﬁle which contains the…
I am trying to develop a tool for churn prediction for mobile telecommunication<br />
My project is divided in four phases<br />
phase 1)user interface which is divided in the following components<br />
a)Start page<br />
The start page essentially consists of two buttons<br />
• Button for selecting the training data for the program<br />
• Button for selecting the data on which the prediction has to be performed<br />
The pressing of button 1 launches a ﬁle browser, which can then be used to select the text ﬁle which contains the training data. This training data is then used to generate rules using dmel algorithm.<br />
<br />
b)File browser<br />
The ﬁle browser is invoked whenever the user has to select a ﬁle. The ﬁle browser is a ﬁle explorer window that prompts the user to browse to a ﬁle and return its absolute path to the program.<br />
<br />
c)Attribute selector<br />
Once the ﬁle is selected, the program reads the contents of the ﬁle. The attributes are identiﬁed. A popup window is displayed which prompts the user to select the attributes that should be used for generating the rules. The attributes that are unchecked are discarded.<br />
<br />
d)Window for setting importance factor<br />
After selecting the attributes, the user is prompted with another window which displays a slider for each attribute selected in the previous step. The slider ranges from 0 to 1 in steps of 0.1. Once this step is complete, the selected attributes and their corresponding importance factors are then stored into attribute instances of the attribute class, each of which contains a variable named importance factor. Once this step is complete, the program returns to the starting page. Once the user is at the starting page again, he clicks button 2 to browse for the actual database ﬁle on which the prediction is to be done. For this the ﬁle browser is used again.<br />
<br />
e)Output page<br />
Once the user enters the database ﬁle on which prediction is to be done, the prediction algorithm is invoked. A set of rules are generated on the basis of the training data that has been previously selected and these rules are then applied on the database on which prediction is to be done. The target attribute value is then updated for every row in the database. The output page then provides options for displaying the set of rules as it is generated, and also for displaying the entire database, and also for displaying only the target attribute ﬁeld.<br />
<br />
<br />
this is how my phase 1 will work but the problem i m facing is with button 2 which is use to select data for prediction. I want to know from which data ll be used for prediction.Please explain me with an eg. it ll be a gr8 help to me to start a proj. Skills you need to become a data scientist.tag:www.analyticbridge.com,2014-09-09:2004291:Topic:3080112014-09-09T08:06:56.714ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>But, I thought, the more helpful approach might be a plan.</p>
<p>My goal is to create a plan where you get to the level of average industry practitioner</p>
<p>Skills you need: Ability to take Excel/CSV data sets, pre-process and visualize; Build a model and Visualize the results.</p>
<p>Recommended steps:</p>
<p>1. Download one data set from Kaggle/UCI or anywhere from the Internet. I am deliberately not giving a link as I want you to search through multiple sets. Create a deck of slides…</p>
<p>But, I thought, the more helpful approach might be a plan.</p>
<p>My goal is to create a plan where you get to the level of average industry practitioner</p>
<p>Skills you need: Ability to take Excel/CSV data sets, pre-process and visualize; Build a model and Visualize the results.</p>
<p>Recommended steps:</p>
<p>1. Download one data set from Kaggle/UCI or anywhere from the Internet. I am deliberately not giving a link as I want you to search through multiple sets. Create a deck of slides describing the business problem, ROI, current practices, their weakness etc.</p>
<p>Mile stone 1: Creating a business context for a problem is a crucial step in becoming a practitioner. Congrats, you have done that! You should spend a week for this provided you put in 20 hours a week.</p>
<p>2. Look at the attributes given. Brain storm whether you can create more attributes from them. If transactions are given, you can create average number of transaction per day, average value of transactions etc. Think and create as many new attributes as you can.<br/> 2. Download R, Deducer (my preference). They both are open source.<br/> 3. From the resources provided by others, learn the techniques and intuition behind standard data pre-processing (I mean ways in which you fill missing values, bin neumeric variables and merge categorical variables, scale data, dimensionality reduction etc.).<br/> 4. Use Excel/Deducer and create new data and pre-process the data.</p>
<p>Mile stone 2: Creating one big structured table where independent attributes are columns and records are rows is a huge step in solving. You should be able to do this with 4 weeks of work. Don’t forget to add a few slides in your ppt on data pre-processing</p>
<p>5. Learn descriptive statistics, histogram, box plot, scatter plot and bar chart. Learn to plot these in deducer/ggplot.<br/> 6. Do detailed descriptive statistics and visualizations on the data. There are excellent resources on this all over the net. I created a few videos myselg (<a href="http://beyond.insofe.edu.in/cate">http://beyond.insofe.edu.in/cate</a>…)</p>
<p>Mile stone 3: Visualizing is considered most important interfacing step. and you are done with it. Add these to your slide deck. Allocate two weeks for this.</p>
<p>6. Learn linear, logistic regression and clustering from any of the resources given in these threads.<br/> 7. Apply then on your data sets and do all diagnostics. Deducer makes it easy to do this.</p>
<p>Mile stone 4: Congrats! You built your predictive models. I think, you need 3 weeks for this step.</p>
<p>8. Brain storm and think about how you can simplify and present these results. Goal is to present to a non-data scientist. Use your visualization skills again. Add these slides to your deck.</p>
<p>Milestone 5: Take a week or two for this.</p>
<p>You have created a slide deck, some code and knowledge base. Nore importantly, you solved a problem end-to-end. Viola, in approximately 12 weeks you are where 90% of data scientists are <img src="http://beyond.insofe.edu.in/wp-includes/images/smilies/icon_smile.gif" alt=":-)" class="wp-smiley"/></p>
<p>Now, to get to a higher level</p>
<p>Add more algorithms (decision trees, neural nets etc.). Learn more domains and problems. Study techniques to solve unstructured data. There are wonderful courses in the thread. Take them slowly.</p>
<p>Hope this helps.</p> Decomposition for log-linear modeltag:www.analyticbridge.com,2014-09-09:2004291:Topic:3080582014-09-09T03:45:34.382ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>Hi, <br></br> <br></br> I have estimated a log linear regression model using SAS with the following functional form:<br></br> <br></br> lnY = a + XB1 + Xb2 + Xb3 + Xb4<br></br> <br></br> The dependent variable is in log form, the independent/explanatory variables in linear form.</p>
<p>With the equation I can estimate/forecast the linear value of Y by taking the antilog/exponent of the forecast from the equation so that I can see the value in the orginal Y values instead of the logs. This is fine.</p>
<p> But I…</p>
<p>Hi, <br/> <br/> I have estimated a log linear regression model using SAS with the following functional form:<br/> <br/> lnY = a + XB1 + Xb2 + Xb3 + Xb4<br/> <br/> The dependent variable is in log form, the independent/explanatory variables in linear form.</p>
<p>With the equation I can estimate/forecast the linear value of Y by taking the antilog/exponent of the forecast from the equation so that I can see the value in the orginal Y values instead of the logs. This is fine.</p>
<p> But I also want to decompose the forecast/estimate by the respective explanatory X variable. <br/> <br/> For example if the total forecast in log form = 5, then the anti log/exponent of that gives me a forecast of 148 in the original Y series. Now of that 148, what I need to calculate is how much is X1, X2 etc is worth, e.g <br/> <br/> a = 2<br/> X1 = 45<br/> X2 = 15<br/> X3 = 25<br/> x4 = 61<br/> <br/> Total = 148<br/> <br/> Does anyone know how to do this?<br/> <br/> Thanks,<br/> <br/> Biswajit</p> Logistic regression intercept term not significanttag:www.analyticbridge.com,2014-09-08:2004291:Topic:3081472014-09-08T06:16:57.284ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>Hi</p>
<p>presently m building a short stay model for inpatient claims....the dependent variables for the same is defined as the claims for which we recovered amount(that claim can be a waste/fraud claim) by conducting manual audit last year.</p>
<p>Though C stats and other stats are good for the model but the problem starts when i look onto significance of intercept term which is as below:…</p>
<p></p>
<table border="0" cellspacing="0" width="384">
<colgroup></colgroup></table>
<p>Hi</p>
<p>presently m building a short stay model for inpatient claims....the dependent variables for the same is defined as the claims for which we recovered amount(that claim can be a waste/fraud claim) by conducting manual audit last year.</p>
<p>Though C stats and other stats are good for the model but the problem starts when i look onto significance of intercept term which is as below:</p>
<p></p>
<table border="0" cellspacing="0" width="384">
<colgroup><col width="64" span="6"></col></colgroup><tbody><tr><td height="20" width="64">Parameter</td>
<td width="64">DF</td>
<td width="64">Estimate</td>
<td width="64">Standard</td>
<td width="64">Wald</td>
<td width="64">Pr > ChiSq</td>
</tr>
<tr><td height="20"></td>
<td></td>
<td></td>
<td>Error</td>
<td colspan="2">Chi-Square</td>
</tr>
<tr><td height="20">Intercept</td>
<td align="right">1</td>
<td align="right">-0.0165</td>
<td align="right">0.0728</td>
<td align="right">0.0514</td>
<td align="right">0.8206</td>
</tr>
</tbody>
</table>
<p> M not sure what to do....</p>
<p></p>
<p>can any one please suggest me..what can be done...i understand ..<span>it means that the average value of the dependent variable when all the other independent variables are equal to zero.....but should i include intercept while scoring my model or what else.....?</span></p>
<p></p>
<p><span>Thanks in advance..!!!!</span></p>
<p></p>
<p><span>Regards</span></p>
<p></p>
<p><span>Amitesh</span></p> Curious formula generating all digits of square root numberstag:www.analyticbridge.com,2014-09-03:2004291:Topic:3069112014-09-03T19:37:56.096ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>This elementary number has two interesting features. One that interests us data scientists: <a href="http://www.analyticbridge.com/forum/topics/challenge-of-the-week-random-numbers" target="_blank">the generation of great (non-periodic) random bits</a>, using its decimals. And one even more amazing to all mathematicians, and one of the greatest unresolved mathematical problems of all times: are the digits of SQRT(2) randomly distributed, say in base 2?…</p>
<p></p>
<p>This elementary number has two interesting features. One that interests us data scientists: <a href="http://www.analyticbridge.com/forum/topics/challenge-of-the-week-random-numbers" target="_blank">the generation of great (non-periodic) random bits</a>, using its decimals. And one even more amazing to all mathematicians, and one of the greatest unresolved mathematical problems of all times: are the digits of SQRT(2) randomly distributed, say in base 2?</p>
<p><a href="http://api.ning.com:80/files/kt*hMWIEF*D5*gCLqcqpr5ED-RfSVzMSflFQ0WRZJ4Kidce48SdH34dwuus0K562wOBik4AUxZtZtryN4LEhmol3Te3*81QV/bor55.PNG" target="_self"><img src="http://api.ning.com:80/files/kt*hMWIEF*D5*gCLqcqpr5ED-RfSVzMSflFQ0WRZJ4Kidce48SdH34dwuus0K562wOBik4AUxZtZtryN4LEhmol3Te3*81QV/bor55.PNG" width="682" class="align-center"/></a></p>
<p style="text-align: center;"><em>Figure 1: successive iterations of algorithm below; d(n) are digits of SQRT(2)/2 in base 2</em></p>
<p>The fact that such a simple question about such a basic number has yet to be proved or disproved as of today, is just simply amazing. Everyone very strongly believe that <a href="http://arxiv.org/abs/math/0512404" target="_blank">the answer is yes</a>, and that indeed, subsequent digits are independent from each other. These decimals - and I mean billions of them computed using some Map-Reduce architecture - have successfully passed all randomness tests so far.</p>
<p>Yet the result below would let you think that these decimals (bits to be precise - here we work in base 2, not in base 10), because they obey such a natural, simple, non-chaotic recurrence relationship, just can't be random. The fact is that they almost certainly look extremely random (see column d(n) in Figure 1 above), and yet the result below is also and paradoxally correct - though please check it out yourself: this is the purpose of this <a href="http://www.datasciencecentral.com/group/resources/forum/topics/best-kept-secret-about-data-science-competitions" target="_blank">challenge of the week</a>.</p>
<p><strong><span class="font-size-4">Recursive algorithm to compute digits of SQRT(2)/2</span></strong></p>
<p>Let us define the following recurrence system:</p>
<p><span style="font-family: 'courier new', courier;" class="font-size-2">p(0) = 0, p(1)= 1, e(1) = 2</span></p>
<p><span style="font-family: 'courier new', courier;" class="font-size-2"><strong>If</strong> 4p(n) + 1 < 2e(n) <strong>Then</strong></span></p>
<ul>
<li><span style="font-size: 10pt; font-family: 'courier new', courier;">p(n+1) = 2p(n) + 1</span></li>
<li><span style="font-size: 10pt; font-family: 'courier new', courier;">e(n+1) = 4e(n) - 8p(n) - 2</span></li>
<li><span style="font-size: 10pt; font-family: 'courier new', courier;">d(n+1) = 1</span></li>
</ul>
<p><span style="font-family: 'courier new', courier;" class="font-size-2"><strong>Else</strong></span></p>
<ul>
<li><span style="font-size: 10pt; font-family: 'courier new', courier;">p(n+1) = 2p(n)</span></li>
<li><span style="font-size: 10pt; font-family: 'courier new', courier;">e(n+1) = 4e(n)</span></li>
<li><span style="font-size: 10pt; font-family: 'courier new', courier;">d(n+1) = 0</span></li>
</ul>
<p>Note that d(n+1) = p(n+1) - 2p(n).</p>
<p><strong>The surprising result</strong> is that the d(n)'s represent the bits of SQRT(2)/2 when represented in base 2. In other words,</p>
<p style="text-align: center;">SQRT(2)/2 = SUM{ d(k)/2^k }</p>
<p>where the sum is over all integer k greater than 0. Can you prove it? Or find something wrong in my assertion? </p>
<p>Unfortunately, because the p(n)'s and e(n)'s grow exponentially fast, this recursion is of no use to produce random bits. It does the job without violations - unlike pretty much all random number generators available in the public domain - but it can't be implemented. Could you find similar recursions that produce (simulated) randomness, and that are computationally manageable?</p>
<p>Attached is <a href="http://api.ning.com:80/files/kt*hMWIEF*D9YJaqqGdKZjl48htlITL*ESgmUKGre3HefxM-5ZddJEtrFOa7kkETbt1j*9zrCBNWx8RL-NZiaGfyun8awaQE/randomnumbersdigits.xlsx" target="_self">a small spreadsheet</a> that shows the calculation, if you want to replicate my results in figure 1. <span>Some hints to optimize this recursion, prove this result, as well as a generalization to SQRT(a)/a in base b, and even to generate the digits of a^{-1/c} in base b, using the same technique, are found in the next sections. </span>In short, it means that<strong> many irrational numbers have a highly predictible sequence of digits</strong>, despite looking perfectly random. This might be a concern if you design cryptographic applications relying on certain types of high quality simulated random numbers.</p>
<p><strong>Another surprising result</strong>, if you look at the column <em>Approx to SQRT(2)/2</em> in Figure 1 (or better, download my spreadsheet), is the fact that the algorithm converges relatively fast to SQRT(2)/2, but in a very weird way: any time d(n)=0, there's no progress, as if the algorithm had reached the limiting value, or got stalled. Such behavior would cause most numerical analysis software programs to stop, erroneously believing that they reached the solution, once stuck in the first such configuration: in this example, when n=2 (n is the iteration counter). Interestingly, this algorithm will get stuck infinitely many times, for a billion of iterations in a row. This happens any time we hit a point in SQRT(2)/2 expansion where one billion bits are just zero. It happens incredibly rarely (that's the general belief), yet it happens an infinite number of times (that's the general belief too). Yet overal convergence is still very fast. </p>
<p><strong><span class="font-size-4">Numerical Optimization</span></strong></p>
<p>One way to get rid of the exponential growth (and thus exponential storage and computing time as the number of iterations - that is, <strong>n</strong> - is growing) is to store the log of p(n) and e(n), instead of p(n) and e(n). This would eliminate some of the computational/numerical issues associated with exponential growth. Note that p(n) is very well approximated by (SQRT(2)/2) * 2^n. Also note that d(n+1) is entirely determined by the sign of 4p(n) + 1 - 2e(n), that is, asymptotically, by the sign of 2p(n) - e(n).</p>
<p>Maybe transforming p(n) and e(n) is another solution, then obtaining recurrence relations for the transformed variables. For instance, instead of p(n), use q(n) = p(n) - INT(0.707106781 * 2^n). The number q(n) will be much more manageable (smaller) than p(n).</p>
<p>However, no matter how efficiently we transform the two variables p(n) and e(n), it is impossible to keep storage and computing time bounded as the number of iterations (that is, the number <strong>n</strong>) is growing. At best, we can expect linear growth for storage, or maybe sub-linear (like a log function). Otherwise, our recurrence system (if bounded over positive integers as <strong>n</strong> grows) would be periodic, defeating the purpose of producing a non-periodic random number simulator. In addition, it could not converge to an irrational number, if bounded.</p>
<p><span class="font-size-4"><strong>Generalization and proof sketch</strong></span></p>
<p>A generalization of my irrational number <em>decimal generator</em> is as follows, and will help you prove the validity of my result:</p>
<p>Here we try to produce the decimals, in base <strong>b</strong>, of <strong>a</strong> at power 1/<strong>c</strong> (that is, the decimals of <strong>a</strong>^{1/<strong>c</strong>} in base <strong>b</strong>), where <strong>a</strong>, <strong>b</strong>, <strong>c</strong> are positive integers. It works best when both <strong>b</strong>=2 and <strong>c</strong>=2.</p>
<p><strong>Define </strong></p>
<ul>
<li>k(n+1) = <strong>b</strong>*k(n)</li>
<li>p(n) is the largest integer such that { k(n) }^<strong>c</strong> > a*{ p(n) }^<strong>c</strong></li>
<li>p(n+1) = <strong>b</strong>*p(n) + r(n), with r(n) a non-negative integer strictly inferior to <strong>b</strong> (but as large as possible), such that <strong>a</strong> * { p(n+1) }^<strong>c</strong> < { k(n+1) } ^<strong>c</strong></li>
<li>e(n), the positive error term, is defined by { k(n) }^<strong>c</strong> = <strong>a</strong> * { p(n) }^<strong>c</strong> + e(n)</li>
<li>d(n+1) = p(n+1) - <strong>b</strong>*p(n)</li>
</ul>
<p>Check out our <a href="http://www.datasciencecentral.com/group/resources/forum/topics/best-kept-secret-about-data-science-competitions" target="_blank">previous weekly challenges</a></p> Some software and skills that every Data Scientist should know?tag:www.analyticbridge.com,2014-09-01:2004291:Topic:3069592014-09-01T07:34:54.623ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>To correctly answer this question, you need to ask yourself what is the job that you want to aim? A data scientist can aim for three different jobs. For the lack of better words (or my lack of knowedge of those words!), let me classify them as</p>
<p>1. Analysts, 2. Consultants, 3. Engineers</p>
<p>Analysts: These are the guys who do the same job repeatedly (statistical analysis in clinical trials, target marketing in banks etc.). In India, I see quite a few companies that get outsourced…</p>
<p>To correctly answer this question, you need to ask yourself what is the job that you want to aim? A data scientist can aim for three different jobs. For the lack of better words (or my lack of knowedge of those words!), let me classify them as</p>
<p>1. Analysts, 2. Consultants, 3. Engineers</p>
<p>Analysts: These are the guys who do the same job repeatedly (statistical analysis in clinical trials, target marketing in banks etc.). In India, I see quite a few companies that get outsourced analytics also fall in this category. I noticed that they get data in a standard form and they use the same model to analyze and use same charts to visualize. The variance from project to project is very little.</p>
<p>You need to be a master of one or two modules of one tool (like SAS, SPSS) for this. Any online video and an installed version of the software and some data is good enough to get you started. You do not need to have in depth understanding of science also.</p>
<p>Your organization itself has a lot of inertia to try anything new. I really had a tough time to convince a bank to try decision trees (they were doing logistic regression for 20 years) as late as 2010! The manager said why do you bring new things when the old ones are working fine:-)</p>
<p>Also, when I talked to his team about logistic regression, I realized that they did not understand the underlying mathematics or science well enough. But, it was not a major deterrent for that specific job. They were doing fine.</p>
<p>Beware, these are the low end jobs in data science. Choose this path if and only if you are OK with routine and not so difficult work.</p>
<p>2. Consultants: These are the Mckinsey, Deloitte, Booz and Hamilton kind of guys. I also see them in dedicated analytics groups of large insurance, tech companies. They work on different problems that their clients are facing and provide needed guidance and consulting.</p>
<p>You need a very good aptitude to understand and communicate the business problems at a big level (sort of MBAish skills). You need to be very good with a few algorithms (standard ones like trees, nearest neighbors, regression, naive bayes). If you position yourself as a data scientist and not a business consultant, you need working knowledge of more advanced algorithms also (support vector machines, beliefnets, neural nets etc.). I strongly recommend one language to implement these (R, SAS, SPSS…) hands on. Infact, now a days, I am teaching R/Shiny for my students so they can quickly put up interactive demos. I strongly recommend a visualization tool (ggplot in R or Tableu or Qlikview).</p>
<p>I also emphasize on understanding the underlying mathematics intuitively. You should be able to play and experiment and not just use. The problem solving and logical skills are very important.</p>
<p>3. Engineers: These are the product guys. Google/Amazon/FB and a score of start-ups etc. need data guys who can code and build products.</p>
<p>You need to be very good at SQL and one language (my favorite is Python but Java etc. is fine). Now a days, NOSQL skills (Mongo, Cassandra, HBASe etc.) and Hive/PIG kind of big data scripting skills are also very useful. You need to be very good with machine learning algorithms, efficient engineering of software and standard coding and development procedures. You most likely will work on technology and hence the business and consulting skills are not as important as the previous one.</p>
<p>In all three above, interestingly, an intuitive understanding of the algorithms is good enough and you do not need really deep math (I know I am scandalizing a purist here!).</p>
<p>If your goal is to teach and do research in data science, you need the skills mentioned in either 2 (if you want to go for teaching in a business school) or 3 (if you want to teach in a CS school). In addition, you must be extremely good in advanced undergraduate mathematics (calculus, linear algebra and coordinate geometry). Designing newer algorithms and mathematics becomes very important here. For various topics related to data science check (<a href="http://beyond.insofe.edu.in/">http://beyond.insofe.edu.in/</a>) </p>
<p>So, to sum it up, the skills you need to hone depend on the specific interests you want to pursue as a data scientist. Realize that data science is very broad and hence may lead to different professions. You pick what you love and tune yourself for that.</p> Three Hadoop-Related Webinars in Septembertag:www.analyticbridge.com,2014-08-29:2004291:Topic:3067682014-08-29T18:19:54.598ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p><strong>Hadoop 2.0: Learn YARN, Sept 16</strong></p>
<p>Speakers:</p>
<ul>
<li>John Kreisa, Hortonworks</li>
<li>John Haddad, Informatica</li>
<li>Imad Birouty, Teradata</li>
</ul>
<p>Hosted by: Tim Matteson, Cofounder, Data Science Central</p>
<p><a href="http://bit.ly/1tghmZY" target="_blank">Sign Up</a> </p>
<p><strong>Discover Red Hat and Apache Hadoop</strong> - Modern Data Architecture Series</p>
<p>September 3, 10 and 17 @10am PDT</p>
<p>Join us in this 3-part interactive webinar…</p>
<p><strong>Hadoop 2.0: Learn YARN, Sept 16</strong></p>
<p>Speakers:</p>
<ul>
<li>John Kreisa, Hortonworks</li>
<li>John Haddad, Informatica</li>
<li>Imad Birouty, Teradata</li>
</ul>
<p>Hosted by: Tim Matteson, Cofounder, Data Science Central</p>
<p><a href="http://bit.ly/1tghmZY" target="_blank">Sign Up</a> </p>
<p><strong>Discover Red Hat and Apache Hadoop</strong> - Modern Data Architecture Series</p>
<p>September 3, 10 and 17 @10am PDT</p>
<p>Join us in this 3-part interactive webinar series as we'll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data. During the session we will walk through best practices, use cases, demo and tutorials to get you started with Apache Hadoop and Red Hat.</p>
<p><a href="http://bit.ly/1lu6mWA" target="_blank">Sign Up</a> </p>
<p><strong>A Modern Data Architecture for Customer Analytics</strong> - with HP Vertica and Apache Hadoop</p>
<p>September 9, 2014 @10am PDT</p>
<p>Attend this webinar and learn how organizations are combining HP Vertical Analytics Platform and Hortonworks to quickly explore and analyze broad variety of data types to transform actionable information that allows them to be understand how their customers and site visitors interact with their business, offline and online.</p>
<p><a href="http://bit.ly/1nylY6q" target="_blank">Sign Up</a> </p> New approaches for data modeling.tag:www.analyticbridge.com,2014-08-28:2004291:Topic:3066292014-08-28T07:02:36.333ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>Very interesting question which can be answered in multiple perspectives.<br></br> Techniques: If you are looking at techniques in data modelling, there are quite a few that are exploding. Deep learning, spectral methods, kernel methods, probabilistic graphical models, social networking analytics are all the latest and fastest growing areas.<br></br> Business verticals: We are also seeing a lot of interest in data science applications in the entire circle of health care industries like…</p>
<p>Very interesting question which can be answered in multiple perspectives.<br/> Techniques: If you are looking at techniques in data modelling, there are quite a few that are exploding. Deep learning, spectral methods, kernel methods, probabilistic graphical models, social networking analytics are all the latest and fastest growing areas.<br/> Business verticals: We are also seeing a lot of interest in data science applications in the entire circle of health care industries like pharmaceutical industries, hospitals and insurance companies. Previously only banks and retail organisations used to be analytics savvy. So, if I interpret your questions as what are the areas where data science is becoming a new approach to problem solving, I advise you to watch out the healthcare sector.<br/> Horizontal problems: we often hear a lot from clients from a variety of verticals about their need to solve questions related to unstructured data analysis in the context of social media content. Data visualization is also a capability that is generating a lot of interest in the corporate world.</p> What is the importance of understanding data distributions in machine learning?tag:www.analyticbridge.com,2014-08-25:2004291:Topic:3064862014-08-25T13:42:20.130ZMatthew A. Riebelhttp://www.analyticbridge.com/profile/MatthewARiebel
<p>I would like to know What is the importance understanding underlying data distributions in a dataset before applying any machine learning algorithm - it can be either prediction or classification problem?</p>
<p>I would like to know What is the importance understanding underlying data distributions in a dataset before applying any machine learning algorithm - it can be either prediction or classification problem?</p>