-- Josh Smith
"Science is knowledge which we understand so well that we can teach it to a computer. Everything else is art"
-- Donald Knuth
"Doing data analyis requires quite a bit of thinking and we believe that when you’ve completed a good data analysis, you’ve spent more time thinking than doing."
-- Roger Peng
- Frame: What to forecast? At what horizon? At what level?
- Acquire, Refine, Explore: Do EDA to understand the trend and pattern within the data
- Models: Mean Model, Linear Trend, Random Walk, Simple Moving Average, Exp Smoothing, Decomposition, ARIMA
- Insight: Share the insight through a datavis of the models
- Frame: What are the comments you are trying to understand?
- Acquire, Refine, Explore: Do Wordcloud, Lemmatization, Part of Speech Analysis, and Entity Chunking
- Models: TF-IDF, Topic Modelling, Sentiment Analysis
- Insight: Share the insight through word cloud and topic visualisation
- Toy Problems
- Simple Problems
- Complex Problems
- Business Problems
- Research Problems
- Scraping (structured, unstructured)
- Files (csv, xls, json, xml, pdf, ...)
- Database (sqlite, ...)
- APIs
- Streaming
- Data Cleaning (inconsistent, missing, ...)
- Data Refining (derive, parse, merge, filter, convert, ...)
- Data Transformations (group by, pivot, aggregate, sample, summarise, ...)
- Simple Vis
- Multi Dimensional Vis
- Geographic Vis
- Large Data Vis (Bin - Summarise - Smooth)
- Interactive Vis
- Continuous: Regression - Linear, Polynomial, Tree Based Methods - CART, Random Forest, Gradient Boosting Machines
- Classification - Logistics Regression, Tree, KNN, SVM, Naive-Bayes, Bayesian Network
- Continuous: Clustering & Dimensionality Reduction like PCA, SVD, MDS, K-means
- Categorical: Association Analysis
- Time Series
- Text Analytics
- Network / Graph Analytics
- Optimization
- Reinforcement Learning
- Online Learning
- Deep Learning
- Other Applications: Image, Speech
- Narrative Visualisation
- Dashboard Visualisation
- Decision Making Tools
- Automated Decision Tools
- Acquire / Refine:
Pandas, Beautiful Soup, Selenium, Requests, SQL Alchemy, Numpy, Blaze
- Explore:
MatPlotLib, Seaborn, Bokeh, Plotly, Vega, Folium
- Model:
Scikit-Learn, StatsModels, SciPy, Gensim, Keras, Tensor Flow, PySpark
- Insight:
Django, Flask
- One of the good books on statistical learning is ISLR -> An Introduction to Statistical Learning with Application in R
- You can find all the ISLR code in python at this github repo - https://github.com/JWarmenhoven/ISLR-python
- Forecasting: Principle and Text
- Statistical forecasting: Notes on regression and time series analysis Case
- Harvard Data Science Course - CS 109 Course (It is structured in similar way to the approach we shared)
- Data Science Specialisation - JHU Data Science (It is a good course, though the material is coded in R)
- Many more on Coursera & Udacity...