Statistics-Watistics


 

Alright, this post is a bit of a side trip on our learning journey. I’ve always believed that having a strong grip on statistics is super important when diving into machine learning. A lot of popular ML algorithms, like linear regression, are basically borrowed from stats. If you’re more into analytics than hardcore ML, knowing your statistics becomes even more critical. The tools and concepts you pick up in stats are insanely useful across all sorts of fields—whether it’s physics, engineering, social sciences, medical research, or even testing new drugs. Whatever buzzwords you’re into—Data Science, AI, Machine Learning, Deep Learning, Analytics, Data-Driven Decisions, Data Modeling—you name it, stats is at the heart of it all. Trust me, getting comfy with statistics will make everything else click into place.

But how so?

Well, there are two ways to look at data science:

  1. Computational View: Here, data is seen as a huge sequence of numbers that need to be crunched by fast algorithms. Think of things like approximate nearest neighbors, low-dimensional embeddings, spectral methods, and distributed optimization.

  2. Statistical View: In this view, data comes from a random process. The goal is to figure out how this process works so we can make predictions or understand what influences it.

Statistics is all about understanding the process that generates data. This process has two parts: one part is predictable and makes sense to us, and the other part is just pure randomness. The aim of statistics is to dig into this process, explain as much of it as possible, and strip away the randomness until all that's left is true, unpredictable randomness.

Therefore, having a reliable resource to understand what this field has to offer is crucial. This is why I want to talk about the course MITx 18.6501x: Fundamentals of Statistics by Professor Philippe Rigollet. It is offered by MIT through EdX. The course content is free for anyone to access, but if you want to complete it and get a certificate, there is a fee. It is a hell-of-course that one can take up to solidify one's foundations is his/her data science journey.

There are several advantages to accessing this course through EdX, just like other courses on the platform. You get access to all videos, transcripts, and PowerPoints used in class, along with online support for any questions about the material. Additionally, there's an active learning community where you can ask questions and get help with assignments and quizzes, often through helpful hints. The community itself is warm and welcoming, acting as a learning support group and keeping you motivated throughout the journey.

The course itself isn't a walk in the park—it's quite tough and demands a lot of rigor and discipline to complete, as the professor himself admits in his initial lectures. However, if you really put in the effort and push through with determination, the learnings are incredibly rewarding. Finishing this course has been one of the most satisfying experiences I've had in a long time.

So what do you get out of the course?

The course starts right off the bat by emphasizing the importance of the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN), highlighting their foundational significance. One of the key concepts introduced early on is the three types of convergence between a sequence of numbers and a random variable generated from a random process:

  1. Almost surely convergence
  2. Convergence in probability
  3. Convergence in distribution
From there, we move on to estimation, confidence intervals, and hypothesis testing. The course also demonstrates how to use the Delta method to form an asymptotic confidence interval that can contain the true value of a parameter for any distribution/model, such as the Exponential Distribution with a certain degree of confidence.

One of the key takeaways for me from this course was learning the three methods for finding an estimate that converges to the true value of a parameter:
  1. Maximum Likelihood Estimation
  2. Method of Moments
  3. M-Estimation
Throughout the course, you get to learn several tools that help any data scientist derive insights about the characteristics of an entire population by examining a sample of data. I urge readers to explore the course content themselves by simply accessing it for free or paying for a certification if they wish.

As mentioned before, the convenience of taking this course through EdX is immense for all the reasons outlined earlier. However, it's important to note that the course includes additional homework, classroom assignments, two midterms, and a final exam (all objective questions). These additional tasks are quite challenging and serve as real head-scratchers for serious students. I've seen students struggle with the questions in classwork and homework, often relying on the community for hints. This proves that the course is truly meant for serious learners who are determined to master the material.

Conclusion

Should you take up this course and invest three months of your time and energy? Well, if you seriously want to master the field of data science and continue your journey into ML, then absolutely yes. Not only will it enrich your knowledge base with useful statistical tools that will help you become a better data scientist, but it will also strengthen your foundations for the learning journey ahead.

Comments

Popular posts from this blog

Another must read by O'Reilly!

Python, NumPy and Pandas