ABC of Statistics-Day-1

ABC of Statistics for Data Science Day 01 English

Table of Contents

Introduction

Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, presentation, and organization of data.

In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied.

Here are some key aspects of statistics:

  1. Data Collection: Statistics involves gathering data from various sources, which could be experiments, surveys, or observational studies. The way data is collected is critical to ensuring its validity and relevance.

  2. Data Analysis: Once data is collected, statisticians use a variety of methods to analyze it. This includes descriptive statistics, which summarize data through numbers like the mean and standard deviation, and inferential statistics, which allow for conclusions and predictions to be drawn from data.

  3. Interpretation: One of the most crucial steps in statistics is interpreting the results of the data analysis. This involves understanding what the data is saying and, just as importantly, understanding the limitations of the data and the analysis methods used.

  4. Presentation: Effective communication of statistical findings, often through charts, graphs, and tables, is essential. This makes complex data understandable and accessible to those who need to make decisions based on it.

  5. Decision Making: In many cases, the ultimate goal of statistics is to inform decision-making. This can range from business decisions, like how to improve a product, to healthcare decisions, like evaluating the effectiveness of a new treatment.

  6. Prediction and Forecasting: Statistics are often used to make predictions about the future based on existing data. Techniques like regression analysis, time series analysis, and machine learning are used for forecasting.

  7. Variability and Uncertainty Handling: Statistics provides methods to deal with variability and uncertainty in data. It helps in understanding the randomness in data and making informed decisions despite the inherent uncertainties.

  8. Quantitative Research: In various fields like economics, medicine, psychology, and environmental science, statistics is used for quantitative research and hypothesis testing.

Statistics, thus, plays a crucial role in numerous fields, enabling us to make sense of the vast amounts of data generated in our world today. It is a fundamental tool in research and analysis, aiding in the understanding and solving of complex problems.

Statistics in Data Science and Machine Learning

Statistics provides methods for designing experiments and surveys, and for making inferences about the characteristics of populations based on sample data. In the context of data science and machine learning, statistics plays several crucial roles:

  1. Data Understanding and Preparation: Statistics helps in understanding data through descriptive statistics such as mean, median, mode, variance, and standard deviation. These metrics provide insights into the data’s central tendency, dispersion, and overall distribution. Understanding these aspects is vital for data cleaning and preparation.

  2. Modeling and Algorithm Selection: Many machine learning algorithms are grounded in statistical theories. For instance, linear regression, logistic regression, and various types of clustering methods are directly based on statistical concepts. Selecting the right algorithm often requires understanding these statistical underpinnings.

  3. Inference and Prediction: Statistics is key to making inferences and predictions from data. It helps in estimating the relationships between variables and in making predictions about future observations. For example, statistical hypothesis testing is used to infer if the observed data can be explained by a model or is due to random chance.

  4. Performance Evaluation: After training a machine learning model, statistics is used to evaluate its performance. Metrics like confusion matrix, precision, recall, F1 score, and ROC curves are based on statistical concepts. These metrics help in understanding the strengths and weaknesses of a model.

  5. Experimentation and Validation: In machine learning, experimentation is essential. Statistical methods such as A/B testing and cross-validation are used to validate models and ensure their effectiveness and reliability before deploying them in real-world applications.

  6. Dealing with Uncertainty: Machine learning models often have to deal with uncertainty in data. Statistics provides tools to quantify, manage, and make decisions under uncertainty, for instance, through probabilistic models and Bayesian methods.

  7. Feature Engineering and Selection: Statistical methods help in identifying significant variables (features) that have more predictive power. Techniques like correlation analysis and principal component analysis (PCA) are used for feature selection and dimensionality reduction.

  8. Ethical and Responsible AI: Statistics plays a role in ensuring that machine learning models are fair, ethical, and unbiased. Statistical analysis can help identify and mitigate biases in data and models.

In essence, statistics forms the backbone of data science and machine learning, providing the necessary tools and methodologies for extracting insights and knowledge from data. Its importance cannot be overstated, as it enables practitioners to make data-driven decisions and build intelligent systems that are effective, reliable, and ethical.

Why data is important for statistics?

Before you can use statistics to analyze a problem, you must convert information about the problem into data. That is, you must establish or adopt a system of assigning values, most often numbers, to the objects or concepts that are central to the problem in question.

For instance, when you buy something at the store, the price you pay is a measurement: it assigns a number signifying the amount of money that you must pay to buy the item. Similarly, when you step on the bathroom scale in the morning, the number you see is a measurement of your body weight. Depending on where you live, this number may be expressed in either pounds or kilograms, but the principle of assigning a number to a physical quantity (weight) holds true in either case.

Data need not be inherently numeric to be useful in an analysis. For instance, the categories male and female are commonly used in both science and everyday life to classify people, and there is nothing inherently numeric about these two categories. Similarly, we often speak of the colors of objects in broad classes such as red and blue, and there is nothing inherently numeric about these categories either. (Although you could make an argument about different wavelengths of light, it’s not necessary to have this knowledge to classify objects by color.)

This kind of thinking in categories is a completely ordinary, everyday experience, and we are seldom bothered by the fact that different categories may be applied in different situations. For instance, an artist might differentiate among colors such as carmine, crimson, and garnet, whereas a layperson would be satisfied to refer to all of them as red. Similarly, a social scientist might be interested in collecting information about a person’s marital status in terms such as single-never married, single-divorced, and single-widowed, whereas to someone else, a person in any of those three categories could simply be considered single. The point is that the level of detail used in a system of classification should be appropriate, based on the reasons for making the classification and the uses to which the information will be put.

Measurements

Measurement is the process of systematically assigning numbers to objects and their properties to facilitate the use of mathematics in studying and describing objects and their relationships. Some types of measurement are fairly concrete: for instance, measuring a person’s weight in pounds or kilograms or his height in feet and inches or in meters. Note that the particular system of measurement used is not as important as the fact that we apply a consistent set of rules: we can easily convert a weight expressed in kilograms to the equivalent weight in pounds, for instance. Although any system of units may seem arbitrary (try defending feet and inches to someone who grew up with the metric system!), as long as the system has a consistent relationship with the property being measured, we can use the results in calculations. Measurement is not limited to physical qualities such as height and weight. Tests to measure abstract constructs such as intelligence or scholastic aptitude are commonly used in education and psychology, and the field of psychometrics is largely concerned with the development and refinement of methods to study these types of constructs. Establishing that a particular measurement is accurate and meaningful is more difficult when it can’t be observed directly. Although you can test the accuracy of one scale by comparing results with those obtained from another scale known to be accurate, and you can see the obvious use of knowing the weight of an object, the situation is more complex if you are interested in measuring a construct such as intelligence. In this case, not only are there no universally accepted measures of intelligence against which you can compare a new measure, there is not even common agreement about what “intelligence” means. To put it another way, it’s difficult to say with confidence what someone’s actual intelligence is because there is no certain way to measure it, and in fact, there might not even be common agreement on what it is. These issues are particularly relevant to the social sciences and education, where a great deal of research focuses on just such abstract concepts.

These concepts of Measurements have been adapted from: Statistics in a Nutshell, 2nd Edition [Book]

Scales or levels of measurements

Below is a tabulated summary of the scales or levels of measurement in statistics:

ScaleDefinitionExamplesImportant Information
NominalCategorizes data without a natural order or ranking.Gender, NationalityOnly used for labeling; mathematical operations are not meaningful.
OrdinalCategorizes data with a natural order, but intervals are not consistent.Movie ratings (e.g., 1-5 stars), Economic class (e.g., low, middle, high)Indicates order, but differences between values are not standardized.
IntervalNumeric scale where intervals between values are consistent, but there is no true zero point.Temperature in Celsius or Fahrenheit, Calendar yearsAllows for meaningful addition and subtraction; multiplication and division are not meaningful.
RatioSimilar to interval, but with a meaningful zero point, allowing for all mathematical operations.Height, Weight, Age, IncomeAllows for all mathematical operations, including meaningful ratios.

This table encapsulates the essential aspects of each measurement level, including their definitions, typical examples, and important considerations for their use in statistical analysis and data science.

Data Types

Qualitative vs. Quantitative Data Types

AspectQualitative DataQuantitative Data
DefinitionData that describes qualities or characteristics.Data that can be measured or counted.
NatureNon-numeric, subjective.Numeric, objective.
ExamplesColors, textures, smells, opinions, genres.Height, weight, temperature, scores, quantities.
AnalysisCategorization, thematic analysis, content analysis.Statistical analysis, mathematical calculations.
MeasurementNominal or ordinal scales.Interval or ratio scales.
PurposeUnderstanding complex concepts, opinions, or experiences.Quantifying characteristics, making predictions, testing hypotheses.

Categorical vs. Numerical Data Types

AspectCategorical DataNumerical Data
DefinitionData that represents groups or categories.Data that represents quantities and can be measured.
TypesNominal (no inherent order) and Ordinal (ordered categories).Discrete (countable numbers) and Continuous (measurable quantities).
ExamplesGender, nationality, blood type.Age, salary, temperature, distance.
AnalysisUsed for classification, sorting, grouping.Used for statistical calculations, comparisons.
CharacteristicOften non-numeric, but can be coded numerically.Inherently numeric.
Use in ResearchIdentifying subgroups, exploring data distribution.Performing calculations, establishing correlations.

These tables should help differentiate these data types and provide a clear understanding of their uses and characteristics in data analysis, statistics, and research methodologies.

In summary:

  • Qualitative Data Types (Categorical): Nominal and Ordinal scales are used for categorizing and ranking data without implying a numeric nature.

  • Quantitative Data Types (Numerical): Interval and Ratio scales are used for numeric data, allowing for a wide range of arithmetic and statistical operations. Ratio data includes a true zero point, which differentiates it from interval data.

Few other data types important for statistical analysis are mentioned in this table:

Data TypeDescriptionExamples
Discrete DataData that can only take specific values (typically integers). These values are countable and have gaps between them.Number of children in a family, number of cars in a parking lot.
Continuous DataData that can take any value within a range. These values are measurable and can be infinitely subdivided.Height, weight, temperature, time.
Binary DataA special case of nominal data with only two categories or states (0 or 1, True or False, Yes or No).Outcome of a coin flip (heads or tails), a light switch (on or off).
Categorical DataData that can be divided into groups, which may or may not have a logical order.Blood type (A, B, AB, O), types of cuisine (Italian, Chinese, Indian).
Ordinal Categorical DataA type of categorical data with a clear ordering or ranking.Star ratings for a hotel (1-star, 2-star, 3-star), education level (high school, undergraduate, graduate).
Time Series DataData points collected or recorded at regular time intervals.Daily stock market prices, monthly rainfall amounts.
Spatial DataData that has a geographical or spatial component.Locations on a map, regions in a geographic information system (GIS).
Multivariate DataData involving multiple variables or attributes.Data sets containing demographics, economic indicators.
Structured DataData that adheres to a predefined model or schema, like in databases.Relational database tables, Excel spreadsheets.
Unstructured DataData that doesn’t fit into a conventional database schema.Text files, multimedia content, web pages.
Semi-Structured DataA mix of structured and unstructured data formats.Emails (structured headers, unstructured body), XML and JSON documents.
Boolean DataData with only two possible values.True/False questions, On/Off switches.
Nominal DataCategorizes data without any order or rank.Types of animals, varieties of fruits.
Textual DataData consisting of words, sentences, or paragraphs.Books, articles, social media posts.
Audio DataData in the form of sound.Recorded speeches, music files.
Video DataSequences of images (frames).Movies, surveillance footage.
Image DataVisual data in the form of pixels.Photographs, paintings.

 

Learning Resources

Learn from our YouTube channel in urdu/Hindi here is the link to playlist:

ABC of Statistics for Data Science and Machine Learning.

Facebook
Twitter
LinkedIn

18 Comments.

  1. اسلام علیکم! سر بہت عمدہ ہم کافی دنوں سے شماریات کو پڑھنے کے لیے انٹر نیٹ پر تلاش کر رہا تھا ، ماشاءاللہ آپ کے لیکچرز نےبہت متاثر کیا اور سیکھ رہاہوں جزاء ک اللہ ، اللہ رب العزت آپکو جزاء خیرعطاء فرمائے اور ہمیشہ خوش و خرم رکھےآمین۔

  2. AOA, This blog post provides an introduction to statistics and covers topics such as the definition of statistics, types of data, and the importance of statistics in data science. You have done a good job of explaining the concepts in a simple and easy-to-understand manner. The use of examples and illustrations makes the blog post engaging and informative. Overall, I found the blog post to be a great resource for me.ALLAH PAK ap ko dono jahan ki bhalian aata kry AAMEEN.

  3. very well explaination of Statistics in this blog , this blog is very helpful to understand statistics for Datascience and Machine Learning … Thank u Sir

  4. This blog is superb and easily understandable for all those who have no any background to statics ……Its first time in my life i take interest in statistic and this because of yours easy way of explaining of Statistic

  5. This blog post contains the starter topics of statistics, if anyone follows these topics then they can easily improve their statistical expertise.

Leave a Reply

Your email address will not be published. Required fields are marked *