Pearson vs Spearman Correlation

Pearson:

Generally Linear relation

Assumes Linearity.

Correlation between height and weight

Sensitive to outliers.

Spearman:

increasing or decreasing relationship, but may not be linear. Monotonic.

Higher marks lead to lower ranks, but generally not linearly.

Does not assume Linearity.

It can be good for categorical variables and relationships.

Less Sensitive to outliers.

Ref: Internet sources

Reporting (Results and Discussion) for your Data Analytics Projects

Evaluation, Results, Analysis, Reporting

Evaluation: What and How

•Evaluate: the accuracy and generality of the model

• (we did in model evaluation, threat to validity)

•Now Evaluate: if model meets the business objectives

•Seek if there is some business reasons

•why this model is deficient

•Evaluation: Take this model and application on real world case

•See the outcome

•Evaluate: data mining/model/experiment results generated

Evaluation Results and Reporting

•Assess data mining results with respect to business success criteria

•Also, overall report on the result

•And then analyze/evaluate against business success criteria

•Impact/Implications on the business

•Summarize assessment results

•in terms of business success criteria

•include a final statement whether the project meets

•The initial business objectives

•Reporting and Analysis

Examples

•Results section: Page 51

https://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1692&context=etd_projects

•Check Results and Discussion sections

https://arxiv.org/ftp/arxiv/papers/2203/2203.06848.pdf

A Comparative Study on Forecasting of Retail Sales

•May be complicated

https://arxiv.org/pdf/2303.11633.pdf

•Learning Context-Aware Classifier for Semantic Segmentation

•Check results section; also Discussion Section

https://arxiv.org/pdf/2303.07533.pdf

•You can notice: results reported under different criteria, use of tables and figures.

•Notice/read the descriptions

Data Analytics, Machine Learning, Data Science

Examples: Experiment Design

Experiment 1:

Forecast the nations that will have the most suicides, 

Data:

Output variables:

Method/Algorithm for this experiment

Experiment 2:

Find out the association of GDP and population size on suicide rates,

Data:

Output variables:

Method/Algorithm for this experiment

Experiment design 3:

Predict which age groups are most prone to commit suicide

Data:

Output variables:

Method/Algorithm for this experiment

Data Analytics, Machine Learning, Data Science

Tools and Tutorials for Data Manipulation

Join Data from Multiple Sources

Power BI

Python

SQL

•Databases and Data Warehouse

https://durhamcollege.desire2learn.com/d2l/le/content/467097/viewContent/6376898/View

•Data Modeling and SQL

https://durhamcollege.desire2learn.com/d2l/le/content/467097/viewContent/6376900/View

•Microsoft Power BI

https://durhamcollege.desire2learn.com/d2l/le/content/467097/viewContent/6377023/View

Tutorials and Examples

•MySQL Data Manipulation:

https://www.databasejournal.com/mysql/mysql-data-manipulation-and-query-statements/

https://www.w3schools.com/sql/

https://www.tutorialspoint.com/sql/index.htm

•Workbench: https://www.tutorialspoint.com/create-a-new-database-with-mysql-workbench

•SQL Server Data Manipulation

https://www.tutorialspoint.com/ms_sql_server/index.htm

•Management Studio:

https://www.tutorialspoint.com/ms_sql_server/ms_sql_server_management_studio.htm

•Power BI Data Manipulation

https://learn.microsoft.com/en-us/power-bi/connect-data/desktop-tutorial-importing-and-analyzing-data-from-a-web-page

Data Manipulation in Python

https://www.analyticsvidhya.com/blog/2021/06/data-manipulation-using-pandas-essential-functionalities-of-pandas-you-need-to-know/

Data Analytics, Machine Learning, Data Science

Threat To Validity for Your Data Analytics Projects

•Internal

•External

•Construct

•Statistical Conclusion

Internal: Informative variable missing. Bring data from other sources

External: Fixation variable make the result perfect. Model may not generalize

Construct: Class imbalance affects outcome badly

Statistical Conclusion: Based on the statistical measure used, the conclusion can be incorrect.

•Data Mining: Association: Support, Confidence, and Lift

Internal Validity

Is your experiment (and Model) Internally Valid?

What is the Threat that

the experiment (model, and outcome) is invalid (internally)?)

Example: Reasons that inferences between two variables are causal are incorrect. [b]

Cause: Lack of informative variables

Solution: Bring data from other sources

External Validity

Is your experiment (and Model) Externally Valid?

What is the Threat to external Validity that the experiment (model, and outcome) is externally invalid?)

“Study results may not apply to other groups.”

Cause: Fixation Variable

Solution: exclude fixation variable from the study

Ref: https://en.wikipedia.org/wiki/External_validity

Construct Validity

Is your experiment (and Model) Valid by Construction?

What is the Threat that  the experiment (model, and outcome) is invalid by Construction?)

Example: in Classification if the data is imbalanced,

Variables’ effect on the outcome can be invalid

Cause: Construction/balance problem

Solution: Treat Data for Imbalance

Statistical Conclusion Validity

Is your conclusion (from the experiment and the Model) Statistically Valid, even done by Statistical Analysis?

What is the Threat that  the conclusion (from the experiment and the Model) is invalid?)

Example: In data mining, you just considered Association. But that does not give the full picture

Solution: Include Support, Confidence, and Lift

Ref: https://www.analyticsvidhya.com/

Data Analytics, Machine Learning.

Data Analytics, Machine Learning, Data Science

Threat To Validity for Your Data Analytics Projects

• Internal

• External

• Construct

• Statistical Conclusion

• Internal: Informative variable missing. Bring data from other sources

• External: The Fixation variable makes the result perfect. The model may not generalize

• Construct: Class imbalance affects the outcome badly

• Statistical Conclusion: Based on the statistical measure used, the conclusion can be incorrect.


Data Mining: Association: Support, Confidence, and Lift

Data Analytics, Machine Learning, Data Science

McNemar’s Test

Chi-Square

McNemar’s  Test

Chi-square: “A chi-square test is used to help determine if observed results are in line with expected results, and to rule out that observations are due to chance.” Coinflip as an example [1]

References:
1. https://www.investopedia.com/terms/c/chi-square-statistic.asp

Data Analytics, Machine Learning, Data Science

Statistics for Data Analytics and Machine Learning Projects

Null Hypothesis

[2]

Paired t-test

Unpaired t-test

•Pearson Correlation

One Way: Analysis of variance

Spearman Correlation

Spearman

•Kendal Tau Coef

Wilcoxon Sum test

Basic EDA

•Mcnaimer’s test

•Friedman test

•Kruskal-Wallis Test

Two Way Analysis of variance

•K-Fold Cross Validation paired t-test

•Wilcoxon Signed Rank Test

Data Analytics, Machine Learning, Data Science

Make Sense of your Data: For Data Analytics Project

Hypothesis-based versus data-driven analysis

“Only those data analysts who are given time to explore and analyze data thoughtfully and thoroughly are consistently successful.”

Data Identification and Prioritization

Use Augmented data besides Data Pipeline

Analytics Sandbox


Characterizing the Data—Exploring a Single Variable

Data: Descriptive analysis options

Find: Distribution of quantitative variables

Reference: [1]. Gregory S. Nelson. The Analytics Lifecycle Toolkit: A Practical Guide for an Effective  Analytics Capability,  John Wiley & Sons © 2018

Data Analytics, Machine Learning

Data Analytics, Machine Learning, Data Science

Factors/Variables to Consider For Experimental Design for Data Analytics Projects

Design of experiments fishbone

REF: [1]. Gregory S. Nelson. The Analytics Lifecycle Toolkit: A Practical Guide for an Effective  Analytics Capability,  John Wiley & Sons © 2018 . Chapter 6 – Problem Framing

Data Analytics, Machine Learning

Data Analytics, Machine Learning, Data Science