Jackknife Logistic And Linear Regression For Clustering And Predictions

Originally posted on DataSciebceCentral, by Dr. Granville. Click here to read original article and comments.

This article discusses a far more general version of the technique described in our article The best kept secret about regression. Here we adapt our methodology so that it applies to data sets with a more complex structure, in particular with highly correlated independent variables.

Our goal is to produce a regression tool that can be used as a black box, be very robust and parameter-free, and usable and easy-to-interpret by non-statisticians. It is part of a bigger project: automating many fundamental data science tasks, to make it easy, scalable and cheap for data consumers, not just for data experts. Our previous attempts at automation include

Readers are invited to further formalize the technology outlined here, and challenge my proposed methodology.

1. Introduction

As in our previous paper, without loss of generality, we focus on linear regression with centered variables (with zero mean), and no intercept. Generalization to logistic or non-centered variables is straightforward.

Thus we are still dealing with the following regression framework:

Y = a_1 * X_1 + … + a_n * X_n + noise

Remember that the solution proposed in our previous paper was

b_i = cov(Y, X_i) / var(X_i), i = 1, …, n
a_i = M * b_i, i = 1, …, n
M (a real number, not a matrix) is chosen to minimize var(Z), with Z = Y – a_1 * X_1 + … + a_n * X_n

When cov(X_i, X_j) = 0 for i < j, my regression and the classical regression produce identical regression coefficients, and M = 1.

Terminology: Z is the noise, Y is the (observed) response, the a_i’s are the regression coefficients, and and S = a_1 * X_1 + … + a_n * X_n is the estimated or predicted response. The X_i’s are the independent variables or features.

2. Re-visiting our previous data set

I have added more cross-correlations to the previous simulated dataset consisting of 4 independent variables, still denoted as x, y, z, u in the new, updated attached spreadsheet. Now corr(x, y) = 0.99.

Read full article.

Source: http://www.datasciencecentral.com/xn/detail/6448529:Topic:172366

Generative Data Intelligence

Jackknife logistic and linear regression for clustering and predictions

Bitcoin & Cryptocurrency Blog – Official CoinJar Blog

Top Meme Coin Picks for Long-Term Holding: Dogecoin (DOGE), Shiba Inu (SHIB), and Furrever Token (FURR)

Latest Intelligence

Memecoins On A Run As PEPE And BOME Price Break Resistance?

Top Ethereum Layer-2 networks adopt Avail DA to boost rollup efficiency and security

Despite market volatility, MicroStrategy’s “BTC per Share” reaches near record levels

SEC Likely To Deny Ethereum Spot ETFs In May: Reuters

Why Memecoins Are Bad for the Crypto Industry, Explains a16z CTO

SKALE and Virtualness Global Partnership Reimagines Fan Engagement for Sports, Creators, and Enterprises Using the Power of Blockchain

Chat with us