Connect with us

Big Data

Complete Guide to Feature Engineering: Zero to Hero

Published

on

This article was published as a part of the Data Science Blogathon

Introduction

You must be aware of the fact that Feature Engineering is the heart of any Machine Learning model. How successful a model is or how accurately it predicts that depends on the application of various feature engineering techniques. In this article, we are going to dive deep to study feature engineering. The article will be explaining all the techniques and will also include code wherever necessary. So, let’s start from ground zero, what is feature engineering?

Feature Engineering

Image 1

What is feature engineering?

All machine learning algorithms use some input data to generate outputs. Input data contains many features which may not be in proper form to be given to the model directly. It needs some kind of processing and here feature engineering helps. Feature engineering fulfils mainly two goals:

  • It prepares the input dataset in the form which is required for a specific model or machine learning algorithm.
  • Feature engineering helps in improving the performance of machine learning models magically.

According to some surveys, data scientists spend their time on data preparation. See this figure below:

data preparation

Image 2

This clearly shows the importance of feature engineering in machine learning. So, this article will help you in understanding this whole concept.

Prerequisites:

1. Install Python and get its basic hands-on knowledge.

2. Pandas library in python. Command to install: pip install pandas

3. Numpy library in python. Command to install: pip install numpy

Then import these two libraries like this:

import pandas as pd
import numpy as np

Now, let’s begin!

I am listing here the main feature engineering techniques to process the data. We will then look at each technique one by one in detail with its applications.

The main feature engineering techniques that will be discussed are:

1. Missing data imputation

2. Categorical encoding

3. Variable transformation

4. Outlier engineering

5.  Date and time engineering

Missing Data Imputation for Feature Engineering

In your input data, there may be some features or columns which will have missing data, missing values. It occurs if there is no data stored for a certain observation in a variable. Missing data is very common and it is an unavoidable problem especially in real-world data sets. If this data containing a missing value is used then you can see the significance in the results. So, imputation is the act of replacing missing data with statistical estimates of the missing values. It helps you to complete your training data which can then be provided to any model or an algorithm for prediction.

There are multiple techniques for missing data imputation. These are as follows:-

  1. Complete case analysis
  2. Mean / Median / Mode imputation
  3. Missing Value Indicator

Complete Case Analysis for Missing Data Imputation

Complete case analysis is basically analyzing those observations in the dataset that contains values in all the variables. Or you can say, remove all the observations that contain missing values. But this method can only be used when there are only a few observations which has a missing dataset otherwise it will reduce the dataset size and then it will be of not much use.

So, it can be used when missing data is small but in real-life datasets, the amount of missing data is always big. So, practically, complete case analysis is never an option to use, although you can use it if the missing data size is small.

Let’s see the use of this on the titanic dataset.

Download the titanic dataset from here.

import numpy as np
import pandas as pd titanic = pd.read_csv('titanic/train.csv')
# make a copy of titanic dataset
data1 = titanic.copy()
data1.isnull().mean()

Output:

PassengerId 0.000000
Survived 0.000000
Pclass 0.000000
Name 0.000000
Sex 0.000000
Age 0.198653
SibSp 0.000000
Parch 0.000000
Ticket 0.000000
Fare 0.000000
Cabin  0.771044
Embarked 0.002245
dtype: float64

If we remove all the missing observations, we would end up with a very small dataset, given that the Cabin is missing for 77% of the observations.

# check how many observations we would drop
print('total passengers with values in all variables: ', data1.dropna().shape[0])
print('total passengers in the Titanic: ', data1.shape[0])
print('percentage of data without missing values: ', data1.dropna().shape[0]/ np.float(data1.shape[0]))
total passengers with values in all variables: 183
total passengers in the Titanic: 891
percentage of data without missing values: 0.2053872053872054

So, we have complete information for only 20% of our observations in the Titanic dataset. Thus, Complete Case Analysis method would not be an option for this dataset.

Mean/ Median/ Mode for Missing Data Imputation

Missing values can also be replaced with the mean, median, or mode of the variable(feature). It is widely used in data competitions and in almost every situation. It is suitable to use this technique where data is missing at random places and in small proportions.

# impute missing values in age in train and test set
median = X_train.Age.median()
for df in [X_train, X_test]: df['Age'].fillna(median, inplace=True)
X_train['Age'].isnull().sum()

Output:
0

0 represents that now the Age feature has no null values.

One important point to consider while doing imputation is that it should be done over the training set first and then to the test set. All missing values in the train set and test set should be filled with the value which is extracted from the train set only. This helps in avoiding overfitting.

Missing Value Indicator For Missing Value Indication

This technique involves adding a binary variable to indicate whether the value is missing for a certain observation. This variable takes the value 1 if the observation is missing, or 0 otherwise. But we still need to replace the missing values in the original variable, which we tend to do with mean or median imputation. By using these 2 techniques together, if the missing value has predictive power, it will be captured by the missing indicator, and if it doesn’t it will be masked by the mean / median imputation.

X_train['Age_NA'] = np.where(X_train['Age'].isnull(), 1, 0)
X_test['Age_NA'] = np.where(X_test['Age'].isnull(), 1, 0)
X_train.head()

Output:

data | Feature Engineering
X_train.Age.mean(), X_train.Age.median()
(29.915338645418327, 29.0)

Now, since mean and median are the same, let’s replace them with the median.

X_train['Age'].fillna(X_train.Age.median(), inplace=True)
X_test['Age'].fillna(X_train.Age.median(), inplace=True) X_train.head(10)

So, the Age_NA variable was created to capture the missingness.

Categorical encoding in Feature Engineering

Categorical data is defined as that data that takes only a number of values. Let’s understand this with an example. Parameter Gender in a dataset will have categorical values like Male, Female. If a survey is done to know which car people own then the result will be categorical (because the answers would be in categories like Honda, Toyota, Hyundai, Maruti, None, etc.). So, the point to notice here is that data falls in a fixed set of categories.

If you directly give this dataset with categorical variables to a model, you will get an error. Hence, they are required to be encoded. There are multiple techniques to do so:

  1. One-Hot encoding (OHE)
  2. Ordinal encoding
  3. Count and Frequency encoding
  4. Target encoding / Mean encoding

Let’s understand them in detail.

 One-Hot Encoding

It is a commonly used technique for encoding categorical variables. It basically creates binary variables for each category present in the categorical variable. These binary variables will have 0 if it is absent in the category or 1 if it is present. Each new variable is called a dummy variable or binary variable.

Example: If the categorical variable is Gender with labels female and male, two boolean variables can be generated called male and female. Male will take 1 if the person is male or 0 otherwise. Similarly for a female variable. See this code below for the titanic dataset.

pd.get_dummies(data['Sex']).head()
pd.concat([data['Sex'], pd.get_dummies(data['Sex'])], axis=1).head()

Output:

Sex female
0 male 0 1
1 female 1 0
2 female 1 0
3 female 1 0
4 male 0 1

But you can see that we only need 1 dummy variable to represent Sex categorical variable. So, you can take it as a general formula where if there are n categories, you only need an n-1 dummy variable. So you can easily drop anyone dummy variable. To get n-1 dummy variables simply use this:

pd.get_dummies(data['Sex'], drop_first=True).head()

Ordinal Encoding

What does ordinal mean? It simply means a categorical variable whose categories can be ordered and that too meaningfully.

For example, Student’s grades in an exam are ordinal. (A,B,C,D, Fail). In this case, a simple way to encode is to replace the labels with some ordinal number.  Look at sample code:

from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
le = preprocessing.LabelEncoder()
le.fit(["paris", "paris", "tokyo", "amsterdam"])
le.transform(["tokyo", "tokyo", "paris"])
>>>array([2, 2, 1]...)
list(le.inverse_transform([2, 2, 1]))
>>>['tokyo', 'tokyo', 'paris']

Count and Frequency Encoding

In this encoding technique, categories are replaced by the count of the observations that show that category in the dataset. Replacement can also be done with the frequency of the percentage of observations in the dataset. Suppose, if 30 of 100 genders are male we can replace male with 30 or by 0.3.

This approach is popularly used in data science competitions, so basically it represents how many times each label appears in the dataset.

Target / Mean Encoding

In target encoding, also called mean encoding, we replace each category of a variable with the mean value of the target for the observations that show a certain category. For example, there is a categorical variable “city”, and we want to predict if the customer will buy a TV provided we send a letter. If 30 percent of the people in the city “London” buy the TV, we would replace London with 0.3. So it helps in capturing some information regarding the target at the time of encoding the category and it also does not expands the feature space. Hence, it also can be considered as an option for encoding. But it may cause over-fitting to the model, so be careful. Look at this code for implementation:

import pandas as pd
# creating dataset
data={'CarName':['C1','C2','C3','C1','C4','C3','C2','C1','C2','C4','C1'], 'Target':[1,0,1,1,1,0,0,1,1,1,0]}
df = pd.DataFrame(data)
print(df)

Output:

CarName Target
0 C1 1
1 C2 0
2 C3 1
3 C1 1
4 C4 1
5 C3 0
6 C2 0
7 C1 1
8 C2 1
9 C4 1
10 C1 0

df.groupby([‘CarName’])[‘Target’].count()

Output:

CarName
C1 4
C2 3
C3 2
C4 2
Name: Target, dtype: int64

df.groupby(['CarName'])['Target'].mean()

Output:

CarName
s1 0.750000
s2 0.333333
s3 0.500000
s4 1.000000
Name: Target, dtype: float64

Mean_encoded = df.groupby(['CarName'])['Target'].mean().to_dict()

df['CarName'] = df['CarName'].map(Mean_encoded)

print(df)

CarName Target
0 0.750000 1
1 0.333333 0
2 0.500000 1
3 0.750000 1
4 1.000000 1
5 0.500000 0
6 0.333333 0
7 0.750000 1
8 0.333333 1
9 1.000000 1
10 0.750000 0

Variable Transformation

Machine learning algorithms like linear and logistic regression assume that the variables are normally distributed. If a variable is not normally distributed, sometimes it is possible to find a mathematical transformation so that the transformed variable is Gaussian. Gaussian distributed variables many times boost the machine learning algorithm performance.

Commonly used mathematical transformations are:

  1. Logarithm transformation – log(x)
  2. Square root transformation – sqrt(x)
  3. Reciprocal transformation – 1 / x
  4. Exponential transformation – exp(x)

Let’s check these out on the titanic dataset.

Loading numerical features of the titanic dataset.

cols_reqiuired = ['Age', 'Fare', 'Survived'])
data[cols_reqiuired].head()
Output:
Survived	Age	Fare
0	0	22.0	7.2500
1	1	38.0	71.2833
2	1	26.0	7.9250
3	1	35.0	53.1000
4	0	35.0	8.0500

First, we need to fill in missing data. We will start with filling missing data with a random sample.

 def impute(data, variable):
 df = data.copy() df[variable+'_random'] = df[variable] # extract the random sample to fill the na random_sample = df[variable].dropna().sample(df[variable].isnull().sum(), random_state=0) random_sample.index = df[df[variable].isnull()].index df.loc[df[variable].isnull(), variable+'_random'] = random_sample return df[variable+'_random']
# fill na
data['Age'] = impute_na(data, 'Age')

Now, to visualize the distribution of the age variable we will plot histogram and Q-Q-plot.

def plots(df, variable): plt.figure(figsize=(15,6)) plt.subplot(1, 2, 1) df[variable].hist() plt.subplot(1, 2, 2) stats.probplot(df[variable], dist="norm", plot=pylab) plt.show()
plots(data, 'Age')

Output:

transformation | Feature Engineering

The age variable is almost normally distributed, except for some observations on the lower tail. Also, you can notice slight skew in the histogram to the left.

Now, let’s apply the above transformation and compare the transformed Age variable.

Logarithmic transformation

data['Age_log'] = np.log(data.Age)
plots(data, 'Age_log')

Output:

Logrithmic transformation

You can observe here that logarithmic transformation did not produce a Gaussian-like distribution for Age column.

Square root transformation – sqrt(x)

data['Age_sqr'] =data.Age**(1/2)
plots(data, 'Age_sqr')

Output:

sqrt transformation

This is a bit better, but the still variable is not Gaussian.

Reciprocal transformation – 1 / x

data['Age_reciprocal'] = 1 / data.Age
plots(data, 'Age_reciprocal')

Output:

reciprocal transformation

This transformation is also not useful to transform Age into a  normally distributed variable.

Exponential transformation – exp(x)

data['Age_exp'] = data.Age**(1/1.2) plots(data, 'Age_exp')

Output:

exponential | Feature Engineering

This one is the best of all the transformations above, at the time of generating a variable that is normally distributed.

Outlier engineering

Outliers are defined as those values that are unusually high or low with respect to the rest of the observations of the variable. Some of the techniques to handle outliers are:

1. Outlier removal

2. Treating outliers as missing values

3. Outlier capping

How to identify outliers?

For that, the basic form of detection is an extreme value analysis of data. If the distribution of the variable is Gaussian then outliers will lie outside the mean plus or minus three times the standard deviation of the variable. But if the variable is not normally distributed, then quantiles can be used. Calculate the quantiles and then inter quartile range:

Inter quantile is 75th quantile-25quantile.

upper boundary: 75th quantile + (IQR * 1.5)

lower boundary: 25th quantile – (IQR * 1.5)

So, the outlier will sit outside these boundaries.

 Outlier removal

In this technique, simply remove outlier observations from the dataset. In datasets if outliers are not abundant, then dropping the outliers will not affect the data much. But if multiple variables have outliers then we may end up removing a big chunk of data from our dataset. So, this point has to be kept in mind whenever dropping the outliers.

 Treating outliers as missing values

You can also treat outliers as missing values. But then these missing values also have to be filled. So to fill missing values you can use any of the methods as discussed above in this article.

 Outlier capping

This procedure involves capping the maximum and minimum values at a predefined value. This value can be derived from the variable distribution. If a variable is normally distributed we can cap the maximum and minimum values at the mean plus or minus three times the standard deviation. But if the variable is skewed, we can use the inter-quantile range proximity rule or cap at the bottom percentiles.

Date and Time Feature Engineering

Date variables are considered a special type of categorical variable and if they are processed well they can enrich the dataset to a great extent. From the date we can extract various important information like: Month, Semester, Quarter, Day, Day of the week, Is it a weekend or not, hours, minutes, and many more. Let’s use some dataset and do some coding around it.

For this, we will use the Lending club dataset. Download it from here.

We will use only two columns from the dataset: issue_d and last_pymnt_d.

use_cols = ['issue_d', 'last_pymnt_d']
data = pd.read_csv('/kaggle/input/lending-club-loan-data/loan.csv', usecols=use_cols, nrows=10000)
data.head()
Output: issue_d	last_pymnt_d
0	Dec-2018	Feb-2019
1	Dec-2018	Feb-2019
2	Dec-2018	Feb-2019
3	Dec-2018	Feb-2019
4	Dec-2018	Feb-2019

Now, parse dates into DateTime format as they are coded in strings currently.

data['issue_dt'] = pd.to_datetime(data.issue_d)
data['last_pymnt_dt'] = pd.to_datetime(data.last_pymnt_d)
data[['issue_d','issue_dt','last_pymnt_d', 'last_pymnt_dt']].head()
Output: issue_d	issue_dt	last_pymnt_d	last_pymnt_dt
0	Dec-2018	2018-12-01	Feb-2019	2019-02-01
1	Dec-2018	2018-12-01	Feb-2019	2019-02-01
2	Dec-2018	2018-12-01	Feb-2019	2019-02-01
3	Dec-2018	2018-12-01	Feb-2019	2019-02-01
4	Dec-2018	2018-12-01	Feb-2019	2019-02-01

Now, extracting month from date.

data['issue_dt_month'] = data['issue_dt'].dt.month
data[['issue_dt', 'issue_dt_month']].head()
Output: issue_dt issue_dt_month
0	2018-12-01	12
1	2018-12-01	12
2	2018-12-01	12
3	2018-12-01	12
4	2018-12-01	12

Extracting quarter from date.

data['issue_dt_quarter'] = data['issue_dt'].dt.quarter
data[['issue_dt', 'issue_dt_quarter']].head()
Output: issue_dt	issue_dt_quarter
0	2018-12-01	4
1	2018-12-01	4
2	2018-12-01	4
3	2018-12-01	4
4	2018-12-01	4

Extracting the day of the week from the date.

data['issue_dt_dayofweek'] = data['issue_dt'].dt.dayofweek
data[['issue_dt', 'issue_dt_dayofweek']].head()
Output: issue_dt	issue_dt_dayofweek
0	2018-12-01	5
1	2018-12-01	5
2	2018-12-01	5
3	2018-12-01	5
4	2018-12-01	5

Extracting the weekday name from the date.

data['issue_dt_dayofweek'] = data['issue_dt'].dt.weekday_name
data[['issue_dt', 'issue_dt_dayofweek']].head()
Output: issue_dt	issue_dt_dayofweek
0	2018-12-01	Saturday
1	2018-12-01	Saturday
2	2018-12-01	Saturday
3	2018-12-01	Saturday
4	2018-12-01	Saturday

So, these are just a few examples with date and time, you can explore more. These kinds of things always help in improving the quality of data.

End Notes

In this article, I tried to explain feature engineering in detail with some code examples on the dataset. Feature engineering is very helpful in making your model more accurate and effective. As a next step, try out the techniques we discussed above on some other datasets for better understanding. I hope you find this article helpful. Let’s connect on Linkedin.

Thanks for reading if you reached here :).

Happy coding!

Image Sources-

  1. Image 1 https://onlinecoursebay.com/
  2. Image 2 https://www.forbes.com/

The media shown in this article on recursion in Python are not owned by Analytics Vidhya and are used at the Author’s discretion.

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.analyticsvidhya.com/blog/2021/09/complete-guide-to-feature-engineering-zero-to-hero/

Big Data

If you did not already know

Published

on

DataOps google
DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations. From a process and methodology perspective, DataOps applies Agile software development, DevOps software development practices and the statistical process control used in lean manufacturing, to data analytics. In DataOps, development of new analytics is streamlined using Agile software development, an iterative project management methodology that replaces the traditional Waterfall sequential methodology. Studies show that software development projects complete significantly faster and with far fewer defects when Agile Development is used. The Agile methodology is particularly effective in environments where requirements are quickly evolving – a situation well known to data analytics professionals. DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics. Like lean manufacturing, DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert. DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, agility, quality, security, access and ease of use. …

CoSegNet google
We introduce CoSegNet, a deep neural network architecture for co-segmentation of a set of 3D shapes represented as point clouds. CoSegNet takes as input a set of unsegmented shapes, proposes per-shape parts, and then jointly optimizes the part labelings across the set subjected to a novel group consistency loss expressed via matrix rank estimates. The proposals are refined in each iteration by an auxiliary network that acts as a weak regularizing prior, pre-trained to denoise noisy, unlabeled parts from a large collection of segmented 3D shapes, where the part compositions within the same object category can be highly inconsistent. The output is a consistent part labeling for the input set, with each shape segmented into up to K (a user-specified hyperparameter) parts. The overall pipeline is thus weakly supervised, producing consistent segmentations tailored to the test set, without consistent ground-truth segmentations. We show qualitative and quantitative results from CoSegNet and evaluate it via ablation studies and comparisons to state-of-the-art co-segmentation methods. …

Stochastic Computation Graph (SCG) google
Stochastic computation graphs are directed acyclic graphs that encode the dependency structure of computation to be performed. The graphical notation generalizes directed graphical models. …

Smooth Density Spatial Quantile Regression google
We derive the properties and demonstrate the desirability of a model-based method for estimating the spatially-varying effects of covariates on the quantile function. By modeling the quantile function as a combination of I-spline basis functions and Pareto tail distributions, we allow for flexible parametric modeling of the extremes while preserving non-parametric flexibility in the center of the distribution. We further establish that the model guarantees the desired degree of differentiability in the density function and enables the estimation of non-stationary covariance functions dependent on the predictors. We demonstrate through a simulation study that the proposed method produces more efficient estimates of the effects of predictors than other methods, particularly in distributions with heavy tails. To illustrate the utility of the model we apply it to measurements of benzene collected around an oil refinery to determine the effect of an emission source within the refinery on the distribution of the fence line measurements. …

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://analytixon.com/2021/10/24/if-you-did-not-already-know-1540/

Continue Reading

Big Data

If you did not already know

Published

on

DataOps google
DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. While DataOps began as a set of best practices, it has now matured to become a new and independent approach to data analytics. DataOps applies to the entire data lifecycle from data preparation to reporting, and recognizes the interconnected nature of the data analytics team and information technology operations. From a process and methodology perspective, DataOps applies Agile software development, DevOps software development practices and the statistical process control used in lean manufacturing, to data analytics. In DataOps, development of new analytics is streamlined using Agile software development, an iterative project management methodology that replaces the traditional Waterfall sequential methodology. Studies show that software development projects complete significantly faster and with far fewer defects when Agile Development is used. The Agile methodology is particularly effective in environments where requirements are quickly evolving – a situation well known to data analytics professionals. DevOps focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of analytics. This merging of software development and IT operations has improved velocity, quality, predictability and scale of software engineering and deployment. Borrowing methods from DevOps, DataOps seeks to bring these same improvements to data analytics. Like lean manufacturing, DataOps utilizes statistical process control (SPC) to monitor and control the data analytics pipeline. With SPC in place, the data flowing through an operational system is constantly monitored and verified to be working. If an anomaly occurs, the data analytics team can be notified through an automated alert. DataOps is not tied to a particular technology, architecture, tool, language or framework. Tools that support DataOps promote collaboration, orchestration, agility, quality, security, access and ease of use. …

CoSegNet google
We introduce CoSegNet, a deep neural network architecture for co-segmentation of a set of 3D shapes represented as point clouds. CoSegNet takes as input a set of unsegmented shapes, proposes per-shape parts, and then jointly optimizes the part labelings across the set subjected to a novel group consistency loss expressed via matrix rank estimates. The proposals are refined in each iteration by an auxiliary network that acts as a weak regularizing prior, pre-trained to denoise noisy, unlabeled parts from a large collection of segmented 3D shapes, where the part compositions within the same object category can be highly inconsistent. The output is a consistent part labeling for the input set, with each shape segmented into up to K (a user-specified hyperparameter) parts. The overall pipeline is thus weakly supervised, producing consistent segmentations tailored to the test set, without consistent ground-truth segmentations. We show qualitative and quantitative results from CoSegNet and evaluate it via ablation studies and comparisons to state-of-the-art co-segmentation methods. …

Stochastic Computation Graph (SCG) google
Stochastic computation graphs are directed acyclic graphs that encode the dependency structure of computation to be performed. The graphical notation generalizes directed graphical models. …

Smooth Density Spatial Quantile Regression google
We derive the properties and demonstrate the desirability of a model-based method for estimating the spatially-varying effects of covariates on the quantile function. By modeling the quantile function as a combination of I-spline basis functions and Pareto tail distributions, we allow for flexible parametric modeling of the extremes while preserving non-parametric flexibility in the center of the distribution. We further establish that the model guarantees the desired degree of differentiability in the density function and enables the estimation of non-stationary covariance functions dependent on the predictors. We demonstrate through a simulation study that the proposed method produces more efficient estimates of the effects of predictors than other methods, particularly in distributions with heavy tails. To illustrate the utility of the model we apply it to measurements of benzene collected around an oil refinery to determine the effect of an emission source within the refinery on the distribution of the fence line measurements. …

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://analytixon.com/2021/10/24/if-you-did-not-already-know-1540/

Continue Reading

Big Data

If you did not already know

Published

on

Correntropy google
Correntropy is a nonlinear similarity measure between two random variables.
Learning with the Maximum Correntropy Criterion Induced Losses for Regression


Patient Event Graph (PatientEG) google
Medical activities, such as diagnoses, medicine treatments, and laboratory tests, as well as temporal relations between these activities are the basic concepts in clinical research. However, existing relational data model on electronic medical records (EMRs) lacks explicit and accurate semantic definitions of these concepts. It leads to the inconvenience of query construction and the inefficiency of query execution where multi-table join queries are frequently required. In this paper, we propose a patient event graph (PatientEG) model to capture the characteristics of EMRs. We respectively define five types of medical entities, five types of medical events and five types of temporal relations. Based on the proposed model, we also construct a PatientEG dataset with 191,294 events, 3,429 distinct entities, and 545,993 temporal relations using EMRs from Shanghai Shuguang hospital. To help to normalize entity values which contain synonyms, hyponymies, and abbreviations, we link them with the Chinese biomedical knowledge graph. With the help of PatientEG dataset, we are able to conveniently perform complex queries for clinical research such as auxiliary diagnosis and therapeutic effectiveness analysis. In addition, we provide a SPARQL endpoint to access PatientEG dataset and the dataset is also publicly available online. Also, we list several illustrative SPARQL queries on our website. …

LogitBoost Autoregressive Networks google
Multivariate binary distributions can be decomposed into products of univariate conditional distributions. Recently popular approaches have modeled these conditionals through neural networks with sophisticated weight-sharing structures. It is shown that state-of-the-art performance on several standard benchmark datasets can actually be achieved by training separate probability estimators for each dimension. In that case, model training can be trivially parallelized over data dimensions. On the other hand, complexity control has to be performed for each learned conditional distribution. Three possible methods are considered and experimentally compared. The estimator that is employed for each conditional is LogitBoost. Similarities and differences between the proposed approach and autoregressive models based on neural networks are discussed in detail. …

Discretification google
Discretification’ is the mechanism of making continuous data discrete. If you really grasp the concept, you may be thinking ‘Wait a minute, the type of data we are collecting is discrete in and of itself! Data can EITHER be discrete OR continuous, it can’t be both!’ You would be correct. But what if we manually selected values along that continuous measurement, and declared them to be in a specific category? For instance, if we declare 72.0 degrees and greater to be ‘Hot’, 35.0-71.9 degrees to be ‘Moderate’, and anything lower than 35.0 degrees to be ‘Cold’, we have ‘discretified’ temperature! Our readings that were once continuous now fit into distinct categories. So, where we do we draw the boundaries for these categories? What makes 35.0 degrees ‘Cold’ and 35.1 degrees ‘Moderate’? At is at this juncture that the TRUE decision is being made. The beauty of approaching the challenge in this manner is that it is data-centric, not concept-centric. Let’s walk through our marketing example first without using discretification, then with it. …

PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://analytixon.com/2021/10/23/if-you-did-not-already-know-1539/

Continue Reading

Big Data

Capturing the signal of weak electricigens: a worthy endeavour

Published

on

Recently several non-traditional electroactive microorganisms have been discovered. These can be considered weak electricigens; microorganisms that typically rely on soluble electron acceptors and donors in their lifecycle but are also capable of extracellular electron transfer (EET), resulting in either a low, unreliable, or otherwise unexpected current. These unanticipated electroactive microorganisms represent a new chapter in electromicrobiology and have important medical, environmental, and biotechnological relevance.
PlatoAi. Web3 Reimagined. Data Intelligence Amplified.
Click here to access.

Source: https://www.cell.com/trends/biotechnology/fulltext/S0167-7799(21)00229-8?rss=yes

Continue Reading
Blockchain3 days ago

People’s payment attitude: Why cash Remains the most Common Means of Payment & How Technology and Crypto have more Advantages as a Means of payment

Automotive4 days ago

7 Secrets That Automakers Wish You Don’t Know

Startups3 days ago

The 12 TikTok facts you should know

Energy2 days ago

U Power ties up with Bosch to collaborate on Super Board technology

Supply Chain3 days ago

LPG tubes – what to think about

Gaming4 days ago

New Steam Games You Might Have Missed In August 2021

Blockchain4 days ago

What Is the Best Crypto IRA for Me? Use These 6 Pieces of Criteria to Find Out More

Gaming4 days ago

How do casinos without an account work?

IOT4 days ago

The Benefits of Using IoT SIM Card Technology

Blockchain4 days ago

The Most Profitable Cryptocurrencies on the Market

Gaming4 days ago

Norway will crack down on the unlicensed iGaming market with a new gaming law

Blockchain4 days ago

What does swapping crypto mean?

Energy2 days ago

Piperylene Market Size to Grow by USD 428.50 mn from 2020 to 2024 | Growing Demand for Piperylene-based Adhesives to Boost Growth | Technavio

Energy2 days ago

Notice of Data Security Breach Incident

AR/VR5 days ago

Preview: Little Cities – Delightful City Building on Quest

Blockchain2 days ago

Blockchain & Infrastructure Post-Event Release

Blockchain2 days ago

Week Ahead – Between a rock and a hard place

Cyber Security2 days ago

Ransomware Took a New Twist with US Leading a Law Enforcement Effort to Hack Back

Code2 days ago

How does XML to JSON converter work?

Esports2 days ago

How to get Shiny Zacian and Zamazenta in Pokémon Sword and Shield

Trending