Pandas Profiling Report

Dataset statistics

Number of variables	5
Number of observations	77096
Missing cells	49053
Missing cells (%)	12.7%
Duplicate rows	8966
Duplicate rows (%)	11.6%
Total size in memory	2.9 MiB
Average record size in memory	40.0 B

Variable types

Numeric	2
Categorical	2
Boolean	1

Alerts

Dataset has 8966 (11.6%) duplicate rows	Duplicates
`company_rating` has a high cardinality: 70 distinct values	High cardinality
`company_location` has a high cardinality: 181 distinct values	High cardinality
`company_rating` has 29909 (38.8%) missing values	Missing
`company_location` has 19130 (24.8%) missing values	Missing

Reproduction

Analysis started	2022-11-24 10:51:06.732214
Analysis finished	2022-11-24 10:51:12.799680
Duration	6.07 seconds
Software version	pandas-profiling vv3.5.0
Download configuration	config.json

id
Real number (ℝ)

Distinct	50098
Distinct (%)	65.0%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Mean	25155.386

Minimum	1
Maximum	50098
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	602.4 KiB

Quantile statistics

Minimum	1
5-th percentile	2704.75
Q1	12935.75
median	25253.5
Q3	37410.25
95-th percentile	47520.25
Maximum	50098
Range	50097
Interquartile range (IQR)	24474.5

Descriptive statistics

Standard deviation	14300.991
Coefficient of variation (CV)	0.5685061
Kurtosis	-1.1684667
Mean	25155.386
Median Absolute Deviation (MAD)	12241.5
Skewness	-0.01137149
Sum	1.9393797 × 10⁹
Variance	2.0451833 × 10⁸
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
29647	1086	1.4%
45111	297	0.4%
28828	184	0.2%
32203	176	0.2%
20334	167	0.2%
18077	125	0.2%
4745	114	0.1%
10711	108	0.1%
22721	106	0.1%
19019	102	0.1%
Other values (50088)	74631	96.8%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
1	2	< 0.1%
2	1	< 0.1%
3	1	< 0.1%
4	1	< 0.1%
5	1	< 0.1%
6	2	< 0.1%
7	1	< 0.1%
8	1	< 0.1%
9	2	< 0.1%
10	1	< 0.1%

Value	Count	Frequency (%)
50098	2	< 0.1%
50097	1	< 0.1%
50096	1	< 0.1%
50095	1	< 0.1%
50094	1	< 0.1%
50093	1	< 0.1%
50092	1	< 0.1%
50091	1	< 0.1%
50090	1	< 0.1%
50089	2	< 0.1%

company_rating
Categorical

HIGH CARDINALITY
MISSING

Distinct	70
Distinct (%)	0.1%
Missing	29909
Missing (%)	38.8%
Memory size	602.4 KiB

100%	34204
90%	1698
99%	956
0%	920
80%	862
Other values (65)	8547

Length

Max length	4
Median length	4
Mean length	3.7053638
Min length	2

Characters and Unicode

Total characters	174845
Distinct characters	11
Distinct categories	2 ?
Distinct scripts	1 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	5 ?
Unique (%)	< 0.1%

Sample

1st row	100%
2nd row	67%
3rd row	67%
4th row	91%
5th row	100%

Common Values

Value	Count	Frequency (%)
100%	34204	44.4%
90%	1698	2.2%
99%	956	1.2%
0%	920	1.2%
80%	862	1.1%
98%	726	0.9%
50%	702	0.9%
97%	675	0.9%
94%	538	0.7%
95%	458	0.6%
Other values (60)	5448	7.1%
(Missing)	29909	38.8%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
100	34204	72.5%
90	1698	3.6%
99	956	2.0%
0	920	1.9%
80	862	1.8%
98	726	1.5%
50	702	1.5%
97	675	1.4%
94	538	1.1%
95	458	1.0%
Other values (60)	5448	11.5%

Most occurring characters

Value	Count	Frequency (%)
0	73359	42.0%
%	47187	27.0%
1	34583	19.8%
9	7824	4.5%
8	3425	2.0%
7	2471	1.4%
5	1898	1.1%
6	1601	0.9%
3	1177	0.7%
4	832	0.5%

Most occurring categories

Value	Count	Frequency (%)
Decimal Number	127658	73.0%
Other Punctuation	47187	27.0%

Most frequent character per category

Decimal Number

Value	Count	Frequency (%)
0	73359	57.5%
1	34583	27.1%
9	7824	6.1%
8	3425	2.7%
7	2471	1.9%
5	1898	1.5%
6	1601	1.3%
3	1177	0.9%
4	832	0.7%
2	488	0.4%

Other Punctuation

Value	Count	Frequency (%)
%	47187	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Common	174845	100.0%

Most frequent character per script

Common

Value	Count	Frequency (%)
0	73359	42.0%
%	47187	27.0%
1	34583	19.8%
9	7824	4.5%
8	3425	2.0%
7	2471	1.4%
5	1898	1.1%
6	1601	0.9%
3	1177	0.7%
4	832	0.5%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	174845	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
0	73359	42.0%
%	47187	27.0%
1	34583	19.8%
9	7824	4.5%
8	3425	2.0%
7	2471	1.4%
5	1898	1.1%
6	1601	0.9%
3	1177	0.7%
4	832	0.5%

company_location
Categorical

HIGH CARDINALITY
MISSING

Distinct	181
Distinct (%)	0.3%
Missing	19130
Missing (%)	24.8%
Memory size	602.4 KiB

Russian Federation	1979
Niue	1863
Uzbekistan	1711
Guinea	1703
Isle of Man	1633
Other values (176)	49077

Length

Max length	51
Median length	29
Mean length	9.5813408
Min length	4

Characters and Unicode

Total characters	555392
Distinct characters	59
Distinct categories	8 ?
Distinct scripts	2 ?
Distinct blocks	1 ?

The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique	21 ?
Unique (%)	< 0.1%

Sample

1st row	Niue
2nd row	Anguilla
3rd row	Russian Federation
4th row	Barbados
5th row	Sao Tome and Principe

Common Values

Value	Count	Frequency (%)
Russian Federation	1979	2.6%
Niue	1863	2.4%
Uzbekistan	1711	2.2%
Guinea	1703	2.2%
Isle of Man	1633	2.1%
Sao Tome and Principe	1624	2.1%
Nicaragua	1490	1.9%
Tonga	1442	1.9%
Peru	1300	1.7%
Marshall Islands	1293	1.7%
Other values (171)	41928	54.4%
(Missing)	19130	24.8%

Length

Histogram of lengths of the category

Value	Count	Frequency (%)
islands	3185	3.8%
and	3018	3.6%
guinea	2001	2.4%
russian	1979	2.4%
federation	1979	2.4%
niue	1863	2.2%
uzbekistan	1711	2.1%
of	1658	2.0%
isle	1633	2.0%
man	1633	2.0%
Other values (229)	62471	75.1%

Most occurring characters

Value	Count	Frequency (%)
a	78762	14.2%
n	49414	8.9%
i	47479	8.5%
e	43540	7.8%
o	28191	5.1%
r	28103	5.1%
s	27841	5.0%
	25165	4.5%
l	20596	3.7%
u	19622	3.5%
Other values (49)	186679	33.6%

Most occurring categories

Value	Count	Frequency (%)
Lowercase Letter	448825	80.8%
Uppercase Letter	77999	14.0%
Space Separator	25165	4.5%
Close Punctuation	1497	0.3%
Open Punctuation	1497	0.3%
Other Punctuation	360	0.1%
Decimal Number	48	< 0.1%
Dash Punctuation	1	< 0.1%

Most frequent character per category

Lowercase Letter

Value	Count	Frequency (%)
a	78762	17.5%
n	49414	11.0%
i	47479	10.6%
e	43540	9.7%
o	28191	6.3%
r	28103	6.3%
s	27841	6.2%
l	20596	4.6%
u	19622	4.4%
d	18661	4.2%
Other values (16)	86616	19.3%

Uppercase Letter

Value	Count	Frequency (%)
M	9229	11.8%
I	6334	8.1%
S	5391	6.9%
G	5277	6.8%
N	5275	6.8%
T	5266	6.8%
C	4981	6.4%
F	4942	6.3%
P	4924	6.3%
R	4723	6.1%
Other values (15)	21657	27.8%

Other Punctuation

Value	Count	Frequency (%)
&	357	99.2%
'	3	0.8%

Decimal Number

Value	Count	Frequency (%)
6	24	50.0%
0	24	50.0%

Space Separator

Value	Count	Frequency (%)
	25165	100.0%

Close Punctuation

Value	Count	Frequency (%)
)	1497	100.0%

Open Punctuation

Value	Count	Frequency (%)
(	1497	100.0%

Dash Punctuation

Value	Count	Frequency (%)
-	1	100.0%

Most occurring scripts

Value	Count	Frequency (%)
Latin	526824	94.9%
Common	28568	5.1%

Most frequent character per script

Latin

Value	Count	Frequency (%)
a	78762	15.0%
n	49414	9.4%
i	47479	9.0%
e	43540	8.3%
o	28191	5.4%
r	28103	5.3%
s	27841	5.3%
l	20596	3.9%
u	19622	3.7%
d	18661	3.5%
Other values (41)	164615	31.2%

Common

Value	Count	Frequency (%)
	25165	88.1%
)	1497	5.2%
(	1497	5.2%
&	357	1.2%
6	24	0.1%
0	24	0.1%
'	3	< 0.1%
-	1	< 0.1%

Most occurring blocks

Value	Count	Frequency (%)
ASCII	555392	100.0%

Most frequent character per block

ASCII

Value	Count	Frequency (%)
a	78762	14.2%
n	49414	8.9%
i	47479	8.5%
e	43540	7.8%
o	28191	5.1%
r	28103	5.1%
s	27841	5.0%
	25165	4.5%
l	20596	3.7%
u	19622	3.5%
Other values (49)	186679	33.6%

total_fleet_count
Real number (ℝ)

Distinct	90
Distinct (%)	0.1%
Missing	7
Missing (%)	< 0.1%
Infinite	0
Infinite (%)	0.0%
Mean	30.473699

Minimum	1
Maximum	1484
Zeros	0
Zeros (%)	0.0%
Negative	0
Negative (%)	0.0%
Memory size	602.4 KiB

Quantile statistics

Minimum	1
5-th percentile	1
Q1	1
median	1
Q3	4
95-th percentile	60
Maximum	1484
Range	1483
Interquartile range (IQR)	3

Descriptive statistics

Standard deviation	165.54827
Coefficient of variation (CV)	5.4324966
Kurtosis	54.721438
Mean	30.473699
Median Absolute Deviation (MAD)	0
Skewness	7.4147427
Sum	2349187
Variance	27406.229
Monotonicity	Not monotonic

Histogram with fixed size bins (bins=50)

Value	Count	Frequency (%)
1	38555	50.0%
2	11827	15.3%
3	4915	6.4%
4	2852	3.7%
5	1648	2.1%
6	1250	1.6%
1305	1086	1.4%
7	984	1.3%
8	867	1.1%
9	791	1.0%
Other values (80)	12314	16.0%

Minimum 10 values
Maximum 10 values

Value	Count	Frequency (%)
1	38555	50.0%
2	11827	15.3%
3	4915	6.4%
4	2852	3.7%
5	1648	2.1%
6	1250	1.6%
7	984	1.3%
8	867	1.1%
9	791	1.0%
10	642	0.8%

Value	Count	Frequency (%)
1484	102	0.1%
1305	1086	1.4%
1105	10	< 0.1%
781	1	< 0.1%
420	297	0.4%
419	1	< 0.1%
198	184	0.2%
185	8	< 0.1%
176	176	0.2%
172	19	< 0.1%

iata_approved
Boolean

Distinct	2
Distinct (%)	< 0.1%
Missing	7
Missing (%)	< 0.1%
Memory size	150.7 KiB

False	47482
True	29607
(Missing)	7

Common Values (Table)
Common Values (Plot)

Value	Count	Frequency (%)
False	47482	61.6%
True	29607	38.4%
(Missing)	7	< 0.1%

id
total_fleet_count

total_fleet_count
id

total_fleet_count
id

Auto

The auto setting is an interpretable pairwise column metric of the following mapping:

Variable_type-Variable_type : Method, Range
Categorical-Categorical : Cramer's V, [0,1]
Numerical-Categorical : Cramer's V, [0,1] (using a discretized numerical column)
Numerical-Numerical : Spearman's ρ, [-1,1]

The number of bins used in the discretization for the Numerical-Categorical column pair can be changed using config.correlations["auto"].n_bins. The number of bins affects the granularity of the association you wish to measure.

This configuration uses the recommended metric for each pair of columns.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

A simple visualization of nullity by column.

Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

First rows
Last rows

	id	company_rating	company_location	total_fleet_count	iata_approved
0	35029	100%	Niue	4.0	f
1	30292	67%	Anguilla	6.0	f
2	19032	67%	Russian Federation	4.0	f
3	8238	91%	Barbados	15.0	t
4	30342	NaN	Sao Tome and Principe	2.0	t
5	32413	100%	Faroe Islands	1.0	f
6	35620	90%	Micronesia	3.0	f
7	23820	NaN	Rwanda	1.0	t
8	46528	100%	Uzbekistan	3.0	t
9	11875	100%	Micronesia	2.0	t

	id	company_rating	company_location	total_fleet_count	iata_approved
77086	15249	NaN	Marshall Islands	1.0	f
77087	44431	NaN	NaN	1.0	f
77088	25724	NaN	NaN	1.0	f
77089	32743	NaN	Kiribati	2.0	f
77090	19010	NaN	Philippines	2.0	f
77091	6654	100%	Tonga	3.0	f
77092	8000	NaN	Chile	2.0	t
77093	14296	NaN	Netherlands	4.0	f
77094	27363	80%	NaN	3.0	t
77095	12542	98%	Mauritania	19.0	t

Most frequently occurring

	id	company_rating	company_location	total_fleet_count	iata_approved	# duplicates
5293	29647	100%	Peru	1305.0	f	1086
8081	45111	100%	Sao Tome and Principe	420.0	f	297
5148	28828	100%	Isle of Man	198.0	t	184
5737	32203	100%	Barbados	176.0	f	176
3644	20334	99%	Niger	171.0	t	167
3222	18077	100%	Sao Tome and Principe	139.0	f	125
838	4745	100%	Croatia	119.0	f	114
1888	10711	93%	Uganda	108.0	t	108
4070	22721	100%	Ecuador	109.0	f	106
3400	19019	97%	Nicaragua	1484.0	f	102

Overview

Variables

Common Values

Length

Most occurring characters

Most occurring categories

Most frequent character per category

Decimal Number

Other Punctuation

Most occurring scripts

Most frequent character per script

Common

Most occurring blocks

Most frequent character per block

ASCII

Common Values

Length

Most occurring characters

Most occurring categories

Most frequent character per category

Lowercase Letter

Uppercase Letter

Other Punctuation

Decimal Number

Space Separator

Close Punctuation

Open Punctuation

Dash Punctuation

Most occurring scripts

Most frequent character per script

Latin

Common

Most occurring blocks

Most frequent character per block

ASCII

Interactions

Correlations

Auto

Spearman's ρ

Pearson's r

Kendall's τ

Cramér's V (φc)

Phik (φk)

Missing values

Sample

Duplicate rows

Most frequently occurring