Overview

Dataset statistics

Number of variables5
Number of observations77096
Missing cells49053
Missing cells (%)12.7%
Duplicate rows8966
Duplicate rows (%)11.6%
Total size in memory2.9 MiB
Average record size in memory40.0 B

Variable types

Numeric2
Categorical2
Boolean1

Alerts

Dataset has 8966 (11.6%) duplicate rowsDuplicates
company_rating has a high cardinality: 70 distinct valuesHigh cardinality
company_location has a high cardinality: 181 distinct valuesHigh cardinality
company_rating has 29909 (38.8%) missing valuesMissing
company_location has 19130 (24.8%) missing valuesMissing

Reproduction

Analysis started2022-11-24 10:51:06.732214
Analysis finished2022-11-24 10:51:12.799680
Duration6.07 seconds
Software versionpandas-profiling vv3.5.0
Download configurationconfig.json

Variables

id
Real number (ℝ)

Distinct50098
Distinct (%)65.0%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean25155.386
Minimum1
Maximum50098
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size602.4 KiB
2022-11-24T10:51:12.954525image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile2704.75
Q112935.75
median25253.5
Q337410.25
95-th percentile47520.25
Maximum50098
Range50097
Interquartile range (IQR)24474.5

Descriptive statistics

Standard deviation14300.991
Coefficient of variation (CV)0.5685061
Kurtosis-1.1684667
Mean25155.386
Median Absolute Deviation (MAD)12241.5
Skewness-0.01137149
Sum1.9393797 × 109
Variance2.0451833 × 108
MonotonicityNot monotonic
2022-11-24T10:51:13.399741image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
29647 1086
 
1.4%
45111 297
 
0.4%
28828 184
 
0.2%
32203 176
 
0.2%
20334 167
 
0.2%
18077 125
 
0.2%
4745 114
 
0.1%
10711 108
 
0.1%
22721 106
 
0.1%
19019 102
 
0.1%
Other values (50088) 74631
96.8%
ValueCountFrequency (%)
1 2
< 0.1%
2 1
< 0.1%
3 1
< 0.1%
4 1
< 0.1%
5 1
< 0.1%
6 2
< 0.1%
7 1
< 0.1%
8 1
< 0.1%
9 2
< 0.1%
10 1
< 0.1%
ValueCountFrequency (%)
50098 2
< 0.1%
50097 1
< 0.1%
50096 1
< 0.1%
50095 1
< 0.1%
50094 1
< 0.1%
50093 1
< 0.1%
50092 1
< 0.1%
50091 1
< 0.1%
50090 1
< 0.1%
50089 2
< 0.1%

company_rating
Categorical

HIGH CARDINALITY
MISSING

Distinct70
Distinct (%)0.1%
Missing29909
Missing (%)38.8%
Memory size602.4 KiB
100%
34204 
90%
 
1698
99%
 
956
0%
 
920
80%
 
862
Other values (65)
8547 

Length

Max length4
Median length4
Mean length3.7053638
Min length2

Characters and Unicode

Total characters174845
Distinct characters11
Distinct categories2 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique5 ?
Unique (%)< 0.1%

Sample

1st row100%
2nd row67%
3rd row67%
4th row91%
5th row100%

Common Values

ValueCountFrequency (%)
100% 34204
44.4%
90% 1698
 
2.2%
99% 956
 
1.2%
0% 920
 
1.2%
80% 862
 
1.1%
98% 726
 
0.9%
50% 702
 
0.9%
97% 675
 
0.9%
94% 538
 
0.7%
95% 458
 
0.6%
Other values (60) 5448
 
7.1%
(Missing) 29909
38.8%

Length

2022-11-24T10:51:13.551488image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
100 34204
72.5%
90 1698
 
3.6%
99 956
 
2.0%
0 920
 
1.9%
80 862
 
1.8%
98 726
 
1.5%
50 702
 
1.5%
97 675
 
1.4%
94 538
 
1.1%
95 458
 
1.0%
Other values (60) 5448
 
11.5%

Most occurring characters

ValueCountFrequency (%)
0 73359
42.0%
% 47187
27.0%
1 34583
19.8%
9 7824
 
4.5%
8 3425
 
2.0%
7 2471
 
1.4%
5 1898
 
1.1%
6 1601
 
0.9%
3 1177
 
0.7%
4 832
 
0.5%

Most occurring categories

ValueCountFrequency (%)
Decimal Number 127658
73.0%
Other Punctuation 47187
 
27.0%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
0 73359
57.5%
1 34583
27.1%
9 7824
 
6.1%
8 3425
 
2.7%
7 2471
 
1.9%
5 1898
 
1.5%
6 1601
 
1.3%
3 1177
 
0.9%
4 832
 
0.7%
2 488
 
0.4%
Other Punctuation
ValueCountFrequency (%)
% 47187
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common 174845
100.0%

Most frequent character per script

Common
ValueCountFrequency (%)
0 73359
42.0%
% 47187
27.0%
1 34583
19.8%
9 7824
 
4.5%
8 3425
 
2.0%
7 2471
 
1.4%
5 1898
 
1.1%
6 1601
 
0.9%
3 1177
 
0.7%
4 832
 
0.5%

Most occurring blocks

ValueCountFrequency (%)
ASCII 174845
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
0 73359
42.0%
% 47187
27.0%
1 34583
19.8%
9 7824
 
4.5%
8 3425
 
2.0%
7 2471
 
1.4%
5 1898
 
1.1%
6 1601
 
0.9%
3 1177
 
0.7%
4 832
 
0.5%

company_location
Categorical

HIGH CARDINALITY
MISSING

Distinct181
Distinct (%)0.3%
Missing19130
Missing (%)24.8%
Memory size602.4 KiB
Russian Federation
 
1979
Niue
 
1863
Uzbekistan
 
1711
Guinea
 
1703
Isle of Man
 
1633
Other values (176)
49077 

Length

Max length51
Median length29
Mean length9.5813408
Min length4

Characters and Unicode

Total characters555392
Distinct characters59
Distinct categories8 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique21 ?
Unique (%)< 0.1%

Sample

1st rowNiue
2nd rowAnguilla
3rd rowRussian Federation
4th rowBarbados
5th rowSao Tome and Principe

Common Values

ValueCountFrequency (%)
Russian Federation 1979
 
2.6%
Niue 1863
 
2.4%
Uzbekistan 1711
 
2.2%
Guinea 1703
 
2.2%
Isle of Man 1633
 
2.1%
Sao Tome and Principe 1624
 
2.1%
Nicaragua 1490
 
1.9%
Tonga 1442
 
1.9%
Peru 1300
 
1.7%
Marshall Islands 1293
 
1.7%
Other values (171) 41928
54.4%
(Missing) 19130
24.8%

Length

2022-11-24T10:51:13.713435image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
islands 3185
 
3.8%
and 3018
 
3.6%
guinea 2001
 
2.4%
russian 1979
 
2.4%
federation 1979
 
2.4%
niue 1863
 
2.2%
uzbekistan 1711
 
2.1%
of 1658
 
2.0%
isle 1633
 
2.0%
man 1633
 
2.0%
Other values (229) 62471
75.1%

Most occurring characters

ValueCountFrequency (%)
a 78762
14.2%
n 49414
 
8.9%
i 47479
 
8.5%
e 43540
 
7.8%
o 28191
 
5.1%
r 28103
 
5.1%
s 27841
 
5.0%
25165
 
4.5%
l 20596
 
3.7%
u 19622
 
3.5%
Other values (49) 186679
33.6%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter 448825
80.8%
Uppercase Letter 77999
 
14.0%
Space Separator 25165
 
4.5%
Close Punctuation 1497
 
0.3%
Open Punctuation 1497
 
0.3%
Other Punctuation 360
 
0.1%
Decimal Number 48
 
< 0.1%
Dash Punctuation 1
 
< 0.1%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
a 78762
17.5%
n 49414
11.0%
i 47479
10.6%
e 43540
9.7%
o 28191
 
6.3%
r 28103
 
6.3%
s 27841
 
6.2%
l 20596
 
4.6%
u 19622
 
4.4%
d 18661
 
4.2%
Other values (16) 86616
19.3%
Uppercase Letter
ValueCountFrequency (%)
M 9229
11.8%
I 6334
 
8.1%
S 5391
 
6.9%
G 5277
 
6.8%
N 5275
 
6.8%
T 5266
 
6.8%
C 4981
 
6.4%
F 4942
 
6.3%
P 4924
 
6.3%
R 4723
 
6.1%
Other values (15) 21657
27.8%
Other Punctuation
ValueCountFrequency (%)
& 357
99.2%
' 3
 
0.8%
Decimal Number
ValueCountFrequency (%)
6 24
50.0%
0 24
50.0%
Space Separator
ValueCountFrequency (%)
25165
100.0%
Close Punctuation
ValueCountFrequency (%)
) 1497
100.0%
Open Punctuation
ValueCountFrequency (%)
( 1497
100.0%
Dash Punctuation
ValueCountFrequency (%)
- 1
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin 526824
94.9%
Common 28568
 
5.1%

Most frequent character per script

Latin
ValueCountFrequency (%)
a 78762
15.0%
n 49414
 
9.4%
i 47479
 
9.0%
e 43540
 
8.3%
o 28191
 
5.4%
r 28103
 
5.3%
s 27841
 
5.3%
l 20596
 
3.9%
u 19622
 
3.7%
d 18661
 
3.5%
Other values (41) 164615
31.2%
Common
ValueCountFrequency (%)
25165
88.1%
) 1497
 
5.2%
( 1497
 
5.2%
& 357
 
1.2%
6 24
 
0.1%
0 24
 
0.1%
' 3
 
< 0.1%
- 1
 
< 0.1%

Most occurring blocks

ValueCountFrequency (%)
ASCII 555392
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
a 78762
14.2%
n 49414
 
8.9%
i 47479
 
8.5%
e 43540
 
7.8%
o 28191
 
5.1%
r 28103
 
5.1%
s 27841
 
5.0%
25165
 
4.5%
l 20596
 
3.7%
u 19622
 
3.5%
Other values (49) 186679
33.6%

total_fleet_count
Real number (ℝ)

Distinct90
Distinct (%)0.1%
Missing7
Missing (%)< 0.1%
Infinite0
Infinite (%)0.0%
Mean30.473699
Minimum1
Maximum1484
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size602.4 KiB
2022-11-24T10:51:13.873169image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q34
95-th percentile60
Maximum1484
Range1483
Interquartile range (IQR)3

Descriptive statistics

Standard deviation165.54827
Coefficient of variation (CV)5.4324966
Kurtosis54.721438
Mean30.473699
Median Absolute Deviation (MAD)0
Skewness7.4147427
Sum2349187
Variance27406.229
MonotonicityNot monotonic
2022-11-24T10:51:14.014828image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
1 38555
50.0%
2 11827
 
15.3%
3 4915
 
6.4%
4 2852
 
3.7%
5 1648
 
2.1%
6 1250
 
1.6%
1305 1086
 
1.4%
7 984
 
1.3%
8 867
 
1.1%
9 791
 
1.0%
Other values (80) 12314
 
16.0%
ValueCountFrequency (%)
1 38555
50.0%
2 11827
 
15.3%
3 4915
 
6.4%
4 2852
 
3.7%
5 1648
 
2.1%
6 1250
 
1.6%
7 984
 
1.3%
8 867
 
1.1%
9 791
 
1.0%
10 642
 
0.8%
ValueCountFrequency (%)
1484 102
 
0.1%
1305 1086
1.4%
1105 10
 
< 0.1%
781 1
 
< 0.1%
420 297
 
0.4%
419 1
 
< 0.1%
198 184
 
0.2%
185 8
 
< 0.1%
176 176
 
0.2%
172 19
 
< 0.1%
Distinct2
Distinct (%)< 0.1%
Missing7
Missing (%)< 0.1%
Memory size150.7 KiB
False
47482 
True
29607 
(Missing)
 
7
ValueCountFrequency (%)
False 47482
61.6%
True 29607
38.4%
(Missing) 7
 
< 0.1%
2022-11-24T10:51:14.177680image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Interactions

2022-11-24T10:51:11.768041image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
2022-11-24T10:51:11.456344image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
2022-11-24T10:51:11.915233image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
2022-11-24T10:51:11.600656image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Correlations

2022-11-24T10:51:14.261901image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Auto

The auto setting is an interpretable pairwise column metric of the following mapping:
  • Variable_type-Variable_type : Method, Range
  • Categorical-Categorical : Cramer's V, [0,1]
  • Numerical-Categorical : Cramer's V, [0,1] (using a discretized numerical column)
  • Numerical-Numerical : Spearman's ρ, [-1,1]
The number of bins used in the discretization for the Numerical-Categorical column pair can be changed using config.correlations["auto"].n_bins. The number of bins affects the granularity of the association you wish to measure.

This configuration uses the recommended metric for each pair of columns.
2022-11-24T10:51:14.415630image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
2022-11-24T10:51:14.557027image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
2022-11-24T10:51:14.684663image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
2022-11-24T10:51:14.847503image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.
2022-11-24T10:51:15.001208image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

2022-11-24T10:51:12.153682image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
A simple visualization of nullity by column.
2022-11-24T10:51:12.354142image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
2022-11-24T10:51:12.654371image/svg+xmlMatplotlib v3.6.2, https://matplotlib.org/
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.

Sample

idcompany_ratingcompany_locationtotal_fleet_countiata_approved
035029100%Niue4.0f
13029267%Anguilla6.0f
21903267%Russian Federation4.0f
3823891%Barbados15.0t
430342NaNSao Tome and Principe2.0t
532413100%Faroe Islands1.0f
63562090%Micronesia3.0f
723820NaNRwanda1.0t
846528100%Uzbekistan3.0t
911875100%Micronesia2.0t
idcompany_ratingcompany_locationtotal_fleet_countiata_approved
7708615249NaNMarshall Islands1.0f
7708744431NaNNaN1.0f
7708825724NaNNaN1.0f
7708932743NaNKiribati2.0f
7709019010NaNPhilippines2.0f
770916654100%Tonga3.0f
770928000NaNChile2.0t
7709314296NaNNetherlands4.0f
770942736380%NaN3.0t
770951254298%Mauritania19.0t

Duplicate rows

Most frequently occurring

idcompany_ratingcompany_locationtotal_fleet_countiata_approved# duplicates
529329647100%Peru1305.0f1086
808145111100%Sao Tome and Principe420.0f297
514828828100%Isle of Man198.0t184
573732203100%Barbados176.0f176
36442033499%Niger171.0t167
322218077100%Sao Tome and Principe139.0f125
8384745100%Croatia119.0f114
18881071193%Uganda108.0t108
407022721100%Ecuador109.0f106
34001901997%Nicaragua1484.0f102