github.com/pandas-profiling/pandas-profiling

1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.

Open this visualization on its own page →

Contributors

62

Lines of Code

3,905

From

2016-01-09

To

2020-12-27

About pandas-profiling/pandas-profiling

fg-data-profiling (formerly ydata-profiling) is a Python library that generates automated exploratory data analysis reports from pandas and Spark DataFrames in a single line of code. The package extends the functionality of pandas' df.describe() method to provide comprehensive statistical analysis and visualizations, outputting results in multiple formats including HTML, JSON, and interactive Jupyter notebook widgets.

The tool performs univariate analysis with descriptive statistics and distribution histograms, multivariate analysis including correlations and missing data patterns, automatic data type detection, and specialized analysis for time-series data, text content, files, and images. It also generates automated data quality alerts highlighting potential issues such as missing values, skewness, high correlation, and duplicate rows. The package includes features for comparing multiple datasets and can handle large-scale data processing through Spark integration.

fg-data-profiling integrates with popular data science tools and platforms including Great Expectations for validation, cloud services like Google Cloud and Kaggle, workflow orchestration tools like Airflow and Kedro, and interactive dashboard frameworks such as Streamlit and Dash. The project targets data analysts, scientists, and engineers working with pandas DataFrames who need quick insight into dataset characteristics and quality without manual analysis code.

Share this video