pyspark.pandas.Series.compare#
- Series.compare(other, keep_shape=False, keep_equal=False)[source]#
Compare to another Series and show the differences.
Note
This API is slightly different from pandas when indexes from both Series are not identical and config ‘compute.eager_check’ is False. pandas raise an exception; however, pandas-on-Spark just proceeds and performs by ignoring mismatches.
>>> psser1 = ps.Series([1, 2, 3, 4, 5], index=pd.Index([1, 2, 3, 4, 5])) >>> psser2 = ps.Series([1, 2, 3, 4, 5], index=pd.Index([1, 2, 4, 3, 6])) >>> psser1.compare(psser2) ... ValueError: Can only compare identically-labeled Series objects
>>> with ps.option_context("compute.eager_check", False): ... psser1.compare(psser2) ... self other 3 3.0 4.0 4 4.0 3.0 5 5.0 NaN 6 NaN 5.0
- Parameters
- otherSeries
Object to compare with.
- keep_shapebool, default False
If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.
- keep_equalbool, default False
If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.
- Returns
- DataFrame
Notes
Matching NaNs will not appear as a difference.
Examples
>>> from pyspark.pandas.config import set_option, reset_option >>> set_option("compute.ops_on_diff_frames", True) >>> s1 = ps.Series(["a", "b", "c", "d", "e"]) >>> s2 = ps.Series(["a", "a", "c", "b", "e"])
Align the differences on columns
>>> s1.compare(s2).sort_index() self other 1 b a 3 d b
Keep all original rows
>>> s1.compare(s2, keep_shape=True).sort_index() self other 0 None None 1 b a 2 None None 3 d b 4 None None
Keep all original rows and all original values
>>> s1.compare(s2, keep_shape=True, keep_equal=True).sort_index() self other 0 a a 1 b a 2 c c 3 d b 4 e e
>>> reset_option("compute.ops_on_diff_frames")