pyspark.pandas.Series.compare#

Series.compare(other, keep_shape=False, keep_equal=False)[source]#

Compare to another Series and show the differences.

Note

This API is slightly different from pandas when indexes from both Series are not identical and config ‘compute.eager_check’ is False. pandas raise an exception; however, pandas-on-Spark just proceeds and performs by ignoring mismatches.

>>> psser1 = ps.Series([1, 2, 3, 4, 5], index=pd.Index([1, 2, 3, 4, 5]))
>>> psser2 = ps.Series([1, 2, 3, 4, 5], index=pd.Index([1, 2, 4, 3, 6]))
>>> psser1.compare(psser2)  
...
ValueError: Can only compare identically-labeled Series objects
>>> with ps.option_context("compute.eager_check", False):
...     psser1.compare(psser2)  
...
   self  other
3   3.0    4.0
4   4.0    3.0
5   5.0    NaN
6   NaN    5.0
Parameters
otherSeries

Object to compare with.

keep_shapebool, default False

If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.

keep_equalbool, default False

If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.

Returns
DataFrame

Notes

Matching NaNs will not appear as a difference.

Examples

>>> from pyspark.pandas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> s1 = ps.Series(["a", "b", "c", "d", "e"])
>>> s2 = ps.Series(["a", "a", "c", "b", "e"])

Align the differences on columns

>>> s1.compare(s2).sort_index()
  self other
1    b     a
3    d     b

Keep all original rows

>>> s1.compare(s2, keep_shape=True).sort_index()
   self other
0  None  None
1     b     a
2  None  None
3     d     b
4  None  None

Keep all original rows and all original values

>>> s1.compare(s2, keep_shape=True, keep_equal=True).sort_index()
  self other
0    a     a
1    b     a
2    c     c
3    d     b
4    e     e
>>> reset_option("compute.ops_on_diff_frames")