(Go: >> BACK << -|- >> HOME <<)

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pylibcudf polars date from str #16306

Draft
wants to merge 2 commits into
base: branch-24.08
Choose a base branch
from

Conversation

brandon-b-miller
Copy link
Contributor
@brandon-b-miller brandon-b-miller commented Jul 18, 2024

This PR adds datetime/timestamp parsing from string columns in pylibcudf and cudf-polars.

Closes #16174

WIP

@brandon-b-miller brandon-b-miller added feature request New feature or request non-breaking Non-breaking change cudf.polars Issues specific to cudf.polars labels Jul 18, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jul 18, 2024
@@ -811,6 +812,9 @@ def do_evaluate(
else prefix.obj,
)
)
elif self.name == pl_expr.StringFunction.Strptime:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wence- I'm a little unsure if what's being passed from rust here is correct. The docs list many options for this api but self.options is empty here. the rust code only passes the enum value out, is this right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we didn't send that through in pola-rs/polars#16221.

Please check if pola-rs/polars#17702 works for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be working, with that PR I get this:

(Pdb++) self.options
('%Y-%m-%d', True, True, True)

@brandon-b-miller
Copy link
Contributor Author

Polars seems to enforce some kind of handling for ambiguous datetimes through this API, we can either error, raise, or replace them but all seem to rely on identifying if they exist and where. This is problematic for us short term in the very general case because we'd likely have to write some careful logic and tests to do this properly.

Are there any conditions we can think of that preclude the possibility of this occurring, like maybe requiring that the format specify UTC for now?

@wence-
Copy link
Contributor
wence- commented Jul 23, 2024

Polars seems to enforce some kind of handling for ambiguous datetimes through this API, we can either error, raise, or replace them but all seem to rely on identifying if they exist and where. This is problematic for us short term in the very general case because we'd likely have to write some careful logic and tests to do this properly.

Are there any conditions we can think of that preclude the possibility of this occurring, like maybe requiring that the format specify UTC for now?

Paging @mroeschke who has done a load of this in cudf-classic for the pandas compatibility.

If we think the answer is "it's too complicated", let's just not support unconditionally and write up properly as an issue.

@mroeschke
Copy link
Contributor

Both %z and %Z raise in cudf classic currently as NotImplementedError

As you alluded to, a %Z in the format matching UTC exactly is the only safe timezone localization without introspecting the data. Technically any string matching %z could be supported since it should be treated as an "arbitrary UTC offset", but there's no way to represent that in the dtype today.

@brandon-b-miller
Copy link
Contributor Author

Both %z and %Z raise in cudf classic currently as NotImplementedError

As you alluded to, a %Z in the format matching UTC exactly is the only safe timezone localization without introspecting the data. Technically any string matching %z could be supported since it should be treated as an "arbitrary UTC offset", but there's no way to represent that in the dtype today.

Thanks @mroeschke , please correct me if I am wrong, but I think this means we should at least be able to proceed with the case of %Z - my reasoning here being that if we assume the data is all normalized to UTC, no instant in time should be ambiguous, so any kind of "handling" should be a no-op.

@mroeschke
Copy link
Contributor
mroeschke commented Jul 23, 2024

Thanks @mroeschke , please correct me if I am wrong, but I think this means we should at least be able to proceed with the case of %Z - my reasoning here being that if we assume the data is all normalized to UTC, no instant in time should be ambiguous, so any kind of "handling" should be a no-op.

assume the data is all normalized to UTC

Internally within cudf we can assume this. From polars I'm not sure if this can be assumed, especially if it's directly from user input. e.g.:

df = pl.DataFrame({"a": ["2024-03-10 02:30:00 US/Pacific"]})
df.select(pl.col("a").str.to_datetime(format="%Y-%m-%d %H:%M:%S %Z"))

should raise as 2024-03-10 02:30:00 is not a valid date and time in US/Pacific

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cudf.polars Issues specific to cudf.polars feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: In Progress
Status: No status
Development

Successfully merging this pull request may close these issues.

[BUG] Support string to datetime conversion in Polars engine
3 participants