Skip to content

Instantly share code, notes, and snippets.

@agoose77
Last active August 10, 2023 09:40
Show Gist options
  • Save agoose77/e0be8c9f7d7628b5928fa6b8d7b868d5 to your computer and use it in GitHub Desktop.
Save agoose77/e0be8c9f7d7628b5928fa6b8d7b868d5 to your computer and use it in GitHub Desktop.
jupytext kernelspec
text_representation
extension format_name format_version jupytext_version
.md
myst
0.13
1.15.0
display_name language name
Python 3 (ipykernel)
python
python3

Parsing strings

Awkward Array implements support for ragged strings as ragged lists of code-units. As such, successive strings are closely packed in memory, leading to high-performance operations.

+++

Reading strings from a UTF8 file

+++

Let's imagine that we want to read some logging output that is stored in a text file. For example, a subset of logs from the Android Application framework.

%%bash --out path
pushd $(mktemp -d) >/dev/null 2>&1
wget https://zenodo.org/record/8196385/files/Android_v1.zip >/dev/null 2>&1
unzip Android_v1.zip >/dev/null 2>&1
realpath Android.log

What do these logs look like?

!head {path}

To begin with, we can open these logs as an array of {data}np.uint8 dtype using NumPy, and convert the resulting array to an Awkward Array

import awkward as ak
import numpy as np

with open(path.strip(), "rb") as f:
    arr = np.fromfile(f, dtype=np.uint8)

raw_bytes = ak.from_numpy(arr)
raw_bytes.type.show()

Awkward Array doesn't support scalar values, so we can't treat these characters as a single-string. Instead we need at least one dimension. Let's unflatten our array of characters, to form a length-1 array of strings.

string = ak.enforce_type(ak.unflatten(raw_bytes, len(raw_bytes)), "string")
string.type.show()

The underlying mechanism for implementing strings as lists of code-units can be seen if we inspect the low-level layout that builds the array

string.layout

The __array__ parameter is special. It is reserved by Awkward Array, and signals that the layout is a special pre-undertood built-in type. In this case, that type of the outer ListOffsetArray is "string". It can also be seen that the inner NumpyArray also has an __array__ parameter, this time with a value of char. In Awkward Array, an array of strings must look like this layout; a list with the __array__="string" parameter wrapping a NumpyArray with the __array__="char" parameter.

+++

A single (very long) string isn't much use. Let's split this string at the line boundaries

split_at_newlines = ak.str.split_pattern(string, "\n")
split_at_newlines

Now we can remove the temporary length-1 outer dimension that was required to treat the data as a string

lines = split_at_newlines[0]
lines

In the low-level layout, we can see that these lines are still just variable-length lists

lines.layout

Code-units vs code-points

In general, whilst strings can fundamentally be described as lists of bytes (code-units), many string operations do not operate at the byte-level. The {mod}ak.str submodule provides a suite of vectorised operations that operate at the code-point (not code-unit) level, such as computing the string length. Consider the following simple string

large_code_point = ak.Array(["Å"])

In Awkward Array, strings are UTF-8 encoded, meaning that a single code-point may comprise up to four code-units (bytes). Although it looks like this is a single character, if we look at the layout it's clear that the number of code-units is in-fact two

large_code_point.layout

This is reflected in the {func}ak.num function

ak.num(large_code_point)

The {mod}ak.str module provides a function for computing the length of a string

ak.str.length(large_code_point)

Clearly this function is code-point aware.

+++

Transforming strings

+++

Let's consider the following log data

import awkward as ak

lines = ak.from_iter(
    [
        "12-17 19:31:36.263  1795  1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0\r",
        "12-17 19:31:36.263  5224  5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0\r",
        "12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r",
        "12-17 19:31:36.264  1795  1825 I PowerManager_screenOn: DisplayPowerController updatePowerState mPendingRequestLocked=policy=BRIGHT, useProximitySensor=true, useProximitySensorbyPhone=true, screenBrightness=33, screenAutoBrightnessAdjustment=0.0, brightnessSetByUser=true, useAutoBrightness=true, blockScreenOn=false, lowPowerMode=false, boostScreenBrightness=false, dozeScreenBrightness=-1, dozeScreenState=UNKNOWN, useTwilight=false, useSmartBacklight=true, brightnessWaitMode=false, brightnessWaitRet=true, screenAutoBrightness=-1, userId=0\r",
        "12-17 19:31:36.264  1795  2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823\r",
        "12-17 19:31:36.264  1795  2750 I HwLightsService: back light level before map = 823\r",
        "12-17 19:31:36.264  1795  1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r",
        "12-17 19:31:36.264  1795  1825 V KeyguardServiceDelegate: onScreenTurnedOn()\r",
        "12-17 19:31:36.264  1795  1825 I WindowManger_keyguard: onScreenTurnedOn()\r",
        "12-17 19:31:36.264  1795  1825 D DisplayPowerController: Display ready!\r",
        "12-17 19:31:36.264  1795  1825 D DisplayPowerController: Finished business...\r",
        "12-17 19:31:36.264  2852  3328 D KeyguardService: Caller checkPermission fail\r",
        "12-17 19:31:36.264  2852  3328 D KeyguardService: KGSvcCall onScreenTurnedOn.\r",
        "12-17 19:31:36.264  2852  3328 D KeyguardViewMediator: notifyScreenTurnedOn\r",
        "12-17 19:31:36.265  2852  2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn\r",
        "12-17 19:31:36.265  2852  2852 I PhoneStatusBar: onScreenTurnedOn\r",
        "12-17 19:31:36.265  2852  2852 D KGWallpaper_Magazine: getNextIndex: 0; from 5 to 5; size: 44\r",
        "12-17 19:31:36.265  2852  2852 I HwLockScreenReporter: report msg is :{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg}\r",
        "12-17 19:31:36.265  2852  2852 W HwLockScreenReporter: report result = falsereport type:162 msg:{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg, channelId: 05}\r",
        "12-17 19:31:36.265  2852  2852 I OucScreenOnCounter: Screen already turned on at: 1481974212\r",
        "12-17 19:31:36.267  5224  5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0\r",
        "12-17 19:31:36.270  1795 16500 I HwActivityManagerService: Split enqueueing broadcast [callerApp]:ProcessRecord{580cfb2 5224:com.huawei.health:DaemonService/u0a99}\r",
        "12-17 19:31:36.271  2852  2852 I EventCenter: EventCenter Get :android.com.huawei.bone.NOTIFY_SPORT_DATA\r",
        "12-17 19:31:36.275  7741  7741 D Mms_TX_NOTIFY: Get no-perm notification callback android.intent.action.SCREEN_ON\r",
        "12-17 19:31:36.275  7741  7741 D Mms_TX_NOTIFY: ScreenState present\r",
        "12-17 19:31:36.275  5224  5283 I Step_HSNH: 20002302|upDateHealthNotification()|89|2.98|4180\r",
        "12-17 19:31:36.276  2883  2996 I HwSystemManager: ITrafficInfo:ITrafficInfo create 301updateBytes = 1769320345\r",
        "12-17 19:31:36.278  5224  5283 I Step_HSNH: 20002302|rebuild notification\r",
        "12-17 19:31:36.279  2852  2925 I EventCenter: ContentChange for slot: 1\r",
        "12-17 19:31:36.279  2852  2852 I HwBrightnessController: onChange selfChange:false uri.toString():content://settings/system/screen_auto_brightness mIsObserveAutoBrightnessChange:true\r",
        "12-17 19:31:36.279  1795  1825 D FpDataCollector: case xxx, not a fingerprint unlock \r",
        "12-17 19:31:36.280  1795  1825 D PowerManagerService: ready=true,policy=3,wakefulness=1,wksummary=0x11,uasummary=0x1,bootcompleted=true,boostinprogress=false,waitmodeenable=false,mode=true,manual=33,auto=-1,adj=0.0userId=0\r",
        "12-17 19:31:36.280  1795  1825 I PowerManager_screenOn: PowerManagerNotifier onWakefulnessChangeFinished mInteractiveChanging=true, mInteractive=true\r",
        "12-17 19:31:36.280  2852  2852 I HwBrightnessUtils: APS brightness=20.0,ConvertToPercentage=0.21667233\r",
        "12-17 19:31:36.280  2852  2852 I HwBrightnessUtils:  getSeekBarProgress isAutoMode:true current brightness:20 percentage:0.21667233\r",
        "12-17 19:31:36.280  2852  2852 I HwBrightnessController: updateSlider1 seekBarProgress:2167\r",
        "12-17 19:31:36.280  2852  2852 I HwBrightnessController: updateSlider2 seekBarProgress:2167\r",
        "12-17 19:31:36.280  2852  2852 I ToggleSlider:  mSeekListener onProgressChanged progress:2167 fromUser:false\r",
        "12-17 19:31:36.281  2852  2852 I ToggleSlider:  mSeekListener onProgressChanged progress:2167 fromUser:false\r",
        "12-17 19:31:36.282  3626  3753 I LogCollectService: msg = 103 received\r",
        "12-17 19:31:36.283  1795 11747 I NotificationManager: enqueueNotificationInternal: pkg=com.huawei.health id=10010 notification=Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x2 color=0x00000000 vis=PRIVATE)\r",
        "12-17 19:31:36.284  1795  1795 I NotificationManager: enqueueNotificationInternal: n.getKey = 0|com.huawei.health|10010|null|10099\r",
        "12-17 19:31:36.285  1795  2750 D HW_DISPLAY_EFFECT: presently, hw_update_color_temp_for_rg_led interface not achieved.\r",
        "12-17 19:31:36.285  3466  3466 I Contacts: DialpadFragment mBroadcastReceiver action:android.intent.action.SCREEN_ON\r",
        "12-17 19:31:36.289  3608  3608 D InCall  : InCallActivity - mScreenOnReceiver mCallEndOptionsDialog = null\r",
        "12-17 19:31:36.295  1795  1795 V NotificationService: disableEffects=null canInterrupt=false once update: false\r",
        "12-17 19:31:36.297  2852  2852 I StatusBar: onNotificationPosted: StatusBarNotification(pkg=com.huawei.health user=UserHandle{0} id=10010 tag=null key=0|com.huawei.health|10010|null|10099: Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x62 color=0x00000000 vis=PRIVATE)) important=2, post=1481974296283, when=1481531589202, vis=0, userid=0\r",
        "12-17 19:31:36.297  2852  2852 D StatusBar: updateNotification(StatusBarNotification(pkg=com.huawei.health user=UserHandle{0} id=10010 tag=null key=0|com.huawei.health|10010|null|10099: Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x62 color=0x00000000 vis=PRIVATE)))\r",
        "12-17 19:31:36.298  2852  2852 D HwCust  : Create obj success use class android.app.HwCustNotificationImpl\r",
        "12-17 19:31:36.299  2852  2852 I StatusBarIconView: updateTint: tint=0\r",
        "12-17 19:31:36.300  2852  2852 D StatusBar: No peeking: unimportant notification: 0|com.huawei.health|10010|null|10099\r",
        "12-17 19:31:36.301  2852  2852 D StatusBar: applyInPlace=true shouldPeek=false alertAgain=true\r",
        "12-17 19:31:36.301  2852  2852 I NotificationGroupManager: onEntryUpdated:0|com.huawei.health|10010|null|10099\r",
        "12-17 19:31:36.301  2852  2852 I NotificationGroupManager: onEntryAdded:0|com.huawei.health|10010|null|10099, group=0|com.huawei.health|10010|null|10099\r",
        "12-17 19:31:36.301  2852  2852 D StatusBar: reusing notification for key: 0|com.huawei.health|10010|null|10099\r",
        "12-17 19:31:36.301  2852  2852 D HwCust  : Create obj success use class android.app.HwCustNotificationImpl\r",
        "12-17 19:31:36.301  2852  2852 D HwCust  : Create obj success use class android.app.HwCustNotificationImpl\r",
        "12-17 19:31:36.302  2852  2852 I StatusBarIconView: updateTint: tint=0\r",
        "12-17 19:31:36.304 16628 16628 I TotemWeather: RetryTaskController:mTaskList is null\r",
        "12-17 19:31:36.311  2852  2852 I HwPhoneStatusBar: updateNotificationShade\r",
        "12-17 19:31:36.311  2852  2852 I PhoneStatusBar: updateNotificationShade\r",
        "12-17 19:31:36.311  2852  2852 I PhoneStatusBar: removeNotificationChildren\r",
        "12-17 19:31:36.311  2852  2852 I HwNotificationIconAreaController: showNotificationAll\r",
        "12-17 19:31:36.313 31949 31967 I PushService: main{1} PushService.onStartCommand(PushService.java:87) Push Service Start by  userEvent\r",
    ]
)

Decomposing strings into records

+++

In the {mod}ak.str module there is the {func}ak.str.extract_regex function. This function decomposes an array of strings into an array of records, where each field of the newly created records corresponds to a named group in the regular expression. Let's define a regular expression to match our log

pattern = (
    # Timestamp
    r"(?P<datetime>\d\d-\d\d\s\d\d:\d\d:\d\d)\."
    # Fractional seconds
    r"(?P<datetime_frac>\d\d\d)\s\s"
    # Unknown integers
    r"(?P<i0>\d\d\d\d)\s\s"
    r"(?P<i1>\d\d\d\d)\s"
    # String category
    r"(?P<category>\w)\s"
    # String message
    r"(?P<message>.*)"
)

Does this match the first line?

lines[0]

Let's use the {mod}re module to use the above pattern to parse this line

import re

match = re.match(pattern, lines[0])
match.groupdict()

Let's now apply {func}ak.str.extract_regex to our array of lines using this pattern

structured = ak.str.extract_regex(lines, pattern)
structured

The type of the structured record is an "optional record of optional fields". This is because both the match itself can fail (producing the outer option), or the inner groups may be missing (producing the inner options). If we know that all groups should succeed or all groups should fail, then we can lift the inner options outside the record. To do this, we need to decompose the record, and rebuild it with ak.zip which provides a special optiontype_outside_record argument.

fields = ak.fields(structured)
contents = ak.unzip(structured)

result = ak.zip(dict(zip(fields, contents)), optiontype_outside_record=True)
result

Splitting and joining strings

+++

Strings in Awkward Array can arbitrarily be joined together, and split into sublists. Let's start by creating an array of strings that we can later manipulate. The following timestamps array contains a list of timestamp-like strings

timestamp = ak.from_iter(
    [
        "12-17 19:31:36.263",
        "12-17 19:31:36.263",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.264",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.265",
        "12-17 19:31:36.267",
        "12-17 19:31:36.270",
        "12-17 19:31:36.271",
        "12-17 19:31:36.275",
        "12-17 19:31:36.275",
        "12-17 19:31:36.275",
        "12-17 19:31:36.276",
        "12-17 19:31:36.278",
        "12-17 19:31:36.279",
        "12-17 19:31:36.279",
        "12-17 19:31:36.279",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.280",
        "12-17 19:31:36.281",
        "12-17 19:31:36.282",
        "12-17 19:31:36.283",
        "12-17 19:31:36.284",
        "12-17 19:31:36.285",
        "12-17 19:31:36.285",
        "12-17 19:31:36.289",
        "12-17 19:31:36.295",
        "12-17 19:31:36.297",
        "12-17 19:31:36.297",
        "12-17 19:31:36.298",
        "12-17 19:31:36.299",
        "12-17 19:31:36.300",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.301",
        "12-17 19:31:36.302",
        "12-17 19:31:36.304",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.311",
        "12-17 19:31:36.313",
    ]
)

Joining strings together

Parsing datetimes in a performant manner is tricky. Pandas has such an ability, but it uses NumPy's fixed-width strings. Arrow provides strptime, but it does not handle fractional seconds or timedeltas and requires a full date. In order to use Arrow's {func}pyarrow.compute.strptime function, we can manipulate the string to prepend the date, operating only on the non-fraction part of the match.

+++

Let's assume that these timestamps were recorded in the year 2022. We can prepend the string "2022" with the "-" delimiter to complete the timestamp string

timestamp_with_year = ak.str.join_element_wise(["2022"], timestamp, ["-"])
timestamp_with_year

The ["2022"] and ["-"] arrays are broadcast with the timestamp array before joining element-wise.

+++

{func}ak.str.join_element_wise is useful for building new strings from separate arrays. It might also be the case that one has a single array of strings that they wish to join along the final axis (like a reducer). There exists a separate function {func}ak.str.join for such a purpose

ak.str.join(
    [
        ["do", "re", "me"],
        ["fa", "so"],
        ["la"],
        ["ti", "da"],
    ],
    separator="-🎵-",
)

Splitting strings apart

+++

The timestamps above still cannot be parsed by Arrow; the fractional time component is not (at time of writing) yet supported. To fix this, we can split the fractional component from the timestamp, and add it as a timedelta64[ms] later on.

+++

Let's split the fractional time component into two parts using {func}ak.str.split_pattern.

timestamp_split = ak.str.split_pattern(timestamp_with_year, ".", max_splits=1)
timestamp_split
timestamp_non_fractional = timestamp_split[:, 0]
timestamp_fractional = timestamp_split[:, 1]

Now we can parse these timestamps using Arrow!

import pyarrow.compute

datetime = ak.from_arrow(
    pyarrow.compute.strptime(
        ak.to_arrow(timestamp_non_fractional, extensionarray=False),
        "%Y-%m-%d %H:%M:%S",
        "ms",
    )
)
datetime

Finally, we build an offset for the fractional component (in milliseconds) using strings_astype

import numpy as np

datetime_offset = ak.strings_astype(timestamp_fractional, np.dtype("timedelta64[ms]"))
datetime_offset

This offset is added to the absolute datetime to obtain a timestamp

timestamp = datetime + datetime_offset
timestamp

If we had a different parsing library that could only handle dates and times separately, then we could also split on the whitespace. Although {func}ak.str.split_pattern supports whitespace, it is more performant (and versatile) to use {func}ak.str.split_whitespace

ak.str.split_whitespace(timestamp_with_year)

If we also needed to split off the fractional component (and manually build the time delta), then we could have used {func}ak.str.split_pattern_regex to split on both whitespace and the period

ak.str.split_pattern_regex(timestamp_with_year, r"\.|\s")

Splitting strings by a delimiter

Just as we might wish to join two strings from different arrays, it is also useful to be able to do the reverse. Consider the message field, which contains a heading prefix, followed by a delimeter. We can use a regular expression to match this, but it is faster to use the dedicated split function {func}ak.str.split_pattern that accepts a pattern

+++

We can use a regular expression to match this, but it is faster to use the dedicated split function {func}ak.str.split_pattern that accepts a pattern

%%timeit
ak.str.split_pattern(structured.message, ": ", max_splits=1)
%%timeit
ak.str.extract_regex(
    ak.drop_none(structured.message), r"(?P<head>[^:]+):\s*?(?P<tail>.*)"
)
split_message = ak.str.split_pattern(structured.message, ": ", max_splits=1)
split_message

Some lines do not have a heading

is_small_split = ak.num(split_message) < 2
split_message[ak.fill_none(is_small_split, False)]

The message is always the last item. We can left-pad the array with None values to ensure that there is always a heading (albeit None). {func}ak.pad_none appends None values but we want to prepend them. Therefore, we need to flip the array, perform the padding, and revert the flip.

padded_message = ak.pad_none(split_message[:, ::-1], target=2)[  # Flip before padding
    :, ::-1
]  # Flip after padding
heading = padded_message[:, 0]
message = padded_message[:, 1]

Categorical types

+++

The {func}ak.str.to_categorical function can be used to encode a categorical type from an array of strings

category = ak.str.to_categorical(structured.category)
category

If we look at the layout for this array, we can see that categorical types are implemented as indexed views into the category array

category.layout

The {func}ak.categories function can be used to reveal the underlying categories

ak.categories(category)

Categorical types are particular useful for reducing the size of a column comprised of large, repeated strings. In this case, there is no performance benefit to converting the strings to a categorical, apart from the pre-computation of the unique values (given by {func}ak.categories).

+++

Putting this all together, we have

log = ak.zip(
    {
        "timestamp": timestamp,
        "category": category,
        "heading": heading,
        "message": message,
    }
)
log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment