jupytext | kernelspec | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Awkward Array implements support for ragged strings as ragged lists of code-units. As such, successive strings are closely packed in memory, leading to high-performance operations.
+++
+++
Let's imagine that we want to read some logging output that is stored in a text file. For example, a subset of logs from the Android Application framework.
%%bash --out path
pushd $(mktemp -d) >/dev/null 2>&1
wget https://zenodo.org/record/8196385/files/Android_v1.zip >/dev/null 2>&1
unzip Android_v1.zip >/dev/null 2>&1
realpath Android.log
What do these logs look like?
!head {path}
To begin with, we can open these logs as an array of {data}np.uint8
dtype using NumPy, and convert the resulting array to an Awkward Array
import awkward as ak
import numpy as np
with open(path.strip(), "rb") as f:
arr = np.fromfile(f, dtype=np.uint8)
raw_bytes = ak.from_numpy(arr)
raw_bytes.type.show()
Awkward Array doesn't support scalar values, so we can't treat these characters as a single-string. Instead we need at least one dimension. Let's unflatten our array of characters, to form a length-1 array of strings.
string = ak.enforce_type(ak.unflatten(raw_bytes, len(raw_bytes)), "string")
string.type.show()
The underlying mechanism for implementing strings as lists of code-units can be seen if we inspect the low-level layout that builds the array
string.layout
The __array__
parameter is special. It is reserved by Awkward Array, and signals that the layout is a special pre-undertood built-in type. In this case, that type of the outer ListOffsetArray
is "string". It can also be seen that the inner NumpyArray
also has an __array__
parameter, this time with a value of char
. In Awkward Array, an array of strings must look like this layout; a list with the __array__="string"
parameter wrapping a NumpyArray
with the __array__="char"
parameter.
+++
A single (very long) string isn't much use. Let's split this string at the line boundaries
split_at_newlines = ak.str.split_pattern(string, "\n")
split_at_newlines
Now we can remove the temporary length-1 outer dimension that was required to treat the data as a string
lines = split_at_newlines[0]
lines
In the low-level layout, we can see that these lines are still just variable-length lists
lines.layout
In general, whilst strings can fundamentally be described as lists of bytes (code-units), many string operations do not operate at the byte-level. The {mod}ak.str
submodule provides a suite of vectorised operations that operate at the code-point (not code-unit) level, such as computing the string length. Consider the following simple string
large_code_point = ak.Array(["Å"])
In Awkward Array, strings are UTF-8 encoded, meaning that a single code-point may comprise up to four code-units (bytes). Although it looks like this is a single character, if we look at the layout it's clear that the number of code-units is in-fact two
large_code_point.layout
This is reflected in the {func}ak.num
function
ak.num(large_code_point)
The {mod}ak.str
module provides a function for computing the length of a string
ak.str.length(large_code_point)
Clearly this function is code-point aware.
+++
+++
Let's consider the following log data
import awkward as ak
lines = ak.from_iter(
[
"12-17 19:31:36.263 1795 1825 I PowerManager_screenOn: DisplayPowerStatesetColorFadeLevel: level=1.0\r",
"12-17 19:31:36.263 5224 5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0\r",
"12-17 19:31:36.264 1795 1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r",
"12-17 19:31:36.264 1795 1825 I PowerManager_screenOn: DisplayPowerController updatePowerState mPendingRequestLocked=policy=BRIGHT, useProximitySensor=true, useProximitySensorbyPhone=true, screenBrightness=33, screenAutoBrightnessAdjustment=0.0, brightnessSetByUser=true, useAutoBrightness=true, blockScreenOn=false, lowPowerMode=false, boostScreenBrightness=false, dozeScreenBrightness=-1, dozeScreenState=UNKNOWN, useTwilight=false, useSmartBacklight=true, brightnessWaitMode=false, brightnessWaitRet=true, screenAutoBrightness=-1, userId=0\r",
"12-17 19:31:36.264 1795 2750 I PowerManager_screenOn: DisplayPowerState Updating screen state: state=ON, backlight=823\r",
"12-17 19:31:36.264 1795 2750 I HwLightsService: back light level before map = 823\r",
"12-17 19:31:36.264 1795 1825 D DisplayPowerController: Animating brightness: target=21, rate=40\r",
"12-17 19:31:36.264 1795 1825 V KeyguardServiceDelegate: onScreenTurnedOn()\r",
"12-17 19:31:36.264 1795 1825 I WindowManger_keyguard: onScreenTurnedOn()\r",
"12-17 19:31:36.264 1795 1825 D DisplayPowerController: Display ready!\r",
"12-17 19:31:36.264 1795 1825 D DisplayPowerController: Finished business...\r",
"12-17 19:31:36.264 2852 3328 D KeyguardService: Caller checkPermission fail\r",
"12-17 19:31:36.264 2852 3328 D KeyguardService: KGSvcCall onScreenTurnedOn.\r",
"12-17 19:31:36.264 2852 3328 D KeyguardViewMediator: notifyScreenTurnedOn\r",
"12-17 19:31:36.265 2852 2852 D KeyguardViewMediator: handleNotifyScreenTurnedOn\r",
"12-17 19:31:36.265 2852 2852 I PhoneStatusBar: onScreenTurnedOn\r",
"12-17 19:31:36.265 2852 2852 D KGWallpaper_Magazine: getNextIndex: 0; from 5 to 5; size: 44\r",
"12-17 19:31:36.265 2852 2852 I HwLockScreenReporter: report msg is :{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg}\r",
"12-17 19:31:36.265 2852 2852 W HwLockScreenReporter: report result = falsereport type:162 msg:{picture: Deepwater-05-2.3.001-bigpicture_05_8.jpg, channelId: 05}\r",
"12-17 19:31:36.265 2852 2852 I OucScreenOnCounter: Screen already turned on at: 1481974212\r",
"12-17 19:31:36.267 5224 5283 I SendBroadcastPermission: action:android.com.huawei.bone.NOTIFY_SPORT_DATA, mPermissionType:0\r",
"12-17 19:31:36.270 1795 16500 I HwActivityManagerService: Split enqueueing broadcast [callerApp]:ProcessRecord{580cfb2 5224:com.huawei.health:DaemonService/u0a99}\r",
"12-17 19:31:36.271 2852 2852 I EventCenter: EventCenter Get :android.com.huawei.bone.NOTIFY_SPORT_DATA\r",
"12-17 19:31:36.275 7741 7741 D Mms_TX_NOTIFY: Get no-perm notification callback android.intent.action.SCREEN_ON\r",
"12-17 19:31:36.275 7741 7741 D Mms_TX_NOTIFY: ScreenState present\r",
"12-17 19:31:36.275 5224 5283 I Step_HSNH: 20002302|upDateHealthNotification()|89|2.98|4180\r",
"12-17 19:31:36.276 2883 2996 I HwSystemManager: ITrafficInfo:ITrafficInfo create 301updateBytes = 1769320345\r",
"12-17 19:31:36.278 5224 5283 I Step_HSNH: 20002302|rebuild notification\r",
"12-17 19:31:36.279 2852 2925 I EventCenter: ContentChange for slot: 1\r",
"12-17 19:31:36.279 2852 2852 I HwBrightnessController: onChange selfChange:false uri.toString():content://settings/system/screen_auto_brightness mIsObserveAutoBrightnessChange:true\r",
"12-17 19:31:36.279 1795 1825 D FpDataCollector: case xxx, not a fingerprint unlock \r",
"12-17 19:31:36.280 1795 1825 D PowerManagerService: ready=true,policy=3,wakefulness=1,wksummary=0x11,uasummary=0x1,bootcompleted=true,boostinprogress=false,waitmodeenable=false,mode=true,manual=33,auto=-1,adj=0.0userId=0\r",
"12-17 19:31:36.280 1795 1825 I PowerManager_screenOn: PowerManagerNotifier onWakefulnessChangeFinished mInteractiveChanging=true, mInteractive=true\r",
"12-17 19:31:36.280 2852 2852 I HwBrightnessUtils: APS brightness=20.0,ConvertToPercentage=0.21667233\r",
"12-17 19:31:36.280 2852 2852 I HwBrightnessUtils: getSeekBarProgress isAutoMode:true current brightness:20 percentage:0.21667233\r",
"12-17 19:31:36.280 2852 2852 I HwBrightnessController: updateSlider1 seekBarProgress:2167\r",
"12-17 19:31:36.280 2852 2852 I HwBrightnessController: updateSlider2 seekBarProgress:2167\r",
"12-17 19:31:36.280 2852 2852 I ToggleSlider: mSeekListener onProgressChanged progress:2167 fromUser:false\r",
"12-17 19:31:36.281 2852 2852 I ToggleSlider: mSeekListener onProgressChanged progress:2167 fromUser:false\r",
"12-17 19:31:36.282 3626 3753 I LogCollectService: msg = 103 received\r",
"12-17 19:31:36.283 1795 11747 I NotificationManager: enqueueNotificationInternal: pkg=com.huawei.health id=10010 notification=Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x2 color=0x00000000 vis=PRIVATE)\r",
"12-17 19:31:36.284 1795 1795 I NotificationManager: enqueueNotificationInternal: n.getKey = 0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.285 1795 2750 D HW_DISPLAY_EFFECT: presently, hw_update_color_temp_for_rg_led interface not achieved.\r",
"12-17 19:31:36.285 3466 3466 I Contacts: DialpadFragment mBroadcastReceiver action:android.intent.action.SCREEN_ON\r",
"12-17 19:31:36.289 3608 3608 D InCall : InCallActivity - mScreenOnReceiver mCallEndOptionsDialog = null\r",
"12-17 19:31:36.295 1795 1795 V NotificationService: disableEffects=null canInterrupt=false once update: false\r",
"12-17 19:31:36.297 2852 2852 I StatusBar: onNotificationPosted: StatusBarNotification(pkg=com.huawei.health user=UserHandle{0} id=10010 tag=null key=0|com.huawei.health|10010|null|10099: Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x62 color=0x00000000 vis=PRIVATE)) important=2, post=1481974296283, when=1481531589202, vis=0, userid=0\r",
"12-17 19:31:36.297 2852 2852 D StatusBar: updateNotification(StatusBarNotification(pkg=com.huawei.health user=UserHandle{0} id=10010 tag=null key=0|com.huawei.health|10010|null|10099: Notification(pri=0 contentView=null vibrate=null sound=null defaults=0x0 flags=0x62 color=0x00000000 vis=PRIVATE)))\r",
"12-17 19:31:36.298 2852 2852 D HwCust : Create obj success use class android.app.HwCustNotificationImpl\r",
"12-17 19:31:36.299 2852 2852 I StatusBarIconView: updateTint: tint=0\r",
"12-17 19:31:36.300 2852 2852 D StatusBar: No peeking: unimportant notification: 0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.301 2852 2852 D StatusBar: applyInPlace=true shouldPeek=false alertAgain=true\r",
"12-17 19:31:36.301 2852 2852 I NotificationGroupManager: onEntryUpdated:0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.301 2852 2852 I NotificationGroupManager: onEntryAdded:0|com.huawei.health|10010|null|10099, group=0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.301 2852 2852 D StatusBar: reusing notification for key: 0|com.huawei.health|10010|null|10099\r",
"12-17 19:31:36.301 2852 2852 D HwCust : Create obj success use class android.app.HwCustNotificationImpl\r",
"12-17 19:31:36.301 2852 2852 D HwCust : Create obj success use class android.app.HwCustNotificationImpl\r",
"12-17 19:31:36.302 2852 2852 I StatusBarIconView: updateTint: tint=0\r",
"12-17 19:31:36.304 16628 16628 I TotemWeather: RetryTaskController:mTaskList is null\r",
"12-17 19:31:36.311 2852 2852 I HwPhoneStatusBar: updateNotificationShade\r",
"12-17 19:31:36.311 2852 2852 I PhoneStatusBar: updateNotificationShade\r",
"12-17 19:31:36.311 2852 2852 I PhoneStatusBar: removeNotificationChildren\r",
"12-17 19:31:36.311 2852 2852 I HwNotificationIconAreaController: showNotificationAll\r",
"12-17 19:31:36.313 31949 31967 I PushService: main{1} PushService.onStartCommand(PushService.java:87) Push Service Start by userEvent\r",
]
)
+++
In the {mod}ak.str
module there is the {func}ak.str.extract_regex
function. This function decomposes an array of strings into an array of records, where each field of the newly created records corresponds to a named group in the regular expression. Let's define a regular expression to match our log
pattern = (
# Timestamp
r"(?P<datetime>\d\d-\d\d\s\d\d:\d\d:\d\d)\."
# Fractional seconds
r"(?P<datetime_frac>\d\d\d)\s\s"
# Unknown integers
r"(?P<i0>\d\d\d\d)\s\s"
r"(?P<i1>\d\d\d\d)\s"
# String category
r"(?P<category>\w)\s"
# String message
r"(?P<message>.*)"
)
Does this match the first line?
lines[0]
Let's use the {mod}re
module to use the above pattern to parse this line
import re
match = re.match(pattern, lines[0])
match.groupdict()
Let's now apply {func}ak.str.extract_regex
to our array of lines using this pattern
structured = ak.str.extract_regex(lines, pattern)
structured
The type of the structured
record is an "optional record of optional fields". This is because both the match itself can fail (producing the outer option), or the inner groups may be missing (producing the inner options). If we know that all groups should succeed or all groups should fail, then we can lift the inner options outside the record. To do this, we need to decompose the record, and rebuild it with ak.zip
which provides a special optiontype_outside_record
argument.
fields = ak.fields(structured)
contents = ak.unzip(structured)
result = ak.zip(dict(zip(fields, contents)), optiontype_outside_record=True)
result
+++
Strings in Awkward Array can arbitrarily be joined together, and split into sublists. Let's start by creating an array of strings that we can later manipulate. The following timestamps
array contains a list of timestamp-like strings
timestamp = ak.from_iter(
[
"12-17 19:31:36.263",
"12-17 19:31:36.263",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.264",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.265",
"12-17 19:31:36.267",
"12-17 19:31:36.270",
"12-17 19:31:36.271",
"12-17 19:31:36.275",
"12-17 19:31:36.275",
"12-17 19:31:36.275",
"12-17 19:31:36.276",
"12-17 19:31:36.278",
"12-17 19:31:36.279",
"12-17 19:31:36.279",
"12-17 19:31:36.279",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.280",
"12-17 19:31:36.281",
"12-17 19:31:36.282",
"12-17 19:31:36.283",
"12-17 19:31:36.284",
"12-17 19:31:36.285",
"12-17 19:31:36.285",
"12-17 19:31:36.289",
"12-17 19:31:36.295",
"12-17 19:31:36.297",
"12-17 19:31:36.297",
"12-17 19:31:36.298",
"12-17 19:31:36.299",
"12-17 19:31:36.300",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.301",
"12-17 19:31:36.302",
"12-17 19:31:36.304",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.311",
"12-17 19:31:36.313",
]
)
Parsing datetimes in a performant manner is tricky. Pandas has such an ability, but it uses NumPy's fixed-width strings. Arrow provides strptime
, but it does not handle fractional seconds or timedeltas and requires a full date. In order to use Arrow's {func}pyarrow.compute.strptime
function, we can manipulate the string to prepend the date, operating only on the non-fraction part of the match.
+++
Let's assume that these timestamps were recorded in the year 2022. We can prepend the string "2022" with the "-" delimiter to complete the timestamp string
timestamp_with_year = ak.str.join_element_wise(["2022"], timestamp, ["-"])
timestamp_with_year
The ["2022"]
and ["-"]
arrays are broadcast with the timestamp
array before joining element-wise.
+++
{func}ak.str.join_element_wise
is useful for building new strings from separate arrays. It might also be the case that one has a single array of strings that they wish to join along the final axis (like a reducer). There exists a separate function {func}ak.str.join
for such a purpose
ak.str.join(
[
["do", "re", "me"],
["fa", "so"],
["la"],
["ti", "da"],
],
separator="-🎵-",
)
+++
The timestamps above still cannot be parsed by Arrow; the fractional time component is not (at time of writing) yet supported. To fix this, we can split the fractional component from the timestamp, and add it as a timedelta64[ms]
later on.
+++
Let's split the fractional time component into two parts using {func}ak.str.split_pattern
.
timestamp_split = ak.str.split_pattern(timestamp_with_year, ".", max_splits=1)
timestamp_split
timestamp_non_fractional = timestamp_split[:, 0]
timestamp_fractional = timestamp_split[:, 1]
Now we can parse these timestamps using Arrow!
import pyarrow.compute
datetime = ak.from_arrow(
pyarrow.compute.strptime(
ak.to_arrow(timestamp_non_fractional, extensionarray=False),
"%Y-%m-%d %H:%M:%S",
"ms",
)
)
datetime
Finally, we build an offset for the fractional component (in milliseconds) using strings_astype
import numpy as np
datetime_offset = ak.strings_astype(timestamp_fractional, np.dtype("timedelta64[ms]"))
datetime_offset
This offset is added to the absolute datetime to obtain a timestamp
timestamp = datetime + datetime_offset
timestamp
If we had a different parsing library that could only handle dates and times separately, then we could also split on the whitespace. Although {func}ak.str.split_pattern
supports whitespace, it is more performant (and versatile) to use {func}ak.str.split_whitespace
ak.str.split_whitespace(timestamp_with_year)
If we also needed to split off the fractional component (and manually build the time delta), then we could have used {func}ak.str.split_pattern_regex
to split on both whitespace and the period
ak.str.split_pattern_regex(timestamp_with_year, r"\.|\s")
Just as we might wish to join two strings from different arrays, it is also useful to be able to do the reverse. Consider the message
field, which contains a heading prefix, followed by a delimeter. We can use a regular expression to match this, but it is faster to use the dedicated split function {func}ak.str.split_pattern
that accepts a pattern
+++
We can use a regular expression to match this, but it is faster to use the dedicated split function {func}ak.str.split_pattern
that accepts a pattern
%%timeit
ak.str.split_pattern(structured.message, ": ", max_splits=1)
%%timeit
ak.str.extract_regex(
ak.drop_none(structured.message), r"(?P<head>[^:]+):\s*?(?P<tail>.*)"
)
split_message = ak.str.split_pattern(structured.message, ": ", max_splits=1)
split_message
Some lines do not have a heading
is_small_split = ak.num(split_message) < 2
split_message[ak.fill_none(is_small_split, False)]
The message is always the last item. We can left-pad the array with None
values to ensure that there is always a heading (albeit None
). {func}ak.pad_none
appends None
values but we want to prepend them. Therefore, we need to flip the array, perform the padding, and revert the flip.
padded_message = ak.pad_none(split_message[:, ::-1], target=2)[ # Flip before padding
:, ::-1
] # Flip after padding
heading = padded_message[:, 0]
message = padded_message[:, 1]
+++
The {func}ak.str.to_categorical
function can be used to encode a categorical type from an array of strings
category = ak.str.to_categorical(structured.category)
category
If we look at the layout for this array, we can see that categorical types are implemented as indexed views into the category array
category.layout
The {func}ak.categories
function can be used to reveal the underlying categories
ak.categories(category)
Categorical types are particular useful for reducing the size of a column comprised of large, repeated strings. In this case, there is no performance benefit to converting the strings to a categorical, apart from the pre-computation of the unique values (given by {func}ak.categories
).
+++
Putting this all together, we have
log = ak.zip(
{
"timestamp": timestamp,
"category": category,
"heading": heading,
"message": message,
}
)
log