Skip to content

Instantly share code, notes, and snippets.

@charlie-segment
Created November 28, 2022 17:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save charlie-segment/0d2f306f382ce23163ffe9feec6d5859 to your computer and use it in GitHub Desktop.
Save charlie-segment/0d2f306f382ce23163ffe9feec6d5859 to your computer and use it in GitHub Desktop.
Using Hex, Python, Segment, and AWS S3 for easy data visualization and collaboration
October 20, 2022
Draft
Summary
Perform analysis and create visualizations of your first-party customer event data, using this recipe which combines Hex’s charting capabilities with data stored in an AWS S3 bucket (a popular Segment destination). You’ll also be able to publish and share your data work, allowing you to collaborate with your colleagues.
Intro
Segment customers who collect first-party customer data often want to be able to analyze it and to build visualizations. In this recipe, we share an easy and interactive way for Segment users to share their work with their colleagues without sending screenshots back and forth. We do so by leveraging Hex, a little bit of Python code, and Segment’s AWS S3 destination for easy data visualization and collaboration.
With Segment, you can collect, transform, send, and archive your first-party customer data. Segment simplifies the process of collecting data and connecting new tools, allowing you to spend more time using your data, and less time trying to collect it. A very popular destination for your first-party customer data is AWS S3. AWS S3 is is an object storage service offering industry-leading scalability, data availability, security, and performance.
With your data stored in AWS S3, you can now use it within Hex. Hex is a modern data workspace which makes it easy to:
- connect to data
- analyze it in collaborative SQL and Python-powered notebooks
- share your work as interactive data apps and stories
What do you need?
- A Segment workspace
- A properly configured source (like website, mobile, or server) and AWS S3 Destination
- Hex.tech account
- Ability to read/write code in Python
Step 1 - Add an AWS S3 Destination
Follow the documentation here to add an AWS S3 Destination to your Segment workspace and connect it to an existing source. Make note of your bucket’s name. Once set up, verify that events are flowing into your S3 bucket (you may have to wait ~1 hour for files to appear). Also take note of the path used by Segment in your S3 bucket.
Step 2 - Create/Retrieve API key(s) from AWS
In order for Hex to read data from the S3 bucket previously set up as a Destination in your Segment workspace, you’ll need to generate a new set of AWS API credentials. Once you open the AWS Console, click on the following in this order:
1. Your username near the top right and select My Security Credentials
2. Users in the sidebar
3. Your username
4. Security Credentials tab
5. Create Access Key
6. Show User Security Credentials
Step 3 - Create a new project in Hex
1. As a first step, create Secrets for the two AWS credentials: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Additionally, create a Secret for the name of your S3 bucket
1. To create the Secrets, navigate to the Variables tab and hit the + Add button. In this example, the Secrets are named aws_access_key_id, aws_secret_access_key, and s3_bucket_name
2. Add a Python cell and copy in above code
![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1666478513383_Screen+Shot+2022-10-22+at+6.41.24+PM.png)
3. In order to visualize our results, we’ll add a Table cell using df_froms3 as the dataframe
![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667925590298_Screen+Shot+2022-11-08+at+11.38.38+AM.png)
4. Because we’re only interested in Track events, we’ll now add a Filter cell, with source set to df_froms3.
1. Add a filter where type is equal to track
2. Save the result as filter_result (the default)
![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667925376048_Screen+Shot+2022-11-08+at+11.34.50+AM.png)
5. Now that our data has been cleansed, we will add a Chart cell using filter_result for our Data.
1. Let’s analyze how frequently each Track event appears in our data. To do so, set:
1. Mark to Bar
2. X-axis to event
3. Y-axis to sent_at or any other field
1. set Type to Quantitative
2. set Aggregate to Count
2. The resulting chart should look something like this:
![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667926555170_Screen+Shot+2022-10-22+at+6.19.11+PM.png)
(Optional) Step 4 - Publish/share your work
1. You can make you work available to others in your organization by publishing and sharing it.
![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667928395336_Screen+Shot+2022-11-08+at+12.25.44+PM.png)
2. Here is the published app used in this recipe.
Conclusion
In this recipe, we learned how to set up an AWS S3 destination in Segment, send our event data there, and then read that data using Hex (and some Python). In Hex, we ultimately created a chart to show how frequently each type of event appears in our data set. This work can then be shared and published, allowing our colleagues to easily collaborate with us.
Appendix
Python code
import boto3
import json
import sys,os
import s3fs
import pandas as pd
#get enviromental variables set in Hex project
os.environ['AWS_ACCESS_KEY_ID']= aws_access_key_id
os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret_access_key
os.environ['S3_BUCKET_NAME'] = s3_bucket_name
session = boto3.Session(aws_access_key_id, aws_secret_access_key)
#path created by Segment AWS S3 Destination - https://segment.com/docs/connections/storage/catalog/aws-s3/#data-format
prefix = "segment-logs/8ss0NFrJnz/1666224000000/"
s3 = boto3.resource('s3')
bucket = s3.Bucket(name=s3_bucket_name)
s3_files = []
for obj in bucket.objects.filter(Prefix=prefix):
s3_files.append(obj.key)
#create dataframe from all files in S3 bucket
df_froms3 = pd.concat((pd.read_json(f's3://{s3_bucket_name}/{file}', compression='gzip', lines=True) for file in s3_files), ignore_index=True)
#drop integrations column due to Hex bug - "we do not seem to be able to a handle column that only contains empty objects."
df_froms3 = df_froms3.drop(columns=['integrations'])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment