charlie-segment/hex.py Secret

## hex.py
Using Hex, Python, Segment, and AWS S3 for easy data visualization and collaboration
October 20, 2022


Draft

Summary
Perform analysis and create visualizations of your first-party customer event data, using this recipe which combines Hex’s charting capabilities with data stored in an AWS S3 bucket (a popular Segment destination). You’ll also be able to publish and share your data work, allowing you to collaborate with your colleagues.

Intro
Segment customers who collect first-party customer data often want to be able to analyze it and to build visualizations. In this recipe, we share an easy and interactive way for Segment users to share their work with their colleagues without sending screenshots back and forth. We do so by leveraging Hex, a little bit of Python code, and Segment’s AWS S3 destination for easy data visualization and collaboration.

With Segment, you can collect, transform, send, and archive your first-party customer data. Segment simplifies the process of collecting data and connecting new tools, allowing you to spend more time using your data, and less time trying to collect it. A very popular destination for your first-party customer data is AWS S3. AWS S3 is is an object storage service offering industry-leading scalability, data availability, security, and performance.

With your data stored in AWS S3, you can now use it within Hex. Hex is a modern data workspace which makes it easy to:

- connect to data
- analyze it in collaborative SQL and Python-powered notebooks
- share your work as interactive data apps and stories

What do you need?

- A Segment workspace
    - A properly configured source (like website, mobile, or server) and AWS S3 Destination
- Hex.tech account
- Ability to read/write code in Python

Step 1 - Add an AWS S3 Destination
Follow the documentation here to add an AWS S3 Destination to your Segment workspace and connect it to an existing source. Make note of your bucket’s name. Once set up, verify that events are flowing into your S3 bucket (you may have to wait ~1 hour for files to appear). Also take note of the path used by Segment in your S3 bucket.

Step 2 - Create/Retrieve API key(s) from AWS
In order for Hex to read data from the S3 bucket previously set up as a Destination in your Segment workspace, you’ll need to generate a new set of AWS API credentials. Once you open the AWS Console, click on the following in this order:

1. Your username near the top right and select My Security Credentials
2. Users in the sidebar
3. Your username
4. Security Credentials tab
5. Create Access Key
6. Show User Security Credentials

Step 3 - Create a new project in Hex

1. As a first step, create Secrets for the two AWS credentials: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Additionally, create a Secret for the name of your S3 bucket
    1. To create the Secrets, navigate to the Variables tab and hit the + Add button. In this example, the Secrets are named aws_access_key_id, aws_secret_access_key, and s3_bucket_name
2. Add a Python cell and copy in above code
![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1666478513383_Screen+Shot+2022-10-22+at+6.41.24+PM.png)

3. In order to visualize our results, we’ll add a Table cell using df_froms3 as the dataframe
![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667925590298_Screen+Shot+2022-11-08+at+11.38.38+AM.png)

4. Because we’re only interested in Track events, we’ll now add a Filter cell, with source set to df_froms3.
    1. Add a filter where type is equal to track
    2. Save the result as filter_result (the default)


![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667925376048_Screen+Shot+2022-11-08+at+11.34.50+AM.png)

5. Now that our data has been cleansed, we will add a Chart cell using filter_result  for our Data.
    1. Let’s analyze how frequently each Track event appears in our data. To do so, set:
        1. Mark to Bar
        2. X-axis to event
        3. Y-axis to sent_at or any other field
            1. set Type to Quantitative
            2. set Aggregate to Count
    2. The resulting chart should look something like this:


![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667926555170_Screen+Shot+2022-10-22+at+6.19.11+PM.png)


(Optional) Step 4 - Publish/share your work

1. You can make you work available to others in your organization by publishing and sharing it.
![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667928395336_Screen+Shot+2022-11-08+at+12.25.44+PM.png)

2. Here is the published app used in this recipe.

Conclusion
In this recipe, we learned how to set up an AWS S3 destination in Segment, send our event data there, and then read that data using Hex (and some Python). In Hex, we ultimately created a chart to show how frequently each type of event appears in our data set. This work can then be shared and published, allowing our colleagues to easily collaborate with us.

Appendix

Python code

    import boto3
    import json
    import sys,os
    import s3fs
    import pandas as pd

    #get enviromental variables set in Hex project
    os.environ['AWS_ACCESS_KEY_ID']= aws_access_key_id
    os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret_access_key
    os.environ['S3_BUCKET_NAME'] = s3_bucket_name

    session = boto3.Session(aws_access_key_id, aws_secret_access_key)

    #path created by Segment AWS S3 Destination - https://segment.com/docs/connections/storage/catalog/aws-s3/#data-format
    prefix = "segment-logs/8ss0NFrJnz/1666224000000/"
    s3 = boto3.resource('s3')
    bucket = s3.Bucket(name=s3_bucket_name)
    s3_files = []

    for obj in bucket.objects.filter(Prefix=prefix):
         s3_files.append(obj.key)

    #create dataframe from all files in S3 bucket
    df_froms3 = pd.concat((pd.read_json(f's3://{s3_bucket_name}/{file}', compression='gzip', lines=True) for file in s3_files), ignore_index=True)

    #drop integrations column due to Hex bug - "we do not seem to be able to a handle column that only contains empty objects."
    df_froms3 = df_froms3.drop(columns=['integrations'])
	Using Hex, Python, Segment, and AWS S3 for easy data visualization and collaboration
	October 20, 2022


	Draft

	Summary
	Perform analysis and create visualizations of your first-party customer event data, using this recipe which combines Hex’s charting capabilities with data stored in an AWS S3 bucket (a popular Segment destination). You’ll also be able to publish and share your data work, allowing you to collaborate with your colleagues.

	Intro
	Segment customers who collect first-party customer data often want to be able to analyze it and to build visualizations. In this recipe, we share an easy and interactive way for Segment users to share their work with their colleagues without sending screenshots back and forth. We do so by leveraging Hex, a little bit of Python code, and Segment’s AWS S3 destination for easy data visualization and collaboration.

	With Segment, you can collect, transform, send, and archive your first-party customer data. Segment simplifies the process of collecting data and connecting new tools, allowing you to spend more time using your data, and less time trying to collect it. A very popular destination for your first-party customer data is AWS S3. AWS S3 is is an object storage service offering industry-leading scalability, data availability, security, and performance.

	With your data stored in AWS S3, you can now use it within Hex. Hex is a modern data workspace which makes it easy to:

	- connect to data
	- analyze it in collaborative SQL and Python-powered notebooks
	- share your work as interactive data apps and stories

	What do you need?

	- A Segment workspace
	- A properly configured source (like website, mobile, or server) and AWS S3 Destination
	- Hex.tech account
	- Ability to read/write code in Python

	Step 1 - Add an AWS S3 Destination
	Follow the documentation here to add an AWS S3 Destination to your Segment workspace and connect it to an existing source. Make note of your bucket’s name. Once set up, verify that events are flowing into your S3 bucket (you may have to wait ~1 hour for files to appear). Also take note of the path used by Segment in your S3 bucket.

	Step 2 - Create/Retrieve API key(s) from AWS
	In order for Hex to read data from the S3 bucket previously set up as a Destination in your Segment workspace, you’ll need to generate a new set of AWS API credentials. Once you open the AWS Console, click on the following in this order:

	1. Your username near the top right and select My Security Credentials
	2. Users in the sidebar
	3. Your username
	4. Security Credentials tab
	5. Create Access Key
	6. Show User Security Credentials

	Step 3 - Create a new project in Hex

	1. As a first step, create Secrets for the two AWS credentials: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. Additionally, create a Secret for the name of your S3 bucket
	1. To create the Secrets, navigate to the Variables tab and hit the + Add button. In this example, the Secrets are named aws_access_key_id, aws_secret_access_key, and s3_bucket_name
	2. Add a Python cell and copy in above code
	![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1666478513383_Screen+Shot+2022-10-22+at+6.41.24+PM.png)

	3. In order to visualize our results, we’ll add a Table cell using df_froms3 as the dataframe
	![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667925590298_Screen+Shot+2022-11-08+at+11.38.38+AM.png)

	4. Because we’re only interested in Track events, we’ll now add a Filter cell, with source set to df_froms3.
	1. Add a filter where type is equal to track
	2. Save the result as filter_result (the default)


	![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667925376048_Screen+Shot+2022-11-08+at+11.34.50+AM.png)

	5. Now that our data has been cleansed, we will add a Chart cell using filter_result for our Data.
	1. Let’s analyze how frequently each Track event appears in our data. To do so, set:
	1. Mark to Bar
	2. X-axis to event
	3. Y-axis to sent_at or any other field
	1. set Type to Quantitative
	2. set Aggregate to Count
	2. The resulting chart should look something like this:


	![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667926555170_Screen+Shot+2022-10-22+at+6.19.11+PM.png)


	(Optional) Step 4 - Publish/share your work

	1. You can make you work available to others in your organization by publishing and sharing it.
	![](https://paper-attachments.dropboxusercontent.com/s_6D8E727545722B7458881B3777F72091F9AD268811C126C507E1C656D1A42E25_1667928395336_Screen+Shot+2022-11-08+at+12.25.44+PM.png)

	2. Here is the published app used in this recipe.

	Conclusion
	In this recipe, we learned how to set up an AWS S3 destination in Segment, send our event data there, and then read that data using Hex (and some Python). In Hex, we ultimately created a chart to show how frequently each type of event appears in our data set. This work can then be shared and published, allowing our colleagues to easily collaborate with us.

	Appendix

	Python code

	import boto3
	import json
	import sys,os
	import s3fs
	import pandas as pd

	#get enviromental variables set in Hex project
	os.environ['AWS_ACCESS_KEY_ID']= aws_access_key_id
	os.environ['AWS_SECRET_ACCESS_KEY'] = aws_secret_access_key
	os.environ['S3_BUCKET_NAME'] = s3_bucket_name

	session = boto3.Session(aws_access_key_id, aws_secret_access_key)

	#path created by Segment AWS S3 Destination - https://segment.com/docs/connections/storage/catalog/aws-s3/#data-format
	prefix = "segment-logs/8ss0NFrJnz/1666224000000/"
	s3 = boto3.resource('s3')
	bucket = s3.Bucket(name=s3_bucket_name)
	s3_files = []

	for obj in bucket.objects.filter(Prefix=prefix):
	s3_files.append(obj.key)

	#create dataframe from all files in S3 bucket
	df_froms3 = pd.concat((pd.read_json(f's3://{s3_bucket_name}/{file}', compression='gzip', lines=True) for file in s3_files), ignore_index=True)

	#drop integrations column due to Hex bug - "we do not seem to be able to a handle column that only contains empty objects."
	df_froms3 = df_froms3.drop(columns=['integrations'])