Skip to content

Instantly share code, notes, and snippets.

@lukebarousse
Last active January 13, 2024 09:33
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save lukebarousse/ded1fc3dbde6e0050d45635140480aee to your computer and use it in GitHub Desktop.
Save lukebarousse/ded1fc3dbde6e0050d45635140480aee to your computer and use it in GitHub Desktop.
SerpApi Results to BigQuery - Google Cloud Function
import base64
import pandas as pd
from serpapi import GoogleSearch
from google.cloud import bigquery
import datetime
def hello_pubsub(event, context):
search_term = "data analyst"
search_location = "United States"
for num in range(45):
start = num * 10
params = {
"api_key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", #Fill in with your API key from SerpApi
"device": "desktop",
"engine": "google_jobs",
"google_domain": "google.com",
"q": search_term,
"hl": "en",
"gl": "us",
"location": search_location,
"chips": "date_posted:today",
"start": start,
}
search = GoogleSearch(params)
results = search.get_dict()
# check if the last search page (i.e., no results)
try:
if results['error'] == "Google hasn't returned any results for this query.":
break
except KeyError:
print(f"Getting SerpAPI data for page: {start}")
else:
continue
# create dataframe of 10 pulled results
jobs = results['jobs_results']
jobs = pd.DataFrame(jobs)
jobs = pd.concat([pd.DataFrame(jobs),
pd.json_normalize(jobs['detected_extensions'])],
axis=1).drop('detected_extensions', 1)
jobs['date_time'] = datetime.datetime.utcnow()
# concat dataframe
if start == 0:
jobs_all = jobs
else:
jobs_all = pd.concat([jobs_all, jobs])
jobs_all['search_term'] = search_term
jobs_all['search_location'] = search_location
# send resluts to BigQuery
table_id = "xxxxxxxxxxxxxxxxxxxxxxxx" # BigQuery Table name
client = bigquery.Client()
table = client.get_table(table_id)
errors = client.insert_rows_from_dataframe(table, jobs_all)
if errors == []:
print("Data loaded into table")
return "Success"
else:
print(errors)
return "Failed"
@rohan472000
Copy link

codes are correct, function is running on GCP but in my system its data is not showing in Bigquery table..why?

@lukebarousse
Copy link
Author

Thats really strange, I'll be honest ('m not really sure...

For me, I did notice that there is sometimes a long delay (~30 mins) from the data appearing in BigQuery after the function runs. I hope this is just the issue.

@rohan472000
Copy link

Ya that can be the issue as I seen this lag earlie in GCP and AWS.

But Luke, how can we schedule this function through DAG( Apache airflow or cloud composers) or through any other scheduler, so that it fetches new job updates daily?

@lukebarousse
Copy link
Author

Ha! I'm trying to code something like that with Apache Airflow right now to solve this. ๐Ÿคฃ So crazy you mention that.

My goal is to move this code from Cloud Functions (and Cloud scheduler), to Airflow completely, to run this daily and export it to BigQuery. I'll share the code to GitHub as I develop it. (ping me if I don't in next few days)

Right now, I'm trying to figure out if it's a simple enough task to just make it into a container and run it via Cloud Run... or if should I use Cloud Composer. Do you have any opinions on this?

@rohan472000
Copy link

wow...after watching your video I'm also trying to do same on Apache airflow through DAGs and its multiple tasks.

Containerization and running via cloud run seems to be a good idea......hope you make it.

Btw I'm your one of subscribers and really appreciate your work and contents.

@lukebarousse
Copy link
Author

Heck yeah! Well let me know how it goes for you as well! Thanks for the sub!! ๐Ÿ™Œ๐Ÿผ

@lukebarousse
Copy link
Author

@rohan472000 I made a pipeline with airflow that collects about 10,000 jobs a day and stores it in BigQuery. I have it up and running on my local computer. Here is the link to the github repo.

@rohan472000
Copy link

Great work Luke...really appreciable.

@wouter-suren
Copy link

wouter-suren commented Dec 10, 2022

Hi guys! What a great project, thanks for sharing. Reading the conversation here. What's the main reason for moving from GCP's cloud function to airflow? What was the main limitation of the cloud function for this use case?

@lukebarousse
Copy link
Author

Great question! Cloud functions are designed for simple tasks and have time limits as low as 5 minutes before they time out and stop your code (can get up to 60 minutes for some). I needed to move to another option that didn't have a time limit, and there are a number of options you can use on GCP to help with this. Selfishly, I just like airflow for its popularity in the data engineering community, so I decided to go with this. I plan to use it more in upcoming projects so this was a test to learn it further.

@wouter-suren
Copy link

Indeed, the V2 release of Cloud Function allows to run up to 60 minutes, which is the one I use most of the time for the longer python processes. Yeah, I've been eyeballing airflow for quite a while now and it seems a better and more professional step up, especially when you start to manage more cloud functions that have to work together, although GCP's workflow is also very easy alternative to scheduling/chaining multiple cloud functions, which I use most ATM.

I'm curious how it affects the costs though for the better approach you are after, as you mention in your vid: cloud functions are very cheap. Looking forward to your results! And great initiative to make the jobs more accessible!

@lukebarousse
Copy link
Author

Yeah, Google's workflow looks interesting; I hadn't checked it out yet; how much is it costing you for this?

For Airflow, I'm running it locally on a server to cut costs; however, google does have an airflow option, "Google Composer", BUT I tried out the service and it was costing more than $10 a day... so I wasn't about to pay $3.5k a year for this, so I migrated airflow to my local server (NAS) and have it running there basically for free

@wouter-suren
Copy link

wouter-suren commented Dec 12, 2022

Sound like a great approach, at least you can run it as much as you like.

The advantage of Google Workflows is that it very cheap. It's very light weight and serverless just like cloud functions so you only pay when it has to spin up to execute some code. Plus you don't need to configure a lot to set it up. You can schedule it easily as well in the workflow gui, using cronjob syntax. In my experience it costs even less then the cheap cloud functions it has to execute. It is easy to chain some cloud functions.

Steps:

  • For the Workflow to be allowed to trigger a Cloud Function, the cloud function trigger settings need to be changed to 'Allow unauthenticated invocations', since Workflow is treated as being an 'outside' connection to the cloud function. (Not sure you can do this by editing an existing cloud function, might need to re-create it...)
  • To prevent that the whole world can trigger the function, you go to the 'Connections' tab of the cloud function and check 'Allow internal traffic only'. This way only the workflow can trigger it.
  • Example code of how you trigger a cloud function in workflow:
main:
    steps:
    - call_customer:
        call: http.get
        args:
            url: https://custent-orwdeforwa-ew.a.run.app

@lukebarousse
Copy link
Author

@wouter-suren That's awesome, just the fact that workflow is a cheaper option seems like a great option to pursue; thanks for providing all this information; i'm def going to have to implement it in my next project! ๐Ÿ™Œ๐Ÿผ

@rohan472000
Copy link

Sound like a great approach, at least you can run it as much as you like.

The advantage of Google Workflows is that it very cheap. It's very light weight and serverless just like cloud functions so you only pay when it has to spin up to execute some code. Plus you don't need to configure a lot to set it up. You can schedule it easily as well in the workflow gui, using cronjob syntax. In my experience it costs even less then the cheap cloud functions it has to execute. It is easy to chain some cloud functions.

Steps:

  • For the Workflow to be allowed to trigger a Cloud Function, the cloud function trigger settings need to be changed to 'Allow unauthenticated invocations', since Workflow is treated as being an 'outside' connection to the cloud function. (Not sure you can do this by editing an existing cloud function, might need to re-create it...)
  • To prevent that the whole world can trigger the function, you go to the 'Connections' tab of the cloud function and check 'Allow internal traffic only'. This way only the workflow can trigger it.
  • Example code of how you trigger a cloud function in workflow:
main:
    steps:
    - call_customer:
        call: http.get
        args:
            url: https://custent-orwdeforwa-ew.a.run.app

Thanks for sharing this info...today I learned this new thing.

@wouter-suren
Copy link

just a small after thought, workflow/airflow is only needed when orchestration comes into play, so when multiple jobs (cloud functions or other stuff) need to run periodically and e.g. when they depend on each other. also, airflow is more advanced then google workflow. and: if it's just a single cloud function you can always just schedule it directly using the Google Cloud Scheduler feature, no need for a google workflow.

@karthickrajja
Copy link

karthickrajja commented Dec 13, 2022

Hi Luke,
Is there another way to use the Serp API because it costs 50$/month for more than 100 searches per month.

@rohan472000
Copy link

Hi Luke, Is there another way to use the Serp API because it costs 50$/month for more than 100 searches per month.

you can use serpstack instead..it has alomost same accuracy and it's free also.

@lukebarousse
Copy link
Author

@karthickrajja unfortunately, I don't have a recommendation for a free API like this... These types of API's cost money for those providing the services in order to get the data we need; so typically it's hard to find a service that you can get large quantities of data for free

@karthickrajja
Copy link

@karthickrajja unfortunately, I don't have a recommendation for a free API like this... These types of API's cost money for those providing the services in order to get the data we need; so typically it's hard to find a service that you can get large quantities of data for free

Thanks for reply luke๐Ÿ‘๐Ÿ‘๐Ÿ‘.....

@karthickrajja
Copy link

karthickrajja commented Dec 14, 2022

Hi Luke, Is there another way to use the Serp API because it costs 50$/month for more than 100 searches per month.

you can use serpstack instead..it has alomost same accuracy and it's free also.

Rohan it is also chargeable for more then 100 search per month but it's plan is much cheaper then serpapi....so I'll try it.... thanks ๐Ÿ‘๐Ÿ‘๐Ÿ‘

@thomastibouche
Copy link

codes are correct, function is running on GCP but in my system its data is not showing in Bigquery table..why?

Same Issue here, i rechecked the code, Serp API key is correct and Big query table_id too, data just don't shows up in Bigquery no error in the code.

@rohan472000
Copy link

codes are correct, function is running on GCP but in my system its data is not showing in Bigquery table..why?

Same Issue here, i rechecked the code, Serp API key is correct and Big query table_id too, data just don't shows up in Bigquery no error in the code.

well @thomastibouche I tried 2 times but didn't got those data into my BigQuery, then I left because of unavailability of free time to look into it again.
But if you also getting the same error then there is a problem that we aren't aware of..... @lukebarousse ..have you heard this error from anyone else?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment