Hülya Çoban, Author at Search Engine Land

Learn how to chart and track Google Trends in Data Studio using Python

Hülya Çoban — Wed, 12 Feb 2020 20:27:56 +0000

Google Trends is a free and incredibly useful tool that provides search interests, popular keywords and hot topics in a lot of languages for different platforms such as web search, Youtube or Google Shopping. Regardless of the marketing channel, it can be a very helpful tool to get valuable insights and make meaningful choices for the next steps of your project.

Basically, it gives the data on the relative popularity of a keyword from 2004 to the present, which is really cool! (Relative popularity means the ratio of your search term interest to the interests of all keywords searched on Google.)

Everything is great so far, but analyzing Google Trends data at scale is mostly not practical. Many of us don’t use it much because it seems like a tedious job to search for keywords on the website and get data points one by one. So how can we use Google Trends in a more effective way?

In this article, my aim is to show you the pytrends library in Python and what benefits you can get from it in your data analysis. I will also explain the connection between Google Spreadsheets and Jupyter Notebook in order to import data into Google Data Studio to share it with others easily. For example, while analyzing Search Console data on Data Studio dashboard, wouldn’t it be nice to have Google Trends data on the same page? If your answer is yes, let’s dig in!

3 topics I will cover in this article:

Coding with Pytrends library and exploring its features
Connecting Jupyter Notebook to Google Spreadsheets with gspread library
Importing data into Google Data Studio

System requirements to use the Pytrends Library

Python 2.7+ and Python 3.3+
Requires Requests, lxml, Pandas libraries. If you don’t know how to install libraries, check this Python document. (hint: pip install pandas)
Jupyter Notebook is an open source web application provides the environment to run your code.

Coding with Pytrends Library

First of all, you have to install the library:

pip install pytrends

Importing necessary libraries:

import pytrends from pytrends.request import TrendReq import pandas as pd import time import datetime from datetime import datetime, date, time

Now it is time to code!

pytrend = TrendReq() pytrend.build_payload(kw_list=['tea', 'coffee', 'coke', 'milk', 'water'], timeframe='today 12-m', geo = 'GB')

Payload function is important to specify your search. Write your keywords, decide date range, location and many other things like choosing Youtube or Shopping channel to analyze. In the code above, ‘’today 12-m’’ means one year data. You can narrow your results by specifying location with ‘’geo’.’

Let’s say you have a Youtube channel and you only want to see Youtube search trends. Then your code will be like this:

pytrend.build_payload(kw_list=['tea', 'coffee', 'coke', 'milk', 'water'], timeframe='today 12-m', geo = 'GB', gprop= youtube)

Or let’s assume that you have a food&drink blog and want to get trend data of your keywords in that category, not relative to all searches. Then it will be something like this:

pytrend.build_payload(kw_list=['tea', 'coffee', 'coke', 'milk', 'water'], timeframe='today 12-m', geo = 'GB', cat = 71)

In order to see all features and filters, you should check this repository on Github and also you can find all category codes in here.

(By the way, be careful that you cannot write directly more than 5 keywords in here. It will give an error because you can compare only 5 keywords on Google Trends. I will use another code to analyze keywords more than 5.)

So, let’s keep on and get trends score now.

#to get interest over time score, you'll need pytrend.interest_over_time() function. #For more functions, check this: https://github.com/GeneralMills/pytrends interest_over_time_df = pytrend.interest_over_time() print(interest_over_time_df.head())

# Let's draw import matplotlib.pyplot as plt import seaborn as sns sns.set(color_codes=True) dx = interest_over_time_df.plot.line(figsize = (9,6), title = "Interest Over Time") dx.set_xlabel('Date') dx.set_ylabel('Trends Index') dx.tick_params(axis='both', which='major', labelsize=13)

Suggested keywords

Now I will show you another cool feature of Google Trends. If you use the suggestion function, it will return with suggested keywords and their ‘’types.’’

print(pytrend.suggestions(keyword='search engine land'), '\n') print(pytrend.suggestions(keyword='amazon'), '\n') print(pytrend.suggestions(keyword='cats'), '\n') print(pytrend.suggestions(keyword='macbook pro'), '\n') print(pytrend.suggestions(keyword='beer'), '\n') print(pytrend.suggestions(keyword='ikea'), '\n')

Related queries

This is my favorite! Especially because it can be really helpful in Google Ads, keyword research and content creation.

Let’s check ‘’foundation’’ keyword in the Beauty category and get related keywords.

pytrend.build_payload(kw_list=['foundation'], geo = 'US', timeframe = 'today 3-m', cat = 44) related_queries= pytrend.related_queries() print(related_queries)

You will see two parts in the output; top keywords and rising keywords. The value of top keywords shows Google Trends score from 0 to 100. However, the value of rising keywords shows how much interest in the keywords have increased in percentage.

If a website sells foundations, it would be great to follow what people are searching for lately, right? These products might be getting popular or reverse, they might have a bad reputation lately and that’s why people might search for them. For instance, noticing this as soon as possible in Google Ads may prevent you from spending excessive amounts of money with no conversion.

Tracking lots of keywords

Now, I will write a group of random keywords here and get their data. You can also read keywords from a csv or excel file but make sure that its type must be a ‘’list.’’

searches = ['detox', 'water fasting', 'benefits of fasting', 'fasting benefits', 'acidic', 'water diet', 'ozone therapy', 'colon hydrotherapy', 'water fast', 'reflexology', 'balance', 'deep tissue massage', 'cryo', 'healthy body', 'what is detox', 'the truth about cancer', 'dieta', 'reverse diabetes', 'how to reverse diabetes', 'water cleanse', 'can you drink water when fasting', 'water fasting benefits', 'glycemic load', 'anti ageing', 'how to water fast', 'ozone treatment', 'healthy mind', 'can you reverse diabetes', 'anti aging', 'health benefits of fasting', 'hydrocolonic', 'shiatsu massage', 'seaweed wrap', 'shiatsu', 'can you get rid of diabetes', 'how to get rid of diabetes', 'healthy body healthy mind', 'colonic hydrotherapy', 'green detox', 'what is water fasting', '21 day water fast', 'benefits of water fasting', 'cellulite', 'ty bollinger', 'detox diet', 'detox program', 'anti aging treatments', 'ketogenic', 'glycemic index', 'water fasting weight loss', 'keto diet plan', 'acidic symptoms', 'alkaline diet', 'water fasting diet', 'laser therapy', 'anti cellulite massage', 'swedish massage', 'benefit of fasting', 'detox your body', 'colon therapy', 'reversing diabetes', 'detoxing', 'truth about cancer', 'how to remove acidity from body', '21 day water fast results', 'colon cleanse', 'fasting health benefits', 'antiaging', 'aromatheraphy massage']

groupkeywords = list(zip(*[iter(searches)]*1)) groupkeywords = [list(x) for x in groupkeywords]

dicti = {} i = 1 for trending in groupkeywords: pytrend.build_payload(trending, timeframe = 'today 3-m', geo = 'GB') dicti[i] = pytrend.interest_over_time() i+=1

result = pd.concat(dicti, axis=1) result.columns = result.columns.droplevel(0) result = result.drop('isPartial', axis = 1)

result

Yes! I have all of them, but I need to reshape my data frame in case of merging this data with Search Console.

result.reset_index(level=0, inplace=True) pd.melt(result, id_vars='date', value_vars=searches)

result.to_excel(‘trends.xlsx’)

Google Trends data is ready to go!

Connecting Jupyter Notebook to Google Spreadsheets with gspread library

First of all, you need to enable some APIs and create a secret client JSON file in order to authorize Google Sheets access. I will not explain this in this article, but here is a great guide explaining how to do that step by step.

Then you can just use these codes below:

import gspread from oauth2client.service_account import ServiceAccountCredentials links = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive'] credentials = ServiceAccountCredentials.from_json_keyfile_name('ENTER-YOUR-JSON-FILE-NAME-HERE.json', links) gc = gspread.authorize(credentials)

Creating and opening a spreadsheet:

sh = gc.create('My cool spreadsheet') wks = gc.open("My cool spreadsheet").sheet1 # check colab documents here for more examples → https://colab.research.google.com/notebooks/io.ipynb

Creating a custom formula to send data frames into sheets :

#https://www.danielecook.com/from-pandas-to-google-sheets/

def iter_pd(df): for val in list(df.columns): yield val for row in df.values: for val in list(row): if pd.isna(val): yield "" else: yield val

def pandas_to_sheets(pandas_df, sheet, clear = True): # Updates all values in a workbook to match a pandas dataframe if clear: sheet.clear() (row, col) = pandas_df.shape cells = sheet.range("A1: {}".format(gspread.utils.rowcol_to_a1(row + 1, col))) for cell, val in zip(cells, iter_pd(df)): cell.value = val sheet.update_cells(cells)

An example to see how it works:

df = pd.read_csv("train.csv") pandas_to_sheets(df, wks)

Let’s continue with trends data and merge it with Search Console data.

sh = gc.create('GoogleTrends') wks = gc.open("GoogleTrends").sheet1 pandas_to_sheets(result, wks)

dx = pd.read_excel('Trends.xlsx', sheet_name='Sheet1') dz = pd.read_excel('Trends.xlsx', sheet_name = 'console') #my console data is here, make sure where yours is dm = pd.merge(dx, dz, on = ['Query', 'Date']) dm

And let’s send this one also into Google Sheets.

wks = gc.open("GoogleTrends").sheet3 pandas_to_sheets(dm, wks)

Importing data into Google Data Studio

Now you can just connect this spreadsheet with Google Data Studio:

Tracking rising keywords

pytrend.build_payload(kw_list=['foundation', 'eyeliner', 'concealer', 'lipstick'], geo = 'US', timeframe = 'today 3-m', cat = 44) related_queries= pytrend.related_queries() dg=related_queries.get('lipstick').get('rising') dg

Use pandas_to_sheets again. Import these into Data Studio and visualize:

Wrapping up

It seems complicated at first, but just try these codes and create your own dashboards. Because at the end, you will just run the code on Jupyter Notebook and refresh the data on Google Data Studio. It will take only 10-15 seconds to update all of them, I promise!

Here is my Github repository for all Python codes together.

Happy coding!

This year’s SMX Advanced will feature a brand-new SEO for Developers track with highly-technical sessions – many in live-coding format – focused on using code libraries and architecture models to develop applications that improve SEO. SMX Advanced will be held June 8-10 in Seattle. Register today.

Here’s how I used Python to build a regression model using an e-commerce dataset

Hülya Çoban — Mon, 16 Dec 2019 18:47:03 +0000

The programming language of Python is gaining popularity among SEOs for its ease of use to automate daily, routine tasks. It can save time and generate some fancy machine learning to solve more significant problems that can ultimately help your brand and your career. Apart from automations, this article will assist those who want to learn more about data science and how Python can help.

In the example below, I use an e-commerce data set to build a regression model. I also explain how to determine if the model reveals anything statistically significant, as well as how outliers may skew your results.

I use Python 3 and Jupyter Notebooks to generate plots and equations with linear regression on Kaggle data. I checked the correlations and built a basic machine learning model with this dataset. With this setup, I now have an equation to predict my target variable.

Before building my model, I want to step back to offer an easy-to-understand definition of linear regression and why it’s vital to analyzing data.

What is linear regression?

Linear regression is a basic machine learning algorithm that is used for predicting a variable based on its linear relationship between other independent variables. Let’s see a simple linear regression graph:

If you know the equation here, you can also know y values against x values. ‘’a’’ is coefficient of ‘’x’’ and also the slope of the line, ‘’b’’ is intercept which means when x = 0, b = y.

My e-commerce dataset

I used this dataset from Kaggle. It is not a very complicated or detailed one but enough to study linear regression concept.

If you are new and didn’t use Jupyter Notebook before, here is a quick tip for you:

Launch the Terminal and write this command: jupyter notebook

Once entered, this command will automatically launch your default web browser with a new notebook. Click New and Python 3.

Now it is time to use some fancy Python codes.

Importing libraries

import matplotlib.pyplot as plt import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error import statsmodels.api as sm from statsmodels.tools.eval_measures import mse, rmse import seaborn as sns pd.options.display.float_format = '{:.5f}'.format import warnings import math import scipy.stats as stats import scipy from sklearn.preprocessing import scale warnings.filterwarnings('ignore')

Reading data

df = pd.read_csv("Ecom_Customers.csv") df.head()

My target variable will be Yearly Amount Spent and I’ll try to find its relation between other variables. It would be great if I could be able to say that users will spend this much for example, if Time on App is increased 1 minute more. This is the main purpose of the study.

Exploratory data analysis

First let’s check the correlation heatmap:

df_kor = df.corr() plt.figure(figsize=(10,10)) sns.heatmap(df_kor, vmin=-1, vmax=1, cmap="viridis", annot=True, linewidth=0.1)

This heatmap shows correlations between each variable by giving them a weight from -1 to +1.

Purples mean negative correlation, yellows mean positive correlation and getting closer to 1 or -1 means you have something meaningful there, analyze it. For example:

Length of Membership has positive and high correlation with Yearly Amount Spent. (81%)
Time on App also has a correlation but not powerful like Length of Membership. (50%)

Let’s see these relations in detailed. My favorite plot is sns.pairplot. Only one line of code and you will see all distributions.

sns.pairplot(df)

This chart shows all distributions between each variable, draws all graphs for you. In order to understand which data they include, check left and bottom axis names. (If they are the same, you will see a simple distribution bar chart.)

Look at the last line, Yearly Amount Spent (my target on the left axis) graphs against other variables.

Length of Membership has really perfect linearity, it is so obvious that if I can increase the customer loyalty, they will spend more! But how much? Is there any number or coefficient to specify it? Can we predict it? We will figure it out.

Checking missing values

Before building any model, you should check if there are any empty cells in your dataset. It is not possible to keep on with those NaN values because many machine learning algorithms do not support data with them.

This is my code to see missing values:

df.isnull().sum()

isnull() detects NaN values and sum() counts them.

I have no NaN values which is good. If I had, I should have filled them or dropped them.

For example, to drop all NaN values use this:

df.dropna(inplace=True)

To fill, you can use fillna():

df["Time on App"].fillna(df["Time on App"].mean(), inplace=True)

My suggestion here is to read this great article on how to handle missing values in your dataset. That is another problem to solve and needs different approaches if you have them.

Building a linear regression model

So far, I have explored the dataset in detail and got familiar with it. Now it is time to create the model and see if I can predict Yearly Amount Spent.

Let’s define X and Y. First I will add all other variables to X and analyze the results later.

Y=df["Yearly Amount Spent"] X=df[[ "Length of Membership", "Time on App", "Time on Website", 'Avg. Session Length']]

Then I will split my dataset into training and testing data which means I will select 20% of the data randomly and separate it from the training data. (test_size shows the percentage of the test data – 20%) (If you don’t specify the random_state in your code, then every time you run (execute) your code, a new random value is generated and training and test datasets would have different values each time.)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465) print('Training Data Count: {}'.format(X_train.shape[0])) print('Testing Data Count: {}'.format(X_test.shape[0]))

Now, let’s build the model:

X_train = sm.add_constant(X_train) results = sm.OLS(y_train, X_train).fit() results.summary()

Understanding the outputs of the model: Is this statistically significant?

So what do all those numbers mean actually?

Before continuing, it will be better to explain these basic statistical terms here because I will decide if my model is sufficient or not by looking at those numbers.

What is the p-value?

P-value or probability value shows statistical significance. Let’s say you have a hypothesis that the average CTR of your brand keywords is 70% or more and its p-value is 0.02. This means there is a 2% probability that you would see CTRs of your brand keywords below %70. Is it statistically significant? 0.05 is generally used for max limit (95% confidence level), so if you have p-value smaller than 0.05, yes! It is significant. The smaller the p-value is, the better your results!

Now let’s look at the summary table. My 4 variables have some p-values showing their relations whether significant or insignificant with Yearly Amount Spent. As you can see, Time on Website is statistically insignificant with it because its p-value is 0.180. So it will be better to drop it.

What is R squared and Adjusted R squared?

R square is a simple but powerful metric that shows how much variance is explained by the model. It counts all variables you defined in X and gives a percentage of explanation. It is something like your model capabilities.

Adjusted R squared is also similar to R squared but it counts only statistically significant variables. That is why it is better to look at adjusted R squared all the time.

In my model, 98.4% of the variance can be explained, which is really high.

What is Coef?

They are coefficients of the variables which give us the equation of the model.

So is it over? No! I have Time on Website variable in my model which is statistically insignificant.

Now I will build another model and drop Time on Website variable:

X2=df[["Length of Membership", "Time on App", 'Avg. Session Length']] X2_train, X2_test, y2_train, y2_test = train_test_split(X2, Y, test_size = 0.2, random_state = 465) print('Training Data Count:', X2_train.shape[0]) print('Testing Data Count::', X2_test.shape[0])

X2_train = sm.add_constant(X2_train) results2 = sm.OLS(y2_train, X2_train).fit() results2.summary()

R squared is still good and I have no variable having p-value higher than 0.05.

Let’s look at the model chart here:

X2_test = sm.add_constant(X2_test) y2_preds = results2.predict(X2_test) plt.figure(dpi = 75) plt.scatter(y2_test, y2_preds) plt.plot(y2_test, y2_test, color="red") plt.xlabel("Actual Scores", fontdict=ex_font) plt.ylabel("Estimated Scores", fontdict=ex_font) plt.title("Model: Actual vs Estimated Scores", fontdict=header_font) plt.show()

It seems like I predict values really good! Actual scores and predicted scores have almost perfect linearity.

Finally, I will check the errors.

Errors

When building models, comparing them and deciding which one is better is a crucial step. You should test lots of things and then analyze summaries. Drop some variables, sum or multiply them and again test. After completing the series of analysis, you will check p-values, errors and R squared. The best model will have:

P-values smaller than 0.05
Smaller errors
Higher adjusted R squared

Let’s look at errors now:

print("Mean Absolute Error (MAE) : {}".format(mean_absolute_error(y2_test, y2_preds))) print("Mean Squared Error (MSE) : {}".format(mse(y2_test, y2_preds))) print("Root Mean Squared Error (RMSE) : {}".format(rmse(y2_test, y2_preds))) print("Root Mean Squared Error (RMSE) : {}".format(rmse(y2_test, y2_preds))) print("Mean Absolute Perc. Error (MAPE) : {}".format(np.mean(np.abs((y2_test - y2_preds) / y2_test)) * 100))

If you want to know what MSE, RMSE or MAPE is, you can read this article.

They are all different calculations of errors and now, we will just focus on smaller ones while comparing different models.

So, in order to compare my model with another one, I will create one more model including Length of Membership and Time on App only.

X3=df[['Length of Membership', 'Time on App']] Y = df['Yearly Amount Spent'] X3_train, X3_test, y3_train, y3_test = train_test_split(X3, Y, test_size = 0.2, random_state = 465) X3_train = sm.add_constant(X3_train) results3 = sm.OLS(y3_train, X3_train).fit() results3.summary()

X3_test = sm.add_constant(X3_test) y3_preds = results3.predict(X3_test) plt.figure(dpi = 75) plt.scatter(y3_test, y3_preds) plt.plot(y3_test, y3_test, color="red") plt.xlabel("Actual Scores", fontdict=eksen_font) plt.ylabel("Estimated Scores", fontdict=eksen_font) plt.title("Model Actual Scores vs Estimated Scores", fontdict=baslik_font) plt.show() print("Mean Absolute Error (MAE) : {}".format(mean_absolute_error(y3_test, y3_preds))) print("Mean Squared Error (MSE) : {}".format(mse(y3_test, y3_preds))) print("Root Mean Squared Error (RMSE) : {}".format(rmse(y3_test, y3_preds))) print("Mean Absolute Perc. Error (MAPE) : {}".format(np.mean(np.abs((y3_test - y3_preds) / y3_test)) * 100))

Which one is best?

As you can see, errors of the last model are higher than the first one. Also adjusted R squared is decreased. If errors were smaller, then we would say the last one is better – independent of R squared. Ultimately, we choose smaller errors and higher R squared. I’ve just added this second one to show you how you can compare the models and decide which one is the best.

Now our model is this:

Yearly Amount Spent = -1027.28 + 61.49x(Length of Membership) + 38.76x(Time on App) + 25.48x(Avg. Session Length)

This means, for example, if we can increase the length of membership 1 year more and holding all other features fixed, one person will spend 61.49 dollars more!

Advanced tips: Outliers and nonlinearity

When you are dealing with the real data, generally things are not that easy. To find linearity or more accurate models, you may need to do something else. For example, if your model isn’t accurate enough, check for outliers. Sometimes outliers can mislead your results!

Source: https://r-statistics.co/Outlier-Treatment-With-R.html

Apart from this, sometimes you will get curved lines instead of linear but you will see that there is also a relation between variables!

Then you should think of transforming your variables by using logarithms or square.

Here is a trick for you to decide which one to use:

Source: https://courses.lumenlearning.com/boundless-algebra/chapter/graphs-of-exponential-and-logarithmic-functions/

For example, in the third graph, if you have a line similar to the green one, you should consider using logarithms in order to make it linear!

There are lots of things to do so testing all of them is really important.

Conclusion

If you like to play with numbers and advance your data science skill set, learn Python. It is not a very difficult programming language to learn, and the statistics you can generate with it can make a huge difference in your daily work.

Google Analytics, Google Ads, Search Console… Using these tools already offers tons of data, and if you know the concepts of handling data accurately, you will get very valuable insights from them. You can create more accurate traffic forecasts, or analyze Analytics data such as bounce rate, time on page and their relations with the conversion rate. At the end of the day, it might be possible to predict the future of your brand. But these are only a few examples.

If you want to go further in linear regression, check my Google Page Speed Insights OLS model. I’ve built my own dataset and tried to predict the calculation based on speed metrics such as FCP (First Contentful Paint), FMP (First Meaningful Paint) and TTI (Time to Interactive).

In closing, blend your data, try to find correlations and predict your target. Hamlet Batista has a great article about practical data blending. I strongly recommend it before building any regression model.