The post Learn how to chart and track Google Trends in Data Studio using Python appeared first on Search Engine Land.

]]>Basically, it gives the data on the relative popularity of a keyword from 2004 to the present, which is really cool! (*Relative popularity means the ratio of your search term interest to the interests of all keywords searched on Google.*)

Everything is great so far, but analyzing Google Trends data at scale is mostly not practical. Many of us don’t use it much because it seems like a tedious job to search for keywords on the website and get data points one by one. So how can we use Google Trends in a more effective way?

In this article, my aim is to show you the pytrends library in Python and what benefits you can get from it in your data analysis. I will also explain the connection between Google Spreadsheets and Jupyter Notebook in order to import data into Google Data Studio to share it with others easily. For example, while analyzing Search Console data on Data Studio dashboard, wouldn’t it be nice to have Google Trends data on the same page? If your answer is yes, let’s dig in!

3 topics I will cover in this article:

- Coding with Pytrends library and exploring its features
- Connecting Jupyter Notebook to Google Spreadsheets with gspread library
- Importing data into Google Data Studio

- Python 2.7+ and Python 3.3+
- Requires Requests, lxml, Pandas libraries. If you don’t know how to install libraries, check this Python document. (hint: pip install pandas)
- Jupyter Notebook is an open source web application provides the environment to run your code.

First of all, you have to install the library:

`pip install pytrends`

Importing necessary libraries:

`import pytrends `

from pytrends.request import TrendReq

import pandas as pd

import time

import datetime

from datetime import datetime, date, time

Now it is time to code!

`pytrend = TrendReq() `

pytrend.build_payload(kw_list=['tea', 'coffee', 'coke', 'milk', 'water'], timeframe='today 12-m', geo = 'GB')

Payload function is important to specify your search. Write your keywords, decide date range, location and many other things like choosing Youtube or Shopping channel to analyze. In the code above, ‘’*today 12-m’’* means one year data. You can narrow your results by specifying location with ‘*’geo*’.’

Let’s say you have a Youtube channel and you only want to see Youtube search trends. Then your code will be like this:

`pytrend.build_payload(kw_list=['tea', 'coffee', 'coke', 'milk', 'water'], timeframe='today 12-m', geo = 'GB', gprop= youtube)`

Or let’s assume that you have a food&drink blog and want to get trend data of your keywords in that category, not relative to all searches. Then it will be something like this:

`pytrend.build_payload(kw_list=['tea', 'coffee', 'coke', 'milk', 'water'], timeframe='today 12-m', geo = 'GB', cat = 71)`

In order to see all features and filters, you should check this repository on Github and also you can find all category codes in here.

(*By the way, be careful that you cannot write directly more than 5 keywords in here. It will give an error because you can compare only 5 keywords on Google Trends. I will use another code to analyze keywords more than 5.*)

So, let’s keep on and get trends score now.

`#to get interest over time score, you'll need pytrend.interest_over_time() function. `

#For more functions, check this: https://github.com/GeneralMills/pytrends

interest_over_time_df = pytrend.interest_over_time() print(interest_over_time_df.head())

`# Let's draw `

import matplotlib.pyplot as plt

import seaborn as sns

sns.set(color_codes=True)

dx = interest_over_time_df.plot.line(figsize = (9,6), title = "Interest Over Time")

dx.set_xlabel('Date')

dx.set_ylabel('Trends Index')

dx.tick_params(axis='both', which='major', labelsize=13)

Now I will show you another cool feature of Google Trends. If you use the suggestion function, it will return with suggested keywords and their ‘’**types.**’’

`print(pytrend.suggestions(keyword='search engine land'), '\n') `

print(pytrend.suggestions(keyword='amazon'), '\n')

print(pytrend.suggestions(keyword='cats'), '\n')

print(pytrend.suggestions(keyword='macbook pro'), '\n')

print(pytrend.suggestions(keyword='beer'), '\n')

print(pytrend.suggestions(keyword='ikea'), '\n')

This is my favorite! Especially because it can be really helpful in Google Ads, keyword research and content creation.

Let’s check ‘’foundation’’ keyword in the Beauty category and get related keywords.

`pytrend.build_payload(kw_list=['foundation'], geo = 'US', timeframe = 'today 3-m', cat = 44)`

related_queries= pytrend.related_queries()

print(related_queries)

You will see two parts in the output; top keywords and rising keywords. The value of top keywords shows Google Trends score from 0 to 100. However, the value of rising keywords shows how much interest in the keywords have increased in percentage.

If a website sells foundations, it would be great to follow what people are searching for lately, right? These products might be getting popular or reverse, they might have a bad reputation lately and that’s why people might search for them. For instance, noticing this as soon as possible in Google Ads may prevent you from spending excessive amounts of money with no conversion.

Now, I will write a group of random keywords here and get their data. You can also read keywords from a csv or excel file but make sure that its type must be a ‘’list.’’

`searches = ['detox', 'water fasting', 'benefits of fasting', 'fasting benefits', `

'acidic', 'water diet', 'ozone therapy', 'colon hydrotherapy', 'water fast',

'reflexology', 'balance', 'deep tissue massage', 'cryo', 'healthy body', 'what is detox',

'the truth about cancer', 'dieta', 'reverse diabetes', 'how to reverse diabetes',

'water cleanse', 'can you drink water when fasting', 'water fasting benefits', 'glycemic load', 'anti ageing', 'how to water fast', 'ozone treatment', 'healthy mind', 'can you reverse diabetes', 'anti aging', 'health benefits of fasting', 'hydrocolonic', 'shiatsu massage', 'seaweed wrap', 'shiatsu',
'can you get rid of diabetes', 'how to get rid of diabetes', 'healthy body healthy mind', 'colonic hydrotherapy', 'green detox', 'what is water fasting', '21 day water fast', 'benefits of water fasting', 'cellulite', 'ty bollinger', 'detox diet', 'detox program', 'anti aging treatments', 'ketogenic', 'glycemic index', 'water fasting weight loss', 'keto diet plan', 'acidic symptoms', 'alkaline diet', 'water fasting diet', 'laser therapy', 'anti cellulite massage', 'swedish massage', 'benefit of fasting', 'detox your body', 'colon therapy', 'reversing diabetes', 'detoxing', 'truth about cancer', 'how to remove acidity from body', '21 day water fast results', 'colon cleanse', 'fasting health benefits', 'antiaging', 'aromatheraphy massage']

`groupkeywords = list(zip(*[iter(searches)]*1))`

groupkeywords = [list(x) for x in groupkeywords]

`dicti = {} `

i = 1

for trending in groupkeywords:

pytrend.build_payload(trending, timeframe = 'today 3-m', geo = 'GB')

dicti[i] = pytrend.interest_over_time()

i+=1

`result = pd.concat(dicti, axis=1)`

result.columns = result.columns.droplevel(0)

result = result.drop('isPartial', axis = 1)

`result`

Yes! I have all of them, but I need to reshape my data frame in case of merging this data with Search Console.

`result.reset_index(level=0, inplace=True)`

pd.melt(result, id_vars='date', value_vars=searches)

`result.to_excel(‘trends.xlsx’)`

Google Trends data is ready to go!

First of all, you need to enable some APIs and create a secret client JSON file in order to authorize Google Sheets access. I will not explain this in this article, but here is a great guide explaining how to do that step by step.

Then you can just use these codes below:

`import gspread `

from oauth2client.service_account import ServiceAccountCredentials

links = ['https://spreadsheets.google.com/feeds',

'https://www.googleapis.com/auth/drive']

credentials =

ServiceAccountCredentials.from_json_keyfile_name('ENTER-YOUR-JSON-FILE-NAME-HERE.json', links)

gc = gspread.authorize(credentials)

Creating and opening a spreadsheet:

`sh = gc.create('My cool spreadsheet') `

wks = gc.open("My cool spreadsheet").sheet1

# check colab documents here for more examples →

https://colab.research.google.com/notebooks/io.ipynb

Creating a custom formula to send data frames into sheets :

`#https://www.danielecook.com/from-pandas-to-google-sheets/`

`def iter_pd(df): `

for val in list(df.columns):

yield val

for row in df.values:

for val in list(row):

if pd.isna(val):

yield ""

else:

yield val

`def pandas_to_sheets(pandas_df, sheet, clear = True): `

# Updates all values in a workbook to match a pandas dataframe if clear:

sheet.clear()

(row, col) = pandas_df.shape

cells = sheet.range("A1:

{}".format(gspread.utils.rowcol_to_a1(row + 1, col)))

for cell, val in zip(cells, iter_pd(df)):

cell.value = val

sheet.update_cells(cells)

An example to see how it works:

`df = pd.read_csv("train.csv") `

pandas_to_sheets(df, wks)

Let’s continue with trends data and merge it with Search Console data.

`sh = gc.create('GoogleTrends') `

wks = gc.open("GoogleTrends").sheet1

pandas_to_sheets(result, wks)

`dx = pd.read_excel('Trends.xlsx', sheet_name='Sheet1') `

dz = pd.read_excel('Trends.xlsx', sheet_name = 'console') #my console data is here, make sure where yours is

dm = pd.merge(dx, dz, on = ['Query', 'Date'])

dm

And let’s send this one also into Google Sheets.

`wks = gc.open("GoogleTrends").sheet3`

pandas_to_sheets(dm, wks)

Now you can just connect this spreadsheet with Google Data Studio:

`pytrend.build_payload(kw_list=['foundation', 'eyeliner', 'concealer', 'lipstick'], geo = 'US', timeframe = 'today 3-m', cat = 44)`

related_queries= pytrend.related_queries()

dg=related_queries.get('lipstick').get('rising')

dg

Use *pandas_to_sheets* again. Import these into Data Studio and visualize:

It seems complicated at first, but just try these codes and create your own dashboards. Because at the end, you will just run the code on Jupyter Notebook and refresh the data on Google Data Studio. It will take only 10-15 seconds to update all of them, I promise!

Here is my Github repository for all Python codes together.

Happy coding!

*This year’s SMX Advanced will feature a brand-new SEO for Developers track with highly-technical sessions – many in live-coding format – focused on using code libraries and architecture models to develop applications that improve SEO. SMX Advanced will be held June 8-10 in Seattle. Register today.*

The post Learn how to chart and track Google Trends in Data Studio using Python appeared first on Search Engine Land.

]]>The post Here’s how I used Python to build a regression model using an e-commerce dataset appeared first on Search Engine Land.

]]>In the example below, I use an e-commerce data set to build a regression model. I also explain how to determine if the model reveals anything statistically significant, as well as how outliers may skew your results.

I use Python 3 and Jupyter Notebooks to generate plots and equations with linear regression on Kaggle data. I checked the correlations and built a basic machine learning model with this dataset. With this setup, I now have an equation to predict my target variable.

Before building my model, I want to step back to offer an easy-to-understand definition of linear regression and why it’s vital to analyzing data.

Linear regression is a basic machine learning algorithm that is used for predicting a variable based on its linear relationship between other independent variables. Let’s see a simple linear regression graph:

If you know the equation here, you can also know y values against x values. ‘’a’’ is coefficient of ‘’x’’ and also the slope of the line, ‘’b’’ is intercept which means when x = 0, b = y.

I used this dataset from Kaggle. It is not a very complicated or detailed one but enough to study linear regression concept.

If you are new and didn’t use Jupyter Notebook before, here is a quick tip for you:

- Launch the Terminal and write this command:
**jupyter notebook**

Once entered, this command will automatically launch your default web browser with a new notebook. Click New and Python 3.

Now it is time to use some fancy Python codes.

**Importing libraries**

`import matplotlib.pyplot as plt`

import numpy as np

import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error

import statsmodels.api as sm

from statsmodels.tools.eval_measures import mse, rmse

import seaborn as sns

pd.options.display.float_format = '{:.5f}'.format

import warnings

import math

import scipy.stats as stats

import scipy

from sklearn.preprocessing import scale

warnings.filterwarnings('ignore')

**Reading data**

`df = pd.read_csv("Ecom_Customers.csv")`

df.head()

My target variable will be *Yearly Amount Spent* and I’ll try to find its relation between other variables. It would be great if I could be able to say that users will spend this much for example, if *Time on App* is increased 1 minute more. This is the main purpose of the study.

**Exploratory data analysis**

First let’s check the correlation heatmap:

`df_kor = df.corr()`

plt.figure(figsize=(10,10))

sns.heatmap(df_kor, vmin=-1, vmax=1, cmap="viridis", annot=True, linewidth=0.1)

This heatmap shows correlations between each variable by giving them a weight from -1 to +1.

Purples mean negative correlation, yellows mean positive correlation and getting closer to 1 or -1 means you have something meaningful there, analyze it. For example:

*Length of Membership*has positive and high correlation with*Yearly Amount Spent.*(81%)*Time on App*also has a correlation but not powerful like*Length of Membership.*(50%)

Let’s see these relations in detailed. My favorite plot is **sns.pairplot**. Only one line of code and you will see all distributions.

`sns.pairplot(df)`

This chart shows all distributions between each variable, draws all graphs for you. In order to understand which data they include, check left and bottom axis names. (If they are the same, you will see a simple distribution bar chart.)

Look at the **last line,** Yearly Amount Spent (my target on the left axis) graphs against other variables.

Length of Membership has really perfect linearity, it is so obvious that if I can increase the customer loyalty, they will spend more! But how much? Is there any number or coefficient to specify it? Can we predict it? We will figure it out.

**Checking missing values**

Before building any model, you should check if there are any empty cells in your dataset. It is not possible to keep on with those NaN values because many machine learning algorithms do not support data with them.

This is my code to see missing values:

`df.isnull().sum()`

isnull() detects NaN values and sum() counts them.

I have no NaN values which is good. If I had, I should have filled them or dropped them.

For example, to drop all NaN values use this:

`df.dropna(inplace=True)`

To fill, you can use fillna():

`df["Time on App"].fillna(df["Time on App"].mean(), inplace=True)`

My suggestion here is to read this great article on how to handle missing values in your dataset. That is another problem to solve and needs different approaches if you have them.

So far, I have explored the dataset in detail and got familiar with it. Now it is time to create the model and see if I can predict *Yearly Amount Spent.*

Let’s define X and Y. First I will add all other variables to X and analyze the results later.

`Y=df["Yearly Amount Spent"]`

X=df[[ "Length of Membership", "Time on App", "Time on Website", 'Avg. Session Length']]

Then I will split my dataset into training and testing data which means I will select 20% of the data randomly and separate it from the training data. (test_size shows the percentage of the test data – 20%) (If you don’t specify the **random_state** in your code, then every time you run (execute) your code, a new random value is generated and training and test datasets would have different values each time.)

`X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 465) `

print('Training Data Count: {}'.format(X_train.shape[0]))

print('Testing Data Count: {}'.format(X_test.shape[0]))

Now, let’s build the model:

`X_train = sm.add_constant(X_train)`

results = sm.OLS(y_train, X_train).fit()

results.summary()

So what do all those numbers mean actually?

Before continuing, it will be better to explain these basic statistical terms here because I will decide if my model is sufficient or not by looking at those numbers.

**What is the p-value?**

P-value or probability value shows statistical significance. Let’s say you have a hypothesis that the average CTR of your brand keywords is 70% or more and its p-value is 0.02. This means there is a **2% probability** that you would see CTRs of your brand keywords **below** %70. Is it statistically significant? 0.05 is generally used for max limit (95% confidence level), so if you have p-value smaller than 0.05, yes! It is significant. The smaller the p-value is, the better your results!

Now let’s look at the summary table. My 4 variables have some p-values showing their relations whether significant or insignificant with *Yearly Amount Spent*. As you can see, *Time on Website* is statistically **insignificant** with it because its p-value is 0.180. So it will be better to drop it.

**What is R squared and Adjusted R squared?**

R square is a simple but powerful metric that shows how much **variance is explained** by the model. It counts all variables you defined in X and gives a percentage of explanation. It is something like your model capabilities.

**Adjusted** R squared is also similar to R squared but it counts only statistically significant variables. That is why it is better to look at adjusted R squared all the time.

In my model, 98.4% of the variance can be explained, which is really high.

**What is Coef?**

They are coefficients of the variables which give us the equation of the model.

So is it over? No! I have *Time on Website* variable in my model which is statistically insignificant.

Now I will build another model and drop *Time on Website* variable:

`X2=df[["Length of Membership", "Time on App", 'Avg. Session Length']]`

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, Y, test_size = 0.2, random_state = 465)

print('Training Data Count:', X2_train.shape[0])

print('Testing Data Count::', X2_test.shape[0])

`X2_train = sm.add_constant(X2_train)`

results2 = sm.OLS(y2_train, X2_train).fit()

results2.summary()

R squared is still good and I have no variable having p-value higher than 0.05.

Let’s look at the model chart here:

`X2_test = sm.add_constant(X2_test)`

y2_preds = results2.predict(X2_test)

plt.figure(dpi = 75)

plt.scatter(y2_test, y2_preds)

plt.plot(y2_test, y2_test, color="red")

plt.xlabel("Actual Scores", fontdict=ex_font)

plt.ylabel("Estimated Scores", fontdict=ex_font)

plt.title("Model: Actual vs Estimated Scores", fontdict=header_font)

plt.show()

It seems like I predict values really good! Actual scores and predicted scores have almost perfect linearity.

Finally, I will check the errors.

**Errors**

When building models, comparing them and deciding which one is better is a crucial step. You should test lots of things and then analyze summaries. Drop some variables, sum or multiply them and again test. After completing the series of analysis, you will check p-values, errors and R squared. The best model will have:

- P-values smaller than 0.05
- Smaller errors
- Higher adjusted R squared

Let’s look at errors now:

`print("Mean Absolute Error (MAE) : {}".format(mean_absolute_error(y2_test, y2_preds)))`

print("Mean Squared Error (MSE) : {}".format(mse(y2_test, y2_preds)))

print("Root Mean Squared Error (RMSE) : {}".format(rmse(y2_test, y2_preds)))

print("Root Mean Squared Error (RMSE) : {}".format(rmse(y2_test, y2_preds)))

print("Mean Absolute Perc. Error (MAPE) : {}".format(np.mean(np.abs((y2_test - y2_preds) / y2_test)) * 100))

If you want to know what MSE, RMSE or MAPE is, you can read this article.

They are all different calculations of errors and now, we will just focus on smaller ones while comparing different models.

So, in order to compare my model with another one, I will create one more model including *Length of Membership* and *Time on App *only*.*

`X3=df[['Length of Membership', 'Time on App']]`

Y = df['Yearly Amount Spent']

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, Y, test_size = 0.2, random_state = 465)

X3_train = sm.add_constant(X3_train)

results3 = sm.OLS(y3_train, X3_train).fit()

results3.summary()

`X3_test = sm.add_constant(X3_test)`

y3_preds = results3.predict(X3_test)

plt.figure(dpi = 75)

plt.scatter(y3_test, y3_preds)

plt.plot(y3_test, y3_test, color="red")

plt.xlabel("Actual Scores", fontdict=eksen_font)

plt.ylabel("Estimated Scores", fontdict=eksen_font)

plt.title("Model Actual Scores vs Estimated Scores", fontdict=baslik_font)

plt.show()

print("Mean Absolute Error (MAE)

: {}".format(mean_absolute_error(y3_test, y3_preds)))

print("Mean Squared Error (MSE) : {}".format(mse(y3_test, y3_preds)))

print("Root Mean Squared Error (RMSE) :

{}".format(rmse(y3_test, y3_preds))) print("Mean Absolute Perc. Error (MAPE) :

{}".format(np.mean(np.abs((y3_test - y3_preds) / y3_test)) * 100))

As you can see, errors of the last model are higher than the first one. Also adjusted R squared is decreased.* If errors were smaller, then we would say the last one is better – independent of R squared. *Ultimately, we choose smaller errors and higher R squared. I’ve just added this second one to show you how you can compare the models and decide which one is the best.

Now our model is this:

*Yearly Amount Spent = -1027.28 + 61.49x(Length of Membership) + 38.76x(Time on App) + 25.48x(Avg. Session Length)*

This means, for example, if we can increase the length of membership 1 year more and holding all other features fixed, one person will spend 61.49 dollars more!

When you are dealing with the real data, generally things are not that easy. To find linearity or more accurate models, you may need to do something else. For example, if your model isn’t accurate enough, check for outliers. Sometimes outliers can mislead your results!

Apart from this, sometimes you will get curved lines instead of linear but you will see that there is also a relation between variables!

Then you should think of transforming your variables by using logarithms or square.

Here is a trick for you to decide which one to use:

For example, in the third graph, if you have a line similar to the green one, you should consider using logarithms in order to make it linear!

There are lots of things to do so testing all of them is really important.

If you like to play with numbers and advance your data science skill set, learn Python. It is not a very difficult programming language to learn, and the statistics you can generate with it can make a huge difference in your daily work.

Google Analytics, Google Ads, Search Console… Using these tools already offers tons of data, and if you know the concepts of handling data accurately, you will get very valuable insights from them. You can create more accurate traffic forecasts, or analyze Analytics data such as bounce rate, time on page and their relations with the conversion rate. At the end of the day, it might be possible to predict the future of your brand. But these are only a few examples.

If you want to go further in linear regression, check my Google Page Speed Insights OLS model. I’ve built my own dataset and tried to predict the calculation based on speed metrics such as FCP (First Contentful Paint), FMP (First Meaningful Paint) and TTI (Time to Interactive).

In closing, blend your data, try to find correlations and predict your target. Hamlet Batista has a great article about practical data blending. I strongly recommend it before building any regression model.

The post Here’s how I used Python to build a regression model using an e-commerce dataset appeared first on Search Engine Land.

]]>