How to Do Web Scraping Mobile Phone Data from Flipkart Using Python?

Anil Prajapati
9 min readSep 6, 2021

Extract products data from Flipkart with Pandas, Selenium, BeautifulSoup4, and CSV.

These days, the Internet is submerged with a huge amount of data associated with what we had one decade ago. As per Forbes, the data we yield every day is mind-boggling! You can have 2.5 quintillion bytes of data produced daily at the present pace, and it has become possible due to Internet of Things (IoT) devices. Accessing this information, either in form of video, text, audio, images, or other formats, the majority of businesses depend seriously on data for beating their competitors as well as succeed in the business. Inappropriately, the majority of data is not open. The majority of websites do not offer the option of saving data that they show on their sites. That is where Software or Web Scraping tools come to scrape data from different websites.

What is Web Scraping?

Web Scraping is a procedure of auto-downloading the data shown on the site using a few computer programs. A data scraping tool could scrape different pages from the website and automate the tedious job of manually copy and paste the data shown. Web Scraping is very important as despite the industries, the web has information, which can offer actionable businesses insights to get a benefit over your competitors.

Steps Associated with Web Scraping

To fetch data through Web Scraping with Python, we require to go through these steps:

  • Get the URL, which you wish to extract.
  • Checking the Page.
  • Find data you need to scrape.
  • Write a code.
  • Run a code & scrape data.
  • Lastly, store data in the necessary format.

Packages Utilized for Web Scraping

We’ll utilize the given Python packages:

Pandas: Pandas is the library utilized for data analysis and manipulation. This is used for storing data in desired formats.

BeautifulSoup4: BeautifulSoup4 is a Python web scraping library utilized to parse HTML documents. This makes parse trees, which are useful in scraping tags from HTML strings.

Selenium: Selenium is a tool specially designed to assist you in running automated tests of web applications. Though this is not its key objective, Selenium is used in Python also for data scraping as it could access JavaScript-rendered content (whereas regular extraction tools like BeautifulSoup can’t do it). We’ll utilize Selenium for downloading HTML-based content from Flipkart as well as see in the interactive way what’s taking place.

CSV: A CSV module implements different classes for reading and writing tabular information in the CSV format.

Project Demonstration

Import Libraries

Let’s begin with installing the necessary packages.

import csv from bs4 import BeautifulSoup from selenium import webdriver import pandas as pd

Starting the WebDriver

We start by firstly making a Webdriver object by importing a web driver class from the docs as well as we can utilize this object for doing any operation(s) required. For instance, we have made a Chrome object here.

# Creating an instance of webdriver for google chrome driver = webdriver.Chrome()

Now, we utilize the get the function of the webdriver.Chrome() for opening the Flipkart site page in Chrome’s driver object.

# Using webdriver we'll now open the flipkart website in chrome url = 'https://flipkart.com' # We;ll use the get method of driver and pass in the URL driver.get(url)

Now, you can get some ways we can organize a product search:

The initial way is automating the browser by finding input elements and insert the text as well as hit the ‘enter’ switch on a keyboard. The image here shows it:

Although this type of automation is needless and it makes the potential of program failure. So, the rule for automation is to automate what is absolutely necessary when doing Web Scraping.

Now, search the inputs inside a search area as well as press enter. You’ll see that a search term has been entrenched into a URL site. Currently, we can utilize this pattern for creating a function, which will create the required URL for the driver to recover. It would be much efficient in the long term as well as less prone the program failure. Just see the image given below:

Now, let’s copy the pattern and make a function, which will insert search terms using the string formatting.

def get_url(search_item): ''' This function fetches the URL of the item that you want to search ''' template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on' # We'are replacing every space with '+' to adhere with the pattern search_item = search_item.replace(" ","+") return template.format(search_item)

Currently, we have the function, which will produce a URL depending on a search term that we offer.

# Checking whether the function is working properly or not url = get_url('mobile phones') print(url) https://www.flipkart.com/search?q=mobile+phones&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b

A function produces similar results as before.

Scraping the Collection

Now, we will scrape the webpage content from which we wish to scrape data.

To perform that, we require to make a BeautifulSoup object that will parse HTML content from a page source.

Making a soup object with driver.page_source for retrieving HTML text as well as then we’ll utilize a default HTML parser for parsing the HTML.

# Creating a soup object using driver.page_source to retreive the HTML text and then we'll use the default html parser to parse # the HTML. soup = BeautifulSoup(driver.page_source, 'html.parser')

Now as we have recognized that the given car or record specified by a box having all the details that we require for the mobile phone. Therefore, let’s discover tags for boxes or cards that have data we wish to scrape.

We’ll scrape — Models, stars, total reviews, total ratings, RAM, display, storage capacity, camera information, expandable options, processor, warranty, battery, and price data.

Inspect the Tags

Usually, the data is entrenched in tags. Therefore, we require to inspect a page to observe, under which tagging the data that we wish to extract is entrenched. For inspecting a page, just right-click on an element as well as choose ‘Inspect’.

We can utilize a tag & precisely class=_11fQZEK to have all the boxes or cards and after that, we could easily find information from the boxes of all mobile phones.

Prototype a Single Record

On the 1st page, you have listed 24 mobile phones, therefore, let’s choose the first one.

# picking the 1st card from the complete list of cards item = results[0]

To scrape a phone model, we discover a div tag using class=_4rR01T.

# Extracting the model of the phone from the 1st card model = item.find('div',{'class':"_4rR01T"}).text model 'REDMI 9i (Nature Green, 64 GB)'

Also, to have Stars provided by the users to any mobile phone, we will discover a div tag using class=_3LWZlK.

# Extracting Stars from 1st card star = item.find('div',{'class':"_3LWZlK"}).text star '4.3'

Now, scrape the remaining data by finding the tags.

# Extracting Number of Ratings from 1st card num_ratings = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[0:item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')].strip() num_ratings '4,06,452 Ratings' # Extracting Number of Reviews from 1st card reviews = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')+1:].strip() reviews '23,336 Reviews' # Extracting RAM from the 1st card ram = item.find('li',{'class':"rgWa7D"}).text[0:item.find('li',{'class':"rgWa7D"}).text.find('|')] ram '4 GB RAM ' # Extracting Storage/ROM from 1st card storage = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][0:10].strip() storage '64 GB ROM' # Extracting whether there is an option of expanding the storage or not expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:] expandable 'Expandable Upto 512 GB' # Extracting the display option from the 1st card display = item.find_all('li')[1].text.strip() display '16.59 cm (6.53 inch) HD+ Display' # Extracting camera options from the 1st card camera = item.find_all('li')[2].text.strip() camera '13MP Rear Camera | 5MP Front Camera' # Extracting the battery option from the 1st card battery = item.find_all('li')[3].text battery '5000 mAh Lithium Polymer Battery' # Extracting the processir option from the 1st card processor = item.find_all('li')[4].text.strip() processor 'MediaTek Helio G25 Processor' # Extracting Warranty from the 1st card warranty = item.find_all('li')[-1].text.strip() warranty 'Brand Warranty of 1 Year Available for Mobile and 6 Months for Accessories' # Extracting price of the model from the 1st card price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text

Generalize a Pattern

It’s time to make a function, which will scrape all the data from one page.

def extract_phone_model_info(item): """

The function scrapes price, ram, model, storage, total ratings, stars, total reviews, storage expandable alternative, camera quality, display option, processor, battery, warranty at Flipkart

""" # Extracting the model of the phone from the 1st card model = item.find('div',{'class':"_4rR01T"}).text # Extracting Stars from 1st card star = item.find('div',{'class':"_3LWZlK"}).text # Extracting Number of Ratings from 1st card num_ratings = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[0:item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')].strip() # Extracting Number of Reviews from 1st card reviews = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')+1:].strip() # Extracting RAM from the 1st card ram = item.find('li',{'class':"rgWa7D"}).text[0:item.find('li',{'class':"rgWa7D"}).text.find('|')] # Extracting Storage/ROM from 1st card storage = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][0:10].strip() # Extracting whether there is an option of expanding the storage or not expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:] # Extracting the display option from the 1st card display = item.find_all('li')[1].text.strip() # Extracting camera options from the 1st card camera = item.find_all('li')[2].text.strip() # Extracting the battery option from the 1st card battery = item.find_all('li')[3].text # Extracting the processir option from the 1st card processor = item.find_all('li')[4].text.strip() # Extracting Warranty from the 1st card warranty = item.find_all('li')[-1].text.strip() # Extracting price of the model from the 1st card price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text result = (model,star,num_ratings,reviews,ram,storage,expandable,display,camera,battery,processor,warranty,price) return result

Put all data from a single page to one list.

# Now putting all the information from all the cards/phone models and putting them into a list records_list = [] results = soup.find_all('a',{'class':"_1fQZEK"}) for item in results: records_list.append(extract_phone_model_info(item))

View how our data frames look like for 1st page through creating the DataFrame with the list created above.

pd.DataFrame(records_list,columns=['model',"star","num_ratings" ,"reviews",'ram',"storage","expandable","display","camera","battery","processor","warranty","price"])

Table

Next Page’s Navigation

Writing a customized function, which will assist us in getting data from different pages.

def get_url(search_item): ''' This function fetches the URL of the item that you want to search ''' template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on' search_item = search_item.replace(" ","+") # Add term query to URL url = template.format(search_item) # Add term query placeholder url += '&page{}' return url

Put All Pieces Together

Let’s combine what we have made so far through combining everything. In the end, we will write the main function, which will take a search query as well as provide us the DataFrame after scraping from 464 pages providing us data of around 11,000 mobile phones.

# Importing necessary Libraries import csv from bs4 import BeautifulSoup from selenium import webdriver import pandas as pd def get_url(search_item): ''' This function fetches the URL of the item that you want to search ''' template = 'https://www.flipkart.com/search?q={}&as=on&as-show=on&otracker=AS_Query_HistoryAutoSuggest_1_4_na_na_na&otracker1=AS_Query_HistoryAutoSuggest_1_4_na_na_na&as-pos=1&as-type=HISTORY&suggestionId=mobile+phones&requestId=e625b409-ca2a-456a-b53c-0fdb7618b658&as-backfill=on' search_item = search_item.replace(" ","+") # Add term query to URL url = template.format(search_item) # Add term query placeholder url += '&page{}' return url def extract_phone_model_info(item): """ This function extracts model, price, ram, storage, stars , number of ratings, number of reviews, storage expandable option, display option, camera quality, battery , processor, warranty of a phone model at flipkart """ # Extracting the model of the phone from the 1st card model = item.find('div',{'class':"_4rR01T"}).text # Extracting Stars from 1st card star = item.find('div',{'class':"_3LWZlK"}).text # Extracting Number of Ratings from 1st card num_ratings = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[0:item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')].strip() # Extracting Number of Reviews from 1st card reviews = item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ")[item.find('span',{'class':"_2_R_DZ"}).text.replace('\xa0&\xa0'," ; ").find(';')+1:].strip() # Extracting RAM from the 1st card ram = item.find('li',{'class':"rgWa7D"}).text[0:item.find('li',{'class':"rgWa7D"}).text.find('|')] # Extracting Storage/ROM from 1st card storage = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][0:10].strip() # Extracting whether there is an option of expanding the storage or not expandable = item.find('li',{'class':"rgWa7D"}).text[item.find('li',{'class':"rgWa7D"}).text.find('|')+1:][13:] # Extracting the display option from the 1st card display = item.find_all('li')[1].text.strip() # Extracting camera options from the 1st card camera = item.find_all('li')[2].text.strip() # Extracting the battery option from the 1st card battery = item.find_all('li')[3].text # Extracting the processir option from the 1st card processor = item.find_all('li')[4].text.strip() # Extracting Warranty from the 1st card warranty = item.find_all('li')[-1].text.strip() # Extracting price of the model from the 1st card price = item.find('div',{'class':'_30jeq3 _1_WHN1'}).text result = (model,star,num_ratings,reviews,ram,storage,expandable,display,camera,battery,processor,warranty,price) return result def main(search_item): ''' This function will create a dataframe for all the details that we are fetching from all the multiple pages ''' driver = webdriver.Chrome() records = [] url = get_url(search_item) for page in range(1,464): driver.get(url.format(page)) soup = BeautifulSoup(driver.page_source,'html.parser') results = soup.find_all('a',{'class':"_1fQZEK"}) for item in results: records.append(extract_phone_model_info(item)) driver.close() # Saving the data into a csv file with open('Flipkart_results.csv','w',newline='',encoding='utf-8') as f: writer = csv.writer(f) writer.writerow(['Model','Stars','Num_of_Ratings','Reviews','Ram','Storage','Expandable', 'Display','Camera','Battery','Processor','Warranty','Price']) writer.writerows(records)

Run our key function to scrape data of all Mobile phones available on different pages.

%%time main('mobile phones') Wall time: 40min 54s

View the data

Let’s observe how the Result will look like.

scraped_df = pd.read_csv('C:\\Users\\DELL\\Desktop\\Jupyter Notebook\\Jovian Web Scraping\\Amazon Products Web Scrapper\\Flipkart_results.csv') scraped_df.head()

Originally published at https://www.xbyte.io.

--

--