New Instagram module to avoid bans

We now use instagrapi ( https://github.com/adw0rd/instagrapi ), a more stable API to scrape data from Instagram It requires logging in using a username and password: they must be set as environment variables before execution The instagram.py module has been rewritten from scratch, even if it still uses the common module structure among the project I added a new method called *download(url, filename)* to utils.py to download any file type, since instagrapi doesn't come with a method to dowload medias I updated the requirements and the documentation as well to reflect the new changes to the Instagram module The old Instagram module is still available as instagram_old.py, but if after some weeks of testing of the new module I won't find any issue, I will delete that
marco97pa · on Nov 22 · 7f006a1f7660cc98df16ef4bdef31d2ff09fbab5 · 7f006a1
1 parent 820ac5d
commit 7f006a1f7660cc98df16ef4bdef31d2ff09fbab5
Showing with 250 additions and 70 deletions.

+2 −2 .github/workflows/main.yml

+3 −2 README.md

+5 −16 docs/source/index.rst

+67 −47 instagram.py

+154 −0 instagram_old.py

+3 −3 requirements.txt

+16 −0 utils.py
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -3,8 +3,8 @@
 name: Runner
 
 on:
-  # schedule:
-  #  - cron: "*/10 * * * *" # Runs every 10 minutes, everyday (actual timing may vary, depending on GitHub)
+  schedule:
+   - cron: "*/10 * * * *" # Runs every 10 minutes, everyday (actual timing may vary, depending on GitHub)
   workflow_dispatch:
 
 jobs:

diff --git a/README.md b/README.md
@@ -78,7 +78,7 @@ The project is really modular and by editing the YAML file you can easily fork t
 * Python 3
 * [Tweepy](https://pypi.org/project/tweepy/)
 * [pillow](https://pypi.org/project/Pillow/)
-* [insta-scrape](https://pypi.org/project/insta-scrape/)
+* [instagrapi](https://github.com/adw0rd/instagrapi)
 * [python-youtube](https://pypi.org/project/python-youtube/)
 * [spotipy](https://github.com/plamere/spotipy)
 * [billboard.py](https://github.com/guoguo12/billboard-charts)
@@ -119,7 +119,8 @@ in that case all the python command will be 'python' instead of 'python3' and 'p
 
     export YOUTUBE_API_KEY='xxxx'
 
-    export INSTAGRAM_SESSION_ID='xxxx'
+    export INSTAGRAM_ACCOUNT_USERNAME='xxxxxx'
+    export INSTAGRAM_ACCOUNT_PASSWORD='xxxxxx'
 
     export SPOTIPY_CLIENT_ID='xxxx'
     export SPOTIPY_CLIENT_SECRET='xxxx'

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -92,24 +92,13 @@ keys. Then set them as environment variables, by running these lines:
 ``export SPOTIPY_CLIENT_ID='xxxx'``
 ``export SPOTIPY_CLIENT_SECRET='xxxx'``
 
-Instagram SESSION_ID cookie
+Instagram USERNAME and PASSWORD
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-You can get your Instagram **Session ID** if you are logged in to
-Instagram using any browser. This guide uses Google Chrome.
-
-1. Open Google Chrome
-2. Go to `Instagram`_ and log in to your account
-3. Right click somewhere on the webpage and select **Inspect** on the
-   dropdown menu to open Chrome Developer tools
-4. Click the **Application** tab under the Chrome Developer Tools window
-5. Under the **storage** header on the left-hand menu, expand
-   **Cookies** and click on the entry for ``https://www.instagram.com/``
-6. Find the row with the **Name** equal to ``sessionid``: this is your
-   **Session ID**
-
-| Then set it as environment variable by running:
-| ``export INSTAGRAM_SESSION_ID='xxxxx'``
+You can set your username and password like this:
+``export INSTAGRAM_ACCOUNT_USERNAME='xxxxxx'``
+``export INSTAGRAM_ACCOUNT_PASSWORD='xxxxxx'``
+
 
 Fork
 --------------------

diff --git a/instagram.py b/instagram.py
@@ -1,17 +1,21 @@
 #!/usr/bin/env python3
 import os
-from instascrape import *
-from utils import display_num, convert_num, download_image
+from instagrapi import Client
+from PIL import Image
+import requests
+from utils import display_num, convert_num, download
 from tweet import twitter_post, twitter_post_image
-from random import randint
-from time import sleep
+
+module = "Instagram"
 
 # Get Instagram cookies
-instagram_sessionid = os.environ.get('INSTAGRAM_SESSION_ID')
-headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
-"cookie": f"sessionid={instagram_sessionid};"}
+ACCOUNT_USERNAME = os.environ.get('INSTAGRAM_ACCOUNT_USERNAME')
+ACCOUNT_PASSWORD = os.environ.get('INSTAGRAM_ACCOUNT_PASSWORD')
+
+# Login
+cl = Client()
+cl.login(ACCOUNT_USERNAME, ACCOUNT_PASSWORD)
 
-module = "Instagram"
 
 def instagram_data(group):
     """Runs all the Instagram related tasks
@@ -26,26 +30,23 @@ def instagram_data(group):
     """
 
     print("[{}] Starting tasks...".format(module))
-    group, ig_profile = instagram_profile(group)
-    wait_random()
-    group = instagram_last_post(group, ig_profile)
+    group, user_id = instagram_profile(group)
+    group = instagram_last_post(group, user_id)
 
     for artist in group["members"]:
-        wait_random()
-        artist, ig_profile = instagram_profile(artist)
-        wait_random()
-        artist = instagram_last_post(artist, ig_profile)
+        artist, user_id = instagram_profile(artist)
+        artist = instagram_last_post(artist, user_id)
 
     print()
     return group
 
-def instagram_last_post(artist, profile):
+def instagram_last_post(artist, user_id):
     """Gets the last post of a profile
 
     It tweets if there is a new post: if the timestamp of the latest stored post does not match with the latest fetched posts timestamp
 
     Args:
-      - profile: a Profile instance, already scraped
+      - user_id: a profile ID
       - artist: a dictionary with all the details of the artist
 
     Returns:
@@ -54,32 +55,33 @@ def instagram_last_post(artist, profile):
 
     print("[{}] ({}) Fetching new posts".format(module, artist["instagram"]["url"][26:-1]))
 
-    recents = profile.get_recent_posts()
+    medias = cl.user_medias(user_id, 5)
 
-    for recent in recents:
-      recent.scrape(headers=headers)
+    for media in medias:
       # If the last post timestamp is greater (post is newest) or the saved post does not exist
-      if "last_post" not in artist["instagram"] or recent.timestamp > artist["instagram"]["last_post"]["timestamp"]:
-        url = "https://www.instagram.com/p/" + recent.shortcode
-        if recent.is_video:
+      if "last_post" not in artist["instagram"] or media.taken_at.timestamp() > artist["instagram"]["last_post"]["timestamp"]:
+        url = "https://www.instagram.com/p/" + media.code
+        if media.resources[0].media_type == "2":
             content_type = "video"
             filename = "temp.mp4"
+            source = media.resources[0].video_url
         else:
             content_type = "photo"
             filename = "temp.jpg"
-        recent.download(filename)
+            source = media.resources[0].thumbnail_url
+        download(source, filename)
         twitter_post_image(
-            "{} posted a new {} on #Instagram:\n{}\n{}\n{}\n\n{}".format(artist["name"], content_type, clean_caption(recent.caption), recent.timestamp, url, artist["hashtags"]),
+            "{} posted a new {} on #Instagram:\n{}\n{}\n{}\n\n{}".format(artist["name"], content_type, clean_caption(media.caption_text), media.taken_at.timestamp(), url, artist["hashtags"]),
             filename,
             None
         )
       else:
         break
 
     last_post = {}
-    last_post["url"] = "https://www.instagram.com/p/" + recents[0].shortcode
-    last_post["caption"] = recents[0].caption
-    last_post["timestamp"] = recents[0].timestamp
+    last_post["url"] = "https://www.instagram.com/p/" + medias[0].code
+    last_post["caption"] = medias[0].caption_text
+    last_post["timestamp"] = medias[0].taken_at.timestamp()
 
     artist["instagram"]["last_post"] = last_post
 
@@ -95,34 +97,35 @@ def instagram_profile(artist):
 
     Returns:
       - an dictionary containing all the updated data of the artist
-      - a Profile instance
+      - a Profile ID
     """
+    username = artist["instagram"]["url"][26:-1]
 
-    print("[{}] ({}) Fetching profile details".format(module, artist["instagram"]["url"][26:-1]))
+    print("[{}] ({}) Fetching profile details".format(module, username))
 
-    profile = Profile(artist["instagram"]["url"])
-    profile = profile.scrape(headers=headers, inplace=False)
-    artist["instagram"]["posts"] = profile.posts
+    user_id = cl.user_id_from_username(username)
+    info = cl.user_info(user_id)
+    artist["instagram"]["posts"] = info.media_count
     # Update profile pic
-    artist["instagram"]["image"] = profile.profile_pic_url_hd
+    artist["instagram"]["image"] = info.profile_pic_url
 
     # Add followers if never happened before
     if "followers" not in artist["instagram"]:
-      artist["instagram"]["followers"] = profile.followers
+      artist["instagram"]["followers"] = info.follower_count
 
     # Update followers only if there is an increase (fixes https://github.com/marco97pa/Blackpink-Data/issues/11)
-    if profile.followers > artist["instagram"]["followers"]:
-        print("[{}] ({}) Followers increased {} --> {}".format(module, artist["instagram"]["url"][26:-1], artist["instagram"]["followers"], profile.followers))
-        if convert_num("M", artist["instagram"]["followers"]) != convert_num("M", profile.followers):
+    if info.follower_count > artist["instagram"]["followers"]:
+        print("[{}] ({}) Followers increased {} --> {}".format(module, artist["instagram"]["url"][26:-1], artist["instagram"]["followers"], info.follower_count))
+        if convert_num("M", artist["instagram"]["followers"]) != convert_num("M", info.follower_count):
             twitter_post_image(
-                "{} reached {} followers on #Instagram\n{}".format(artist["name"], display_num(profile.followers), artist["hashtags"]),
-                download_image(artist["instagram"]["image"]),
-                display_num(profile.followers, short=True),
+                "{} reached {} followers on #Instagram\n{}".format(artist["name"], display_num(info.follower_count), artist["hashtags"]),
+                download_profile_pic(artist["instagram"]["image"]),
+                display_num(info.follower_count, short=True),
                 text_size=50
                 )
-        artist["instagram"]["followers"] = profile.followers
+        artist["instagram"]["followers"] = info.follower_count
 
-    return artist, profile
+    return artist, user_id
 
 def clean_caption(caption):
     """Removes unnecessary parts of an Instagram post caption
@@ -148,7 +151,24 @@ def clean_caption(caption):
 
     return clean[:90]
 
-def wait_random():
-  sleeptime = randint(5,50)
-  print("Sleeping for {} sec".format(sleeptime))
-  sleep(sleeptime)
+def download_profile_pic(url):
+    """Downloads an image, given an url
+
+    The image is saved in the download.jpg file
+
+    Args:
+      url: source from where download the image
+    """
+
+    filename = "download.jpg"
+    response = requests.get(url)
+
+    file = open(filename, "wb")
+    file.write(response.content)
+    file.close()
+
+    img = Image.open(filename)
+    img = img.resize((400, 400), Image.ANTIALIAS)
+    img.save(filename)
+
+    return filename
diff --git a/instagram_old.py b/instagram_old.py
@@ -0,0 +1,154 @@
+#!/usr/bin/env python3
+import os
+from instascrape import *
+from utils import display_num, convert_num, download_image
+from tweet import twitter_post, twitter_post_image
+from random import randint
+from time import sleep
+
+# Get Instagram cookies
+instagram_sessionid = os.environ.get('INSTAGRAM_SESSION_ID')
+headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
+"cookie": f"sessionid={instagram_sessionid};"}
+
+module = "Instagram"
+
+def instagram_data(group):
+    """Runs all the Instagram related tasks
+
+    It scrapes data from Instagram for the whole group and the single artists
+
+    Args:
+      group: dictionary with the data of the group to scrape
+
+    Returns:
+      the same group dictionary with updated data
+    """
+
+    print("[{}] Starting tasks...".format(module))
+    group, ig_profile = instagram_profile(group)
+    wait_random()
+    group = instagram_last_post(group, ig_profile)
+
+    for artist in group["members"]:
+        wait_random()
+        artist, ig_profile = instagram_profile(artist)
+        wait_random()
+        artist = instagram_last_post(artist, ig_profile)
+
+    print()
+    return group
+
+def instagram_last_post(artist, profile):
+    """Gets the last post of a profile
+
+    It tweets if there is a new post: if the timestamp of the latest stored post does not match with the latest fetched posts timestamp
+
+    Args:
+      - profile: a Profile instance, already scraped
+      - artist: a dictionary with all the details of the artist
+
+    Returns:
+      an dictionary containing all the updated data of the artist
+    """
+
+    print("[{}] ({}) Fetching new posts".format(module, artist["instagram"]["url"][26:-1]))
+
+    recents = profile.get_recent_posts()
+
+    for recent in recents:
+      recent.scrape(headers=headers)
+      # If the last post timestamp is greater (post is newest) or the saved post does not exist
+      if "last_post" not in artist["instagram"] or recent.timestamp > artist["instagram"]["last_post"]["timestamp"]:
+        url = "https://www.instagram.com/p/" + recent.shortcode
+        if recent.is_video:
+            content_type = "video"
+            filename = "temp.mp4"
+        else:
+            content_type = "photo"
+            filename = "temp.jpg"
+        recent.download(filename)
+        twitter_post_image(
+            "{} posted a new {} on #Instagram:\n{}\n{}\n{}\n\n{}".format(artist["name"], content_type, clean_caption(recent.caption), recent.timestamp, url, artist["hashtags"]),
+            filename,
+            None
+        )
+      else:
+        break
+
+    last_post = {}
+    last_post["url"] = "https://www.instagram.com/p/" + recents[0].shortcode
+    last_post["caption"] = recents[0].caption
+    last_post["timestamp"] = recents[0].timestamp
+
+    artist["instagram"]["last_post"] = last_post
+
+    return artist
+
+def instagram_profile(artist):
+    """Gets the details of an artist on Instagram
+
+    It tweets if the artist reaches a new followers goal
+
+    Args:
+      artist: a dictionary with all the details of the artist
+
+    Returns:
+      - an dictionary containing all the updated data of the artist
+      - a Profile instance
+    """
+
+    print("[{}] ({}) Fetching profile details".format(module, artist["instagram"]["url"][26:-1]))
+
+    profile = Profile(artist["instagram"]["url"])
+    profile = profile.scrape(headers=headers, inplace=False)
+    artist["instagram"]["posts"] = profile.posts
+    # Update profile pic
+    artist["instagram"]["image"] = profile.profile_pic_url_hd
+
+    # Add followers if never happened before
+    if "followers" not in artist["instagram"]:
+      artist["instagram"]["followers"] = profile.followers
+
+    # Update followers only if there is an increase (fixes https://github.com/marco97pa/Blackpink-Data/issues/11)
+    if profile.followers > artist["instagram"]["followers"]:
+        print("[{}] ({}) Followers increased {} --> {}".format(module, artist["instagram"]["url"][26:-1], artist["instagram"]["followers"], profile.followers))
+        if convert_num("M", artist["instagram"]["followers"]) != convert_num("M", profile.followers):
+            twitter_post_image(
+                "{} reached {} followers on #Instagram\n{}".format(artist["name"], display_num(profile.followers), artist["hashtags"]),
+                download_image(artist["instagram"]["image"]),
+                display_num(profile.followers, short=True),
+                text_size=50
+                )
+        artist["instagram"]["followers"] = profile.followers
+
+    return artist, profile
+
+def clean_caption(caption):
+    """Removes unnecessary parts of an Instagram post caption
+
+    It removes all the hashtags and converts tags in plain text (@marco97pa --> marco97pa)
+
+    Args:
+      caption: a text
+
+    Returns:
+      the same caption without hashtags and tags
+    """
+
+    clean = ""
+
+    words = caption.split()
+    for word in words:
+        if word[0] != "#":
+            if word[0] == "@":
+                clean += word[1:] + " "
+            else:
+                clean += word + " "
+
+    return clean[:90]
+
+def wait_random():
+  sleeptime = randint(5,50)
+  print("Sleeping for {} sec".format(sleeptime))
+  sleep(sleeptime)
diff --git a/requirements.txt b/requirements.txt
@@ -8,10 +8,10 @@ pyyaml
 # Needed to post or retrieve data from Twitter
 tweepy
 #
-# insta-scrape
-# https://pypi.org/project/insta-scrape/
+# instagrapi
+# https://adw0rd.github.io/instagrapi/
 # Needed to retrieve data from Instagram
-insta-scrape
+instagrapi
 #
 # python-youtube
 # https://pypi.org/project/python-youtube/

diff --git a/utils.py b/utils.py
@@ -88,3 +88,19 @@ def download_image(url):
     file.close()
 
     return filename
+
+def download(url, filename):
+    """Downloads a file, given an url and filename
+
+    Args:
+      url: source from where download the image
+      filename: name of the file to save
+    """
+
+    response = requests.get(url)
+
+    file = open(filename, "wb")
+    file.write(response.content)
+    file.close()
+
+    return filename