Robots.txt and special files, the rules to avoid mistakes
Let’s go back to SEO basics thanks to the new #askGoogleWebmaster pill, the series in which John Mueller answers a question asked by the SEO community. In the last episode we talk about the file Robots.txt and, to be precise, the best practices for the management of certain types of files and extensions, such as .css and .htacess, and the Googler tells us what is the right way to take in these cases. That is, whether it is better to leave access to Googlebot or prevent its scanning of those pages.
Using the disallow for special files?
It all starts, as mentioned, from the question of a user, who asks the Google Senior Webmaster Trends Analyst how to behave with respect to the file robots.txt and “whether to disallow files like /*.css$, /php.ini and /.htaccess“, and therefore, more generally, how to manage these special files.
John Mueller first answers with the usual irony, saying that he cannot “prevent from preventing” access to such files (literally, “I can’t disallow you from disallowing those files”), then go into a little more detail and offer his real opinion, because that approach seems to be “a bad idea“.
The negative effects of unwanted blocks
In some cases, the disallow of special files is simply redundant and therefore unnecessary, but in other circumstances it could seriously compromise the ability of Googlebot to scan a site, with all the consequent negative effects.
The procedure that the user has in mind risks in fact to cause damage to the crawling capacity of Googlebot, and therefore to affect the understanding of the pages, the correct indexing and, last but not least, the ranking.
What the disallow on special files means
Mueller quickly explains what it means to proceed with that disallow and what may be the consequences for Googlebot and the site.
- disallow: /*.css$
would deny access to all CSS files: Google must instead have the ability to access CSS files, so that it can correctly render the pages of the site. This is crucial, for example, to be able to recognize when a page is optimized for mobile devices. The Googler adds that “generally the are not indexed, but we must be able to scan it”.
So, if the concern of site owners and webmasters is to disallow CSS files to prevent them from being indexed, Mueller reassures them by saying that this usually does not happen. On the contrary, blocking them complicates the life of Google, which needs the file regardless, and in any case even if a CSS file ends up being indexed will not damage the site (or at least less than the opposite case).
- disallow: /php.ini
php.ini is a configuration file for PHP. In general, this file must be locked or locked in a special location so that no one can access it: this means that even Googlebot does not have access to that resource. So, forbidding the scanning of /php.ini in the robots.txt file is simply redundant and unnecessary.
- disallow: /.htaccess
as in the previous case, .htaccess too is a special control file, blocked by default which therefore does not offer external access, even to Googlebot. Then, there is no need to use disallow explicitly because the bot cannot access or scan it.
Do not use a Robots.txt file copied from another site
Before concluding the video, John Mueller offers some thoughts and a precise suggestion for proper management of the robots.txt file.
The message is clear: do not recklessly copy and reuse a robot.txt file from another site simply assuming that it will be fine for your own. The best way to avoid errors is to think carefully “which parts of your site you want to avoid scanning” and then use disallow accordingly to prevent access to Googlebot.