Introduction\n\nAs a software developer, I love developing what I hope becomes great software. Whether it be an application running locally on a machine, a service running on another machine that I remotely connect to, or a combination of the two (the most fun and challenging), software development is a process that whilst DEEPLY FRUSTRATING can yield results that can fill you with euphoria 🙉! You just need to power through lots of mistakes, failed tests and spot that one line breaking everything 🤣!\n\nWith the seeming increase of high-profile breaches of large, well-known companies, it has become apparent that security is something that software developers need to strongly consider when designing architecture and writing lines of code. One bad access point could spill sensitive information leading you to have to put out a carefully worded apology-like non apology and have you scrambling to undo the damage done.\n\nGenerate Your Own (Uncaring) Breach Notices with Ease!\n\nFortunately there are many free guidelines out there to help us along the way, including those published by the OWASP, NIST and the NCSC. Unfortunately, many developers and website owners aren't keeping up to date with the latest guidance, and due to their desire to be indexed by search engines (for users to find), these sites fall victim to a silly mistake which inadvertantly exposes the sensitive areas of their site to the PUBLIC internet for ALL to see. Here I explain the how and why, and detail measures to take to protect yourself.\n\niPad with Google on the Safari browser ready to surf the web\n\nSpider-Man, Spider-Man, Crawls Wherever A Spider Can 🕷️🕸️\n\nTo understand how the problem eventually shows, we need to look at the behaviours in the build up to such varying levels of catastrophe. Many site owners want their websites to be indexed by search engines so people can find their content. Search engines do this by sending up software known as Spiders out to crawl the web searching for content and following links. Nowadays, lots of users avoid directly typing in URLS (like heyjournal.com 😉), and instead fire up a search engine like Google, Bing, Yahoo, DuckDuckGo etc to search for content and selecting the best result(s) that suit their criteria. Therefore, there is a key incentive to making the CONTENT pages of your site highly visible and ranking as high as possible for maximum results.\n\nHowever, many sites don't just have public CONTENT pages for search engines to index and users to consume. They may have generic assets that they may not want search engines to index. They may also have private pages intended for the eyes 👀 of only a select few people, as well as admin pages to login to and manage content. They may even have sensitive files, where in the wrong hands could prove deeply embarrassing 😬. Because of these classifications of assets, it is understandable that website owners do NOT want search engines from indexing this content for the world to search and retrieve on a whim. "Okay Google, bring me up the sensitive health records of John Smith"! How quickly can you say... GDPR fine 🙈?!\n\nThis is where a special file comes into play known as robots.txt, which is ALWAYS stored at my-domain/robots.txt (so all tools know where to look). The role of this file is to tell the creepy Spiders where on a website they are *EXPECTED* to visit and where they are *EXPECTED* to avoid. It also outlines which Spiders are *EXPECTED* to crawl the site (whether the site owner wants Google, Bing, Yahoo etc to crawl their site), as well as the location of a sitemap should one exist. I have highlighted the term *EXPECTED* as you should take this directive as a *SUGGESTION ONLY*. There is *NOTHING* this file can do to ACTUALLY stop Spiders from ignoring it and crawling around your site. Search engines do say they adhere to such rules you detail in the robots.txt file, but remember that they aren't the only arachnids in town 😲.\n\nAbout /robots.txt\n\nA beautiful web for Spiders to investigate\n\nHasty and Ill-Judged Behaviour\n\nSpiders are as fantastic at discovering content as Sherlock Holmes is at discovering clues. They however are almost clueless when determining the sensitivity of the information that they find, meaning that if they crawl sensitive content, they will make it available online for search results irrespective of whether it's someone's card details and/or private journal. We as people can look at information and judge whether or not it should be public, but Spiders aren't anywhere near as good, and work with the assumption that if it is online and public, it should be searchable. This has lead to some sizeable data leaks due to misconfigured sites being made available to the public internet and thus indexed by search engines.\n\nHuge Data Leak at Largest U.S. Bond Insurer\n\nDozens of Companies Leaked Sensitive Data Thanks to Misconfigured Box Accounts\n\nRegus Spills Data of 900 Staff on Trello Board Set to ‘Public’\n\nSprint Exposed Customer Support Site to Web\n\nTo combat this, many companies turn to the robots.txt file and make use of its 'Disallow' directive to inform Spiders the regions of the site where they *SHOULDN'T* crawl. This will likely stop such sensitive pages being indexed by the likes of Google, Bing, Yahoo etc (hooray 🥳), but it brings with it another issue altogether. You see, the robots.txt file is standardised (always located at my-domain/robots.txt), PUBLIC, is intended to be read by machines and is *VERY HUMAN READABLE*. This means that *ANYONE* or, *ANY MACHINE* can read it and determine the exact places on your site that you don't want them to look!\n\nRobots.txt Tells Hackers the Places You Don't Want Them to Look\n\nPlease Take Care when Putting Things on the Internet!\n\nUsing the 'Disallow' directive in your robots.txt file to hide important, unsecured portions of your site is like putting £5m in cash in an unlocked, unguarded, unattended shack on the side of a busy city street, with a sign next to it saying "no entry, valuable items inside"! If your site has monitoring, this is the equivalent of installing CCTV cameras pointing at the door. It will likely stop many people, but at the same time pique the interest of a lot of others! Would you even consider using this method to store that amount of money 🙈?! Also, thinking about human nature, I know that if people are told not to look somewhere, many would do just the opposite and DEFINITELY look there! You just need one opportunist... 😅\n\nA cheeky monkey about to disobey any robot.txt 'Disallow' directives!\n\nI have attached an example file called 'Robots REVEALING.txt' which clearly illustrates my point. You should definitely avoid files like this as they advertise your weak points for attackers to exploit.\n\nA Better Way\n\nI think it's safe to say that none of us really want to leave sensitive information unprotected on the public internet, let alone paint a MASSIVE target on it to draw in the bad people! I always operate with the mindset that if you put a resource online, someone will DEFINITELY find it, so protect everything that you can. With this in mind, here's a few things we can do to avoid disaster!\n\nWherever possible avoid putting things in the internet that doesn't need to be, and route certain resources like administrator panels and sensitive files through internal, controlled channels instead. Does every Tom, Richard and Sally really need to access such sensitive content? Likely not, so practice a deny by default approach. If it isn't available, malicious hackers can't access it.\n\nProtect your sensitive resources with authentication \n(preferably multi-factor) and authorisation so that people need to log in with valid (strong) credentials and have the required permissions. As logins are private and known only by your trusted users, Spiders won't be able to index the pages and malicious hackers won't be able to steal valuable information.\n\nLimit access to sensitive areas of your site to ONLY specific IP addresses that YOU control. This means that any access from machines outside of your IP address list will be actively blocked by your site, leading to no site indexing by Spiders and a harder time for attackers.\n\nAs you should now be using HTTPS to secure EVERY resource on your site, for your more sensitive assets you can enforce Mutual TLS to limit connections to trusted machines. Here, any accessor would need to provide a valid certificate signed by your trusted Certificate Authority (CA) otherwise your server will reject the connection attempt. You can use the free, Open Source and EXCELLENT OpenSSL to act as your own CA and issue certificates.\n\nUsing digital techniques to lock up your online assets\n\nBy implementing these measures, every time a Spider tries to crawl these protected pages, it will receive a BAD response which will prevent it finding what it needs. This prevents search engines from subsequently indexing these pages. Now there will be no need to use the 'Disallow' directive, and you can use a file like the attached 'Robots SAFE.txt' on your site which tells ALL Spiders they can crawl ALL areas of your site. As an added bonus, the lack of specific 'Disallowed' areas will make the lives of attackers just that little bit more difficult to find interesting URLs 😎. This is far better than having a massive sign saying "you can hack me here"!\n\nMore Secure and Safer Sites\n\nMost site owners want their sites to be discoverable whilst at the same time free from hacks, leaks and breaches. The robots.txt file is great for guiding Spiders to the areas you want search engines to index using the 'Allow' directive, and is excellent for guiding those who want to test your security because of the 'Disallow' directive. Instead, I recommend allowing Spiders to crawl your entire site and use actual security mechanisms to keep pages from being searchable in Google, Bing, Yahoo etc. This comes with the added bonus of not being an easy target for attackers! This way you will be taking the privacy and security of your data seriously and your users can browse free of worry 😀.\n\nTake care and all the best, Si.