Post Format

Yak shaving to speed up creating VMs

1 comment


tl;dr

In the last weeks I am building VMs for VMWare Fusion with Packer, which really works well (considering Packer, not so much VMWare).

This is just in preparation for playing with Docker on CentOS, so I needed to install a newer Kernel (3.14 to be exact). The biggest hurdle was to install the VMWare tools for the newer kernel. The “thinprint” service never started and the hgfs file system never started either, therefore Vagrant could not share a folder with the VM (solved that by using nfs).

Trying different ways to get everything working I built one VM after another. Sitting in Caf├ęs or in my hotel room with abysmal internet access made that so much slower. Downloading all the extra RPMs, the newer kernel etc. challenged my patience.

rsync the repo?

So I came up with the idea to have all the RPMs locally. But mirroring a complete repository takes some 30 GB (and many hours to rsync). Tried that. Cancelled.

reverse proxy?

Next I wanted to use a proxy. I found some examples on how to configure squid. But they used it as a reverse proxy. And even taking some time reading up on it, I couldn’t understand the configuration. Why a reverse proxy?
Normally a reverse proxy is for connections from the ‘outside’ to protect a slow backend. My server still should be able to access the internet in general, so this didn’t sound right (please use the comments to increase my knowledge).

configure yum

So I stepped back to using squid as a forward proxy. I found this description really straightforward to help me set up the yum configuration inside the VM, resulting in the following script:

  sed -i -e "s/^mirrorlist/#mirrorlist/" \
         -e "s%#baseurl=http://mirror.centos.org/centos%baseurl=http://centos.mirror-server.de%" \
         /etc/yum.repos.d/CentOS-Base.repo  

This comments out the mirrorlist (s/^mirrorlist/#mirrorlist/) and comments in the baseurl to use (I chose http://centos.mirror-server.de because it is near to me), resulting in:

#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os
baseurl=http://centos.mirror-server.de/$releasever/os/$basearch/  

Then I add the proxy configuration to yum.conf:

echo "proxy=$http_proxy" >> /etc/yum.conf  

and remove the fastestmirror plugin:

  mv -f /etc/yum/pluginconf.d/fastestmirror.conf $HOME/fastestmirror.conf  

IP address of the host

The next problem was to determine the IP address of my laptop from within the VM. I found no officially documented/supported mechanism. VMWare creates an interface called vmnet8 on my computer (which is a virtual network interface used for ‘shared’ networking by VMWare); ifconfig shows:

vmnet8: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
    ether 00:50:56:c0:00:08
    inet 192.168.117.1 netmask 0xffffff00 broadcast 192.168.117.255

And we are looking for the IP address 192.168.117.1. I could set that value as an environment variable in Packer, I chose to determine the value from inside the VM instead, because I need that information later when the VM is running as well. Luckily the IP address of my laptop from within the VM always seems to be XXX.XXX.XXX.1. I just need the IP address of the VM (the guest) and replace the last number: 192.168.117.132 => 192.168.117.1

MY_IP=`ifconfig $INTERFACE | grep "inet " | sed "s/inet addr:\([0-9.]*\).*/\1/"`
HOST_IP=`echo $MY_IP | sed s/\.[0-9]*$/.1/`

My squid proxy will be running on port 3128, so I just have to check, if the proxy really is available:

curl --connect-timeout 1 http://www.google.com > /dev/null 2&>1
if [[ $? != 0 ]]; then
   ...
fi  

configuring squid

To speed things up, I first used SquidMan. Everything looked dandy, but when the VM tried to yum update and started to download the repository metadata (a 4 MB sqlite db), it started okay and got really slow. I mean really slow: instead of 20 seconds it estimated 2 hours. Some googling seemed to say that Squid waits for all the data to arrive and then hands it over to the client. So I waited. And waited. Nothing. Tried to download from within the VM bypassing the proxy: 20 seconds again. I googled, and googled some more until I was ready to give up. What I know about networking “fits on a stamp”, but I started a tcpdump nonetheless. And there was something curious: when I bypassed the proxy, everything looked normal (to me). But when I used the proxy, I saw many entries with IPV6 addresses instead of IPV4. So I tried to determine how to tell Squid not to use IPV6 or at least to prefer IPV4:

dns_v4_first on  

Now the speed was as expected. I still do not understand if the problem is with the chosen mirror (centos.mirror-server.de) or my ISP. I do not care.

In the meantime (I thought SquidMan might be the culprit) I switched over to a ‘normal’ squid, and learned some more stuff and ended with the following script to configure my squid (installed via homebrew on OSX, so YMMV):

# we need the directory, where the squid configuration file can be found  
SQUID_DIR=`brew info squid | grep Cellar | sed "s/^\([^ ]*\).*/\1/"`  

# insert some additional refresh patterns (before the other refresh patterns)  
sed -i .org '/refresh_pattern .ftp/i \  
refresh_pattern -i .rpm$ 129600 100% 129600 \  
refresh_pattern -i .bz2$ 129600 100% 129600 \  
' $SQUID_DIR/etc/squid.conf  

# append some lines to the configuration  
cat <<EOF >> $SQUID_DIR/etc/squid.conf  

# log file locations:  
cache_access_log stdio:/usr/local/var/logs/squid/squid-access.log  
cache_store_log stdio:/usr/local/var/logs/squid/squid-store.log  
cache_log /usr/local/var/logs/squid/squid-cache.log  

# store objects up to:  
maximum_object_size 16 MB  

# I needed that at my home to avoid slow ipv6  
dns_v4_first on  

# and we want the cache to survive a restart:  
cache_dir ufs /usr/local/var/cache/squid 10000 16 256  
EOF  

the refresh_pattern are used to store the RPMs and keep them, even if no cache-headers are set. They have to be set before the existing refresh_patterns, so I insert them before the first pattern (/refresh_pattern .ftp/i). Then I set the log directories, the maximum object size (16 MB), the IPV4 (DNS via IPV4) and finally a cache directory so that the cached data survives a restart of the proxy. Done.

1 Comment so far Join the Conversation

  1. Pingback: Zettelkasten

Leave a Reply

Required fields are marked *.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s