Difference between pages "Intel HD Audio" and "Ebuild Version Encoding"

From Funtoo
(Difference between pages)
Jump to: navigation, search
m
 
(Own considerations and script)
 
Line 1: Line 1:
== Introduction ==
+
= Introduction =
  
This tutorial describes how to configure an Intel HD Audio sound card. It explains the configuration at the kernel level and the changes you'll need to make to /etc/modprobe.d/alsa.conf.
+
When implementing a database backend for portage, the versions of packages need to be directly comparable. An integer sized 64 or 128 bit is the way to go. This page describes a scheme to encode the original version strings into compacted integers.
  
== Prerequisites ==
+
= Version Format =
  
Before using this tutorial to configure your PC's sound card, make sure that the command '''''lspci''''' displays information similar to the following:
+
The format of version strings used in ebuilds can be categorized as:
  
 
<pre>
 
<pre>
# lspci | grep -i audio
+
version ::= <basic> <suffix>
00:1b.0 Audio device: Intel Corporation 5 Series/3400 Series Chipset High Definition Audio (rev 06)
+
basic  ::= <number> {'.' <number>} [char]
 +
suffix  ::= ['_' <symbol> [number]] ['-r' <number>]
 +
number  ::= an arbitary non-negative number
 +
symbol  ::= one of "alpha", "beta", "rc", "pre", "p"
 +
char    ::= 'a' to 'z'
 +
 
 +
NOTE: {'a'} means 'a' can appear none or arbitary times.
 
</pre>
 
</pre>
  
In addition, the following packages must be installed. They are usually installed during the installation of the base system.
+
== Comparison Rules ==
  
*media-libs/alsa-lib
+
* Basic version: Use the general comparison rules. For example: 1.0 < 1.1 < 1.1.1 < 1.2 < 1.2a < 1.3 < 2 < 2.0
*media-plugins/gst-plugins-alsa
+
* Suffixes: _alpha < _beta < _pre < _rc < none < _p
*media-sound/alsa-headers
+
* Overall: If the basic versions are different, compare them. Or compare the suffixes. At last compare ebuild revisions.
*media-sound/alsa-utils
+
  
The codec for the sound card used in this example is:
+
= Basics =
 +
 
 +
We have to use a single integers to represent multiple integers plus some symbols while keep them comparable.
 +
 
 +
In this part we are focusing on the basic versions. Take 16 bit integers for explanation. Using 16 bit integer 2 and 12 have the form
  
 
<pre>
 
<pre>
$ cat /proc/asound/card*/codec#0 | grep -i codec
+
0000000000000010
Codec: IDT 92HD73C1X5
+
0000000000001100
 
</pre>
 
</pre>
  
== Kernel configuration ==
+
and they are apparently comparable. What's more, even they appear in the middle of the integer:
  
{{Fancynote|The kernel configuration shown in this example is based on vanilla-sources-3.0.3, but the options are present in other versions of the kernel, including gentoo-sources and git-sources.}}
+
<pre>
 +
0000001000000000
 +
0001100000000000
 +
</pre>
  
=== menuconfig ===
+
Provided their LSB (least significant bit) are alligned, they are still comparable. So basically we can align the LSB of numbers to some fixed positions. If we use each 4 bit for a number we can get:
 
+
Shown here are the kernel options that need to be selected. Configure the options as modules ('''<M>''') if permitted. Leave the other options as they are.  
+
  
 
<pre>
 
<pre>
Device Drivers  --->
+
1.0      => 0001 0000 0000 0000
  <M> Sound card support  --->
+
1.0.0   => 0001 0000 0000 0000
    [*]  Preclaim OSS device numbers
+
1.1      => 0001 0001 0000 0000
    <M>  Advanced Linux Sound Architecture  --->
+
3.4.1.10 => 0011 0100 0001 1010
      <M>  Sequencer support
+
      <M>    Sequencer dummy client
+
      <M>  OSS Mixer API
+
      <M>  OSS PCM (digital audio) API
+
      [*]    OSS PCM (digital audio) API - Include plugin system
+
      [*]  OSS Sequencer API
+
      <M>  HR-timer backend support
+
      [*]    Use HR-timer as default sequencer timer
+
      -*-  Dynamic device file minor numbers
+
      [*]  Verbose procfs contents
+
      [*]  Generic sound devices  --->
+
        <M>  Virtual MIDI soundcard
+
      [*]  PCI sound devices  --->
+
        <M>  Intel HD Audio  --->
+
          [*]  Build Realtek HD-audio codec support
+
          [*]  Build Analog Device HD-audio codec support
+
          [*]  Build IDT/Sigmatel HD-audio codec support
+
          [*]  Build VIA HD-audio codec support
+
          [*]  Build HDMI/DisplayPort HD-audio codec support
+
          [*]  Build Conexant HD-audio codec support
+
          [*]  Build Creative CA0110-IBG codec support
+
          [*]  Build C-Media HD-audio codec support
+
          [*]  Build Silicon Labs 3054 HD-modem codec support
+
          [*]  Enable generic HD-audio codec parser
+
          [*]  Aggressive power-saving on HD-audio
+
          (20)   Default time-out for HD-audio power-save mode
+
      [*]  USB sound devices  --->
+
        <M>   USB Audio/MIDI driver
+
      [*]  FireWire sound devices  --->
+
 
</pre>
 
</pre>
  
=== /usr/src/linux/.config ===
+
We put the numbers from left (most significant) to right (least significant) to give numbers the correct priority.
  
The options required for the sound card can be selected by editing /usr/src/linux/.config as well:
+
This method has a big limitation that the size of an integer and the number of the integers are all fixed. Neither we can't destinguish 1.0 and 1.0.0. To overcome these limitations, we can prepend a length marker before the number and use the following encoding:
 +
 
 +
{| {{table}}
 +
! Encoding
 +
! Range
 +
|-
 +
|<tt><b>0</b>xx</tt>
 +
|0~3
 +
|-
 +
|<tt><b>1</b>xxxxx</tt>
 +
|4~31
 +
|}
 +
 
 +
Losing one bit for a number, now we can encode more complex versions:
  
 
<pre>
 
<pre>
CONFIG_SOUND=m
+
1.0        => 001 000              => 0010 0000 0000 0000
CONFIG_SOUND_OSS_CORE=y
+
1.0.0      => 001 000 000          => 0010 0000 0000 0000
CONFIG_SOUND_OSS_CORE_PRECLAIM=y
+
1.1        => 001 001              => 0010 0100 0000 0000
CONFIG_SND=m
+
1.3        => 001 011              => 0010 1100 0000 0000
CONFIG_SND_TIMER=m
+
1.4        => 001 100100            => 0011 0010 0000 0000
CONFIG_SND_PCM=m
+
2.6.20      => 010 100110 110100    => 0101 0011 0110 1000
CONFIG_SND_HWDEP=m
+
3.0        => 001 000              => 0010 0000 0000 0000
CONFIG_SND_RAWMIDI=m
+
3.1        => 001 001              => 0010 0100 0000 0000
CONFIG_SND_SEQUENCER=m
+
3.2.3.3    => 011 010 011 011      => 0110 1001 1011 0000
CONFIG_SND_SEQ_DUMMY=m
+
3.2.3.3.1  => 011 010 011 011 001  => 0110 1001 1011 0010
CONFIG_SND_OSSEMUL=y
+
3.4.1.10    => 011 100100 001 101010 => 0111 0010 0001 1010 10 (too long!)
CONFIG_SND_MIXER_OSS=m
+
</pre>
CONFIG_SND_PCM_OSS=m
+
CONFIG_SND_PCM_OSS_PLUGINS=y
+
CONFIG_SND_SEQUENCER_OSS=y
+
CONFIG_SND_HRTIMER=m
+
CONFIG_SND_SEQ_HRTIMER_DEFAULT=y
+
CONFIG_SND_DYNAMIC_MINORS=y
+
# CONFIG_SND_SUPPORT_OLD_API is not set
+
CONFIG_SND_VERBOSE_PROCFS=y
+
# CONFIG_SND_VERBOSE_PRINTK is not set
+
# CONFIG_SND_DEBUG is not set
+
CONFIG_SND_VMASTER=y
+
CONFIG_SND_DMA_SGBUF=y
+
CONFIG_SND_RAWMIDI_SEQ=m
+
# CONFIG_SND_OPL3_LIB_SEQ is not set
+
# CONFIG_SND_OPL4_LIB_SEQ is not set
+
# CONFIG_SND_SBAWE_SEQ is not set
+
# CONFIG_SND_EMU10K1_SEQ is not set
+
CONFIG_SND_DRIVERS=y
+
# CONFIG_SND_PCSP is not set
+
# CONFIG_SND_DUMMY is not set
+
# CONFIG_SND_ALOOP is not set
+
CONFIG_SND_VIRMIDI=m
+
# CONFIG_SND_MTPAV is not set
+
# CONFIG_SND_SERIAL_U16550 is not set
+
# CONFIG_SND_MPU401 is not set
+
CONFIG_SND_PCI=y
+
# CONFIG_SND_AD1889 is not set
+
# CONFIG_SND_ALS300 is not set
+
# CONFIG_SND_ALS4000 is not set
+
# CONFIG_SND_ALI5451 is not set
+
# CONFIG_SND_ASIHPI is not set
+
# CONFIG_SND_ATIIXP is not set
+
# CONFIG_SND_ATIIXP_MODEM is not set
+
# CONFIG_SND_AU8810 is not set
+
# CONFIG_SND_AU8820 is not set
+
# CONFIG_SND_AU8830 is not set
+
# CONFIG_SND_AW2 is not set
+
# CONFIG_SND_AZT3328 is not set
+
# CONFIG_SND_BT87X is not set
+
# CONFIG_SND_CA0106 is not set
+
# CONFIG_SND_CMIPCI is not set
+
# CONFIG_SND_OXYGEN is not set
+
# CONFIG_SND_CS4281 is not set
+
# CONFIG_SND_CS46XX is not set
+
# CONFIG_SND_CS5530 is not set
+
# CONFIG_SND_CS5535AUDIO is not set
+
# CONFIG_SND_CTXFI is not set
+
# CONFIG_SND_DARLA20 is not set
+
# CONFIG_SND_GINA20 is not set
+
# CONFIG_SND_LAYLA20 is not set
+
# CONFIG_SND_DARLA24 is not set
+
# CONFIG_SND_GINA24 is not set
+
# CONFIG_SND_LAYLA24 is not set
+
# CONFIG_SND_MONA is not set
+
# CONFIG_SND_MIA is not set
+
# CONFIG_SND_ECHO3G is not set
+
# CONFIG_SND_INDIGO is not set
+
# CONFIG_SND_INDIGOIO is not set
+
# CONFIG_SND_INDIGODJ is not set
+
# CONFIG_SND_INDIGOIOX is not set
+
# CONFIG_SND_INDIGODJX is not set
+
# CONFIG_SND_EMU10K1 is not set
+
# CONFIG_SND_EMU10K1X is not set
+
# CONFIG_SND_ENS1370 is not set
+
# CONFIG_SND_ENS1371 is not set
+
# CONFIG_SND_ES1938 is not set
+
# CONFIG_SND_ES1968 is not set
+
# CONFIG_SND_FM801 is not set
+
CONFIG_SND_HDA_INTEL=m
+
# CONFIG_SND_HDA_HWDEP is not set
+
# CONFIG_SND_HDA_INPUT_BEEP is not set
+
# CONFIG_SND_HDA_INPUT_JACK is not set
+
# CONFIG_SND_HDA_PATCH_LOADER is not set
+
CONFIG_SND_HDA_CODEC_REALTEK=y
+
CONFIG_SND_HDA_CODEC_ANALOG=y
+
CONFIG_SND_HDA_CODEC_SIGMATEL=y
+
CONFIG_SND_HDA_CODEC_VIA=y
+
CONFIG_SND_HDA_CODEC_HDMI=y
+
# CONFIG_SND_HDA_CODEC_CIRRUS is not set
+
CONFIG_SND_HDA_CODEC_CONEXANT=y
+
CONFIG_SND_HDA_CODEC_CA0110=y
+
CONFIG_SND_HDA_CODEC_CMEDIA=y
+
CONFIG_SND_HDA_CODEC_SI3054=y
+
CONFIG_SND_HDA_GENERIC=y
+
CONFIG_SND_HDA_POWER_SAVE=y
+
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=20
+
# CONFIG_SND_HDSP is not set
+
# CONFIG_SND_HDSPM is not set
+
# CONFIG_SND_ICE1712 is not set
+
# CONFIG_SND_ICE1724 is not set
+
# CONFIG_SND_INTEL8X0 is not set
+
# CONFIG_SND_INTEL8X0M is not set
+
# CONFIG_SND_KORG1212 is not set
+
# CONFIG_SND_LOLA is not set
+
# CONFIG_SND_LX6464ES is not set
+
# CONFIG_SND_MAESTRO3 is not set
+
# CONFIG_SND_MIXART is not set
+
# CONFIG_SND_NM256 is not set
+
# CONFIG_SND_PCXHR is not set
+
# CONFIG_SND_RIPTIDE is not set
+
# CONFIG_SND_RME32 is not set
+
# CONFIG_SND_RME96 is not set
+
# CONFIG_SND_RME9652 is not set
+
# CONFIG_SND_SONICVIBES is not set
+
# CONFIG_SND_TRIDENT is not set
+
# CONFIG_SND_VIA82XX is not set
+
# CONFIG_SND_VIA82XX_MODEM is not set
+
# CONFIG_SND_VIRTUOSO is not set
+
# CONFIG_SND_VX222 is not set
+
# CONFIG_SND_YMFPCI is not set
+
CONFIG_SND_USB=y
+
CONFIG_SND_USB_AUDIO=m
+
# CONFIG_SND_USB_UA101 is not set
+
# CONFIG_SND_USB_USX2Y is not set
+
# CONFIG_SND_USB_CAIAQ is not set
+
# CONFIG_SND_USB_US122L is not set
+
# CONFIG_SND_USB_6FIRE is not set
+
CONFIG_SND_FIREWIRE=y
+
# CONFIG_SND_FIREWIRE_SPEAKERS is not set
+
# CONFIG_SND_ISIGHT is not set
+
# CONFIG_SND_PCMCIA is not set
+
# CONFIG_SND_SOC is not set
+
# CONFIG_SOUND_PRIME is not set
+
</pre>  
+
  
== /etc/modprobe.d/alsa.conf ==
+
You see previous version 3.4.1.10 fail using this method. This is a trade-off for flexibility.
  
=== Dell laptop ===
+
The length markers can also be variable-size. The point here is to use [http://en.wikipedia.org/wiki/Huffman_coding Huffman coding]. We just have to make sure:
  
{{Fancynote|The configuration shown here applies to a Dell Studio 1749 laptop. It might also be applicable to other Dell laptops.}}
+
* The length markers are comparable. In other words, markers for shorter length must be less than markers for longer length, when aligned to MSB.
 +
* No length marker can be a prefix of another, or we can't distinguish them.
  
If you plan on using external speakers or a headset, you need to edit /etc/modprobe.d/alsa.conf so that the integrated speakers are muted when you plug an external sound device in. Add the following lines to the end of the file:
+
= Specification =
  
<pre>
+
For an arbitary version string, we just encode each part of the version and put them to the encoded number from MSB to LSB one by one, using the following rules.
# Sound card config
+
# Codec: IDT 92HD73C1X5
+
  
options snd-hda-intel model=dell-m6
+
Firstly a general rule for numbers is given below.
options snd-hda-intel enable_msi=1
+
 
</pre>
+
{| {{table}}
 +
!Encoding
 +
!Meaning
 +
!Explanation
 +
|-
 +
|<tt><b>01000</b>xxxxx</tt>
 +
|'a' ~ 'z'
 +
|For the single char at the last, can only appear once
 +
|-
 +
|<tt><b>01</b>x{3}</tt> except <tt><b>01000</b></tt>
 +
|0 ~ 6
 +
|For commonly used small version numbers
 +
|-
 +
|<tt><b>100</b>x{6}</tt>
 +
|7 ~ 70
 +
|
 +
|-
 +
|<tt><b>101</b>x{12}</tt>
 +
|71 ~ 4166
 +
|
 +
|-
 +
|<tt><b>110</b>x{27}</tt>
 +
|4167 ~ 134221894
 +
|For dates, 99999999 ~ 2^27
 +
|-
 +
|<tt><b>1110</b>x{40}</tt>
 +
|134221895 ~ 1099645849670
 +
|For timestamps, 999999999999 ~ 2^40
 +
|-
 +
|<tt><b>1111</b>x{48}</tt>
 +
|1099645849671 ~ 282574622560326
 +
|For some really large numbers
 +
|}
 +
(x{n} means x repeat n times)
 +
 
 +
Here we apply two tricks:
 +
* we leave code <tt><b>01000</b></tt> for encoding the ending single char, so <tt><b>01001</b></tt> means 0 and then.
 +
* as 0~6 is already encoded as 01xxx, the code range <tt><b>100000000</b></tt> ~ <tt><b>100000100</b></tt> became illegal. To save the space, we use <tt><b>100000000</b></tt> to encode 7 immediately, which made the upper bound become 6 + 64 = 70 instead of 63. The same trick does to others.
 +
 
 +
We encode basic versions as a continuous sequence of variable size integers.
 +
 
 +
Now depends on the existance of suffixes (not including the ebuild revison '-r'), we append a sequence denoting the end of basic version:
 +
 
 +
{| {{table}}
 +
!Encoding
 +
!Meaning
 +
!Explanation
 +
|-
 +
|<tt><b>000</b></tt>
 +
|Suffix is "_alpha", "_beta", "_pre", "_rc"
 +
|These suffixes are less than nothing
 +
|-
 +
|<tt><b>001</b></tt>
 +
|No suffix, or '_p'
 +
|General case
 +
|}
 +
 
 +
Then if there are suffixes, we encode the suffixes and append the sequences according to this table:
 +
 
 +
{| {{table}}
 +
!Encoding
 +
!Meaning
 +
!Explanation
 +
|-
 +
|<tt><b>0</b></tt> + NUM
 +
|ebuild revision '-r'
 +
|NUM is a variable size integer
 +
|-
 +
|<tt><b>10</b>xxx</tt>
 +
|Suffix without a number
 +
|
 +
|-
 +
|<tt><b>11</b>xxx</tt> + NUM
 +
|Suffix with a number
 +
|NUM is a variable size integer
 +
|}
 +
 
 +
We can encode the suffix as a 3-bit number, following their priority:
 +
 
 +
{| {{table}}
 +
!Encoding
 +
!Suffix
 +
|-
 +
|<tt><b>000</b></tt>
 +
|_alpha
 +
|-
 +
|<tt><b>001</b></tt>
 +
|_beta
 +
|-
 +
|<tt><b>010</b></tt>
 +
|_rc
 +
|-
 +
|<tt><b>011</b></tt>
 +
|_pre
 +
|-
 +
|<tt><b>000</b></tt>
 +
|_p
 +
|}
 +
 
 +
Here we reused <tt><b>000</b></tt> for '_p' because it's in different category (larger than nothing) as '_alpha' (less than nothing). As you can see 3-bit is efficient enough for further expanding.
 +
 
 +
= Consideration =
 +
 
 +
* We can encode obvious date/timestamps using more compact numbers (think of 22b vs 27b for date). But the drawback is dates will be incomparable to ordinary numbers. When some numbers looks like dates but are actually not dates, this will cause severe issues.
 +
* There are still some packages whose version doesn't fit in a 64 bit integer (about ~40). Plus no database support native 128bit integers at this time. To workaround this issue we can either fork these packages and shorten their version number, or use a combination of two 64bit integers to form a 128bit integer. But the second method might be an ugly trick.
 +
* We could also leave these packages alone without encoding. When we need to do SQL query we just fetch all these packages and compare the versions using the original string-based comparison. Since there are not too many packages fail the 64 bit boundary, this will have little impact on the performance.
 +
 
 +
= Script =
 +
 
 +
The following is a perl script which does the counting of bits needed for each package. The numbers in subs can be changed to test adjusting the encoding.
 +
 
 +
The script is used for prototype only and isn't well tested. Please feel free to play with it and improve it.
 +
 
 +
<syntaxhighlight lang="perl">
 +
#!/usr/bin/env perl
 +
 
 +
use 5.010;
 +
use strict;
 +
 
 +
my @list = grep /^.+\.ebuild$/, `find /usr/portage/`;
 +
my $max_bits = 0;
 +
my $max_ebuild;
 +
my $failed_count = 0;
 +
my $orig;
 +
 
 +
# bits needed for an ordinary number
 +
sub bits_for_number
 +
{
 +
given ($_)
 +
 +
when ($_ <= "6")                    { return 2 + 3;      }
 +
when ($_ <= "70")                    { return 3 + 6;      }
 +
when ($_ <= "4166")                  { return 3 + 12;      }
 +
when ($_ <= "134221894")            { return 3 + 27;      }
 +
when ($_ <= "1099645849670")        { return 4 + 40;      }
 +
when ($_ <= "282574622560326")      { return 4 + 48;      }
 +
default                              { die "ERROR: $orig"; }
 +
}
 +
}
 +
 
 +
# bits needed for the single char immediately after basic version
 +
sub bits_for_char
 +
{
 +
return 10;
 +
}
 +
 
 +
# bits needed to denote the end of basic version
 +
sub bits_for_suffix_start
 +
{
 +
return 3;
 +
}
 +
 
 +
# bits needed for a suffix
 +
sub bits_for_suffix
 +
{
 +
/^(?<Suffix>[-_][a-z]+)(?<Num>\d*)$/;
 +
my $suffix = $+{Suffix};
 +
my $num = $+{Num};
 +
given ($suffix)
 +
{
 +
when (/-r/)      { return 1 + bits_for_number($num); }
 +
default          { return 5 + bits_for_number($num); }
 +
}
 +
}
 +
 
 +
foreach (@list)
 +
{
 +
chomp;
 +
$orig = $_;
 +
my $bits_needed = 0;
 +
 +
/\/(?<Name>.+)\/\g{Name}-(?<Version>.+)\.ebuild$/;
 +
$+{Version} =~ /^(?<Basic>.+?)(?<Suffix>([-_].+)?)$/;
 +
 +
my $basic_version = $+{Basic};
 +
my $suffix = $+{Suffix};
 +
 +
my @basic_parts = $basic_version =~ /\d+/g;
 +
foreach (@basic_parts)
 +
{
 +
$bits_needed += bits_for_number($_);
 +
}
 +
if ($basic_version =~ /([a-z])$/)
 +
{
 +
$bits_needed += bits_for_char();
 +
}
 +
 +
$bits_needed += bits_for_suffix_start();
 +
if ($suffix)
 +
{
 +
my @suffix_parts = $suffix =~ /[-_][a-z]+\d*/g;
 +
foreach (@suffix_parts)
 +
{
 +
$bits_needed += bits_for_suffix($_);
 +
}
 +
}
 +
 +
if ($bits_needed > $max_bits)
 +
{
 +
$max_bits = $bits_needed;
 +
$max_ebuild = $orig;
 +
}
 +
if ($bits_needed > 64)
 +
{
 +
$failed_count++;
 +
say "$orig: $bits_needed";
 +
}
 +
}
 +
 
 +
say;
 +
say "Failed ebuild count: $failed_count";
 +
say "Maximum: $max_ebuild: $max_bits";
 +
</syntaxhighlight>
 +
 
 +
[[Category:Portage]]
 +
[[Category:Labs]]

Latest revision as of 16:48, 13 November 2011

Introduction

When implementing a database backend for portage, the versions of packages need to be directly comparable. An integer sized 64 or 128 bit is the way to go. This page describes a scheme to encode the original version strings into compacted integers.

Version Format

The format of version strings used in ebuilds can be categorized as:

version ::= <basic> <suffix>
basic   ::= <number> {'.' <number>} [char]
suffix  ::= ['_' <symbol> [number]] ['-r' <number>]
number  ::= an arbitary non-negative number
symbol  ::= one of "alpha", "beta", "rc", "pre", "p"
char    ::= 'a' to 'z'

NOTE: {'a'} means 'a' can appear none or arbitary times.

Comparison Rules

  • Basic version: Use the general comparison rules. For example: 1.0 < 1.1 < 1.1.1 < 1.2 < 1.2a < 1.3 < 2 < 2.0
  • Suffixes: _alpha < _beta < _pre < _rc < none < _p
  • Overall: If the basic versions are different, compare them. Or compare the suffixes. At last compare ebuild revisions.

Basics

We have to use a single integers to represent multiple integers plus some symbols while keep them comparable.

In this part we are focusing on the basic versions. Take 16 bit integers for explanation. Using 16 bit integer 2 and 12 have the form

0000000000000010
0000000000001100

and they are apparently comparable. What's more, even they appear in the middle of the integer:

0000001000000000
0001100000000000

Provided their LSB (least significant bit) are alligned, they are still comparable. So basically we can align the LSB of numbers to some fixed positions. If we use each 4 bit for a number we can get:

1.0      => 0001 0000 0000 0000
1.0.0    => 0001 0000 0000 0000
1.1      => 0001 0001 0000 0000
3.4.1.10 => 0011 0100 0001 1010

We put the numbers from left (most significant) to right (least significant) to give numbers the correct priority.

This method has a big limitation that the size of an integer and the number of the integers are all fixed. Neither we can't destinguish 1.0 and 1.0.0. To overcome these limitations, we can prepend a length marker before the number and use the following encoding:

Encoding Range
0xx 0~3
1xxxxx 4~31

Losing one bit for a number, now we can encode more complex versions:

1.0         => 001 000               => 0010 0000 0000 0000
1.0.0       => 001 000 000           => 0010 0000 0000 0000
1.1         => 001 001               => 0010 0100 0000 0000
1.3         => 001 011               => 0010 1100 0000 0000
1.4         => 001 100100            => 0011 0010 0000 0000
2.6.20      => 010 100110 110100     => 0101 0011 0110 1000
3.0         => 001 000               => 0010 0000 0000 0000
3.1         => 001 001               => 0010 0100 0000 0000
3.2.3.3     => 011 010 011 011       => 0110 1001 1011 0000
3.2.3.3.1   => 011 010 011 011 001   => 0110 1001 1011 0010
3.4.1.10    => 011 100100 001 101010 => 0111 0010 0001 1010 10 (too long!)

You see previous version 3.4.1.10 fail using this method. This is a trade-off for flexibility.

The length markers can also be variable-size. The point here is to use Huffman coding. We just have to make sure:

  • The length markers are comparable. In other words, markers for shorter length must be less than markers for longer length, when aligned to MSB.
  • No length marker can be a prefix of another, or we can't distinguish them.

Specification

For an arbitary version string, we just encode each part of the version and put them to the encoded number from MSB to LSB one by one, using the following rules.

Firstly a general rule for numbers is given below.

Encoding Meaning Explanation
01000xxxxx 'a' ~ 'z' For the single char at the last, can only appear once
01x{3} except 01000 0 ~ 6 For commonly used small version numbers
100x{6} 7 ~ 70
101x{12} 71 ~ 4166
110x{27} 4167 ~ 134221894 For dates, 99999999 ~ 2^27
1110x{40} 134221895 ~ 1099645849670 For timestamps, 999999999999 ~ 2^40
1111x{48} 1099645849671 ~ 282574622560326 For some really large numbers

(x{n} means x repeat n times)

Here we apply two tricks:

  • we leave code 01000 for encoding the ending single char, so 01001 means 0 and then.
  • as 0~6 is already encoded as 01xxx, the code range 100000000 ~ 100000100 became illegal. To save the space, we use 100000000 to encode 7 immediately, which made the upper bound become 6 + 64 = 70 instead of 63. The same trick does to others.

We encode basic versions as a continuous sequence of variable size integers.

Now depends on the existance of suffixes (not including the ebuild revison '-r'), we append a sequence denoting the end of basic version:

Encoding Meaning Explanation
000 Suffix is "_alpha", "_beta", "_pre", "_rc" These suffixes are less than nothing
001 No suffix, or '_p' General case

Then if there are suffixes, we encode the suffixes and append the sequences according to this table:

Encoding Meaning Explanation
0 + NUM ebuild revision '-r' NUM is a variable size integer
10xxx Suffix without a number
11xxx + NUM Suffix with a number NUM is a variable size integer

We can encode the suffix as a 3-bit number, following their priority:

Encoding Suffix
000 _alpha
001 _beta
010 _rc
011 _pre
000 _p

Here we reused 000 for '_p' because it's in different category (larger than nothing) as '_alpha' (less than nothing). As you can see 3-bit is efficient enough for further expanding.

Consideration

  • We can encode obvious date/timestamps using more compact numbers (think of 22b vs 27b for date). But the drawback is dates will be incomparable to ordinary numbers. When some numbers looks like dates but are actually not dates, this will cause severe issues.
  • There are still some packages whose version doesn't fit in a 64 bit integer (about ~40). Plus no database support native 128bit integers at this time. To workaround this issue we can either fork these packages and shorten their version number, or use a combination of two 64bit integers to form a 128bit integer. But the second method might be an ugly trick.
  • We could also leave these packages alone without encoding. When we need to do SQL query we just fetch all these packages and compare the versions using the original string-based comparison. Since there are not too many packages fail the 64 bit boundary, this will have little impact on the performance.

Script

The following is a perl script which does the counting of bits needed for each package. The numbers in subs can be changed to test adjusting the encoding.

The script is used for prototype only and isn't well tested. Please feel free to play with it and improve it.

#!/usr/bin/env perl
 
use 5.010;
use strict;
 
my @list = grep /^.+\.ebuild$/, `find /usr/portage/`;
my $max_bits = 0;
my $max_ebuild;
my $failed_count = 0;
my $orig;
 
# bits needed for an ordinary number
sub bits_for_number
{
	given ($_)
	{  
		when ($_ <= "6")                     { return 2 + 3;       }
		when ($_ <= "70")                    { return 3 + 6;       }
		when ($_ <= "4166")                  { return 3 + 12;      }
		when ($_ <= "134221894")             { return 3 + 27;      }
		when ($_ <= "1099645849670")         { return 4 + 40;      }
		when ($_ <= "282574622560326")       { return 4 + 48;      }
		default                              { die "ERROR: $orig"; }
	}
}
 
# bits needed for the single char immediately after basic version
sub bits_for_char
{
	return 10;
}
 
# bits needed to denote the end of basic version
sub bits_for_suffix_start
{
	return 3;
}
 
# bits needed for a suffix
sub bits_for_suffix
{
	/^(?<Suffix>[-_][a-z]+)(?<Num>\d*)$/;
	my $suffix = $+{Suffix};
	my $num = $+{Num};
	given ($suffix)
	{
		when (/-r/)      { return 1 + bits_for_number($num); }
		default          { return 5 + bits_for_number($num); }
	}
}
 
foreach (@list)
{
	chomp;
	$orig = $_;
	my $bits_needed = 0;
 
	/\/(?<Name>.+)\/\g{Name}-(?<Version>.+)\.ebuild$/;
	$+{Version} =~ /^(?<Basic>.+?)(?<Suffix>([-_].+)?)$/;
 
	my $basic_version = $+{Basic};
	my $suffix = $+{Suffix};
 
	my @basic_parts = $basic_version =~ /\d+/g;
	foreach (@basic_parts)
	{
		$bits_needed += bits_for_number($_);
	}
	if ($basic_version =~ /([a-z])$/)
	{
		$bits_needed += bits_for_char();
	}
 
	$bits_needed += bits_for_suffix_start();
	if ($suffix)
	{
		my @suffix_parts = $suffix =~ /[-_][a-z]+\d*/g;
		foreach (@suffix_parts)
		{
			$bits_needed += bits_for_suffix($_);
		}
	}
 
	if ($bits_needed > $max_bits)
	{
		$max_bits = $bits_needed;
		$max_ebuild = $orig;
	}
	if ($bits_needed > 64)
	{
		$failed_count++;
		say "$orig: $bits_needed";
	}
}
 
say;
say "Failed ebuild count: $failed_count";
say "Maximum: $max_ebuild: $max_bits";